Raft FAQ1

Q: Does Raft sacrifice anything for simplicity?

A: Raft gives up some performance in return for clarity; for example:

  • Every operation must be written to disk for persistence; performance
    probably requires batching many operations into each disk write.

  • There can only usefully be a single AppendEntries in flight from the
    leader to each follower: followers reject out-of-order
    AppendEntries, and the sender’s nextIndex[] mechanism requires
    one-at-a-time. A provision for pipelining many AppendEntries would
    be better.

  • The snapshotting design is only practical for relatively small
    states, since it writes the entire state to disk. If the state is
    big (e.g. if it’s a big database), you’d want a way to write just
    parts of the state that have changed recently.

  • Similarly, bringing recovering replicas up to date by sending them a
    complete snapshot will be slow, needlessly so if the replica already
    has an old snapshot.

  • Servers may not be able to take much advantage of multi-core because
    operations must be executed one at a time (in log order).

These could be fixed by modifying Raft, but the result might have less
value as a tutorial.

Q: Is Raft used in real-world software, or do companies generally roll
their own flavor of Paxos (or use a different consensus protocol)?

A: There are several real-world users of Raft: Docker
(https://docs.docker.com/engine/swarm/raft/), etcd (https://etcd.io),
and MongoDB. Other systems said to be using Raft include CockroachDB,
RethinkDB, and TiKV. Maybe you can find more starting at
http://raft.github.io/

On the other hand, my impression is that most state-machine
replication systems are based on the Multi-Paxos and Viewstamped
Replication protocols.

Q: What is Paxos? In what sense is Raft simpler?

A: There is a protocol called Paxos that allows a set of servers to
agree on a single value. While Paxos requires some thought to
understand, it is far simpler than Raft. Here’s an easy-to-read paper
about Paxos:

http://css.csail.mit.edu/6.824/2014/papers/paxos-simple.pdf

However, Paxos solves a much smaller problem than Raft. To build a
real-world replicated service, the replicas need to agree on an
indefinite sequence of values (the client commands), and they need
ways to efficiently recover when servers crash and restart or miss
messages. People have built such systems with Paxos as the starting
point; look up Google’s Chubby and Paxos Made Live papers, and
ZooKeeper/ZAB. There is also a protocol called Viewstamped
Replication; it’s a good design, and similar to Raft, but the paper
about it is hard to understand.

These real-world protocols are complex, and (before Raft) there was
not a good introductory paper describing how they work. The Raft
paper, in contrast, is relatively easy to read and fairly detailed.
That’s a big contribution.

Whether the Raft protocol is inherently easier to understand than
something else is not clear. The issue is clouded by a lack of good
descriptions of other real-world protocols. In addition, Raft
sacrifices performance for clarity in a number of ways; that’s fine
for a tutorial but not always desirable in a real-world protocol.

Q: How long had Paxos existed before the authors created Raft?

A: Paxos was invented in the late 1980s. Raft was developed around
2012.

Raft closely resembles a protocol called Viewstamped Replication,
originally published in 1988. There were replicated fault-tolerant file
servers built on top of Viewstamped Replication in the early 1990s,
though not in production use.

A bunch of real-world systems are derived from Paxos: Chubby, Spanner,
Megastore, and Zookeeper/ZAB. Starting in the early 2000s big web
sites and cloud providers needed fault-tolerant services, and Paxos
was more or less re-discovered at that time and put into production
use.

Q: How does Raft’s performance compare to Paxos in real-world applications?

A: The fastest Paxos-derived protocols are probably faster than
Raft as described in the paper; have a look at ZAB/ZooKeeper and Paxos
Made Live. On the other hand, etcd3 (using Raft) claims to have
achieved better performance than zookeeper and many Paxos-based
implementations (https://www.youtube.com/watch?v=hQigKX0MxPw).

There are situations where Raft’s leader is not so great. If the
datacenters containing replicas and clients are distant from each
other, people sometimes use agreement protocols derived from original
Paxos. The reason is that Paxos has no leader; any replica can start
an agreement; so clients can talk to the replica in their local
datacenter rather than having to talk to a leader in a distant
datacenter. ePaxos is an example.

Q: Why are we learning/implementing Raft instead of Paxos?

A: We’re using Raft in 6.824 because there is a paper that clearly
describes how to build a complete replicated service using Raft. I
know of no satisfactory paper that describes how to build a complete
replicated server system based on Paxos.

Q: Are there systems like Raft that can survive and continue to
operate when only a minority of the cluster is active?

A: Not with Raft’s properties. But you can do it with different
assumptions, or different client-visible semantics. The basic problem
is split-brain – the possibility of multiple diverging copies of the
state, caused by multiple subsets of the replicas mutating the state
without being aware of each other. There are two solution approaches
that I know of.

If somehow clients and servers can learn exactly which servers are live
and which are dead (as opposed to live but unreachable due to network
failure), then one can build a system that can function as long as one
is alive, picking (say) the lowest-numbered server known to be alive.
However, it’s hard for one computer to decide if another computer is
dead, as opposed to the network losing the messages between them. One
way to do it is to have a human decide – the human can inspect each
server and decide which are alive and dead.

The other approach is to allow split-brain operation, and to have a
way for servers to reconcile the resulting diverging state after
partitions are healed. This can be made to work for some kinds of
services, but has complex client-visible semantics (usually called
“eventual consistency”). Have a look at the COPS, FuzzyLog, and
Bitcoin papers which are assigned later in the course.

Q: In Raft, the service which is being replicated is not available to
the clients during an election process. In practice how much of a
problem does this cause?

A: The client-visible pause seems likely to be on the order of a tenth of a
second. The authors expect failures (and thus elections) to be rare,
since they only happen if machines or the network fails. Many servers
and networks stay up continuously for months or even years at a time, so
this doesn’t seem like a huge problem for many applications.

Q: Are there other consensus systems that don’t have leader-election
pauses?

A: There are versions of Paxos-based replication that do not have a leader
or elections, and thus don’t suffer from pauses during elections.
Instead, any server can effectively act as leader at any time. The cost
of not having a leader is that more messages are required for each
agreement.

Q: How are Raft and VMware FT related?

A: Raft has no single point of failure, while VMware FT does have a
single point of failure in the form of the test-and-set server. In
that sense Raft is fundamentally more fault-tolerant than VMware FT.
One could fix this by implementing FT’s test-and-set server as a
replicated service using Raft or Paxos.

VMware-FT can replicate any virtual machine guest, and thus any
server-style software. Raft can only replicate software designed
specifically for replication for Raft. For such software, Raft would
likely be more efficient than VMware-FT.

Q: Why can’t a malicious person take over a Raft server, or forge
incorrect Raft messages?

A: Raft doesn’t include defenses against attacks like this. It assumes
that all participants are following the protocol, and that only the
correct set of servers is participating.

A real deployment would have to keep out malicious attackers. The most
straightforward option is to place the servers behind a firewall to
filter out packets from random people on the Internet, and to ensure
that all computers and people inside the firewall are trustworthy.

There may be situations where Raft has to operate on the same network as
potential attackers. In that case a good plan would be to authenticate
the Raft packets with some cryptographic scheme. For example, give each
legitimate Raft server a public/private key pair, have it sign all the
packets it sends, give each server a list of the public keys of
legitimate Raft servers, and have the servers ignore packets that aren’t
signed by a key on that list.

Q: The paper mentions that Raft works under all non-Byzantine
conditions. What are Byzantine conditions and why could they make Raft
fail?

A: “Non-Byzantine conditions” means that the servers are fail-stop:
they either follow the Raft protocol correctly, or they halt. For
example, most power failures are non-Byzantine because they cause
computers to simply stop executing instructions; if a power failure
occurs, Raft may stop operating, but it won’t send incorrect results
to clients.

Byzantine failure refers to situations in which some computers execute
incorrectly, because of bugs or because someone malicious is
controlling the computers. If a failure like this occurs, Raft may
send incorrect results to clients.

Most of 6.824 is about tolerating non-Byzantine faults. Correct
operation despite Byzantine faults is more difficult; we’ll touch on
this topic at the end of the term.

Q: In Figure 1, what does the interface between client and server look
like?

A: Typically an RPC interface to the server. For a key/value storage
server such as you’ll build in Lab 3, it’s Put(key,value) and
Get(value) RPCs. The RPCs are handled by a key/value module in the
server, which calls Raft.Start() to ask Raft to put a client RPC in
the log, and reads the applyCh to learn of newly committed log
entries.

Q: What if a client sends a request to a leader, but the leader
crashes before sending the client request to all followers, and the
new leader doesn’t have the request in its log? Won’t that cause the
client request to be lost?

A: Yes, the request may be lost. If a log entry isn’t committed, Raft
may not preserve it across a leader change.

That’s OK because the client could not have received a reply to its
request if Raft didn’t commit the request. The client will know (by
seeing a timeout or leader change) that its request may have been
lost, and will re-send it.

The fact that clients can re-send requests means that the system has
to be on its guard against duplicate requests; you’ll deal with this
in Lab 3.

Q: If there’s a network partition, can Raft end up with two leaders
and split brain?

A: No. There can be at most one active leader.

A new leader can only be elected if it can contact a majority of servers
(including itself) with RequestVote RPCs. So if there’s a partition, and
one of the partitions contains a majority of the servers, that one
partition can elect a new leader. Other partitions must have only a
minority, so they cannot elect a leader. If there is no majority
partition, there will be no leader (until someone repairs the network
partition).

Q: Suppose a new leader is elected while the network is partitioned,
but the old leader is in a different partition. How will the old
leader know to stop committing new entries?

A: The old leader will either not be able to get a majority of
successful responses to its AppendEntries RPCs (if it’s in a minority
partition), or if it can talk to a majority, that majority must
overlap with the new leader’s majority, and the servers in the overlap
will tell the old leader that there’s a higher term. That will cause
the old leader to switch to follower.

Q: When some servers have failed, does “majority” refer to a majority
of the live servers, or a majority of all servers (even the dead
ones)?

A: Always a majority of all servers. So if there are 5 Raft peers in
total, but two have failed, a candidate must still get 3 votes
(including itself) in order to elected leader.

There are many reasons for this. It could be that the two “failed”
servers are actually up and running in a different partition. From
their point of view, there are three failed servers. If they were
allowed to elect a leader using just two votes (from just the two
live-looking servers), we would get split brain. Another reason is
that we need the majorities of any two leader to overlap at at least
one server, to guarantee that a new leader sees the previous term
number and any log entries committed in previous terms; this requires
a majority out of all servers, dead and alive.

Q: What if the election timeout is too short? Will that cause Raft to
malfunction?

A: A bad choice of election timeout does not affect safety, it only
affects liveness.

If the election timeout is too small, then followers may repeatedly
time out before the leader has a chance to send out any AppendEntries.
In that case Raft may spend all its time electing new leaders, and no
time processing client requests. If the election timeout is too large,
then there will be a needlessly large pause after a leader failure
before a new leader is elected.

Q: Why randomize election timeouts?

A: To reduce the chance that two peers simultaneously become
candidates and split the votes evenly between them, preventing anyone
from being elected.

Q: Can a candidate declare itself the leader as soon as it receives
votes from a majority, and not bother waiting for further RequestVote
replies?

A: Yes – a majority is sufficient. It would be a mistake to wait
longer, because some peers might have failed and thus not ever reply.

Q: Can a leader in Raft ever stop being a leader except by crashing?

A: Yes. If a leader’s CPU is slow, or its network connection breaks,
or loses too many packets, or delivers packets too slowly, the other
servers won’t see its AppendEntries RPCs, and will start an election.

Q: When are followers’ log entries sent to their state machines?

A: Only after the leader says that an entry is committed, using the
leaderCommit field of the AppendEntries RPC. At that point the
follower can execute (or apply) the log entry, which for us means send
it on the applyCh.

Q: Should the leader wait for replies to AppendEntries RPCs?

A: The leader should send the AppendEntries RPCs concurrently, without
waiting. As replies come back, the leader should count them, and mark
the log entry as committed only when it has replies from a majority of
servers (including itself).

One way to do this in Go is for the leader to send each AppendEntries
RPC in a separate goroutine, so that the leader sends the RPCs
concurrently. Something like this:

for each server {
go func() {
send the AppendEntries RPC and wait for the reply
if reply.success == true {
increment count
if count == nservers/2 + 1 {
this entry is committed
}
}
} ()
}

Q: What happens if a half (or more) of the servers die?

A: The service can’t make any progress; it will keep trying to elect a
leader over and over. If/when enough servers come back to life with
persistent Raft state intact, they will be able to elect a leader and
continue.

Q: Why is the Raft log 1-indexed?

A: You should view it as zero-indexed, but starting out with one entry
(at index=0) that has term 0. That allows the very first AppendEntries
RPC to contain 0 as PrevLogIndex, and be a valid index into the log.

Q: When network partition happens, wouldn’t client requests in
minority partitions be lost?

A: Yes, only the partition with a majority of servers can commit and
execute client operations. The servers in the minority partition(s)
won’t be able to commit client operations, so they won’t reply to
client requests. Clients will keep re-sending the requests until they
can contact a majority Raft partition.

Q: Is the argument in 5.4.3 a complete proof?

A: 5.4.3 is not a complete proof. Here are some places to look:

http://ramcloud.stanford.edu/~ongaro/thesis.pdf
http://verdi.uwplse.org/raft-proof.pdf

Q: Are there any limitations to what applications can be built on top of Raft?

A: I think that in order to fit cleanly into a replicated state
machine framework like Raft, the replicated service has to be
self-contained – it can have private state, and accept commands from
clients that update the state, but it can’t contact outside entities
without special precautions. If the replicated application interacts
with the outside world, the outside world has to be able to deal
correctly with repeated requests (due to replication and replay of
logs after reboot), and it has to never contradict itself (i.e. it has
to be careful to send exactly the same answer to all replicas, and to
all re-executions of log entries). That in turn seems to require that
any outside entity that a Raft-based application contacts must itself
be fault-tolerant, i.e. probably has to use Raft or something like it.
That’s fairly limiting.

As an example, imagine a replicated online ordering system sending
credit card charging requests to some external credit card processor
service. That external processor will see repeated requests (one or more
from each replica). Will it respond exactly the same way to each
request? Will it charge the credit card more than once? If it does do
the right thing, will it still do the right thing if it crashes and
reboots at an awkward time?