Raft Locking Advice

If you are wondering how to use locks in the 6.824 Raft labs, here are
some rules and ways of thinking that might be helpful.

Rule 1: Whenever you have data that more than one goroutine uses, and
at least one goroutine might modify the data, the goroutines should
use locks to prevent simultaneous use of the data. The Go race
detector is pretty good at detecting violations of this rule (though
it won’t help with any of the rules below).
规则1:只要你拥有不止一个goroutine使用的数据,并且至少一个goroutine可以修改数据,
则goroutine应该使用锁来防止同时使用数据。 Go Race检测器非常擅长检测违反此规则的情况
(尽管它对以下任何规则都无济于事)。

Rule 2: Whenever code makes a sequence of modifications to shared
data, and other goroutines might malfunction if they looked at the
data midway through the sequence, you should use a lock around the
whole sequence.
规则2:当代码对共享数据进行修改的序列,而正好其他goroutine如果在代码中途查看数据时,
则可能会出现歧义,应在整个代码周围(使用该数据的地方)使用锁。

An example:

rf.mu.Lock()
rf.currentTerm += 1
rf.state = Candidate
rf.mu.Unlock()

It would be a mistake for another goroutine to see either of these
updates alone (i.e. the old state with the new term, or the new term
with the old state). So we need to hold the lock continuously over the
whole sequence of updates. All other code that uses rf.currentTerm or
rf.state must also hold the lock, in order to ensure exclusive access
for all uses.
对于另一个goroutine,单独查看这些变量更新(即具有新任期的旧状态或具有旧状态的新任期)将是一个错误。
因此,我们需要在整个更新代码块中连续保持锁。
使用rf.currentTerm或rf.state的所有其他代码也必须持有该锁,以确保对所有使用的独占访问。

The code between Lock() and Unlock() is often called a “critical
section.” The locking rules a programmer chooses (e.g. “a goroutine
must hold rf.mu when using rf.currentTerm or rf.state”) are often
called a “locking protocol”.
Lock()和Unlock()之间的代码通常称为“关键部分”。
程序员选择的锁定规则(例如,“使用rf.currentTerm或rf.state时,goroutine必须使用rf.mu”)
通常被称为“锁定协议”。

Rule 3: Whenever code does a sequence of reads of shared data (or
reads and writes), and would malfunction if another goroutine modified
the data midway through the sequence, you should use a lock around the
whole sequence.
规则3:每当代码对共享数据进行读取(或读取和写入)序列时,
如果另一个goroutine在序列中途修改了数据,将会导致歧义,则应在整个代码周围使用锁

An example that could occur in a Raft RPC handler:

rf.mu.Lock()
if args.Term > rf.currentTerm {
rf.currentTerm = args.Term
}
rf.mu.Unlock()

This code needs to hold the lock continuously for the whole sequence.
Raft requires that currentTerm only increases, and never decreases.
Another RPC handler could be executing in a separate goroutine; if it
were allowed to modify rf.currentTerm between the if statement and the
update to rf.currentTerm, this code might end up decreasing
rf.currentTerm. Hence the lock must be held continuously over the
whole sequence. In addition, every other use of currentTerm must hold
the lock, to ensure that no other goroutine modifies currentTerm
during our critical section.
此代码需要在整个代码段中连续保持锁。
Raft要求currentTerm仅增加,而从不减少。
另一个RPC处理程序可以在单独的goroutine中执行;
如果允许在if语句和对rf.currentTerm的更新之间修改rf.currentTerm(其他协程修改),
则此代码可能最终会减少rf.currentTerm。
因此,锁必须在整个序列中连续保持。
另外,currentTerm的所有其他使用都必须持有该锁,
以确保在我们的关键部分没有其他goroutine修改currentTerm。

Real Raft code would need to use longer critical sections than these
examples; for example, a Raft RPC handler should probably hold the
lock for the entire handler.
与这些示例相比,实际的Raft代码将会有更长的关键代码;
例如,一个Raft RPC处理程序可能应该持有整个处理程序的锁。

Rule 4: It’s usually a bad idea to hold a lock while doing anything
that might wait: reading a Go channel, sending on a channel, waiting
for a timer, calling time.Sleep(), or sending an RPC (and waiting for the
reply). One reason is that you probably want other goroutines to make
progress during the wait. Another reason is deadlock avoidance. Imagine
two peers sending each other RPCs while holding locks; both RPC
handlers need the receiving peer’s lock; neither RPC handler can ever
complete because it needs the lock held by the waiting RPC call.
规则4:在做一些需要时间等待的事情时,持有锁通常是一个坏主意:
读取channel,发送channel,等待计时器,调用time.Sleep()或发送RPC(然后等待回复)。
原因之一是你可能希望其他goroutine在等待过程中进行下一步。
另一个原因是避免死锁。想象两个对象在持有锁的情况下互相发送RPC。
两个RPC处理程序会有需要等待对方的锁;
两个RPC处理程序都无法完成,因为它无法拿到已被持有的锁。

Code that waits should first release locks. If that’s not convenient,
sometimes it’s useful to create a separate goroutine to do the wait.
等待的代码应首先释放锁。如果这样做不方便,有时创建一个单独的goroutine进行等待会很有用。

Rule 5: Be careful about assumptions across a drop and re-acquire of a
lock. One place this can arise is when avoiding waiting with locks
held. For example, this code to send vote RPCs is incorrect:
规则5:小心假设一个释放和重新获得一个锁
可能出现这种情况的一个地方是,避免在持有锁的情况下等待。
例如,以下发送投票RPC的代码不正确:

rf.mu.Lock()
rf.currentTerm += 1
rf.state = Candidate
for {
go func() {
rf.mu.Lock()
args.Term = rf.currentTerm
rf.mu.Unlock()
Call(“Raft.RequestVote”, &args, …)
// handle the reply…
} ()
}
rf.mu.Unlock()

The code sends each RPC in a separate goroutine. It’s incorrect
because args.Term may not be the same as the rf.currentTerm at which
the surrounding code decided to become a Candidate. Lots of time may
pass between when the surrounding code creates the goroutine and when
the goroutine reads rf.currentTerm; for example, multiple terms may
come and go, and the peer may no longer be a candidate. One way to fix
this is for the created goroutine to use a copy of rf.currentTerm made
while the outer code holds the lock. Similarly, reply-handling code
after the Call() must re-check all relevant assumptions after
re-acquiring the lock; for example, it should check that
rf.currentTerm hasn’t changed since the decision to become a
candidate.
该代码在单独的goroutine中发送每个RPC。这是不正确的,
因为args.Term可能与之前代码中的决定成为候选者的rf.currentTerm不同。
从开始创建goroutine到goroutine读取rf.currentTerm,这中间可能要花费很多时间。
例如,中间将产生多个任期的变化,而当前对象可能不再是候选者。
解决此问题的一种方法是让创建的goroutine使用在外部代码持有锁的情况下创建的rf.currentTerm的副本。
同样,Call()之后的回复处理代码必须在重新获得锁定后重新检查所有相关的判断;
例如,它应该检查rf.currentTerm自从决定成为候选人以来没有改变。

It can be difficult to interpret and apply these rules. Perhaps most
puzzling is the notion in Rules 2 and 3 of code sequences that
shouldn’t be interleaved with other goroutines’ reads or writes. How
can one recognize such sequences? How should one decide where a
sequence ought to start and end?
可能难以解释和应用这些规则。也许最令人困惑的是规则2和规则3中不应与其他goroutine
同时读写同一个代码块的概念。一个人怎么能认出这样的代码块?
一个人应该如何确定一个代码块应该在哪里开始和结束?

One approach is to start with code that has no locks, and think
carefully about where one needs to add locks to attain correctness.
This approach can be difficult since it requires reasoning about the
correctness of concurrent code.
一种方法是从没有锁的代码开始,然后仔细考虑需要在何处添加锁以获得正确性。
这种方法可能很困难,因为它需要对并发代码的正确性进行推理。

A more pragmatic approach starts with the observation that if there
were no concurrency (no simultaneously executing goroutines), you
would not need locks at all. But you have concurrency forced on you
when the RPC system creates goroutines to execute RPC handlers, and
because you need to send RPCs in separate goroutines to avoid waiting.
You can effectively eliminate this concurrency by identifying all
places where goroutines start (RPC handlers, background goroutines you
create in Make(), &c), acquiring the lock at the very start of each
goroutine, and only releasing the lock when that goroutine has
completely finished and returns. This locking protocol ensures that
nothing significant ever executes in parallel; the locks ensure that
each goroutine executes to completion before any other goroutine is
allowed to start. With no parallel execution, it’s hard to violate
Rules 1, 2, 3, or 5. If each goroutine’s code is correct in isolation
(when executed alone, with no concurrent goroutines), it’s likely to
still be correct when you use locks to suppress concurrency. So you
can avoid explicit reasoning about correctness, or explicitly
identifying critical sections.
一种更实用的方法开始于观察到,如果没有并发(没有同时执行goroutine),则根本不需要锁。
但是,当RPC系统创建goroutine来执行RPC处理程序时,由于您需要并发,
因此你需要在单独的goroutine中发送RPC,以避免等待。
你可以通过确定goroutine开始的所有位置(RPC处理程序,您在Make()和&c中创建的后台goroutine),
在每个goroutine的最开始处获取该锁,并且仅在该goroutine完全完成并返回时才释放该锁,来有效地消除这种并发性。
该锁定协议可确保没有大量的事务可以并发执行。
这些锁确保每个goroutine在允许任何其他goroutine启动之前执行完毕。
在没有并发执行的情况下,很难违反规则1、2、3或5。
如果每个goroutine的代码是正确的(单独执行时,没有并发goroutines),
当你使用锁来抑制并发时,它可能仍然是正确的。
因此,你可以避免对正确性进行显式推理,或显式识别关键部分。(不懂啥意思)

However, Rule 4 is likely to be a problem. So the next step is to find
places where the code waits, and to add lock releases and re-acquires
(and/or goroutine creation) as needed, being careful to re-establish
assumptions after each re-acquire. You may find this process easier to
get right than directly identifying sequences that must be locked for
correctness.
但是,规则4可能是个问题。因此,下一步是找到代码等待的位置,
并根据需要释放锁和重新获取(和/或创建协程),在每次重新获取之后都要小心地重新进行相应的判断。
可能会发现这个过程比直接识别必须锁定以确保正确性的代码块更容易正确。

(As an aside, what this approach sacrifices is any opportunity for
better performance via parallel execution on multiple cores: your code
is likely to hold locks when it doesn’t need to, and may thus
unnecessarily prohibit parallel execution of goroutines. On the other
hand, there is not much opportunity for CPU parallelism within a
single Raft peer.)
(顺便说一句,这种方法所牺牲的是通过多核上的并行执行来获得更好性能的能力:
你的代码很可能在并不需要锁的时候持有锁,就有可能影响到goroutine的并发执行。
一方面,在单个Raft节点中实现CPU并行性的机会不多。)