Jason Gustafson created KAFKA-14077:
---------------------------------------

             Summary: KRaft should support recovery from failed disk
                 Key: KAFKA-14077
                 URL: https://issues.apache.org/jira/browse/KAFKA-14077
             Project: Kafka
          Issue Type: Bug
            Reporter: Jason Gustafson
             Fix For: 3.3.0


If one of the nodes in the metadata quorum has a disk failure, there is no way 
currently to safely bring the node back into the quorum. When we lose disk 
state, we are at risk of losing committed data even if the failure only affects 
a minority of the cluster.

Here is an example. Suppose that a metadata quorum has 3 members: v1, v2, and 
v3. Initially, v1 is the leader and writes a record at offset 1. After v2 
acknowledges replication of the record, it becomes committed. Suppose that v1 
fails before v3 has a chance to replicate this record. As long as v1 remains 
down, the raft protocol guarantees that only v2 can become leader, so the 
record cannot be lost. The raft protocol expects that when v1 returns, it will 
still have that record, but what if there is a disk failure, the state cannot 
be recovered and v1 participates in leader election? Then we would have 
committed data on a minority of the voters. The main problem here concerns how 
we recover from this impaired state without risking the loss of this data.

Consider a naive solution which brings v1 back with an empty disk. Since the 
node has lost is prior knowledge of the state of the quorum, it will vote for 
any candidate that comes along. If v3 becomes a candidate, then it will vote 
for itself and it just needs the vote from v1 to become leader. If that 
happens, then the committed data on v2 will become lost.

This is just one scenario. In general, the invariants that the raft protocol is 
designed to preserve go out the window when disk state is lost. For example, it 
is also possible to contrive a scenario where the loss of disk state leads to 
multiple leaders. There is a good reason why raft requires that any vote cast 
by a voter is written to disk since otherwise the voter may vote for different 
candidates in the same epoch.

Many systems solve this problem with a unique identifier which is generated 
automatically and stored on disk. This identifier is then committed to the raft 
log. If a disk changes, we would see a new identifier and we can prevent the 
node from breaking raft invariants. Then recovery from a failed disk requires a 
quorum reconfiguration. We need something like this in KRaft to make disk 
recovery possible.

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to