Jack Vanlightly created KAFKA-16281:
---------------------------------------

             Summary: Probable IllegalState possible with KIP-966
                 Key: KAFKA-16281
                 URL: https://issues.apache.org/jira/browse/KAFKA-16281
             Project: Kafka
          Issue Type: Task
          Components: kraft
            Reporter: Jack Vanlightly


I have a TLA+ model of KIP-966 and I have identified an IllegalState exception 
that would occur with the existing MaybeHandleCommonResponse behavior.

The issue stems from the fact that a leader, let's call it r1, can resign 
(either due to a restart or check quorum) and then later initiate a pre-vote 
where it ends up in the same epoch as before, but a cleared local leader id. 
When r1 transitions to Prospective it clears its local leader id. When r1 
receives a response from r2 who believes that r1 is still the leader, the logic 
in MaybeHandleCommonResponse tries to transition r1 to follower of itself, 
causing an IllegalState exception to be raised.

This is an example history:
 # r1 is the leader in epoch 1.
 # r1 quorum resigns, or restarts and resigns.
 # r1 experiences an election timeout and transitions to Prospective clearing 
its local leader id.
 # r1 sends a pre vote request to its peers.
 # r2 thinks r1 is still the leader, sends a vote response, not granting its 
vote and setting leaderId=r1 and epoch=1.
 # r1 receives the vote response and executes MaybeHandleCommonResponse which 
tries to transition r1 to Follower of itself and an illegal state occurs.

The relevant else if statement in MaybeHandleCommonResponse is here: 
https://github.com/apache/kafka/blob/a26a1d847f1884a519561e7a4fb4cd13e051c824/raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java#L1538

In the TLA+ specification, I fixed this issue by adding a fourth condition to 
this statement, that the leaderId also does not equal this server's id. 
[https://github.com/Vanlightly/kafka-tlaplus/blob/9b2600d1cd5c65930d666b12792d47362b64c015/kraft/kip_996/kraft_kip_996_functions.tla#L336]

We should probably create a test to confirm the issue first and then look at 
using the fix I made in the TLA+, though there may be other options.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to