Jason Gustafson created KAFKA-7408:
--------------------------------------

             Summary: Truncate to LSO on unclean leader election
                 Key: KAFKA-7408
                 URL: https://issues.apache.org/jira/browse/KAFKA-7408
             Project: Kafka
          Issue Type: Improvement
            Reporter: Jason Gustafson
            Assignee: Jason Gustafson


If an unclean leader is elected, we may lose committed transaction data. That 
alone is expected, but what is worse is that a transaction which was previously 
completed (either committed or aborted) may lose its marker and become 
dangling. The transaction coordinator will not know about the unclean leader 
election, so will not know to resend the transaction markers. Consumers with 
read_committed isolation will be stuck because the LSO cannot advance.

To keep this scenario from occurring, it would be better to have the unclean 
leader truncate to the LSO so that there are no dangling transactions. 
Truncating to the LSO is not alone sufficient because the markers which allowed 
the LSO advancement may be at higher offsets. What we can do is let the newly 
elected leader truncate to the LSO and then rewrite all the markers that 
followed it using its own leader epoch (to avoid divergence from followers).

The interesting cases when an unclean leader election occurs are are when a 
transaction is ongoing. 

1. If a producer is in the middle of a transaction commit, then the coordinator 
may still attempt to write transaction markers. This will either succeed or 
fail depending on the producer epoch in the unclean leader. If the epoch 
matches, then the WriteTxnMarker call will succeed, which will simply be 
ignored by the consumer. If the epoch doesn't match, the WriteTxnMarker call 
will fail and the transaction coordinator can potentially remove the partition 
from the transaction.

2. If a producer is still writing the transaction, then what happens depends on 
the producer state in the unclean leader. If no producer state has been lost, 
then the transaction can continue without impact. Otherwise, the producer will 
likely fail with an OUT_OF_ORDER_SEQUENCE error, which will cause the 
transaction to be aborted by the coordinator. That takes us back to the first 
case.

By truncating the LSO, we ensure that transactions are either preserved in 
whole or they are removed from the log in whole. For an unclean leader 
election, that's probably as good as we can do. But we are ensured that 
consumers will not be blocked by dangling transactions. The only remaining 
situation where a dangling transaction might be left is if one of the 
transaction state partitions has an unclean leader election.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to