Chris Egerton created KAFKA-14091:
-------------------------------------

             Summary: Suddenly-killed tasks can leave hanging transactions open
                 Key: KAFKA-14091
                 URL: https://issues.apache.org/jira/browse/KAFKA-14091
             Project: Kafka
          Issue Type: Improvement
          Components: KafkaConnect
            Reporter: Chris Egerton


Right now, if a task running with exactly-once support is killed ungracefully, 
it may leave a hanging transaction open. If the transaction included writes to 
the offsets topic, then startup for future workers becomes blocked on that 
transaction expiring.

Ideally, we could identify these kinds of hanging transactions and proactively 
abort them.

Unfortunately, there are a few facts that make this fairly complicated:
 # Workers read to the end of the offsets topic during startup, before joining 
the cluster
 # Workers do not know which tasks they are assigned until they join the cluster

The result of these facts is that we cannot trust workers that are restarted 
shortly after being ungracefully shut down to fence out their own hanging 
transactions, since any hanging transactions would prevent them from being able 
to join the group and receive their task assignment in the first place.

We could possibly accomplish this by having the leader proactively abort any 
open transactions for tasks on workers that appear to have left the cluster 
during a rebalance. This would not require us to wait for the scheduled 
rebalance delay to elapse, since the intent of the delay is to provide a buffer 
between when workers leave and when their connectors/tasks are reallocated 
across the cluster (and, if the worker is able to rejoin before that buffer is 
consumed, then give it back the same connectors/tasks it was running 
previously); aborting transactions for tasks on these workers would not 
interfere with that goal.

 

It's also possible that we may have to handle the case where a 
[cancelled|https://github.com/apache/kafka/blob/badfbacdd09a9ee8821847f4b28d98625f354ed7/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L274-L287]
 task leaves a transaction open; I have yet to confirm whether this is 
possible, though.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to