A. Sophie Blee-Goldman created KAFKA-13333:
----------------------------------------------

             Summary: Optimize condition for triggering rebalance after wiping 
out corrupted task
                 Key: KAFKA-13333
                 URL: https://issues.apache.org/jira/browse/KAFKA-13333
             Project: Kafka
          Issue Type: Improvement
          Components: streams
            Reporter: A. Sophie Blee-Goldman


Just filing a ticket to list some thoughts I had on optimizing 
https://issues.apache.org/jira/browse/KAFKA-12486. 

The idea here is to trigger a rebalance upon detecting corruption of some task. 
This task may have had a large amount of state that had to be wiped out under 
eos, so we might be able to avoid a long downtime due to restoration if we can 
utilize the HA TaskAssignor to temporarily move that active task to another 
node that has some state for it already (eg had a standby task for it).

Right now, we trigger that rebalance under the condition that (a) eos is 
enabled, and (b) at least one of the corrupted tasks was an active task. This 
is a pretty safe bet, but it's worth jotting down some potential optimizations 
of this condition so we can trim down the occurrences of unnecessary rebalances 
that wouldn't have helped. For example:

1) Don't kick off a rebalance if the corrupted task is in CREATED or RESTORING, 
and is not within the acceptable.recovery.lag from the end of the changelog. If 
the task wasn't caught up on this host but assigned to it anyway, that 
indicates there wasn't any other host with enough state for this task and 
therefore no one to temporarily take it over

2) Only trigger a rebalance if standbys are configured, and/or parse the 
standby host info to verify whether this task has a standby copy on another 
live client. It's still possible to have a copy of this task's state on another 
host even without standbys, but the odds are greatly reduced.

3) If we want to get really fancy (and I'm not quite sure we do), we could have 
the assignor report not just the names but also the lag of each standby task on 
another host, and then trigger the rebalance depending on whether this task has 
a hot standby within the acceptable.recovery.lag



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to