vamossagar12 opened a new pull request, #13453:
URL: https://github.com/apache/kafka/pull/13453

   During fast consecutive rebalances where a task is revoked from one worker 
and assigned to another one, it has been observed that there is a small time 
window and thus a race condition during which a RUNNING status record in the 
new generation is produced and is immediately followed by a delayed UNASSIGNED 
status record belonging to the same or a previous generation before the worker 
that sends this message reads the RUNNING status record that corresponds to the 
latest generation.
   Although this doesn't inhibit the actual execution of tasks, it reports an 
incorrect status for those tasks(i.e UNASSIGNED). If the users have setup some 
kind of monitoring on tasks status then this could lead to false alarms for 
example.
   This PR aims to solve this problem by checking if a status message is stale 
after reading it and updates it's status only when it is safe to. Note that it 
uses the same method `canWriteSafely` to check the staleness, so if needed, 
that method can be renamed (which I haven't done in this PR). Also, the 
original description of the ticket only talks about RUNNING/UNASSIGNED state 
but this PR should ideally help in filtering out all stale messages (which 
might still be infrequent but worth handling imo).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to