[ https://issues.apache.org/jira/browse/FLINK-29396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chesnay Schepler updated FLINK-29396: ------------------------------------- Priority: Critical (was: Blocker) > Race condition in JobMaster shutdown can leak resource requirements > ------------------------------------------------------------------- > > Key: FLINK-29396 > URL: https://issues.apache.org/jira/browse/FLINK-29396 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.15.0 > Reporter: Chesnay Schepler > Priority: Critical > > When a JobMaster is stopped it > a) sends a message to the RM informing it of the final job status > b) removes itself as the leader. > Once the JM loses leadership the RM is also informed about that. > With that we have 2 messages being sent to the RM at about the same time. > If the shutdown notifications arrives first (and job is in a terminal state) > we wipe the resource requirements, and the leader loss notification is > effectively ignored. > If the leader loss notification arrives first we keep the resource > requirements, assuming that another JM will pick the job up later on, and the > shutdown notification will be ignored. > This can cause a session cluster to essentially do nothing until the job > timeout is triggered due to no leader being present (default 5 minutes). -- This message was sent by Atlassian Jira (v8.20.10#820010)