[jira] [Comment Edited] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332834#comment-17332834 ] Zhu Zhu edited comment on FLINK-17726 at 4/27/21, 1:50 AM: --- I think it is a potential issue and is not a real production problem yet. The problem would happen only if a task is directly cancelled by TM and no other task in the same pipelined region was failed. So far I think this case will not happen. was (Author: zhuzh): I think it is a potential issue and is not a real production problem yet. The problem would happen only if a task is directly cancelled by TM without failing nay other task in the same pipelined region. So far I think this case will not happen. > Scheduler should take care of tasks directly canceled by TaskManager > > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Runtime / Task >Affects Versions: 1.11.0, 1.12.0 >Reporter: Zhu Zhu >Priority: Critical > Labels: stale-critical > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17280409#comment-17280409 ] Zhu Zhu edited comment on FLINK-17726 at 2/7/21, 8:23 AM: -- Thanks for relaunching this discussion and proposing a solution! [~pnowojski] I'd like to double confirm the proposal. Please correct me if I understand it incorrectly: 1. task should not be CANCELLED in TM unless it was CANCELING. Instead, it should be transitioned into FAILED with a "secondary" failure with the information of the root cause task 2. JM triggers failovers on "primary" failures and ignores related secondary failures. For"secondary" failures, given that the related "primary" failure should always be reported sooner or later, JM can simply mark the task as CANCELED and skip the failure handling. To further improve it, JM can register a timeout on secondary failures in case that the related "primary" failure is not reported, or to speed up the recover without waiting for a heartbeat timeout. 3. JM triggers a failover if a task directly transitions from DEPLOYING/RUNNING to CANCELED in TM, which is never expected to happen though after the work of #1 was (Author: zhuzh): Thanks for relaunching this discussion and proposing a solution! [~pnowojski] I'd like to double confirm the proposal. Please correct me if I understand it incorrectly: 1. task should not be CANCELLED in TM unless it was CANCELING. It should be transitioned into FAILED with a "secondary" failure with the information of the root cause task 2. JM triggers failovers on "primary" failures and ignores related secondary failures. For"secondary" failures, given that the related "primary" failure should always be reported sooner or later, JM can simply mark the task as CANCELED and skip the failure handling. To further improve it, JM can register a timeout on secondary failures in case that the related "primary" failure is not reported, or to speed up the recover without waiting for a heartbeat timeout. 3. JM triggers a failover if a task directly transitions from DEPLOYING/RUNNING to CANCELED in TM, which is never expected to happen though after the work of #1 > Scheduler should take care of tasks directly canceled by TaskManager > > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Runtime / Task >Affects Versions: 1.11.0, 1.12.0 >Reporter: Zhu Zhu >Priority: Critical > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17280409#comment-17280409 ] Zhu Zhu edited comment on FLINK-17726 at 2/7/21, 8:23 AM: -- Thanks for relaunching this discussion and proposing a solution! [~pnowojski] I'd like to double confirm the proposal. Please correct me if I understand it incorrectly: 1. task should not be CANCELLED in TM unless it was CANCELING. It should be transitioned into FAILED with a "secondary" failure with the information of the root cause task 2. JM triggers failovers on "primary" failures and ignores related secondary failures. For"secondary" failures, given that the related "primary" failure should always be reported sooner or later, JM can simply mark the task as CANCELED and skip the failure handling. To further improve it, JM can register a timeout on secondary failures in case that the related "primary" failure is not reported, or to speed up the recover without waiting for a heartbeat timeout. 3. JM triggers a failover if a task directly transitions from DEPLOYING/RUNNING to CANCELED in TM, which is never expected to happen though after the work of #1 was (Author: zhuzh): Thanks for relaunching this discussion and proposing a solution. I'd like to double confirm the proposal. Please correct me if I understand it incorrectly: 1. task should not be CANCELLED in TM unless it was CANCELING. It should be transitioned into FAILED with a "secondary" failure with the information of the root cause task 2. JM triggers failovers on "primary" failures and ignores related secondary failures. For"secondary" failures, given that the related "primary" failure should always be reported sooner or later, JM can simply mark the task as CANCELED and skip the failure handling. To further improve it, JM can register a timeout on secondary failures in case that the related "primary" failure is not reported, or to speed up the recover without waiting for a heartbeat timeout. 3. JM triggers a failover if a task directly transitions from DEPLOYING/RUNNING to CANCELED in TM, which is never expected to happen though after the work of #1 > Scheduler should take care of tasks directly canceled by TaskManager > > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Runtime / Task >Affects Versions: 1.11.0, 1.12.0 >Reporter: Zhu Zhu >Priority: Critical > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279743#comment-17279743 ] Piotr Nowojski edited comment on FLINK-17726 at 2/5/21, 3:13 PM: - We have discovered this issue once we introduced a bug, that caused Task switch to `CANCELLED` state incorrectly (it should switched to `FINISHED`) and deadlocked the job. This led me to some discussion with [~trohrmann] about how to handle this kind of issues. For me the most important part would be to not have a regression in the way, how we are reporting the root/primary/real cause of the failure. Currently switching from `RUNNING` -> `CANCELLED` state is a valid thing to do for the task, if this is a "secondary" failure caused by upstream/downstream task issue. This currently allows JobManager to easily ignore those "secondary" failures, from the real failures and pick first reported "real" failure as the root cause of the job/region failure. If we followed the proposed here in the task solution, to not allow the `RUNNING` -> `CANCELLED` transition, but just simply treat it as a regular "primary" failure, I would expect user to be flooded with hundreds of secondary failures. In that case it would be extremely difficult for him to figure out what has happened. Primary example: "real" failure is a loss of Task Manager that was either not detected, or would be detected after heartbeat timeout, which caused hundreds/thousands "secondary" failures (currently `RUNNING` -> `CANCELLED` transitions). My proposal how to deal with this situation, would be to keep the distinction of the "secondary" failure, but also enrich it with the information which task was the reason behind. JobManager would receive information "Task B1 failed because something has happened to with the Task A1". That would let us do two things: * If Job Manager managed to detect some primary failure, it could ignore (or batch together) all of the secondary failures * If no primary failure was detected, and we want to failover the job without waiting for example for the heartbeat pointing to the primary failure, Job Manager could connect secondary failures and the DAG, to deduce that something bad has happened with the "Task B1" was (Author: pnowojski): We have discovered this issue once we introduced a bug, that caused Task switch to `CANCELLED` state incorrectly (it should switched to `FINISHED`) and deadlocked the job. This led me to have some discussion with [~trohrmann] about a good way how to handle this kind of issues. For me the most important part would be to not have a regression in the way, how we are reporting the root/primary/real cause of the failure. Currently switching from `RUNNING` -> `CANCELLED` state is a valid thing to do for the task, if this is a "secondary" failure caused by upstream/downstream task issue. This currently allows JobManager to easily ignore those "secondary" failures, from the real failures and pick first reported "real" failure as the root cause of the job/region failure. If we followed the proposed here in the task solution, to not allow the `RUNNING` -> `CANCELLED` transition, but just simply treat it as a regular "primary" failure, I would expect user to be flooded with hundreds of secondary failures, which would be extremely difficult for him to figure out what has happened. Primary example: "real" failure is a loss of Task Manager that was either not detected, or would be detected after heartbeat timeout, which caused hundreds/thousands "secondary" failures (currently `RUNNING` -> `CANCELLED` transitions). My proposal how to deal with this situation, would be to keep the distinction of the "secondary" failure, but also enrich it with the information which task was the reason behind. JobManager would receive information "Task B1 failed because something has happened to with the Task A1". That would let us do two things: * If JobManager managed to detect some primary failure, it could ignore (or batch together) all of the secondary failures * If no primary failure was detected, and we want to failover the job without waiting for example for the heartbeat pointing to the primary failure, Job Manager could connect secondary failures and the DAG, to deduce that something bad has happened with the "Task B1" > Scheduler should take care of tasks directly canceled by TaskManager > > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Runtime / Task >Affects Versions: 1.11.0, 1.12.0 >Reporter: Zhu Zhu >Priority: Critical > > JobManager will not trigger failure handling when receiving CANCELED task > update.