[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332834#comment-17332834 ] Zhu Zhu commented on FLINK-17726: - I think it is a potential issue and is not a real production problem yet. The problem would happen only if a task is directly cancelled by TM without failing nay other task in the same pipelined region. So far I think this case will not happen. > Scheduler should take care of tasks directly canceled by TaskManager > > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Runtime / Task >Affects Versions: 1.11.0, 1.12.0 >Reporter: Zhu Zhu >Priority: Critical > Labels: stale-critical > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332430#comment-17332430 ] Till Rohrmann commented on FLINK-17726: --- [~zhuzh] do you remember how you stumbled upon this problem? Did a user report a problem with it? > Scheduler should take care of tasks directly canceled by TaskManager > > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Runtime / Task >Affects Versions: 1.11.0, 1.12.0 >Reporter: Zhu Zhu >Priority: Critical > Labels: stale-critical > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329039#comment-17329039 ] Flink Jira Bot commented on FLINK-17726: This critical issue is unassigned and itself and all of its Sub-Tasks have not been updated for 7 days. So, it has been labeled "stale-critical". If this ticket is indeed critical, please either assign yourself or give an update. Afterwards, please remove the label. In 7 days the issue will be deprioritized. > Scheduler should take care of tasks directly canceled by TaskManager > > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Runtime / Task >Affects Versions: 1.11.0, 1.12.0 >Reporter: Zhu Zhu >Priority: Critical > Labels: stale-critical > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17280870#comment-17280870 ] Piotr Nowojski commented on FLINK-17726: Yes/maybe [~zhuzh]. I think you summarised the gist of the idea correctly. However there is one potential improvement: {quote} For"secondary" failures, given that the related "primary" failure should always be reported sooner or later, JM can simply mark the task as CANCELED and skip the failure handling. {quote} Maybe not in the first version, or maybe already in the first version, [~trohrmann] would like to tackle the problem to speed up failover, so that we do not have to wait for the primary failure to arrive. If JM already knows that some tasks started to fail (with secondary failures), it can already failover job/region, instead of waiting for example 1 minute for the heartbeats to time out. One thing that is not clear for me, is how to detect the primary failure in such case. Maybe we would need to failover the job but still keep collecting the failure reasons for the previous attempt, and keep updating the detected root cause lazily? For example if we have a chain of 4 tasks: A->B->C->D Maybe TaskManager handling A will fail silently, but the first error message JM will receive from D, then a second later from C then a second later from B and 1 minute later a timeout of A. Also note, that we don't have any pressure at the moment of fixing this right now. > Scheduler should take care of tasks directly canceled by TaskManager > > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Runtime / Task >Affects Versions: 1.11.0, 1.12.0 >Reporter: Zhu Zhu >Priority: Critical > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17280409#comment-17280409 ] Zhu Zhu commented on FLINK-17726: - Thanks for relaunching this discussion and proposing a solution. I'd like to double confirm the proposal. Please correct me if I understand it incorrectly: 1. task should not be CANCELLED in TM unless it was CANCELING. It should be transitioned into FAILED with a "secondary" failure with the information of the root cause task 2. JM triggers failovers on "primary" failures and ignores related secondary failures. For"secondary" failures, given that the related "primary" failure should always be reported sooner or later, JM can simply mark the task as CANCELED and skip the failure handling. To further improve it, JM can register a timeout on secondary failures in case that the related "primary" failure is not reported, or to speed up the recover without waiting for a heartbeat timeout. 3. JM triggers a failover if a task directly transitions from DEPLOYING/RUNNING to CANCELED in TM, which is never expected to happen though after the work of #1 > Scheduler should take care of tasks directly canceled by TaskManager > > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Runtime / Task >Affects Versions: 1.11.0, 1.12.0 >Reporter: Zhu Zhu >Priority: Critical > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279743#comment-17279743 ] Piotr Nowojski commented on FLINK-17726: We have discovered this issue once we introduced a bug, that caused Task switch to `CANCELLED` state incorrectly (it should switched to `FINISHED`) and deadlocked the job. This led me to have some discussion with [~trohrmann] about a good way how to handle this kind of issues. For me the most important part would be to not have a regression in the way, how we are reporting the root/primary/real cause of the failure. Currently switching from `RUNNING` -> `CANCELLED` state is a valid thing to do for the task, if this is a "secondary" failure caused by upstream/downstream task issue. This currently allows JobManager to easily ignore those "secondary" failures, from the real failures and pick first reported "real" failure as the root cause of the job/region failure. If we followed the proposed here in the task solution, to not allow the `RUNNING` -> `CANCELLED` transition, but just simply treat it as a regular "primary" failure, I would expect user to be flooded with hundreds of secondary failures, which would be extremely difficult for him to figure out what has happened. Primary example: "real" failure is a loss of Task Manager that was either not detected, or would be detected after heartbeat timeout, which caused hundreds/thousands "secondary" failures (currently `RUNNING` -> `CANCELLED` transitions). My proposal how to deal with this situation, would be to keep the distinction of the "secondary" failure, but also enrich it with the information which task was the reason behind. JobManager would receive information "Task B1 failed because something has happened to with the Task A1". That would let us do two things: * If JobManager managed to detect some primary failure, it could ignore (or batch together) all of the secondary failures * If no primary failure was detected, and we want to failover the job without waiting for example for the heartbeat pointing to the primary failure, Job Manager could connect secondary failures and the DAG, to deduce that something bad has happened with the "Task B1" > Scheduler should take care of tasks directly canceled by TaskManager > > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Runtime / Task >Affects Versions: 1.11.0, 1.12.0 >Reporter: Zhu Zhu >Priority: Critical > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232653#comment-17232653 ] Till Rohrmann commented on FLINK-17726: --- I think we first need a design because there were still some open questions about how to consolidate root cause messages which arrive out of order or late at the JM. Hence, I would suggest to not include it in the {{1.12.0}} release. > Scheduler should take care of tasks directly canceled by TaskManager > > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Runtime / Task >Affects Versions: 1.11.0, 1.12.0 >Reporter: Zhu Zhu >Priority: Critical > Fix For: 1.12.0, 1.11.3, 1.13.0 > > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232595#comment-17232595 ] Yuan Mei commented on FLINK-17726: -- Double checked with [~nicholasjiang], he said he would finish this ticket by the end of this week. > Scheduler should take care of tasks directly canceled by TaskManager > > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Runtime / Task >Affects Versions: 1.11.0, 1.12.0 >Reporter: Zhu Zhu >Priority: Critical > Fix For: 1.12.0, 1.11.3, 1.13.0 > > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228482#comment-17228482 ] Till Rohrmann commented on FLINK-17726: --- I guess this ticket won't make it into the {{1.12.0}} release, right [~nicholasjiang]? > Scheduler should take care of tasks directly canceled by TaskManager > > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Runtime / Task >Affects Versions: 1.11.0, 1.12.0 >Reporter: Zhu Zhu >Assignee: Nicholas Jiang >Priority: Critical > Fix For: 1.12.0, 1.11.3 > > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210780#comment-17210780 ] Zhu Zhu commented on FLINK-17726: - Hi [~nicholasjiang], is there any updates? Or do you still like to work on this item? > Scheduler should take care of tasks directly canceled by TaskManager > > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Runtime / Task >Affects Versions: 1.11.0, 1.12.0 >Reporter: Zhu Zhu >Assignee: Nicholas Jiang >Priority: Critical > Fix For: 1.12.0, 1.11.3 > > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170531#comment-17170531 ] Zhu Zhu commented on FLINK-17726: - Hi [~nicholasjiang], is there any updates for the design? > Scheduler should take care of tasks directly canceled by TaskManager > > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Runtime / Task >Affects Versions: 1.11.0, 1.12.0 >Reporter: Zhu Zhu >Assignee: Nicholas Jiang >Priority: Critical > Fix For: 1.12.0, 1.11.2 > > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125653#comment-17125653 ] Zhu Zhu commented on FLINK-17726: - Thanks for the updates. [~nicholasjiang] Looking forward to your design. > Scheduler should take care of tasks directly canceled by TaskManager > > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Runtime / Task >Affects Versions: 1.11.0, 1.12.0 >Reporter: Zhu Zhu >Assignee: Nicholas Jiang >Priority: Critical > Fix For: 1.12.0, 1.11.1 > > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125632#comment-17125632 ] Nicholas Jiang commented on FLINK-17726: I have already discuss with [~trohrmann] offline. And considering that I have completed other issues I claimed, I would like to make a design about this in a few these days, and discuss with you [~zhuzh] about the design. Thanks for [~zhuzh] and [~trohrmann] explanation for the solution. > Scheduler should take care of tasks directly canceled by TaskManager > > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Runtime / Task >Affects Versions: 1.11.0, 1.12.0 >Reporter: Zhu Zhu >Assignee: Nicholas Jiang >Priority: Critical > Fix For: 1.12.0, 1.11.1 > > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125588#comment-17125588 ] Zhu Zhu commented on FLINK-17726: - I think it should be that in TM we do not allow `transitionState` from a non-CANCELING state to CANCELED. e.g. `PartitionProducerStateResponseHandle#cancelConsumption()` should also be covered even though it does not throw `CancelTaskException`. However, given that there are still open questions about "how to consolidate root cause messages which arrive out of order or late at the JobManager", so I think the design is not finalized. > Scheduler should take care of tasks directly canceled by TaskManager > > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Runtime / Task >Affects Versions: 1.11.0, 1.12.0 >Reporter: Zhu Zhu >Assignee: Nicholas Jiang >Priority: Critical > Fix For: 1.12.0, 1.11.1 > > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125565#comment-17125565 ] Nicholas Jiang commented on FLINK-17726: In general,[~zhuzh][~trohrmann]The solution could be to transition state into FAILED when occuring the CancelTaskException and the current Task state is not CANCELLING, am I right? > Scheduler should take care of tasks directly canceled by TaskManager > > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Runtime / Task >Affects Versions: 1.11.0, 1.12.0 >Reporter: Zhu Zhu >Assignee: Nicholas Jiang >Priority: Critical > Fix For: 1.12.0, 1.11.1 > > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124912#comment-17124912 ] Till Rohrmann commented on FLINK-17726: --- After an offline discussion with [~zhuzh] we agreed to not do this feature for the {{1.11.0}} release because there are still open questions how to consolidate root cause messages which arrive out of order or late at the {{JobManager}}. > Scheduler should take care of tasks directly canceled by TaskManager > > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Runtime / Task >Affects Versions: 1.11.0, 1.12.0 >Reporter: Zhu Zhu >Assignee: Nicholas Jiang >Priority: Critical > Fix For: 1.11.0, 1.12.0 > > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124808#comment-17124808 ] Till Rohrmann commented on FLINK-17726: --- Sorry for my late reply. I think we should handle all state transition to {{CANCELED}} from a different state than {{CANCELLING}} on the {{JobMaster}} as a failure. This effectively means that the {{JobMaster}} must initiate the cancellation in one form or another. If we wanted the {{Task}} to be smart and to initiate the cancellation, then it would have to be sure that there is another {{Task}} which reported a failure back to the {{JobMaster}}. I think in the general case this is very hard to guarantee (only if the other task sends a message that it successfully transmitted this state transition to the {{JobMaster}} it would be ok). Consequently, if there is the situation of {{A1 -> B1}} and {{A1}} fails and {{B1}} realizes it, then {{B1}} cannot be sure that {{A1}} could update the {{JM}} and has to fail. Only if the {{JM}} sent the cancellation request it knows that the failure of {{A1}} has been successfully reported and it can cancel. One way to ensure this contract on the {{Task}} side could be to only allow state transition from {{CANCELLING}} to {{CANCELED}}. Concretely, this means that we transition into {{FAILED}} if we see a {{CancelTaskException}} if the current {{Task}} state is not {{CANCELLING}}. Does this make sense? > Scheduler should take care of tasks directly canceled by TaskManager > > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Runtime / Task >Affects Versions: 1.11.0, 1.12.0 >Reporter: Zhu Zhu >Assignee: Nicholas Jiang >Priority: Critical > Fix For: 1.11.0, 1.12.0 > > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123422#comment-17123422 ] Nicholas Jiang commented on FLINK-17726: [~zhuzh]Thanks for your explanation, therefore this need [~trohrmann] confirm whether to use the implementation that mark directly CANCELED tasks with a dedicate exception. > Scheduler should take care of tasks directly canceled by TaskManager > > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.0, 1.12.0 >Reporter: Zhu Zhu >Assignee: Nicholas Jiang >Priority: Critical > Fix For: 1.11.0, 1.12.0 > > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17121514#comment-17121514 ] Zhu Zhu commented on FLINK-17726: - [~nicholasjiang] What I mean is that not all kinds of directly CANCELED tasks should trigger failovers. There might be directly CANCELED tasks that were not caused by FAILED/CANCELED upstream tasks. We still need to trigger failovers on this kind of directly CANCELED tasks, otherwise they would stay in CANCELED forever. A dedicate exception means the exception would only be thrown in this specific case. It does not affect error handling of other cases. > Scheduler should take care of tasks directly canceled by TaskManager > > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.0, 1.12.0 >Reporter: Zhu Zhu >Assignee: Nicholas Jiang >Priority: Critical > Fix For: 1.11.0, 1.12.0 > > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17121477#comment-17121477 ] Nicholas Jiang commented on FLINK-17726: [~zhuzh]I agree with that directly CANCELED tasks couldn't trigger failover, and just trigger failover for FAILED state task. But whether mark directly CANCELED tasks with a dedicate exception is unconfirmed, I concern about whether this mark couldn't trigger failover of all cases. I tend to use a new state such as DIRECT_CANCELED to handle this case. Any problem with my understanding, please point out. > Scheduler should take care of tasks directly canceled by TaskManager > > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.0, 1.12.0 >Reporter: Zhu Zhu >Assignee: Nicholas Jiang >Priority: Critical > Fix For: 1.11.0, 1.12.0 > > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120997#comment-17120997 ] Zhu Zhu commented on FLINK-17726: - I just thought of a case that might be problematic with the proposed change. Imagine a case like this: job {A1 -> B1}. A1 and B1 were running. Later A1 failed and B1 was CANCELED due to A1's failure. However, the CANCELED state of B1 was reported earlier than the FAILED state of A1. If we trigger a failover on receiving the directly CANCELED state of B1 and start canceling A1, the failure cause of A1 will be discarded because it will not be treated as the root failure. Maybe we should mark this kind of directly CANCELED tasks with a dedicate exception and do not trigger failover on them at JM side. [~trohrmann][~nicholasjiang] WDYT? > Scheduler should take care of tasks directly canceled by TaskManager > > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.0, 1.12.0 >Reporter: Zhu Zhu >Assignee: Nicholas Jiang >Priority: Critical > Fix For: 1.11.0, 1.12.0 > > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120883#comment-17120883 ] Nicholas Jiang commented on FLINK-17726: [~zhuzh]I have double checked the methods that trigger cancel task operation, including cancelOrFailAndCancelInvokable, cancelExecution which are based on TaskCanceler, cancelInvokable which is based on invokable's cancel method and caller of method transitionState. After checking again, the case that a directly CANCELED task happens when its upstream task was canceled/failed doesn't exist. IMO, my solution would be to modify tasks that transitions to CANCELED from all states except from CANCELING to FAILED status as same as the solution you mentioned. What do you think about ? > Scheduler should take care of tasks directly canceled by TaskManager > > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.0, 1.12.0 >Reporter: Zhu Zhu >Assignee: Nicholas Jiang >Priority: Critical > Fix For: 1.11.0, 1.12.0 > > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120725#comment-17120725 ] Zhu Zhu commented on FLINK-17726: - [~nicholasjiang] what's the current state? is there any problem to open the PR? > Scheduler should take care of tasks directly canceled by TaskManager > > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.0, 1.12.0 >Reporter: Zhu Zhu >Assignee: Nicholas Jiang >Priority: Critical > Fix For: 1.11.0, 1.12.0 > > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119326#comment-17119326 ] Zhu Zhu commented on FLINK-17726: - Thanks for the updates [~nicholasjiang]. Please ping me once the PR is ready. > Scheduler should take care of tasks directly canceled by TaskManager > > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.0, 1.12.0 >Reporter: Zhu Zhu >Assignee: Nicholas Jiang >Priority: Critical > Fix For: 1.11.0, 1.12.0 > > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119166#comment-17119166 ] Nicholas Jiang commented on FLINK-17726: Hi [~trohrmann], I previously double checked whether a directly CANCELED task can happen in the case [~zhuzh] mentioned, and sorry for not update the state of this issue in time. I would like to take the pull request today and sync the state with [~zhuzh]. > Scheduler should take care of tasks directly canceled by TaskManager > > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.0, 1.12.0 >Reporter: Zhu Zhu >Assignee: Nicholas Jiang >Priority: Critical > Fix For: 1.11.0, 1.12.0 > > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118408#comment-17118408 ] Till Rohrmann commented on FLINK-17726: --- Hi [~nicholasjiang], what's the state of this issue? > Scheduler should take care of tasks directly canceled by TaskManager > > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.0, 1.12.0 >Reporter: Zhu Zhu >Assignee: Nicholas Jiang >Priority: Critical > Fix For: 1.11.0, 1.12.0 > > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113820#comment-17113820 ] Zhu Zhu commented on FLINK-17726: - I have assigned the ticket to you. [~nicholasjiang] But before starting to apply the changes, would you help to double check whether a directly CANCELED task can happen in the case that its upstream task was canceled/failed? > Scheduler should take care of tasks directly canceled by TaskManager > > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.0, 1.12.0 >Reporter: Zhu Zhu >Priority: Critical > Fix For: 1.11.0, 1.12.0 > > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager
[ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113809#comment-17113809 ] Nicholas Jiang commented on FLINK-17726: [~zhuzh]Could you please assign this to me? I would like to follow up your suggestion for this issue. Thanks. > Scheduler should take care of tasks directly canceled by TaskManager > > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.0, 1.12.0 >Reporter: Zhu Zhu >Priority: Critical > Fix For: 1.11.0, 1.12.0 > > > JobManager will not trigger failure handling when receiving CANCELED task > update. > This is because CANCELED tasks are usually caused by another FAILED task. > These CANCELED tasks will be restarted by the failover process triggered > FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime > issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to > CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)