[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager

2021-04-26 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332834#comment-17332834
 ] 

Zhu Zhu commented on FLINK-17726:
-

I think it is a potential issue and is not a real production problem yet. The 
problem would happen only if a task is directly cancelled by TM without failing 
nay other task in the same pipelined region. So far I think this case will not 
happen.

> Scheduler should take care of tasks directly canceled by TaskManager
> 
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Task
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Zhu Zhu
>Priority: Critical
>  Labels: stale-critical
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager

2021-04-26 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332430#comment-17332430
 ] 

Till Rohrmann commented on FLINK-17726:
---

[~zhuzh] do you remember how you stumbled upon this problem? Did a user report 
a problem with it?

> Scheduler should take care of tasks directly canceled by TaskManager
> 
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Task
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Zhu Zhu
>Priority: Critical
>  Labels: stale-critical
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager

2021-04-22 Thread Flink Jira Bot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329039#comment-17329039
 ] 

Flink Jira Bot commented on FLINK-17726:


This critical issue is unassigned and itself and all of its Sub-Tasks have not 
been updated for 7 days. So, it has been labeled "stale-critical". If this 
ticket is indeed critical, please either assign yourself or give an update. 
Afterwards, please remove the label. In 7 days the issue will be deprioritized.

> Scheduler should take care of tasks directly canceled by TaskManager
> 
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Task
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Zhu Zhu
>Priority: Critical
>  Labels: stale-critical
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager

2021-02-08 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17280870#comment-17280870
 ] 

Piotr Nowojski commented on FLINK-17726:


Yes/maybe [~zhuzh]. I think you summarised the gist of the idea correctly. 
However there is one potential improvement:
{quote}
For"secondary" failures, given that the related "primary" failure should always 
be reported sooner or later, JM can simply mark the task as CANCELED and skip 
the failure handling.
{quote}
Maybe not in the first version, or maybe already in the first version, 
[~trohrmann] would like to tackle the problem to speed up failover, so that we 
do not have to wait for the primary failure to arrive. If JM already knows that 
some tasks started to fail (with secondary failures), it can already failover 
job/region, instead of waiting for example 1 minute for the heartbeats to time 
out. 

One thing that is not clear for me, is how to detect the primary failure in 
such case. Maybe we would need to failover the job but still keep collecting 
the failure reasons for the previous attempt, and keep updating the detected 
root cause lazily? For example if we have a chain of 4 tasks:
A->B->C->D
Maybe TaskManager handling A will fail silently, but the first error message JM 
will receive from D, then a second later from C then a second later from B and 
1 minute later a timeout of A.

Also note, that we don't have any pressure at the moment of fixing this right 
now.

> Scheduler should take care of tasks directly canceled by TaskManager
> 
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Task
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Zhu Zhu
>Priority: Critical
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager

2021-02-07 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17280409#comment-17280409
 ] 

Zhu Zhu commented on FLINK-17726:
-

Thanks for relaunching this discussion and proposing a solution.
I'd like to double confirm the proposal. Please correct me if I understand it 
incorrectly:
1. task should not be CANCELLED in TM unless it was CANCELING. It should be 
transitioned into FAILED with a "secondary" failure with the information of the 
root cause task
2. JM triggers failovers on "primary" failures and ignores related secondary 
failures. For"secondary" failures, given that the related "primary" failure 
should always be reported sooner or later, JM can simply mark the task as 
CANCELED and skip the failure handling. To further improve it, JM can register 
a timeout on secondary failures in case that the related "primary" failure is 
not reported, or to speed up the recover without waiting for a heartbeat 
timeout.
3. JM triggers a failover if a task directly transitions from DEPLOYING/RUNNING 
to CANCELED in TM, which is never expected to happen though after the work of #1

> Scheduler should take care of tasks directly canceled by TaskManager
> 
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Task
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Zhu Zhu
>Priority: Critical
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager

2021-02-05 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279743#comment-17279743
 ] 

Piotr Nowojski commented on FLINK-17726:


We have discovered this issue once we introduced a bug, that caused Task switch 
to `CANCELLED` state incorrectly (it should switched to `FINISHED`) and 
deadlocked the job.

This led me to have some discussion with [~trohrmann] about a good way how to 
handle this kind of issues. For me the most important part would be to not have 
a regression in the way, how we are reporting the root/primary/real cause of 
the failure. Currently switching from `RUNNING` -> `CANCELLED` state is a valid 
thing to do for the task, if this is a "secondary" failure caused by 
upstream/downstream task issue. This currently allows JobManager to easily 
ignore those "secondary" failures, from the real failures and pick first 
reported "real" failure as the root cause of the job/region failure.

If we followed the proposed here in the task solution, to not allow the 
`RUNNING` -> `CANCELLED` transition, but just simply treat it as a regular 
"primary" failure, I would expect user to be flooded with hundreds of secondary 
failures, which would be extremely difficult for him to figure out what has 
happened. Primary example: "real" failure is a loss of Task Manager that was 
either not detected, or would be detected after heartbeat timeout, which caused 
hundreds/thousands "secondary" failures (currently `RUNNING` -> `CANCELLED` 
transitions). 

My proposal how to deal with this situation, would be to keep the distinction 
of the "secondary" failure, but also enrich it with the information which task 
was the reason behind. JobManager would receive information "Task B1 failed 
because something has happened to with the Task A1". 

That would let us do two things:
* If JobManager managed to detect some primary failure, it could ignore (or 
batch together) all of the secondary failures
* If no primary failure was detected, and we want to failover the job without 
waiting for example for the heartbeat pointing to the primary failure, Job 
Manager could connect secondary failures and the DAG, to deduce that something 
bad has happened with the "Task B1"

> Scheduler should take care of tasks directly canceled by TaskManager
> 
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Task
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Zhu Zhu
>Priority: Critical
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager

2020-11-16 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232653#comment-17232653
 ] 

Till Rohrmann commented on FLINK-17726:
---

I think we first need a design because there were still some open questions 
about how to consolidate root cause messages which arrive out of order or late 
at the JM. Hence, I would suggest to not include it in the {{1.12.0}} release.

> Scheduler should take care of tasks directly canceled by TaskManager
> 
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Task
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Zhu Zhu
>Priority: Critical
> Fix For: 1.12.0, 1.11.3, 1.13.0
>
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager

2020-11-15 Thread Yuan Mei (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17232595#comment-17232595
 ] 

Yuan Mei commented on FLINK-17726:
--

Double checked with [~nicholasjiang], he said he would finish this ticket by 
the end of this week.

> Scheduler should take care of tasks directly canceled by TaskManager
> 
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Task
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Zhu Zhu
>Priority: Critical
> Fix For: 1.12.0, 1.11.3, 1.13.0
>
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager

2020-11-09 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228482#comment-17228482
 ] 

Till Rohrmann commented on FLINK-17726:
---

I guess this ticket won't make it into the {{1.12.0}} release, right 
[~nicholasjiang]?

> Scheduler should take care of tasks directly canceled by TaskManager
> 
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Task
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Zhu Zhu
>Assignee: Nicholas Jiang
>Priority: Critical
> Fix For: 1.12.0, 1.11.3
>
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager

2020-10-09 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210780#comment-17210780
 ] 

Zhu Zhu commented on FLINK-17726:
-

Hi [~nicholasjiang], is there any updates? Or do you still like to work on this 
item?

> Scheduler should take care of tasks directly canceled by TaskManager
> 
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Task
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Zhu Zhu
>Assignee: Nicholas Jiang
>Priority: Critical
> Fix For: 1.12.0, 1.11.3
>
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager

2020-08-03 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170531#comment-17170531
 ] 

Zhu Zhu commented on FLINK-17726:
-

Hi [~nicholasjiang], is there any updates for the design?

> Scheduler should take care of tasks directly canceled by TaskManager
> 
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Task
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Zhu Zhu
>Assignee: Nicholas Jiang
>Priority: Critical
> Fix For: 1.12.0, 1.11.2
>
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager

2020-06-04 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125653#comment-17125653
 ] 

Zhu Zhu commented on FLINK-17726:
-

Thanks for the updates. [~nicholasjiang] 
Looking forward to your design.

> Scheduler should take care of tasks directly canceled by TaskManager
> 
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Task
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Zhu Zhu
>Assignee: Nicholas Jiang
>Priority: Critical
> Fix For: 1.12.0, 1.11.1
>
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager

2020-06-04 Thread Nicholas Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125632#comment-17125632
 ] 

Nicholas Jiang commented on FLINK-17726:


I have already discuss with [~trohrmann] offline. And considering that I have 
completed other issues I claimed, I would like to make a design about this in a 
few these days, and discuss with you [~zhuzh] about the design. Thanks for 
[~zhuzh] and [~trohrmann] explanation for the solution.

> Scheduler should take care of tasks directly canceled by TaskManager
> 
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Task
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Zhu Zhu
>Assignee: Nicholas Jiang
>Priority: Critical
> Fix For: 1.12.0, 1.11.1
>
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager

2020-06-04 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125588#comment-17125588
 ] 

Zhu Zhu commented on FLINK-17726:
-

I think it should be that in TM we do not allow `transitionState` from a 
non-CANCELING state to CANCELED.
e.g. `PartitionProducerStateResponseHandle#cancelConsumption()` should also be 
covered even though it does not throw `CancelTaskException`.
However, given that there are still open questions about "how to consolidate 
root cause messages which arrive out of order or late at the JobManager", so I 
think the design is not finalized.

> Scheduler should take care of tasks directly canceled by TaskManager
> 
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Task
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Zhu Zhu
>Assignee: Nicholas Jiang
>Priority: Critical
> Fix For: 1.12.0, 1.11.1
>
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager

2020-06-03 Thread Nicholas Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125565#comment-17125565
 ] 

Nicholas Jiang commented on FLINK-17726:


In general,[~zhuzh][~trohrmann]The solution could be to transition state into 
FAILED when occuring the CancelTaskException and the current Task state is not 
CANCELLING, am I right?

> Scheduler should take care of tasks directly canceled by TaskManager
> 
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Task
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Zhu Zhu
>Assignee: Nicholas Jiang
>Priority: Critical
> Fix For: 1.12.0, 1.11.1
>
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager

2020-06-03 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124912#comment-17124912
 ] 

Till Rohrmann commented on FLINK-17726:
---

After an offline discussion with [~zhuzh] we agreed to not do this feature for 
the {{1.11.0}} release because there are still open questions how to 
consolidate root cause messages which arrive out of order or late at the 
{{JobManager}}.

> Scheduler should take care of tasks directly canceled by TaskManager
> 
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Task
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Zhu Zhu
>Assignee: Nicholas Jiang
>Priority: Critical
> Fix For: 1.11.0, 1.12.0
>
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager

2020-06-03 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124808#comment-17124808
 ] 

Till Rohrmann commented on FLINK-17726:
---

Sorry for my late reply. I think we should handle all state transition to 
{{CANCELED}} from a different state than {{CANCELLING}} on the {{JobMaster}} as 
a failure. This effectively means that the {{JobMaster}} must initiate the 
cancellation in one form or another. If we wanted the {{Task}} to be smart and 
to initiate the cancellation, then it would have to be sure that there is 
another {{Task}} which reported a failure back to the {{JobMaster}}. I think in 
the general case this is very hard to guarantee (only if the other task sends a 
message that it successfully transmitted this state transition to the 
{{JobMaster}} it would be ok).

Consequently,  if there is the situation of {{A1 -> B1}} and {{A1}} fails and 
{{B1}} realizes it, then {{B1}} cannot be sure that {{A1}} could update the 
{{JM}} and has to fail. Only if the {{JM}} sent the cancellation request it 
knows that the failure of {{A1}} has been successfully reported and it can 
cancel.

One way to ensure this contract on the {{Task}} side could be to only allow 
state transition from {{CANCELLING}} to {{CANCELED}}. Concretely, this means 
that we transition into {{FAILED}} if we see a {{CancelTaskException}} if the 
current {{Task}} state is not {{CANCELLING}}.

Does this make sense?

> Scheduler should take care of tasks directly canceled by TaskManager
> 
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Task
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Zhu Zhu
>Assignee: Nicholas Jiang
>Priority: Critical
> Fix For: 1.11.0, 1.12.0
>
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager

2020-06-02 Thread Nicholas Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123422#comment-17123422
 ] 

Nicholas Jiang commented on FLINK-17726:


[~zhuzh]Thanks for your explanation, therefore this need [~trohrmann] confirm 
whether to use the implementation that mark directly CANCELED tasks with a 
dedicate exception.

> Scheduler should take care of tasks directly canceled by TaskManager
> 
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Zhu Zhu
>Assignee: Nicholas Jiang
>Priority: Critical
> Fix For: 1.11.0, 1.12.0
>
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager

2020-06-01 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17121514#comment-17121514
 ] 

Zhu Zhu commented on FLINK-17726:
-

[~nicholasjiang] What I mean is that not all kinds of directly CANCELED tasks 
should trigger failovers. 
There might be directly CANCELED tasks that were not caused by FAILED/CANCELED 
upstream tasks. We still need to trigger failovers on this kind of directly 
CANCELED tasks, otherwise they would stay in CANCELED forever.
A dedicate exception means the exception would only be thrown in this specific 
case. It does not affect error handling of other cases.

> Scheduler should take care of tasks directly canceled by TaskManager
> 
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Zhu Zhu
>Assignee: Nicholas Jiang
>Priority: Critical
> Fix For: 1.11.0, 1.12.0
>
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager

2020-06-01 Thread Nicholas Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17121477#comment-17121477
 ] 

Nicholas Jiang commented on FLINK-17726:


[~zhuzh]I agree with that directly CANCELED tasks couldn't trigger failover, 
and just trigger failover for FAILED state task. But whether mark directly 
CANCELED tasks with a dedicate exception is unconfirmed, I concern about 
whether this mark couldn't trigger failover of all cases. I tend to use a new 
state such as DIRECT_CANCELED to handle this case. Any problem with my 
understanding, please point out.


> Scheduler should take care of tasks directly canceled by TaskManager
> 
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Zhu Zhu
>Assignee: Nicholas Jiang
>Priority: Critical
> Fix For: 1.11.0, 1.12.0
>
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager

2020-06-01 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120997#comment-17120997
 ] 

Zhu Zhu commented on FLINK-17726:
-

I just thought of a case that might be problematic with the proposed change.
Imagine a case like this: job {A1 -> B1}. A1 and B1 were running. Later A1 
failed and B1 was CANCELED due to A1's failure.
However, the CANCELED state of B1 was reported earlier than the FAILED state of 
A1. 
If we trigger a failover on receiving the directly CANCELED state of B1 and 
start canceling A1, the failure cause of A1 will be discarded because it will 
not be treated as the root failure.

Maybe we should mark this kind of directly CANCELED tasks with a dedicate 
exception and do not trigger failover on them at JM side.
[~trohrmann][~nicholasjiang] WDYT?

> Scheduler should take care of tasks directly canceled by TaskManager
> 
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Zhu Zhu
>Assignee: Nicholas Jiang
>Priority: Critical
> Fix For: 1.11.0, 1.12.0
>
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager

2020-06-01 Thread Nicholas Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120883#comment-17120883
 ] 

Nicholas Jiang commented on FLINK-17726:


[~zhuzh]I have double checked the methods that trigger cancel task operation, 
including cancelOrFailAndCancelInvokable, cancelExecution which are based on 
TaskCanceler, cancelInvokable which is based on invokable's cancel method and 
caller of method transitionState. After checking again, the case that a 
directly CANCELED task happens when its upstream task was canceled/failed 
doesn't exist.  IMO, my solution would be to modify tasks that transitions to 
CANCELED from all states except from CANCELING to FAILED status as same as the 
solution you mentioned. What do you think about ?

> Scheduler should take care of tasks directly canceled by TaskManager
> 
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Zhu Zhu
>Assignee: Nicholas Jiang
>Priority: Critical
> Fix For: 1.11.0, 1.12.0
>
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager

2020-05-31 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120725#comment-17120725
 ] 

Zhu Zhu commented on FLINK-17726:
-

[~nicholasjiang] what's the current state? is there any problem to open the PR?

> Scheduler should take care of tasks directly canceled by TaskManager
> 
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Zhu Zhu
>Assignee: Nicholas Jiang
>Priority: Critical
> Fix For: 1.11.0, 1.12.0
>
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager

2020-05-29 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119326#comment-17119326
 ] 

Zhu Zhu commented on FLINK-17726:
-

Thanks for the updates [~nicholasjiang].
Please ping me once the PR is ready.

> Scheduler should take care of tasks directly canceled by TaskManager
> 
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Zhu Zhu
>Assignee: Nicholas Jiang
>Priority: Critical
> Fix For: 1.11.0, 1.12.0
>
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager

2020-05-28 Thread Nicholas Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119166#comment-17119166
 ] 

Nicholas Jiang commented on FLINK-17726:


Hi [~trohrmann], I previously double checked whether a directly CANCELED task 
can happen in the case [~zhuzh] mentioned, and sorry for not update the state 
of this issue in time. I would like to take the pull request today and sync the 
state with [~zhuzh].

> Scheduler should take care of tasks directly canceled by TaskManager
> 
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Zhu Zhu
>Assignee: Nicholas Jiang
>Priority: Critical
> Fix For: 1.11.0, 1.12.0
>
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager

2020-05-28 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118408#comment-17118408
 ] 

Till Rohrmann commented on FLINK-17726:
---

Hi [~nicholasjiang], what's the state of this issue?

> Scheduler should take care of tasks directly canceled by TaskManager
> 
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Zhu Zhu
>Assignee: Nicholas Jiang
>Priority: Critical
> Fix For: 1.11.0, 1.12.0
>
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager

2020-05-22 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113820#comment-17113820
 ] 

Zhu Zhu commented on FLINK-17726:
-

I have assigned the ticket to you. [~nicholasjiang]
But before starting to apply the changes, would you help to double check 
whether a directly CANCELED task can happen in the case that its upstream task 
was canceled/failed?

> Scheduler should take care of tasks directly canceled by TaskManager
> 
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Zhu Zhu
>Priority: Critical
> Fix For: 1.11.0, 1.12.0
>
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager

2020-05-22 Thread Nicholas Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113809#comment-17113809
 ] 

Nicholas Jiang commented on FLINK-17726:


[~zhuzh]Could you please assign this to me? I would like to follow up your 
suggestion for this issue. Thanks.

> Scheduler should take care of tasks directly canceled by TaskManager
> 
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Zhu Zhu
>Priority: Critical
> Fix For: 1.11.0, 1.12.0
>
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)