[jira] [Comment Edited] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager

2021-04-26 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332834#comment-17332834
 ] 

Zhu Zhu edited comment on FLINK-17726 at 4/27/21, 1:50 AM:
---

I think it is a potential issue and is not a real production problem yet. The 
problem would happen only if a task is directly cancelled by TM and no other 
task in the same pipelined region was failed. So far I think this case will not 
happen.


was (Author: zhuzh):
I think it is a potential issue and is not a real production problem yet. The 
problem would happen only if a task is directly cancelled by TM without failing 
nay other task in the same pipelined region. So far I think this case will not 
happen.

> Scheduler should take care of tasks directly canceled by TaskManager
> 
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Task
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Zhu Zhu
>Priority: Critical
>  Labels: stale-critical
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager

2021-02-07 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17280409#comment-17280409
 ] 

Zhu Zhu edited comment on FLINK-17726 at 2/7/21, 8:23 AM:
--

Thanks for relaunching this discussion and proposing a solution! [~pnowojski]
I'd like to double confirm the proposal. Please correct me if I understand it 
incorrectly:
1. task should not be CANCELLED in TM unless it was CANCELING. Instead, it 
should be transitioned into FAILED with a "secondary" failure with the 
information of the root cause task
2. JM triggers failovers on "primary" failures and ignores related secondary 
failures. For"secondary" failures, given that the related "primary" failure 
should always be reported sooner or later, JM can simply mark the task as 
CANCELED and skip the failure handling. To further improve it, JM can register 
a timeout on secondary failures in case that the related "primary" failure is 
not reported, or to speed up the recover without waiting for a heartbeat 
timeout.
3. JM triggers a failover if a task directly transitions from DEPLOYING/RUNNING 
to CANCELED in TM, which is never expected to happen though after the work of #1


was (Author: zhuzh):
Thanks for relaunching this discussion and proposing a solution! [~pnowojski]
I'd like to double confirm the proposal. Please correct me if I understand it 
incorrectly:
1. task should not be CANCELLED in TM unless it was CANCELING. It should be 
transitioned into FAILED with a "secondary" failure with the information of the 
root cause task
2. JM triggers failovers on "primary" failures and ignores related secondary 
failures. For"secondary" failures, given that the related "primary" failure 
should always be reported sooner or later, JM can simply mark the task as 
CANCELED and skip the failure handling. To further improve it, JM can register 
a timeout on secondary failures in case that the related "primary" failure is 
not reported, or to speed up the recover without waiting for a heartbeat 
timeout.
3. JM triggers a failover if a task directly transitions from DEPLOYING/RUNNING 
to CANCELED in TM, which is never expected to happen though after the work of #1

> Scheduler should take care of tasks directly canceled by TaskManager
> 
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Task
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Zhu Zhu
>Priority: Critical
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager

2021-02-07 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17280409#comment-17280409
 ] 

Zhu Zhu edited comment on FLINK-17726 at 2/7/21, 8:23 AM:
--

Thanks for relaunching this discussion and proposing a solution! [~pnowojski]
I'd like to double confirm the proposal. Please correct me if I understand it 
incorrectly:
1. task should not be CANCELLED in TM unless it was CANCELING. It should be 
transitioned into FAILED with a "secondary" failure with the information of the 
root cause task
2. JM triggers failovers on "primary" failures and ignores related secondary 
failures. For"secondary" failures, given that the related "primary" failure 
should always be reported sooner or later, JM can simply mark the task as 
CANCELED and skip the failure handling. To further improve it, JM can register 
a timeout on secondary failures in case that the related "primary" failure is 
not reported, or to speed up the recover without waiting for a heartbeat 
timeout.
3. JM triggers a failover if a task directly transitions from DEPLOYING/RUNNING 
to CANCELED in TM, which is never expected to happen though after the work of #1


was (Author: zhuzh):
Thanks for relaunching this discussion and proposing a solution.
I'd like to double confirm the proposal. Please correct me if I understand it 
incorrectly:
1. task should not be CANCELLED in TM unless it was CANCELING. It should be 
transitioned into FAILED with a "secondary" failure with the information of the 
root cause task
2. JM triggers failovers on "primary" failures and ignores related secondary 
failures. For"secondary" failures, given that the related "primary" failure 
should always be reported sooner or later, JM can simply mark the task as 
CANCELED and skip the failure handling. To further improve it, JM can register 
a timeout on secondary failures in case that the related "primary" failure is 
not reported, or to speed up the recover without waiting for a heartbeat 
timeout.
3. JM triggers a failover if a task directly transitions from DEPLOYING/RUNNING 
to CANCELED in TM, which is never expected to happen though after the work of #1

> Scheduler should take care of tasks directly canceled by TaskManager
> 
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Task
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Zhu Zhu
>Priority: Critical
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update. 
> This is because CANCELED tasks are usually caused by another FAILED task. 
> These CANCELED tasks will be restarted by the failover process triggered  
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime 
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to 
> CANCELED from all states except from CANCELING as failed tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager

2021-02-05 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279743#comment-17279743
 ] 

Piotr Nowojski edited comment on FLINK-17726 at 2/5/21, 3:13 PM:
-

We have discovered this issue once we introduced a bug, that caused Task switch 
to `CANCELLED` state incorrectly (it should switched to `FINISHED`) and 
deadlocked the job.

This led me to some discussion with [~trohrmann] about how to handle this kind 
of issues. For me the most important part would be to not have a regression in 
the way, how we are reporting the root/primary/real cause of the failure. 
Currently switching from `RUNNING` -> `CANCELLED` state is a valid thing to do 
for the task, if this is a "secondary" failure caused by upstream/downstream 
task issue. This currently allows JobManager to easily ignore those "secondary" 
failures, from the real failures and pick first reported "real" failure as the 
root cause of the job/region failure.

If we followed the proposed here in the task solution, to not allow the 
`RUNNING` -> `CANCELLED` transition, but just simply treat it as a regular 
"primary" failure, I would expect user to be flooded with hundreds of secondary 
failures. In that case it would be extremely difficult for him to figure out 
what has happened. Primary example: "real" failure is a loss of Task Manager 
that was either not detected, or would be detected after heartbeat timeout, 
which caused hundreds/thousands "secondary" failures (currently `RUNNING` -> 
`CANCELLED` transitions). 

My proposal how to deal with this situation, would be to keep the distinction 
of the "secondary" failure, but also enrich it with the information which task 
was the reason behind. JobManager would receive information "Task B1 failed 
because something has happened to with the Task A1". 

That would let us do two things:
* If Job Manager managed to detect some primary failure, it could ignore (or 
batch together) all of the secondary failures
* If no primary failure was detected, and we want to failover the job without 
waiting for example for the heartbeat pointing to the primary failure, Job 
Manager could connect secondary failures and the DAG, to deduce that something 
bad has happened with the "Task B1"


was (Author: pnowojski):
We have discovered this issue once we introduced a bug, that caused Task switch 
to `CANCELLED` state incorrectly (it should switched to `FINISHED`) and 
deadlocked the job.

This led me to have some discussion with [~trohrmann] about a good way how to 
handle this kind of issues. For me the most important part would be to not have 
a regression in the way, how we are reporting the root/primary/real cause of 
the failure. Currently switching from `RUNNING` -> `CANCELLED` state is a valid 
thing to do for the task, if this is a "secondary" failure caused by 
upstream/downstream task issue. This currently allows JobManager to easily 
ignore those "secondary" failures, from the real failures and pick first 
reported "real" failure as the root cause of the job/region failure.

If we followed the proposed here in the task solution, to not allow the 
`RUNNING` -> `CANCELLED` transition, but just simply treat it as a regular 
"primary" failure, I would expect user to be flooded with hundreds of secondary 
failures, which would be extremely difficult for him to figure out what has 
happened. Primary example: "real" failure is a loss of Task Manager that was 
either not detected, or would be detected after heartbeat timeout, which caused 
hundreds/thousands "secondary" failures (currently `RUNNING` -> `CANCELLED` 
transitions). 

My proposal how to deal with this situation, would be to keep the distinction 
of the "secondary" failure, but also enrich it with the information which task 
was the reason behind. JobManager would receive information "Task B1 failed 
because something has happened to with the Task A1". 

That would let us do two things:
* If JobManager managed to detect some primary failure, it could ignore (or 
batch together) all of the secondary failures
* If no primary failure was detected, and we want to failover the job without 
waiting for example for the heartbeat pointing to the primary failure, Job 
Manager could connect secondary failures and the DAG, to deduce that something 
bad has happened with the "Task B1"

> Scheduler should take care of tasks directly canceled by TaskManager
> 
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Task
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Zhu Zhu
>Priority: Critical
>
> JobManager will not trigger failure handling when receiving CANCELED task 
> update.