[ 
https://issues.apache.org/jira/browse/FLINK-13554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17010451#comment-17010451
 ] 

Xintong Song edited comment on FLINK-13554 at 1/8/20 8:08 AM:
--------------------------------------------------------------

We have confirmed that the release-1.10 blocker FLINK-15456 is actually caused 
by the problem described in this ticket.
Since this problem is not introduced in 1.10, I believe it should not be a 
blocker. But how do we fix the problem, and whether it needs to be fixed in 
1.10 still need to be discussed.
I'm setting this ticket to be release-1.10 critical for now, to avoid 
overlooking it before a decision being made.
cc [~gjy] [~liyu] [~zhuzh] [~chesnay] [~trohrmann] [~karmagyz]


was (Author: xintongsong):
We have confirmed that the release-1.10 blocker FLINK-15456 is actually caused 
by the problem described in this ticket.
Since this problem is not introduced in 1.10, I believe it should not be a 
blocker. But how do we fix the problem, and whether it needs to be fixed in 
1.10 still need to be discussed.
I'm setting this ticket to be release-1.10 critical for now, to avoid 
overlooking it before a decision being made.

> ResourceManager should have a timeout on starting new TaskExecutors.
> --------------------------------------------------------------------
>
>                 Key: FLINK-13554
>                 URL: https://issues.apache.org/jira/browse/FLINK-13554
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.9.0
>            Reporter: Xintong Song
>            Priority: Critical
>             Fix For: 1.10.0
>
>
> Recently, we encountered a case that one TaskExecutor get stuck during 
> launching on Yarn (without fail), causing that job cannot recover from 
> continuous failovers.
> The reason the TaskExecutor gets stuck is due to our environment problem. The 
> TaskExecutor gets stuck somewhere after the ResourceManager starts the 
> TaskExecutor and waiting for the TaskExecutor to be brought up and register. 
> Later when the slot request timeouts, the job fails over and requests slots 
> from ResourceManager again, the ResourceManager still see a TaskExecutor (the 
> stuck one) is being started and will not request new container from Yarn. 
> Therefore, the job can not recover from failure.
> I think to avoid such unrecoverable status, the ResourceManager need to have 
> a timeout on starting new TaskExecutor. If the starting of TaskExecutor takes 
> too long, it should just fail the TaskExecutor and starts a new one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to