[ https://issues.apache.org/jira/browse/FLINK-13554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17010451#comment-17010451 ]
Xintong Song edited comment on FLINK-13554 at 1/8/20 8:08 AM: -------------------------------------------------------------- We have confirmed that the release-1.10 blocker FLINK-15456 is actually caused by the problem described in this ticket. Since this problem is not introduced in 1.10, I believe it should not be a blocker. But how do we fix the problem, and whether it needs to be fixed in 1.10 still need to be discussed. I'm setting this ticket to be release-1.10 critical for now, to avoid overlooking it before a decision being made. cc [~gjy] [~liyu] [~zhuzh] [~chesnay] [~trohrmann] [~karmagyz] was (Author: xintongsong): We have confirmed that the release-1.10 blocker FLINK-15456 is actually caused by the problem described in this ticket. Since this problem is not introduced in 1.10, I believe it should not be a blocker. But how do we fix the problem, and whether it needs to be fixed in 1.10 still need to be discussed. I'm setting this ticket to be release-1.10 critical for now, to avoid overlooking it before a decision being made. > ResourceManager should have a timeout on starting new TaskExecutors. > -------------------------------------------------------------------- > > Key: FLINK-13554 > URL: https://issues.apache.org/jira/browse/FLINK-13554 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Affects Versions: 1.9.0 > Reporter: Xintong Song > Priority: Critical > Fix For: 1.10.0 > > > Recently, we encountered a case that one TaskExecutor get stuck during > launching on Yarn (without fail), causing that job cannot recover from > continuous failovers. > The reason the TaskExecutor gets stuck is due to our environment problem. The > TaskExecutor gets stuck somewhere after the ResourceManager starts the > TaskExecutor and waiting for the TaskExecutor to be brought up and register. > Later when the slot request timeouts, the job fails over and requests slots > from ResourceManager again, the ResourceManager still see a TaskExecutor (the > stuck one) is being started and will not request new container from Yarn. > Therefore, the job can not recover from failure. > I think to avoid such unrecoverable status, the ResourceManager need to have > a timeout on starting new TaskExecutor. If the starting of TaskExecutor takes > too long, it should just fail the TaskExecutor and starts a new one. -- This message was sent by Atlassian Jira (v8.3.4#803005)