[jira] [Commented] (FLINK-22506) YARN job cluster stuck in retrying creating JobManager if savepoint is corrupted
[ https://issues.apache.org/jira/browse/FLINK-22506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337517#comment-17337517 ] Paul Lin commented on FLINK-22506: -- [~trohrmann] Thanks a lot for the input. Now I'm suspecting maybe the value of `yarn.application-attempt-failures-validity-interval` is low (I'm using the default), given that in my case a retry may take 1 min. I'll investigate further, and close the issue if it's a configuration problem. Thanks again! > YARN job cluster stuck in retrying creating JobManager if savepoint is > corrupted > > > Key: FLINK-22506 > URL: https://issues.apache.org/jira/browse/FLINK-22506 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN >Affects Versions: 1.11.3 >Reporter: Paul Lin >Priority: Major > Attachments: corrupted_savepoint.log, yarn application attempts.png > > > If a non-retryable error (e.g. the savepoint is corrupted or unaccessible) > occurs during the initiation of the job manager, the job cluster exits with > an error code. But since it does not mark the attempt as failed, it won't be > count as a failed attempt, and YARN will keep retrying forever. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-22506) YARN job cluster stuck in retrying creating JobManager if savepoint is corrupted
[ https://issues.apache.org/jira/browse/FLINK-22506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337510#comment-17337510 ] Till Rohrmann commented on FLINK-22506: --- Ok, then I have misunderstood the ticket a bit. I thought that any application master failure would be handled as a failed attempt and counts towards the {{yarn.application-attempts}}. I don't think that we ever mark an Yarn attempt explicitly as failed. Hence, I thought that it should work with {{yarn.application-attempts}} and {{yarn.application-attempt-failures-validity-interval}}. > YARN job cluster stuck in retrying creating JobManager if savepoint is > corrupted > > > Key: FLINK-22506 > URL: https://issues.apache.org/jira/browse/FLINK-22506 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN >Affects Versions: 1.11.3 >Reporter: Paul Lin >Priority: Major > Attachments: corrupted_savepoint.log, yarn application attempts.png > > > If a non-retryable error (e.g. the savepoint is corrupted or unaccessible) > occurs during the initiation of the job manager, the job cluster exits with > an error code. But since it does not mark the attempt as failed, it won't be > count as a failed attempt, and YARN will keep retrying forever. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-22506) YARN job cluster stuck in retrying creating JobManager if savepoint is corrupted
[ https://issues.apache.org/jira/browse/FLINK-22506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337499#comment-17337499 ] Paul Lin commented on FLINK-22506: -- [~knaufk] I've attached the jm logs and the screen shoot yarn application web UI. Please take a look. I first reported the issue as a bug, because I think the max number of attempts (which is set to 2) is not respected in this case, but I'm fine with making it an improvement. > YARN job cluster stuck in retrying creating JobManager if savepoint is > corrupted > > > Key: FLINK-22506 > URL: https://issues.apache.org/jira/browse/FLINK-22506 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN >Affects Versions: 1.11.3 >Reporter: Paul Lin >Priority: Major > Attachments: corrupted_savepoint.log, yarn application attempts.png > > > If a non-retryable error (e.g. the savepoint is corrupted or unaccessible) > occurs during the initiation of the job manager, the job cluster exits with > an error code. But since it does not mark the attempt as failed, it won't be > count as a failed attempt, and YARN will keep retrying forever. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-22506) YARN job cluster stuck in retrying creating JobManager if savepoint is corrupted
[ https://issues.apache.org/jira/browse/FLINK-22506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337421#comment-17337421 ] Paul Lin commented on FLINK-22506: -- [~knaufk] Actually I'm not using the application mode, and this issue has been around for a very long time. I've tried 1.12.1, and the problem still exists. [~trohrmann] I agree that it's hard to distinguish non-retryable errors from the other ones. I think a simple thought to solve the problem is to make the attempt failed when an retryable or non-retryable error occurs, and leave YARN to decide whether the application should be restarted. The total restarts would be restricted by `yarn.application-attempts` and `yarn.application-attempt-failures-validity-interval`. > YARN job cluster stuck in retrying creating JobManager if savepoint is > corrupted > > > Key: FLINK-22506 > URL: https://issues.apache.org/jira/browse/FLINK-22506 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN >Affects Versions: 1.11.3 >Reporter: Paul Lin >Priority: Major > > If a non-retryable error (e.g. the savepoint is corrupted or unaccessible) > occurs during the initiation of the job manager, the job cluster exits with > an error code. But since it does not mark the attempt as failed, it won't be > count as a failed attempt, and YARN will keep retrying forever. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-22506) YARN job cluster stuck in retrying creating JobManager if savepoint is corrupted
[ https://issues.apache.org/jira/browse/FLINK-22506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335373#comment-17335373 ] Till Rohrmann commented on FLINK-22506: --- I've talked to [~rmetzger] and he said that we did not change the behaviour with FLINK-16866. So it basically boils down to what Konstantin said. It is hard to distinguish between a transient and permanent error. That's why Flink will retry the operation by killing the process and letting Yarn restart it. > YARN job cluster stuck in retrying creating JobManager if savepoint is > corrupted > > > Key: FLINK-22506 > URL: https://issues.apache.org/jira/browse/FLINK-22506 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN >Affects Versions: 1.11.3 >Reporter: Paul Lin >Priority: Major > > If a non-retryable error (e.g. the savepoint is corrupted or unaccessible) > occurs during the initiation of the job manager, the job cluster exits with > an error code. But since it does not mark the attempt as failed, it won't be > count as a failed attempt, and YARN will keep retrying forever. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-22506) YARN job cluster stuck in retrying creating JobManager if savepoint is corrupted
[ https://issues.apache.org/jira/browse/FLINK-22506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335225#comment-17335225 ] Konstantin Knauf commented on FLINK-22506: -- I assume you are using application mode on YARN? Generally, it is hard for the Job Manager to distinguish between a retryable error (S3 temporarily unavailable) and non-retryable error like the one you mention. So, the current behavior is to retry in any case. I would therefore move this to an "Improvement". Could you share the logs of the Jobmanager? If possible you could also try out Flink 1.12, https://issues.apache.org/jira/browse/FLINK-16866 might have already changed this behavior, but I would have to test this myself. > YARN job cluster stuck in retrying creating JobManager if savepoint is > corrupted > > > Key: FLINK-22506 > URL: https://issues.apache.org/jira/browse/FLINK-22506 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.11.3 >Reporter: Paul Lin >Priority: Major > > If a non-retryable error (e.g. the savepoint is corrupted or unaccessible) > occurs during the initiation of the job manager, the job cluster exits with > an error code. But since it does not mark the attempt as failed, it won't be > count as a failed attempt, and YARN will keep retrying forever. -- This message was sent by Atlassian Jira (v8.3.4#803005)