[jira] [Commented] (FLINK-22506) YARN job cluster stuck in retrying creating JobManager if savepoint is corrupted

2021-04-30 Thread Paul Lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337517#comment-17337517
 ] 

Paul Lin commented on FLINK-22506:
--

[~trohrmann] Thanks a lot for the input. Now I'm suspecting maybe the value of 
`yarn.application-attempt-failures-validity-interval` is low (I'm using the 
default), given that in my case a retry may take 1 min. I'll investigate 
further, and close the issue if it's a configuration problem. Thanks again! 

> YARN job cluster stuck in retrying creating JobManager if savepoint is 
> corrupted
> 
>
> Key: FLINK-22506
> URL: https://issues.apache.org/jira/browse/FLINK-22506
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.11.3
>Reporter: Paul Lin
>Priority: Major
> Attachments: corrupted_savepoint.log, yarn application attempts.png
>
>
> If a non-retryable error (e.g. the savepoint is corrupted or unaccessible) 
> occurs during the initiation of the job manager, the job cluster exits with 
> an error code. But since it does not mark the attempt as failed, it won't be 
> count as a failed attempt, and YARN will keep retrying forever.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22506) YARN job cluster stuck in retrying creating JobManager if savepoint is corrupted

2021-04-30 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337510#comment-17337510
 ] 

Till Rohrmann commented on FLINK-22506:
---

Ok, then I have misunderstood the ticket a bit. I thought that any application 
master failure would be handled as a failed attempt and counts towards the 
{{yarn.application-attempts}}. I don't think that we ever mark an Yarn attempt 
explicitly as failed. Hence, I thought that it should work with 
{{yarn.application-attempts}} and 
{{yarn.application-attempt-failures-validity-interval}}.

> YARN job cluster stuck in retrying creating JobManager if savepoint is 
> corrupted
> 
>
> Key: FLINK-22506
> URL: https://issues.apache.org/jira/browse/FLINK-22506
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.11.3
>Reporter: Paul Lin
>Priority: Major
> Attachments: corrupted_savepoint.log, yarn application attempts.png
>
>
> If a non-retryable error (e.g. the savepoint is corrupted or unaccessible) 
> occurs during the initiation of the job manager, the job cluster exits with 
> an error code. But since it does not mark the attempt as failed, it won't be 
> count as a failed attempt, and YARN will keep retrying forever.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22506) YARN job cluster stuck in retrying creating JobManager if savepoint is corrupted

2021-04-30 Thread Paul Lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337499#comment-17337499
 ] 

Paul Lin commented on FLINK-22506:
--

[~knaufk] I've attached the jm logs and the screen shoot yarn application web 
UI. Please take a look. I first reported the issue as a bug, because I think 
the max number of attempts (which is set to 2) is not respected in this case, 
but I'm fine with making it an improvement.

> YARN job cluster stuck in retrying creating JobManager if savepoint is 
> corrupted
> 
>
> Key: FLINK-22506
> URL: https://issues.apache.org/jira/browse/FLINK-22506
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.11.3
>Reporter: Paul Lin
>Priority: Major
> Attachments: corrupted_savepoint.log, yarn application attempts.png
>
>
> If a non-retryable error (e.g. the savepoint is corrupted or unaccessible) 
> occurs during the initiation of the job manager, the job cluster exits with 
> an error code. But since it does not mark the attempt as failed, it won't be 
> count as a failed attempt, and YARN will keep retrying forever.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22506) YARN job cluster stuck in retrying creating JobManager if savepoint is corrupted

2021-04-30 Thread Paul Lin (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337421#comment-17337421
 ] 

Paul Lin commented on FLINK-22506:
--

[~knaufk] Actually I'm not using the application mode, and this issue has been 
around for a very long time. I've tried 1.12.1, and the problem still exists.

[~trohrmann] I agree that it's hard to distinguish non-retryable errors from 
the other ones. I think a simple thought to solve the problem is to make the 
attempt failed when an retryable or non-retryable error occurs, and leave YARN 
to decide whether the application should be restarted. The total restarts would 
be restricted by `yarn.application-attempts` and 
`yarn.application-attempt-failures-validity-interval`. 

> YARN job cluster stuck in retrying creating JobManager if savepoint is 
> corrupted
> 
>
> Key: FLINK-22506
> URL: https://issues.apache.org/jira/browse/FLINK-22506
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.11.3
>Reporter: Paul Lin
>Priority: Major
>
> If a non-retryable error (e.g. the savepoint is corrupted or unaccessible) 
> occurs during the initiation of the job manager, the job cluster exits with 
> an error code. But since it does not mark the attempt as failed, it won't be 
> count as a failed attempt, and YARN will keep retrying forever.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22506) YARN job cluster stuck in retrying creating JobManager if savepoint is corrupted

2021-04-29 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335373#comment-17335373
 ] 

Till Rohrmann commented on FLINK-22506:
---

I've talked to [~rmetzger] and he said that we did not change the behaviour 
with FLINK-16866. So it basically boils down to what Konstantin said. It is 
hard to distinguish between a transient and permanent error. That's why Flink 
will retry the operation by killing the process and letting Yarn restart it.

> YARN job cluster stuck in retrying creating JobManager if savepoint is 
> corrupted
> 
>
> Key: FLINK-22506
> URL: https://issues.apache.org/jira/browse/FLINK-22506
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.11.3
>Reporter: Paul Lin
>Priority: Major
>
> If a non-retryable error (e.g. the savepoint is corrupted or unaccessible) 
> occurs during the initiation of the job manager, the job cluster exits with 
> an error code. But since it does not mark the attempt as failed, it won't be 
> count as a failed attempt, and YARN will keep retrying forever.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22506) YARN job cluster stuck in retrying creating JobManager if savepoint is corrupted

2021-04-29 Thread Konstantin Knauf (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335225#comment-17335225
 ] 

Konstantin Knauf commented on FLINK-22506:
--

I assume you are using application mode on YARN? 

Generally, it is hard for the Job Manager to distinguish between a retryable 
error (S3 temporarily unavailable) and non-retryable error like the one you 
mention. So, the current behavior is to retry in any case. I would therefore 
move this to an "Improvement". Could you share the logs of the Jobmanager?

If possible you could also try out Flink 1.12, 
https://issues.apache.org/jira/browse/FLINK-16866 might have already changed 
this behavior, but I would have to test this myself. 

> YARN job cluster stuck in retrying creating JobManager if savepoint is 
> corrupted
> 
>
> Key: FLINK-22506
> URL: https://issues.apache.org/jira/browse/FLINK-22506
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.11.3
>Reporter: Paul Lin
>Priority: Major
>
> If a non-retryable error (e.g. the savepoint is corrupted or unaccessible) 
> occurs during the initiation of the job manager, the job cluster exits with 
> an error code. But since it does not mark the attempt as failed, it won't be 
> count as a failed attempt, and YARN will keep retrying forever.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)