[ 
https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14526195#comment-14526195
 ] 

Jun Gong commented on YARN-3480:
--------------------------------

[~vinodkv] Thank you for the comments.

{quote}
No, as you noted later, the right solution is for apps to set the 
attempt-failures validity-interval.
{quote}
Yes, I agree with it.

{quote}
We already have a yarn.resourcemanager.am.max-attempts that acts as a global 
limit. Is that not sufficient? A more practical problem is the number of apps 
itself. And we do have an upper limit of 10K by default for this. Is that not 
enough? Are you seeing issues in a real-life scenario?
{quote}
yarn.resourcemanager.am.max-attempts just limits the max attempts in the time 
window which is configured through 'attemptFailuresValidityInterval'. Suppose 
the following scenario: app's  am.max-attempts is set to 2, and its 
attemptFailuresValidityInterval is set to 30, if app failed at 00:00, 00:31, 
00: 62..., it will continue to retry and run because its number of failed 
attempts at the time window(attemptFailuresValidityInterval) is always 1. Then 
attempts' number will increase continously.

{quote}
I think we need to have a lower limit on the failure-validaty interval to avoid 
situations like this. If others agree too, will file a ticket.
{quote}
Please see the above scenario.

> Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable
> ----------------------------------------------------------------------------
>
>                 Key: YARN-3480
>                 URL: https://issues.apache.org/jira/browse/YARN-3480
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3480.01.patch, YARN-3480.02.patch, 
> YARN-3480.03.patch
>
>
> When RM HA is enabled and running containers are kept across attempts, apps 
> are more likely to finish successfully with more retries(attempts), so it 
> will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However 
> it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make 
> RM recover process much slower. It might be better to set max attempts to be 
> stored in RMStateStore.
> BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to 
> a small value, retried attempts might be very large. So we need to delete 
> some attempts stored in RMStateStore and RMStateStore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to