[
https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14526195#comment-14526195
]
Jun Gong commented on YARN-3480:
--------------------------------
[~vinodkv] Thank you for the comments.
{quote}
No, as you noted later, the right solution is for apps to set the
attempt-failures validity-interval.
{quote}
Yes, I agree with it.
{quote}
We already have a yarn.resourcemanager.am.max-attempts that acts as a global
limit. Is that not sufficient? A more practical problem is the number of apps
itself. And we do have an upper limit of 10K by default for this. Is that not
enough? Are you seeing issues in a real-life scenario?
{quote}
yarn.resourcemanager.am.max-attempts just limits the max attempts in the time
window which is configured through 'attemptFailuresValidityInterval'. Suppose
the following scenario: app's am.max-attempts is set to 2, and its
attemptFailuresValidityInterval is set to 30, if app failed at 00:00, 00:31,
00: 62..., it will continue to retry and run because its number of failed
attempts at the time window(attemptFailuresValidityInterval) is always 1. Then
attempts' number will increase continously.
{quote}
I think we need to have a lower limit on the failure-validaty interval to avoid
situations like this. If others agree too, will file a ticket.
{quote}
Please see the above scenario.
> Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable
> ----------------------------------------------------------------------------
>
> Key: YARN-3480
> URL: https://issues.apache.org/jira/browse/YARN-3480
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: resourcemanager
> Affects Versions: 2.6.0
> Reporter: Jun Gong
> Assignee: Jun Gong
> Attachments: YARN-3480.01.patch, YARN-3480.02.patch,
> YARN-3480.03.patch
>
>
> When RM HA is enabled and running containers are kept across attempts, apps
> are more likely to finish successfully with more retries(attempts), so it
> will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However
> it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make
> RM recover process much slower. It might be better to set max attempts to be
> stored in RMStateStore.
> BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to
> a small value, retried attempts might be very large. So we need to delete
> some attempts stored in RMStateStore and RMStateStore.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)