[
https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15039613#comment-15039613
]
Jun Gong commented on YARN-3480:
--------------------------------
[~jianhe] thanks for the remind. I thought the final solution is "we only have
(limits + asynchronous recovery) for services, once YARN-1039 goes in", so I am
waiting for YARN-1039.
However what you just suggested is reasonable too, it depends on how important
we think apps history information is. We have already implemented it and it
works well in our cluster, I could port it to trunk. I will attach a patch
against trunk code later.
> Recovery may get very slow with lots of services with lots of app-attempts
> --------------------------------------------------------------------------
>
> Key: YARN-3480
> URL: https://issues.apache.org/jira/browse/YARN-3480
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Affects Versions: 2.6.0
> Reporter: Jun Gong
> Assignee: Jun Gong
> Attachments: YARN-3480.01.patch, YARN-3480.02.patch,
> YARN-3480.03.patch, YARN-3480.04.patch
>
>
> When RM HA is enabled and running containers are kept across attempts, apps
> are more likely to finish successfully with more retries(attempts), so it
> will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However
> it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make
> RM recover process much slower. It might be better to set max attempts to be
> stored in RMStateStore.
> BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to
> a small value, retried attempts might be very large. So we need to delete
> some attempts stored in RMStateStore and RMStateStore.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)