[jira] [Commented] (YARN-3480) Recovery may get very slow with lots of services with lots of app-attempts

Jian He (JIRA) Mon, 14 Dec 2015 14:33:07 -0800

    [ 
https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056844#comment-15056844
 ]


Jian He commented on YARN-3480:
-------------------------------

[~hex108], 
how about removing the attempts that are beyond the max-allowed-attempts 
instead of the ones beyond the validity interval ? this way, we can keep more 
reasonable amount of history.
Instead of introducing the dummyAttempt in the RMApp, we can change the caller 
to always find the current attempt for container by using  
AbstractYarnScheduler#getCurrentAttemptForContainer API. This way, the 
container events can be routed to the current attempts instead of old one.

> Recovery may get very slow with lots of services with lots of app-attempts
> --------------------------------------------------------------------------
>
>                 Key: YARN-3480
>                 URL: https://issues.apache.org/jira/browse/YARN-3480
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3480.01.patch, YARN-3480.02.patch, 
> YARN-3480.03.patch, YARN-3480.04.patch, YARN-3480.05.patch, YARN-3480.06.patch
>
>
> When RM HA is enabled and running containers are kept across attempts, apps 
> are more likely to finish successfully with more retries(attempts), so it 
> will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However 
> it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make 
> RM recover process much slower. It might be better to set max attempts to be 
> stored in RMStateStore.
> BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to 
> a small value, retried attempts might be very large. So we need to delete 
> some attempts stored in RMStateStore and RMStateStore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3480) Recovery may get very slow with lots of services with lots of app-attempts

Reply via email to