[jira] [Commented] (YARN-3480) Recovery may get very slow with lots of services with lots of app-attempts

Vinod Kumar Vavilapalli (JIRA) Fri, 08 May 2015 09:01:41 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14534761#comment-14534761
 ]


Vinod Kumar Vavilapalli commented on YARN-3480:
-----------------------------------------------

bq. RM HA is enabled, use ZK to store apps' info, most apps running in the 
cluster are long running(service) apps, yarn.resourcemanager.am.max-attempts is 
set to 10000 because we have not patched YARN-611 and we want apps to retry 
more times. There are 10K apps with 1~10000 attempts stored in ZK. It will take 
about 6 mins to recover those apps when RM HA.
Part of why you are seeing the problem today itself is precisely because you 
don't have YARN-611.

Once you have YARN-611, assuming a validity interval in the order of 10s of 
minutes, to reach 10K objects, you need consistent failures for >100 days to 
see what you are seeing.

That said, I can definitely see issues going forward. We can do two things.
 - Assuming _some_ history is important, we can have a limit the amount of 
completed app-attempts' history that the platform remembers. Apps can control 
how much they want the platform to remember but they cannot specify more than a 
cluster configured global limit.
 - Instead of throwing away all history, I'd instead also do the recovery of 
very old attempts outside of the recovery path. That way recovery can still be 
fast (only recovering few of the most recent attempts synchronously) and given 
enough time, older history will get read offline.

Thoughts?

> Recovery may get very slow with lots of services with lots of app-attempts
> --------------------------------------------------------------------------
>
>                 Key: YARN-3480
>                 URL: https://issues.apache.org/jira/browse/YARN-3480
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3480.01.patch, YARN-3480.02.patch, 
> YARN-3480.03.patch
>
>
> When RM HA is enabled and running containers are kept across attempts, apps 
> are more likely to finish successfully with more retries(attempts), so it 
> will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However 
> it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make 
> RM recover process much slower. It might be better to set max attempts to be 
> stored in RMStateStore.
> BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to 
> a small value, retried attempts might be very large. So we need to delete 
> some attempts stored in RMStateStore and RMStateStore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3480) Recovery may get very slow with lots of services with lots of app-attempts

Reply via email to