Jun Gong commented on YARN-3480:

[~vinodkv] Thanks for the suggestions.

Part of why you are seeing the problem today itself is precisely because you 
don't have YARN-611.
Once you have YARN-611, assuming a validity interval in the order of 10s of 
minutes, to reach 10K objects, you need consistent failures for >100 days to 
see what you are seeing.
Yes, YARN-611 will benefit us a lot. Our own AM will fail for some conditions, 
and it also makes number of retried attempts very large.

Assuming some history is important, we can have a limit the amount of completed 
app-attempts' history that the platform remembers. Apps can control how much 
they want the platform to remember but they cannot specify more than a cluster 
configured global limit.
Some details to clarify: we might need keep failed attempts those are in 
validity window, so it is the minimum number of attempts that we should keep. 
So when apps specify how much they want the platform to remember, we need 
consider it as another minimum number of attempts that we should keep.

instead of throwing away all history, I'd instead also do the recovery of very 
old attempts outside of the recovery path. That way recovery can still be fast 
(only recovering few of the most recent attempts synchronously) and given 
enough time, older history will get read offline.
It makes recovery more fast, and does not lose any attempts' history. However 
it will makes recovery process a little more complicated.

The former method(removing attempts) is more concise, and just likes logrotate, 
if we could accept the absence of some attempts' history information, I would 
prefer it.

> Recovery may get very slow with lots of services with lots of app-attempts
> --------------------------------------------------------------------------
>                 Key: YARN-3480
>                 URL: https://issues.apache.org/jira/browse/YARN-3480
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3480.01.patch, YARN-3480.02.patch, 
> YARN-3480.03.patch, YARN-3480.04.patch
> When RM HA is enabled and running containers are kept across attempts, apps 
> are more likely to finish successfully with more retries(attempts), so it 
> will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However 
> it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make 
> RM recover process much slower. It might be better to set max attempts to be 
> stored in RMStateStore.
> BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to 
> a small value, retried attempts might be very large. So we need to delete 
> some attempts stored in RMStateStore and RMStateStore.

This message was sent by Atlassian JIRA

Reply via email to