Jun Gong commented on YARN-3480:

Thanks for the suggestion!

I meant we can reuse the "yarn.resourcemanager.am.max-attempts" config ? In 
regular case without validityInterval enabled, number of attempts will never go 
over this limit. If that is enabled, we can remove the ones that are over this 

I think we don't need to remove the attempt from the memory, only need to 
remove it from store.
It is reasonable. Keeping the attempts in the memory also avoids the following 
Only those attempts which satisfy 'shouldCountTowardsMaxAttemptRetry()' are 
counted as completed attempts. When validityInterval is enabled and we remove 
the ones that are over "yarn.resourcemanager.am.max-attempts" in the memory, 
app will always retry if there are some attempts that does not count towards 
max attempt retry in the attempts we kept.

the current change will affect all other events too. I suggest below logic in 
ApplicationAttemptEventDispatcher and also add a comment why this is needed
else if ( app.getSubmissionContext.getKeepContainersAcrossAttempts() && 
event.type == containerFinished)

dummyAttempt - is it ok to just return the first attempt in the RMApp#attempts 
map ? rename it to previousFailedAttempt
OK. I will fix them.

> Recovery may get very slow with lots of services with lots of app-attempts
> --------------------------------------------------------------------------
>                 Key: YARN-3480
>                 URL: https://issues.apache.org/jira/browse/YARN-3480
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3480.01.patch, YARN-3480.02.patch, 
> YARN-3480.03.patch, YARN-3480.04.patch, YARN-3480.05.patch, 
> YARN-3480.06.patch, YARN-3480.07.patch
> When RM HA is enabled and running containers are kept across attempts, apps 
> are more likely to finish successfully with more retries(attempts), so it 
> will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However 
> it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make 
> RM recover process much slower. It might be better to set max attempts to be 
> stored in RMStateStore.
> BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to 
> a small value, retried attempts might be very large. So we need to delete 
> some attempts stored in RMStateStore and RMStateStore.

This message was sent by Atlassian JIRA

Reply via email to