Jun Gong commented on YARN-3480:

Thanks for explaining.

These cases make removing attempts complex. We are removing attempts 
asynchronously. If RMStateStore does not transit to 'FENCED' for failed 
operations, we might fail to remove some attempts and succeed to remove other 
attempts, suppose there were 4 attempts: attempt01,  attempt02, attempt03 and 
attempt04, we wanted to remove 2 attempts(attempt01 and attempt02), but we 
failed to remove attempt01, then remain attempts are attempt01, attempt03 and 
attempt04. They are not consistent. When recovering these attempts for RM 
restart, we will fail to recover attempts because we could not recover 

To make things simple, how about just remove attempts if HA is enabled(or 
'RMFailFast' is set)?

> Recovery may get very slow with lots of services with lots of app-attempts
> --------------------------------------------------------------------------
>                 Key: YARN-3480
>                 URL: https://issues.apache.org/jira/browse/YARN-3480
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3480.01.patch, YARN-3480.02.patch, 
> YARN-3480.03.patch, YARN-3480.04.patch, YARN-3480.05.patch, 
> YARN-3480.06.patch, YARN-3480.07.patch, YARN-3480.08.patch, 
> YARN-3480.09.patch, YARN-3480.10.patch, YARN-3480.11.patch
> When RM HA is enabled and running containers are kept across attempts, apps 
> are more likely to finish successfully with more retries(attempts), so it 
> will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However 
> it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make 
> RM recover process much slower. It might be better to set max attempts to be 
> stored in RMStateStore.
> BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to 
> a small value, retried attempts might be very large. So we need to delete 
> some attempts stored in RMStateStore and RMStateStore.

This message was sent by Atlassian JIRA

Reply via email to