Jun Gong commented on YARN-3480:

[~jianhe], sorry for not specifying our scenario: RM HA is enabled, use ZK to 
store apps' info, most apps running in the cluster are long running(service) 
apps, yarn.resourcemanager.am.max-attempts is set to 10000 because we have not 
patched YARN-611 and we want apps to retry more times.  There are 10K apps with 
1~10000 attempts stored in ZK. It will take about 6 mins to recover those apps 
when RM HA.

1. How often do you see an app failed with a large number of attempts? If it's 
limited to a few apps. I wouldn't worry so much.
2. How slower it is in reality in your case? we've done some benchmark, 
recovering 10k apps(with 1 attempt) on ZK is pretty fast, within 20 seconds or 
Please see above. I think it will be OK for map-reduce jobs. But it might not 
be OK for service apps which have been running several months.

3. Limiting the attempts to be recorded means we are losing history. it's a 
trade off.
Yes, I agree.

> Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable
> ----------------------------------------------------------------------------
>                 Key: YARN-3480
>                 URL: https://issues.apache.org/jira/browse/YARN-3480
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3480.01.patch, YARN-3480.02.patch, 
> YARN-3480.03.patch
> When RM HA is enabled and running containers are kept across attempts, apps 
> are more likely to finish successfully with more retries(attempts), so it 
> will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However 
> it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make 
> RM recover process much slower. It might be better to set max attempts to be 
> stored in RMStateStore.
> BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to 
> a small value, retried attempts might be very large. So we need to delete 
> some attempts stored in RMStateStore and RMStateStore.

This message was sent by Atlassian JIRA

Reply via email to