Jian He commented on YARN-3480:

[~hex108], generally,  it's better to avoid a global config for an outlier app. 
1. How often do you see an app failed with a large number of attempts? If it's 
limited to a few apps. I wouldn't worry so much.
bq.  make RM recover process much slower.
2. How slower it is in reality in your case?  we've done some benchmark, 
recovering 10k apps(with 1 attempt) on ZK is pretty fast, within 20 seconds or 
3. Limiting the attempts to be recorded means we are losing history. it's a 
trade off.

My main point is that if you can provide some real numbers showing how slow the 
recovery process in real scenario, we can figure out where the bottleneck is 
and how to improve it.

> Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable
> ----------------------------------------------------------------------------
>                 Key: YARN-3480
>                 URL: https://issues.apache.org/jira/browse/YARN-3480
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3480.01.patch, YARN-3480.02.patch, 
> YARN-3480.03.patch
> When RM HA is enabled and running containers are kept across attempts, apps 
> are more likely to finish successfully with more retries(attempts), so it 
> will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However 
> it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make 
> RM recover process much slower. It might be better to set max attempts to be 
> stored in RMStateStore.
> BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to 
> a small value, retried attempts might be very large. So we need to delete 
> some attempts stored in RMStateStore and RMStateStore.

This message was sent by Atlassian JIRA

Reply via email to