[ 
https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14119191#comment-14119191
 ] 

Vinod Kumar Vavilapalli commented on YARN-611:
----------------------------------------------

bq. 1. API Change: I'm not sure whether it is really necessary to have a 
completely standalone proto messages for ApplicationRetryPolicy's 
implementations. It sounds an overkill to me. In fact, 
MaxApplicationRetriesPolicy seems to be a special case of 
WindowedApplicationRetriesPolicy, where the window size is to be infinitely 
large, such that the number of failures will never be reset. Therefore, why not 
simply adding one more field (i.e., resetTimeWindow) in 
ApplicationSubmissionContext. When resetTimeWindow = 0 or -1, it means the 
window size is unbounded, and failure number will never be reset. On the other 
side, when resetTimeWindow is set to > 0, the failure number will no take the 
failures which happen out of the window into account.
On second read, this doesn't look like a bad idea. I am okay adding a new 
resetTimeWindow field and be done with it.

bq. 4. Affecting RMStateStore: I'm not sure why it is necessary to persist the 
"end time" into RMStateStore, which seems not to be really used for reseting 
the window. 
Again on second read, this isn't terrible in combination with resetTimeWindow. 
Once we have end-time of each app-attempt, RM can figure out how many retries 
happened for this app in the last resetTimeWindow.

> Add an AM retry count reset window to YARN RM
> ---------------------------------------------
>
>                 Key: YARN-611
>                 URL: https://issues.apache.org/jira/browse/YARN-611
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.0.3-alpha
>            Reporter: Chris Riccomini
>            Assignee: Xuan Gong
>         Attachments: YARN-611.1.patch, YARN-611.2.patch, YARN-611.3.patch, 
> YARN-611.4.patch, YARN-611.4.rebase.patch, YARN-611.5.patch
>
>
> YARN currently has the following config:
> yarn.resourcemanager.am.max-retries
> This config defaults to 2, and defines how many times to retry a "failed" AM 
> before failing the whole YARN job. YARN counts an AM as failed if the node 
> that it was running on dies (the NM will timeout, which counts as a failure 
> for the AM), or if the AM dies.
> This configuration is insufficient for long running (or infinitely running) 
> YARN jobs, since the machine (or NM) that the AM is running on will 
> eventually need to be restarted (or the machine/NM will fail). In such an 
> event, the AM has not done anything wrong, but this is counted as a "failure" 
> by the RM. Since the retry count for the AM is never reset, eventually, at 
> some point, the number of machine/NM failures will result in the AM failure 
> count going above the configured value for 
> yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the 
> job as failed, and shut it down. This behavior is not ideal.
> I propose that we add a second configuration:
> yarn.resourcemanager.am.retry-count-window-ms
> This configuration would define a window of time that would define when an AM 
> is "well behaved", and it's safe to reset its failure count back to zero. 
> Every time an AM fails the RmAppImpl would check the last time that the AM 
> failed. If the last failure was less than retry-count-window-ms ago, and the 
> new failure count is > max-retries, then the job should fail. If the AM has 
> never failed, the retry count is < max-retries, or if the last failure was 
> OUTSIDE the retry-count-window-ms, then the job should be restarted. 
> Additionally, if the last failure was outside the retry-count-window-ms, then 
> the failure count should be set back to 0.
> This would give developers a way to have well-behaved AMs run forever, while 
> still failing mis-behaving AMs after a short period of time.
> I think the work to be done here is to change the RmAppImpl to actually look 
> at app.attempts, and see if there have been more than max-retries failures in 
> the last retry-count-window-ms milliseconds. If there have, then the job 
> should fail, if not, then the job should go forward. Additionally, we might 
> also need to add an endTime in either RMAppAttemptImpl or 
> RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the 
> failure.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to