[ 
https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14055373#comment-14055373
 ] 

Vinod Kumar Vavilapalli commented on YARN-611:
----------------------------------------------

Scanned through the patch, quick comments

 - We need to implement our existing num-max-retries implementation as one of 
the policies
 - I think we should have the following API structure
{code}
message ApplicationSubmissionContext {
  ...
  ...
  // [default = max_retries_policy]
  optional ApplicationRetryPolicy app_retry_policy;
}

enum ApplicationRetryPolicyType {
  MAX_RETRIES_POLICY,
  WINDOWED_RETRIES_POLICY
}

message ApplicationRetryPolicy {
  // The following is really a required field
  optional ApplicationRetryPolicyType appRetryPolicyType [ default = 
MAX_RETRIES_POLICY ];

  // Only one of the following are accepted based on the type
  // Each is a context object specific to the policy type
  optional MaxApplicationRetriesPolicyContext;
  optional WindowedApplicationRetriesPolicyContext;
  ..
}

message MaxApplicationRetriesPolicyContext {
    optional int32 maxAppAttempts [default = 2];
}

message WindowedApplicationRetriesPolicyContext {
  optional int64 reset_time_window [default = 86400]
}
{code} 
 - Please don't add any setters/mutators in the read only interfaces like 
RMApp, RMAppAttempt. You can push them down to the implementations


> Add an AM retry count reset window to YARN RM
> ---------------------------------------------
>
>                 Key: YARN-611
>                 URL: https://issues.apache.org/jira/browse/YARN-611
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.0.3-alpha
>            Reporter: Chris Riccomini
>            Assignee: Xuan Gong
>         Attachments: YARN-611.1.patch, YARN-611.2.patch, YARN-611.3.patch
>
>
> YARN currently has the following config:
> yarn.resourcemanager.am.max-retries
> This config defaults to 2, and defines how many times to retry a "failed" AM 
> before failing the whole YARN job. YARN counts an AM as failed if the node 
> that it was running on dies (the NM will timeout, which counts as a failure 
> for the AM), or if the AM dies.
> This configuration is insufficient for long running (or infinitely running) 
> YARN jobs, since the machine (or NM) that the AM is running on will 
> eventually need to be restarted (or the machine/NM will fail). In such an 
> event, the AM has not done anything wrong, but this is counted as a "failure" 
> by the RM. Since the retry count for the AM is never reset, eventually, at 
> some point, the number of machine/NM failures will result in the AM failure 
> count going above the configured value for 
> yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the 
> job as failed, and shut it down. This behavior is not ideal.
> I propose that we add a second configuration:
> yarn.resourcemanager.am.retry-count-window-ms
> This configuration would define a window of time that would define when an AM 
> is "well behaved", and it's safe to reset its failure count back to zero. 
> Every time an AM fails the RmAppImpl would check the last time that the AM 
> failed. If the last failure was less than retry-count-window-ms ago, and the 
> new failure count is > max-retries, then the job should fail. If the AM has 
> never failed, the retry count is < max-retries, or if the last failure was 
> OUTSIDE the retry-count-window-ms, then the job should be restarted. 
> Additionally, if the last failure was outside the retry-count-window-ms, then 
> the failure count should be set back to 0.
> This would give developers a way to have well-behaved AMs run forever, while 
> still failing mis-behaving AMs after a short period of time.
> I think the work to be done here is to change the RmAppImpl to actually look 
> at app.attempts, and see if there have been more than max-retries failures in 
> the last retry-count-window-ms milliseconds. If there have, then the job 
> should fail, if not, then the job should go forward. Additionally, we might 
> also need to add an endTime in either RMAppAttemptImpl or 
> RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the 
> failure.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to