[jira] [Commented] (YARN-614) Retry attempts automatically for hardware failures or YARN issues and set default app retries to 1

Chris Riccomini (JIRA) Fri, 03 May 2013 10:56:17 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648624#comment-13648624
 ]


Chris Riccomini commented on YARN-614:
--------------------------------------

Looking into #1 a bit more.

The AM's finished container is added in 
RMAppAttemptImpl.AMFinishingContainerFinishedTransition.

{code}
appAttempt.justFinishedContainers.add(containerStatus);
{code}

Which is handled in this transition in RMAppAttemptImpl:

{code}
      .addTransition(RMAppAttemptState.FINISHING,
          EnumSet.of(RMAppAttemptState.FINISHING, RMAppAttemptState.FINISHED),
          RMAppAttemptEventType.CONTAINER_FINISHED,
          new AMFinishingContainerFinishedTransition())
{code}

The RMAppAttemptEventType.CONTAINER_FINISHED event is triggered by 
RMAppAttemptContainerFinishedEvent:

{code}
  public RMAppAttemptContainerFinishedEvent(ApplicationAttemptId appAttemptId, 
      ContainerStatus containerStatus) {
    super(appAttemptId, RMAppAttemptEventType.CONTAINER_FINISHED);
    this.containerStatus = containerStatus;
  }
{code}

Which is triggered by two transitions in RMContainerImpl: 
ContainerFinishedAtAcquiredState and KillTransition. During failure scenarios, 
only KillTransition is triggered. It's triggered by:

{code}
RMContainerEventType.RELEASED
RMContainerEventType.EXPIRE
RMContainerEventType.KILL
{code}

>From RMContainerEventType:

{code}
  // Source: SchedulerApp
  START,
  ACQUIRED,
  KILL, // Also from Node on NodeRemoval
  RESERVED,

  LAUNCHED,
  FINISHED,

  // Source: ApplicationMasterService->Scheduler
  RELEASED,

  // Source: ContainerAllocationExpirer  
  EXPIRE
{code}

When a node is lost, the scheduler triggers the KILL signal (see removeNode in 
FairScheduler, FifoScheduler, and CapacityScheduler).

So it looks like KILL is triggered by NodeRemoval, which happens when a node 
fails. I believe this means that the AM's container will be added to 
justFinishedContainers when a node is lost.

                
> Retry attempts automatically for hardware failures or YARN issues and set 
> default app retries to 1
> --------------------------------------------------------------------------------------------------
>
>                 Key: YARN-614
>                 URL: https://issues.apache.org/jira/browse/YARN-614
>             Project: Hadoop YARN
>          Issue Type: Improvement
>            Reporter: Bikas Saha
>         Attachments: YARN-614-0.patch, YARN-614-1.patch, YARN-614-2.patch
>
>
> Attempts can fail due to a large number of user errors and they should not be 
> retried unnecessarily. The only reason YARN should retry an attempt is when 
> the hardware fails or YARN has an error. NM failing, lost NM and NM disk 
> errors are the hardware errors that come to mind.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-614) Retry attempts automatically for hardware failures or YARN issues and set default app retries to 1

Reply via email to