[
https://issues.apache.org/jira/browse/YARN-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648624#comment-13648624
]
Chris Riccomini commented on YARN-614:
--------------------------------------
Looking into #1 a bit more.
The AM's finished container is added in
RMAppAttemptImpl.AMFinishingContainerFinishedTransition.
{code}
appAttempt.justFinishedContainers.add(containerStatus);
{code}
Which is handled in this transition in RMAppAttemptImpl:
{code}
.addTransition(RMAppAttemptState.FINISHING,
EnumSet.of(RMAppAttemptState.FINISHING, RMAppAttemptState.FINISHED),
RMAppAttemptEventType.CONTAINER_FINISHED,
new AMFinishingContainerFinishedTransition())
{code}
The RMAppAttemptEventType.CONTAINER_FINISHED event is triggered by
RMAppAttemptContainerFinishedEvent:
{code}
public RMAppAttemptContainerFinishedEvent(ApplicationAttemptId appAttemptId,
ContainerStatus containerStatus) {
super(appAttemptId, RMAppAttemptEventType.CONTAINER_FINISHED);
this.containerStatus = containerStatus;
}
{code}
Which is triggered by two transitions in RMContainerImpl:
ContainerFinishedAtAcquiredState and KillTransition. During failure scenarios,
only KillTransition is triggered. It's triggered by:
{code}
RMContainerEventType.RELEASED
RMContainerEventType.EXPIRE
RMContainerEventType.KILL
{code}
>From RMContainerEventType:
{code}
// Source: SchedulerApp
START,
ACQUIRED,
KILL, // Also from Node on NodeRemoval
RESERVED,
LAUNCHED,
FINISHED,
// Source: ApplicationMasterService->Scheduler
RELEASED,
// Source: ContainerAllocationExpirer
EXPIRE
{code}
When a node is lost, the scheduler triggers the KILL signal (see removeNode in
FairScheduler, FifoScheduler, and CapacityScheduler).
So it looks like KILL is triggered by NodeRemoval, which happens when a node
fails. I believe this means that the AM's container will be added to
justFinishedContainers when a node is lost.
> Retry attempts automatically for hardware failures or YARN issues and set
> default app retries to 1
> --------------------------------------------------------------------------------------------------
>
> Key: YARN-614
> URL: https://issues.apache.org/jira/browse/YARN-614
> Project: Hadoop YARN
> Issue Type: Improvement
> Reporter: Bikas Saha
> Attachments: YARN-614-0.patch, YARN-614-1.patch, YARN-614-2.patch
>
>
> Attempts can fail due to a large number of user errors and they should not be
> retried unnecessarily. The only reason YARN should retry an attempt is when
> the hardware fails or YARN has an error. NM failing, lost NM and NM disk
> errors are the hardware errors that come to mind.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira