[ 
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14088047#comment-14088047
 ] 

zhihai xu commented on YARN-2359:
---------------------------------

[~jianhe] The code is in pullNewlyAllocatedContainersAndNMTokens of 
SchedulerApplicationAttempt.java
{code}
      try {
        // create container token and NMToken altogether.
        container.setContainerToken(rmContext.getContainerTokenSecretManager()
          .createContainerToken(container.getId(), container.getNodeId(),
            getUser(), container.getResource(), container.getPriority(),
            rmContainer.getCreationTime()));
        NMToken nmToken =
            rmContext.getNMTokenSecretManager().createAndGetNMToken(getUser(),
              getApplicationAttemptId(), container);
        if (nmToken != null) {
          nmTokens.add(nmToken);
        }
      } catch (IllegalArgumentException e) {
        // DNS might be down, skip returning this container.
        LOG.error("Error trying to assign container token and NM token to" +
            " an allocated container " + container.getId(), e);
        continue;
      }
{code}

When IllegalArgumentException exception happened from createContainerToken, the 
code will skip the container.
Then zero container is returned in amContainerAllocation.
The following code in AMContainerAllocatedTransition in RMAppAttemptImpl.java 
will keep retry CONTAINER_ALLOCATED in SCHEDULED state.
So IllegalArgumentException will cause zero container returned in 
amContainerAllocation, which will cause RMAppAttemptImpl stay at state 
RMAppAttemptState.SCHEDULED.

{code}
     if (amContainerAllocation.getContainers().size() == 0) {
        appAttempt.retryFetchingAMContainer(appAttempt);
        return RMAppAttemptState.SCHEDULED;
      }
{code}

> Application is hung without timeout and retry after DNS/network is down. 
> -------------------------------------------------------------------------
>
>                 Key: YARN-2359
>                 URL: https://issues.apache.org/jira/browse/YARN-2359
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>            Priority: Critical
>         Attachments: YARN-2359.000.patch, YARN-2359.001.patch, 
> YARN-2359.002.patch
>
>
> Application is hung without timeout and retry after DNS/network is down. 
> It is because right after the container is allocated for the AM, the 
> DNS/network is down for the node which has the AM container.
> The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
> RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
> IllegalArgumentException(due to DNS error) happened, it stay at state 
> RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
> processed at this state:
> RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
> The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) 
> which will be generated when the node and container timeout. So even the node 
> is removed, the Application is still hung in this state 
> RMAppAttemptState.SCHEDULED.
> The only way to make the application exit this state is to send 
> RMAppAttemptEventType.KILL event which will only be generated when you 
> manually kill the application from Job Client by forceKillApplication.
> To fix the issue, we should add an entry in the state machine table to handle 
> RMAppAttemptEventType.CONTAINER_FINISHED event at state 
> RMAppAttemptState.SCHEDULED
> add the following code in StateMachineFactory:
> {code}.addTransition(RMAppAttemptState.SCHEDULED, 
>           RMAppAttemptState.FINAL_SAVING,
>           RMAppAttemptEventType.CONTAINER_FINISHED,
>           new FinalSavingTransition(
>             new AMContainerCrashedBeforeRunningTransition(), 
>             RMAppAttemptState.FAILED)){code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to