Hadoop QA commented on YARN-2359:

{color:green}+1 overall{color}.  Here are the results of testing the latest 
  against trunk revision .

    {color:green}+1 @author{color}.  The patch does not contain any @author 

    {color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

    {color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

    {color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

    {color:green}+1 eclipse:eclipse{color}.  The patch built with 

    {color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

    {color:green}+1 core tests{color}.  The patch passed unit tests in 

    {color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4448//console

This message is automatically generated.

> Application is hung without timeout and retry after DNS/network is down. 
> -------------------------------------------------------------------------
>                 Key: YARN-2359
>                 URL: https://issues.apache.org/jira/browse/YARN-2359
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>            Priority: Critical
>         Attachments: YARN-2359.000.patch, YARN-2359.001.patch
> Application is hung without timeout and retry after DNS/network is down. 
> It is because right after the container is allocated for the AM, the 
> DNS/network is down for the node which has the AM container.
> The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
> RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
> IllegalArgumentException(due to DNS error) happened, it stay at state 
> RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
> processed at this state:
> RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
> The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) 
> which will be generated when the node and container timeout. So even the node 
> is removed, the Application is still hung in this state 
> RMAppAttemptState.SCHEDULED.
> The only way to make the application exit this state is to send 
> RMAppAttemptEventType.KILL event which will only be generated when you 
> manually kill the application from Job Client by forceKillApplication.
> To fix the issue, we should add an entry in the state machine table to handle 
> RMAppAttemptEventType.CONTAINER_FINISHED event at state 
> RMAppAttemptState.SCHEDULED
> add the following code in StateMachineFactory:
> {code}.addTransition(RMAppAttemptState.SCHEDULED, 
>           RMAppAttemptState.FINAL_SAVING,
>           RMAppAttemptEventType.CONTAINER_FINISHED,
>           new FinalSavingTransition(
>             new AMContainerCrashedBeforeRunningTransition(), 
>             RMAppAttemptState.FAILED)){code}

This message was sent by Atlassian JIRA

Reply via email to