[ 
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2359:
----------------------------

    Description: 
Application is hung without timeout and retry after DNS/network is down. 
It is because right after the container is allocated for the AM, the 
DNS/network is down for the node which has the AM container.
The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
IllegalArgumentException(due to DNS error) happened, it stay at state 
RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
processed at this state:
RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) 
which will be generated by the node and container timeout. So even the node is 
removed, the Application is still hung in this state 
RMAppAttemptState.SCHEDULED.
The only way to make the application exit this state is to send 
RMAppAttemptEventType.KILL event which will only be generated when you manually 
kill the application from Job Client by forceKillApplication.

To fix the issue, we should add an entry in the state machine table to handle 
RMAppAttemptEventType.CONTAINER_FINISHED event at state 
RMAppAttemptState.SCHEDULED
add the following code in StateMachineFactory:
{code}.addTransition(RMAppAttemptState.SCHEDULED, 
          RMAppAttemptState.FINAL_SAVING,
          RMAppAttemptEventType.CONTAINER_FINISHED,
          new FinalSavingTransition(
            new AMContainerCrashedBeforeRunningTransition(), 
            RMAppAttemptState.FAILED)){code}

  was:
Application is hung without timeout and retry after DNS/network is down. 
It is because right after the container is allocated for the AM, the 
DNS/network is down for the node which has the AM container.
The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
IllegalArgumentException(due to DNS error) happened, it stay at state 
RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
processed at this state:
RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) 
which will be generated by the node and container timeout. So even the node is 
removed, the Application is still hung in this state 
RMAppAttemptState.SCHEDULED.
The only way to make the application exit this state is to send 
RMAppAttemptEventType.KILL event which will only be generated when you manually 
kill the application from Job Client by forceKillApplication.

To fix the issue, we should add an entry in the state machine table to handle 
RMAppAttemptEventType.CONTAINER_FINISHED event at state 
RMAppAttemptState.SCHEDULED
add the following code in StateMachineFactory:
{{ .addTransition(RMAppAttemptState.SCHEDULED, 
          RMAppAttemptState.FINAL_SAVING,
          RMAppAttemptEventType.CONTAINER_FINISHED,
          new FinalSavingTransition(
            new AMContainerCrashedBeforeRunningTransition(), 
            RMAppAttemptState.FAILED))}}


> Application is hung without timeout and retry after DNS/network is down. 
> -------------------------------------------------------------------------
>
>                 Key: YARN-2359
>                 URL: https://issues.apache.org/jira/browse/YARN-2359
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>            Priority: Critical
>         Attachments: YARN-2359.000.patch
>
>
> Application is hung without timeout and retry after DNS/network is down. 
> It is because right after the container is allocated for the AM, the 
> DNS/network is down for the node which has the AM container.
> The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
> RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
> IllegalArgumentException(due to DNS error) happened, it stay at state 
> RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
> processed at this state:
> RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
> The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) 
> which will be generated by the node and container timeout. So even the node 
> is removed, the Application is still hung in this state 
> RMAppAttemptState.SCHEDULED.
> The only way to make the application exit this state is to send 
> RMAppAttemptEventType.KILL event which will only be generated when you 
> manually kill the application from Job Client by forceKillApplication.
> To fix the issue, we should add an entry in the state machine table to handle 
> RMAppAttemptEventType.CONTAINER_FINISHED event at state 
> RMAppAttemptState.SCHEDULED
> add the following code in StateMachineFactory:
> {code}.addTransition(RMAppAttemptState.SCHEDULED, 
>           RMAppAttemptState.FINAL_SAVING,
>           RMAppAttemptEventType.CONTAINER_FINISHED,
>           new FinalSavingTransition(
>             new AMContainerCrashedBeforeRunningTransition(), 
>             RMAppAttemptState.FAILED)){code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to