zhihai xu created YARN-2359:
Summary: Application is hung without timeout and retry after
DNS/network is down.
Project: Hadoop YARN
Issue Type: Bug
Reporter: zhihai xu
Application is hung without timeout and retry after DNS/network is down.
It is because right after the container is allocated for the AM, the
DNS/network is down for the node which has the AM container.
The application attempt is at state RMAppAttemptState.SCHEDULED, it receive
RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the
IllegalArgumentException(due to DNS error) happened, it stay at state
RMAppAttemptState.SCHEDULED. In the state machine, only two events will be
processed at this state:
RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED)
which will be generated by the node and container timeout. So even the node is
removed, the Application is still hung in this state
The only way to make the application exit this state is to send
RMAppAttemptEventType.KILL event which will only be generated when you manually
kill the application from Job Client by forceKillApplication.
To fix the issue, we should add an entry in the state machine table to handle
RMAppAttemptEventType.CONTAINER_FINISHED event at state
add the following code in StateMachineFactory:
This message was sent by Atlassian JIRA