[
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075537#comment-14075537
]
Advertising
Hadoop QA commented on YARN-2359:
---------------------------------
{color:green}+1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12658009/YARN-2359.001.patch
against trunk revision .
{color:green}+1 @author{color}. The patch does not contain any @author
tags.
{color:green}+1 tests included{color}. The patch appears to include 1 new
or modified test files.
{color:green}+1 javac{color}. The applied patch does not increase the
total number of javac compiler warnings.
{color:green}+1 javadoc{color}. There were no new javadoc warning messages.
{color:green}+1 eclipse:eclipse{color}. The patch built with
eclipse:eclipse.
{color:green}+1 findbugs{color}. The patch does not introduce any new
Findbugs (version 2.0.3) warnings.
{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.
{color:green}+1 core tests{color}. The patch passed unit tests in
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.
{color:green}+1 contrib tests{color}. The patch passed contrib unit tests.
Test results:
https://builds.apache.org/job/PreCommit-YARN-Build/4448//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4448//console
This message is automatically generated.
> Application is hung without timeout and retry after DNS/network is down.
> -------------------------------------------------------------------------
>
> Key: YARN-2359
> URL: https://issues.apache.org/jira/browse/YARN-2359
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Reporter: zhihai xu
> Assignee: zhihai xu
> Priority: Critical
> Attachments: YARN-2359.000.patch, YARN-2359.001.patch
>
>
> Application is hung without timeout and retry after DNS/network is down.
> It is because right after the container is allocated for the AM, the
> DNS/network is down for the node which has the AM container.
> The application attempt is at state RMAppAttemptState.SCHEDULED, it receive
> RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the
> IllegalArgumentException(due to DNS error) happened, it stay at state
> RMAppAttemptState.SCHEDULED. In the state machine, only two events will be
> processed at this state:
> RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
> The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED)
> which will be generated when the node and container timeout. So even the node
> is removed, the Application is still hung in this state
> RMAppAttemptState.SCHEDULED.
> The only way to make the application exit this state is to send
> RMAppAttemptEventType.KILL event which will only be generated when you
> manually kill the application from Job Client by forceKillApplication.
> To fix the issue, we should add an entry in the state machine table to handle
> RMAppAttemptEventType.CONTAINER_FINISHED event at state
> RMAppAttemptState.SCHEDULED
> add the following code in StateMachineFactory:
> {code}.addTransition(RMAppAttemptState.SCHEDULED,
> RMAppAttemptState.FINAL_SAVING,
> RMAppAttemptEventType.CONTAINER_FINISHED,
> new FinalSavingTransition(
> new AMContainerCrashedBeforeRunningTransition(),
> RMAppAttemptState.FAILED)){code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)