[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.

2014-08-06 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14088479#comment-14088479
 ] 

Karthik Kambatla commented on YARN-2359:


Checking this in..

> Application is hung without timeout and retry after DNS/network is down. 
> -
>
> Key: YARN-2359
> URL: https://issues.apache.org/jira/browse/YARN-2359
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-2359.000.patch, YARN-2359.001.patch, 
> YARN-2359.002.patch
>
>
> Application is hung without timeout and retry after DNS/network is down. 
> It is because right after the container is allocated for the AM, the 
> DNS/network is down for the node which has the AM container.
> The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
> RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
> IllegalArgumentException(due to DNS error) happened, it stay at state 
> RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
> processed at this state:
> RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
> The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) 
> which will be generated when the node and container timeout. So even the node 
> is removed, the Application is still hung in this state 
> RMAppAttemptState.SCHEDULED.
> The only way to make the application exit this state is to send 
> RMAppAttemptEventType.KILL event which will only be generated when you 
> manually kill the application from Job Client by forceKillApplication.
> To fix the issue, we should add an entry in the state machine table to handle 
> RMAppAttemptEventType.CONTAINER_FINISHED event at state 
> RMAppAttemptState.SCHEDULED
> add the following code in StateMachineFactory:
> {code}.addTransition(RMAppAttemptState.SCHEDULED, 
>   RMAppAttemptState.FINAL_SAVING,
>   RMAppAttemptEventType.CONTAINER_FINISHED,
>   new FinalSavingTransition(
> new AMContainerCrashedBeforeRunningTransition(), 
> RMAppAttemptState.FAILED)){code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.

2014-08-06 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14088161#comment-14088161
 ] 

Jian He commented on YARN-2359:
---

I see, thanks for your explanation. looks good to me too

> Application is hung without timeout and retry after DNS/network is down. 
> -
>
> Key: YARN-2359
> URL: https://issues.apache.org/jira/browse/YARN-2359
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-2359.000.patch, YARN-2359.001.patch, 
> YARN-2359.002.patch
>
>
> Application is hung without timeout and retry after DNS/network is down. 
> It is because right after the container is allocated for the AM, the 
> DNS/network is down for the node which has the AM container.
> The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
> RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
> IllegalArgumentException(due to DNS error) happened, it stay at state 
> RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
> processed at this state:
> RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
> The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) 
> which will be generated when the node and container timeout. So even the node 
> is removed, the Application is still hung in this state 
> RMAppAttemptState.SCHEDULED.
> The only way to make the application exit this state is to send 
> RMAppAttemptEventType.KILL event which will only be generated when you 
> manually kill the application from Job Client by forceKillApplication.
> To fix the issue, we should add an entry in the state machine table to handle 
> RMAppAttemptEventType.CONTAINER_FINISHED event at state 
> RMAppAttemptState.SCHEDULED
> add the following code in StateMachineFactory:
> {code}.addTransition(RMAppAttemptState.SCHEDULED, 
>   RMAppAttemptState.FINAL_SAVING,
>   RMAppAttemptEventType.CONTAINER_FINISHED,
>   new FinalSavingTransition(
> new AMContainerCrashedBeforeRunningTransition(), 
> RMAppAttemptState.FAILED)){code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.

2014-08-06 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14088047#comment-14088047
 ] 

zhihai xu commented on YARN-2359:
-

[~jianhe] The code is in pullNewlyAllocatedContainersAndNMTokens of 
SchedulerApplicationAttempt.java
{code}
  try {
// create container token and NMToken altogether.
container.setContainerToken(rmContext.getContainerTokenSecretManager()
  .createContainerToken(container.getId(), container.getNodeId(),
getUser(), container.getResource(), container.getPriority(),
rmContainer.getCreationTime()));
NMToken nmToken =
rmContext.getNMTokenSecretManager().createAndGetNMToken(getUser(),
  getApplicationAttemptId(), container);
if (nmToken != null) {
  nmTokens.add(nmToken);
}
  } catch (IllegalArgumentException e) {
// DNS might be down, skip returning this container.
LOG.error("Error trying to assign container token and NM token to" +
" an allocated container " + container.getId(), e);
continue;
  }
{code}

When IllegalArgumentException exception happened from createContainerToken, the 
code will skip the container.
Then zero container is returned in amContainerAllocation.
The following code in AMContainerAllocatedTransition in RMAppAttemptImpl.java 
will keep retry CONTAINER_ALLOCATED in SCHEDULED state.
So IllegalArgumentException will cause zero container returned in 
amContainerAllocation, which will cause RMAppAttemptImpl stay at state 
RMAppAttemptState.SCHEDULED.

{code}
 if (amContainerAllocation.getContainers().size() == 0) {
appAttempt.retryFetchingAMContainer(appAttempt);
return RMAppAttemptState.SCHEDULED;
  }
{code}

> Application is hung without timeout and retry after DNS/network is down. 
> -
>
> Key: YARN-2359
> URL: https://issues.apache.org/jira/browse/YARN-2359
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-2359.000.patch, YARN-2359.001.patch, 
> YARN-2359.002.patch
>
>
> Application is hung without timeout and retry after DNS/network is down. 
> It is because right after the container is allocated for the AM, the 
> DNS/network is down for the node which has the AM container.
> The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
> RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
> IllegalArgumentException(due to DNS error) happened, it stay at state 
> RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
> processed at this state:
> RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
> The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) 
> which will be generated when the node and container timeout. So even the node 
> is removed, the Application is still hung in this state 
> RMAppAttemptState.SCHEDULED.
> The only way to make the application exit this state is to send 
> RMAppAttemptEventType.KILL event which will only be generated when you 
> manually kill the application from Job Client by forceKillApplication.
> To fix the issue, we should add an entry in the state machine table to handle 
> RMAppAttemptEventType.CONTAINER_FINISHED event at state 
> RMAppAttemptState.SCHEDULED
> add the following code in StateMachineFactory:
> {code}.addTransition(RMAppAttemptState.SCHEDULED, 
>   RMAppAttemptState.FINAL_SAVING,
>   RMAppAttemptEventType.CONTAINER_FINISHED,
>   new FinalSavingTransition(
> new AMContainerCrashedBeforeRunningTransition(), 
> RMAppAttemptState.FAILED)){code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.

2014-08-06 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14088002#comment-14088002
 ] 

Jian He commented on YARN-2359:
---

[~zxu],  thanks for working on it.  I have a question: 
bq. The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
IllegalArgumentException(due to DNS error) happened, it stay at state 
RMAppAttemptState.SCHEDULED. 
where in the code is the IllegalArgumentException thrown ?

> Application is hung without timeout and retry after DNS/network is down. 
> -
>
> Key: YARN-2359
> URL: https://issues.apache.org/jira/browse/YARN-2359
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-2359.000.patch, YARN-2359.001.patch, 
> YARN-2359.002.patch
>
>
> Application is hung without timeout and retry after DNS/network is down. 
> It is because right after the container is allocated for the AM, the 
> DNS/network is down for the node which has the AM container.
> The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
> RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
> IllegalArgumentException(due to DNS error) happened, it stay at state 
> RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
> processed at this state:
> RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
> The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) 
> which will be generated when the node and container timeout. So even the node 
> is removed, the Application is still hung in this state 
> RMAppAttemptState.SCHEDULED.
> The only way to make the application exit this state is to send 
> RMAppAttemptEventType.KILL event which will only be generated when you 
> manually kill the application from Job Client by forceKillApplication.
> To fix the issue, we should add an entry in the state machine table to handle 
> RMAppAttemptEventType.CONTAINER_FINISHED event at state 
> RMAppAttemptState.SCHEDULED
> add the following code in StateMachineFactory:
> {code}.addTransition(RMAppAttemptState.SCHEDULED, 
>   RMAppAttemptState.FINAL_SAVING,
>   RMAppAttemptEventType.CONTAINER_FINISHED,
>   new FinalSavingTransition(
> new AMContainerCrashedBeforeRunningTransition(), 
> RMAppAttemptState.FAILED)){code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.

2014-08-06 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14087848#comment-14087848
 ] 

Tsuyoshi OZAWA commented on YARN-2359:
--

+1(non-binding), it looks good to me. Also ran tests and confirmed that it 
works.

> Application is hung without timeout and retry after DNS/network is down. 
> -
>
> Key: YARN-2359
> URL: https://issues.apache.org/jira/browse/YARN-2359
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-2359.000.patch, YARN-2359.001.patch, 
> YARN-2359.002.patch
>
>
> Application is hung without timeout and retry after DNS/network is down. 
> It is because right after the container is allocated for the AM, the 
> DNS/network is down for the node which has the AM container.
> The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
> RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
> IllegalArgumentException(due to DNS error) happened, it stay at state 
> RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
> processed at this state:
> RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
> The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) 
> which will be generated when the node and container timeout. So even the node 
> is removed, the Application is still hung in this state 
> RMAppAttemptState.SCHEDULED.
> The only way to make the application exit this state is to send 
> RMAppAttemptEventType.KILL event which will only be generated when you 
> manually kill the application from Job Client by forceKillApplication.
> To fix the issue, we should add an entry in the state machine table to handle 
> RMAppAttemptEventType.CONTAINER_FINISHED event at state 
> RMAppAttemptState.SCHEDULED
> add the following code in StateMachineFactory:
> {code}.addTransition(RMAppAttemptState.SCHEDULED, 
>   RMAppAttemptState.FINAL_SAVING,
>   RMAppAttemptEventType.CONTAINER_FINISHED,
>   new FinalSavingTransition(
> new AMContainerCrashedBeforeRunningTransition(), 
> RMAppAttemptState.FAILED)){code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.

2014-08-06 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14087760#comment-14087760
 ] 

Karthik Kambatla commented on YARN-2359:


+1. Will commit this later today if no one objects. 

> Application is hung without timeout and retry after DNS/network is down. 
> -
>
> Key: YARN-2359
> URL: https://issues.apache.org/jira/browse/YARN-2359
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-2359.000.patch, YARN-2359.001.patch, 
> YARN-2359.002.patch
>
>
> Application is hung without timeout and retry after DNS/network is down. 
> It is because right after the container is allocated for the AM, the 
> DNS/network is down for the node which has the AM container.
> The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
> RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
> IllegalArgumentException(due to DNS error) happened, it stay at state 
> RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
> processed at this state:
> RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
> The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) 
> which will be generated when the node and container timeout. So even the node 
> is removed, the Application is still hung in this state 
> RMAppAttemptState.SCHEDULED.
> The only way to make the application exit this state is to send 
> RMAppAttemptEventType.KILL event which will only be generated when you 
> manually kill the application from Job Client by forceKillApplication.
> To fix the issue, we should add an entry in the state machine table to handle 
> RMAppAttemptEventType.CONTAINER_FINISHED event at state 
> RMAppAttemptState.SCHEDULED
> add the following code in StateMachineFactory:
> {code}.addTransition(RMAppAttemptState.SCHEDULED, 
>   RMAppAttemptState.FINAL_SAVING,
>   RMAppAttemptEventType.CONTAINER_FINISHED,
>   new FinalSavingTransition(
> new AMContainerCrashedBeforeRunningTransition(), 
> RMAppAttemptState.FAILED)){code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.

2014-08-05 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14087117#comment-14087117
 ] 

Hadoop QA commented on YARN-2359:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/1266/YARN-2359.002.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4526//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4526//console

This message is automatically generated.

> Application is hung without timeout and retry after DNS/network is down. 
> -
>
> Key: YARN-2359
> URL: https://issues.apache.org/jira/browse/YARN-2359
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-2359.000.patch, YARN-2359.001.patch, 
> YARN-2359.002.patch
>
>
> Application is hung without timeout and retry after DNS/network is down. 
> It is because right after the container is allocated for the AM, the 
> DNS/network is down for the node which has the AM container.
> The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
> RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
> IllegalArgumentException(due to DNS error) happened, it stay at state 
> RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
> processed at this state:
> RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
> The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) 
> which will be generated when the node and container timeout. So even the node 
> is removed, the Application is still hung in this state 
> RMAppAttemptState.SCHEDULED.
> The only way to make the application exit this state is to send 
> RMAppAttemptEventType.KILL event which will only be generated when you 
> manually kill the application from Job Client by forceKillApplication.
> To fix the issue, we should add an entry in the state machine table to handle 
> RMAppAttemptEventType.CONTAINER_FINISHED event at state 
> RMAppAttemptState.SCHEDULED
> add the following code in StateMachineFactory:
> {code}.addTransition(RMAppAttemptState.SCHEDULED, 
>   RMAppAttemptState.FINAL_SAVING,
>   RMAppAttemptEventType.CONTAINER_FINISHED,
>   new FinalSavingTransition(
> new AMContainerCrashedBeforeRunningTransition(), 
> RMAppAttemptState.FAILED)){code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.

2014-08-05 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14086985#comment-14086985
 ] 

zhihai xu commented on YARN-2359:
-

upload new patch to add comment in the unit test.

> Application is hung without timeout and retry after DNS/network is down. 
> -
>
> Key: YARN-2359
> URL: https://issues.apache.org/jira/browse/YARN-2359
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-2359.000.patch, YARN-2359.001.patch, 
> YARN-2359.002.patch
>
>
> Application is hung without timeout and retry after DNS/network is down. 
> It is because right after the container is allocated for the AM, the 
> DNS/network is down for the node which has the AM container.
> The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
> RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
> IllegalArgumentException(due to DNS error) happened, it stay at state 
> RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
> processed at this state:
> RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
> The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) 
> which will be generated when the node and container timeout. So even the node 
> is removed, the Application is still hung in this state 
> RMAppAttemptState.SCHEDULED.
> The only way to make the application exit this state is to send 
> RMAppAttemptEventType.KILL event which will only be generated when you 
> manually kill the application from Job Client by forceKillApplication.
> To fix the issue, we should add an entry in the state machine table to handle 
> RMAppAttemptEventType.CONTAINER_FINISHED event at state 
> RMAppAttemptState.SCHEDULED
> add the following code in StateMachineFactory:
> {code}.addTransition(RMAppAttemptState.SCHEDULED, 
>   RMAppAttemptState.FINAL_SAVING,
>   RMAppAttemptEventType.CONTAINER_FINISHED,
>   new FinalSavingTransition(
> new AMContainerCrashedBeforeRunningTransition(), 
> RMAppAttemptState.FAILED)){code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.

2014-07-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075537#comment-14075537
 ] 

Hadoop QA commented on YARN-2359:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12658009/YARN-2359.001.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4448//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4448//console

This message is automatically generated.

> Application is hung without timeout and retry after DNS/network is down. 
> -
>
> Key: YARN-2359
> URL: https://issues.apache.org/jira/browse/YARN-2359
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-2359.000.patch, YARN-2359.001.patch
>
>
> Application is hung without timeout and retry after DNS/network is down. 
> It is because right after the container is allocated for the AM, the 
> DNS/network is down for the node which has the AM container.
> The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
> RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
> IllegalArgumentException(due to DNS error) happened, it stay at state 
> RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
> processed at this state:
> RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
> The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) 
> which will be generated when the node and container timeout. So even the node 
> is removed, the Application is still hung in this state 
> RMAppAttemptState.SCHEDULED.
> The only way to make the application exit this state is to send 
> RMAppAttemptEventType.KILL event which will only be generated when you 
> manually kill the application from Job Client by forceKillApplication.
> To fix the issue, we should add an entry in the state machine table to handle 
> RMAppAttemptEventType.CONTAINER_FINISHED event at state 
> RMAppAttemptState.SCHEDULED
> add the following code in StateMachineFactory:
> {code}.addTransition(RMAppAttemptState.SCHEDULED, 
>   RMAppAttemptState.FINAL_SAVING,
>   RMAppAttemptEventType.CONTAINER_FINISHED,
>   new FinalSavingTransition(
> new AMContainerCrashedBeforeRunningTransition(), 
> RMAppAttemptState.FAILED)){code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.

2014-07-26 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075531#comment-14075531
 ] 

zhihai xu commented on YARN-2359:
-

I just added a unit test case (testAMCrashAtScheduled) in the patch to verify 
this state transition in RMAppAttempt state machine.

> Application is hung without timeout and retry after DNS/network is down. 
> -
>
> Key: YARN-2359
> URL: https://issues.apache.org/jira/browse/YARN-2359
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-2359.000.patch, YARN-2359.001.patch
>
>
> Application is hung without timeout and retry after DNS/network is down. 
> It is because right after the container is allocated for the AM, the 
> DNS/network is down for the node which has the AM container.
> The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
> RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
> IllegalArgumentException(due to DNS error) happened, it stay at state 
> RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
> processed at this state:
> RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
> The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) 
> which will be generated when the node and container timeout. So even the node 
> is removed, the Application is still hung in this state 
> RMAppAttemptState.SCHEDULED.
> The only way to make the application exit this state is to send 
> RMAppAttemptEventType.KILL event which will only be generated when you 
> manually kill the application from Job Client by forceKillApplication.
> To fix the issue, we should add an entry in the state machine table to handle 
> RMAppAttemptEventType.CONTAINER_FINISHED event at state 
> RMAppAttemptState.SCHEDULED
> add the following code in StateMachineFactory:
> {code}.addTransition(RMAppAttemptState.SCHEDULED, 
>   RMAppAttemptState.FINAL_SAVING,
>   RMAppAttemptEventType.CONTAINER_FINISHED,
>   new FinalSavingTransition(
> new AMContainerCrashedBeforeRunningTransition(), 
> RMAppAttemptState.FAILED)){code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.

2014-07-25 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074917#comment-14074917
 ] 

zhihai xu commented on YARN-2359:
-

I can pass the test TestAMRestart in my local build.

---
 T E S T S
---
Running 
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 90.076 sec - in 
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart

Results :

Tests run: 5, Failures: 0, Errors: 0, Skipped: 0

> Application is hung without timeout and retry after DNS/network is down. 
> -
>
> Key: YARN-2359
> URL: https://issues.apache.org/jira/browse/YARN-2359
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-2359.000.patch
>
>
> Application is hung without timeout and retry after DNS/network is down. 
> It is because right after the container is allocated for the AM, the 
> DNS/network is down for the node which has the AM container.
> The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
> RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
> IllegalArgumentException(due to DNS error) happened, it stay at state 
> RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
> processed at this state:
> RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
> The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) 
> which will be generated by the node and container timeout. So even the node 
> is removed, the Application is still hung in this state 
> RMAppAttemptState.SCHEDULED.
> The only way to make the application exit this state is to send 
> RMAppAttemptEventType.KILL event which will only be generated when you 
> manually kill the application from Job Client by forceKillApplication.
> To fix the issue, we should add an entry in the state machine table to handle 
> RMAppAttemptEventType.CONTAINER_FINISHED event at state 
> RMAppAttemptState.SCHEDULED
> add the following code in StateMachineFactory:
>  .addTransition(RMAppAttemptState.SCHEDULED, 
>   RMAppAttemptState.FINAL_SAVING,
>   RMAppAttemptEventType.CONTAINER_FINISHED,
>   new FinalSavingTransition(
> new AMContainerCrashedBeforeRunningTransition(), 
> RMAppAttemptState.FAILED))



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.

2014-07-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074873#comment-14074873
 ] 

Hadoop QA commented on YARN-2359:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12657887/YARN-2359.000.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4436//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4436//console

This message is automatically generated.

> Application is hung without timeout and retry after DNS/network is down. 
> -
>
> Key: YARN-2359
> URL: https://issues.apache.org/jira/browse/YARN-2359
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-2359.000.patch
>
>
> Application is hung without timeout and retry after DNS/network is down. 
> It is because right after the container is allocated for the AM, the 
> DNS/network is down for the node which has the AM container.
> The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
> RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
> IllegalArgumentException(due to DNS error) happened, it stay at state 
> RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
> processed at this state:
> RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
> The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) 
> which will be generated by the node and container timeout. So even the node 
> is removed, the Application is still hung in this state 
> RMAppAttemptState.SCHEDULED.
> The only way to make the application exit this state is to send 
> RMAppAttemptEventType.KILL event which will only be generated when you 
> manually kill the application from Job Client by forceKillApplication.
> To fix the issue, we should add an entry in the state machine table to handle 
> RMAppAttemptEventType.CONTAINER_FINISHED event at state 
> RMAppAttemptState.SCHEDULED
> add the following code in StateMachineFactory:
>  .addTransition(RMAppAttemptState.SCHEDULED, 
>   RMAppAttemptState.FINAL_SAVING,
>   RMAppAttemptEventType.CONTAINER_FINISHED,
>   new FinalSavingTransition(
> new AMContainerCrashedBeforeRunningTransition(), 
> RMAppAttemptState.FAILED))



--
This message was sent by Atlassian JIRA
(v6.2#6252)