[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.

2014-08-06 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087760#comment-14087760
 ] 

Karthik Kambatla commented on YARN-2359:


+1. Will commit this later today if no one objects. 

 Application is hung without timeout and retry after DNS/network is down. 
 -

 Key: YARN-2359
 URL: https://issues.apache.org/jira/browse/YARN-2359
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-2359.000.patch, YARN-2359.001.patch, 
 YARN-2359.002.patch


 Application is hung without timeout and retry after DNS/network is down. 
 It is because right after the container is allocated for the AM, the 
 DNS/network is down for the node which has the AM container.
 The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
 RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
 IllegalArgumentException(due to DNS error) happened, it stay at state 
 RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
 processed at this state:
 RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
 The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) 
 which will be generated when the node and container timeout. So even the node 
 is removed, the Application is still hung in this state 
 RMAppAttemptState.SCHEDULED.
 The only way to make the application exit this state is to send 
 RMAppAttemptEventType.KILL event which will only be generated when you 
 manually kill the application from Job Client by forceKillApplication.
 To fix the issue, we should add an entry in the state machine table to handle 
 RMAppAttemptEventType.CONTAINER_FINISHED event at state 
 RMAppAttemptState.SCHEDULED
 add the following code in StateMachineFactory:
 {code}.addTransition(RMAppAttemptState.SCHEDULED, 
   RMAppAttemptState.FINAL_SAVING,
   RMAppAttemptEventType.CONTAINER_FINISHED,
   new FinalSavingTransition(
 new AMContainerCrashedBeforeRunningTransition(), 
 RMAppAttemptState.FAILED)){code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.

2014-08-06 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087848#comment-14087848
 ] 

Tsuyoshi OZAWA commented on YARN-2359:
--

+1(non-binding), it looks good to me. Also ran tests and confirmed that it 
works.

 Application is hung without timeout and retry after DNS/network is down. 
 -

 Key: YARN-2359
 URL: https://issues.apache.org/jira/browse/YARN-2359
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-2359.000.patch, YARN-2359.001.patch, 
 YARN-2359.002.patch


 Application is hung without timeout and retry after DNS/network is down. 
 It is because right after the container is allocated for the AM, the 
 DNS/network is down for the node which has the AM container.
 The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
 RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
 IllegalArgumentException(due to DNS error) happened, it stay at state 
 RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
 processed at this state:
 RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
 The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) 
 which will be generated when the node and container timeout. So even the node 
 is removed, the Application is still hung in this state 
 RMAppAttemptState.SCHEDULED.
 The only way to make the application exit this state is to send 
 RMAppAttemptEventType.KILL event which will only be generated when you 
 manually kill the application from Job Client by forceKillApplication.
 To fix the issue, we should add an entry in the state machine table to handle 
 RMAppAttemptEventType.CONTAINER_FINISHED event at state 
 RMAppAttemptState.SCHEDULED
 add the following code in StateMachineFactory:
 {code}.addTransition(RMAppAttemptState.SCHEDULED, 
   RMAppAttemptState.FINAL_SAVING,
   RMAppAttemptEventType.CONTAINER_FINISHED,
   new FinalSavingTransition(
 new AMContainerCrashedBeforeRunningTransition(), 
 RMAppAttemptState.FAILED)){code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.

2014-08-06 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088002#comment-14088002
 ] 

Jian He commented on YARN-2359:
---

[~zxu],  thanks for working on it.  I have a question: 
bq. The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
IllegalArgumentException(due to DNS error) happened, it stay at state 
RMAppAttemptState.SCHEDULED. 
where in the code is the IllegalArgumentException thrown ?

 Application is hung without timeout and retry after DNS/network is down. 
 -

 Key: YARN-2359
 URL: https://issues.apache.org/jira/browse/YARN-2359
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-2359.000.patch, YARN-2359.001.patch, 
 YARN-2359.002.patch


 Application is hung without timeout and retry after DNS/network is down. 
 It is because right after the container is allocated for the AM, the 
 DNS/network is down for the node which has the AM container.
 The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
 RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
 IllegalArgumentException(due to DNS error) happened, it stay at state 
 RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
 processed at this state:
 RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
 The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) 
 which will be generated when the node and container timeout. So even the node 
 is removed, the Application is still hung in this state 
 RMAppAttemptState.SCHEDULED.
 The only way to make the application exit this state is to send 
 RMAppAttemptEventType.KILL event which will only be generated when you 
 manually kill the application from Job Client by forceKillApplication.
 To fix the issue, we should add an entry in the state machine table to handle 
 RMAppAttemptEventType.CONTAINER_FINISHED event at state 
 RMAppAttemptState.SCHEDULED
 add the following code in StateMachineFactory:
 {code}.addTransition(RMAppAttemptState.SCHEDULED, 
   RMAppAttemptState.FINAL_SAVING,
   RMAppAttemptEventType.CONTAINER_FINISHED,
   new FinalSavingTransition(
 new AMContainerCrashedBeforeRunningTransition(), 
 RMAppAttemptState.FAILED)){code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.

2014-08-06 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088047#comment-14088047
 ] 

zhihai xu commented on YARN-2359:
-

[~jianhe] The code is in pullNewlyAllocatedContainersAndNMTokens of 
SchedulerApplicationAttempt.java
{code}
  try {
// create container token and NMToken altogether.
container.setContainerToken(rmContext.getContainerTokenSecretManager()
  .createContainerToken(container.getId(), container.getNodeId(),
getUser(), container.getResource(), container.getPriority(),
rmContainer.getCreationTime()));
NMToken nmToken =
rmContext.getNMTokenSecretManager().createAndGetNMToken(getUser(),
  getApplicationAttemptId(), container);
if (nmToken != null) {
  nmTokens.add(nmToken);
}
  } catch (IllegalArgumentException e) {
// DNS might be down, skip returning this container.
LOG.error(Error trying to assign container token and NM token to +
 an allocated container  + container.getId(), e);
continue;
  }
{code}

When IllegalArgumentException exception happened from createContainerToken, the 
code will skip the container.
Then zero container is returned in amContainerAllocation.
The following code in AMContainerAllocatedTransition in RMAppAttemptImpl.java 
will keep retry CONTAINER_ALLOCATED in SCHEDULED state.
So IllegalArgumentException will cause zero container returned in 
amContainerAllocation, which will cause RMAppAttemptImpl stay at state 
RMAppAttemptState.SCHEDULED.

{code}
 if (amContainerAllocation.getContainers().size() == 0) {
appAttempt.retryFetchingAMContainer(appAttempt);
return RMAppAttemptState.SCHEDULED;
  }
{code}

 Application is hung without timeout and retry after DNS/network is down. 
 -

 Key: YARN-2359
 URL: https://issues.apache.org/jira/browse/YARN-2359
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-2359.000.patch, YARN-2359.001.patch, 
 YARN-2359.002.patch


 Application is hung without timeout and retry after DNS/network is down. 
 It is because right after the container is allocated for the AM, the 
 DNS/network is down for the node which has the AM container.
 The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
 RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
 IllegalArgumentException(due to DNS error) happened, it stay at state 
 RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
 processed at this state:
 RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
 The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) 
 which will be generated when the node and container timeout. So even the node 
 is removed, the Application is still hung in this state 
 RMAppAttemptState.SCHEDULED.
 The only way to make the application exit this state is to send 
 RMAppAttemptEventType.KILL event which will only be generated when you 
 manually kill the application from Job Client by forceKillApplication.
 To fix the issue, we should add an entry in the state machine table to handle 
 RMAppAttemptEventType.CONTAINER_FINISHED event at state 
 RMAppAttemptState.SCHEDULED
 add the following code in StateMachineFactory:
 {code}.addTransition(RMAppAttemptState.SCHEDULED, 
   RMAppAttemptState.FINAL_SAVING,
   RMAppAttemptEventType.CONTAINER_FINISHED,
   new FinalSavingTransition(
 new AMContainerCrashedBeforeRunningTransition(), 
 RMAppAttemptState.FAILED)){code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.

2014-08-06 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088161#comment-14088161
 ] 

Jian He commented on YARN-2359:
---

I see, thanks for your explanation. looks good to me too

 Application is hung without timeout and retry after DNS/network is down. 
 -

 Key: YARN-2359
 URL: https://issues.apache.org/jira/browse/YARN-2359
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-2359.000.patch, YARN-2359.001.patch, 
 YARN-2359.002.patch


 Application is hung without timeout and retry after DNS/network is down. 
 It is because right after the container is allocated for the AM, the 
 DNS/network is down for the node which has the AM container.
 The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
 RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
 IllegalArgumentException(due to DNS error) happened, it stay at state 
 RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
 processed at this state:
 RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
 The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) 
 which will be generated when the node and container timeout. So even the node 
 is removed, the Application is still hung in this state 
 RMAppAttemptState.SCHEDULED.
 The only way to make the application exit this state is to send 
 RMAppAttemptEventType.KILL event which will only be generated when you 
 manually kill the application from Job Client by forceKillApplication.
 To fix the issue, we should add an entry in the state machine table to handle 
 RMAppAttemptEventType.CONTAINER_FINISHED event at state 
 RMAppAttemptState.SCHEDULED
 add the following code in StateMachineFactory:
 {code}.addTransition(RMAppAttemptState.SCHEDULED, 
   RMAppAttemptState.FINAL_SAVING,
   RMAppAttemptEventType.CONTAINER_FINISHED,
   new FinalSavingTransition(
 new AMContainerCrashedBeforeRunningTransition(), 
 RMAppAttemptState.FAILED)){code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.

2014-08-06 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088479#comment-14088479
 ] 

Karthik Kambatla commented on YARN-2359:


Checking this in..

 Application is hung without timeout and retry after DNS/network is down. 
 -

 Key: YARN-2359
 URL: https://issues.apache.org/jira/browse/YARN-2359
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-2359.000.patch, YARN-2359.001.patch, 
 YARN-2359.002.patch


 Application is hung without timeout and retry after DNS/network is down. 
 It is because right after the container is allocated for the AM, the 
 DNS/network is down for the node which has the AM container.
 The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
 RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
 IllegalArgumentException(due to DNS error) happened, it stay at state 
 RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
 processed at this state:
 RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
 The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) 
 which will be generated when the node and container timeout. So even the node 
 is removed, the Application is still hung in this state 
 RMAppAttemptState.SCHEDULED.
 The only way to make the application exit this state is to send 
 RMAppAttemptEventType.KILL event which will only be generated when you 
 manually kill the application from Job Client by forceKillApplication.
 To fix the issue, we should add an entry in the state machine table to handle 
 RMAppAttemptEventType.CONTAINER_FINISHED event at state 
 RMAppAttemptState.SCHEDULED
 add the following code in StateMachineFactory:
 {code}.addTransition(RMAppAttemptState.SCHEDULED, 
   RMAppAttemptState.FINAL_SAVING,
   RMAppAttemptEventType.CONTAINER_FINISHED,
   new FinalSavingTransition(
 new AMContainerCrashedBeforeRunningTransition(), 
 RMAppAttemptState.FAILED)){code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.

2014-08-05 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14086985#comment-14086985
 ] 

zhihai xu commented on YARN-2359:
-

upload new patch to add comment in the unit test.

 Application is hung without timeout and retry after DNS/network is down. 
 -

 Key: YARN-2359
 URL: https://issues.apache.org/jira/browse/YARN-2359
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-2359.000.patch, YARN-2359.001.patch, 
 YARN-2359.002.patch


 Application is hung without timeout and retry after DNS/network is down. 
 It is because right after the container is allocated for the AM, the 
 DNS/network is down for the node which has the AM container.
 The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
 RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
 IllegalArgumentException(due to DNS error) happened, it stay at state 
 RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
 processed at this state:
 RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
 The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) 
 which will be generated when the node and container timeout. So even the node 
 is removed, the Application is still hung in this state 
 RMAppAttemptState.SCHEDULED.
 The only way to make the application exit this state is to send 
 RMAppAttemptEventType.KILL event which will only be generated when you 
 manually kill the application from Job Client by forceKillApplication.
 To fix the issue, we should add an entry in the state machine table to handle 
 RMAppAttemptEventType.CONTAINER_FINISHED event at state 
 RMAppAttemptState.SCHEDULED
 add the following code in StateMachineFactory:
 {code}.addTransition(RMAppAttemptState.SCHEDULED, 
   RMAppAttemptState.FINAL_SAVING,
   RMAppAttemptEventType.CONTAINER_FINISHED,
   new FinalSavingTransition(
 new AMContainerCrashedBeforeRunningTransition(), 
 RMAppAttemptState.FAILED)){code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.

2014-08-05 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087117#comment-14087117
 ] 

Hadoop QA commented on YARN-2359:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/1266/YARN-2359.002.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4526//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4526//console

This message is automatically generated.

 Application is hung without timeout and retry after DNS/network is down. 
 -

 Key: YARN-2359
 URL: https://issues.apache.org/jira/browse/YARN-2359
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-2359.000.patch, YARN-2359.001.patch, 
 YARN-2359.002.patch


 Application is hung without timeout and retry after DNS/network is down. 
 It is because right after the container is allocated for the AM, the 
 DNS/network is down for the node which has the AM container.
 The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
 RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
 IllegalArgumentException(due to DNS error) happened, it stay at state 
 RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
 processed at this state:
 RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
 The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) 
 which will be generated when the node and container timeout. So even the node 
 is removed, the Application is still hung in this state 
 RMAppAttemptState.SCHEDULED.
 The only way to make the application exit this state is to send 
 RMAppAttemptEventType.KILL event which will only be generated when you 
 manually kill the application from Job Client by forceKillApplication.
 To fix the issue, we should add an entry in the state machine table to handle 
 RMAppAttemptEventType.CONTAINER_FINISHED event at state 
 RMAppAttemptState.SCHEDULED
 add the following code in StateMachineFactory:
 {code}.addTransition(RMAppAttemptState.SCHEDULED, 
   RMAppAttemptState.FINAL_SAVING,
   RMAppAttemptEventType.CONTAINER_FINISHED,
   new FinalSavingTransition(
 new AMContainerCrashedBeforeRunningTransition(), 
 RMAppAttemptState.FAILED)){code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.

2014-07-26 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075531#comment-14075531
 ] 

zhihai xu commented on YARN-2359:
-

I just added a unit test case (testAMCrashAtScheduled) in the patch to verify 
this state transition in RMAppAttempt state machine.

 Application is hung without timeout and retry after DNS/network is down. 
 -

 Key: YARN-2359
 URL: https://issues.apache.org/jira/browse/YARN-2359
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-2359.000.patch, YARN-2359.001.patch


 Application is hung without timeout and retry after DNS/network is down. 
 It is because right after the container is allocated for the AM, the 
 DNS/network is down for the node which has the AM container.
 The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
 RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
 IllegalArgumentException(due to DNS error) happened, it stay at state 
 RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
 processed at this state:
 RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
 The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) 
 which will be generated when the node and container timeout. So even the node 
 is removed, the Application is still hung in this state 
 RMAppAttemptState.SCHEDULED.
 The only way to make the application exit this state is to send 
 RMAppAttemptEventType.KILL event which will only be generated when you 
 manually kill the application from Job Client by forceKillApplication.
 To fix the issue, we should add an entry in the state machine table to handle 
 RMAppAttemptEventType.CONTAINER_FINISHED event at state 
 RMAppAttemptState.SCHEDULED
 add the following code in StateMachineFactory:
 {code}.addTransition(RMAppAttemptState.SCHEDULED, 
   RMAppAttemptState.FINAL_SAVING,
   RMAppAttemptEventType.CONTAINER_FINISHED,
   new FinalSavingTransition(
 new AMContainerCrashedBeforeRunningTransition(), 
 RMAppAttemptState.FAILED)){code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.

2014-07-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075537#comment-14075537
 ] 

Hadoop QA commented on YARN-2359:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12658009/YARN-2359.001.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4448//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4448//console

This message is automatically generated.

 Application is hung without timeout and retry after DNS/network is down. 
 -

 Key: YARN-2359
 URL: https://issues.apache.org/jira/browse/YARN-2359
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-2359.000.patch, YARN-2359.001.patch


 Application is hung without timeout and retry after DNS/network is down. 
 It is because right after the container is allocated for the AM, the 
 DNS/network is down for the node which has the AM container.
 The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
 RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
 IllegalArgumentException(due to DNS error) happened, it stay at state 
 RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
 processed at this state:
 RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
 The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) 
 which will be generated when the node and container timeout. So even the node 
 is removed, the Application is still hung in this state 
 RMAppAttemptState.SCHEDULED.
 The only way to make the application exit this state is to send 
 RMAppAttemptEventType.KILL event which will only be generated when you 
 manually kill the application from Job Client by forceKillApplication.
 To fix the issue, we should add an entry in the state machine table to handle 
 RMAppAttemptEventType.CONTAINER_FINISHED event at state 
 RMAppAttemptState.SCHEDULED
 add the following code in StateMachineFactory:
 {code}.addTransition(RMAppAttemptState.SCHEDULED, 
   RMAppAttemptState.FINAL_SAVING,
   RMAppAttemptEventType.CONTAINER_FINISHED,
   new FinalSavingTransition(
 new AMContainerCrashedBeforeRunningTransition(), 
 RMAppAttemptState.FAILED)){code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.

2014-07-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14074873#comment-14074873
 ] 

Hadoop QA commented on YARN-2359:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12657887/YARN-2359.000.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4436//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4436//console

This message is automatically generated.

 Application is hung without timeout and retry after DNS/network is down. 
 -

 Key: YARN-2359
 URL: https://issues.apache.org/jira/browse/YARN-2359
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-2359.000.patch


 Application is hung without timeout and retry after DNS/network is down. 
 It is because right after the container is allocated for the AM, the 
 DNS/network is down for the node which has the AM container.
 The application attempt is at state RMAppAttemptState.SCHEDULED, it receive 
 RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the 
 IllegalArgumentException(due to DNS error) happened, it stay at state 
 RMAppAttemptState.SCHEDULED. In the state machine, only two events will be 
 processed at this state:
 RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL.
 The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) 
 which will be generated by the node and container timeout. So even the node 
 is removed, the Application is still hung in this state 
 RMAppAttemptState.SCHEDULED.
 The only way to make the application exit this state is to send 
 RMAppAttemptEventType.KILL event which will only be generated when you 
 manually kill the application from Job Client by forceKillApplication.
 To fix the issue, we should add an entry in the state machine table to handle 
 RMAppAttemptEventType.CONTAINER_FINISHED event at state 
 RMAppAttemptState.SCHEDULED
 add the following code in StateMachineFactory:
  .addTransition(RMAppAttemptState.SCHEDULED, 
   RMAppAttemptState.FINAL_SAVING,
   RMAppAttemptEventType.CONTAINER_FINISHED,
   new FinalSavingTransition(
 new AMContainerCrashedBeforeRunningTransition(), 
 RMAppAttemptState.FAILED))



--
This message was sent by Atlassian JIRA
(v6.2#6252)