[jira] [Commented] (YARN-472) MR app master deletes staging dir when sent a reboot command from the RM

2013-03-19 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13606333#comment-13606333
 ] 

Jason Lowe commented on YARN-472:
-

{{Runtime.halt}} would be one brutally efficient way to stop the AM in its 
tracks, but I agree it's probably best to simply follow the normal shutdown 
sequence but indicate via a flag or other means that we don't want to copy the 
history file to the done_intermediate directory, unregister, or clean the 
staging directory.

 MR app master deletes staging dir when sent a reboot command from the RM
 

 Key: YARN-472
 URL: https://issues.apache.org/jira/browse/YARN-472
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: jian he
Assignee: jian he
 Attachments: YARN-472.1.patch


 If the RM is restarted when the MR job is running, then it sends a reboot 
 command to the job. The job ends up deleting the staging dir and that causes 
 the next attempt to fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-472) MR app master deletes staging dir when sent a reboot command from the RM

2013-03-19 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13607195#comment-13607195
 ] 

Hadoop QA commented on YARN-472:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12574478/YARN-472.2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 tests included appear to have a timeout.{color}

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:red}-1 eclipse:eclipse{color}.  The patch failed to build with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/548//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/548//console

This message is automatically generated.

 MR app master deletes staging dir when sent a reboot command from the RM
 

 Key: YARN-472
 URL: https://issues.apache.org/jira/browse/YARN-472
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: jian he
Assignee: jian he
 Attachments: YARN-472.1.patch, YARN-472.2.patch


 If the RM is restarted when the MR job is running, then it sends a reboot 
 command to the job. The job ends up deleting the staging dir and that causes 
 the next attempt to fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-472) MR app master deletes staging dir when sent a reboot command from the RM

2013-03-18 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13605485#comment-13605485
 ] 

Bikas Saha commented on YARN-472:
-

Can this if stmt be simplified. Currently the assignment of false to 
amLastRetry is redundant because the if checks if amLastRetry is false
{code}
+  JobImpl jobImpl = (JobImpl) this.job;  
+  if( !isLastAMRetry  
+  jobImpl.getInternalState() == JobStateInternal.REBOOT) {
+isLastAMRetry = false;
+  }
+  else {
+//We are finishing cleanly so this is the last retry
+isLastAMRetry = true;
+  }
{code}

If the application is kill*/fail*/succeeded*/, then it should probably ignore 
the REBOOT since it wont be run again.
{code}
   JobStateInternal.KILLED,
   JobStateInternal.ERROR, JobEventType.INTERNAL_ERROR,
   INTERNAL_ERROR_TRANSITION)
+  .addTransition(JobStateInternal.KILLED, JobStateInternal.REBOOT,
+  JobEventType.AM_REBOOT,
+  INTERNAL_REBOOT_TRANSITION) 
{code}

IMO, changing JobState to add a new state would be bad for MR1 back-compat. I 
think its ok to transform REBOOT to KILLED since in some sense the RM is 
killing this attempt. Does this sound reasonable to other committers?
{code}
+case REBOOT:
+  return JobState.KILLED;
{code}

This seems inconsistent with the previous KILLED mapping to job state. We 
should set counters for the same type.
{code}
 metrics.killedJob(this);
 break;
   case ERROR:
+  case REBOOT:
   case FAILED:
 metrics.failedJob(this);
{code}

IMO, InternalTerminationTransition sounds better since the job is not really 
unsuccessful. Also, it will be clearer to name stateInternal to 
terminationState because its the state to which the job go end up in when done.
{code}
-  private static class InternalErrorTransition implements
+  private static class InternalUnsuccessfulTransition implements
   SingleArcTransitionJobImpl, JobEvent {
+JobStateInternal stateInternal = null;
+
+public InternalUnsuccessfulTransition(JobStateInternal stateInternal){
+  this.stateInternal = stateInternal;
+}
+
{code}

In addition to the above changes, I think we also need to make sure that the AM 
does not unregister from the RM when it is sent a reboot command. This is 
because if it successfully unregisters from the RM then the RM will internally 
complete the app and not try it again, I think. It may be that RM needs to be 
changed to handle rebooted AM's properly.

 MR app master deletes staging dir when sent a reboot command from the RM
 

 Key: YARN-472
 URL: https://issues.apache.org/jira/browse/YARN-472
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: jian he
Assignee: jian he
 Attachments: YARN-472.1.patch


 If the RM is restarted when the MR job is running, then it sends a reboot 
 command to the job. The job ends up deleting the staging dir and that causes 
 the next attempt to fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-472) MR app master deletes staging dir when sent a reboot command from the RM

2013-03-18 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13605953#comment-13605953
 ] 

Jason Lowe commented on YARN-472:
-

Another cause for the AM to receive a reboot command from the RM is a 
split-brain situation where the RM has expired the AM (e.g.: due to network 
cut) but the AM has not killed itself (e.g.: thrashing in garbage collect or 
something).  If this were the case and the AM isn't the last attempt, it needs 
to get out of the way and not do any damage (e.g.: not try to commit, create 
history, etc.) because the RM could have already started the other attempt.

Attempting to unregister is likely a fruitless effort since the RM has 
basically said via the reboot directive it has no idea what this AM is trying 
to do.  If it does succeed in unregistering then that would prevent further app 
attempts from launching as Bikas noted, and that's not desirable.

I agree that adding a new state seems unnecessary.  I've always interpreted the 
reboot directive to indicate the AM is in a bad state and needs to get out, 
fast.  As such, I'd rather keep this simple.  If the attempt isn't the last, 
have the AM log the reception of the reboot and crash without doing any 
filesystem damage.  If it is the last attempt then we can do something like we 
do today, e.g.: cleanup staging and generate history with an error status.


 MR app master deletes staging dir when sent a reboot command from the RM
 

 Key: YARN-472
 URL: https://issues.apache.org/jira/browse/YARN-472
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: jian he
Assignee: jian he
 Attachments: YARN-472.1.patch


 If the RM is restarted when the MR job is running, then it sends a reboot 
 command to the job. The job ends up deleting the staging dir and that causes 
 the next attempt to fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-472) MR app master deletes staging dir when sent a reboot command from the RM

2013-03-15 Thread jian he (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13603836#comment-13603836
 ] 

jian he commented on YARN-472:
--

The approach used is to add a separate JobStateInternal.REBOOT to distinguish 
from the JobStateInternal.ERROR, such that, when the job is shutdown it will 
not delete the staging dir if its on JobStateInternal.REBOOT  its not the last 
retry

 MR app master deletes staging dir when sent a reboot command from the RM
 

 Key: YARN-472
 URL: https://issues.apache.org/jira/browse/YARN-472
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: jian he
Assignee: jian he

 If the RM is restarted when the MR job is running, then it sends a reboot 
 command to the job. The job ends up deleting the staging dir and that causes 
 the next attempt to fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-472) MR app master deletes staging dir when sent a reboot command from the RM

2013-03-15 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13603864#comment-13603864
 ] 

Hadoop QA commented on YARN-472:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12573938/YARN-472.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

  {color:red}-1 one of tests included doesn't have a timeout.{color}

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/523//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/523//console

This message is automatically generated.

 MR app master deletes staging dir when sent a reboot command from the RM
 

 Key: YARN-472
 URL: https://issues.apache.org/jira/browse/YARN-472
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: jian he
Assignee: jian he
 Attachments: YARN-472.1.patch


 If the RM is restarted when the MR job is running, then it sends a reboot 
 command to the job. The job ends up deleting the staging dir and that causes 
 the next attempt to fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira