[jira] [Commented] (YARN-472) MR app master deletes staging dir when sent a reboot command from the RM

Bikas Saha (JIRA) Mon, 18 Mar 2013 12:10:15 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13605485#comment-13605485
 ]


Bikas Saha commented on YARN-472:
---------------------------------

Can this if stmt be simplified. Currently the assignment of false to 
amLastRetry is redundant because the if checks if amLastRetry is false
{code}
+      JobImpl jobImpl = (JobImpl) this.job;      
+      if( !isLastAMRetry && 
+          jobImpl.getInternalState() == JobStateInternal.REBOOT) {
+        isLastAMRetry = false;
+      }
+      else {
+        //We are finishing cleanly so this is the last retry
+        isLastAMRetry = true;
+      }
{code}

If the application is kill*/fail*/succeeded*/, then it should probably ignore 
the REBOOT since it wont be run again.
{code}
               JobStateInternal.KILLED,
               JobStateInternal.ERROR, JobEventType.INTERNAL_ERROR,
               INTERNAL_ERROR_TRANSITION)
+          .addTransition(JobStateInternal.KILLED, JobStateInternal.REBOOT,
+              JobEventType.AM_REBOOT,
+              INTERNAL_REBOOT_TRANSITION) 
{code}

IMO, changing JobState to add a new state would be bad for MR1 back-compat. I 
think its ok to transform REBOOT to KILLED since in some sense the RM is 
killing this attempt. Does this sound reasonable to other committers?
{code}
+    case REBOOT:
+      return JobState.KILLED;
{code}

This seems inconsistent with the previous KILLED mapping to job state. We 
should set counters for the same type.
{code}
         metrics.killedJob(this);
         break;
       case ERROR:
+      case REBOOT:
       case FAILED:
         metrics.failedJob(this);
{code}

IMO, InternalTerminationTransition sounds better since the job is not really 
unsuccessful. Also, it will be clearer to name stateInternal to 
terminationState because its the state to which the job go end up in when done.
{code}
-  private static class InternalErrorTransition implements
+  private static class InternalUnsuccessfulTransition implements
       SingleArcTransition<JobImpl, JobEvent> {
+    JobStateInternal stateInternal = null;
+    
+    public InternalUnsuccessfulTransition(JobStateInternal stateInternal){
+      this.stateInternal = stateInternal;
+    }
+    
{code}

In addition to the above changes, I think we also need to make sure that the AM 
does not unregister from the RM when it is sent a reboot command. This is 
because if it successfully unregisters from the RM then the RM will internally 
complete the app and not try it again, I think. It may be that RM needs to be 
changed to handle rebooted AM's properly.
                
> MR app master deletes staging dir when sent a reboot command from the RM
> ------------------------------------------------------------------------
>
>                 Key: YARN-472
>                 URL: https://issues.apache.org/jira/browse/YARN-472
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: jian he
>            Assignee: jian he
>         Attachments: YARN-472.1.patch
>
>
> If the RM is restarted when the MR job is running, then it sends a reboot 
> command to the job. The job ends up deleting the staging dir and that causes 
> the next attempt to fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-472) MR app master deletes staging dir when sent a reboot command from the RM

Reply via email to