[jira] [Commented] (YARN-472) MR app master deletes staging dir when sent a reboot command from the RM
[ https://issues.apache.org/jira/browse/YARN-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13606333#comment-13606333 ] Jason Lowe commented on YARN-472: - {{Runtime.halt}} would be one brutally efficient way to stop the AM in its tracks, but I agree it's probably best to simply follow the normal shutdown sequence but indicate via a flag or other means that we don't want to copy the history file to the done_intermediate directory, unregister, or clean the staging directory. MR app master deletes staging dir when sent a reboot command from the RM Key: YARN-472 URL: https://issues.apache.org/jira/browse/YARN-472 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: jian he Assignee: jian he Attachments: YARN-472.1.patch If the RM is restarted when the MR job is running, then it sends a reboot command to the job. The job ends up deleting the staging dir and that causes the next attempt to fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-472) MR app master deletes staging dir when sent a reboot command from the RM
[ https://issues.apache.org/jira/browse/YARN-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13607195#comment-13607195 ] Hadoop QA commented on YARN-472: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12574478/YARN-472.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 tests included appear to have a timeout.{color} {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:red}-1 eclipse:eclipse{color}. The patch failed to build with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/548//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/548//console This message is automatically generated. MR app master deletes staging dir when sent a reboot command from the RM Key: YARN-472 URL: https://issues.apache.org/jira/browse/YARN-472 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: jian he Assignee: jian he Attachments: YARN-472.1.patch, YARN-472.2.patch If the RM is restarted when the MR job is running, then it sends a reboot command to the job. The job ends up deleting the staging dir and that causes the next attempt to fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-472) MR app master deletes staging dir when sent a reboot command from the RM
[ https://issues.apache.org/jira/browse/YARN-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13605485#comment-13605485 ] Bikas Saha commented on YARN-472: - Can this if stmt be simplified. Currently the assignment of false to amLastRetry is redundant because the if checks if amLastRetry is false {code} + JobImpl jobImpl = (JobImpl) this.job; + if( !isLastAMRetry + jobImpl.getInternalState() == JobStateInternal.REBOOT) { +isLastAMRetry = false; + } + else { +//We are finishing cleanly so this is the last retry +isLastAMRetry = true; + } {code} If the application is kill*/fail*/succeeded*/, then it should probably ignore the REBOOT since it wont be run again. {code} JobStateInternal.KILLED, JobStateInternal.ERROR, JobEventType.INTERNAL_ERROR, INTERNAL_ERROR_TRANSITION) + .addTransition(JobStateInternal.KILLED, JobStateInternal.REBOOT, + JobEventType.AM_REBOOT, + INTERNAL_REBOOT_TRANSITION) {code} IMO, changing JobState to add a new state would be bad for MR1 back-compat. I think its ok to transform REBOOT to KILLED since in some sense the RM is killing this attempt. Does this sound reasonable to other committers? {code} +case REBOOT: + return JobState.KILLED; {code} This seems inconsistent with the previous KILLED mapping to job state. We should set counters for the same type. {code} metrics.killedJob(this); break; case ERROR: + case REBOOT: case FAILED: metrics.failedJob(this); {code} IMO, InternalTerminationTransition sounds better since the job is not really unsuccessful. Also, it will be clearer to name stateInternal to terminationState because its the state to which the job go end up in when done. {code} - private static class InternalErrorTransition implements + private static class InternalUnsuccessfulTransition implements SingleArcTransitionJobImpl, JobEvent { +JobStateInternal stateInternal = null; + +public InternalUnsuccessfulTransition(JobStateInternal stateInternal){ + this.stateInternal = stateInternal; +} + {code} In addition to the above changes, I think we also need to make sure that the AM does not unregister from the RM when it is sent a reboot command. This is because if it successfully unregisters from the RM then the RM will internally complete the app and not try it again, I think. It may be that RM needs to be changed to handle rebooted AM's properly. MR app master deletes staging dir when sent a reboot command from the RM Key: YARN-472 URL: https://issues.apache.org/jira/browse/YARN-472 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: jian he Assignee: jian he Attachments: YARN-472.1.patch If the RM is restarted when the MR job is running, then it sends a reboot command to the job. The job ends up deleting the staging dir and that causes the next attempt to fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-472) MR app master deletes staging dir when sent a reboot command from the RM
[ https://issues.apache.org/jira/browse/YARN-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13605953#comment-13605953 ] Jason Lowe commented on YARN-472: - Another cause for the AM to receive a reboot command from the RM is a split-brain situation where the RM has expired the AM (e.g.: due to network cut) but the AM has not killed itself (e.g.: thrashing in garbage collect or something). If this were the case and the AM isn't the last attempt, it needs to get out of the way and not do any damage (e.g.: not try to commit, create history, etc.) because the RM could have already started the other attempt. Attempting to unregister is likely a fruitless effort since the RM has basically said via the reboot directive it has no idea what this AM is trying to do. If it does succeed in unregistering then that would prevent further app attempts from launching as Bikas noted, and that's not desirable. I agree that adding a new state seems unnecessary. I've always interpreted the reboot directive to indicate the AM is in a bad state and needs to get out, fast. As such, I'd rather keep this simple. If the attempt isn't the last, have the AM log the reception of the reboot and crash without doing any filesystem damage. If it is the last attempt then we can do something like we do today, e.g.: cleanup staging and generate history with an error status. MR app master deletes staging dir when sent a reboot command from the RM Key: YARN-472 URL: https://issues.apache.org/jira/browse/YARN-472 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: jian he Assignee: jian he Attachments: YARN-472.1.patch If the RM is restarted when the MR job is running, then it sends a reboot command to the job. The job ends up deleting the staging dir and that causes the next attempt to fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-472) MR app master deletes staging dir when sent a reboot command from the RM
[ https://issues.apache.org/jira/browse/YARN-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13603836#comment-13603836 ] jian he commented on YARN-472: -- The approach used is to add a separate JobStateInternal.REBOOT to distinguish from the JobStateInternal.ERROR, such that, when the job is shutdown it will not delete the staging dir if its on JobStateInternal.REBOOT its not the last retry MR app master deletes staging dir when sent a reboot command from the RM Key: YARN-472 URL: https://issues.apache.org/jira/browse/YARN-472 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: jian he Assignee: jian he If the RM is restarted when the MR job is running, then it sends a reboot command to the job. The job ends up deleting the staging dir and that causes the next attempt to fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-472) MR app master deletes staging dir when sent a reboot command from the RM
[ https://issues.apache.org/jira/browse/YARN-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13603864#comment-13603864 ] Hadoop QA commented on YARN-472: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12573938/YARN-472.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:red}-1 one of tests included doesn't have a timeout.{color} {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/523//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/523//console This message is automatically generated. MR app master deletes staging dir when sent a reboot command from the RM Key: YARN-472 URL: https://issues.apache.org/jira/browse/YARN-472 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: jian he Assignee: jian he Attachments: YARN-472.1.patch If the RM is restarted when the MR job is running, then it sends a reboot command to the job. The job ends up deleting the staging dir and that causes the next attempt to fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira