[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166681#comment-14166681 ] Hudson commented on YARN-2617: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #707 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/707/]) YARN-2617. Fixed ApplicationSubmissionContext to still set resource for backward compatibility. Contributed by Wangda Tan. (zjshen: rev e532ed8faa8db4b008a5b8d3f82b48a1b314fa6c) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ApplicationSubmissionContext.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/TestPBImplRecords.java NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.6.patch, YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166873#comment-14166873 ] Hudson commented on YARN-2617: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1897 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1897/]) YARN-2617. Fixed ApplicationSubmissionContext to still set resource for backward compatibility. Contributed by Wangda Tan. (zjshen: rev e532ed8faa8db4b008a5b8d3f82b48a1b314fa6c) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/TestPBImplRecords.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ApplicationSubmissionContext.java NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.6.patch, YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166962#comment-14166962 ] Hudson commented on YARN-2617: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1922 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1922/]) YARN-2617. Fixed ApplicationSubmissionContext to still set resource for backward compatibility. Contributed by Wangda Tan. (zjshen: rev e532ed8faa8db4b008a5b8d3f82b48a1b314fa6c) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/TestPBImplRecords.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ApplicationSubmissionContext.java * hadoop-yarn-project/CHANGES.txt NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.6.patch, YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166104#comment-14166104 ] Hudson commented on YARN-2617: -- FAILURE: Integrated in Hadoop-trunk-Commit #6230 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6230/]) YARN-2617. Fixed ApplicationSubmissionContext to still set resource for backward compatibility. Contributed by Wangda Tan. (zjshen: rev e532ed8faa8db4b008a5b8d3f82b48a1b314fa6c) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/TestPBImplRecords.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ApplicationSubmissionContext.java NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.6.patch, YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157901#comment-14157901 ] Hudson commented on YARN-2617: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #699 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/699/]) YARN-2617. Fixed NM to not send duplicate container status whose app is not running. Contributed by Jun Gong (jianhe: rev 3ef1cf187faeb530e74606dd7113fd1ba08140d7) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.6.patch, YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158014#comment-14158014 ] Hudson commented on YARN-2617: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1890 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1890/]) YARN-2617. Fixed NM to not send duplicate container status whose app is not running. Contributed by Jun Gong (jianhe: rev 3ef1cf187faeb530e74606dd7113fd1ba08140d7) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.6.patch, YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158063#comment-14158063 ] Hudson commented on YARN-2617: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1915 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1915/]) YARN-2617. Fixed NM to not send duplicate container status whose app is not running. Contributed by Jun Gong (jianhe: rev 3ef1cf187faeb530e74606dd7113fd1ba08140d7) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * hadoop-yarn-project/CHANGES.txt NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.6.patch, YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156198#comment-14156198 ] Jun Gong commented on YARN-2617: I investigated why TestDirectoryCollection failed. And it might be because of YARN-2640. Could you help check and review it please? Thank you. NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.6.patch, YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156761#comment-14156761 ] Jian He commented on YARN-2617: --- YARN-2640 seems resolved in YARN-1979 already. NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.6.patch, YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156772#comment-14156772 ] Hudson commented on YARN-2617: -- FAILURE: Integrated in Hadoop-trunk-Commit #6176 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6176/]) YARN-2617. Fixed NM to not send duplicate container status whose app is not running. Contributed by Jun Gong (jianhe: rev 3ef1cf187faeb530e74606dd7113fd1ba08140d7) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.6.patch, YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155220#comment-14155220 ] Jian He commented on YARN-2617: --- looks good, +1 NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155477#comment-14155477 ] Hadoop QA commented on YARN-2617: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672391/YARN-2617.5.patch against trunk revision 1f5b42a. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:red}-1 release audit{color}. The applied patch generated 1 release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerResync org.apache.hadoop.yarn.server.nodemanager.containermanager.TestContainerManager org.apache.hadoop.yarn.server.nodemanager.TestEventFlow org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.TestContainersMonitor org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerReboot {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5205//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5205//artifact/patchprocess/patchReleaseAuditProblems.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5205//console This message is automatically generated. NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, YARN-2617.5.patch, YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155715#comment-14155715 ] Hadoop QA commented on YARN-2617: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672391/YARN-2617.5.patch against trunk revision dd1b8f2. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 eclipse:eclipse{color}. The patch failed to build with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5212//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5212//console This message is automatically generated. NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, YARN-2617.5.patch, YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155766#comment-14155766 ] Hadoop QA commented on YARN-2617: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672391/YARN-2617.5.patch against trunk revision 52bbe0f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 eclipse:eclipse{color}. The patch failed to build with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5218//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5218//console This message is automatically generated. NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, YARN-2617.5.patch, YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155880#comment-14155880 ] Hadoop QA commented on YARN-2617: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672458/YARN-2617.5.patch against trunk revision 0708827. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5222//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5222//console This message is automatically generated. NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155912#comment-14155912 ] Hadoop QA commented on YARN-2617: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672458/YARN-2617.5.patch against trunk revision 0708827. {color:red}-1 patch{color}. Trunk compilation may be broken. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5225//console This message is automatically generated. NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155941#comment-14155941 ] Hadoop QA commented on YARN-2617: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672468/YARN-2617.5.patch against trunk revision 9e40de6. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5227//console This message is automatically generated. NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155960#comment-14155960 ] Jun Gong commented on YARN-2617: It seems that there is something wrong with Jenkins. From the console output https://builds.apache.org/job/PreCommit-YARN-Build/5227//console, it seems to apply a wrong patch. quote Going to apply patch with: /usr/bin/patch -p0 patching file hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestClientToAMTokens.java quote NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.6.patch, YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155992#comment-14155992 ] Hadoop QA commented on YARN-2617: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672474/YARN-2617.6.patch against trunk revision 9e40de6. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5229//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5229//console This message is automatically generated. NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.6.patch, YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154128#comment-14154128 ] Jian He commented on YARN-2617: --- Thanks for updating ! - {{containerStatuses.add(status);}} is moved after this check {{status.getContainerState() == ContainerState.COMPLETE}}. In some cases(e.g. NM decommission), I think we still need to send the completeContainers across so that RM knows this container completes. - we may not need to change {{getNMContainerStatuses}}, as this method will be invoked only once on re-register. I’m afraid not sending the whole containers for recovery will hit some other race conditions. NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154334#comment-14154334 ] Jian He commented on YARN-2617: --- bq. send completed container one time even its corresponding Application is stopped then delete it from context.getContainers() yep, because if we gracefully decommission a node, we also need to notify RM that the containers running on this node completes. btw. once you uploaded a patch, you can click the submit patch, which will trigger jenkins to run the corresponding unit tests. NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154337#comment-14154337 ] Jun Gong commented on YARN-2617: Get it. Thank you! NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154369#comment-14154369 ] Hadoop QA commented on YARN-2617: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672239/YARN-2617.4.patch against trunk revision 17d1202. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5193//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5193//console This message is automatically generated. NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151393#comment-14151393 ] Jian He commented on YARN-2617: --- [~hex108], thanks for reporting this issue ! - Took a look at the patch. For Apps at New/Init state, we still should not prematurely remove the containers from context. I think we should explicitly check if apps are at FINISHING_CONTAINERS_WAIT/APPLICATION_RESOURCES_CLEANINGUP/FINISHED state. If they are, we are safe to remove the containers from context. - Also, the following code {code} if (!this.context.getApplications().containsKey(applicationId)) { context.getContainers().remove(containerId); continue; } {code} needs to be moved inside the following check {{ if (containerStatus.getState().equals(ContainerState.COMPLETE))}}, so that if app is at one of the states mentioned above, we don't remove containers from the context until container reaches COMPLETE state. Does this make sense to you ? - Could you add unit test for your change too ? NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151396#comment-14151396 ] Jian He commented on YARN-2617: --- I just added you to the contributor list, thanks again for your contribution ! NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151466#comment-14151466 ] Jun Gong commented on YARN-2617: [~jianhe], thank you for the review! {quote} I think we should explicitly check if apps are at FINISHING_CONTAINERS_WAIT/APPLICATION_RESOURCES_CLEANINGUP/FINISHED state. {quote} My concern is that we will need to modify the code when we add a new state for ApplicationImpl. It will be OK if it is not a problem. BTW: is there any case that APP has containers but APP is not in RUNNING state? {quote} The code needs to be moved inside the following check {{ if (containerStatus.getState().equals(ContainerState.COMPLETE))}} ... {quote} OK. I will change it. And I will add an unit test. NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)