[jira] [Commented] (YARN-2811) Fair Scheduler is violating max memory settings in 2.4
[ https://issues.apache.org/jira/browse/YARN-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212015#comment-14212015 ] Sandy Ryza commented on YARN-2811: -- This looks almost good to go - the last thing is that we should use Resources.fitsIn instead of Resources.lessThanOrEqual(RESOURCE_CALCULATOR...), as the latter will only consider memory. Fair Scheduler is violating max memory settings in 2.4 -- Key: YARN-2811 URL: https://issues.apache.org/jira/browse/YARN-2811 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Siqi Li Assignee: Siqi Li Attachments: YARN-2811.v1.patch, YARN-2811.v2.patch, YARN-2811.v3.patch, YARN-2811.v4.patch, YARN-2811.v5.patch, YARN-2811.v6.patch, YARN-2811.v7.patch This has been seen on several queues showing the allocated MB going significantly above the max MB and it appears to have started with the 2.4 upgrade. It could be a regression bug from 2.0 to 2.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2863) ResourceManager will shutdown when job's queue is empty
yangping wu created YARN-2863: - Summary: ResourceManager will shutdown when job's queue is empty Key: YARN-2863 URL: https://issues.apache.org/jira/browse/YARN-2863 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.2.0 Reporter: yangping wu When I submit a job to hadoop cluster, but don't specified a queuename as follow $HADOOP_HOMEhadoop jar statistics.jar com.wyp.Sts -Dmapreduce.job.queuename= and if yarn.scheduler.fair.allow-undeclared-pools is not overwrite by user(default is true), then QueueManager will call createLeafQueue method to create the queue, but I didn't set the mapreduce.job.queuename property. Then throw MetricsException 2014-11-14 16:07:57,358 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ADDED to the scheduler org.apache.hadoop.metrics2.MetricsException: Metrics source QueueMetrics,q0=root already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:126) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:107) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:94) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueue.init(FSQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.init(FSLeafQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.createLeafQueue(QueueManager.java:191) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:136) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:652) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplication(FairScheduler.java:610) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1015) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) 2014-11-14 16:07:57,359 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2863) ResourceManager will shutdown when job's queue is empty
[ https://issues.apache.org/jira/browse/YARN-2863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yangping wu updated YARN-2863: -- Description: When I submit a job to hadoop cluster, but don't specified a queuename as follow {code} $HADOOP_HOMEhadoop jar statistics.jar com.wyp.Sts -Dmapreduce.job.queuename= {code} and if yarn.scheduler.fair.allow-undeclared-pools is not overwrite by user(default is true), then QueueManager will call createLeafQueue method to create the queue, but I didn't set the mapreduce.job.queuename property. Then throw MetricsException {code} 2014-11-14 16:07:57,358 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ADDED to the scheduler org.apache.hadoop.metrics2.MetricsException: Metrics source QueueMetrics,q0=root already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:126) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:107) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:94) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueue.init(FSQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.init(FSLeafQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.createLeafQueue(QueueManager.java:191) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:136) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:652) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplication(FairScheduler.java:610) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1015) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) 2014-11-14 16:07:57,359 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {code} was: When I submit a job to hadoop cluster, but don't specified a queuename as follow $HADOOP_HOMEhadoop jar statistics.jar com.wyp.Sts -Dmapreduce.job.queuename= and if yarn.scheduler.fair.allow-undeclared-pools is not overwrite by user(default is true), then QueueManager will call createLeafQueue method to create the queue, but I didn't set the mapreduce.job.queuename property. Then throw MetricsException {code} 2014-11-14 16:07:57,358 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ADDED to the scheduler org.apache.hadoop.metrics2.MetricsException: Metrics source QueueMetrics,q0=root already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:126) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:107) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:94) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueue.init(FSQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.init(FSLeafQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.createLeafQueue(QueueManager.java:191) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:136) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:652) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplication(FairScheduler.java:610) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1015) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) 2014-11-14 16:07:57,359 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {code} ResourceManager will shutdown when
[jira] [Updated] (YARN-2863) ResourceManager will shutdown when job's queue is empty
[ https://issues.apache.org/jira/browse/YARN-2863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yangping wu updated YARN-2863: -- Description: When I submit a job to hadoop cluster, but don't specified a queuename as follow {code} $HADOOP_HOMEhadoop jar statistics.jar com.wyp.Sts -Dmapreduce.job.queuename= {code} and if {code}yarn.scheduler.fair.allow-undeclared-pools{code} is not overwrite by user(default is true), then QueueManager will call createLeafQueue method to create the queue, but I didn't set the mapreduce.job.queuename property. Then throw MetricsException {code} 2014-11-14 16:07:57,358 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ADDED to the scheduler org.apache.hadoop.metrics2.MetricsException: Metrics source QueueMetrics,q0=root already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:126) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:107) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:94) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueue.init(FSQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.init(FSLeafQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.createLeafQueue(QueueManager.java:191) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:136) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:652) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplication(FairScheduler.java:610) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1015) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) 2014-11-14 16:07:57,359 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {code} was: When I submit a job to hadoop cluster, but don't specified a queuename as follow {code} $HADOOP_HOMEhadoop jar statistics.jar com.wyp.Sts -Dmapreduce.job.queuename= {code} and if yarn.scheduler.fair.allow-undeclared-pools is not overwrite by user(default is true), then QueueManager will call createLeafQueue method to create the queue, but I didn't set the mapreduce.job.queuename property. Then throw MetricsException {code} 2014-11-14 16:07:57,358 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ADDED to the scheduler org.apache.hadoop.metrics2.MetricsException: Metrics source QueueMetrics,q0=root already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:126) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:107) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:94) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueue.init(FSQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.init(FSLeafQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.createLeafQueue(QueueManager.java:191) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:136) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:652) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplication(FairScheduler.java:610) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1015) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) 2014-11-14 16:07:57,359 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {code}
[jira] [Updated] (YARN-2863) ResourceManager will shutdown when job's queue is empty
[ https://issues.apache.org/jira/browse/YARN-2863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yangping wu updated YARN-2863: -- Description: When I submit a job to hadoop cluster, but don't specified a queuename as follow $HADOOP_HOMEhadoop jar statistics.jar com.wyp.Sts -Dmapreduce.job.queuename= and if yarn.scheduler.fair.allow-undeclared-pools is not overwrite by user(default is true), then QueueManager will call createLeafQueue method to create the queue, but I didn't set the mapreduce.job.queuename property. Then throw MetricsException {code} 2014-11-14 16:07:57,358 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ADDED to the scheduler org.apache.hadoop.metrics2.MetricsException: Metrics source QueueMetrics,q0=root already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:126) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:107) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:94) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueue.init(FSQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.init(FSLeafQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.createLeafQueue(QueueManager.java:191) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:136) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:652) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplication(FairScheduler.java:610) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1015) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) 2014-11-14 16:07:57,359 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {code} was: When I submit a job to hadoop cluster, but don't specified a queuename as follow $HADOOP_HOMEhadoop jar statistics.jar com.wyp.Sts -Dmapreduce.job.queuename= and if yarn.scheduler.fair.allow-undeclared-pools is not overwrite by user(default is true), then QueueManager will call createLeafQueue method to create the queue, but I didn't set the mapreduce.job.queuename property. Then throw MetricsException 2014-11-14 16:07:57,358 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ADDED to the scheduler org.apache.hadoop.metrics2.MetricsException: Metrics source QueueMetrics,q0=root already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:126) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:107) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:94) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueue.init(FSQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.init(FSLeafQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.createLeafQueue(QueueManager.java:191) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:136) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:652) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplication(FairScheduler.java:610) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1015) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) 2014-11-14 16:07:57,359 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. ResourceManager will shutdown when job's queue is empty
[jira] [Updated] (YARN-2863) ResourceManager will shutdown when job's queue is empty
[ https://issues.apache.org/jira/browse/YARN-2863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yangping wu updated YARN-2863: -- Description: When I submit a job to hadoop cluster, but don't specified a queuename as follow {code} $HADOOP_HOMEhadoop jar statistics.jar com.wyp.Sts -Dmapreduce.job.queuename= {code} and if {yarn.scheduler.fair.allow-undeclared-pools is not overwrite by user(default is true), then QueueManager will call createLeafQueue method to create the queue, but I didn't set the mapreduce.job.queuename property. Then throw MetricsException {code} 2014-11-14 16:07:57,358 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ADDED to the scheduler org.apache.hadoop.metrics2.MetricsException: Metrics source QueueMetrics,q0=root already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:126) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:107) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:94) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueue.init(FSQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.init(FSLeafQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.createLeafQueue(QueueManager.java:191) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:136) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:652) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplication(FairScheduler.java:610) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1015) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) 2014-11-14 16:07:57,359 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {code} was: When I submit a job to hadoop cluster, but don't specified a queuename as follow {code} $HADOOP_HOMEhadoop jar statistics.jar com.wyp.Sts -Dmapreduce.job.queuename= {code} and if {code}yarn.scheduler.fair.allow-undeclared-pools{code} is not overwrite by user(default is true), then QueueManager will call createLeafQueue method to create the queue, but I didn't set the mapreduce.job.queuename property. Then throw MetricsException {code} 2014-11-14 16:07:57,358 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ADDED to the scheduler org.apache.hadoop.metrics2.MetricsException: Metrics source QueueMetrics,q0=root already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:126) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:107) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:94) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueue.init(FSQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.init(FSLeafQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.createLeafQueue(QueueManager.java:191) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:136) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:652) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplication(FairScheduler.java:610) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1015) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) 2014-11-14 16:07:57,359 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {code}
[jira] [Updated] (YARN-2863) ResourceManager will shutdown when job's queue is empty
[ https://issues.apache.org/jira/browse/YARN-2863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yangping wu updated YARN-2863: -- Description: When I submit a job to hadoop cluster, but don't specified a queuename as follow {code} $HADOOP_HOMEhadoop jar statistics.jar com.wyp.Sts -Dmapreduce.job.queuename= {code} and if *yarn.scheduler.fair.allow-undeclared-pools* is not overwrite by user(default is true), then QueueManager will call createLeafQueue method to create the queue, but I didn't set the mapreduce.job.queuename property. Then throw MetricsException {code} 2014-11-14 16:07:57,358 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ADDED to the scheduler org.apache.hadoop.metrics2.MetricsException: Metrics source QueueMetrics,q0=root already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:126) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:107) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:94) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueue.init(FSQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.init(FSLeafQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.createLeafQueue(QueueManager.java:191) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:136) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:652) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplication(FairScheduler.java:610) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1015) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) 2014-11-14 16:07:57,359 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {code} was: When I submit a job to hadoop cluster, but don't specified a queuename as follow {code} $HADOOP_HOMEhadoop jar statistics.jar com.wyp.Sts -Dmapreduce.job.queuename= {code} and if {yarn.scheduler.fair.allow-undeclared-pools is not overwrite by user(default is true), then QueueManager will call createLeafQueue method to create the queue, but I didn't set the mapreduce.job.queuename property. Then throw MetricsException {code} 2014-11-14 16:07:57,358 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ADDED to the scheduler org.apache.hadoop.metrics2.MetricsException: Metrics source QueueMetrics,q0=root already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:126) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:107) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:94) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueue.init(FSQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.init(FSLeafQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.createLeafQueue(QueueManager.java:191) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:136) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:652) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplication(FairScheduler.java:610) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1015) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) 2014-11-14 16:07:57,359 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {code} ResourceManager
[jira] [Updated] (YARN-2863) ResourceManager will shutdown when job's queue is empty
[ https://issues.apache.org/jira/browse/YARN-2863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yangping wu updated YARN-2863: -- Description: When I submit a job to hadoop cluster, but don't specified a queuename as follow {code} $HADOOP_HOMEhadoop jar statistics.jar com.wyp.Sts -Dmapreduce.job.queuename= {code} and if *yarn.scheduler.fair.allow-undeclared-pools* is not overwrite by user(default is true), then QueueManager will call createLeafQueue method to create the queue, because mapreduce.job.queuename is empty .Then it will throw MetricsException {code} 2014-11-14 16:07:57,358 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ADDED to the scheduler org.apache.hadoop.metrics2.MetricsException: Metrics source QueueMetrics,q0=root already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:126) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:107) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:94) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueue.init(FSQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.init(FSLeafQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.createLeafQueue(QueueManager.java:191) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:136) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:652) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplication(FairScheduler.java:610) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1015) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) 2014-11-14 16:07:57,359 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {code} was: When I submit a job to hadoop cluster, but don't specified a queuename as follow {code} $HADOOP_HOMEhadoop jar statistics.jar com.wyp.Sts -Dmapreduce.job.queuename= {code} and if *yarn.scheduler.fair.allow-undeclared-pools* is not overwrite by user(default is true), then QueueManager will call createLeafQueue method to create the queue, but I didn't set the mapreduce.job.queuename property. Then throw MetricsException {code} 2014-11-14 16:07:57,358 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ADDED to the scheduler org.apache.hadoop.metrics2.MetricsException: Metrics source QueueMetrics,q0=root already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:126) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:107) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:94) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueue.init(FSQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.init(FSLeafQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.createLeafQueue(QueueManager.java:191) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:136) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:652) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplication(FairScheduler.java:610) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1015) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) 2014-11-14 16:07:57,359 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {code} ResourceManager will
[jira] [Updated] (YARN-2863) ResourceManager will shutdown when job's queue is empty
[ https://issues.apache.org/jira/browse/YARN-2863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yangping wu updated YARN-2863: -- Description: When I submit a job to hadoop cluster, but don't specified a queuename as follow {code} $HADOOP_HOMEhadoop jar statistics.jar com.wyp.Sts -Dmapreduce.job.queuename= {code} and if *yarn.scheduler.fair.allow-undeclared-pools* is not overwrite by user(default is true), then QueueManager will call createLeafQueue method to create the queue, because mapreduce.job.queuename is empty .But this will throw MetricsException {code} 2014-11-14 16:07:57,358 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ADDED to the scheduler org.apache.hadoop.metrics2.MetricsException: Metrics source QueueMetrics,q0=root already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:126) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:107) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:94) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueue.init(FSQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.init(FSLeafQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.createLeafQueue(QueueManager.java:191) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:136) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:652) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplication(FairScheduler.java:610) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1015) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) 2014-11-14 16:07:57,359 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {code} was: When I submit a job to hadoop cluster, but don't specified a queuename as follow {code} $HADOOP_HOMEhadoop jar statistics.jar com.wyp.Sts -Dmapreduce.job.queuename= {code} and if *yarn.scheduler.fair.allow-undeclared-pools* is not overwrite by user(default is true), then QueueManager will call createLeafQueue method to create the queue, because mapreduce.job.queuename is empty .Then it will throw MetricsException {code} 2014-11-14 16:07:57,358 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ADDED to the scheduler org.apache.hadoop.metrics2.MetricsException: Metrics source QueueMetrics,q0=root already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:126) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:107) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:94) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueue.init(FSQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.init(FSLeafQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.createLeafQueue(QueueManager.java:191) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:136) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:652) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplication(FairScheduler.java:610) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1015) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) 2014-11-14 16:07:57,359 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {code} ResourceManager will
[jira] [Commented] (YARN-2603) ApplicationConstants missing HADOOP_MAPRED_HOME
[ https://issues.apache.org/jira/browse/YARN-2603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212162#comment-14212162 ] Hudson commented on YARN-2603: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #5 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/5/]) Revert YARN-2603. ApplicationConstants missing HADOOP_MAPRED_HOME (Ray Chiang via aw) (vinodkv: rev 4ae9780e6a05bfd6b93f1c871c22761ddd8b19cb) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/ApplicationConstants.java * hadoop-yarn-project/CHANGES.txt ApplicationConstants missing HADOOP_MAPRED_HOME --- Key: YARN-2603 URL: https://issues.apache.org/jira/browse/YARN-2603 Project: Hadoop YARN Issue Type: Bug Reporter: Allen Wittenauer Assignee: Ray Chiang Labels: newbie Attachments: YARN-2603-01.patch The Environment enum should have HADOOP_MAPRED_HOME listed as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2846) Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart.
[ https://issues.apache.org/jira/browse/YARN-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212156#comment-14212156 ] Hudson commented on YARN-2846: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #5 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/5/]) YARN-2846. Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart. Contributed by Junping Du (jlowe: rev 33ea5ae92b9dd3abace104903d9a94d17dd75af5) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/RecoveredContainerLaunch.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/ContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart. --- Key: YARN-2846 URL: https://issues.apache.org/jira/browse/YARN-2846 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Junping Du Assignee: Junping Du Priority: Blocker Fix For: 2.6.0 Attachments: YARN-2846-demo.patch, YARN-2846.patch The NM restart work preserving feature could make running AM container get LOST and killed during stop NM daemon. The exception is like below: {code} 2014-11-11 00:48:35,214 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(408)) - Memory usage of ProcessTree 22140 for container-id container_1415666714233_0001_01_84: 53.8 MB of 512 MB physical memory used; 931.3 MB of 1.0 GB virtual memory used 2014-11-11 00:48:35,223 ERROR nodemanager.NodeManager (SignalLogger.java:handle(60)) - RECEIVED SIGNAL 15: SIGTERM 2014-11-11 00:48:35,299 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:50060 2014-11-11 00:48:35,337 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(512)) - Applications still running : [application_1415666714233_0001] 2014-11-11 00:48:35,338 INFO ipc.Server (Server.java:stop(2437)) - Stopping server on 45454 2014-11-11 00:48:35,344 INFO ipc.Server (Server.java:run(706)) - Stopping IPC Server listener on 45454 2014-11-11 00:48:35,346 INFO logaggregation.LogAggregationService (LogAggregationService.java:serviceStop(141)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService waiting for pending aggregation during exit 2014-11-11 00:48:35,347 INFO ipc.Server (Server.java:run(832)) - Stopping IPC Server Responder 2014-11-11 00:48:35,347 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:abortLogAggregation(502)) - Aborting log aggregation for application_1415666714233_0001 2014-11-11 00:48:35,348 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:run(382)) - Aggregation did not complete for application application_1415666714233_0001 2014-11-11 00:48:35,358 WARN monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(476)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting. 2014-11-11 00:48:35,406 ERROR launcher.RecoveredContainerLaunch (RecoveredContainerLaunch.java:call(87)) - Unable to recover container container_1415666714233_0001_01_01 java.io.IOException: Interrupted while waiting for process 20001 to exit at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:180) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:82) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) at
[jira] [Commented] (YARN-2766) ApplicationHistoryManager is expected to return a sorted list of apps/attempts/containers
[ https://issues.apache.org/jira/browse/YARN-2766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212167#comment-14212167 ] Hudson commented on YARN-2766: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #5 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/5/]) YARN-2766. Made ApplicationHistoryManager return a sorted list of apps, attempts and containers. Contributed by Robert Kanter. (zjshen: rev 3648cb57c9f018a3a339c26f5a0ca2779485521a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryManagerOnTimelineStore.java ApplicationHistoryManager is expected to return a sorted list of apps/attempts/containers -- Key: YARN-2766 URL: https://issues.apache.org/jira/browse/YARN-2766 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.6.0 Reporter: Robert Kanter Assignee: Robert Kanter Fix For: 2.7.0 Attachments: YARN-2766.patch, YARN-2766.patch, YARN-2766.patch, YARN-2766.patch {{TestApplicationHistoryClientService.testContainers}} and {{TestApplicationHistoryClientService.testApplicationAttempts}} both fail because the test assertions are assuming a returned Collection is in a certain order. The collection comes from a HashMap, so the order is not guaranteed, plus, according to [this page|http://docs.oracle.com/javase/8/docs/technotes/guides/collections/changes8.html], there are situations where the iteration order of a HashMap will be different between Java 7 and 8. We should fix the test code to not assume a specific ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2853) Killing app may hang while AM is unregistering
[ https://issues.apache.org/jira/browse/YARN-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212166#comment-14212166 ] Hudson commented on YARN-2853: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #5 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/5/]) YARN-2853. Fixed a bug in ResourceManager causing apps to hang when the user kill request races with ApplicationMaster finish. Contributed by Jian He. (vinodkv: rev 3651fe1b089851b38be351c00a9899817166bf3e) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRM.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/TestRMAppTransitions.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java YARN-2853. Merging to branch-2.6 for hadoop-2.6.0-rc1. (acmurthy: rev d648e60ebab7f1942dba92e9cd2cb62b8d70419b) * hadoop-yarn-project/CHANGES.txt Killing app may hang while AM is unregistering -- Key: YARN-2853 URL: https://issues.apache.org/jira/browse/YARN-2853 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Fix For: 2.6.0 Attachments: YARN-2853.1.patch, YARN-2853.1.patch, YARN-2853.2.patch, YARN-2853.3.patch When killing an app, app first moves to KILLING state, If RMAppAttempt receives the attempt_unregister event before attempt_kill event, it'll ignore the later attempt_kill event. Hence, RMApp won't be able to move to KILLED state and stays at KILLING state forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2635) TestRM, TestRMRestart, TestClientToAMTokens should run with both CS and FS
[ https://issues.apache.org/jira/browse/YARN-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212158#comment-14212158 ] Hudson commented on YARN-2635: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #5 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/5/]) YARN-2635. Merging to branch-2.6 for hadoop-2.6.0-rc1. (acmurthy: rev 81dc0ac6dcf2f34ad607da815ea0144f178691a9) * hadoop-yarn-project/CHANGES.txt TestRM, TestRMRestart, TestClientToAMTokens should run with both CS and FS -- Key: YARN-2635 URL: https://issues.apache.org/jira/browse/YARN-2635 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Fix For: 2.6.0 Attachments: YARN-2635-1.patch, YARN-2635-2.patch, yarn-2635-3.patch, yarn-2635-4.patch If we change the scheduler from Capacity Scheduler to Fair Scheduler, the TestRMRestart would fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2856) Application recovery throw InvalidStateTransitonException: Invalid event: ATTEMPT_KILLED at ACCEPTED
[ https://issues.apache.org/jira/browse/YARN-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212157#comment-14212157 ] Hudson commented on YARN-2856: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #5 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/5/]) YARN-2856. Fixed RMAppImpl to handle ATTEMPT_KILLED event at ACCEPTED state on app recovery. Contributed by Rohith Sharmaks (jianhe: rev d005404ef7211fe96ce1801ed267a249568540fd) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/TestRMAppTransitions.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * hadoop-yarn-project/CHANGES.txt Application recovery throw InvalidStateTransitonException: Invalid event: ATTEMPT_KILLED at ACCEPTED Key: YARN-2856 URL: https://issues.apache.org/jira/browse/YARN-2856 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Rohith Assignee: Rohith Priority: Critical Fix For: 2.7.0 Attachments: YARN-2856.1.patch, YARN-2856.patch It is observed that recovering an application with its attempt KILLED final state throw below exception. And application remain in accepted state forever. {code} 2014-11-12 02:34:10,602 | ERROR | AsyncDispatcher event handler | Can't handle this event at current state | org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:673) org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ATTEMPT_KILLED at ACCEPTED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:671) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:90) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:730) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:714) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2853) Killing app may hang while AM is unregistering
[ https://issues.apache.org/jira/browse/YARN-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212195#comment-14212195 ] Hudson commented on YARN-2853: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #743 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/743/]) YARN-2853. Fixed a bug in ResourceManager causing apps to hang when the user kill request races with ApplicationMaster finish. Contributed by Jian He. (vinodkv: rev 3651fe1b089851b38be351c00a9899817166bf3e) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/TestRMAppTransitions.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRM.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java YARN-2853. Merging to branch-2.6 for hadoop-2.6.0-rc1. (acmurthy: rev d648e60ebab7f1942dba92e9cd2cb62b8d70419b) * hadoop-yarn-project/CHANGES.txt Killing app may hang while AM is unregistering -- Key: YARN-2853 URL: https://issues.apache.org/jira/browse/YARN-2853 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Fix For: 2.6.0 Attachments: YARN-2853.1.patch, YARN-2853.1.patch, YARN-2853.2.patch, YARN-2853.3.patch When killing an app, app first moves to KILLING state, If RMAppAttempt receives the attempt_unregister event before attempt_kill event, it'll ignore the later attempt_kill event. Hence, RMApp won't be able to move to KILLED state and stays at KILLING state forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2635) TestRM, TestRMRestart, TestClientToAMTokens should run with both CS and FS
[ https://issues.apache.org/jira/browse/YARN-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212187#comment-14212187 ] Hudson commented on YARN-2635: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #743 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/743/]) YARN-2635. Merging to branch-2.6 for hadoop-2.6.0-rc1. (acmurthy: rev 81dc0ac6dcf2f34ad607da815ea0144f178691a9) * hadoop-yarn-project/CHANGES.txt TestRM, TestRMRestart, TestClientToAMTokens should run with both CS and FS -- Key: YARN-2635 URL: https://issues.apache.org/jira/browse/YARN-2635 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Fix For: 2.6.0 Attachments: YARN-2635-1.patch, YARN-2635-2.patch, yarn-2635-3.patch, yarn-2635-4.patch If we change the scheduler from Capacity Scheduler to Fair Scheduler, the TestRMRestart would fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2766) ApplicationHistoryManager is expected to return a sorted list of apps/attempts/containers
[ https://issues.apache.org/jira/browse/YARN-2766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212196#comment-14212196 ] Hudson commented on YARN-2766: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #743 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/743/]) YARN-2766. Made ApplicationHistoryManager return a sorted list of apps, attempts and containers. Contributed by Robert Kanter. (zjshen: rev 3648cb57c9f018a3a339c26f5a0ca2779485521a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryManagerOnTimelineStore.java ApplicationHistoryManager is expected to return a sorted list of apps/attempts/containers -- Key: YARN-2766 URL: https://issues.apache.org/jira/browse/YARN-2766 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.6.0 Reporter: Robert Kanter Assignee: Robert Kanter Fix For: 2.7.0 Attachments: YARN-2766.patch, YARN-2766.patch, YARN-2766.patch, YARN-2766.patch {{TestApplicationHistoryClientService.testContainers}} and {{TestApplicationHistoryClientService.testApplicationAttempts}} both fail because the test assertions are assuming a returned Collection is in a certain order. The collection comes from a HashMap, so the order is not guaranteed, plus, according to [this page|http://docs.oracle.com/javase/8/docs/technotes/guides/collections/changes8.html], there are situations where the iteration order of a HashMap will be different between Java 7 and 8. We should fix the test code to not assume a specific ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2846) Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart.
[ https://issues.apache.org/jira/browse/YARN-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212185#comment-14212185 ] Hudson commented on YARN-2846: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #743 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/743/]) YARN-2846. Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart. Contributed by Junping Du (jlowe: rev 33ea5ae92b9dd3abace104903d9a94d17dd75af5) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/ContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/RecoveredContainerLaunch.java * hadoop-yarn-project/CHANGES.txt Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart. --- Key: YARN-2846 URL: https://issues.apache.org/jira/browse/YARN-2846 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Junping Du Assignee: Junping Du Priority: Blocker Fix For: 2.6.0 Attachments: YARN-2846-demo.patch, YARN-2846.patch The NM restart work preserving feature could make running AM container get LOST and killed during stop NM daemon. The exception is like below: {code} 2014-11-11 00:48:35,214 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(408)) - Memory usage of ProcessTree 22140 for container-id container_1415666714233_0001_01_84: 53.8 MB of 512 MB physical memory used; 931.3 MB of 1.0 GB virtual memory used 2014-11-11 00:48:35,223 ERROR nodemanager.NodeManager (SignalLogger.java:handle(60)) - RECEIVED SIGNAL 15: SIGTERM 2014-11-11 00:48:35,299 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:50060 2014-11-11 00:48:35,337 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(512)) - Applications still running : [application_1415666714233_0001] 2014-11-11 00:48:35,338 INFO ipc.Server (Server.java:stop(2437)) - Stopping server on 45454 2014-11-11 00:48:35,344 INFO ipc.Server (Server.java:run(706)) - Stopping IPC Server listener on 45454 2014-11-11 00:48:35,346 INFO logaggregation.LogAggregationService (LogAggregationService.java:serviceStop(141)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService waiting for pending aggregation during exit 2014-11-11 00:48:35,347 INFO ipc.Server (Server.java:run(832)) - Stopping IPC Server Responder 2014-11-11 00:48:35,347 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:abortLogAggregation(502)) - Aborting log aggregation for application_1415666714233_0001 2014-11-11 00:48:35,348 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:run(382)) - Aggregation did not complete for application application_1415666714233_0001 2014-11-11 00:48:35,358 WARN monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(476)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting. 2014-11-11 00:48:35,406 ERROR launcher.RecoveredContainerLaunch (RecoveredContainerLaunch.java:call(87)) - Unable to recover container container_1415666714233_0001_01_01 java.io.IOException: Interrupted while waiting for process 20001 to exit at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:180) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:82) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:177)
[jira] [Commented] (YARN-2603) ApplicationConstants missing HADOOP_MAPRED_HOME
[ https://issues.apache.org/jira/browse/YARN-2603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212191#comment-14212191 ] Hudson commented on YARN-2603: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #743 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/743/]) Revert YARN-2603. ApplicationConstants missing HADOOP_MAPRED_HOME (Ray Chiang via aw) (vinodkv: rev 4ae9780e6a05bfd6b93f1c871c22761ddd8b19cb) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/ApplicationConstants.java * hadoop-yarn-project/CHANGES.txt ApplicationConstants missing HADOOP_MAPRED_HOME --- Key: YARN-2603 URL: https://issues.apache.org/jira/browse/YARN-2603 Project: Hadoop YARN Issue Type: Bug Reporter: Allen Wittenauer Assignee: Ray Chiang Labels: newbie Attachments: YARN-2603-01.patch The Environment enum should have HADOOP_MAPRED_HOME listed as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2856) Application recovery throw InvalidStateTransitonException: Invalid event: ATTEMPT_KILLED at ACCEPTED
[ https://issues.apache.org/jira/browse/YARN-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212186#comment-14212186 ] Hudson commented on YARN-2856: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #743 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/743/]) YARN-2856. Fixed RMAppImpl to handle ATTEMPT_KILLED event at ACCEPTED state on app recovery. Contributed by Rohith Sharmaks (jianhe: rev d005404ef7211fe96ce1801ed267a249568540fd) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/TestRMAppTransitions.java Application recovery throw InvalidStateTransitonException: Invalid event: ATTEMPT_KILLED at ACCEPTED Key: YARN-2856 URL: https://issues.apache.org/jira/browse/YARN-2856 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Rohith Assignee: Rohith Priority: Critical Fix For: 2.7.0 Attachments: YARN-2856.1.patch, YARN-2856.patch It is observed that recovering an application with its attempt KILLED final state throw below exception. And application remain in accepted state forever. {code} 2014-11-12 02:34:10,602 | ERROR | AsyncDispatcher event handler | Can't handle this event at current state | org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:673) org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ATTEMPT_KILLED at ACCEPTED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:671) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:90) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:730) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:714) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2863) ResourceManager will shutdown when job's queuename is empty
[ https://issues.apache.org/jira/browse/YARN-2863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yangping wu updated YARN-2863: -- Description: When I submit a job to hadoop cluster, but don't specified a queuename as follow {code} $HADOOP_HOMEhadoop jar statistics.jar com.iteblog.Sts -Dmapreduce.job.queuename= {code} and if *yarn.scheduler.fair.allow-undeclared-pools* is not overwrite by user(default is true), then QueueManager will call createLeafQueue method to create the queue, because mapreduce.job.queuename is empty .But this will throw MetricsException {code} 2014-11-14 16:07:57,358 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ADDED to the scheduler org.apache.hadoop.metrics2.MetricsException: Metrics source QueueMetrics,q0=root already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:126) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:107) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:94) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueue.init(FSQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.init(FSLeafQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.createLeafQueue(QueueManager.java:191) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:136) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:652) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplication(FairScheduler.java:610) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1015) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) 2014-11-14 16:07:57,359 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {code} was: When I submit a job to hadoop cluster, but don't specified a queuename as follow {code} $HADOOP_HOMEhadoop jar statistics.jar com.wyp.Sts -Dmapreduce.job.queuename= {code} and if *yarn.scheduler.fair.allow-undeclared-pools* is not overwrite by user(default is true), then QueueManager will call createLeafQueue method to create the queue, because mapreduce.job.queuename is empty .But this will throw MetricsException {code} 2014-11-14 16:07:57,358 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ADDED to the scheduler org.apache.hadoop.metrics2.MetricsException: Metrics source QueueMetrics,q0=root already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:126) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:107) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:94) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueue.init(FSQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.init(FSLeafQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.createLeafQueue(QueueManager.java:191) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:136) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:652) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplication(FairScheduler.java:610) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1015) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) 2014-11-14 16:07:57,359 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {code} ResourceManager
[jira] [Updated] (YARN-2863) ResourceManager will shutdown when job's queuename is empty
[ https://issues.apache.org/jira/browse/YARN-2863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yangping wu updated YARN-2863: -- Summary: ResourceManager will shutdown when job's queuename is empty (was: ResourceManager will shutdown when job's queue is empty) ResourceManager will shutdown when job's queuename is empty --- Key: YARN-2863 URL: https://issues.apache.org/jira/browse/YARN-2863 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.2.0 Reporter: yangping wu Original Estimate: 8h Remaining Estimate: 8h When I submit a job to hadoop cluster, but don't specified a queuename as follow {code} $HADOOP_HOMEhadoop jar statistics.jar com.wyp.Sts -Dmapreduce.job.queuename= {code} and if *yarn.scheduler.fair.allow-undeclared-pools* is not overwrite by user(default is true), then QueueManager will call createLeafQueue method to create the queue, because mapreduce.job.queuename is empty .But this will throw MetricsException {code} 2014-11-14 16:07:57,358 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ADDED to the scheduler org.apache.hadoop.metrics2.MetricsException: Metrics source QueueMetrics,q0=root already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:126) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:107) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:94) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueue.init(FSQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.init(FSLeafQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.createLeafQueue(QueueManager.java:191) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:136) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:652) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplication(FairScheduler.java:610) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1015) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) 2014-11-14 16:07:57,359 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2863) ResourceManager will shutdown when job's queuename is empty
[ https://issues.apache.org/jira/browse/YARN-2863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yangping wu updated YARN-2863: -- Description: When I submit a job to hadoop cluster, but don't specified a queuename as follow {code} $HADOOP_HOME/bin/hadoop jar statistics.jar com.iteblog.Sts -Dmapreduce.job.queuename= {code} and if *yarn.scheduler.fair.allow-undeclared-pools* is not overwrite by user(default is true), then QueueManager will call createLeafQueue method to create the queue, because mapreduce.job.queuename is empty .But this will throw MetricsException {code} 2014-11-14 16:07:57,358 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ADDED to the scheduler org.apache.hadoop.metrics2.MetricsException: Metrics source QueueMetrics,q0=root already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:126) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:107) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:94) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueue.init(FSQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.init(FSLeafQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.createLeafQueue(QueueManager.java:191) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:136) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:652) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplication(FairScheduler.java:610) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1015) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) 2014-11-14 16:07:57,359 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {code} was: When I submit a job to hadoop cluster, but don't specified a queuename as follow {code} $HADOOP_HOMEhadoop jar statistics.jar com.iteblog.Sts -Dmapreduce.job.queuename= {code} and if *yarn.scheduler.fair.allow-undeclared-pools* is not overwrite by user(default is true), then QueueManager will call createLeafQueue method to create the queue, because mapreduce.job.queuename is empty .But this will throw MetricsException {code} 2014-11-14 16:07:57,358 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ADDED to the scheduler org.apache.hadoop.metrics2.MetricsException: Metrics source QueueMetrics,q0=root already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:126) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:107) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueueMetrics.forQueue(FSQueueMetrics.java:94) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueue.init(FSQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.init(FSLeafQueue.java:57) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.createLeafQueue(QueueManager.java:191) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:136) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.assignToQueue(FairScheduler.java:652) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplication(FairScheduler.java:610) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1015) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) 2014-11-14 16:07:57,359 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {code}
[jira] [Commented] (YARN-2635) TestRM, TestRMRestart, TestClientToAMTokens should run with both CS and FS
[ https://issues.apache.org/jira/browse/YARN-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212279#comment-14212279 ] Hudson commented on YARN-2635: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1933 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1933/]) YARN-2635. Merging to branch-2.6 for hadoop-2.6.0-rc1. (acmurthy: rev 81dc0ac6dcf2f34ad607da815ea0144f178691a9) * hadoop-yarn-project/CHANGES.txt TestRM, TestRMRestart, TestClientToAMTokens should run with both CS and FS -- Key: YARN-2635 URL: https://issues.apache.org/jira/browse/YARN-2635 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Fix For: 2.6.0 Attachments: YARN-2635-1.patch, YARN-2635-2.patch, yarn-2635-3.patch, yarn-2635-4.patch If we change the scheduler from Capacity Scheduler to Fair Scheduler, the TestRMRestart would fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2603) ApplicationConstants missing HADOOP_MAPRED_HOME
[ https://issues.apache.org/jira/browse/YARN-2603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212283#comment-14212283 ] Hudson commented on YARN-2603: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1933 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1933/]) Revert YARN-2603. ApplicationConstants missing HADOOP_MAPRED_HOME (Ray Chiang via aw) (vinodkv: rev 4ae9780e6a05bfd6b93f1c871c22761ddd8b19cb) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/ApplicationConstants.java * hadoop-yarn-project/CHANGES.txt ApplicationConstants missing HADOOP_MAPRED_HOME --- Key: YARN-2603 URL: https://issues.apache.org/jira/browse/YARN-2603 Project: Hadoop YARN Issue Type: Bug Reporter: Allen Wittenauer Assignee: Ray Chiang Labels: newbie Attachments: YARN-2603-01.patch The Environment enum should have HADOOP_MAPRED_HOME listed as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2856) Application recovery throw InvalidStateTransitonException: Invalid event: ATTEMPT_KILLED at ACCEPTED
[ https://issues.apache.org/jira/browse/YARN-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212278#comment-14212278 ] Hudson commented on YARN-2856: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1933 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1933/]) YARN-2856. Fixed RMAppImpl to handle ATTEMPT_KILLED event at ACCEPTED state on app recovery. Contributed by Rohith Sharmaks (jianhe: rev d005404ef7211fe96ce1801ed267a249568540fd) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/TestRMAppTransitions.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * hadoop-yarn-project/CHANGES.txt Application recovery throw InvalidStateTransitonException: Invalid event: ATTEMPT_KILLED at ACCEPTED Key: YARN-2856 URL: https://issues.apache.org/jira/browse/YARN-2856 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Rohith Assignee: Rohith Priority: Critical Fix For: 2.7.0 Attachments: YARN-2856.1.patch, YARN-2856.patch It is observed that recovering an application with its attempt KILLED final state throw below exception. And application remain in accepted state forever. {code} 2014-11-12 02:34:10,602 | ERROR | AsyncDispatcher event handler | Can't handle this event at current state | org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:673) org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ATTEMPT_KILLED at ACCEPTED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:671) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:90) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:730) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:714) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2846) Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart.
[ https://issues.apache.org/jira/browse/YARN-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212277#comment-14212277 ] Hudson commented on YARN-2846: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1933 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1933/]) YARN-2846. Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart. Contributed by Junping Du (jlowe: rev 33ea5ae92b9dd3abace104903d9a94d17dd75af5) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/ContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/RecoveredContainerLaunch.java Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart. --- Key: YARN-2846 URL: https://issues.apache.org/jira/browse/YARN-2846 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Junping Du Assignee: Junping Du Priority: Blocker Fix For: 2.6.0 Attachments: YARN-2846-demo.patch, YARN-2846.patch The NM restart work preserving feature could make running AM container get LOST and killed during stop NM daemon. The exception is like below: {code} 2014-11-11 00:48:35,214 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(408)) - Memory usage of ProcessTree 22140 for container-id container_1415666714233_0001_01_84: 53.8 MB of 512 MB physical memory used; 931.3 MB of 1.0 GB virtual memory used 2014-11-11 00:48:35,223 ERROR nodemanager.NodeManager (SignalLogger.java:handle(60)) - RECEIVED SIGNAL 15: SIGTERM 2014-11-11 00:48:35,299 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:50060 2014-11-11 00:48:35,337 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(512)) - Applications still running : [application_1415666714233_0001] 2014-11-11 00:48:35,338 INFO ipc.Server (Server.java:stop(2437)) - Stopping server on 45454 2014-11-11 00:48:35,344 INFO ipc.Server (Server.java:run(706)) - Stopping IPC Server listener on 45454 2014-11-11 00:48:35,346 INFO logaggregation.LogAggregationService (LogAggregationService.java:serviceStop(141)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService waiting for pending aggregation during exit 2014-11-11 00:48:35,347 INFO ipc.Server (Server.java:run(832)) - Stopping IPC Server Responder 2014-11-11 00:48:35,347 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:abortLogAggregation(502)) - Aborting log aggregation for application_1415666714233_0001 2014-11-11 00:48:35,348 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:run(382)) - Aggregation did not complete for application application_1415666714233_0001 2014-11-11 00:48:35,358 WARN monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(476)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting. 2014-11-11 00:48:35,406 ERROR launcher.RecoveredContainerLaunch (RecoveredContainerLaunch.java:call(87)) - Unable to recover container container_1415666714233_0001_01_01 java.io.IOException: Interrupted while waiting for process 20001 to exit at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:180) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:82) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:177)
[jira] [Commented] (YARN-2635) TestRM, TestRMRestart, TestClientToAMTokens should run with both CS and FS
[ https://issues.apache.org/jira/browse/YARN-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212292#comment-14212292 ] Hudson commented on YARN-2635: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #5 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/5/]) YARN-2635. Merging to branch-2.6 for hadoop-2.6.0-rc1. (acmurthy: rev 81dc0ac6dcf2f34ad607da815ea0144f178691a9) * hadoop-yarn-project/CHANGES.txt TestRM, TestRMRestart, TestClientToAMTokens should run with both CS and FS -- Key: YARN-2635 URL: https://issues.apache.org/jira/browse/YARN-2635 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Fix For: 2.6.0 Attachments: YARN-2635-1.patch, YARN-2635-2.patch, yarn-2635-3.patch, yarn-2635-4.patch If we change the scheduler from Capacity Scheduler to Fair Scheduler, the TestRMRestart would fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2853) Killing app may hang while AM is unregistering
[ https://issues.apache.org/jira/browse/YARN-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212300#comment-14212300 ] Hudson commented on YARN-2853: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #5 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/5/]) YARN-2853. Fixed a bug in ResourceManager causing apps to hang when the user kill request races with ApplicationMaster finish. Contributed by Jian He. (vinodkv: rev 3651fe1b089851b38be351c00a9899817166bf3e) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRM.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/TestRMAppTransitions.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java YARN-2853. Merging to branch-2.6 for hadoop-2.6.0-rc1. (acmurthy: rev d648e60ebab7f1942dba92e9cd2cb62b8d70419b) * hadoop-yarn-project/CHANGES.txt Killing app may hang while AM is unregistering -- Key: YARN-2853 URL: https://issues.apache.org/jira/browse/YARN-2853 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Fix For: 2.6.0 Attachments: YARN-2853.1.patch, YARN-2853.1.patch, YARN-2853.2.patch, YARN-2853.3.patch When killing an app, app first moves to KILLING state, If RMAppAttempt receives the attempt_unregister event before attempt_kill event, it'll ignore the later attempt_kill event. Hence, RMApp won't be able to move to KILLED state and stays at KILLING state forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2856) Application recovery throw InvalidStateTransitonException: Invalid event: ATTEMPT_KILLED at ACCEPTED
[ https://issues.apache.org/jira/browse/YARN-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212291#comment-14212291 ] Hudson commented on YARN-2856: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #5 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/5/]) YARN-2856. Fixed RMAppImpl to handle ATTEMPT_KILLED event at ACCEPTED state on app recovery. Contributed by Rohith Sharmaks (jianhe: rev d005404ef7211fe96ce1801ed267a249568540fd) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/TestRMAppTransitions.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java Application recovery throw InvalidStateTransitonException: Invalid event: ATTEMPT_KILLED at ACCEPTED Key: YARN-2856 URL: https://issues.apache.org/jira/browse/YARN-2856 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Rohith Assignee: Rohith Priority: Critical Fix For: 2.7.0 Attachments: YARN-2856.1.patch, YARN-2856.patch It is observed that recovering an application with its attempt KILLED final state throw below exception. And application remain in accepted state forever. {code} 2014-11-12 02:34:10,602 | ERROR | AsyncDispatcher event handler | Can't handle this event at current state | org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:673) org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ATTEMPT_KILLED at ACCEPTED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:671) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:90) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:730) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:714) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2766) ApplicationHistoryManager is expected to return a sorted list of apps/attempts/containers
[ https://issues.apache.org/jira/browse/YARN-2766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212301#comment-14212301 ] Hudson commented on YARN-2766: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #5 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/5/]) YARN-2766. Made ApplicationHistoryManager return a sorted list of apps, attempts and containers. Contributed by Robert Kanter. (zjshen: rev 3648cb57c9f018a3a339c26f5a0ca2779485521a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryManagerOnTimelineStore.java ApplicationHistoryManager is expected to return a sorted list of apps/attempts/containers -- Key: YARN-2766 URL: https://issues.apache.org/jira/browse/YARN-2766 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.6.0 Reporter: Robert Kanter Assignee: Robert Kanter Fix For: 2.7.0 Attachments: YARN-2766.patch, YARN-2766.patch, YARN-2766.patch, YARN-2766.patch {{TestApplicationHistoryClientService.testContainers}} and {{TestApplicationHistoryClientService.testApplicationAttempts}} both fail because the test assertions are assuming a returned Collection is in a certain order. The collection comes from a HashMap, so the order is not guaranteed, plus, according to [this page|http://docs.oracle.com/javase/8/docs/technotes/guides/collections/changes8.html], there are situations where the iteration order of a HashMap will be different between Java 7 and 8. We should fix the test code to not assume a specific ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2846) Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart.
[ https://issues.apache.org/jira/browse/YARN-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212353#comment-14212353 ] Hudson commented on YARN-2846: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1957 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1957/]) YARN-2846. Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart. Contributed by Junping Du (jlowe: rev 33ea5ae92b9dd3abace104903d9a94d17dd75af5) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/ContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/RecoveredContainerLaunch.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart. --- Key: YARN-2846 URL: https://issues.apache.org/jira/browse/YARN-2846 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Junping Du Assignee: Junping Du Priority: Blocker Fix For: 2.6.0 Attachments: YARN-2846-demo.patch, YARN-2846.patch The NM restart work preserving feature could make running AM container get LOST and killed during stop NM daemon. The exception is like below: {code} 2014-11-11 00:48:35,214 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(408)) - Memory usage of ProcessTree 22140 for container-id container_1415666714233_0001_01_84: 53.8 MB of 512 MB physical memory used; 931.3 MB of 1.0 GB virtual memory used 2014-11-11 00:48:35,223 ERROR nodemanager.NodeManager (SignalLogger.java:handle(60)) - RECEIVED SIGNAL 15: SIGTERM 2014-11-11 00:48:35,299 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:50060 2014-11-11 00:48:35,337 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(512)) - Applications still running : [application_1415666714233_0001] 2014-11-11 00:48:35,338 INFO ipc.Server (Server.java:stop(2437)) - Stopping server on 45454 2014-11-11 00:48:35,344 INFO ipc.Server (Server.java:run(706)) - Stopping IPC Server listener on 45454 2014-11-11 00:48:35,346 INFO logaggregation.LogAggregationService (LogAggregationService.java:serviceStop(141)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService waiting for pending aggregation during exit 2014-11-11 00:48:35,347 INFO ipc.Server (Server.java:run(832)) - Stopping IPC Server Responder 2014-11-11 00:48:35,347 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:abortLogAggregation(502)) - Aborting log aggregation for application_1415666714233_0001 2014-11-11 00:48:35,348 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:run(382)) - Aggregation did not complete for application application_1415666714233_0001 2014-11-11 00:48:35,358 WARN monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(476)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting. 2014-11-11 00:48:35,406 ERROR launcher.RecoveredContainerLaunch (RecoveredContainerLaunch.java:call(87)) - Unable to recover container container_1415666714233_0001_01_01 java.io.IOException: Interrupted while waiting for process 20001 to exit at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:180) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:82) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) at
[jira] [Commented] (YARN-2766) ApplicationHistoryManager is expected to return a sorted list of apps/attempts/containers
[ https://issues.apache.org/jira/browse/YARN-2766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212365#comment-14212365 ] Hudson commented on YARN-2766: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1957 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1957/]) YARN-2766. Made ApplicationHistoryManager return a sorted list of apps, attempts and containers. Contributed by Robert Kanter. (zjshen: rev 3648cb57c9f018a3a339c26f5a0ca2779485521a) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryManagerOnTimelineStore.java ApplicationHistoryManager is expected to return a sorted list of apps/attempts/containers -- Key: YARN-2766 URL: https://issues.apache.org/jira/browse/YARN-2766 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.6.0 Reporter: Robert Kanter Assignee: Robert Kanter Fix For: 2.7.0 Attachments: YARN-2766.patch, YARN-2766.patch, YARN-2766.patch, YARN-2766.patch {{TestApplicationHistoryClientService.testContainers}} and {{TestApplicationHistoryClientService.testApplicationAttempts}} both fail because the test assertions are assuming a returned Collection is in a certain order. The collection comes from a HashMap, so the order is not guaranteed, plus, according to [this page|http://docs.oracle.com/javase/8/docs/technotes/guides/collections/changes8.html], there are situations where the iteration order of a HashMap will be different between Java 7 and 8. We should fix the test code to not assume a specific ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2856) Application recovery throw InvalidStateTransitonException: Invalid event: ATTEMPT_KILLED at ACCEPTED
[ https://issues.apache.org/jira/browse/YARN-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212354#comment-14212354 ] Hudson commented on YARN-2856: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1957 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1957/]) YARN-2856. Fixed RMAppImpl to handle ATTEMPT_KILLED event at ACCEPTED state on app recovery. Contributed by Rohith Sharmaks (jianhe: rev d005404ef7211fe96ce1801ed267a249568540fd) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/TestRMAppTransitions.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java Application recovery throw InvalidStateTransitonException: Invalid event: ATTEMPT_KILLED at ACCEPTED Key: YARN-2856 URL: https://issues.apache.org/jira/browse/YARN-2856 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Rohith Assignee: Rohith Priority: Critical Fix For: 2.7.0 Attachments: YARN-2856.1.patch, YARN-2856.patch It is observed that recovering an application with its attempt KILLED final state throw below exception. And application remain in accepted state forever. {code} 2014-11-12 02:34:10,602 | ERROR | AsyncDispatcher event handler | Can't handle this event at current state | org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:673) org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ATTEMPT_KILLED at ACCEPTED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:671) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:90) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:730) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:714) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2603) ApplicationConstants missing HADOOP_MAPRED_HOME
[ https://issues.apache.org/jira/browse/YARN-2603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212360#comment-14212360 ] Hudson commented on YARN-2603: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1957 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1957/]) Revert YARN-2603. ApplicationConstants missing HADOOP_MAPRED_HOME (Ray Chiang via aw) (vinodkv: rev 4ae9780e6a05bfd6b93f1c871c22761ddd8b19cb) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/ApplicationConstants.java * hadoop-yarn-project/CHANGES.txt ApplicationConstants missing HADOOP_MAPRED_HOME --- Key: YARN-2603 URL: https://issues.apache.org/jira/browse/YARN-2603 Project: Hadoop YARN Issue Type: Bug Reporter: Allen Wittenauer Assignee: Ray Chiang Labels: newbie Attachments: YARN-2603-01.patch The Environment enum should have HADOOP_MAPRED_HOME listed as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2853) Killing app may hang while AM is unregistering
[ https://issues.apache.org/jira/browse/YARN-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212364#comment-14212364 ] Hudson commented on YARN-2853: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1957 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1957/]) YARN-2853. Fixed a bug in ResourceManager causing apps to hang when the user kill request races with ApplicationMaster finish. Contributed by Jian He. (vinodkv: rev 3651fe1b089851b38be351c00a9899817166bf3e) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/TestRMAppTransitions.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRM.java * hadoop-yarn-project/CHANGES.txt YARN-2853. Merging to branch-2.6 for hadoop-2.6.0-rc1. (acmurthy: rev d648e60ebab7f1942dba92e9cd2cb62b8d70419b) * hadoop-yarn-project/CHANGES.txt Killing app may hang while AM is unregistering -- Key: YARN-2853 URL: https://issues.apache.org/jira/browse/YARN-2853 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Fix For: 2.6.0 Attachments: YARN-2853.1.patch, YARN-2853.1.patch, YARN-2853.2.patch, YARN-2853.3.patch When killing an app, app first moves to KILLING state, If RMAppAttempt receives the attempt_unregister event before attempt_kill event, it'll ignore the later attempt_kill event. Hence, RMApp won't be able to move to KILLED state and stays at KILLING state forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2635) TestRM, TestRMRestart, TestClientToAMTokens should run with both CS and FS
[ https://issues.apache.org/jira/browse/YARN-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212356#comment-14212356 ] Hudson commented on YARN-2635: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1957 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1957/]) YARN-2635. Merging to branch-2.6 for hadoop-2.6.0-rc1. (acmurthy: rev 81dc0ac6dcf2f34ad607da815ea0144f178691a9) * hadoop-yarn-project/CHANGES.txt TestRM, TestRMRestart, TestClientToAMTokens should run with both CS and FS -- Key: YARN-2635 URL: https://issues.apache.org/jira/browse/YARN-2635 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Fix For: 2.6.0 Attachments: YARN-2635-1.patch, YARN-2635-2.patch, yarn-2635-3.patch, yarn-2635-4.patch If we change the scheduler from Capacity Scheduler to Fair Scheduler, the TestRMRestart would fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2766) ApplicationHistoryManager is expected to return a sorted list of apps/attempts/containers
[ https://issues.apache.org/jira/browse/YARN-2766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212390#comment-14212390 ] Hudson commented on YARN-2766: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #5 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/5/]) YARN-2766. Made ApplicationHistoryManager return a sorted list of apps, attempts and containers. Contributed by Robert Kanter. (zjshen: rev 3648cb57c9f018a3a339c26f5a0ca2779485521a) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryManagerOnTimelineStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryClientService.java ApplicationHistoryManager is expected to return a sorted list of apps/attempts/containers -- Key: YARN-2766 URL: https://issues.apache.org/jira/browse/YARN-2766 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.6.0 Reporter: Robert Kanter Assignee: Robert Kanter Fix For: 2.7.0 Attachments: YARN-2766.patch, YARN-2766.patch, YARN-2766.patch, YARN-2766.patch {{TestApplicationHistoryClientService.testContainers}} and {{TestApplicationHistoryClientService.testApplicationAttempts}} both fail because the test assertions are assuming a returned Collection is in a certain order. The collection comes from a HashMap, so the order is not guaranteed, plus, according to [this page|http://docs.oracle.com/javase/8/docs/technotes/guides/collections/changes8.html], there are situations where the iteration order of a HashMap will be different between Java 7 and 8. We should fix the test code to not assume a specific ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2853) Killing app may hang while AM is unregistering
[ https://issues.apache.org/jira/browse/YARN-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212389#comment-14212389 ] Hudson commented on YARN-2853: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #5 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/5/]) YARN-2853. Fixed a bug in ResourceManager causing apps to hang when the user kill request races with ApplicationMaster finish. Contributed by Jian He. (vinodkv: rev 3651fe1b089851b38be351c00a9899817166bf3e) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRM.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/TestRMAppTransitions.java YARN-2853. Merging to branch-2.6 for hadoop-2.6.0-rc1. (acmurthy: rev d648e60ebab7f1942dba92e9cd2cb62b8d70419b) * hadoop-yarn-project/CHANGES.txt Killing app may hang while AM is unregistering -- Key: YARN-2853 URL: https://issues.apache.org/jira/browse/YARN-2853 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Fix For: 2.6.0 Attachments: YARN-2853.1.patch, YARN-2853.1.patch, YARN-2853.2.patch, YARN-2853.3.patch When killing an app, app first moves to KILLING state, If RMAppAttempt receives the attempt_unregister event before attempt_kill event, it'll ignore the later attempt_kill event. Hence, RMApp won't be able to move to KILLED state and stays at KILLING state forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2603) ApplicationConstants missing HADOOP_MAPRED_HOME
[ https://issues.apache.org/jira/browse/YARN-2603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212385#comment-14212385 ] Hudson commented on YARN-2603: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #5 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/5/]) Revert YARN-2603. ApplicationConstants missing HADOOP_MAPRED_HOME (Ray Chiang via aw) (vinodkv: rev 4ae9780e6a05bfd6b93f1c871c22761ddd8b19cb) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/ApplicationConstants.java ApplicationConstants missing HADOOP_MAPRED_HOME --- Key: YARN-2603 URL: https://issues.apache.org/jira/browse/YARN-2603 Project: Hadoop YARN Issue Type: Bug Reporter: Allen Wittenauer Assignee: Ray Chiang Labels: newbie Attachments: YARN-2603-01.patch The Environment enum should have HADOOP_MAPRED_HOME listed as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2635) TestRM, TestRMRestart, TestClientToAMTokens should run with both CS and FS
[ https://issues.apache.org/jira/browse/YARN-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212381#comment-14212381 ] Hudson commented on YARN-2635: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #5 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/5/]) YARN-2635. Merging to branch-2.6 for hadoop-2.6.0-rc1. (acmurthy: rev 81dc0ac6dcf2f34ad607da815ea0144f178691a9) * hadoop-yarn-project/CHANGES.txt TestRM, TestRMRestart, TestClientToAMTokens should run with both CS and FS -- Key: YARN-2635 URL: https://issues.apache.org/jira/browse/YARN-2635 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Fix For: 2.6.0 Attachments: YARN-2635-1.patch, YARN-2635-2.patch, yarn-2635-3.patch, yarn-2635-4.patch If we change the scheduler from Capacity Scheduler to Fair Scheduler, the TestRMRestart would fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2846) Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart.
[ https://issues.apache.org/jira/browse/YARN-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212378#comment-14212378 ] Hudson commented on YARN-2846: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #5 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/5/]) YARN-2846. Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart. Contributed by Junping Du (jlowe: rev 33ea5ae92b9dd3abace104903d9a94d17dd75af5) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/RecoveredContainerLaunch.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/ContainerExecutor.java Incorrect persist exit code for running containers in reacquireContainer() that interrupted by NodeManager restart. --- Key: YARN-2846 URL: https://issues.apache.org/jira/browse/YARN-2846 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Junping Du Assignee: Junping Du Priority: Blocker Fix For: 2.6.0 Attachments: YARN-2846-demo.patch, YARN-2846.patch The NM restart work preserving feature could make running AM container get LOST and killed during stop NM daemon. The exception is like below: {code} 2014-11-11 00:48:35,214 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(408)) - Memory usage of ProcessTree 22140 for container-id container_1415666714233_0001_01_84: 53.8 MB of 512 MB physical memory used; 931.3 MB of 1.0 GB virtual memory used 2014-11-11 00:48:35,223 ERROR nodemanager.NodeManager (SignalLogger.java:handle(60)) - RECEIVED SIGNAL 15: SIGTERM 2014-11-11 00:48:35,299 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:50060 2014-11-11 00:48:35,337 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(512)) - Applications still running : [application_1415666714233_0001] 2014-11-11 00:48:35,338 INFO ipc.Server (Server.java:stop(2437)) - Stopping server on 45454 2014-11-11 00:48:35,344 INFO ipc.Server (Server.java:run(706)) - Stopping IPC Server listener on 45454 2014-11-11 00:48:35,346 INFO logaggregation.LogAggregationService (LogAggregationService.java:serviceStop(141)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService waiting for pending aggregation during exit 2014-11-11 00:48:35,347 INFO ipc.Server (Server.java:run(832)) - Stopping IPC Server Responder 2014-11-11 00:48:35,347 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:abortLogAggregation(502)) - Aborting log aggregation for application_1415666714233_0001 2014-11-11 00:48:35,348 WARN logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:run(382)) - Aggregation did not complete for application application_1415666714233_0001 2014-11-11 00:48:35,358 WARN monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(476)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting. 2014-11-11 00:48:35,406 ERROR launcher.RecoveredContainerLaunch (RecoveredContainerLaunch.java:call(87)) - Unable to recover container container_1415666714233_0001_01_01 java.io.IOException: Interrupted while waiting for process 20001 to exit at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:180) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:82) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:46) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) at
[jira] [Created] (YARN-2864) TestRMWebServicesAppsModification fails in trunk
Ted Yu created YARN-2864: Summary: TestRMWebServicesAppsModification fails in trunk Key: YARN-2864 URL: https://issues.apache.org/jira/browse/YARN-2864 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Priority: Minor From https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/5/console : {code} Tests run: 32, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 151.14 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification testGetNewApplicationAndSubmit[1](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification) Time elapsed: 0.276 sec ERROR! java.lang.NoClassDefFoundError: org/apache/hadoop/io/FastByteComparisons at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) at org.apache.hadoop.io.WritableComparator.compareBytes(WritableComparator.java:187) at org.apache.hadoop.io.BinaryComparable.compareTo(BinaryComparable.java:50) at org.apache.hadoop.io.BinaryComparable.equals(BinaryComparable.java:72) at org.apache.hadoop.io.Text.equals(Text.java:348) at java.util.ArrayList.indexOf(ArrayList.java:216) at java.util.ArrayList.contains(ArrayList.java:199) at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testAppSubmit(TestRMWebServicesAppsModification.java:844) at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testGetNewApplicationAndSubmit(TestRMWebServicesAppsModification.java:726) testGetNewApplicationAndSubmit[3](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification) Time elapsed: 0.225 sec ERROR! java.lang.NoClassDefFoundError: org/apache/hadoop/io/FastByteComparisons at org.apache.hadoop.io.WritableComparator.compareBytes(WritableComparator.java:187) at org.apache.hadoop.io.BinaryComparable.compareTo(BinaryComparable.java:50) at org.apache.hadoop.io.BinaryComparable.equals(BinaryComparable.java:72) at org.apache.hadoop.io.Text.equals(Text.java:348) at java.util.ArrayList.indexOf(ArrayList.java:216) at java.util.ArrayList.contains(ArrayList.java:199) at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testAppSubmit(TestRMWebServicesAppsModification.java:844) at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testGetNewApplicationAndSubmit(TestRMWebServicesAppsModification.java:726) {code} Running on MacBook, I got (with Java 1.7.0_60): {code} Running org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification Tests run: 32, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 146.749 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification testGetNewApplicationAndSubmit[1](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification) Time elapsed: 0.185 sec FAILURE! java.lang.AssertionError: expected:Accepted but was:Internal Server Error at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:144) at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testAppSubmit(TestRMWebServicesAppsModification.java:799) at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testGetNewApplicationAndSubmit(TestRMWebServicesAppsModification.java:726) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2745) YARN new pluggable scheduler which does multi-resource packing
[ https://issues.apache.org/jira/browse/YARN-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212586#comment-14212586 ] Srikanth Kandula commented on YARN-2745: Thanks Karthik, that is an interesting thought. It seems that several of the proposed work-items (resource estimation, expanded asks, modifications to task matching on NM hearbeat) have to happen regardless of whether this is a new scheduler or a flag atop existing ones like FairScheduler. Do you foresee any additional complications to build this as a flag as opposed to stand-alone? Will take this offline. YARN new pluggable scheduler which does multi-resource packing -- Key: YARN-2745 URL: https://issues.apache.org/jira/browse/YARN-2745 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, scheduler Reporter: Robert Grandl Attachments: sigcomm_14_tetris_talk.pptx, tetris_paper.pdf In this umbrella JIRA we propose a new pluggable scheduler, which accounts for all resources used by a task (CPU, memory, disk, network) and it is able to achieve three competing objectives: fairness, improve cluster utilization and reduces average job completion time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2165) Timelineserver should validate that yarn.timeline-service.ttl-ms is greater than zero
[ https://issues.apache.org/jira/browse/YARN-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212595#comment-14212595 ] Zhijie Shen commented on YARN-2165: --- bq. should the check be (= 0) instead of ( 0) ? Since 0 ttl and ttlinterval have no real meanings. Agree. To be more general, it's better to do the sanity check for all the numeric configurations while initializing the timeline server, making sure a valid number has been set. Here's the current list. {code} property descriptionTime to live for timeline store data in milliseconds./description nameyarn.timeline-service.ttl-ms/name value60480/value /property property descriptionLength of time to wait between deletion cycles of leveldb timeline store in milliseconds./description nameyarn.timeline-service.leveldb-timeline-store.ttl-interval-ms/name value30/value /property property descriptionSize of read cache for uncompressed blocks for leveldb timeline store in bytes./description nameyarn.timeline-service.leveldb-timeline-store.read-cache-size/name value104857600/value /property property descriptionSize of cache for recently read entity start times for leveldb timeline store in number of entities./description nameyarn.timeline-service.leveldb-timeline-store.start-time-read-cache-size/name value1/value /property property descriptionSize of cache for recently written entity start times for leveldb timeline store in number of entities./description nameyarn.timeline-service.leveldb-timeline-store.start-time-write-cache-size/name value1/value /property property descriptionHandler thread count to serve the client RPC requests./description nameyarn.timeline-service.handler-thread-count/name value10/value /property property description Default maximum number of retires for timeline servive client. /description nameyarn.timeline-service.client.max-retries/name value30/value /property property description Default retry time interval for timeline servive client. /description nameyarn.timeline-service.client.retry-interval-ms/name value1000/value /property {code} Timelineserver should validate that yarn.timeline-service.ttl-ms is greater than zero - Key: YARN-2165 URL: https://issues.apache.org/jira/browse/YARN-2165 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Karam Singh Attachments: YARN-2165.patch Timelineserver should validate that yarn.timeline-service.ttl-ms is greater than zero Currently if set yarn.timeline-service.ttl-ms=0 Or yarn.timeline-service.ttl-ms=-86400 Timeline server start successfully with complaining {code} 2014-06-15 14:52:16,562 INFO timeline.LeveldbTimelineStore (LeveldbTimelineStore.java:init(247)) - Starting deletion thread with ttl -60480 and cycle interval 30 {code} At starting timelinserver should that yarn.timeline-service-ttl-ms 0 otherwise specially for -ive value discard oldvalues timestamp will be set future value. Which may lead to inconsistancy in behavior {code} public void run() { while (true) { long timestamp = System.currentTimeMillis() - ttl; try { discardOldEntities(timestamp); Thread.sleep(ttlInterval); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2166) Timelineserver should validate that yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms is greater than zero when level db is for timeline store
[ https://issues.apache.org/jira/browse/YARN-2166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212600#comment-14212600 ] Zhijie Shen commented on YARN-2166: --- See the comments on [YARN-2165|https://issues.apache.org/jira/browse/YARN-2165?focusedCommentId=14212595page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14212595]. How about having one pass to do sanity check for all numeric configs. Timelineserver should validate that yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms is greater than zero when level db is for timeline store - Key: YARN-2166 URL: https://issues.apache.org/jira/browse/YARN-2166 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Karam Singh Timelineserver should validate that yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms is greater than zero when level db is for timeline store other if we start timelineserver with yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms=-5000 Timeline starts but Thread.sleep call in EntityDeletionThread.run keep on throwing UncaughtException -ive value {code} 2014-06-16 10:22:03,537 ERROR yarn.YarnUncaughtExceptionHandler (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread Thread[Thread-4,5,main] threw an Exception. java.lang.IllegalArgumentException: timeout value is negative at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore$EntityDeletionThread.run(LeveldbTimelineStore.java:257) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2811) Fair Scheduler is violating max memory settings in 2.4
[ https://issues.apache.org/jira/browse/YARN-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li updated YARN-2811: -- Attachment: YARN-2811.v8.patch Fair Scheduler is violating max memory settings in 2.4 -- Key: YARN-2811 URL: https://issues.apache.org/jira/browse/YARN-2811 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Siqi Li Assignee: Siqi Li Attachments: YARN-2811.v1.patch, YARN-2811.v2.patch, YARN-2811.v3.patch, YARN-2811.v4.patch, YARN-2811.v5.patch, YARN-2811.v6.patch, YARN-2811.v7.patch, YARN-2811.v8.patch This has been seen on several queues showing the allocated MB going significantly above the max MB and it appears to have started with the 2.4 upgrade. It could be a regression bug from 2.0 to 2.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2862) RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used
[ https://issues.apache.org/jira/browse/YARN-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212640#comment-14212640 ] Ming Ma commented on YARN-2862: --- Here are some possible ways to fix it. 1) Fix RMAppManager's recoverApplication to ignore any unrecoverable app. 2) Fix RawLocalFileSystem used by FileSystemRMStateStore to force sync data to disk device. 3) Fix FileSystemRMStateStore to skip app with null ApplicationState#context. Sounds like #3 is the best given the usage scenario of FileSystemRMStateStore. Also RM should expect each implementation of RMStateStore#loadState load valid ApplicationState into RMState. Thoughts? RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used --- Key: YARN-2862 URL: https://issues.apache.org/jira/browse/YARN-2862 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma This might be a known issue. Given FileSystemRMStateStore isn't used for HA scenario, it might not be that important, unless there is something we need to fix at RM layer to make it more tolerant to RMStore issue. When RM was hard shutdown, OS might not get a chance to persist blocks. Some of the stored application data end up with size zero after reboot. And RM didn't like that. {noformat} ls -al /var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351 total 156 drwxr-xr-x.2 x y 4096 Nov 13 16:45 . drwxr-xr-x. 1524 x y 151552 Nov 13 16:45 .. -rw-r--r--.1 x y 0 Nov 13 16:45 appattempt_1412702189634_324351_01 -rw-r--r--.1 x y 0 Nov 13 16:45 .appattempt_1412702189634_324351_01.crc -rw-r--r--.1 x y 0 Nov 13 16:45 application_1412702189634_324351 -rw-r--r--.1 x y 0 Nov 13 16:45 .application_1412702189634_324351.crc {noformat} When RM starts up {noformat} 2014-11-13 16:55:25,844 WARN org.apache.hadoop.fs.FSInputChecker: Problem opening checksum file: file:/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351/application_1412702189634_324351. Ignoring exception: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:146) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:792) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.readFile(FileSystemRMStateStore.java:501) ... 2014-11-13 17:40:48,876 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to load/recover state java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ApplicationState.getAppId(RMStateStore.java:184) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:306) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2862) RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used
[ https://issues.apache.org/jira/browse/YARN-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212671#comment-14212671 ] Gera Shegalov commented on YARN-2862: - [~mingma], It's potentially already fixed by YARN-2010. We can try it for our scenario. RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used --- Key: YARN-2862 URL: https://issues.apache.org/jira/browse/YARN-2862 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma This might be a known issue. Given FileSystemRMStateStore isn't used for HA scenario, it might not be that important, unless there is something we need to fix at RM layer to make it more tolerant to RMStore issue. When RM was hard shutdown, OS might not get a chance to persist blocks. Some of the stored application data end up with size zero after reboot. And RM didn't like that. {noformat} ls -al /var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351 total 156 drwxr-xr-x.2 x y 4096 Nov 13 16:45 . drwxr-xr-x. 1524 x y 151552 Nov 13 16:45 .. -rw-r--r--.1 x y 0 Nov 13 16:45 appattempt_1412702189634_324351_01 -rw-r--r--.1 x y 0 Nov 13 16:45 .appattempt_1412702189634_324351_01.crc -rw-r--r--.1 x y 0 Nov 13 16:45 application_1412702189634_324351 -rw-r--r--.1 x y 0 Nov 13 16:45 .application_1412702189634_324351.crc {noformat} When RM starts up {noformat} 2014-11-13 16:55:25,844 WARN org.apache.hadoop.fs.FSInputChecker: Problem opening checksum file: file:/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351/application_1412702189634_324351. Ignoring exception: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:146) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:792) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.readFile(FileSystemRMStateStore.java:501) ... 2014-11-13 17:40:48,876 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to load/recover state java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ApplicationState.getAppId(RMStateStore.java:184) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:306) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2604) Scheduler should consider max-allocation-* in conjunction with the largest node
[ https://issues.apache.org/jira/browse/YARN-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212672#comment-14212672 ] Karthik Kambatla commented on YARN-2604: +1 to what Jason said. Reusing the configs introduced in YARN-2001 sounds the right way to me too. Scheduler should consider max-allocation-* in conjunction with the largest node --- Key: YARN-2604 URL: https://issues.apache.org/jira/browse/YARN-2604 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.5.1 Reporter: Karthik Kambatla Assignee: Robert Kanter Attachments: YARN-2604.patch, YARN-2604.patch, YARN-2604.patch If the scheduler max-allocation-* values are larger than the resources available on the largest node in the cluster, an application requesting resources between the two values will be accepted by the scheduler but the requests will never be satisfied. The app essentially hangs forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2811) Fair Scheduler is violating max memory settings in 2.4
[ https://issues.apache.org/jira/browse/YARN-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212769#comment-14212769 ] Hadoop QA commented on YARN-2811: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12681587/YARN-2811.v8.patch against trunk revision 1a1dcce. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5844//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5844//console This message is automatically generated. Fair Scheduler is violating max memory settings in 2.4 -- Key: YARN-2811 URL: https://issues.apache.org/jira/browse/YARN-2811 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Siqi Li Assignee: Siqi Li Attachments: YARN-2811.v1.patch, YARN-2811.v2.patch, YARN-2811.v3.patch, YARN-2811.v4.patch, YARN-2811.v5.patch, YARN-2811.v6.patch, YARN-2811.v7.patch, YARN-2811.v8.patch This has been seen on several queues showing the allocated MB going significantly above the max MB and it appears to have started with the 2.4 upgrade. It could be a regression bug from 2.0 to 2.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data
[ https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mit Desai reassigned YARN-1530: --- Assignee: Mit Desai [Umbrella] Store, manage and serve per-framework application-timeline data -- Key: YARN-1530 URL: https://issues.apache.org/jira/browse/YARN-1530 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Mit Desai Attachments: ATS-Write-Pipeline-Design-Proposal.pdf, ATS-meet-up-8-28-2014-notes.pdf, application timeline design-20140108.pdf, application timeline design-20140116.pdf, application timeline design-20140130.pdf, application timeline design-20140210.pdf This is a sibling JIRA for YARN-321. Today, each application/framework has to do store, and serve per-framework data all by itself as YARN doesn't have a common solution. This JIRA attempts to solve the storage, management and serving of per-framework data from various applications, both running and finished. The aim is to change YARN to collect and store data in a generic manner with plugin points for frameworks to do their own thing w.r.t interpretation and serving. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data
[ https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mit Desai updated YARN-1530: Assignee: (was: Mit Desai) [Umbrella] Store, manage and serve per-framework application-timeline data -- Key: YARN-1530 URL: https://issues.apache.org/jira/browse/YARN-1530 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Attachments: ATS-Write-Pipeline-Design-Proposal.pdf, ATS-meet-up-8-28-2014-notes.pdf, application timeline design-20140108.pdf, application timeline design-20140116.pdf, application timeline design-20140130.pdf, application timeline design-20140210.pdf This is a sibling JIRA for YARN-321. Today, each application/framework has to do store, and serve per-framework data all by itself as YARN doesn't have a common solution. This JIRA attempts to solve the storage, management and serving of per-framework data from various applications, both running and finished. The aim is to change YARN to collect and store data in a generic manner with plugin points for frameworks to do their own thing w.r.t interpretation and serving. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2375) Allow enabling/disabling timeline server per framework
[ https://issues.apache.org/jira/browse/YARN-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mit Desai reassigned YARN-2375: --- Assignee: Mit Desai Allow enabling/disabling timeline server per framework -- Key: YARN-2375 URL: https://issues.apache.org/jira/browse/YARN-2375 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles Assignee: Mit Desai -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2588) Standby RM does not transitionToActive if previous transitionToActive is failed with ZK exception.
[ https://issues.apache.org/jira/browse/YARN-2588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212775#comment-14212775 ] Rohith commented on YARN-2588: -- There is another hidden similar issue after this patch. Should I raise another Jira or provide add on patch to this Jira only? Standby RM does not transitionToActive if previous transitionToActive is failed with ZK exception. -- Key: YARN-2588 URL: https://issues.apache.org/jira/browse/YARN-2588 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0, 2.6.0, 2.5.1 Reporter: Rohith Assignee: Rohith Fix For: 2.6.0 Attachments: YARN-2588.1.patch, YARN-2588.2.patch, YARN-2588.patch Consider scenario where, StandBy RM is failed to transition to Active because of ZK exception(connectionLoss or SessionExpired). Then any further transition to Active for same RM does not move RM to Active state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2588) Standby RM does not transitionToActive if previous transitionToActive is failed with ZK exception.
[ https://issues.apache.org/jira/browse/YARN-2588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212777#comment-14212777 ] Karthik Kambatla commented on YARN-2588: Let us do it on another JIRA, given this is already committed to 2.6.0. Standby RM does not transitionToActive if previous transitionToActive is failed with ZK exception. -- Key: YARN-2588 URL: https://issues.apache.org/jira/browse/YARN-2588 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0, 2.6.0, 2.5.1 Reporter: Rohith Assignee: Rohith Fix For: 2.6.0 Attachments: YARN-2588.1.patch, YARN-2588.2.patch, YARN-2588.patch Consider scenario where, StandBy RM is failed to transition to Active because of ZK exception(connectionLoss or SessionExpired). Then any further transition to Active for same RM does not move RM to Active state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2857) ConcurrentModificationException in ContainerLogAppender
[ https://issues.apache.org/jira/browse/YARN-2857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212792#comment-14212792 ] Jason Lowe commented on YARN-2857: -- +1 lgtm. Committing this. ConcurrentModificationException in ContainerLogAppender --- Key: YARN-2857 URL: https://issues.apache.org/jira/browse/YARN-2857 Project: Hadoop YARN Issue Type: Bug Reporter: Mohammad Kamrul Islam Assignee: Mohammad Kamrul Islam Priority: Critical Attachments: ContainerLogAppender.java, MAPREDUCE-6139-test.01.patch, MAPREDUCE-6139.1.patch, MAPREDUCE-6139.2.patch, MAPREDUCE-6139.3.patch, YARN-2857.3.patch Context: * Hadoop-2.3.0 * Using Oozie 4.0.1 * Pig version 0.11.x The job is submitted by Oozie to launch Pig script. The following exception traces were found on MR task log: In syslog: {noformat} 2014-10-24 20:37:29,317 WARN [Thread-5] org.apache.hadoop.util.ShutdownHookManager: ShutdownHook '' failed, java.util.ConcurrentModificationException java.util.ConcurrentModificationException at java.util.LinkedList$ListItr.checkForComodification(LinkedList.java:966) at java.util.LinkedList$ListItr.next(LinkedList.java:888) at org.apache.hadoop.yarn.ContainerLogAppender.close(ContainerLogAppender.java:94) at org.apache.log4j.helpers.AppenderAttachableImpl.removeAllAppenders(AppenderAttachableImpl.java:141) at org.apache.log4j.Category.removeAllAppenders(Category.java:891) at org.apache.log4j.Hierarchy.shutdown(Hierarchy.java:471) at org.apache.log4j.LogManager.shutdown(LogManager.java:267) at org.apache.hadoop.mapred.TaskLog.syncLogsShutdown(TaskLog.java:286) at org.apache.hadoop.mapred.TaskLog$2.run(TaskLog.java:339) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) 2014-10-24 20:37:29,395 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping MapTask metrics system... {noformat} in stderr: {noformat} java.util.ConcurrentModificationException at java.util.LinkedList$ListItr.checkForComodification(LinkedList.java:966) at java.util.LinkedList$ListItr.next(LinkedList.java:888) at org.apache.hadoop.yarn.ContainerLogAppender.close(ContainerLogAppender.java:94) at org.apache.log4j.helpers.AppenderAttachableImpl.removeAllAppenders(AppenderAttachableImpl.java:141) at org.apache.log4j.Category.removeAllAppenders(Category.java:891) at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:759) at org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:648) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:514) at org.apache.log4j.PropertyConfigurator.configure(PropertyConfigurator.java:440) at org.apache.pig.Main.configureLog4J(Main.java:740) at org.apache.pig.Main.run(Main.java:384) at org.apache.pig.PigRunner.run(PigRunner.java:49) at org.apache.oozie.action.hadoop.PigMain.runPigJob(PigMain.java:283) at org.apache.oozie.action.hadoop.PigMain.run(PigMain.java:223) at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:37) at org.apache.oozie.action.hadoop.PigMain.main(PigMain.java:76) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:226) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2857) ConcurrentModificationException in ContainerLogAppender
[ https://issues.apache.org/jira/browse/YARN-2857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212810#comment-14212810 ] Hudson commented on YARN-2857: -- FAILURE: Integrated in Hadoop-trunk-Commit #6545 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6545/]) YARN-2857. ConcurrentModificationException in ContainerLogAppender. Contributed by Mohammad Kamrul Islam (jlowe: rev f2fe8a800e5b0f3875931adba9ae20c6a95aa4ff) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/ContainerLogAppender.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/TestContainerLogAppender.java ConcurrentModificationException in ContainerLogAppender --- Key: YARN-2857 URL: https://issues.apache.org/jira/browse/YARN-2857 Project: Hadoop YARN Issue Type: Bug Reporter: Mohammad Kamrul Islam Assignee: Mohammad Kamrul Islam Priority: Critical Fix For: 2.7.0 Attachments: ContainerLogAppender.java, MAPREDUCE-6139-test.01.patch, MAPREDUCE-6139.1.patch, MAPREDUCE-6139.2.patch, MAPREDUCE-6139.3.patch, YARN-2857.3.patch Context: * Hadoop-2.3.0 * Using Oozie 4.0.1 * Pig version 0.11.x The job is submitted by Oozie to launch Pig script. The following exception traces were found on MR task log: In syslog: {noformat} 2014-10-24 20:37:29,317 WARN [Thread-5] org.apache.hadoop.util.ShutdownHookManager: ShutdownHook '' failed, java.util.ConcurrentModificationException java.util.ConcurrentModificationException at java.util.LinkedList$ListItr.checkForComodification(LinkedList.java:966) at java.util.LinkedList$ListItr.next(LinkedList.java:888) at org.apache.hadoop.yarn.ContainerLogAppender.close(ContainerLogAppender.java:94) at org.apache.log4j.helpers.AppenderAttachableImpl.removeAllAppenders(AppenderAttachableImpl.java:141) at org.apache.log4j.Category.removeAllAppenders(Category.java:891) at org.apache.log4j.Hierarchy.shutdown(Hierarchy.java:471) at org.apache.log4j.LogManager.shutdown(LogManager.java:267) at org.apache.hadoop.mapred.TaskLog.syncLogsShutdown(TaskLog.java:286) at org.apache.hadoop.mapred.TaskLog$2.run(TaskLog.java:339) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) 2014-10-24 20:37:29,395 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping MapTask metrics system... {noformat} in stderr: {noformat} java.util.ConcurrentModificationException at java.util.LinkedList$ListItr.checkForComodification(LinkedList.java:966) at java.util.LinkedList$ListItr.next(LinkedList.java:888) at org.apache.hadoop.yarn.ContainerLogAppender.close(ContainerLogAppender.java:94) at org.apache.log4j.helpers.AppenderAttachableImpl.removeAllAppenders(AppenderAttachableImpl.java:141) at org.apache.log4j.Category.removeAllAppenders(Category.java:891) at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:759) at org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:648) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:514) at org.apache.log4j.PropertyConfigurator.configure(PropertyConfigurator.java:440) at org.apache.pig.Main.configureLog4J(Main.java:740) at org.apache.pig.Main.run(Main.java:384) at org.apache.pig.PigRunner.run(PigRunner.java:49) at org.apache.oozie.action.hadoop.PigMain.runPigJob(PigMain.java:283) at org.apache.oozie.action.hadoop.PigMain.run(PigMain.java:223) at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:37) at org.apache.oozie.action.hadoop.PigMain.main(PigMain.java:76) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:226) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at
[jira] [Updated] (YARN-2056) Disable preemption at Queue level
[ https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-2056: - Attachment: YARN-2056.201411142002.txt [~leftnoteasy], Thank you for all of your help. Uploading new patch. bq. Instead of multiply you should use multiplyAndNormalizeUp here. Using {{multiplyAndNormalizeUp}} helps. However, for the use case in {{testHierarchicalLarge}}, the rounding is still different with the new algorithm (7 and 5 instead of 9 and 4). bq. Actually I think we should consider minimum_allocation in preemption policy, we can address that in a separated JIRA. Would you please create a new JIRA and elaborate on this further? {quote} bq. {{testDisablePreemptionOverCapPlusPending}} Since the result is not changed before/after we set preemption queue, I think it is unnecessary, I would suggest to take it out. {quote} I removed this test. Disable preemption at Queue level - Key: YARN-2056 URL: https://issues.apache.org/jira/browse/YARN-2056 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Mayank Bansal Assignee: Eric Payne Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt, YARN-2056.201408310117.txt, YARN-2056.201409022208.txt, YARN-2056.201409181916.txt, YARN-2056.201409210049.txt, YARN-2056.201409232329.txt, YARN-2056.201409242210.txt, YARN-2056.201410132225.txt, YARN-2056.201410141330.txt, YARN-2056.201410232244.txt, YARN-2056.201410311746.txt, YARN-2056.201411041635.txt, YARN-2056.201411072153.txt, YARN-2056.201411122305.txt, YARN-2056.201411132215.txt, YARN-2056.201411142002.txt We need to be able to disable preemption at individual queue level -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2865) Application recovery continuously fails with Application with id already present. Cannot duplicate
Rohith created YARN-2865: Summary: Application recovery continuously fails with Application with id already present. Cannot duplicate Key: YARN-2865 URL: https://issues.apache.org/jira/browse/YARN-2865 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Rohith Assignee: Rohith YARN-2588 handles exception thrown while transitioningToActive and reset activeServices. But it misses out clearing RMcontext apps/nodes details and ClusterMetrics and QueueMetrics. This causes application recovery to fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2588) Standby RM does not transitionToActive if previous transitionToActive is failed with ZK exception.
[ https://issues.apache.org/jira/browse/YARN-2588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212835#comment-14212835 ] Rohith commented on YARN-2588: -- Thanks Karthik!! I have raised YARN-2865 Standby RM does not transitionToActive if previous transitionToActive is failed with ZK exception. -- Key: YARN-2588 URL: https://issues.apache.org/jira/browse/YARN-2588 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0, 2.6.0, 2.5.1 Reporter: Rohith Assignee: Rohith Fix For: 2.6.0 Attachments: YARN-2588.1.patch, YARN-2588.2.patch, YARN-2588.patch Consider scenario where, StandBy RM is failed to transition to Active because of ZK exception(connectionLoss or SessionExpired). Then any further transition to Active for same RM does not move RM to Active state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2816) NM fail to start with NPE during container recovery
[ https://issues.apache.org/jira/browse/YARN-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212838#comment-14212838 ] Jason Lowe commented on YARN-2816: -- +1 lgtm. Committing this. NM fail to start with NPE during container recovery --- Key: YARN-2816 URL: https://issues.apache.org/jira/browse/YARN-2816 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2816.000.patch, YARN-2816.001.patch, YARN-2816.002.patch, leveldb_records.txt NM fail to start with NPE during container recovery. We saw the following crash happen: 2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl failed in state INITED; cause: java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492) The reason is some DB files used in NMLeveldbStateStoreService are accidentally deleted to save disk space at /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. This leaves some incomplete container record which don't have CONTAINER_REQUEST_KEY_SUFFIX(startRequest) entry in the DB. When container is recovered at ContainerManagerImpl#recoverContainer, The NullPointerException at the following code cause NM shutdown. {code} StartContainerRequest req = rcs.getStartRequest(); ContainerLaunchContext launchContext = req.getContainerLaunchContext(); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2862) RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used
[ https://issues.apache.org/jira/browse/YARN-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212836#comment-14212836 ] Jian He commented on YARN-2862: --- YARN-2010 may not solve this. YARN-1185 might have fixed this. RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used --- Key: YARN-2862 URL: https://issues.apache.org/jira/browse/YARN-2862 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma This might be a known issue. Given FileSystemRMStateStore isn't used for HA scenario, it might not be that important, unless there is something we need to fix at RM layer to make it more tolerant to RMStore issue. When RM was hard shutdown, OS might not get a chance to persist blocks. Some of the stored application data end up with size zero after reboot. And RM didn't like that. {noformat} ls -al /var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351 total 156 drwxr-xr-x.2 x y 4096 Nov 13 16:45 . drwxr-xr-x. 1524 x y 151552 Nov 13 16:45 .. -rw-r--r--.1 x y 0 Nov 13 16:45 appattempt_1412702189634_324351_01 -rw-r--r--.1 x y 0 Nov 13 16:45 .appattempt_1412702189634_324351_01.crc -rw-r--r--.1 x y 0 Nov 13 16:45 application_1412702189634_324351 -rw-r--r--.1 x y 0 Nov 13 16:45 .application_1412702189634_324351.crc {noformat} When RM starts up {noformat} 2014-11-13 16:55:25,844 WARN org.apache.hadoop.fs.FSInputChecker: Problem opening checksum file: file:/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351/application_1412702189634_324351. Ignoring exception: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:146) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:792) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.readFile(FileSystemRMStateStore.java:501) ... 2014-11-13 17:40:48,876 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to load/recover state java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ApplicationState.getAppId(RMStateStore.java:184) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:306) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2865) Application recovery continuously fails with Application with id already present. Cannot duplicate
[ https://issues.apache.org/jira/browse/YARN-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212842#comment-14212842 ] Rohith commented on YARN-2865: -- I encountered this scenario in my test cluster strange way!! causing below exception. {code} 2014-11-14 04:11:33,433 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Recovering 2 applications 2014-11-14 04:11:33,433 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Default priority level is set to application:application_1415591025732_0001 2014-11-14 04:11:33,433 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Application with id application_1415591025732_0001 is already present! Cannot add a duplicate! 2014-11-14 04:11:33,433 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to load/recover state org.apache.hadoop.yarn.exceptions.YarnException: Application with id application_1415591025732_0001 is already present! Cannot add a duplicate! at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:45) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:364) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:332) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1146) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:521) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:925) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:966) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:962) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1612) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:962) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:281) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:805) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:602) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) {code} Application recovery continuously fails with Application with id already present. Cannot duplicate Key: YARN-2865 URL: https://issues.apache.org/jira/browse/YARN-2865 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Rohith Assignee: Rohith YARN-2588 handles exception thrown while transitioningToActive and reset activeServices. But it misses out clearing RMcontext apps/nodes details and ClusterMetrics and QueueMetrics. This causes application recovery to fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2865) Application recovery continuously fails with Application with id already present. Cannot duplicate
[ https://issues.apache.org/jira/browse/YARN-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-2865: - Attachment: YARN-2865.patch Application recovery continuously fails with Application with id already present. Cannot duplicate Key: YARN-2865 URL: https://issues.apache.org/jira/browse/YARN-2865 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Rohith Assignee: Rohith Attachments: YARN-2865.patch YARN-2588 handles exception thrown while transitioningToActive and reset activeServices. But it misses out clearing RMcontext apps/nodes details and ClusterMetrics and QueueMetrics. This causes application recovery to fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2865) Application recovery continuously fails with Application with id already present. Cannot duplicate
[ https://issues.apache.org/jira/browse/YARN-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212845#comment-14212845 ] Rohith commented on YARN-2865: -- Attaching the patch that clears rmcontext,cluster metric and queue metric. And also I have done refactoring of common methods were called from transitionToActive and transitionToStandBy. Please review the patch. Application recovery continuously fails with Application with id already present. Cannot duplicate Key: YARN-2865 URL: https://issues.apache.org/jira/browse/YARN-2865 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Rohith Assignee: Rohith Attachments: YARN-2865.patch YARN-2588 handles exception thrown while transitioningToActive and reset activeServices. But it misses out clearing RMcontext apps/nodes details and ClusterMetrics and QueueMetrics. This causes application recovery to fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2816) NM fail to start with NPE during container recovery
[ https://issues.apache.org/jira/browse/YARN-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212860#comment-14212860 ] Hudson commented on YARN-2816: -- FAILURE: Integrated in Hadoop-trunk-Commit #6549 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6549/]) YARN-2816. NM fail to start with NPE during container recovery. Contributed by Zhihai Xu (jlowe: rev 49c38898b0be64fc686d039ed2fb2dea1378df02) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java NM fail to start with NPE during container recovery --- Key: YARN-2816 URL: https://issues.apache.org/jira/browse/YARN-2816 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Fix For: 2.7.0 Attachments: YARN-2816.000.patch, YARN-2816.001.patch, YARN-2816.002.patch, leveldb_records.txt NM fail to start with NPE during container recovery. We saw the following crash happen: 2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl failed in state INITED; cause: java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492) The reason is some DB files used in NMLeveldbStateStoreService are accidentally deleted to save disk space at /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. This leaves some incomplete container record which don't have CONTAINER_REQUEST_KEY_SUFFIX(startRequest) entry in the DB. When container is recovered at ContainerManagerImpl#recoverContainer, The NullPointerException at the following code cause NM shutdown. {code} StartContainerRequest req = rcs.getStartRequest(); ContainerLaunchContext launchContext = req.getContainerLaunchContext(); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2375) Allow enabling/disabling timeline server per framework
[ https://issues.apache.org/jira/browse/YARN-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Eagles updated YARN-2375: -- Description: This JIRA is to remove the ats enabled flag check within the TimelineClientImpl. Example where this fails is below. While running secure timeline server with ats flag set to disabled on resource manager, Timeline delegation token renewer throws an NPE. Allow enabling/disabling timeline server per framework -- Key: YARN-2375 URL: https://issues.apache.org/jira/browse/YARN-2375 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles Assignee: Mit Desai This JIRA is to remove the ats enabled flag check within the TimelineClientImpl. Example where this fails is below. While running secure timeline server with ats flag set to disabled on resource manager, Timeline delegation token renewer throws an NPE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2811) Fair Scheduler is violating max memory settings in 2.4
[ https://issues.apache.org/jira/browse/YARN-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li updated YARN-2811: -- Attachment: YARN-2811.v9.patch Fair Scheduler is violating max memory settings in 2.4 -- Key: YARN-2811 URL: https://issues.apache.org/jira/browse/YARN-2811 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Siqi Li Assignee: Siqi Li Attachments: YARN-2811.v1.patch, YARN-2811.v2.patch, YARN-2811.v3.patch, YARN-2811.v4.patch, YARN-2811.v5.patch, YARN-2811.v6.patch, YARN-2811.v7.patch, YARN-2811.v8.patch, YARN-2811.v9.patch This has been seen on several queues showing the allocated MB going significantly above the max MB and it appears to have started with the 2.4 upgrade. It could be a regression bug from 2.0 to 2.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2866) Capacity scheduler preemption policy should respect yarn.scheduler.minimum-allocation-mb when computing resource of queues
Wangda Tan created YARN-2866: Summary: Capacity scheduler preemption policy should respect yarn.scheduler.minimum-allocation-mb when computing resource of queues Key: YARN-2866 URL: https://issues.apache.org/jira/browse/YARN-2866 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Currently, capacity scheduler preemption logic doesn't respect minimum_allocation when computing ideal_assign/guaranteed_resource, etc. We should respect it to avoid some potential rounding issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2056) Disable preemption at Queue level
[ https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212899#comment-14212899 ] Wangda Tan commented on YARN-2056: -- [~eepayne], Thanks for update, bq. Would you please create a new JIRA and elaborate on this further? Created YARN-2866 to track this issue. The latest patch LGTM, +1. Would you like to take a look, [~vinodkv], [~mayank_bansal]? Wangda Disable preemption at Queue level - Key: YARN-2056 URL: https://issues.apache.org/jira/browse/YARN-2056 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Mayank Bansal Assignee: Eric Payne Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt, YARN-2056.201408310117.txt, YARN-2056.201409022208.txt, YARN-2056.201409181916.txt, YARN-2056.201409210049.txt, YARN-2056.201409232329.txt, YARN-2056.201409242210.txt, YARN-2056.201410132225.txt, YARN-2056.201410141330.txt, YARN-2056.201410232244.txt, YARN-2056.201410311746.txt, YARN-2056.201411041635.txt, YARN-2056.201411072153.txt, YARN-2056.201411122305.txt, YARN-2056.201411132215.txt, YARN-2056.201411142002.txt We need to be able to disable preemption at individual queue level -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2865) Application recovery continuously fails with Application with id already present. Cannot duplicate
[ https://issues.apache.org/jira/browse/YARN-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2865: -- Priority: Critical (was: Major) Target Version/s: 2.7.0 Application recovery continuously fails with Application with id already present. Cannot duplicate Key: YARN-2865 URL: https://issues.apache.org/jira/browse/YARN-2865 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Rohith Assignee: Rohith Priority: Critical Attachments: YARN-2865.patch YARN-2588 handles exception thrown while transitioningToActive and reset activeServices. But it misses out clearing RMcontext apps/nodes details and ClusterMetrics and QueueMetrics. This causes application recovery to fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2414) RM web UI: app page will crash if app is failed before any attempt has been created
[ https://issues.apache.org/jira/browse/YARN-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2414: -- Target Version/s: 2.7.0 RM web UI: app page will crash if app is failed before any attempt has been created --- Key: YARN-2414 URL: https://issues.apache.org/jira/browse/YARN-2414 Project: Hadoop YARN Issue Type: Bug Components: webapp Reporter: Zhijie Shen Assignee: Wangda Tan Attachments: YARN-2414.patch {code} 2014-08-12 16:45:13,573 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error handling URI: /cluster/app/application_1407887030038_0001 java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263) at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178) at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:84) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:460) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1191) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Caused by: java.lang.NullPointerException at
[jira] [Commented] (YARN-2375) Allow enabling/disabling timeline server per framework
[ https://issues.apache.org/jira/browse/YARN-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212928#comment-14212928 ] Zhijie Shen commented on YARN-2375: --- bq. While running secure timeline server with ats flag set to disabled on resource manager, Timeline delegation token renewer throws an NPE. This is a bug. DT related API methods doesn't check if isEnabled == true. On the other side, the internal stuff is only inited when isEnabled == true. This is why NPE happens. Will file a separate Jira for it. As to removing the global flag, I'm not sure if we should do that. Nowadays, we still don't assume the timeline server is always up as other components in a YARN cluster: RM and NM. Then, if the timeline server is not setup but the YARN cluster assumes it is up, it will result in problems. For example, app submission fails at getting the timeline DT in a secure cluster. Therefore, this config should be kept to serve as the flag to indicate if we have setup the timeline server for the YARN cluster, until we promote it the be the always on daemon like RM and NM. Thoughts? Allow enabling/disabling timeline server per framework -- Key: YARN-2375 URL: https://issues.apache.org/jira/browse/YARN-2375 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles Assignee: Mit Desai This JIRA is to remove the ats enabled flag check within the TimelineClientImpl. Example where this fails is below. While running secure timeline server with ats flag set to disabled on resource manager, Timeline delegation token renewer throws an NPE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2404) Remove ApplicationAttemptState and ApplicationState class in RMStateStore class
[ https://issues.apache.org/jira/browse/YARN-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2404: -- Target Version/s: 2.7.0 Remove ApplicationAttemptState and ApplicationState class in RMStateStore class Key: YARN-2404 URL: https://issues.apache.org/jira/browse/YARN-2404 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Tsuyoshi OZAWA Attachments: YARN-2404.1.patch, YARN-2404.2.patch, YARN-2404.3.patch, YARN-2404.4.patch We can remove ApplicationState and ApplicationAttemptState class in RMStateStore, given that we already have ApplicationStateData and ApplicationAttemptStateData records. we may just replace ApplicationState with ApplicationStateData, similarly for ApplicationAttemptState. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2867) TimelineClient DT methods should check if the timeline service is enabled or not
Zhijie Shen created YARN-2867: - Summary: TimelineClient DT methods should check if the timeline service is enabled or not Key: YARN-2867 URL: https://issues.apache.org/jira/browse/YARN-2867 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Zhijie Shen DT related methods doesn't check if isEnabled == true. On the other side, the internal stuff is only inited when isEnabled == true. NPE happens if users call these methods when the timeline service config is not set to enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2375) Allow enabling/disabling timeline server per framework
[ https://issues.apache.org/jira/browse/YARN-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212932#comment-14212932 ] Zhijie Shen commented on YARN-2375: --- Filed YARN-2867 Allow enabling/disabling timeline server per framework -- Key: YARN-2375 URL: https://issues.apache.org/jira/browse/YARN-2375 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles Assignee: Mit Desai This JIRA is to remove the ats enabled flag check within the TimelineClientImpl. Example where this fails is below. While running secure timeline server with ats flag set to disabled on resource manager, Timeline delegation token renewer throws an NPE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2403) TestNodeManagerResync fails occasionally in trunk
[ https://issues.apache.org/jira/browse/YARN-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212934#comment-14212934 ] Jian He commented on YARN-2403: --- is this still happening? otherwise we can close this. if it's still happening, the patch maybe not enough. TestNodeManagerResync fails occasionally in trunk - Key: YARN-2403 URL: https://issues.apache.org/jira/browse/YARN-2403 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Priority: Minor Attachments: YARN-2403.patch From https://builds.apache.org/job/Hadoop-Yarn-trunk/640/ : {code} TestNodeManagerResync.testKillContainersOnResync:112-testContainerPreservationOnResyncImpl:146 expected:2 but was:1 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2862) RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used
[ https://issues.apache.org/jira/browse/YARN-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212935#comment-14212935 ] Gera Shegalov commented on YARN-2862: - [~jianhe], to add more details: we use 2.4+patches, YARN-1185 is in 2.3. RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used --- Key: YARN-2862 URL: https://issues.apache.org/jira/browse/YARN-2862 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma This might be a known issue. Given FileSystemRMStateStore isn't used for HA scenario, it might not be that important, unless there is something we need to fix at RM layer to make it more tolerant to RMStore issue. When RM was hard shutdown, OS might not get a chance to persist blocks. Some of the stored application data end up with size zero after reboot. And RM didn't like that. {noformat} ls -al /var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351 total 156 drwxr-xr-x.2 x y 4096 Nov 13 16:45 . drwxr-xr-x. 1524 x y 151552 Nov 13 16:45 .. -rw-r--r--.1 x y 0 Nov 13 16:45 appattempt_1412702189634_324351_01 -rw-r--r--.1 x y 0 Nov 13 16:45 .appattempt_1412702189634_324351_01.crc -rw-r--r--.1 x y 0 Nov 13 16:45 application_1412702189634_324351 -rw-r--r--.1 x y 0 Nov 13 16:45 .application_1412702189634_324351.crc {noformat} When RM starts up {noformat} 2014-11-13 16:55:25,844 WARN org.apache.hadoop.fs.FSInputChecker: Problem opening checksum file: file:/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351/application_1412702189634_324351. Ignoring exception: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:146) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:792) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.readFile(FileSystemRMStateStore.java:501) ... 2014-11-13 17:40:48,876 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to load/recover state java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ApplicationState.getAppId(RMStateStore.java:184) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:306) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2868) Add metric for initial container launch time
Ray Chiang created YARN-2868: Summary: Add metric for initial container launch time Key: YARN-2868 URL: https://issues.apache.org/jira/browse/YARN-2868 Project: Hadoop YARN Issue Type: Improvement Reporter: Ray Chiang Assignee: Ray Chiang Add a metric to measure the latency between starting container allocation and first container actually allocated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2603) ApplicationConstants missing HADOOP_MAPRED_HOME
[ https://issues.apache.org/jira/browse/YARN-2603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli resolved YARN-2603. --- Resolution: Invalid Thanks for the response Ray. Closing this as invalid. Repeating my previous message in case of more discussion: {quote} This is not correct. We deliberately avoided putting compile time references to MapReduce in all of YARN. You should instead use yarn.nodemanager.env-whitelist and set HADOOP_MAPRED_HOME while starting nodemanager. OTOH, we are moving away from cluster installs of MapReduce to instead use DistributedCache: See MAPREDUCE-4421. {quote} ApplicationConstants missing HADOOP_MAPRED_HOME --- Key: YARN-2603 URL: https://issues.apache.org/jira/browse/YARN-2603 Project: Hadoop YARN Issue Type: Bug Reporter: Allen Wittenauer Assignee: Ray Chiang Labels: newbie Attachments: YARN-2603-01.patch The Environment enum should have HADOOP_MAPRED_HOME listed as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2868) Add metric for initial container launch time
[ https://issues.apache.org/jira/browse/YARN-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Chiang updated YARN-2868: - Attachment: 20141114_FSQueueAllocationMetric-Up-04.patch First attempt at implementation. Add metric for initial container launch time Key: YARN-2868 URL: https://issues.apache.org/jira/browse/YARN-2868 Project: Hadoop YARN Issue Type: Improvement Reporter: Ray Chiang Assignee: Ray Chiang Labels: metrics, supportability Add a metric to measure the latency between starting container allocation and first container actually allocated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2868) Add metric for initial container launch time
[ https://issues.apache.org/jira/browse/YARN-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Chiang updated YARN-2868: - Attachment: (was: 20141114_FSQueueAllocationMetric-Up-04.patch) Add metric for initial container launch time Key: YARN-2868 URL: https://issues.apache.org/jira/browse/YARN-2868 Project: Hadoop YARN Issue Type: Improvement Reporter: Ray Chiang Assignee: Ray Chiang Labels: metrics, supportability Add a metric to measure the latency between starting container allocation and first container actually allocated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2868) Add metric for initial container launch time
[ https://issues.apache.org/jira/browse/YARN-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Chiang updated YARN-2868: - Attachment: YARN-2868-01.patch First attempt at implementation. Second upload. Add metric for initial container launch time Key: YARN-2868 URL: https://issues.apache.org/jira/browse/YARN-2868 Project: Hadoop YARN Issue Type: Improvement Reporter: Ray Chiang Assignee: Ray Chiang Labels: metrics, supportability Attachments: YARN-2868-01.patch Add a metric to measure the latency between starting container allocation and first container actually allocated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2056) Disable preemption at Queue level
[ https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212944#comment-14212944 ] Hadoop QA commented on YARN-2056: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12681612/YARN-2056.201411142002.txt against trunk revision 10c98ae. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5845//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5845//console This message is automatically generated. Disable preemption at Queue level - Key: YARN-2056 URL: https://issues.apache.org/jira/browse/YARN-2056 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Mayank Bansal Assignee: Eric Payne Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt, YARN-2056.201408310117.txt, YARN-2056.201409022208.txt, YARN-2056.201409181916.txt, YARN-2056.201409210049.txt, YARN-2056.201409232329.txt, YARN-2056.201409242210.txt, YARN-2056.201410132225.txt, YARN-2056.201410141330.txt, YARN-2056.201410232244.txt, YARN-2056.201410311746.txt, YARN-2056.201411041635.txt, YARN-2056.201411072153.txt, YARN-2056.201411122305.txt, YARN-2056.201411132215.txt, YARN-2056.201411142002.txt We need to be able to disable preemption at individual queue level -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2375) Allow enabling/disabling timeline server per framework
[ https://issues.apache.org/jira/browse/YARN-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212947#comment-14212947 ] Mit Desai commented on YARN-2375: - bq. DT related API methods doesn't check if isEnabled == true If the timeline server is running, we cannot turn on the flag in yarn-site because if the flag is turned on, all mapreduce applications will automatically try to connect to timeline server and that is not something that we want at this time. Allow enabling/disabling timeline server per framework -- Key: YARN-2375 URL: https://issues.apache.org/jira/browse/YARN-2375 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles Assignee: Mit Desai This JIRA is to remove the ats enabled flag check within the TimelineClientImpl. Example where this fails is below. While running secure timeline server with ats flag set to disabled on resource manager, Timeline delegation token renewer throws an NPE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2811) Fair Scheduler is violating max memory settings in 2.4
[ https://issues.apache.org/jira/browse/YARN-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212949#comment-14212949 ] Hadoop QA commented on YARN-2811: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12681619/YARN-2811.v9.patch against trunk revision 49c3889. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 eclipse:eclipse{color}. The patch failed to build with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5847//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5847//console This message is automatically generated. Fair Scheduler is violating max memory settings in 2.4 -- Key: YARN-2811 URL: https://issues.apache.org/jira/browse/YARN-2811 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Siqi Li Assignee: Siqi Li Attachments: YARN-2811.v1.patch, YARN-2811.v2.patch, YARN-2811.v3.patch, YARN-2811.v4.patch, YARN-2811.v5.patch, YARN-2811.v6.patch, YARN-2811.v7.patch, YARN-2811.v8.patch, YARN-2811.v9.patch This has been seen on several queues showing the allocated MB going significantly above the max MB and it appears to have started with the 2.4 upgrade. It could be a regression bug from 2.0 to 2.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2375) Allow enabling/disabling timeline server per framework
[ https://issues.apache.org/jira/browse/YARN-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212951#comment-14212951 ] Mit Desai commented on YARN-2375: - And if the flag is turned off in yarn-site, DT API will make its way till that condition validation and do nothing. Allow enabling/disabling timeline server per framework -- Key: YARN-2375 URL: https://issues.apache.org/jira/browse/YARN-2375 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles Assignee: Mit Desai This JIRA is to remove the ats enabled flag check within the TimelineClientImpl. Example where this fails is below. While running secure timeline server with ats flag set to disabled on resource manager, Timeline delegation token renewer throws an NPE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2396) RpcClientFactoryPBImpl.stopClient always throws due to missing close method
[ https://issues.apache.org/jira/browse/YARN-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212952#comment-14212952 ] Hadoop QA commented on YARN-2396: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12660649/yarn2396.patch against trunk revision 49c3889. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5849//console This message is automatically generated. RpcClientFactoryPBImpl.stopClient always throws due to missing close method --- Key: YARN-2396 URL: https://issues.apache.org/jira/browse/YARN-2396 Project: Hadoop YARN Issue Type: Bug Components: api Affects Versions: 2.4.1 Reporter: Jason Lowe Assignee: chang li Attachments: yarn2396.patch RpcClientFactoryPBImpl.stopClient will throw a YarnRuntimeException if the protocol does not have a close method, despite the log message indicating it is ignoring errors. It's interesting to note that none of the YARN protocol classes currently have a close method. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2403) TestNodeManagerResync fails occasionally in trunk
[ https://issues.apache.org/jira/browse/YARN-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212961#comment-14212961 ] Hadoop QA commented on YARN-2403: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12660964/YARN-2403.patch against trunk revision 49c3889. {color:red}-1 patch{color}. Trunk compilation may be broken. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5848//console This message is automatically generated. TestNodeManagerResync fails occasionally in trunk - Key: YARN-2403 URL: https://issues.apache.org/jira/browse/YARN-2403 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Priority: Minor Attachments: YARN-2403.patch From https://builds.apache.org/job/Hadoop-Yarn-trunk/640/ : {code} TestNodeManagerResync.testKillContainersOnResync:112-testContainerPreservationOnResyncImpl:146 expected:2 but was:1 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2375) Allow enabling/disabling timeline server per framework
[ https://issues.apache.org/jira/browse/YARN-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212964#comment-14212964 ] Jonathan Eagles commented on YARN-2375: --- [~zjshen], you misunderstand my request. I am proposing to retain the flag. However, the responsibility of checking whether the ats is enabled needs to be outside of the TimelineClientImpl. In fact, the code in yarn assumes the design I am proposing. In YarnClient it checks the value of ats.enabled, then it creates the TimelineClientImpl which then re-checks ats.enabled. This is the preferred object design. The issues lies in the fact the the timeline delegation token renewer creates a TimelineClient because it has a timeline server delegation token. This is proof enough that a timelineclient needs to be created. This goes back to my original design constraint that ats.enabled must be able to be turned off globally, and enabled at the per job/framework level. Allow enabling/disabling timeline server per framework -- Key: YARN-2375 URL: https://issues.apache.org/jira/browse/YARN-2375 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles Assignee: Mit Desai This JIRA is to remove the ats enabled flag check within the TimelineClientImpl. Example where this fails is below. While running secure timeline server with ats flag set to disabled on resource manager, Timeline delegation token renewer throws an NPE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2392) add more diags about app retry limits on AM failures
[ https://issues.apache.org/jira/browse/YARN-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212970#comment-14212970 ] Jian He commented on YARN-2392: --- thanks Steve, patch not applying anymore, mind updating the patch? add more diags about app retry limits on AM failures Key: YARN-2392 URL: https://issues.apache.org/jira/browse/YARN-2392 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.6.0 Reporter: Steve Loughran Attachments: YARN-2392-001.patch, YARN-2392-002.patch # when an app fails the failure count is shown, but not what the global + local limits are. If the two are different, they should both be printed. # the YARN-2242 strings don't have enough whitespace between text and the URL -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2432) RMStateStore should process the pending events before close
[ https://issues.apache.org/jira/browse/YARN-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212974#comment-14212974 ] Jian He commented on YARN-2432: --- looks good, kick jenkins manually RMStateStore should process the pending events before close --- Key: YARN-2432 URL: https://issues.apache.org/jira/browse/YARN-2432 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Varun Saxena Assignee: Varun Saxena Attachments: YARN-2432.patch Refer to discussion on YARN-2136 (https://issues.apache.org/jira/browse/YARN-2136?focusedCommentId=14097266page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14097266). As pointed out by [~jianhe], we should process the dispatcher event queue before closing the state store by flipping over the following statements in code. {code:title=RMStateStore.java|borderStyle=solid} protected void serviceStop() throws Exception { closeInternal(); dispatcher.stop(); } {code} Currently, if the state store is being stopped on events such as switching to standby, it will first close the state store(in case of ZKRMStateStore, close connection with ZK) and then process the pending events. Instead, we should first process the pending events and then call close. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2432) RMStateStore should process the pending events before close
[ https://issues.apache.org/jira/browse/YARN-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212984#comment-14212984 ] Hadoop QA commented on YARN-2432: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12663208/YARN-2432.patch against trunk revision 49c3889. {color:red}-1 patch{color}. Trunk compilation may be broken. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5851//console This message is automatically generated. RMStateStore should process the pending events before close --- Key: YARN-2432 URL: https://issues.apache.org/jira/browse/YARN-2432 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Varun Saxena Assignee: Varun Saxena Attachments: YARN-2432.patch Refer to discussion on YARN-2136 (https://issues.apache.org/jira/browse/YARN-2136?focusedCommentId=14097266page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14097266). As pointed out by [~jianhe], we should process the dispatcher event queue before closing the state store by flipping over the following statements in code. {code:title=RMStateStore.java|borderStyle=solid} protected void serviceStop() throws Exception { closeInternal(); dispatcher.stop(); } {code} Currently, if the state store is being stopped on events such as switching to standby, it will first close the state store(in case of ZKRMStateStore, close connection with ZK) and then process the pending events. Instead, we should first process the pending events and then call close. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2862) RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used
[ https://issues.apache.org/jira/browse/YARN-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212989#comment-14212989 ] Zhijie Shen commented on YARN-2862: --- It is likely that the assumption we made in [YARN-1776|https://issues.apache.org/jira/browse/YARN-1776?focusedCommentId=13942201page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13942201] is not fully correct. When updating a state file, we (1) write the new file to .new, (2) delete the existing one, and (3) rename the .new to the existing file name. If crash happens before (2), we use .new to recover the state file when loading the state (see FileSystemRMStateStore#checkAndResumeUpdateOperation). According to the description here, RM can crash when (1) is in progress, and leave a corrupted .new file. It seems that we have to do additional validation to check if .new file is corrupted or not, or just simply ignore it . RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used --- Key: YARN-2862 URL: https://issues.apache.org/jira/browse/YARN-2862 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma This might be a known issue. Given FileSystemRMStateStore isn't used for HA scenario, it might not be that important, unless there is something we need to fix at RM layer to make it more tolerant to RMStore issue. When RM was hard shutdown, OS might not get a chance to persist blocks. Some of the stored application data end up with size zero after reboot. And RM didn't like that. {noformat} ls -al /var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351 total 156 drwxr-xr-x.2 x y 4096 Nov 13 16:45 . drwxr-xr-x. 1524 x y 151552 Nov 13 16:45 .. -rw-r--r--.1 x y 0 Nov 13 16:45 appattempt_1412702189634_324351_01 -rw-r--r--.1 x y 0 Nov 13 16:45 .appattempt_1412702189634_324351_01.crc -rw-r--r--.1 x y 0 Nov 13 16:45 application_1412702189634_324351 -rw-r--r--.1 x y 0 Nov 13 16:45 .application_1412702189634_324351.crc {noformat} When RM starts up {noformat} 2014-11-13 16:55:25,844 WARN org.apache.hadoop.fs.FSInputChecker: Problem opening checksum file: file:/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351/application_1412702189634_324351. Ignoring exception: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:146) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:792) at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.readFile(FileSystemRMStateStore.java:501) ... 2014-11-13 17:40:48,876 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to load/recover state java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ApplicationState.getAppId(RMStateStore.java:184) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:306) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2392) add more diags about app retry limits on AM failures
[ https://issues.apache.org/jira/browse/YARN-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212991#comment-14212991 ] Hadoop QA commented on YARN-2392: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12665901/YARN-2392-002.patch against trunk revision 49c3889. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5853//console This message is automatically generated. add more diags about app retry limits on AM failures Key: YARN-2392 URL: https://issues.apache.org/jira/browse/YARN-2392 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.6.0 Reporter: Steve Loughran Attachments: YARN-2392-001.patch, YARN-2392-002.patch # when an app fails the failure count is shown, but not what the global + local limits are. If the two are different, they should both be printed. # the YARN-2242 strings don't have enough whitespace between text and the URL -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2802) add AM container launch and register delay metrics in QueueMetrics to help diagnose performance issue.
[ https://issues.apache.org/jira/browse/YARN-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212992#comment-14212992 ] zhihai xu commented on YARN-2802: - Hi [~jianhe] and [~vinodkv], Could you review the patch? Since this patch change the Capacity Scheduler. It passed the Hadoop QA. thanks zhihai add AM container launch and register delay metrics in QueueMetrics to help diagnose performance issue. -- Key: YARN-2802 URL: https://issues.apache.org/jira/browse/YARN-2802 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2802.000.patch, YARN-2802.001.patch, YARN-2802.002.patch add AM container launch and register delay metrics in QueueMetrics to help diagnose performance issue. Added two metrics in QueueMetrics: aMLaunchDelay: the time spent from sending event AMLauncherEventType.LAUNCH to receiving event RMAppAttemptEventType.LAUNCHED in RMAppAttemptImpl. aMRegisterDelay: the time waiting from receiving event RMAppAttemptEventType.LAUNCHED to receiving event RMAppAttemptEventType.REGISTERED(ApplicationMasterService#registerApplicationMaster) in RMAppAttemptImpl. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2556) Tool to measure the performance of the timeline server
[ https://issues.apache.org/jira/browse/YARN-2556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213006#comment-14213006 ] chang li commented on YARN-2556: I have run those failed tests on my local machine and they all passed with my patch Tool to measure the performance of the timeline server -- Key: YARN-2556 URL: https://issues.apache.org/jira/browse/YARN-2556 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Jonathan Eagles Assignee: chang li Attachments: YARN-2556-WIP.patch, YARN-2556-WIP.patch, yarn2556.patch, yarn2556.patch, yarn2556_wip.patch We need to be able to understand the capacity model for the timeline server to give users the tools they need to deploy a timeline server with the correct capacity. I propose we create a mapreduce job that can measure timeline server write and read performance. Transactions per second, I/O for both read and write would be a good start. This could be done as an example or test job that could be tied into gridmix. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2865) Application recovery continuously fails with Application with id already present. Cannot duplicate
[ https://issues.apache.org/jira/browse/YARN-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213001#comment-14213001 ] Hadoop QA commented on YARN-2865: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12681615/YARN-2865.patch against trunk revision 49c3889. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5846//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5846//console This message is automatically generated. Application recovery continuously fails with Application with id already present. Cannot duplicate Key: YARN-2865 URL: https://issues.apache.org/jira/browse/YARN-2865 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Rohith Assignee: Rohith Priority: Critical Attachments: YARN-2865.patch YARN-2588 handles exception thrown while transitioningToActive and reset activeServices. But it misses out clearing RMcontext apps/nodes details and ClusterMetrics and QueueMetrics. This causes application recovery to fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)