date:20130624


[ 
https://issues.apache.org/jira/browse/YARN-881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692224#comment-13692224
 ] 

Sandy Ryza commented on YARN-881:
-

There are places in the code that rely on the current ordering, 
AppSchedulingInfo, for example.  The thinking may have been that we most 
commonly want to traverse priorities from high to low, which is more 
straightforward if the higher ones are at the front of the list.

 Priority#compareTo method seems to be wrong.
 

 Key: YARN-881
 URL: https://issues.apache.org/jira/browse/YARN-881
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He

 if lower int value means higher priority, shouldn't we return 
 other.getPriority() - this.getPriority()  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-736) Add a multi-resource fair sharing metric

2013-06-24 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692241#comment-13692241
 ] 

Hudson commented on YARN-736:
-

Integrated in Hadoop-trunk-Commit #4005 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/4005/])
YARN-736. Add a multi-resource fair sharing metric. (sandyr via tucu) 
(Revision 1496153)

 Result = SUCCESS
tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1496153
Files : 
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/Resources.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/AppSchedulable.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSQueue.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/Schedulable.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/ComputeFairShares.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/DominantResourceFairnessPolicy.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FairSharePolicy.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FifoPolicy.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FakeSchedulable.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestComputeFairShares.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/TestDominantResourceFairnessPolicy.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/FairScheduler.apt.vm


 Add a multi-resource fair sharing metric
 

 Key: YARN-736
 URL: https://issues.apache.org/jira/browse/YARN-736
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Affects Versions: 2.0.4-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Fix For: 2.2.0

 Attachments: YARN-736-1.patch, YARN-736-2.patch, YARN-736-3.patch, 
 YARN-736-4.patch, YARN-736.patch


 Currently, at a regular interval, the fair scheduler computes a fair memory 
 share for each queue and application inside it.  This fair share is not used 
 for scheduling decisions, but is displayed in the web UI, exposed as a 
 metric, and used for preemption decisions.
 With DRF and multi-resource scheduling, assigning a memory share as the fair 
 share metric to every queue no longer makes sense.  It's not obvious what the 
 replacement should be, but probably something like fractional fairness within 
 a queue, or distance from an ideal cluster state.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-339) TestResourceTrackerService is failing intermittently

2013-06-24 Thread Ravi Prakash (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692270#comment-13692270
 ] 

Ravi Prakash commented on YARN-339:
---

Hi Vinod! Nopes! I can't reproduce this anymore. Closing as fixed. Please 
re-open if you think the patch should still go in. Thanks Jianhe and Vinod!

 TestResourceTrackerService is failing intermittently
 

 Key: YARN-339
 URL: https://issues.apache.org/jira/browse/YARN-339
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 0.23.5
Reporter: Ravi Prakash
Assignee: Jian He
 Attachments: YARN-339.patch


 The test after testReconnectNode() is failing usually. This might be a race 
 condition in Metrics2 code. 
 Tests run: 8, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 3.127 sec  
 FAILURE!
 testDecommissionWithIncludeHosts(org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService)
   Time elapsed: 55 sec   ERROR!
 org.apache.hadoop.metrics2.MetricsException: Metrics source ClusterMetrics 
 already exists!
   at 
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:134)
   at 
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:115)
   at 
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:217)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ClusterMetrics.registerMetrics(ClusterMetrics.java:71)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ClusterMetrics.getMetrics(ClusterMetrics.java:58)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testDecommissionWithIncludeHosts(TestResourceTrackerService.java:74)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-882) Specify per user quota for private/application cache and user log files

[
https://issues.apache.org/jira/browse/YARN-882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Omkar Vinit Joshi updated YARN-882:
---

Description:
At present there is no limit on the number of files / size of the files
localized by single user. Similarly there is no limit on the size of the log
files created by user via running containers.

We need to restrict the user for this.
For LocalizedResources; this has serious concerns in case of secured
environment where malicious user can start one container and localize resources
whose total size = DEFAULT_NM_LOCALIZER_CACHE_TARGET_SIZE_MB. Thereafter it
will either fail (if no extra space is present on disk) or deletion service
will keep removing localized files for other containers/applications.
The limit for logs/localized resources should be decided by RM and sent to NM
via secured containerToken. All these configurations should per container
instead of per user or per nm.

was:
At present there is no limit on the number of files / size of the files
localized by single user. Similarly there is no limit on the size of the log
files created by user via running containers.
We need to restrict the user for this. For LocalizedResources; this has serious
concerns in case of secured environment where malicious user can start one
container and localize resources whose total size =
DEFAULT_NM_LOCALIZER_CACHE_TARGET_SIZE_MB. Thereafter it will either fail (if
no extra space is present on disk) or deletion service will keep removing
localized files for other containers/applications.
The limit for logs/localized resource should be decided by RM and sent to NM
via secured containerToken. All these configurations should per container
instead of per user or per nm.

Specify per user quota for private/application cache and user log files
---

Key: YARN-882
URL: https://issues.apache.org/jira/browse/YARN-882
Project: Hadoop YARN
Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi

At present there is no limit on the number of files / size of the files
localized by single user. Similarly there is no limit on the size of the log
files created by user via running containers.
We need to restrict the user for this.
For LocalizedResources; this has serious concerns in case of secured
environment where malicious user can start one container and localize
resources whose total size = DEFAULT_NM_LOCALIZER_CACHE_TARGET_SIZE_MB.
Thereafter it will either fail (if no extra space is present on disk) or
deletion service will keep removing localized files for other
containers/applications.
The limit for logs/localized resources should be decided by RM and sent to NM
via secured containerToken. All these configurations should per container
instead of per user or per nm.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (YARN-882) Specify per user quota for private/application cache and user log files

Omkar Vinit Joshi created YARN-882:
--

 Summary: Specify per user quota for private/application cache and 
user log files
 Key: YARN-882
 URL: https://issues.apache.org/jira/browse/YARN-882
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi


At present there is no limit on the number of files / size of the files 
localized by single user. Similarly there is no limit on the size of the log 
files created by user via running containers.
We need to restrict the user for this. For LocalizedResources; this has serious 
concerns in case of secured environment where malicious user can start one 
container and localize resources whose total size = 
DEFAULT_NM_LOCALIZER_CACHE_TARGET_SIZE_MB. Thereafter it will either fail (if 
no extra space is present on disk) or deletion service will keep removing 
localized files for other containers/applications. 
The limit for logs/localized resource should be decided by RM and sent to NM 
via secured containerToken. All these configurations should per container 
instead of per user or per nm.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-864) YARN NM leaking containers with CGroups

2013-06-24 Thread Chris Riccomini (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692296#comment-13692296
 ] 

Chris Riccomini commented on YARN-864:
--

Hey Jian,

With your patch applied, the new error (in the NM) is:

{noformat}
19:33:36,741  INFO NodeStatusUpdaterImpl:365 - Node is out of sync with 
ResourceManager, hence rebooting.
19:33:36,764  INFO ContainersMonitorImpl:399 - Memory usage of ProcessTree 
14751 for container-id container_1372091455469_0002_01_02: 779.3 MB of 1.3 
GB physical memory used; 1.6 GB of 10 GB virtual memory used
19:33:37,239  INFO NodeManager:315 - Rebooting the node manager.
19:33:37,261  INFO NodeManager:229 - Containers still running on shutdown: 
[container_1372091455469_0002_01_02]
19:33:37,278 FATAL AsyncDispatcher:137 - Error in dispatcher thread
org.apache.hadoop.metrics2.MetricsException: Metrics source JvmMetrics already 
exists!
at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:126)
at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:107)
at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:217)
at 
org.apache.hadoop.metrics2.source.JvmMetrics.create(JvmMetrics.java:79)
at 
org.apache.hadoop.yarn.server.nodemanager.metrics.NodeManagerMetrics.create(NodeManagerMetrics.java:49)
at 
org.apache.hadoop.yarn.server.nodemanager.metrics.NodeManagerMetrics.create(NodeManagerMetrics.java:45)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.init(NodeManager.java:75)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.createNewNodeManager(NodeManager.java:357)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.reboot(NodeManager.java:316)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:348)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:61)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77)
at java.lang.Thread.run(Thread.java:619)
{noformat}

For the record, you can reproduce this yourself by:

1. Start a YARN RM and NM.
2. Run a YARN job on the cluster that uses at least one container.
3. Run kill -STOP NM PID on the NM.
4. Wait 65 seconds (enough for the NM to time out).
5. Run kill -CONT NM PID

You will see the NM trigger a reboot since it's out of sync with the RM.

 YARN NM leaking containers with CGroups
 ---

 Key: YARN-864
 URL: https://issues.apache.org/jira/browse/YARN-864
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.5-alpha
 Environment: YARN 2.0.5-alpha with patches applied for YARN-799 and 
 YARN-600.
Reporter: Chris Riccomini
 Attachments: rm-log, YARN-864.1.patch


 Hey Guys,
 I'm running YARN 2.0.5-alpha with CGroups and stateful RM turned on, and I'm 
 seeing containers getting leaked by the NMs. I'm not quite sure what's going 
 on -- has anyone seen this before? I'm concerned that maybe it's a 
 mis-understanding on my part about how YARN's lifecycle works.
 When I look in my AM logs for my app (not an MR app master), I see:
 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Got an exit code of -100. 
 This means that container container_1371141151815_0008_03_02 was killed 
 by YARN, either due to being released by the application master or being 
 'lost' due to node failures etc.
 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Released container 
 container_1371141151815_0008_03_02 was assigned task ID 0. Requesting a 
 new container for the task.
 The AM has been running steadily the whole time. Here's what the NM logs say:
 {noformat}
 05:34:59,783  WARN AsyncDispatcher:109 - Interrupted Exception while stopping
 java.lang.InterruptedException
 at java.lang.Object.wait(Native Method)
 at java.lang.Thread.join(Thread.java:1143)
 at java.lang.Thread.join(Thread.java:1196)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.stop(AsyncDispatcher.java:107)
 at 
 org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99)
 at 
 org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.stop(NodeManager.java:209)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:336)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:61)
 at

[jira] [Updated] (YARN-864) YARN NM leaking containers with CGroups

2013-06-24 Thread Jian He (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-864:
-

Attachment: YARN-864.2.patch

 YARN NM leaking containers with CGroups
 ---

 Key: YARN-864
 URL: https://issues.apache.org/jira/browse/YARN-864
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.5-alpha
 Environment: YARN 2.0.5-alpha with patches applied for YARN-799 and 
 YARN-600.
Reporter: Chris Riccomini
 Attachments: rm-log, YARN-864.1.patch, YARN-864.2.patch


 Hey Guys,
 I'm running YARN 2.0.5-alpha with CGroups and stateful RM turned on, and I'm 
 seeing containers getting leaked by the NMs. I'm not quite sure what's going 
 on -- has anyone seen this before? I'm concerned that maybe it's a 
 mis-understanding on my part about how YARN's lifecycle works.
 When I look in my AM logs for my app (not an MR app master), I see:
 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Got an exit code of -100. 
 This means that container container_1371141151815_0008_03_02 was killed 
 by YARN, either due to being released by the application master or being 
 'lost' due to node failures etc.
 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Released container 
 container_1371141151815_0008_03_02 was assigned task ID 0. Requesting a 
 new container for the task.
 The AM has been running steadily the whole time. Here's what the NM logs say:
 {noformat}
 05:34:59,783  WARN AsyncDispatcher:109 - Interrupted Exception while stopping
 java.lang.InterruptedException
 at java.lang.Object.wait(Native Method)
 at java.lang.Thread.join(Thread.java:1143)
 at java.lang.Thread.join(Thread.java:1196)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.stop(AsyncDispatcher.java:107)
 at 
 org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99)
 at 
 org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.stop(NodeManager.java:209)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:336)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:61)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77)
 at java.lang.Thread.run(Thread.java:619)
 05:35:00,314  WARN ContainersMonitorImpl:463 - 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
  is interrupted. Exiting.
 05:35:00,434  WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup 
 at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0006_01_001598
 05:35:00,434  WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup 
 at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0008_03_02
 05:35:00,434  WARN ContainerLaunch:247 - Failed to launch container.
 java.io.IOException: java.lang.InterruptedException
 at org.apache.hadoop.util.Shell.runCommand(Shell.java:205)
 at org.apache.hadoop.util.Shell.run(Shell.java:129)
 at 
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322)
 at 
 org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:619)
 05:35:00,434  WARN ContainerLaunch:247 - Failed to launch container.
 java.io.IOException: java.lang.InterruptedException
 at org.apache.hadoop.util.Shell.runCommand(Shell.java:205)
 at org.apache.hadoop.util.Shell.run(Shell.java:129)
 at 
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322)
 at 
 org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68)
 at

[jira] [Commented] (YARN-864) YARN NM leaking containers with CGroups

2013-06-24 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692388#comment-13692388
 ] 

Jian He commented on YARN-864:
--

Hi Chris 
that failure was due to reboot starts even before stop fully completes.
Uploaded a new patch, tested locally. let me know if that works, thx

 YARN NM leaking containers with CGroups
 ---

 Key: YARN-864
 URL: https://issues.apache.org/jira/browse/YARN-864
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.5-alpha
 Environment: YARN 2.0.5-alpha with patches applied for YARN-799 and 
 YARN-600.
Reporter: Chris Riccomini
 Attachments: rm-log, YARN-864.1.patch, YARN-864.2.patch


 Hey Guys,
 I'm running YARN 2.0.5-alpha with CGroups and stateful RM turned on, and I'm 
 seeing containers getting leaked by the NMs. I'm not quite sure what's going 
 on -- has anyone seen this before? I'm concerned that maybe it's a 
 mis-understanding on my part about how YARN's lifecycle works.
 When I look in my AM logs for my app (not an MR app master), I see:
 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Got an exit code of -100. 
 This means that container container_1371141151815_0008_03_02 was killed 
 by YARN, either due to being released by the application master or being 
 'lost' due to node failures etc.
 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Released container 
 container_1371141151815_0008_03_02 was assigned task ID 0. Requesting a 
 new container for the task.
 The AM has been running steadily the whole time. Here's what the NM logs say:
 {noformat}
 05:34:59,783  WARN AsyncDispatcher:109 - Interrupted Exception while stopping
 java.lang.InterruptedException
 at java.lang.Object.wait(Native Method)
 at java.lang.Thread.join(Thread.java:1143)
 at java.lang.Thread.join(Thread.java:1196)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.stop(AsyncDispatcher.java:107)
 at 
 org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99)
 at 
 org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.stop(NodeManager.java:209)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:336)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:61)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77)
 at java.lang.Thread.run(Thread.java:619)
 05:35:00,314  WARN ContainersMonitorImpl:463 - 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
  is interrupted. Exiting.
 05:35:00,434  WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup 
 at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0006_01_001598
 05:35:00,434  WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup 
 at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0008_03_02
 05:35:00,434  WARN ContainerLaunch:247 - Failed to launch container.
 java.io.IOException: java.lang.InterruptedException
 at org.apache.hadoop.util.Shell.runCommand(Shell.java:205)
 at org.apache.hadoop.util.Shell.run(Shell.java:129)
 at 
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322)
 at 
 org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:619)
 05:35:00,434  WARN ContainerLaunch:247 - Failed to launch container.
 java.io.IOException: java.lang.InterruptedException
 at org.apache.hadoop.util.Shell.runCommand(Shell.java:205)
 at org.apache.hadoop.util.Shell.run(Shell.java:129)
 at 
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322)
 at 
 org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230)
 at

[jira] [Commented] (YARN-883) Expose Fair Scheduler-specific queue metrics


[ 
https://issues.apache.org/jira/browse/YARN-883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692436#comment-13692436
 ] 

Sandy Ryza commented on YARN-883:
-

Submitted patch that adds an FSQueueMetrics, which extends QueueMetrics.  
Verified that the metrics show up on a pseudo-distributed cluster.

 Expose Fair Scheduler-specific queue metrics
 

 Key: YARN-883
 URL: https://issues.apache.org/jira/browse/YARN-883
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Affects Versions: 2.0.5-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-883.patch


 When the Fair Scheduler is enabled, QueueMetrics should include fair share, 
 minimum share, and maximum share.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-883) Expose Fair Scheduler-specific queue metrics


 [ 
https://issues.apache.org/jira/browse/YARN-883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated YARN-883:


Attachment: YARN-883.patch

 Expose Fair Scheduler-specific queue metrics
 

 Key: YARN-883
 URL: https://issues.apache.org/jira/browse/YARN-883
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Affects Versions: 2.0.5-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-883.patch


 When the Fair Scheduler is enabled, QueueMetrics should include fair share, 
 minimum share, and maximum share.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Moved] (YARN-883) Expose Fair Scheduler-specific queue metrics


 [ 
https://issues.apache.org/jira/browse/YARN-883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza moved MAPREDUCE-5350 to YARN-883:


  Component/s: (was: scheduler)
   scheduler
Affects Version/s: (was: 2.0.5-alpha)
   2.0.5-alpha
  Key: YARN-883  (was: MAPREDUCE-5350)
  Project: Hadoop YARN  (was: Hadoop Map/Reduce)

 Expose Fair Scheduler-specific queue metrics
 

 Key: YARN-883
 URL: https://issues.apache.org/jira/browse/YARN-883
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Affects Versions: 2.0.5-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-883.patch


 When the Fair Scheduler is enabled, QueueMetrics should include fair share, 
 minimum share, and maximum share.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (YARN-884) AM expiry interval should be set to smaller of {am, nm}.liveness-monitor.expiry-interval-ms

Karthik Kambatla created YARN-884:
-

 Summary: AM expiry interval should be set to smaller of {am, 
nm}.liveness-monitor.expiry-interval-ms
 Key: YARN-884
 URL: https://issues.apache.org/jira/browse/YARN-884
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.0.4-alpha
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla


As the AM can't outlive the NM on which it is running, it is a good idea to 
disallow setting the am.liveness-monitor.expiry-interval-ms to a value higher 
than nm.liveness-monitor.expiry-interval-ms

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-864) YARN NM leaking containers with CGroups

2013-06-24 Thread Chris Riccomini (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692476#comment-13692476
 ] 

Chris Riccomini commented on YARN-864:
--

Hey Jian,

I re-deployed my test cluster with YARN-600, YARN-799, and your latest patch 
(.2.patch) from YARN-864. I simulated the timeout using kill -STOP (as 
described above), and your patch worked! :)

I'm going to let the cluster run for 24h before declaring victory, but this 
looks promising. I'll follow up tomorrow, when I know more.

Cheers,
Chris

 YARN NM leaking containers with CGroups
 ---

 Key: YARN-864
 URL: https://issues.apache.org/jira/browse/YARN-864
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.5-alpha
 Environment: YARN 2.0.5-alpha with patches applied for YARN-799 and 
 YARN-600.
Reporter: Chris Riccomini
 Attachments: rm-log, YARN-864.1.patch, YARN-864.2.patch


 Hey Guys,
 I'm running YARN 2.0.5-alpha with CGroups and stateful RM turned on, and I'm 
 seeing containers getting leaked by the NMs. I'm not quite sure what's going 
 on -- has anyone seen this before? I'm concerned that maybe it's a 
 mis-understanding on my part about how YARN's lifecycle works.
 When I look in my AM logs for my app (not an MR app master), I see:
 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Got an exit code of -100. 
 This means that container container_1371141151815_0008_03_02 was killed 
 by YARN, either due to being released by the application master or being 
 'lost' due to node failures etc.
 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Released container 
 container_1371141151815_0008_03_02 was assigned task ID 0. Requesting a 
 new container for the task.
 The AM has been running steadily the whole time. Here's what the NM logs say:
 {noformat}
 05:34:59,783  WARN AsyncDispatcher:109 - Interrupted Exception while stopping
 java.lang.InterruptedException
 at java.lang.Object.wait(Native Method)
 at java.lang.Thread.join(Thread.java:1143)
 at java.lang.Thread.join(Thread.java:1196)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.stop(AsyncDispatcher.java:107)
 at 
 org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99)
 at 
 org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.stop(NodeManager.java:209)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:336)
 at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:61)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77)
 at java.lang.Thread.run(Thread.java:619)
 05:35:00,314  WARN ContainersMonitorImpl:463 - 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
  is interrupted. Exiting.
 05:35:00,434  WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup 
 at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0006_01_001598
 05:35:00,434  WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup 
 at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0008_03_02
 05:35:00,434  WARN ContainerLaunch:247 - Failed to launch container.
 java.io.IOException: java.lang.InterruptedException
 at org.apache.hadoop.util.Shell.runCommand(Shell.java:205)
 at org.apache.hadoop.util.Shell.run(Shell.java:129)
 at 
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322)
 at 
 org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:619)
 05:35:00,434  WARN ContainerLaunch:247 - Failed to launch container.
 java.io.IOException: java.lang.InterruptedException
 at org.apache.hadoop.util.Shell.runCommand(Shell.java:205)
 at org.apache.hadoop.util.Shell.run(Shell.java:129)
 at

[jira] [Commented] (YARN-883) Expose Fair Scheduler-specific queue metrics


[ 
https://issues.apache.org/jira/browse/YARN-883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692475#comment-13692475
 ] 

Hadoop QA commented on YARN-883:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12589493/YARN-883.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFSLeafQueue
  
org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1389//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/1389//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1389//console

This message is automatically generated.

 Expose Fair Scheduler-specific queue metrics
 

 Key: YARN-883
 URL: https://issues.apache.org/jira/browse/YARN-883
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Affects Versions: 2.0.5-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-883.patch


 When the Fair Scheduler is enabled, QueueMetrics should include fair share, 
 minimum share, and maximum share.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-649) Make container logs available over HTTP in plain text


[ 
https://issues.apache.org/jira/browse/YARN-649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692501#comment-13692501
 ] 

Sandy Ryza commented on YARN-649:
-

Uploading a patch that takes Vinod's comments into account. It
* Fixes the SecureIOUtils hole (doh!)
* Makes separate ContainerLogsUtils#getContainerLogFile and getContainerLogDirs
* Throws appropriate error codes instead of just returning a string
* Uses StreamingOutput to avoid unbounded buffering
* Marks the API as evolving

I still need to add documentation.

Regarding logs for old jobs, is there a reason that the implementation choice 
would change the API?

 Make container logs available over HTTP in plain text
 -

 Key: YARN-649
 URL: https://issues.apache.org/jira/browse/YARN-649
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.0.4-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-649-2.patch, YARN-649-3.patch, YARN-649.patch, 
 YARN-752-1.patch


 It would be good to make container logs available over the REST API for 
 MAPREDUCE-4362 and so that they can be accessed programatically in general.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-649) Make container logs available over HTTP in plain text


 [ 
https://issues.apache.org/jira/browse/YARN-649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated YARN-649:


Attachment: YARN-649-3.patch

 Make container logs available over HTTP in plain text
 -

 Key: YARN-649
 URL: https://issues.apache.org/jira/browse/YARN-649
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.0.4-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-649-2.patch, YARN-649-3.patch, YARN-649.patch, 
 YARN-752-1.patch


 It would be good to make container logs available over the REST API for 
 MAPREDUCE-4362 and so that they can be accessed programatically in general.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-569) CapacityScheduler: support for preemption (using a capacity monitor)

2013-06-24 Thread Chris Douglas (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Douglas updated YARN-569:
---

Attachment: YARN-569.10.patch

 CapacityScheduler: support for preemption (using a capacity monitor)
 

 Key: YARN-569
 URL: https://issues.apache.org/jira/browse/YARN-569
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Carlo Curino
Assignee: Carlo Curino
 Attachments: 3queues.pdf, CapScheduler_with_preemption.pdf, 
 preemption.2.patch, YARN-569.10.patch, YARN-569.1.patch, YARN-569.2.patch, 
 YARN-569.3.patch, YARN-569.4.patch, YARN-569.5.patch, YARN-569.6.patch, 
 YARN-569.8.patch, YARN-569.9.patch, YARN-569.patch, YARN-569.patch


 There is a tension between the fast-pace reactive role of the 
 CapacityScheduler, which needs to respond quickly to 
 applications resource requests, and node updates, and the more introspective, 
 time-based considerations 
 needed to observe and correct for capacity balance. To this purpose we opted 
 instead of hacking the delicate
 mechanisms of the CapacityScheduler directly to add support for preemption by 
 means of a Capacity Monitor,
 which can be run optionally as a separate service (much like the 
 NMLivelinessMonitor).
 The capacity monitor (similarly to equivalent functionalities in the fairness 
 scheduler) operates running on intervals 
 (e.g., every 3 seconds), observe the state of the assignment of resources to 
 queues from the capacity scheduler, 
 performs off-line computation to determine if preemption is needed, and how 
 best to edit the current schedule to 
 improve capacity, and generates events that produce four possible actions:
 # Container de-reservations
 # Resource-based preemptions
 # Container-based preemptions
 # Container killing
 The actions listed above are progressively more costly, and it is up to the 
 policy to use them as desired to achieve the rebalancing goals. 
 Note that due to the lag in the effect of these actions the policy should 
 operate at the macroscopic level (e.g., preempt tens of containers
 from a queue) and not trying to tightly and consistently micromanage 
 container allocations. 
 - Preemption policy  (ProportionalCapacityPreemptionPolicy): 
 - 
 Preemption policies are by design pluggable, in the following we present an 
 initial policy (ProportionalCapacityPreemptionPolicy) we have been 
 experimenting with.  The ProportionalCapacityPreemptionPolicy behaves as 
 follows:
 # it gathers from the scheduler the state of the queues, in particular, their 
 current capacity, guaranteed capacity and pending requests (*)
 # if there are pending requests from queues that are under capacity it 
 computes a new ideal balanced state (**)
 # it computes the set of preemptions needed to repair the current schedule 
 and achieve capacity balance (accounting for natural completion rates, and 
 respecting bounds on the amount of preemption we allow for each round)
 # it selects which applications to preempt from each over-capacity queue (the 
 last one in the FIFO order)
 # it remove reservations from the most recently assigned app until the amount 
 of resource to reclaim is obtained, or until no more reservations exits
 # (if not enough) it issues preemptions for containers from the same 
 applications (reverse chronological order, last assigned container first) 
 again until necessary or until no containers except the AM container are left,
 # (if not enough) it moves onto unreserve and preempt from the next 
 application. 
 # containers that have been asked to preempt are tracked across executions. 
 If a containers is among the one to be preempted for more than a certain 
 time, the container is moved in a the list of containers to be forcibly 
 killed. 
 Notes:
 (*) at the moment, in order to avoid double-counting of the requests, we only 
 look at the ANY part of pending resource requests, which means we might not 
 preempt on behalf of AMs that ask only for specific locations but not any. 
 (**) The ideal balance state is one in which each queue has at least its 
 guaranteed capacity, and the spare capacity is distributed among queues (that 
 wants some) as a weighted fair share. Where the weighting is based on the 
 guaranteed capacity of a queue, and the function runs to a fix point.  
 Tunables of the ProportionalCapacityPreemptionPolicy:
 # observe-only mode (i.e., log the actions it would take, but behave as 
 read-only)
 # how frequently to run the policy
 # how long to wait between preemption and kill of a container
 # which fraction of the containers I would like to obtain should I preempt 
 (has to do with the natural rate at which containers are returned)
 # deadzone size, i.e., what % of

[jira] [Commented] (YARN-884) AM expiry interval should be set to smaller of {am, nm}.liveness-monitor.expiry-interval-ms


[ 
https://issues.apache.org/jira/browse/YARN-884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692525#comment-13692525
 ] 

Omkar Vinit Joshi commented on YARN-884:


Probably these two are unrelated. First if NM goes down then obviously if AM is 
running on it has gone down but vis-a-versa is not true. In work preserving 
environment we would like to restart/resume the AM which will not be possible 
if we configure liveness interval of am = smallest of {am,nm}.. For example nm 
might be facing problems to connect to RM and may just end up heart beating 
with RM just before RM took the decision about starting new application 
attempt, marking earlier as failed... even if AM heartbeats immediately after 
that it would be waste... right??

I think we need am = larget of {am,nm}

thoughts?

 AM expiry interval should be set to smaller of {am, 
 nm}.liveness-monitor.expiry-interval-ms
 ---

 Key: YARN-884
 URL: https://issues.apache.org/jira/browse/YARN-884
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.0.4-alpha
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
  Labels: configuration

 As the AM can't outlive the NM on which it is running, it is a good idea to 
 disallow setting the am.liveness-monitor.expiry-interval-ms to a value higher 
 than nm.liveness-monitor.expiry-interval-ms

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-649) Make container logs available over HTTP in plain text


[ 
https://issues.apache.org/jira/browse/YARN-649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692526#comment-13692526
 ] 

Hadoop QA commented on YARN-649:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12589505/YARN-649-3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 9 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:

  
org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerResync
  
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.TestApplication
  
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.TestContainersMonitor
  
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.TestContainerLaunch
  
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService
  
org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown
  
org.apache.hadoop.yarn.server.nodemanager.containermanager.TestContainerManager

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1390//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1390//console

This message is automatically generated.

 Make container logs available over HTTP in plain text
 -

 Key: YARN-649
 URL: https://issues.apache.org/jira/browse/YARN-649
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.0.4-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-649-2.patch, YARN-649-3.patch, YARN-649.patch, 
 YARN-752-1.patch


 It would be good to make container logs available over the REST API for 
 MAPREDUCE-4362 and so that they can be accessed programatically in general.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-883) Expose Fair Scheduler-specific queue metrics


 [ 
https://issues.apache.org/jira/browse/YARN-883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated YARN-883:


Attachment: YARN-883-1.patch

 Expose Fair Scheduler-specific queue metrics
 

 Key: YARN-883
 URL: https://issues.apache.org/jira/browse/YARN-883
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Affects Versions: 2.0.5-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-883-1.patch, YARN-883.patch


 When the Fair Scheduler is enabled, QueueMetrics should include fair share, 
 minimum share, and maximum share.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-569) CapacityScheduler: support for preemption (using a capacity monitor)

[
https://issues.apache.org/jira/browse/YARN-569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692542#comment-13692542
]

Hadoop QA commented on YARN-569:

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12589506/YARN-569.10.patch
against trunk revision .

{color:green}+1 @author{color}. The patch does not contain any @author
tags.

{color:green}+1 tests included{color}. The patch appears to include 1 new
or modified test files.

{color:green}+1 javac{color}. The applied patch does not increase the
total number of javac compiler warnings.

{color:green}+1 javadoc{color}. The javadoc tool did not generate any
warning messages.

{color:green}+1 eclipse:eclipse{color}. The patch built with
eclipse:eclipse.

{color:green}+1 findbugs{color}. The patch does not introduce any new
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.

{color:red}-1 core tests{color}. The patch failed these unit tests in
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization

{color:green}+1 contrib tests{color}. The patch passed contrib unit tests.

Test results:
https://builds.apache.org/job/PreCommit-YARN-Build/1391//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1391//console

This message is automatically generated.

CapacityScheduler: support for preemption (using a capacity monitor)

Key: YARN-569
URL: https://issues.apache.org/jira/browse/YARN-569
Project: Hadoop YARN
Issue Type: Sub-task
Components: capacityscheduler
Reporter: Carlo Curino
Assignee: Carlo Curino
Attachments: 3queues.pdf, CapScheduler_with_preemption.pdf,
preemption.2.patch, YARN-569.10.patch, YARN-569.1.patch, YARN-569.2.patch,
YARN-569.3.patch, YARN-569.4.patch, YARN-569.5.patch, YARN-569.6.patch,
YARN-569.8.patch, YARN-569.9.patch, YARN-569.patch, YARN-569.patch

There is a tension between the fast-pace reactive role of the
CapacityScheduler, which needs to respond quickly to
applications resource requests, and node updates, and the more introspective,
time-based considerations
needed to observe and correct for capacity balance. To this purpose we opted
instead of hacking the delicate
mechanisms of the CapacityScheduler directly to add support for preemption by
means of a Capacity Monitor,
which can be run optionally as a separate service (much like the
NMLivelinessMonitor).
The capacity monitor (similarly to equivalent functionalities in the fairness
scheduler) operates running on intervals
(e.g., every 3 seconds), observe the state of the assignment of resources to
queues from the capacity scheduler,
performs off-line computation to determine if preemption is needed, and how
best to edit the current schedule to
improve capacity, and generates events that produce four possible actions:
# Container de-reservations
# Resource-based preemptions
# Container-based preemptions
# Container killing
The actions listed above are progressively more costly, and it is up to the
policy to use them as desired to achieve the rebalancing goals.
Note that due to the lag in the effect of these actions the policy should
operate at the macroscopic level (e.g., preempt tens of containers
from a queue) and not trying to tightly and consistently micromanage
container allocations.
- Preemption policy (ProportionalCapacityPreemptionPolicy):
-
Preemption policies are by design pluggable, in the following we present an
initial policy (ProportionalCapacityPreemptionPolicy) we have been
experimenting with. The ProportionalCapacityPreemptionPolicy behaves as
follows:
# it gathers from the scheduler the state of the queues, in particular, their
current capacity, guaranteed capacity and pending requests (*)
# if there are pending requests from queues that are under capacity it
computes a new ideal balanced state (**)
# it computes the set of preemptions needed to repair the current schedule
and achieve capacity balance (accounting for natural completion rates, and
respecting bounds on the amount of preemption we allow for each round)
# it selects which applications to preempt from each over-capacity queue (the
last one in the FIFO order)
# it remove reservations from the most recently assigned app until the amount
of resource to

[jira] [Commented] (YARN-569) CapacityScheduler: support for preemption (using a capacity monitor)

2013-06-24 Thread Chris Douglas (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692572#comment-13692572
 ] 

Chris Douglas commented on YARN-569:


{{TestAMAuthorization}} also fails on trunk, YARN-878

 CapacityScheduler: support for preemption (using a capacity monitor)
 

 Key: YARN-569
 URL: https://issues.apache.org/jira/browse/YARN-569
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Carlo Curino
Assignee: Carlo Curino
 Attachments: 3queues.pdf, CapScheduler_with_preemption.pdf, 
 preemption.2.patch, YARN-569.10.patch, YARN-569.1.patch, YARN-569.2.patch, 
 YARN-569.3.patch, YARN-569.4.patch, YARN-569.5.patch, YARN-569.6.patch, 
 YARN-569.8.patch, YARN-569.9.patch, YARN-569.patch, YARN-569.patch


 There is a tension between the fast-pace reactive role of the 
 CapacityScheduler, which needs to respond quickly to 
 applications resource requests, and node updates, and the more introspective, 
 time-based considerations 
 needed to observe and correct for capacity balance. To this purpose we opted 
 instead of hacking the delicate
 mechanisms of the CapacityScheduler directly to add support for preemption by 
 means of a Capacity Monitor,
 which can be run optionally as a separate service (much like the 
 NMLivelinessMonitor).
 The capacity monitor (similarly to equivalent functionalities in the fairness 
 scheduler) operates running on intervals 
 (e.g., every 3 seconds), observe the state of the assignment of resources to 
 queues from the capacity scheduler, 
 performs off-line computation to determine if preemption is needed, and how 
 best to edit the current schedule to 
 improve capacity, and generates events that produce four possible actions:
 # Container de-reservations
 # Resource-based preemptions
 # Container-based preemptions
 # Container killing
 The actions listed above are progressively more costly, and it is up to the 
 policy to use them as desired to achieve the rebalancing goals. 
 Note that due to the lag in the effect of these actions the policy should 
 operate at the macroscopic level (e.g., preempt tens of containers
 from a queue) and not trying to tightly and consistently micromanage 
 container allocations. 
 - Preemption policy  (ProportionalCapacityPreemptionPolicy): 
 - 
 Preemption policies are by design pluggable, in the following we present an 
 initial policy (ProportionalCapacityPreemptionPolicy) we have been 
 experimenting with.  The ProportionalCapacityPreemptionPolicy behaves as 
 follows:
 # it gathers from the scheduler the state of the queues, in particular, their 
 current capacity, guaranteed capacity and pending requests (*)
 # if there are pending requests from queues that are under capacity it 
 computes a new ideal balanced state (**)
 # it computes the set of preemptions needed to repair the current schedule 
 and achieve capacity balance (accounting for natural completion rates, and 
 respecting bounds on the amount of preemption we allow for each round)
 # it selects which applications to preempt from each over-capacity queue (the 
 last one in the FIFO order)
 # it remove reservations from the most recently assigned app until the amount 
 of resource to reclaim is obtained, or until no more reservations exits
 # (if not enough) it issues preemptions for containers from the same 
 applications (reverse chronological order, last assigned container first) 
 again until necessary or until no containers except the AM container are left,
 # (if not enough) it moves onto unreserve and preempt from the next 
 application. 
 # containers that have been asked to preempt are tracked across executions. 
 If a containers is among the one to be preempted for more than a certain 
 time, the container is moved in a the list of containers to be forcibly 
 killed. 
 Notes:
 (*) at the moment, in order to avoid double-counting of the requests, we only 
 look at the ANY part of pending resource requests, which means we might not 
 preempt on behalf of AMs that ask only for specific locations but not any. 
 (**) The ideal balance state is one in which each queue has at least its 
 guaranteed capacity, and the spare capacity is distributed among queues (that 
 wants some) as a weighted fair share. Where the weighting is based on the 
 guaranteed capacity of a queue, and the function runs to a fix point.  
 Tunables of the ProportionalCapacityPreemptionPolicy:
 # observe-only mode (i.e., log the actions it would take, but behave as 
 read-only)
 # how frequently to run the policy
 # how long to wait between preemption and kill of a container
 # which fraction of the containers I would like to obtain should I preempt 
 (has to do with

[jira] [Commented] (YARN-883) Expose Fair Scheduler-specific queue metrics


[ 
https://issues.apache.org/jira/browse/YARN-883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692581#comment-13692581
 ] 

Hadoop QA commented on YARN-883:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12589510/YARN-883-1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1392//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/1392//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1392//console

This message is automatically generated.

 Expose Fair Scheduler-specific queue metrics
 

 Key: YARN-883
 URL: https://issues.apache.org/jira/browse/YARN-883
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Affects Versions: 2.0.5-alpha
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Attachments: YARN-883-1.patch, YARN-883.patch


 When the Fair Scheduler is enabled, QueueMetrics should include fair share, 
 minimum share, and maximum share.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-873) YARNClient.getApplicationReport(unknownAppId) returns a null report

2013-06-24 Thread Xuan Gong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692601#comment-13692601
 ] 

Xuan Gong commented on YARN-873:


At commandLine, if we type yarn application -status $UnKnowAppId, it will 
output Application with id $UnKnowAppId doesn't exist in RM.

 YARNClient.getApplicationReport(unknownAppId) returns a null report
 ---

 Key: YARN-873
 URL: https://issues.apache.org/jira/browse/YARN-873
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.1.0-beta
Reporter: Bikas Saha
Assignee: Xuan Gong

 How can the client find out that app does not exist?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (YARN-885) TestBinaryTokenFile (and others) fail

2013-06-24 Thread Kam Kasravi (JIRA)

Kam Kasravi created YARN-885:


 Summary: TestBinaryTokenFile (and others) fail
 Key: YARN-885
 URL: https://issues.apache.org/jira/browse/YARN-885
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.4-alpha
Reporter: Kam Kasravi


Seeing the following stack trace and the unit test goes into a infinite loop:

2013-06-24 17:03:58,316 ERROR [LocalizerRunner for 
container_1372118631537_0001_01_01] security.UserGroupInformation 
(UserGroupInformation.java:doAs(1480)) - PriviledgedActionException 
as:kamkasravi (auth:SIMPLE) cause:java.io.IOException: Server asks us to fall 
back to SIMPLE auth, but this client is configured to only allow secure 
connections.
2013-06-24 17:03:58,317 WARN  [LocalizerRunner for 
container_1372118631537_0001_01_01] ipc.Client (Client.java:run(579)) - 
Exception encountered while connecting to the server : java.io.IOException: 
Server asks us to fall back to SIMPLE auth, but this client is configured to 
only allow secure connections.
2013-06-24 17:03:58,318 ERROR [LocalizerRunner for 
container_1372118631537_0001_01_01] security.UserGroupInformation 
(UserGroupInformation.java:doAs(1480)) - PriviledgedActionException 
as:kamkasravi (auth:SIMPLE) cause:java.io.IOException: java.io.IOException: 
Server asks us to fall back to SIMPLE auth, but this client is configured to 
only allow secure connections.
java.lang.reflect.UndeclaredThrowableException
at 
org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAndThrowException(YarnRemoteExceptionPBImpl.java:135)
at 
org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:56)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:247)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:181)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:103)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:859)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-763) AMRMClientAsync should stop heartbeating after receiving shutdown from RM

2013-06-24 Thread Xuan Gong (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-763:
---

Attachment: YARN-763.1.patch

 AMRMClientAsync should stop heartbeating after receiving shutdown from RM
 -

 Key: YARN-763
 URL: https://issues.apache.org/jira/browse/YARN-763
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bikas Saha
Assignee: Xuan Gong
 Attachments: YARN-763.1.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-808) ApplicationReport does not clearly tell that the attempt is running or not

2013-06-24 Thread Xuan Gong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692632#comment-13692632
 ] 

Xuan Gong commented on YARN-808:


How about we expose the current attempt Id with attempt status as well as 
previous attempt Id with attempt status if they are exist ??

 ApplicationReport does not clearly tell that the attempt is running or not
 --

 Key: YARN-808
 URL: https://issues.apache.org/jira/browse/YARN-808
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.1.0-beta
Reporter: Bikas Saha
Assignee: Xuan Gong

 When an app attempt fails and is being retried, ApplicationReport immediately 
 gives the new attemptId and non-null values of host etc. There is no way for 
 clients to know that the attempt is running other than connecting to it and 
 timing out on invalid host. Solution would be to expose the attempt state or 
 return a null value for host instead of N/A

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-884) AM expiry interval should be set to smaller of {am, nm}.liveness-monitor.expiry-interval-ms


[ 
https://issues.apache.org/jira/browse/YARN-884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692639#comment-13692639
 ] 

Karthik Kambatla commented on YARN-884:
---

If AM_EXPIRY  NM_EXPIRY,
# the user has explicitly set AM_EXPIRY to be smaller than NM_EXPIRY
# I agree it is possible that the RM might expire the first attempt and start 
another attempt, in case the NM fails to connect to the RM for a time 't' such 
that AM_EXPIRY  t  NM_EXPIRY. However, the user has asked for a shorter 
expiry interval for a reason.

If AM_EXPIRY  NM_EXPIRY,
# When NM dies, the AMs on it also would have died. However, IIUC, the RM 
wouldn't schedule another attempt until AM_EXPIRY is met. Correct me if I am 
wrong.


 AM expiry interval should be set to smaller of {am, 
 nm}.liveness-monitor.expiry-interval-ms
 ---

 Key: YARN-884
 URL: https://issues.apache.org/jira/browse/YARN-884
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.0.4-alpha
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
  Labels: configuration

 As the AM can't outlive the NM on which it is running, it is a good idea to 
 disallow setting the am.liveness-monitor.expiry-interval-ms to a value higher 
 than nm.liveness-monitor.expiry-interval-ms

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (YARN-758) Fair scheduler has some bug that causes TestRMRestart to fail


 [ 
https://issues.apache.org/jira/browse/YARN-758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza resolved YARN-758.
-

Resolution: Not A Problem

 Fair scheduler has some bug that causes TestRMRestart to fail
 -

 Key: YARN-758
 URL: https://issues.apache.org/jira/browse/YARN-758
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.1.0-beta
Reporter: Bikas Saha
Assignee: Sandy Ryza

 YARN-757 got fixed by changing the scheduler from Fair to default (which is 
 capacity).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-874) Tracking YARN/MR test failures after HADOOP-9421 and YARN-827

2013-06-24 Thread Vinod Kumar Vavilapalli (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-874:
-

Attachment: YARN-874.2.txt

Updated patch with a new testing validating the common changes.

 Tracking YARN/MR test failures after HADOOP-9421 and YARN-827
 -

 Key: YARN-874
 URL: https://issues.apache.org/jira/browse/YARN-874
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli
Priority: Blocker
 Attachments: YARN-874.1.txt, YARN-874.2.txt, YARN-874.txt


 HADOOP-9421 and YARN-827 broke some YARN/MR tests. Tracking those..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-763) AMRMClientAsync should stop heartbeating after receiving shutdown from RM


[ 
https://issues.apache.org/jira/browse/YARN-763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692641#comment-13692641
 ] 

Hadoop QA commented on YARN-763:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12589525/YARN-763.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 1152 javac 
compiler warnings (more than the trunk's current 1151 warnings).

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client:

  org.apache.hadoop.yarn.client.api.impl.TestNMClient

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1393//testReport/
Javac warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/1393//artifact/trunk/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1393//console

This message is automatically generated.

 AMRMClientAsync should stop heartbeating after receiving shutdown from RM
 -

 Key: YARN-763
 URL: https://issues.apache.org/jira/browse/YARN-763
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bikas Saha
Assignee: Xuan Gong
 Attachments: YARN-763.1.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (YARN-884) AM expiry interval should be set to smaller of {am, nm}.liveness-monitor.expiry-interval-ms


 [ 
https://issues.apache.org/jira/browse/YARN-884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-884:
--

Attachment: yarn-884-1.patch

Uploading a straight-forward patch.

 AM expiry interval should be set to smaller of {am, 
 nm}.liveness-monitor.expiry-interval-ms
 ---

 Key: YARN-884
 URL: https://issues.apache.org/jira/browse/YARN-884
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.0.4-alpha
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
  Labels: configuration
 Attachments: yarn-884-1.patch


 As the AM can't outlive the NM on which it is running, it is a good idea to 
 disallow setting the am.liveness-monitor.expiry-interval-ms to a value higher 
 than nm.liveness-monitor.expiry-interval-ms

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-874) Tracking YARN/MR test failures after HADOOP-9421 and YARN-827


[ 
https://issues.apache.org/jira/browse/YARN-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692664#comment-13692664
 ] 

Omkar Vinit Joshi commented on YARN-874:


tested YARN-872-2...on local cluster... with patch it is running now. 

 Tracking YARN/MR test failures after HADOOP-9421 and YARN-827
 -

 Key: YARN-874
 URL: https://issues.apache.org/jira/browse/YARN-874
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli
Priority: Blocker
 Attachments: YARN-874.1.txt, YARN-874.2.txt, YARN-874.txt


 HADOOP-9421 and YARN-827 broke some YARN/MR tests. Tracking those..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-763) AMRMClientAsync should stop heartbeating after receiving shutdown from RM


[ 
https://issues.apache.org/jira/browse/YARN-763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692666#comment-13692666
 ] 

Sandy Ryza commented on YARN-763:
-

Can we move all of this into the switch statement, replace break with return, 
and get rid of the stop variable?  Unless the thinking is that returning from a 
method in the middle is bad, I think this would be a lot cleaner.

 AMRMClientAsync should stop heartbeating after receiving shutdown from RM
 -

 Key: YARN-763
 URL: https://issues.apache.org/jira/browse/YARN-763
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bikas Saha
Assignee: Xuan Gong
 Attachments: YARN-763.1.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-884) AM expiry interval should be set to smaller of {am, nm}.liveness-monitor.expiry-interval-ms


[ 
https://issues.apache.org/jira/browse/YARN-884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692668#comment-13692668
 ] 

Omkar Vinit Joshi commented on YARN-884:


[~kkambatl] makes sense...


 AM expiry interval should be set to smaller of {am, 
 nm}.liveness-monitor.expiry-interval-ms
 ---

 Key: YARN-884
 URL: https://issues.apache.org/jira/browse/YARN-884
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.0.4-alpha
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
  Labels: configuration
 Attachments: yarn-884-1.patch


 As the AM can't outlive the NM on which it is running, it is a good idea to 
 disallow setting the am.liveness-monitor.expiry-interval-ms to a value higher 
 than nm.liveness-monitor.expiry-interval-ms

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-874) Tracking YARN/MR test failures after HADOOP-9421 and YARN-827


[ 
https://issues.apache.org/jira/browse/YARN-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692675#comment-13692675
 ] 

Hadoop QA commented on YARN-874:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12589527/YARN-874.2.txt
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1394//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1394//console

This message is automatically generated.

 Tracking YARN/MR test failures after HADOOP-9421 and YARN-827
 -

 Key: YARN-874
 URL: https://issues.apache.org/jira/browse/YARN-874
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli
Priority: Blocker
 Attachments: YARN-874.1.txt, YARN-874.2.txt, YARN-874.txt


 HADOOP-9421 and YARN-827 broke some YARN/MR tests. Tracking those..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-885) TestBinaryTokenFile (and others) fail

2013-06-24 Thread Kam Kasravi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692687#comment-13692687
 ] 

Kam Kasravi commented on YARN-885:
--

Changing ContainerLocalizer.runLocalization to where the local context uses the 
same tokens as the user context seems to fix this problem. 

 TestBinaryTokenFile (and others) fail
 -

 Key: YARN-885
 URL: https://issues.apache.org/jira/browse/YARN-885
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.4-alpha
Reporter: Kam Kasravi

 Seeing the following stack trace and the unit test goes into a infinite loop:
 2013-06-24 17:03:58,316 ERROR [LocalizerRunner for 
 container_1372118631537_0001_01_01] security.UserGroupInformation 
 (UserGroupInformation.java:doAs(1480)) - PriviledgedActionException 
 as:kamkasravi (auth:SIMPLE) cause:java.io.IOException: Server asks us to fall 
 back to SIMPLE auth, but this client is configured to only allow secure 
 connections.
 2013-06-24 17:03:58,317 WARN  [LocalizerRunner for 
 container_1372118631537_0001_01_01] ipc.Client (Client.java:run(579)) - 
 Exception encountered while connecting to the server : java.io.IOException: 
 Server asks us to fall back to SIMPLE auth, but this client is configured to 
 only allow secure connections.
 2013-06-24 17:03:58,318 ERROR [LocalizerRunner for 
 container_1372118631537_0001_01_01] security.UserGroupInformation 
 (UserGroupInformation.java:doAs(1480)) - PriviledgedActionException 
 as:kamkasravi (auth:SIMPLE) cause:java.io.IOException: java.io.IOException: 
 Server asks us to fall back to SIMPLE auth, but this client is configured to 
 only allow secure connections.
 java.lang.reflect.UndeclaredThrowableException
 at 
 org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAndThrowException(YarnRemoteExceptionPBImpl.java:135)
 at 
 org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:56)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:247)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:181)
 at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:103)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:859)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-884) AM expiry interval should be set to smaller of {am, nm}.liveness-monitor.expiry-interval-ms


[ 
https://issues.apache.org/jira/browse/YARN-884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692689#comment-13692689
 ] 

Hadoop QA commented on YARN-884:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12589529/yarn-884-1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1395//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1395//console

This message is automatically generated.

 AM expiry interval should be set to smaller of {am, 
 nm}.liveness-monitor.expiry-interval-ms
 ---

 Key: YARN-884
 URL: https://issues.apache.org/jira/browse/YARN-884
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.0.4-alpha
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
  Labels: configuration
 Attachments: yarn-884-1.patch


 As the AM can't outlive the NM on which it is running, it is a good idea to 
 disallow setting the am.liveness-monitor.expiry-interval-ms to a value higher 
 than nm.liveness-monitor.expiry-interval-ms

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-884) AM expiry interval should be set to smaller of {am, nm}.liveness-monitor.expiry-interval-ms