[jira] [Commented] (YARN-1801) NPE in public localizer

2014-08-22 Thread Hong Zhiguo (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106555#comment-14106555
 ] 

Hong Zhiguo commented on YARN-1801:
---

I think YARN-1575 already fixed this NPE. We could mark it as duplicated.

 NPE in public localizer
 ---

 Key: YARN-1801
 URL: https://issues.apache.org/jira/browse/YARN-1801
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.2.0
Reporter: Jason Lowe
Assignee: Hong Zhiguo
Priority: Critical
 Attachments: YARN-1801.patch


 While investigating YARN-1800 found this in the NM logs that caused the 
 public localizer to shutdown:
 {noformat}
 2014-01-23 01:26:38,655 INFO  localizer.ResourceLocalizationService 
 (ResourceLocalizationService.java:addResource(651)) - Downloading public 
 rsrc:{ 
 hdfs://colo-2:8020/user/fertrist/oozie-oozi/601-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar,
  1390440382009, FILE, null }
 2014-01-23 01:26:38,656 FATAL localizer.ResourceLocalizationService 
 (ResourceLocalizationService.java:run(726)) - Error: Shutting down
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712)
 2014-01-23 01:26:38,656 INFO  localizer.ResourceLocalizationService 
 (ResourceLocalizationService.java:run(728)) - Public cache exiting
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely

2014-08-22 Thread mai shurong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106583#comment-14106583
 ] 

mai shurong commented on YARN-1458:
---

George Wong ,
You can try our YARN-1458.patch and it is easy to understand,but the issue is 
still unresolved. You can consult to the corresponding code in later hadoop 
version such as 2.2.1, 2.3.x, 2.4.x


zhihai xu,
It seams your thinking is more rigorous than our patch.

 In Fair Scheduler, size based weight can cause update thread to hold lock 
 indefinitely
 --

 Key: YARN-1458
 URL: https://issues.apache.org/jira/browse/YARN-1458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: Centos 2.6.18-238.19.1.el5 X86_64
 hadoop2.2.0
Reporter: qingwu.fu
Assignee: zhihai xu
  Labels: patch
 Fix For: 2.2.1

 Attachments: YARN-1458.001.patch, YARN-1458.patch

   Original Estimate: 408h
  Remaining Estimate: 408h

 The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
 clients submit lots jobs, it is not easy to reapear. We run the test cluster 
 for days to reapear it. The output of  jstack command on resourcemanager pid:
 {code}
  ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
 waiting for monitor entry [0x43aa9000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
 - waiting to lock 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
 at java.lang.Thread.run(Thread.java:744)
 ……
 FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
 runnable [0x433a2000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
 at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely

2014-08-22 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-1458:


Attachment: YARN-1458.002.patch

 In Fair Scheduler, size based weight can cause update thread to hold lock 
 indefinitely
 --

 Key: YARN-1458
 URL: https://issues.apache.org/jira/browse/YARN-1458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: Centos 2.6.18-238.19.1.el5 X86_64
 hadoop2.2.0
Reporter: qingwu.fu
Assignee: zhihai xu
  Labels: patch
 Fix For: 2.2.1

 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, YARN-1458.patch

   Original Estimate: 408h
  Remaining Estimate: 408h

 The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
 clients submit lots jobs, it is not easy to reapear. We run the test cluster 
 for days to reapear it. The output of  jstack command on resourcemanager pid:
 {code}
  ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
 waiting for monitor entry [0x43aa9000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
 - waiting to lock 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
 at java.lang.Thread.run(Thread.java:744)
 ……
 FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
 runnable [0x433a2000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
 at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2439) Move winutils task related functionality under a yarn-servers-nodemanager project

2014-08-22 Thread Remus Rusanu (JIRA)
Remus Rusanu created YARN-2439:
--

 Summary: Move winutils task related functionality under a 
yarn-servers-nodemanager project
 Key: YARN-2439
 URL: https://issues.apache.org/jira/browse/YARN-2439
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Remus Rusanu
Priority: Minor


Currently winutils is build as part of hadoop-common. But winutils has features 
that relate strictly to the nodemanager, namely `winutils task`. Being build 
under hadoop-common menas that any mvn/pom compile configuration has to be done 
in the hadoop-common project. For example I wanted to add a configuration file 
similar to the container-executor cfg, which gets the .cfg location from the 
${container-executor.conf.dir} in it's parent pom. But for winutils I would 
have to add the config to the hadoop-common pom, despite being very specific 
for the nodemanager use.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-08-22 Thread Remus Rusanu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106605#comment-14106605
 ] 

Remus Rusanu commented on YARN-2198:


I have created YARN-2439 to track the separation of winutils task functionality 
into a nodemanager related project, away from hadoop-common

 Remove the need to run NodeManager as privileged account for Windows Secure 
 Container Executor
 --

 Key: YARN-2198
 URL: https://issues.apache.org/jira/browse/YARN-2198
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Remus Rusanu
Assignee: Remus Rusanu
  Labels: security, windows
 Attachments: YARN-2198.1.patch, YARN-2198.2.patch, 
 YARN-2198.separation.patch


 YARN-1972 introduces a Secure Windows Container Executor. However this 
 executor requires a the process launching the container to be LocalSystem or 
 a member of the a local Administrators group. Since the process in question 
 is the NodeManager, the requirement translates to the entire NM to run as a 
 privileged account, a very large surface area to review and protect.
 This proposal is to move the privileged operations into a dedicated NT 
 service. The NM can run as a low privilege account and communicate with the 
 privileged NT service when it needs to launch a container. This would reduce 
 the surface exposed to the high privileges. 
 There has to exist a secure, authenticated and authorized channel of 
 communication between the NM and the privileged NT service. Possible 
 alternatives are a new TCP endpoint, Java RPC etc. My proposal though would 
 be to use Windows LPC (Local Procedure Calls), which is a Windows platform 
 specific inter-process communication channel that satisfies all requirements 
 and is easy to deploy. The privileged NT service would register and listen on 
 an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop 
 with libwinutils which would host the LPC client code. The client would 
 connect to the LPC port (NtConnectPort) and send a message requesting a 
 container launch (NtRequestWaitReplyPort). LPC provides authentication and 
 the privileged NT service can use authorization API (AuthZ) to validate the 
 caller.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely

2014-08-22 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106606#comment-14106606
 ] 

zhihai xu commented on YARN-1458:
-

I added a test case testFairShareWithZeroWeight in new patch 
YARN-1458.002.patch to verify the patch can work with zero weight.
Without the patch, testFairShareWithZeroWeight will run forever.

 In Fair Scheduler, size based weight can cause update thread to hold lock 
 indefinitely
 --

 Key: YARN-1458
 URL: https://issues.apache.org/jira/browse/YARN-1458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: Centos 2.6.18-238.19.1.el5 X86_64
 hadoop2.2.0
Reporter: qingwu.fu
Assignee: zhihai xu
  Labels: patch
 Fix For: 2.2.1

 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, YARN-1458.patch

   Original Estimate: 408h
  Remaining Estimate: 408h

 The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
 clients submit lots jobs, it is not easy to reapear. We run the test cluster 
 for days to reapear it. The output of  jstack command on resourcemanager pid:
 {code}
  ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
 waiting for monitor entry [0x43aa9000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
 - waiting to lock 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
 at java.lang.Thread.run(Thread.java:744)
 ……
 FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
 runnable [0x433a2000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
 at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2345) yarn rmadmin -report

2014-08-22 Thread Hao Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106630#comment-14106630
 ] 

Hao Gao commented on YARN-2345:
---

yarn node -status nodeid 
will list the status of a single node.
I reuse the code there to get the status of all nodes.

{code:xml}
Nodes Report :
Node-Id : 192.168.1.6:53239
Rack : /default-rack
Node-State : RUNNING
Node-Http-Address : 192.168.1.6:8042
Last-Health-Update : Fri 22/Aug/14 12:53:38:312PDT
Health-Report :
Containers : 0
Memory-Used : 0MB
Memory-Capacity : 8192MB
CPU-Used : 0 vcores
CPU-Capacity : 8 vcores
{code}

Do we need more information? 
Also Do we need to have options like -live -dead ?

 yarn rmadmin -report
 

 Key: YARN-2345
 URL: https://issues.apache.org/jira/browse/YARN-2345
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager
Reporter: Allen Wittenauer
Assignee: Hao Gao
  Labels: newbie

 It would be good to have an equivalent of hdfs dfsadmin -report in YARN.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2345) yarn rmadmin -report

2014-08-22 Thread Hao Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hao Gao updated YARN-2345:
--

Attachment: YARN-2345.1.patch

Attached the patch.

 yarn rmadmin -report
 

 Key: YARN-2345
 URL: https://issues.apache.org/jira/browse/YARN-2345
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager
Reporter: Allen Wittenauer
Assignee: Hao Gao
  Labels: newbie
 Attachments: YARN-2345.1.patch


 It would be good to have an equivalent of hdfs dfsadmin -report in YARN.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2437) [post-HADOOP-9902] start-yarn.sh/stop-yarn needs to give info

2014-08-22 Thread Hao Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hao Gao updated YARN-2437:
--

Assignee: Hao Gao

 [post-HADOOP-9902] start-yarn.sh/stop-yarn needs to give info
 -

 Key: YARN-2437
 URL: https://issues.apache.org/jira/browse/YARN-2437
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scripts
Reporter: Allen Wittenauer
Assignee: Hao Gao
  Labels: newbie

 With the merger and cleanup of the daemon launch code, yarn-daemons.sh no 
 longer prints Starting information.  This should be made more of an analog 
 of start-dfs.sh/stop-dfs.sh.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml

2014-08-22 Thread Varun Vasudev (JIRA)
Varun Vasudev created YARN-2440:
---

 Summary: Cgroups should limit YARN containers to cores allocated 
in yarn-site.xml
 Key: YARN-2440
 URL: https://issues.apache.org/jira/browse/YARN-2440
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Varun Vasudev
Assignee: Varun Vasudev


The current cgroups implementation does not limit YARN containers to the cores 
allocated in yarn-site.xml.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2345) yarn rmadmin -report

2014-08-22 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106669#comment-14106669
 ] 

Wangda Tan commented on YARN-2345:
--

Hi Hao,
I think we already have a NodeCLI, which is yarn node -status nodeid as you 
said. We don't need add such method to RM admin CLI. RM admin CLI should only 
implement methods contained by ResourceManagerAdministrationProtocol.
I would suggest to add more information when execute yarn node -all -list, 
like memory-used, CPU-used, etc. Just like RM web UI - nodes page. 

Thanks,
Wangda

 yarn rmadmin -report
 

 Key: YARN-2345
 URL: https://issues.apache.org/jira/browse/YARN-2345
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager
Reporter: Allen Wittenauer
Assignee: Hao Gao
  Labels: newbie
 Attachments: YARN-2345.1.patch


 It would be good to have an equivalent of hdfs dfsadmin -report in YARN.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml

2014-08-22 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-2440:


Attachment: screenshot-current-implementation.jpg

Screenshot with the CPU usage in the current implementation. In my 
yarn-site.xml, I had set yarn.nodemanager.resource.cpu-vcores to 2. The python 
script is taking up as many cores as it can.

The quota for the yarn group was set to -1.

varun@ubuntu:/var/hadoop/hadoop-3.0.0-SNAPSHOT$ cat 
/cgroup/cpu/yarn/cpu.cfs_quota_us
-1

 Cgroups should limit YARN containers to cores allocated in yarn-site.xml
 

 Key: YARN-2440
 URL: https://issues.apache.org/jira/browse/YARN-2440
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: screenshot-current-implementation.jpg


 The current cgroups implementation does not limit YARN containers to the 
 cores allocated in yarn-site.xml.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml

2014-08-22 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-2440:


Attachment: apache-yarn-2440.0.patch

Attached patch with fix.

 Cgroups should limit YARN containers to cores allocated in yarn-site.xml
 

 Key: YARN-2440
 URL: https://issues.apache.org/jira/browse/YARN-2440
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-2440.0.patch, 
 screenshot-current-implementation.jpg


 The current cgroups implementation does not limit YARN containers to the 
 cores allocated in yarn-site.xml.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml

2014-08-22 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106671#comment-14106671
 ] 

Varun Vasudev commented on YARN-2440:
-

After applying the patch, the quota is set correctly.

{noformat}
varun@ubuntu:/var/hadoop/hadoop-3.0.0-SNAPSHOT$ cat 
/cgroup/cpu/yarn/cpu.cfs_quota_us
20
{noformat}

 Cgroups should limit YARN containers to cores allocated in yarn-site.xml
 

 Key: YARN-2440
 URL: https://issues.apache.org/jira/browse/YARN-2440
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-2440.0.patch, 
 screenshot-current-implementation.jpg


 The current cgroups implementation does not limit YARN containers to the 
 cores allocated in yarn-site.xml.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-810) Support CGroup ceiling enforcement on CPU

2014-08-22 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106687#comment-14106687
 ] 

Varun Vasudev commented on YARN-810:


[~sandyr] [~ywskycn] are you still working on this? If not, I'd like to pick it 
up.

 Support CGroup ceiling enforcement on CPU
 -

 Key: YARN-810
 URL: https://issues.apache.org/jira/browse/YARN-810
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.1.0-beta, 2.0.5-alpha
Reporter: Chris Riccomini
Assignee: Sandy Ryza
 Attachments: YARN-810.patch, YARN-810.patch


 Problem statement:
 YARN currently lets you define an NM's pcore count, and a pcore:vcore ratio. 
 Containers are then allowed to request vcores between the minimum and maximum 
 defined in the yarn-site.xml.
 In the case where a single-threaded container requests 1 vcore, with a 
 pcore:vcore ratio of 1:4, the container is still allowed to use up to 100% of 
 the core it's using, provided that no other container is also using it. This 
 happens, even though the only guarantee that YARN/CGroups is making is that 
 the container will get at least 1/4th of the core.
 If a second container then comes along, the second container can take 
 resources from the first, provided that the first container is still getting 
 at least its fair share (1/4th).
 There are certain cases where this is desirable. There are also certain cases 
 where it might be desirable to have a hard limit on CPU usage, and not allow 
 the process to go above the specified resource requirement, even if it's 
 available.
 Here's an RFC that describes the problem in more detail:
 http://lwn.net/Articles/336127/
 Solution:
 As it happens, when CFS is used in combination with CGroups, you can enforce 
 a ceiling using two files in cgroups:
 {noformat}
 cpu.cfs_quota_us
 cpu.cfs_period_us
 {noformat}
 The usage of these two files is documented in more detail here:
 https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html
 Testing:
 I have tested YARN CGroups using the 2.0.5-alpha implementation. By default, 
 it behaves as described above (it is a soft cap, and allows containers to use 
 more than they asked for). I then tested CFS CPU quotas manually with YARN.
 First, you can see that CFS is in use in the CGroup, based on the file names:
 {noformat}
 [criccomi@eat1-qa464 ~]$ sudo -u app ls -l /cgroup/cpu/hadoop-yarn/
 total 0
 -r--r--r-- 1 app app 0 Jun 13 16:46 cgroup.procs
 drwxr-xr-x 2 app app 0 Jun 13 17:08 container_1371141151815_0004_01_02
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_period_us
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_quota_us
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_period_us
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_runtime_us
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.shares
 -r--r--r-- 1 app app 0 Jun 13 16:46 cpu.stat
 -rw-r--r-- 1 app app 0 Jun 13 16:46 notify_on_release
 -rw-r--r-- 1 app app 0 Jun 13 16:46 tasks
 [criccomi@eat1-qa464 ~]$ sudo -u app cat
 /cgroup/cpu/hadoop-yarn/cpu.cfs_period_us
 10
 [criccomi@eat1-qa464 ~]$ sudo -u app cat
 /cgroup/cpu/hadoop-yarn/cpu.cfs_quota_us
 -1
 {noformat}
 Oddly, it appears that the cfs_period_us is set to .1s, not 1s.
 We can place processes in hard limits. I have process 4370 running YARN 
 container container_1371141151815_0003_01_03 on a host. By default, it's 
 running at ~300% cpu usage.
 {noformat}
 CPU
 4370 criccomi  20   0 1157m 551m  14m S 240.3  0.8  87:10.91 ...
 {noformat}
 When I set the CFS quote:
 {noformat}
 echo 1000  
 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us
  CPU
 4370 criccomi  20   0 1157m 563m  14m S  1.0  0.8  90:08.39 ...
 {noformat}
 It drops to 1% usage, and you can see the box has room to spare:
 {noformat}
 Cpu(s):  2.4%us,  1.0%sy,  0.0%ni, 92.2%id,  4.2%wa,  0.0%hi,  0.1%si, 
 0.0%st
 {noformat}
 Turning the quota back to -1:
 {noformat}
 echo -1  
 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us
 {noformat}
 Burns the cores again:
 {noformat}
 Cpu(s): 11.1%us,  1.7%sy,  0.0%ni, 83.9%id,  3.1%wa,  0.0%hi,  0.2%si, 
 0.0%st
 CPU
 4370 criccomi  20   0 1157m 563m  14m S 253.9  0.8  89:32.31 ...
 {noformat}
 On my dev box, I was testing CGroups by running a python process eight times, 
 to burn through all the cores, since it was doing as described above (giving 
 extra CPU to the process, even with a cpu.shares limit). Toggling the 
 cfs_quota_us seems to enforce a hard limit.
 

[jira] [Commented] (YARN-2436) [post-HADOOP-9902] yarn application help doesn't work

2014-08-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106720#comment-14106720
 ] 

Hudson commented on YARN-2436:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #654 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/654/])
YARN-2436. [post-HADOOP-9902] yarn application help doesn't work (aw: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1619603)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/bin/yarn


 [post-HADOOP-9902] yarn application help doesn't work
 -

 Key: YARN-2436
 URL: https://issues.apache.org/jira/browse/YARN-2436
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scripts
Reporter: Allen Wittenauer
Assignee: Allen Wittenauer
  Labels: newbie
 Fix For: 3.0.0

 Attachments: YARN-2436.patch


 The previous version of the yarn command plays games with the command stack 
 for some commands.  The new code needs duplicate this wackiness.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2434) RM should not recover containers from previously failed attempt when AM restart is not enabled

2014-08-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106717#comment-14106717
 ] 

Hudson commented on YARN-2434:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #654 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/654/])
YARN-2434. RM should not recover containers from previously failed attempt when 
AM restart is not enabled. Contributed by Jian He (jlowe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1619614)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java


 RM should not recover containers from previously failed attempt when AM 
 restart is not enabled
 --

 Key: YARN-2434
 URL: https://issues.apache.org/jira/browse/YARN-2434
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Jian He
 Fix For: 3.0.0, 2.6.0

 Attachments: YARN-2434.1.patch


 If container-preserving AM restart is not enabled and AM failed during RM 
 restart, RM on restart should not recover containers from previously failed 
 attempt.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2441) NPE in nodemanager after restart

2014-08-22 Thread Nishan Shetty (JIRA)
Nishan Shetty created YARN-2441:
---

 Summary: NPE in nodemanager after restart
 Key: YARN-2441
 URL: https://issues.apache.org/jira/browse/YARN-2441
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 3.0.0
Reporter: Nishan Shetty
Priority: Minor


{code}
2014-08-22 16:43:19,640 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
 Blocking new container-requests as container manager rpc server is still 
starting.
2014-08-22 16:43:19,658 INFO org.apache.hadoop.ipc.Server: IPC Server 
Responder: starting
2014-08-22 16:43:19,675 INFO org.apache.hadoop.ipc.Server: IPC Server listener 
on 45026: starting
2014-08-22 16:43:20,029 INFO 
org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager:
 Updating node address : host-10-18-40-95:45026
2014-08-22 16:43:20,029 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
 ContainerManager started at /10.18.40.95:45026
2014-08-22 16:43:20,030 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
 ContainerManager bound to host-10-18-40-95/10.18.40.95:45026
2014-08-22 16:43:20,073 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
callQueue class java.util.concurrent.LinkedBlockingQueue
2014-08-22 16:43:20,098 INFO org.apache.hadoop.ipc.Server: Starting Socket 
Reader #1 for port 45027
2014-08-22 16:43:20,158 INFO 
org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding 
protocol org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB 
to the server
2014-08-22 16:43:20,178 INFO org.apache.hadoop.ipc.Server: IPC Server 
Responder: starting
2014-08-22 16:43:20,192 INFO org.apache.hadoop.ipc.Server: IPC Server listener 
on 45027: starting
2014-08-22 16:43:20,210 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for 
port 45026: readAndProcess from client 10.18.40.84 threw exception 
[java.lang.NullPointerException]
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:167)
at 
org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:43)
at 
org.apache.hadoop.security.token.SecretManager.retriableRetrievePassword(SecretManager.java:91)
at 
org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.getPassword(SaslRpcServer.java:278)
at 
org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.handle(SaslRpcServer.java:305)
at 
com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:585)
at 
com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
at 
org.apache.hadoop.ipc.Server$Connection.processSaslToken(Server.java:1384)
at 
org.apache.hadoop.ipc.Server$Connection.processSaslMessage(Server.java:1361)
at org.apache.hadoop.ipc.Server$Connection.saslProcess(Server.java:1275)
at 
org.apache.hadoop.ipc.Server$Connection.saslReadAndProcess(Server.java:1238)
at 
org.apache.hadoop.ipc.Server$Connection.processRpcOutOfBandRequest(Server.java:1878)
at 
org.apache.hadoop.ipc.Server$Connection.processOneRpc(Server.java:1755)
at 
org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1519)
at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:750)
at 
org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:624)
at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:595)
2014-08-22 16:43:20,227 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for 
port 45026: readAndProcess from client 10.18.40.84 threw exception 
[java.lang.NullPointerException]
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:167)
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2442) ResourceManager JMX UI does not give HA State

2014-08-22 Thread Nishan Shetty (JIRA)
Nishan Shetty created YARN-2442:
---

 Summary: ResourceManager JMX UI does not give HA State
 Key: YARN-2442
 URL: https://issues.apache.org/jira/browse/YARN-2442
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.0.0
Reporter: Nishan Shetty
Priority: Trivial


ResourceManager JMX UI can show the haState (INITIALIZING, ACTIVE, STANDBY, 
STOPPED)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (YARN-2345) yarn rmadmin -report

2014-08-22 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106816#comment-14106816
 ] 

Allen Wittenauer edited comment on YARN-2345 at 8/22/14 1:26 PM:
-

[~wangda], this is to bring consistency between HDFS and YARN.hdfs dfsadmin 
-report has existed for a very long time while YARN doesn't have one.  From a 
user perspective, it's irrelevant what is happening on the inside, just that 
YARN is weird if the equivalent is yarn node -all -list.




was (Author: aw):
[~wangda], this is to bring consistency between HDFS and YARN.hdfs dfsadmin 
-report has existed for a very long time while the RM doesn't have one.  From a 
user perspective, it's irrelevant what is happening on the inside, just that 
YARN is weird if the equivalent is yarn node -all -list.



 yarn rmadmin -report
 

 Key: YARN-2345
 URL: https://issues.apache.org/jira/browse/YARN-2345
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager
Reporter: Allen Wittenauer
Assignee: Hao Gao
  Labels: newbie
 Attachments: YARN-2345.1.patch


 It would be good to have an equivalent of hdfs dfsadmin -report in YARN.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2345) yarn rmadmin -report

2014-08-22 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106816#comment-14106816
 ] 

Allen Wittenauer commented on YARN-2345:


[~wangda], this is to bring consistency between HDFS and YARN.hdfs dfsadmin 
-report has existed for a very long time while the RM doesn't have one.  From a 
user perspective, it's irrelevant what is happening on the inside, just that 
YARN is weird if the equivalent is yarn node -all -list.



 yarn rmadmin -report
 

 Key: YARN-2345
 URL: https://issues.apache.org/jira/browse/YARN-2345
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager
Reporter: Allen Wittenauer
Assignee: Hao Gao
  Labels: newbie
 Attachments: YARN-2345.1.patch


 It would be good to have an equivalent of hdfs dfsadmin -report in YARN.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (YARN-2345) yarn rmadmin -report

2014-08-22 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106816#comment-14106816
 ] 

Allen Wittenauer edited comment on YARN-2345 at 8/22/14 1:28 PM:
-

[~leftnoteasy]], this is to bring consistency between HDFS and YARN.hdfs 
dfsadmin -report has existed for a very long time while YARN doesn't have one.  
From a user perspective, it's irrelevant what is happening on the inside, just 
that YARN is weird if the equivalent is yarn node -all -list.




was (Author: aw):
[~wangda], this is to bring consistency between HDFS and YARN.hdfs dfsadmin 
-report has existed for a very long time while YARN doesn't have one.  From a 
user perspective, it's irrelevant what is happening on the inside, just that 
YARN is weird if the equivalent is yarn node -all -list.



 yarn rmadmin -report
 

 Key: YARN-2345
 URL: https://issues.apache.org/jira/browse/YARN-2345
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager, resourcemanager
Reporter: Allen Wittenauer
Assignee: Hao Gao
  Labels: newbie
 Attachments: YARN-2345.1.patch


 It would be good to have an equivalent of hdfs dfsadmin -report in YARN.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2436) [post-HADOOP-9902] yarn application help doesn't work

2014-08-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106863#comment-14106863
 ] 

Hudson commented on YARN-2436:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1845 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1845/])
YARN-2436. [post-HADOOP-9902] yarn application help doesn't work (aw: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1619603)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/bin/yarn


 [post-HADOOP-9902] yarn application help doesn't work
 -

 Key: YARN-2436
 URL: https://issues.apache.org/jira/browse/YARN-2436
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scripts
Reporter: Allen Wittenauer
Assignee: Allen Wittenauer
  Labels: newbie
 Fix For: 3.0.0

 Attachments: YARN-2436.patch


 The previous version of the yarn command plays games with the command stack 
 for some commands.  The new code needs duplicate this wackiness.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2434) RM should not recover containers from previously failed attempt when AM restart is not enabled

2014-08-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106860#comment-14106860
 ] 

Hudson commented on YARN-2434:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1845 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1845/])
YARN-2434. RM should not recover containers from previously failed attempt when 
AM restart is not enabled. Contributed by Jian He (jlowe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1619614)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java


 RM should not recover containers from previously failed attempt when AM 
 restart is not enabled
 --

 Key: YARN-2434
 URL: https://issues.apache.org/jira/browse/YARN-2434
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Jian He
 Fix For: 3.0.0, 2.6.0

 Attachments: YARN-2434.1.patch


 If container-preserving AM restart is not enabled and AM failed during RM 
 restart, RM on restart should not recover containers from previously failed 
 attempt.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml

2014-08-22 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106889#comment-14106889
 ] 

Nathan Roberts commented on YARN-2440:
--

Thanks Varun for the patch. I'm wondering if it would be possible to make this 
configurable at the system level and per-app. For example, I'd like an 
application to be able to specify that it wants to run with strict container 
limits (to verify SLA's for example), but in general I don't want these limits 
in place (why not let a container use additional CPU if it's available?). 

 Cgroups should limit YARN containers to cores allocated in yarn-site.xml
 

 Key: YARN-2440
 URL: https://issues.apache.org/jira/browse/YARN-2440
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-2440.0.patch, 
 screenshot-current-implementation.jpg


 The current cgroups implementation does not limit YARN containers to the 
 cores allocated in yarn-site.xml.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml

2014-08-22 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106893#comment-14106893
 ] 

Varun Vasudev commented on YARN-2440:
-

[~nroberts] there's already a ticket for your request - YARN-810. That's next 
on my todo list. I've left a comment there asking if I can take it over.

 Cgroups should limit YARN containers to cores allocated in yarn-site.xml
 

 Key: YARN-2440
 URL: https://issues.apache.org/jira/browse/YARN-2440
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-2440.0.patch, 
 screenshot-current-implementation.jpg


 The current cgroups implementation does not limit YARN containers to the 
 cores allocated in yarn-site.xml.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2441) NPE in nodemanager after restart

2014-08-22 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106902#comment-14106902
 ] 

Jason Lowe commented on YARN-2441:
--

Was this truly running trunk as the Affected Versions field indicates or was 
this some other version of Hadoop?  Also was this a work-preserving NM restart 
scenario (i.e.: yarn.nodemanager.recovery.enabled=true) or a typical NM startup?

 NPE in nodemanager after restart
 

 Key: YARN-2441
 URL: https://issues.apache.org/jira/browse/YARN-2441
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 3.0.0
Reporter: Nishan Shetty
Priority: Minor

 {code}
 2014-08-22 16:43:19,640 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
  Blocking new container-requests as container manager rpc server is still 
 starting.
 2014-08-22 16:43:19,658 INFO org.apache.hadoop.ipc.Server: IPC Server 
 Responder: starting
 2014-08-22 16:43:19,675 INFO org.apache.hadoop.ipc.Server: IPC Server 
 listener on 45026: starting
 2014-08-22 16:43:20,029 INFO 
 org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager:
  Updating node address : host-10-18-40-95:45026
 2014-08-22 16:43:20,029 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
  ContainerManager started at /10.18.40.95:45026
 2014-08-22 16:43:20,030 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
  ContainerManager bound to host-10-18-40-95/10.18.40.95:45026
 2014-08-22 16:43:20,073 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
 callQueue class java.util.concurrent.LinkedBlockingQueue
 2014-08-22 16:43:20,098 INFO org.apache.hadoop.ipc.Server: Starting Socket 
 Reader #1 for port 45027
 2014-08-22 16:43:20,158 INFO 
 org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding 
 protocol org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB 
 to the server
 2014-08-22 16:43:20,178 INFO org.apache.hadoop.ipc.Server: IPC Server 
 Responder: starting
 2014-08-22 16:43:20,192 INFO org.apache.hadoop.ipc.Server: IPC Server 
 listener on 45027: starting
 2014-08-22 16:43:20,210 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 
 for port 45026: readAndProcess from client 10.18.40.84 threw exception 
 [java.lang.NullPointerException]
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:167)
   at 
 org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:43)
   at 
 org.apache.hadoop.security.token.SecretManager.retriableRetrievePassword(SecretManager.java:91)
   at 
 org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.getPassword(SaslRpcServer.java:278)
   at 
 org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.handle(SaslRpcServer.java:305)
   at 
 com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:585)
   at 
 com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
   at 
 org.apache.hadoop.ipc.Server$Connection.processSaslToken(Server.java:1384)
   at 
 org.apache.hadoop.ipc.Server$Connection.processSaslMessage(Server.java:1361)
   at org.apache.hadoop.ipc.Server$Connection.saslProcess(Server.java:1275)
   at 
 org.apache.hadoop.ipc.Server$Connection.saslReadAndProcess(Server.java:1238)
   at 
 org.apache.hadoop.ipc.Server$Connection.processRpcOutOfBandRequest(Server.java:1878)
   at 
 org.apache.hadoop.ipc.Server$Connection.processOneRpc(Server.java:1755)
   at 
 org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1519)
   at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:750)
   at 
 org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:624)
   at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:595)
 2014-08-22 16:43:20,227 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 
 for port 45026: readAndProcess from client 10.18.40.84 threw exception 
 [java.lang.NullPointerException]
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:167)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2434) RM should not recover containers from previously failed attempt when AM restart is not enabled

2014-08-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106932#comment-14106932
 ] 

Hudson commented on YARN-2434:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1871 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1871/])
YARN-2434. RM should not recover containers from previously failed attempt when 
AM restart is not enabled. Contributed by Jian He (jlowe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1619614)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java


 RM should not recover containers from previously failed attempt when AM 
 restart is not enabled
 --

 Key: YARN-2434
 URL: https://issues.apache.org/jira/browse/YARN-2434
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Jian He
 Fix For: 3.0.0, 2.6.0

 Attachments: YARN-2434.1.patch


 If container-preserving AM restart is not enabled and AM failed during RM 
 restart, RM on restart should not recover containers from previously failed 
 attempt.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2436) [post-HADOOP-9902] yarn application help doesn't work

2014-08-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106935#comment-14106935
 ] 

Hudson commented on YARN-2436:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1871 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1871/])
YARN-2436. [post-HADOOP-9902] yarn application help doesn't work (aw: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1619603)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/bin/yarn


 [post-HADOOP-9902] yarn application help doesn't work
 -

 Key: YARN-2436
 URL: https://issues.apache.org/jira/browse/YARN-2436
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scripts
Reporter: Allen Wittenauer
Assignee: Allen Wittenauer
  Labels: newbie
 Fix For: 3.0.0

 Attachments: YARN-2436.patch


 The previous version of the yarn command plays games with the command stack 
 for some commands.  The new code needs duplicate this wackiness.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2393) FairScheduler: Implement steady fair share

2014-08-22 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2393:
---

Summary: FairScheduler: Implement steady fair share  (was: Fair Scheduler : 
Implement steady fair share)

 FairScheduler: Implement steady fair share
 --

 Key: YARN-2393
 URL: https://issues.apache.org/jira/browse/YARN-2393
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Wei Yan
 Attachments: YARN-2393-1.patch, YARN-2393-2.patch, YARN-2393-3.patch, 
 yarn-2393-4.patch


 Static fair share is a fair share allocation considering all(active/inactive) 
 queues.It would be shown on the UI for better predictability of finish time 
 of applications.
 We would compute static fair share only when needed, like on queue creation, 
 node added/removed. Please see YARN-2026 for discussions on this. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2442) ResourceManager JMX UI does not give HA State

2014-08-22 Thread Nishan Shetty (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishan Shetty updated YARN-2442:


Affects Version/s: (was: 3.0.0)
   2.5.0

 ResourceManager JMX UI does not give HA State
 -

 Key: YARN-2442
 URL: https://issues.apache.org/jira/browse/YARN-2442
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: Nishan Shetty
Priority: Trivial

 ResourceManager JMX UI can show the haState (INITIALIZING, ACTIVE, STANDBY, 
 STOPPED)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2441) NPE in nodemanager after restart

2014-08-22 Thread Nishan Shetty (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106987#comment-14106987
 ] 

Nishan Shetty commented on YARN-2441:
-

[~jlowe] Sorry i mentioned the wrong Affected Version. Its branch 2. 
Work-preserving NM is not enabled, its just plain restart

 NPE in nodemanager after restart
 

 Key: YARN-2441
 URL: https://issues.apache.org/jira/browse/YARN-2441
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: Nishan Shetty
Priority: Minor

 {code}
 2014-08-22 16:43:19,640 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
  Blocking new container-requests as container manager rpc server is still 
 starting.
 2014-08-22 16:43:19,658 INFO org.apache.hadoop.ipc.Server: IPC Server 
 Responder: starting
 2014-08-22 16:43:19,675 INFO org.apache.hadoop.ipc.Server: IPC Server 
 listener on 45026: starting
 2014-08-22 16:43:20,029 INFO 
 org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager:
  Updating node address : host-10-18-40-95:45026
 2014-08-22 16:43:20,029 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
  ContainerManager started at /10.18.40.95:45026
 2014-08-22 16:43:20,030 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
  ContainerManager bound to host-10-18-40-95/10.18.40.95:45026
 2014-08-22 16:43:20,073 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
 callQueue class java.util.concurrent.LinkedBlockingQueue
 2014-08-22 16:43:20,098 INFO org.apache.hadoop.ipc.Server: Starting Socket 
 Reader #1 for port 45027
 2014-08-22 16:43:20,158 INFO 
 org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding 
 protocol org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB 
 to the server
 2014-08-22 16:43:20,178 INFO org.apache.hadoop.ipc.Server: IPC Server 
 Responder: starting
 2014-08-22 16:43:20,192 INFO org.apache.hadoop.ipc.Server: IPC Server 
 listener on 45027: starting
 2014-08-22 16:43:20,210 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 
 for port 45026: readAndProcess from client 10.18.40.84 threw exception 
 [java.lang.NullPointerException]
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:167)
   at 
 org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:43)
   at 
 org.apache.hadoop.security.token.SecretManager.retriableRetrievePassword(SecretManager.java:91)
   at 
 org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.getPassword(SaslRpcServer.java:278)
   at 
 org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.handle(SaslRpcServer.java:305)
   at 
 com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:585)
   at 
 com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
   at 
 org.apache.hadoop.ipc.Server$Connection.processSaslToken(Server.java:1384)
   at 
 org.apache.hadoop.ipc.Server$Connection.processSaslMessage(Server.java:1361)
   at org.apache.hadoop.ipc.Server$Connection.saslProcess(Server.java:1275)
   at 
 org.apache.hadoop.ipc.Server$Connection.saslReadAndProcess(Server.java:1238)
   at 
 org.apache.hadoop.ipc.Server$Connection.processRpcOutOfBandRequest(Server.java:1878)
   at 
 org.apache.hadoop.ipc.Server$Connection.processOneRpc(Server.java:1755)
   at 
 org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1519)
   at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:750)
   at 
 org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:624)
   at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:595)
 2014-08-22 16:43:20,227 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 
 for port 45026: readAndProcess from client 10.18.40.84 threw exception 
 [java.lang.NullPointerException]
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:167)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2393) FairScheduler: Implement steady fair share

2014-08-22 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106990#comment-14106990
 ] 

Karthik Kambatla commented on YARN-2393:


One of the reasons we (Sandy and I) wanted to make the fairshare being used for 
scheduling instantaneous was to address the case where the maxAMResource 
becomes so small when there are multiple queues that we can't run any 
applications at all. I think it is better to leave it as is. In case any one 
runs into (in testing) issues with maxAMResource, we can consider preempting 
AMs as an alternative. 

 FairScheduler: Implement steady fair share
 --

 Key: YARN-2393
 URL: https://issues.apache.org/jira/browse/YARN-2393
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Wei Yan
 Attachments: YARN-2393-1.patch, YARN-2393-2.patch, YARN-2393-3.patch, 
 yarn-2393-4.patch


 Static fair share is a fair share allocation considering all(active/inactive) 
 queues.It would be shown on the UI for better predictability of finish time 
 of applications.
 We would compute static fair share only when needed, like on queue creation, 
 node added/removed. Please see YARN-2026 for discussions on this. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2393) FairScheduler: Implement steady fair share

2014-08-22 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106991#comment-14106991
 ] 

Karthik Kambatla commented on YARN-2393:


Committing this. 

 FairScheduler: Implement steady fair share
 --

 Key: YARN-2393
 URL: https://issues.apache.org/jira/browse/YARN-2393
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Wei Yan
 Attachments: YARN-2393-1.patch, YARN-2393-2.patch, YARN-2393-3.patch, 
 yarn-2393-4.patch


 Static fair share is a fair share allocation considering all(active/inactive) 
 queues.It would be shown on the UI for better predictability of finish time 
 of applications.
 We would compute static fair share only when needed, like on queue creation, 
 node added/removed. Please see YARN-2026 for discussions on this. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2393) FairScheduler: Add the notion of steady fair share

2014-08-22 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2393:
---

Summary: FairScheduler: Add the notion of steady fair share  (was: 
FairScheduler: Implement steady fair share)

 FairScheduler: Add the notion of steady fair share
 --

 Key: YARN-2393
 URL: https://issues.apache.org/jira/browse/YARN-2393
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Wei Yan
 Attachments: YARN-2393-1.patch, YARN-2393-2.patch, YARN-2393-3.patch, 
 yarn-2393-4.patch


 Static fair share is a fair share allocation considering all(active/inactive) 
 queues.It would be shown on the UI for better predictability of finish time 
 of applications.
 We would compute static fair share only when needed, like on queue creation, 
 node added/removed. Please see YARN-2026 for discussions on this. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2393) FairScheduler: Add the notion of steady fair share

2014-08-22 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2393:
---

Issue Type: New Feature  (was: Improvement)

 FairScheduler: Add the notion of steady fair share
 --

 Key: YARN-2393
 URL: https://issues.apache.org/jira/browse/YARN-2393
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Wei Yan
 Attachments: YARN-2393-1.patch, YARN-2393-2.patch, YARN-2393-3.patch, 
 yarn-2393-4.patch


 Static fair share is a fair share allocation considering all(active/inactive) 
 queues.It would be shown on the UI for better predictability of finish time 
 of applications.
 We would compute static fair share only when needed, like on queue creation, 
 node added/removed. Please see YARN-2026 for discussions on this. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-160) nodemanagers should obtain cpu/memory values from underlying OS

2014-08-22 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106995#comment-14106995
 ] 

Varun Vasudev commented on YARN-160:


[~djp]
{quote}
Both physical id and core id are not guaranteed to have in /proc/cpuinfo 
(please see below for my local VM's info). We may use processor number instead 
in case these ids are 0 (like we did in Windows). Again, this weak my 
confidence that this automatic way of getting CPU/memory resources should 
happen by default (not sure if any cross-platform issues). May be a safer way 
here is to keep previous default behavior (with some static setting) with an 
extra config to enable this. We can wait this feature to be more stable later 
to change the default behavior.
{noformat}

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 70
model name  : Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
stepping: 1
cpu MHz : 2295.265
cache size  : 6144 KB
fpu : yes
fpu_exception   : yes
cpuid level : 13
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm 
constant_tsc up arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc 
aperfmperf unfair_spinlock pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 
x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm ida arat epb 
xsaveopt pln pts dts tpr_shadow vnmi ept vpid fsgsbase smep
bogomips: 4590.53
clflush size: 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:
{noformat}
{quote}

In the example you gave, where we have processors listed but no physical id or 
core id entries, the numProcessors will be set to the number of entries and 
numCores will be set to 1. From the diff -
{noformat}
+  numCores = 1;
{noformat}
There is also a test case to ensure this behaviour.

In addition, cluster administrators can decide whether the NodeManager should 
report numProcessors or numCores by toggling 
yarn.nodemanager.resource.count-logical-processors-as-vcores which by default 
is true. In the vm example, by default the NodeManager will report vcores as 
the number of processor entries in /proc/cpuinfo. If 
yarn.nodemanager.resource.count-logical-processors-as-vcores is set to false, 
the NodeManager will report vcores as 1(if there are no physical id or core id 
entries).

 nodemanagers should obtain cpu/memory values from underlying OS
 ---

 Key: YARN-160
 URL: https://issues.apache.org/jira/browse/YARN-160
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.0.3-alpha
Reporter: Alejandro Abdelnur
Assignee: Varun Vasudev
 Fix For: 2.6.0

 Attachments: apache-yarn-160.0.patch, apache-yarn-160.1.patch


 As mentioned in YARN-2
 *NM memory and CPU configs*
 Currently these values are coming from the config of the NM, we should be 
 able to obtain those values from the OS (ie, in the case of Linux from 
 /proc/meminfo  /proc/cpuinfo). As this is highly OS dependent we should have 
 an interface that obtains this information. In addition implementations of 
 this interface should be able to specify a mem/cpu offset (amount of mem/cpu 
 not to be avail as YARN resource), this would allow to reserve mem/cpu for 
 the OS and other services outside of YARN containers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2441) NPE in nodemanager after restart

2014-08-22 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106999#comment-14106999
 ] 

Jason Lowe commented on YARN-2441:
--

Ah, then this seems like a case where a client (likely an AM) is connecting to 
the NM before the NM has finished registering with the RM to get the secret 
keys.  Trying to block new container requests at the app level probably isn't 
going to work in practice because the SASL layer in RPC doesn't let the 
connection get to the point where the app can try to reject the request.

IMHO we should remove the blocking client requests code and instead do a 
delayed server start, sorta like the delay added by YARN-1337 when NM recovery 
is enabled.  Ideally the RPC layer would support the ability to bind to a 
server socket but not start accepting requests until later.  That would allow 
us to register with the RM knowing what our client port is but without trying 
to let clients through that port until we're really ready.

Shorter term fix might be to have the secret manager throw an exception that 
can be retried by clients if the master key isn't set yet.

 NPE in nodemanager after restart
 

 Key: YARN-2441
 URL: https://issues.apache.org/jira/browse/YARN-2441
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: Nishan Shetty
Priority: Minor

 {code}
 2014-08-22 16:43:19,640 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
  Blocking new container-requests as container manager rpc server is still 
 starting.
 2014-08-22 16:43:19,658 INFO org.apache.hadoop.ipc.Server: IPC Server 
 Responder: starting
 2014-08-22 16:43:19,675 INFO org.apache.hadoop.ipc.Server: IPC Server 
 listener on 45026: starting
 2014-08-22 16:43:20,029 INFO 
 org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager:
  Updating node address : host-10-18-40-95:45026
 2014-08-22 16:43:20,029 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
  ContainerManager started at /10.18.40.95:45026
 2014-08-22 16:43:20,030 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
  ContainerManager bound to host-10-18-40-95/10.18.40.95:45026
 2014-08-22 16:43:20,073 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
 callQueue class java.util.concurrent.LinkedBlockingQueue
 2014-08-22 16:43:20,098 INFO org.apache.hadoop.ipc.Server: Starting Socket 
 Reader #1 for port 45027
 2014-08-22 16:43:20,158 INFO 
 org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding 
 protocol org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB 
 to the server
 2014-08-22 16:43:20,178 INFO org.apache.hadoop.ipc.Server: IPC Server 
 Responder: starting
 2014-08-22 16:43:20,192 INFO org.apache.hadoop.ipc.Server: IPC Server 
 listener on 45027: starting
 2014-08-22 16:43:20,210 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 
 for port 45026: readAndProcess from client 10.18.40.84 threw exception 
 [java.lang.NullPointerException]
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:167)
   at 
 org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:43)
   at 
 org.apache.hadoop.security.token.SecretManager.retriableRetrievePassword(SecretManager.java:91)
   at 
 org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.getPassword(SaslRpcServer.java:278)
   at 
 org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.handle(SaslRpcServer.java:305)
   at 
 com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:585)
   at 
 com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
   at 
 org.apache.hadoop.ipc.Server$Connection.processSaslToken(Server.java:1384)
   at 
 org.apache.hadoop.ipc.Server$Connection.processSaslMessage(Server.java:1361)
   at org.apache.hadoop.ipc.Server$Connection.saslProcess(Server.java:1275)
   at 
 org.apache.hadoop.ipc.Server$Connection.saslReadAndProcess(Server.java:1238)
   at 
 org.apache.hadoop.ipc.Server$Connection.processRpcOutOfBandRequest(Server.java:1878)
   at 
 org.apache.hadoop.ipc.Server$Connection.processOneRpc(Server.java:1755)
   at 
 org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1519)
   at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:750)
   at 
 org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:624)
   at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:595)
 2014-08-22 

[jira] [Commented] (YARN-810) Support CGroup ceiling enforcement on CPU

2014-08-22 Thread Wei Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107000#comment-14107000
 ] 

Wei Yan commented on YARN-810:
--

[~vvasudev], thanks for the offer. I'm still working on this.

 Support CGroup ceiling enforcement on CPU
 -

 Key: YARN-810
 URL: https://issues.apache.org/jira/browse/YARN-810
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.1.0-beta, 2.0.5-alpha
Reporter: Chris Riccomini
Assignee: Sandy Ryza
 Attachments: YARN-810.patch, YARN-810.patch


 Problem statement:
 YARN currently lets you define an NM's pcore count, and a pcore:vcore ratio. 
 Containers are then allowed to request vcores between the minimum and maximum 
 defined in the yarn-site.xml.
 In the case where a single-threaded container requests 1 vcore, with a 
 pcore:vcore ratio of 1:4, the container is still allowed to use up to 100% of 
 the core it's using, provided that no other container is also using it. This 
 happens, even though the only guarantee that YARN/CGroups is making is that 
 the container will get at least 1/4th of the core.
 If a second container then comes along, the second container can take 
 resources from the first, provided that the first container is still getting 
 at least its fair share (1/4th).
 There are certain cases where this is desirable. There are also certain cases 
 where it might be desirable to have a hard limit on CPU usage, and not allow 
 the process to go above the specified resource requirement, even if it's 
 available.
 Here's an RFC that describes the problem in more detail:
 http://lwn.net/Articles/336127/
 Solution:
 As it happens, when CFS is used in combination with CGroups, you can enforce 
 a ceiling using two files in cgroups:
 {noformat}
 cpu.cfs_quota_us
 cpu.cfs_period_us
 {noformat}
 The usage of these two files is documented in more detail here:
 https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html
 Testing:
 I have tested YARN CGroups using the 2.0.5-alpha implementation. By default, 
 it behaves as described above (it is a soft cap, and allows containers to use 
 more than they asked for). I then tested CFS CPU quotas manually with YARN.
 First, you can see that CFS is in use in the CGroup, based on the file names:
 {noformat}
 [criccomi@eat1-qa464 ~]$ sudo -u app ls -l /cgroup/cpu/hadoop-yarn/
 total 0
 -r--r--r-- 1 app app 0 Jun 13 16:46 cgroup.procs
 drwxr-xr-x 2 app app 0 Jun 13 17:08 container_1371141151815_0004_01_02
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_period_us
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_quota_us
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_period_us
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_runtime_us
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.shares
 -r--r--r-- 1 app app 0 Jun 13 16:46 cpu.stat
 -rw-r--r-- 1 app app 0 Jun 13 16:46 notify_on_release
 -rw-r--r-- 1 app app 0 Jun 13 16:46 tasks
 [criccomi@eat1-qa464 ~]$ sudo -u app cat
 /cgroup/cpu/hadoop-yarn/cpu.cfs_period_us
 10
 [criccomi@eat1-qa464 ~]$ sudo -u app cat
 /cgroup/cpu/hadoop-yarn/cpu.cfs_quota_us
 -1
 {noformat}
 Oddly, it appears that the cfs_period_us is set to .1s, not 1s.
 We can place processes in hard limits. I have process 4370 running YARN 
 container container_1371141151815_0003_01_03 on a host. By default, it's 
 running at ~300% cpu usage.
 {noformat}
 CPU
 4370 criccomi  20   0 1157m 551m  14m S 240.3  0.8  87:10.91 ...
 {noformat}
 When I set the CFS quote:
 {noformat}
 echo 1000  
 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us
  CPU
 4370 criccomi  20   0 1157m 563m  14m S  1.0  0.8  90:08.39 ...
 {noformat}
 It drops to 1% usage, and you can see the box has room to spare:
 {noformat}
 Cpu(s):  2.4%us,  1.0%sy,  0.0%ni, 92.2%id,  4.2%wa,  0.0%hi,  0.1%si, 
 0.0%st
 {noformat}
 Turning the quota back to -1:
 {noformat}
 echo -1  
 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us
 {noformat}
 Burns the cores again:
 {noformat}
 Cpu(s): 11.1%us,  1.7%sy,  0.0%ni, 83.9%id,  3.1%wa,  0.0%hi,  0.2%si, 
 0.0%st
 CPU
 4370 criccomi  20   0 1157m 563m  14m S 253.9  0.8  89:32.31 ...
 {noformat}
 On my dev box, I was testing CGroups by running a python process eight times, 
 to burn through all the cores, since it was doing as described above (giving 
 extra CPU to the process, even with a cpu.shares limit). Toggling the 
 cfs_quota_us seems to enforce a hard limit.
 Implementation:
 What do you guys think about 

[jira] [Commented] (YARN-2393) FairScheduler: Add the notion of steady fair share

2014-08-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107007#comment-14107007
 ] 

Hudson commented on YARN-2393:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6097 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6097/])
YARN-2393. FairScheduler: Add the notion of steady fair share. (Wei Yan via 
kasha) (kasha: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1619845)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSParentQueue.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSQueue.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSQueueMetrics.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueueManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/Schedulable.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/SchedulingPolicy.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/ComputeFairShares.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/DominantResourceFairnessPolicy.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FairSharePolicy.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FifoPolicy.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FakeSchedulable.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairSchedulerFairShare.java


 FairScheduler: Add the notion of steady fair share
 --

 Key: YARN-2393
 URL: https://issues.apache.org/jira/browse/YARN-2393
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Wei Yan
 Attachments: YARN-2393-1.patch, YARN-2393-2.patch, YARN-2393-3.patch, 
 yarn-2393-4.patch


 Static fair share is a fair share allocation considering all(active/inactive) 
 queues.It would be shown on the UI for better predictability of finish time 
 of applications.
 We would compute static fair share only when needed, like on queue creation, 
 node added/removed. Please see YARN-2026 for discussions on this. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml

2014-08-22 Thread Wei Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107005#comment-14107005
 ] 

Wei Yan commented on YARN-2440:
---

[~vvasudev], for general cases, we shouldn't strictly limit the cfs_quota_us. 
We always want to let co-located containers to share the cpu resource in a 
proportional way, not strictly follow the container_vcores/NM_vcores ratio. We 
have one runnable patch in YARN-810. I'll check with Sandy for the reviewing.

 Cgroups should limit YARN containers to cores allocated in yarn-site.xml
 

 Key: YARN-2440
 URL: https://issues.apache.org/jira/browse/YARN-2440
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-2440.0.patch, 
 screenshot-current-implementation.jpg


 The current cgroups implementation does not limit YARN containers to the 
 cores allocated in yarn-site.xml.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml

2014-08-22 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107012#comment-14107012
 ] 

Varun Vasudev commented on YARN-2440:
-

[~ywskycn] this patch doesn't limit containers to container_vcores/NM_vcores 
ratio. What it does do is limit the overall YARN usage to the 
yarn.nodemanager.resource.cpu-vcores. If you have 4 cores on a machine and set 
yarn.nodemanager.resource.cpu-vcores 2, we don't restrict the YARN containers 
to 2 cores. The containers can create threads and use up as many cores as they 
want, which defeats the purpose of setting yarn.nodemanager.resource.cpu-vcores.

 Cgroups should limit YARN containers to cores allocated in yarn-site.xml
 

 Key: YARN-2440
 URL: https://issues.apache.org/jira/browse/YARN-2440
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-2440.0.patch, 
 screenshot-current-implementation.jpg


 The current cgroups implementation does not limit YARN containers to the 
 cores allocated in yarn-site.xml.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2431) NM restart: cgroup is not removed for reacquired containers

2014-08-22 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107008#comment-14107008
 ] 

Jason Lowe commented on YARN-2431:
--

Release audit problems are unrelated, see HDFS-6905.

 NM restart: cgroup is not removed for reacquired containers
 ---

 Key: YARN-2431
 URL: https://issues.apache.org/jira/browse/YARN-2431
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Jason Lowe
 Attachments: YARN-2431.patch


 The cgroup for a reacquired container is not being removed when the container 
 exits.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml

2014-08-22 Thread Wei Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107025#comment-14107025
 ] 

Wei Yan commented on YARN-2440:
---

[~vvasudev], I misunderstood this jira. Will post comment later.

 Cgroups should limit YARN containers to cores allocated in yarn-site.xml
 

 Key: YARN-2440
 URL: https://issues.apache.org/jira/browse/YARN-2440
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-2440.0.patch, 
 screenshot-current-implementation.jpg


 The current cgroups implementation does not limit YARN containers to the 
 cores allocated in yarn-site.xml.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2393) FairScheduler: Add the notion of steady fair share

2014-08-22 Thread Wei Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107050#comment-14107050
 ] 

Wei Yan commented on YARN-2393:
---

Thanks, [~kasha], [~ashwinshankar77]. Will post a patch for the YARN-2360 for 
the UI.

 FairScheduler: Add the notion of steady fair share
 --

 Key: YARN-2393
 URL: https://issues.apache.org/jira/browse/YARN-2393
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Wei Yan
 Fix For: 2.6.0

 Attachments: YARN-2393-1.patch, YARN-2393-2.patch, YARN-2393-3.patch, 
 yarn-2393-4.patch


 Static fair share is a fair share allocation considering all(active/inactive) 
 queues.It would be shown on the UI for better predictability of finish time 
 of applications.
 We would compute static fair share only when needed, like on queue creation, 
 node added/removed. Please see YARN-2026 for discussions on this. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2441) NPE in nodemanager after restart

2014-08-22 Thread Nishan Shetty (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishan Shetty updated YARN-2441:


Priority: Major  (was: Minor)

 NPE in nodemanager after restart
 

 Key: YARN-2441
 URL: https://issues.apache.org/jira/browse/YARN-2441
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: Nishan Shetty

 {code}
 2014-08-22 16:43:19,640 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
  Blocking new container-requests as container manager rpc server is still 
 starting.
 2014-08-22 16:43:19,658 INFO org.apache.hadoop.ipc.Server: IPC Server 
 Responder: starting
 2014-08-22 16:43:19,675 INFO org.apache.hadoop.ipc.Server: IPC Server 
 listener on 45026: starting
 2014-08-22 16:43:20,029 INFO 
 org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager:
  Updating node address : host-10-18-40-95:45026
 2014-08-22 16:43:20,029 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
  ContainerManager started at /10.18.40.95:45026
 2014-08-22 16:43:20,030 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
  ContainerManager bound to host-10-18-40-95/10.18.40.95:45026
 2014-08-22 16:43:20,073 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
 callQueue class java.util.concurrent.LinkedBlockingQueue
 2014-08-22 16:43:20,098 INFO org.apache.hadoop.ipc.Server: Starting Socket 
 Reader #1 for port 45027
 2014-08-22 16:43:20,158 INFO 
 org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding 
 protocol org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB 
 to the server
 2014-08-22 16:43:20,178 INFO org.apache.hadoop.ipc.Server: IPC Server 
 Responder: starting
 2014-08-22 16:43:20,192 INFO org.apache.hadoop.ipc.Server: IPC Server 
 listener on 45027: starting
 2014-08-22 16:43:20,210 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 
 for port 45026: readAndProcess from client 10.18.40.84 threw exception 
 [java.lang.NullPointerException]
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:167)
   at 
 org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:43)
   at 
 org.apache.hadoop.security.token.SecretManager.retriableRetrievePassword(SecretManager.java:91)
   at 
 org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.getPassword(SaslRpcServer.java:278)
   at 
 org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.handle(SaslRpcServer.java:305)
   at 
 com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:585)
   at 
 com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
   at 
 org.apache.hadoop.ipc.Server$Connection.processSaslToken(Server.java:1384)
   at 
 org.apache.hadoop.ipc.Server$Connection.processSaslMessage(Server.java:1361)
   at org.apache.hadoop.ipc.Server$Connection.saslProcess(Server.java:1275)
   at 
 org.apache.hadoop.ipc.Server$Connection.saslReadAndProcess(Server.java:1238)
   at 
 org.apache.hadoop.ipc.Server$Connection.processRpcOutOfBandRequest(Server.java:1878)
   at 
 org.apache.hadoop.ipc.Server$Connection.processOneRpc(Server.java:1755)
   at 
 org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1519)
   at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:750)
   at 
 org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:624)
   at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:595)
 2014-08-22 16:43:20,227 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 
 for port 45026: readAndProcess from client 10.18.40.84 threw exception 
 [java.lang.NullPointerException]
 java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:167)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml

2014-08-22 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107057#comment-14107057
 ] 

Jason Lowe commented on YARN-2440:
--

I think cfs_quota_us has a maximum value of 100, so we may have an issue if 
vcores10.

I don't see how this takes into account the mapping of vcores to actual CPUs.   
It's not safe to assume 1 vcore == 1 physical CPU, as some sites will map 
multiple vcores to a physical core to allow fractions of a physical CPU to be 
allocated or to account for varying CPU performance across a heterogeneous 
cluster.

 Cgroups should limit YARN containers to cores allocated in yarn-site.xml
 

 Key: YARN-2440
 URL: https://issues.apache.org/jira/browse/YARN-2440
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-2440.0.patch, 
 screenshot-current-implementation.jpg


 The current cgroups implementation does not limit YARN containers to the 
 cores allocated in yarn-site.xml.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely

2014-08-22 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107066#comment-14107066
 ] 

zhihai xu commented on YARN-1458:
-

[~shurong.mai], YARN-1458.patch will cause regression. It won't work if all the 
weight and MinShare in the active queues are less than 1.
The type conversion from double to int in computeShare loses precision.
{code}
private static int computeShare(Schedulable sched, double w2rRatio,
  ResourceType type) {
double share = sched.getWeights().getWeight(type) * w2rRatio;
share = Math.max(share, getResourceValue(sched.getMinShare(), type));
share = Math.min(share, getResourceValue(sched.getMaxShare(), type));
return (int) share;
  }
{code}
In above code, the initial value w2rRatio is 1.0. If weight and MinShare are 
less than 1, computeShare will return 0.
resourceUsedWithWeightToResourceRatio will return the sum of all these return 
values from computeShare(after lose precision).
It will be zero if all the weight and MinShare in the active queues are less 
than 1. Then YARN-1458.patch will exit the loop earlier with
rMax value 1.0. Then right variable will be less than rMax(1.0). Then all 
queues' fair share will be set to 0 in the following code.
{code}
for (Schedulable sched : schedulables) {
  setResourceValue(computeShare(sched, right, type), sched.getFairShare(), 
type);
}
{code}

This is the reason why the TestFairScheduler is failed at line 1049.
testIsStarvedForFairShare configure the queueA weight 0.25 and queueB weight 
0.75 and total node resource 4 * 1024.
It creates two applications: one is assigned to queueA and the other is 
assigned to queueB.
After FaiScheduler(update) calculated the fair share,  queueA fair share should 
be 1 * 1024 and queueB fair share should be 3 * 1024.
but with YARN-1458.patch, both queueA fair share and queueB fair share are set 
to 0,
It is because in this test there are two active queues:queueA  and queueB, both 
weights are less than 1(0.25 and 0.75), MinShare(minResources) in queueA  and 
queueB are not configured, both MinShare use default value(0).

 In Fair Scheduler, size based weight can cause update thread to hold lock 
 indefinitely
 --

 Key: YARN-1458
 URL: https://issues.apache.org/jira/browse/YARN-1458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: Centos 2.6.18-238.19.1.el5 X86_64
 hadoop2.2.0
Reporter: qingwu.fu
Assignee: zhihai xu
  Labels: patch
 Fix For: 2.2.1

 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, YARN-1458.patch

   Original Estimate: 408h
  Remaining Estimate: 408h

 The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
 clients submit lots jobs, it is not easy to reapear. We run the test cluster 
 for days to reapear it. The output of  jstack command on resourcemanager pid:
 {code}
  ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
 waiting for monitor entry [0x43aa9000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
 - waiting to lock 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
 at java.lang.Thread.run(Thread.java:744)
 ……
 FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
 runnable [0x433a2000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
 at 
 

[jira] [Commented] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml

2014-08-22 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107068#comment-14107068
 ] 

Varun Vasudev commented on YARN-2440:
-

[~jlowe] does it make sense to get the number of physical cores on the machine 
and derive the vcore to physical cpu ratio?

 Cgroups should limit YARN containers to cores allocated in yarn-site.xml
 

 Key: YARN-2440
 URL: https://issues.apache.org/jira/browse/YARN-2440
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-2440.0.patch, 
 screenshot-current-implementation.jpg


 The current cgroups implementation does not limit YARN containers to the 
 cores allocated in yarn-site.xml.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml

2014-08-22 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107069#comment-14107069
 ] 

Varun Vasudev commented on YARN-2440:
-

I'll update the patch to limit cfs_quota_us.

 Cgroups should limit YARN containers to cores allocated in yarn-site.xml
 

 Key: YARN-2440
 URL: https://issues.apache.org/jira/browse/YARN-2440
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-2440.0.patch, 
 screenshot-current-implementation.jpg


 The current cgroups implementation does not limit YARN containers to the 
 cores allocated in yarn-site.xml.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml

2014-08-22 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107093#comment-14107093
 ] 

Jason Lowe commented on YARN-2440:
--

bq. does it make sense to get the number of physical cores on the machine and 
derive the vcore to physical cpu ratio?

Only if the user can specify the multiplier between a vcore and a physical CPU. 
 Not all physical CPUs are created equal, and as I mentioned earlier, some 
sites will want to allow fractions of a physical CPU to be allocated.  
Otherwise we're limiting the number of containers to the number of physical 
cores, and not all tasks require a full core.

 Cgroups should limit YARN containers to cores allocated in yarn-site.xml
 

 Key: YARN-2440
 URL: https://issues.apache.org/jira/browse/YARN-2440
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-2440.0.patch, 
 screenshot-current-implementation.jpg


 The current cgroups implementation does not limit YARN containers to the 
 cores allocated in yarn-site.xml.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2443) Log Handling for Long Running Service

2014-08-22 Thread Xuan Gong (JIRA)
Xuan Gong created YARN-2443:
---

 Summary: Log Handling for Long Running Service
 Key: YARN-2443
 URL: https://issues.apache.org/jira/browse/YARN-2443
 Project: Hadoop YARN
  Issue Type: Task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1104) NMs to support rolling logs of stdout stderr

2014-08-22 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-1104:


Parent Issue: YARN-2443  (was: YARN-896)

 NMs to support rolling logs of stdout  stderr
 --

 Key: YARN-1104
 URL: https://issues.apache.org/jira/browse/YARN-1104
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.1.0-beta
Reporter: Steve Loughran
Assignee: Xuan Gong

 Currently NMs stream the stdout and stderr streams of a container to a file. 
 For longer lived processes those files need to be rotated so that the log 
 doesn't overflow



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-810) Support CGroup ceiling enforcement on CPU

2014-08-22 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107172#comment-14107172
 ] 

Varun Vasudev commented on YARN-810:


[~ywskycn] thanks for letting me know! Some comments on your patch -

1. In CgroupsLCEResourcesHandler.java, you set cfs_period_us to nmShares and 
cfs_quota_us to cpuShares. From the RedHat documentation, cfs_period_us and 
cfs_quota_us operate on a CPU basis. From the documentation
{quote}
   Note that the quota and period parameters operate on a CPU basis. To allow a 
process to fully utilize two CPUs, for example, set cpu.cfs_quota_us to 20 
and cpu.cfs_period_us to 10. 
{quote}
With your current implementation, on a machine with 4 cores(and 4 vcores), a 
container which requests 2 vcores will have cfs_period_us set to 4096 and 
cfs_quota_us set to 2048 which will end up limiting it to 50% of one CPU. Is my 
understanding wrong?

2. This is just nitpicking, but is it possible to change 
CpuEnforceCeilingEnabled(and its variants) to just CpuCeilingEnabled or 
CpuCeilingEnforced?

 Support CGroup ceiling enforcement on CPU
 -

 Key: YARN-810
 URL: https://issues.apache.org/jira/browse/YARN-810
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.1.0-beta, 2.0.5-alpha
Reporter: Chris Riccomini
Assignee: Sandy Ryza
 Attachments: YARN-810.patch, YARN-810.patch


 Problem statement:
 YARN currently lets you define an NM's pcore count, and a pcore:vcore ratio. 
 Containers are then allowed to request vcores between the minimum and maximum 
 defined in the yarn-site.xml.
 In the case where a single-threaded container requests 1 vcore, with a 
 pcore:vcore ratio of 1:4, the container is still allowed to use up to 100% of 
 the core it's using, provided that no other container is also using it. This 
 happens, even though the only guarantee that YARN/CGroups is making is that 
 the container will get at least 1/4th of the core.
 If a second container then comes along, the second container can take 
 resources from the first, provided that the first container is still getting 
 at least its fair share (1/4th).
 There are certain cases where this is desirable. There are also certain cases 
 where it might be desirable to have a hard limit on CPU usage, and not allow 
 the process to go above the specified resource requirement, even if it's 
 available.
 Here's an RFC that describes the problem in more detail:
 http://lwn.net/Articles/336127/
 Solution:
 As it happens, when CFS is used in combination with CGroups, you can enforce 
 a ceiling using two files in cgroups:
 {noformat}
 cpu.cfs_quota_us
 cpu.cfs_period_us
 {noformat}
 The usage of these two files is documented in more detail here:
 https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html
 Testing:
 I have tested YARN CGroups using the 2.0.5-alpha implementation. By default, 
 it behaves as described above (it is a soft cap, and allows containers to use 
 more than they asked for). I then tested CFS CPU quotas manually with YARN.
 First, you can see that CFS is in use in the CGroup, based on the file names:
 {noformat}
 [criccomi@eat1-qa464 ~]$ sudo -u app ls -l /cgroup/cpu/hadoop-yarn/
 total 0
 -r--r--r-- 1 app app 0 Jun 13 16:46 cgroup.procs
 drwxr-xr-x 2 app app 0 Jun 13 17:08 container_1371141151815_0004_01_02
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_period_us
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_quota_us
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_period_us
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_runtime_us
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.shares
 -r--r--r-- 1 app app 0 Jun 13 16:46 cpu.stat
 -rw-r--r-- 1 app app 0 Jun 13 16:46 notify_on_release
 -rw-r--r-- 1 app app 0 Jun 13 16:46 tasks
 [criccomi@eat1-qa464 ~]$ sudo -u app cat
 /cgroup/cpu/hadoop-yarn/cpu.cfs_period_us
 10
 [criccomi@eat1-qa464 ~]$ sudo -u app cat
 /cgroup/cpu/hadoop-yarn/cpu.cfs_quota_us
 -1
 {noformat}
 Oddly, it appears that the cfs_period_us is set to .1s, not 1s.
 We can place processes in hard limits. I have process 4370 running YARN 
 container container_1371141151815_0003_01_03 on a host. By default, it's 
 running at ~300% cpu usage.
 {noformat}
 CPU
 4370 criccomi  20   0 1157m 551m  14m S 240.3  0.8  87:10.91 ...
 {noformat}
 When I set the CFS quote:
 {noformat}
 echo 1000  
 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us
  CPU
 4370 criccomi  20   0 1157m 563m  14m S  1.0  0.8  90:08.39 ...
 {noformat}
 It drops to 1% usage, and you can see the box 

[jira] [Updated] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page

2014-08-22 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2360:
--

Attachment: (was: Screen_Shot_v3.png)

 Fair Scheduler : Display dynamic fair share for queues on the scheduler page
 

 Key: YARN-2360
 URL: https://issues.apache.org/jira/browse/YARN-2360
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
 Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, 
 YARN-2360-v1.txt, YARN-2360-v2.txt


 Based on the discussion in YARN-2026,  we'd like to display dynamic fair 
 share for queues on the scheduler page.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page

2014-08-22 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2360:
--

Attachment: Screen_Shot_v3.png

 Fair Scheduler : Display dynamic fair share for queues on the scheduler page
 

 Key: YARN-2360
 URL: https://issues.apache.org/jira/browse/YARN-2360
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
 Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, 
 YARN-2360-v1.txt, YARN-2360-v2.txt


 Based on the discussion in YARN-2026,  we'd like to display dynamic fair 
 share for queues on the scheduler page.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page

2014-08-22 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2360:
--

Attachment: YARN-2360-v3.patch

Update a patch after YARN-2393. The Screen_Shot_v3.png is the fair scheduler 
web page.

 Fair Scheduler : Display dynamic fair share for queues on the scheduler page
 

 Key: YARN-2360
 URL: https://issues.apache.org/jira/browse/YARN-2360
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
 Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, 
 Screen_Shot_v3.png, YARN-2360-v1.txt, YARN-2360-v2.txt, YARN-2360-v3.patch


 Based on the discussion in YARN-2026,  we'd like to display dynamic fair 
 share for queues on the scheduler page.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page

2014-08-22 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2360:
--

Attachment: Screen_Shot_v3.png

 Fair Scheduler : Display dynamic fair share for queues on the scheduler page
 

 Key: YARN-2360
 URL: https://issues.apache.org/jira/browse/YARN-2360
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
 Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, 
 Screen_Shot_v3.png, YARN-2360-v1.txt, YARN-2360-v2.txt, YARN-2360-v3.patch


 Based on the discussion in YARN-2026,  we'd like to display dynamic fair 
 share for queues on the scheduler page.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-810) Support CGroup ceiling enforcement on CPU

2014-08-22 Thread Wei Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107221#comment-14107221
 ] 

Wei Yan commented on YARN-810:
--

bq. With your current implementation, on a machine with 4 cores(and 4 vcores), 
a container which requests 2 vcores will have cfs_period_us set to 4096 and 
cfs_quota_us set to 2048 which will end up limiting it to 50% of one CPU. Is my 
understanding wrong?

Thanks, [~vvasudev]. I mentioned this problem after reading your YARN-2420 
patch. I'll double check this problem, and will update the patch.

 Support CGroup ceiling enforcement on CPU
 -

 Key: YARN-810
 URL: https://issues.apache.org/jira/browse/YARN-810
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.1.0-beta, 2.0.5-alpha
Reporter: Chris Riccomini
Assignee: Sandy Ryza
 Attachments: YARN-810.patch, YARN-810.patch


 Problem statement:
 YARN currently lets you define an NM's pcore count, and a pcore:vcore ratio. 
 Containers are then allowed to request vcores between the minimum and maximum 
 defined in the yarn-site.xml.
 In the case where a single-threaded container requests 1 vcore, with a 
 pcore:vcore ratio of 1:4, the container is still allowed to use up to 100% of 
 the core it's using, provided that no other container is also using it. This 
 happens, even though the only guarantee that YARN/CGroups is making is that 
 the container will get at least 1/4th of the core.
 If a second container then comes along, the second container can take 
 resources from the first, provided that the first container is still getting 
 at least its fair share (1/4th).
 There are certain cases where this is desirable. There are also certain cases 
 where it might be desirable to have a hard limit on CPU usage, and not allow 
 the process to go above the specified resource requirement, even if it's 
 available.
 Here's an RFC that describes the problem in more detail:
 http://lwn.net/Articles/336127/
 Solution:
 As it happens, when CFS is used in combination with CGroups, you can enforce 
 a ceiling using two files in cgroups:
 {noformat}
 cpu.cfs_quota_us
 cpu.cfs_period_us
 {noformat}
 The usage of these two files is documented in more detail here:
 https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html
 Testing:
 I have tested YARN CGroups using the 2.0.5-alpha implementation. By default, 
 it behaves as described above (it is a soft cap, and allows containers to use 
 more than they asked for). I then tested CFS CPU quotas manually with YARN.
 First, you can see that CFS is in use in the CGroup, based on the file names:
 {noformat}
 [criccomi@eat1-qa464 ~]$ sudo -u app ls -l /cgroup/cpu/hadoop-yarn/
 total 0
 -r--r--r-- 1 app app 0 Jun 13 16:46 cgroup.procs
 drwxr-xr-x 2 app app 0 Jun 13 17:08 container_1371141151815_0004_01_02
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_period_us
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_quota_us
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_period_us
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_runtime_us
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.shares
 -r--r--r-- 1 app app 0 Jun 13 16:46 cpu.stat
 -rw-r--r-- 1 app app 0 Jun 13 16:46 notify_on_release
 -rw-r--r-- 1 app app 0 Jun 13 16:46 tasks
 [criccomi@eat1-qa464 ~]$ sudo -u app cat
 /cgroup/cpu/hadoop-yarn/cpu.cfs_period_us
 10
 [criccomi@eat1-qa464 ~]$ sudo -u app cat
 /cgroup/cpu/hadoop-yarn/cpu.cfs_quota_us
 -1
 {noformat}
 Oddly, it appears that the cfs_period_us is set to .1s, not 1s.
 We can place processes in hard limits. I have process 4370 running YARN 
 container container_1371141151815_0003_01_03 on a host. By default, it's 
 running at ~300% cpu usage.
 {noformat}
 CPU
 4370 criccomi  20   0 1157m 551m  14m S 240.3  0.8  87:10.91 ...
 {noformat}
 When I set the CFS quote:
 {noformat}
 echo 1000  
 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us
  CPU
 4370 criccomi  20   0 1157m 563m  14m S  1.0  0.8  90:08.39 ...
 {noformat}
 It drops to 1% usage, and you can see the box has room to spare:
 {noformat}
 Cpu(s):  2.4%us,  1.0%sy,  0.0%ni, 92.2%id,  4.2%wa,  0.0%hi,  0.1%si, 
 0.0%st
 {noformat}
 Turning the quota back to -1:
 {noformat}
 echo -1  
 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us
 {noformat}
 Burns the cores again:
 {noformat}
 Cpu(s): 11.1%us,  1.7%sy,  0.0%ni, 83.9%id,  3.1%wa,  0.0%hi,  0.2%si, 
 0.0%st
 CPU
 4370 criccomi  20   0 1157m 563m  14m S 253.9  0.8  89:32.31 

[jira] [Commented] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page

2014-08-22 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107244#comment-14107244
 ] 

Karthik Kambatla commented on YARN-2360:


I would rename the legend to Steady fairshare and Instantaneous fairshare. 

 Fair Scheduler : Display dynamic fair share for queues on the scheduler page
 

 Key: YARN-2360
 URL: https://issues.apache.org/jira/browse/YARN-2360
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
 Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, 
 Screen_Shot_v3.png, YARN-2360-v1.txt, YARN-2360-v2.txt, YARN-2360-v3.patch


 Based on the discussion in YARN-2026,  we'd like to display dynamic fair 
 share for queues on the scheduler page.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page

2014-08-22 Thread Wei Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107252#comment-14107252
 ] 

Wei Yan commented on YARN-2360:
---

Thanks, Karthik. Will update a patch with changes, also another problem in the 
FairSchedulerQueueInfo.

 Fair Scheduler : Display dynamic fair share for queues on the scheduler page
 

 Key: YARN-2360
 URL: https://issues.apache.org/jira/browse/YARN-2360
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
 Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, 
 Screen_Shot_v3.png, YARN-2360-v1.txt, YARN-2360-v2.txt, YARN-2360-v3.patch


 Based on the discussion in YARN-2026,  we'd like to display dynamic fair 
 share for queues on the scheduler page.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2444) Primary filters added after first submission not indexed, cause exceptions in logs.

2014-08-22 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated YARN-2444:
-

Attachment: ats.java

 Primary filters added after first submission not indexed, cause exceptions in 
 logs.
 ---

 Key: YARN-2444
 URL: https://issues.apache.org/jira/browse/YARN-2444
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.5.0
Reporter: Marcelo Vanzin
 Attachments: ats.java


 See attached code for an example. The code creates an entity with a primary 
 filter, submits it to the ATS. After that, a new primary filter value is 
 added and the entity is resubmitted. At that point two things can be seen:
 - Searching for the new primary filter value does not return the entity
 - The following exception shows up in the logs:
 {noformat}
 14/08/22 11:33:42 ERROR webapp.TimelineWebServices: Error when verifying 
 access for user dr.who (auth:SIMPLE) on the events of the timeline entity { 
 id: testid-48625678-9cbb-4e71-87de-93c50be51d1a, type: test }
 org.apache.hadoop.yarn.exceptions.YarnException: Owner information of the 
 timeline entity { id: testid-48625678-9cbb-4e71-87de-93c50be51d1a, type: test 
 } is corrupted.
 at 
 org.apache.hadoop.yarn.server.timeline.security.TimelineACLsManager.checkAccess(TimelineACLsManager.java:67)
 at 
 org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices.getEntities(TimelineWebServices.java:172)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2444) Primary filters added after first submission not indexed, cause exceptions in logs.

2014-08-22 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107269#comment-14107269
 ] 

Marcelo Vanzin commented on YARN-2444:
--

The following search causes the problem described above:

{noformat}/ws/v1/timeline/test?primaryFilter=prop2:val2{noformat}

The following one works as expected:

{noformat}/ws/v1/timeline/test?primaryFilter=prop1:val1{noformat}

 Primary filters added after first submission not indexed, cause exceptions in 
 logs.
 ---

 Key: YARN-2444
 URL: https://issues.apache.org/jira/browse/YARN-2444
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.5.0
Reporter: Marcelo Vanzin
 Attachments: ats.java


 See attached code for an example. The code creates an entity with a primary 
 filter, submits it to the ATS. After that, a new primary filter value is 
 added and the entity is resubmitted. At that point two things can be seen:
 - Searching for the new primary filter value does not return the entity
 - The following exception shows up in the logs:
 {noformat}
 14/08/22 11:33:42 ERROR webapp.TimelineWebServices: Error when verifying 
 access for user dr.who (auth:SIMPLE) on the events of the timeline entity { 
 id: testid-48625678-9cbb-4e71-87de-93c50be51d1a, type: test }
 org.apache.hadoop.yarn.exceptions.YarnException: Owner information of the 
 timeline entity { id: testid-48625678-9cbb-4e71-87de-93c50be51d1a, type: test 
 } is corrupted.
 at 
 org.apache.hadoop.yarn.server.timeline.security.TimelineACLsManager.checkAccess(TimelineACLsManager.java:67)
 at 
 org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices.getEntities(TimelineWebServices.java:172)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml

2014-08-22 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107275#comment-14107275
 ] 

Varun Vasudev commented on YARN-2440:
-

It might make things easier to go with [~sandyr] idea to add a configuration to 
add a config which expresses a % of node's CPU that is used by YARN. [~jlowe] 
would that address your concerns?

 Cgroups should limit YARN containers to cores allocated in yarn-site.xml
 

 Key: YARN-2440
 URL: https://issues.apache.org/jira/browse/YARN-2440
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, 
 screenshot-current-implementation.jpg


 The current cgroups implementation does not limit YARN containers to the 
 cores allocated in yarn-site.xml.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page

2014-08-22 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2360:
--

Attachment: Screen_Shot_v4.png

 Fair Scheduler : Display dynamic fair share for queues on the scheduler page
 

 Key: YARN-2360
 URL: https://issues.apache.org/jira/browse/YARN-2360
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
 Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, 
 Screen_Shot_v3.png, Screen_Shot_v4.png, YARN-2360-v1.txt, YARN-2360-v2.txt, 
 YARN-2360-v3.patch, YARN-2360-v4.patch


 Based on the discussion in YARN-2026,  we'd like to display dynamic fair 
 share for queues on the scheduler page.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page

2014-08-22 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2360:
--

Attachment: YARN-2360-v4.patch

 Fair Scheduler : Display dynamic fair share for queues on the scheduler page
 

 Key: YARN-2360
 URL: https://issues.apache.org/jira/browse/YARN-2360
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
 Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, 
 Screen_Shot_v3.png, Screen_Shot_v4.png, YARN-2360-v1.txt, YARN-2360-v2.txt, 
 YARN-2360-v3.patch, YARN-2360-v4.patch


 Based on the discussion in YARN-2026,  we'd like to display dynamic fair 
 share for queues on the scheduler page.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml

2014-08-22 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107331#comment-14107331
 ] 

Jason Lowe commented on YARN-2440:
--

Sure for this JIRA we can go with a percent of total CPU to limit YARN.  For 
something like YARN-160 we'd need the user to specify some kind of relationship 
between vcores and  physical cores.

 Cgroups should limit YARN containers to cores allocated in yarn-site.xml
 

 Key: YARN-2440
 URL: https://issues.apache.org/jira/browse/YARN-2440
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, 
 screenshot-current-implementation.jpg


 The current cgroups implementation does not limit YARN containers to the 
 cores allocated in yarn-site.xml.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page

2014-08-22 Thread Ashwin Shankar (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107351#comment-14107351
 ] 

Ashwin Shankar commented on YARN-2360:
--

[~ywskycn], patch looks good. Should we mention what Instantaneous and 
Steady fair share means in the fair scheduler doc ie apt.vm file, so that 
users know what it means ? I'm also torn on whether we should define these 
terms on the UI as part of the legend tool tip or some other way ?


 Fair Scheduler : Display dynamic fair share for queues on the scheduler page
 

 Key: YARN-2360
 URL: https://issues.apache.org/jira/browse/YARN-2360
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
 Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, 
 Screen_Shot_v3.png, Screen_Shot_v4.png, YARN-2360-v1.txt, YARN-2360-v2.txt, 
 YARN-2360-v3.patch, YARN-2360-v4.patch


 Based on the discussion in YARN-2026,  we'd like to display dynamic fair 
 share for queues on the scheduler page.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely

2014-08-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107379#comment-14107379
 ] 

Hadoop QA commented on YARN-1458:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12663617/YARN-1458.002.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4695//console

This message is automatically generated.

 In Fair Scheduler, size based weight can cause update thread to hold lock 
 indefinitely
 --

 Key: YARN-1458
 URL: https://issues.apache.org/jira/browse/YARN-1458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: Centos 2.6.18-238.19.1.el5 X86_64
 hadoop2.2.0
Reporter: qingwu.fu
Assignee: zhihai xu
  Labels: patch
 Fix For: 2.2.1

 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, YARN-1458.patch

   Original Estimate: 408h
  Remaining Estimate: 408h

 The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
 clients submit lots jobs, it is not easy to reapear. We run the test cluster 
 for days to reapear it. The output of  jstack command on resourcemanager pid:
 {code}
  ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
 waiting for monitor entry [0x43aa9000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
 - waiting to lock 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
 at java.lang.Thread.run(Thread.java:744)
 ……
 FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
 runnable [0x433a2000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
 at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2408) Resource Request REST API for YARN

2014-08-22 Thread Renan DelValle (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renan DelValle updated YARN-2408:
-

Attachment: (was: YARN-2408-2.patch)

 Resource Request REST API for YARN
 --

 Key: YARN-2408
 URL: https://issues.apache.org/jira/browse/YARN-2408
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: webapp
Reporter: Renan DelValle
  Labels: features
 Attachments: YARN-2408-3.patch


 I’m proposing a new REST API for YARN which exposes a snapshot of the 
 Resource Requests that exist inside of the Scheduler. My motivation behind 
 this new feature is to allow external software to monitor the amount of 
 resources being requested to gain more insightful information into cluster 
 usage than is already provided. The API can also be used by external software 
 to detect a starved application and alert the appropriate users and/or sys 
 admin so that the problem may be remedied.
 Here is the proposed API:
 {code:xml}
 resourceRequests
   MB96256/MB
   VCores94/VCores
   appMaster
 applicationIdapplication_/applicationId
 applicationAttemptIdappattempt_/applicationAttemptId
 queueNamedefault/queueName
 totalPendingMB96256/totalPendingMB
 totalPendingVCores94/totalPendingVCores
 numResourceRequests3/numResourceRequests
 resourceRequests
   request
 MB1024/MB
 VCores1/VCores
 resourceName/default-rack/resourceName
 numContainers94/numContainers
 relaxLocalitytrue/relaxLocality
 priority20/priority
   /request
   request
 MB1024/MB
 VCores1/VCores
 resourceName*/resourceName
 numContainers94/numContainers
 relaxLocalitytrue/relaxLocality
 priority20/priority
   /request
   request
 MB1024/MB
 VCores1/VCores
 resourceNamemaster/resourceName
 numContainers94/numContainers
 relaxLocalitytrue/relaxLocality
 priority20/priority
   /request
 /resourceRequests
   /appMaster
 /resourceRequests
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2408) Resource Request REST API for YARN

2014-08-22 Thread Renan DelValle (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renan DelValle updated YARN-2408:
-

Attachment: YARN-2408-3.patch

Bug fix

 Resource Request REST API for YARN
 --

 Key: YARN-2408
 URL: https://issues.apache.org/jira/browse/YARN-2408
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: webapp
Reporter: Renan DelValle
  Labels: features
 Attachments: YARN-2408-3.patch


 I’m proposing a new REST API for YARN which exposes a snapshot of the 
 Resource Requests that exist inside of the Scheduler. My motivation behind 
 this new feature is to allow external software to monitor the amount of 
 resources being requested to gain more insightful information into cluster 
 usage than is already provided. The API can also be used by external software 
 to detect a starved application and alert the appropriate users and/or sys 
 admin so that the problem may be remedied.
 Here is the proposed API:
 {code:xml}
 resourceRequests
   MB96256/MB
   VCores94/VCores
   appMaster
 applicationIdapplication_/applicationId
 applicationAttemptIdappattempt_/applicationAttemptId
 queueNamedefault/queueName
 totalPendingMB96256/totalPendingMB
 totalPendingVCores94/totalPendingVCores
 numResourceRequests3/numResourceRequests
 resourceRequests
   request
 MB1024/MB
 VCores1/VCores
 resourceName/default-rack/resourceName
 numContainers94/numContainers
 relaxLocalitytrue/relaxLocality
 priority20/priority
   /request
   request
 MB1024/MB
 VCores1/VCores
 resourceName*/resourceName
 numContainers94/numContainers
 relaxLocalitytrue/relaxLocality
 priority20/priority
   /request
   request
 MB1024/MB
 VCores1/VCores
 resourceNamemaster/resourceName
 numContainers94/numContainers
 relaxLocalitytrue/relaxLocality
 priority20/priority
   /request
 /resourceRequests
   /appMaster
 /resourceRequests
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml

2014-08-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107429#comment-14107429
 ] 

Hadoop QA commented on YARN-2440:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12663704/apache-yarn-2440.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4694//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4694//console

This message is automatically generated.

 Cgroups should limit YARN containers to cores allocated in yarn-site.xml
 

 Key: YARN-2440
 URL: https://issues.apache.org/jira/browse/YARN-2440
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, 
 screenshot-current-implementation.jpg


 The current cgroups implementation does not limit YARN containers to the 
 cores allocated in yarn-site.xml.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely

2014-08-22 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-1458:


Attachment: YARN-1458.003.patch

 In Fair Scheduler, size based weight can cause update thread to hold lock 
 indefinitely
 --

 Key: YARN-1458
 URL: https://issues.apache.org/jira/browse/YARN-1458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: Centos 2.6.18-238.19.1.el5 X86_64
 hadoop2.2.0
Reporter: qingwu.fu
Assignee: zhihai xu
  Labels: patch
 Fix For: 2.2.1

 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
 YARN-1458.003.patch, YARN-1458.patch

   Original Estimate: 408h
  Remaining Estimate: 408h

 The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
 clients submit lots jobs, it is not easy to reapear. We run the test cluster 
 for days to reapear it. The output of  jstack command on resourcemanager pid:
 {code}
  ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
 waiting for monitor entry [0x43aa9000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
 - waiting to lock 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
 at java.lang.Thread.run(Thread.java:744)
 ……
 FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
 runnable [0x433a2000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
 at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely

2014-08-22 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107471#comment-14107471
 ] 

zhihai xu commented on YARN-1458:
-

I uploaded a new patch YARN-1458.003.patch to resolve merge conflict after 
rebase to latest code.

 In Fair Scheduler, size based weight can cause update thread to hold lock 
 indefinitely
 --

 Key: YARN-1458
 URL: https://issues.apache.org/jira/browse/YARN-1458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: Centos 2.6.18-238.19.1.el5 X86_64
 hadoop2.2.0
Reporter: qingwu.fu
Assignee: zhihai xu
  Labels: patch
 Fix For: 2.2.1

 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
 YARN-1458.003.patch, YARN-1458.patch

   Original Estimate: 408h
  Remaining Estimate: 408h

 The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
 clients submit lots jobs, it is not easy to reapear. We run the test cluster 
 for days to reapear it. The output of  jstack command on resourcemanager pid:
 {code}
  ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
 waiting for monitor entry [0x43aa9000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
 - waiting to lock 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
 at java.lang.Thread.run(Thread.java:744)
 ……
 FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
 runnable [0x433a2000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
 at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page

2014-08-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107479#comment-14107479
 ] 

Hadoop QA commented on YARN-2360:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12663715/YARN-2360-v4.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The following test timeouts occurred in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterService

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4696//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4696//console

This message is automatically generated.

 Fair Scheduler : Display dynamic fair share for queues on the scheduler page
 

 Key: YARN-2360
 URL: https://issues.apache.org/jira/browse/YARN-2360
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
 Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, 
 Screen_Shot_v3.png, Screen_Shot_v4.png, YARN-2360-v1.txt, YARN-2360-v2.txt, 
 YARN-2360-v3.patch, YARN-2360-v4.patch


 Based on the discussion in YARN-2026,  we'd like to display dynamic fair 
 share for queues on the scheduler page.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page

2014-08-22 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2360:
--

Attachment: Screen_Shot_v5.png

 Fair Scheduler : Display dynamic fair share for queues on the scheduler page
 

 Key: YARN-2360
 URL: https://issues.apache.org/jira/browse/YARN-2360
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
 Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, 
 Screen_Shot_v3.png, Screen_Shot_v4.png, Screen_Shot_v5.png, YARN-2360-v1.txt, 
 YARN-2360-v2.txt, YARN-2360-v3.patch, YARN-2360-v4.patch


 Based on the discussion in YARN-2026,  we'd like to display dynamic fair 
 share for queues on the scheduler page.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page

2014-08-22 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2360:
--

Attachment: (was: Screen_Shot_v5.png)

 Fair Scheduler : Display dynamic fair share for queues on the scheduler page
 

 Key: YARN-2360
 URL: https://issues.apache.org/jira/browse/YARN-2360
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
 Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, 
 Screen_Shot_v3.png, Screen_Shot_v4.png, YARN-2360-v1.txt, YARN-2360-v2.txt, 
 YARN-2360-v3.patch, YARN-2360-v4.patch, YARN-2360-v5.patch


 Based on the discussion in YARN-2026,  we'd like to display dynamic fair 
 share for queues on the scheduler page.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page

2014-08-22 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2360:
--

Attachment: YARN-2360-v5.patch

A new patch that adds description in the fair scheduler .apt.vm file, also 
shows the description in the web UI when the mouse hover over the steady fair 
share label or instantaneous fair share label.

 Fair Scheduler : Display dynamic fair share for queues on the scheduler page
 

 Key: YARN-2360
 URL: https://issues.apache.org/jira/browse/YARN-2360
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
 Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, 
 Screen_Shot_v3.png, Screen_Shot_v4.png, YARN-2360-v1.txt, YARN-2360-v2.txt, 
 YARN-2360-v3.patch, YARN-2360-v4.patch, YARN-2360-v5.patch


 Based on the discussion in YARN-2026,  we'd like to display dynamic fair 
 share for queues on the scheduler page.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2408) Resource Request REST API for YARN

2014-08-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107563#comment-14107563
 ] 

Hadoop QA commented on YARN-2408:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12663726/YARN-2408-3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4697//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4697//console

This message is automatically generated.

 Resource Request REST API for YARN
 --

 Key: YARN-2408
 URL: https://issues.apache.org/jira/browse/YARN-2408
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: webapp
Reporter: Renan DelValle
  Labels: features
 Attachments: YARN-2408-3.patch


 I’m proposing a new REST API for YARN which exposes a snapshot of the 
 Resource Requests that exist inside of the Scheduler. My motivation behind 
 this new feature is to allow external software to monitor the amount of 
 resources being requested to gain more insightful information into cluster 
 usage than is already provided. The API can also be used by external software 
 to detect a starved application and alert the appropriate users and/or sys 
 admin so that the problem may be remedied.
 Here is the proposed API:
 {code:xml}
 resourceRequests
   MB96256/MB
   VCores94/VCores
   appMaster
 applicationIdapplication_/applicationId
 applicationAttemptIdappattempt_/applicationAttemptId
 queueNamedefault/queueName
 totalPendingMB96256/totalPendingMB
 totalPendingVCores94/totalPendingVCores
 numResourceRequests3/numResourceRequests
 resourceRequests
   request
 MB1024/MB
 VCores1/VCores
 resourceName/default-rack/resourceName
 numContainers94/numContainers
 relaxLocalitytrue/relaxLocality
 priority20/priority
   /request
   request
 MB1024/MB
 VCores1/VCores
 resourceName*/resourceName
 numContainers94/numContainers
 relaxLocalitytrue/relaxLocality
 priority20/priority
   /request
   request
 MB1024/MB
 VCores1/VCores
 resourceNamemaster/resourceName
 numContainers94/numContainers
 relaxLocalitytrue/relaxLocality
 priority20/priority
   /request
 /resourceRequests
   /appMaster
 /resourceRequests
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page

2014-08-22 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2360:
--

Attachment: (was: YARN-2360-v5.patch)

 Fair Scheduler : Display dynamic fair share for queues on the scheduler page
 

 Key: YARN-2360
 URL: https://issues.apache.org/jira/browse/YARN-2360
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
 Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, 
 Screen_Shot_v3.png, Screen_Shot_v4.png, Screen_Shot_v5.png, YARN-2360-v1.txt, 
 YARN-2360-v2.txt, YARN-2360-v3.patch, YARN-2360-v4.patch, YARN-2360-v5.patch


 Based on the discussion in YARN-2026,  we'd like to display dynamic fair 
 share for queues on the scheduler page.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page

2014-08-22 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2360:
--

Attachment: Screen_Shot_v5.png

 Fair Scheduler : Display dynamic fair share for queues on the scheduler page
 

 Key: YARN-2360
 URL: https://issues.apache.org/jira/browse/YARN-2360
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
 Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, 
 Screen_Shot_v3.png, Screen_Shot_v4.png, Screen_Shot_v5.png, YARN-2360-v1.txt, 
 YARN-2360-v2.txt, YARN-2360-v3.patch, YARN-2360-v4.patch, YARN-2360-v5.patch


 Based on the discussion in YARN-2026,  we'd like to display dynamic fair 
 share for queues on the scheduler page.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page

2014-08-22 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2360:
--

Attachment: YARN-2360-v5.patch

 Fair Scheduler : Display dynamic fair share for queues on the scheduler page
 

 Key: YARN-2360
 URL: https://issues.apache.org/jira/browse/YARN-2360
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
 Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, 
 Screen_Shot_v3.png, Screen_Shot_v4.png, Screen_Shot_v5.png, YARN-2360-v1.txt, 
 YARN-2360-v2.txt, YARN-2360-v3.patch, YARN-2360-v4.patch, YARN-2360-v5.patch


 Based on the discussion in YARN-2026,  we'd like to display dynamic fair 
 share for queues on the scheduler page.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-2445) ATS does not reflect changes to uploaded TimelineEntity

2014-08-22 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created YARN-2445:


 Summary: ATS does not reflect changes to uploaded TimelineEntity
 Key: YARN-2445
 URL: https://issues.apache.org/jira/browse/YARN-2445
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Reporter: Marcelo Vanzin
Priority: Minor
 Attachments: ats2.java

If you make a change to the TimelineEntity and send it to the ATS, that change 
is not reflected in the stored data.

For example, in the attached code, an existing primary filter is removed and a 
new one is added. When you retrieve the entity from the ATS, it only contains 
the old value:

{noformat}
{entities:[{events:[],entitytype:test,entity:testid-ad5380c0-090e-4982-8da8-21676fe4e9f4,starttime:1408746026958,relatedentities:{},primaryfilters:{oldprop:[val]},otherinfo:{}}]}
{noformat}

Perhaps this is what the design wanted, but from an API user standpoint, it's 
really confusing, since to upload events I have to upload the entity itself, 
and the changes are not reflected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2445) ATS does not reflect changes to uploaded TimelineEntity

2014-08-22 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated YARN-2445:
-

Attachment: ats2.java

 ATS does not reflect changes to uploaded TimelineEntity
 ---

 Key: YARN-2445
 URL: https://issues.apache.org/jira/browse/YARN-2445
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Reporter: Marcelo Vanzin
Priority: Minor
 Attachments: ats2.java


 If you make a change to the TimelineEntity and send it to the ATS, that 
 change is not reflected in the stored data.
 For example, in the attached code, an existing primary filter is removed and 
 a new one is added. When you retrieve the entity from the ATS, it only 
 contains the old value:
 {noformat}
 {entities:[{events:[],entitytype:test,entity:testid-ad5380c0-090e-4982-8da8-21676fe4e9f4,starttime:1408746026958,relatedentities:{},primaryfilters:{oldprop:[val]},otherinfo:{}}]}
 {noformat}
 Perhaps this is what the design wanted, but from an API user standpoint, it's 
 really confusing, since to upload events I have to upload the entity itself, 
 and the changes are not reflected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (YARN-321) Generic application history service

2014-08-22 Thread Yu Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Gao reassigned YARN-321:
---

Assignee: Yu Gao

 Generic application history service
 ---

 Key: YARN-321
 URL: https://issues.apache.org/jira/browse/YARN-321
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Luke Lu
Assignee: Yu Gao
 Attachments: AHS Diagram.pdf, ApplicationHistoryServiceHighLevel.pdf, 
 Generic Application History - Design-20131219.pdf, HistoryStorageDemo.java


 The mapreduce job history server currently needs to be deployed as a trusted 
 server in sync with the mapreduce runtime. Every new application would need a 
 similar application history server. Having to deploy O(T*V) (where T is 
 number of type of application, V is number of version of application) trusted 
 servers is clearly not scalable.
 Job history storage handling itself is pretty generic: move the logs and 
 history data into a particular directory for later serving. Job history data 
 is already stored as json (or binary avro). I propose that we create only one 
 trusted application history server, which can have a generic UI (display json 
 as a tree of strings) as well. Specific application/version can deploy 
 untrusted webapps (a la AMs) to query the application history server and 
 interpret the json for its specific UI and/or analytics.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely

2014-08-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107721#comment-14107721
 ] 

Hadoop QA commented on YARN-1458:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12663743/YARN-1458.003.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerFairShare

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4698//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4698//console

This message is automatically generated.

 In Fair Scheduler, size based weight can cause update thread to hold lock 
 indefinitely
 --

 Key: YARN-1458
 URL: https://issues.apache.org/jira/browse/YARN-1458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: Centos 2.6.18-238.19.1.el5 X86_64
 hadoop2.2.0
Reporter: qingwu.fu
Assignee: zhihai xu
  Labels: patch
 Fix For: 2.2.1

 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
 YARN-1458.003.patch, YARN-1458.patch

   Original Estimate: 408h
  Remaining Estimate: 408h

 The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
 clients submit lots jobs, it is not easy to reapear. We run the test cluster 
 for days to reapear it. The output of  jstack command on resourcemanager pid:
 {code}
  ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
 waiting for monitor entry [0x43aa9000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
 - waiting to lock 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
 at java.lang.Thread.run(Thread.java:744)
 ……
 FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
 runnable [0x433a2000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
 at 
 

[jira] [Updated] (YARN-2395) FairScheduler: Preemption timeout should be configurable per queue

2014-08-22 Thread Wei Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2395:
--

Attachment: YARN-2395-2.patch

Update a new patch which addresses Karthik's latest comments, and also add 
per-job preemption timeout configuration for min share.

 FairScheduler: Preemption timeout should be configurable per queue
 --

 Key: YARN-2395
 URL: https://issues.apache.org/jira/browse/YARN-2395
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Wei Yan
 Attachments: YARN-2395-1.patch, YARN-2395-2.patch


 Currently in fair scheduler, the preemption logic considers fair share 
 starvation only at leaf queue level. This jira is created to implement it at 
 the parent queue as well.
 It involves :
 1. Making check for fair share starvation and amount of resource to 
 preempt  recursive such that they traverse the queue hierarchy from root to 
 leaf.
 2. Currently fairSharePreemptionTimeout is a global config. We could make it 
 configurable on a per queue basis,so that we can specify different timeouts 
 for parent queues.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-321) Generic application history service

2014-08-22 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-321:
-

Assignee: (was: Yu Gao)

 Generic application history service
 ---

 Key: YARN-321
 URL: https://issues.apache.org/jira/browse/YARN-321
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Luke Lu
 Attachments: AHS Diagram.pdf, ApplicationHistoryServiceHighLevel.pdf, 
 Generic Application History - Design-20131219.pdf, HistoryStorageDemo.java


 The mapreduce job history server currently needs to be deployed as a trusted 
 server in sync with the mapreduce runtime. Every new application would need a 
 similar application history server. Having to deploy O(T*V) (where T is 
 number of type of application, V is number of version of application) trusted 
 servers is clearly not scalable.
 Job history storage handling itself is pretty generic: move the logs and 
 history data into a particular directory for later serving. Job history data 
 is already stored as json (or binary avro). I propose that we create only one 
 trusted application history server, which can have a generic UI (display json 
 as a tree of strings) as well. Specific application/version can deploy 
 untrusted webapps (a la AMs) to query the application history server and 
 interpret the json for its specific UI and/or analytics.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page

2014-08-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107754#comment-14107754
 ] 

Hadoop QA commented on YARN-2360:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12663761/YARN-2360-v5.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4699//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4699//console

This message is automatically generated.

 Fair Scheduler : Display dynamic fair share for queues on the scheduler page
 

 Key: YARN-2360
 URL: https://issues.apache.org/jira/browse/YARN-2360
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Ashwin Shankar
 Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, 
 Screen_Shot_v3.png, Screen_Shot_v4.png, Screen_Shot_v5.png, YARN-2360-v1.txt, 
 YARN-2360-v2.txt, YARN-2360-v3.patch, YARN-2360-v4.patch, YARN-2360-v5.patch


 Based on the discussion in YARN-2026,  we'd like to display dynamic fair 
 share for queues on the scheduler page.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1326) RM should log using RMStore at startup time

2014-08-22 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-1326:
-

Attachment: YARN-1326.4.patch

Fixed failures of TestRMWebServices.

 RM should log using RMStore at startup time
 ---

 Key: YARN-1326
 URL: https://issues.apache.org/jira/browse/YARN-1326
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.5.0
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1326.1.patch, YARN-1326.2.patch, YARN-1326.3.patch, 
 YARN-1326.4.patch, demo.png

   Original Estimate: 3h
  Remaining Estimate: 3h

 Currently there are no way to know which RMStore RM uses. It's useful to log 
 the information at RM's startup time.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2395) FairScheduler: Preemption timeout should be configurable per queue

2014-08-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107785#comment-14107785
 ] 

Hadoop QA commented on YARN-2395:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12663799/YARN-2395-2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4700//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4700//console

This message is automatically generated.

 FairScheduler: Preemption timeout should be configurable per queue
 --

 Key: YARN-2395
 URL: https://issues.apache.org/jira/browse/YARN-2395
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: fairscheduler
Reporter: Ashwin Shankar
Assignee: Wei Yan
 Attachments: YARN-2395-1.patch, YARN-2395-2.patch


 Currently in fair scheduler, the preemption logic considers fair share 
 starvation only at leaf queue level. This jira is created to implement it at 
 the parent queue as well.
 It involves :
 1. Making check for fair share starvation and amount of resource to 
 preempt  recursive such that they traverse the queue hierarchy from root to 
 leaf.
 2. Currently fairSharePreemptionTimeout is a global config. We could make it 
 configurable on a per queue basis,so that we can specify different timeouts 
 for parent queues.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely

2014-08-22 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-1458:


Attachment: YARN-1458.004.patch

 In Fair Scheduler, size based weight can cause update thread to hold lock 
 indefinitely
 --

 Key: YARN-1458
 URL: https://issues.apache.org/jira/browse/YARN-1458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: Centos 2.6.18-238.19.1.el5 X86_64
 hadoop2.2.0
Reporter: qingwu.fu
Assignee: zhihai xu
  Labels: patch
 Fix For: 2.2.1

 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
 YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.patch

   Original Estimate: 408h
  Remaining Estimate: 408h

 The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
 clients submit lots jobs, it is not easy to reapear. We run the test cluster 
 for days to reapear it. The output of  jstack command on resourcemanager pid:
 {code}
  ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
 waiting for monitor entry [0x43aa9000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
 - waiting to lock 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
 at java.lang.Thread.run(Thread.java:744)
 ……
 FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
 runnable [0x433a2000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
 at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely

2014-08-22 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107807#comment-14107807
 ] 

zhihai xu commented on YARN-1458:
-

I uploaded a new patch YARN-1458.004.patch to fix the test failure.
The test failure is the following:
Parent Queue: root.parentB have one Vcore steady fair share.
But root.parentB have two child queues:root.parentB.childB1 and 
root.parentB.childB2. we can't split one Vcore to two child queues.
The new patch will calculate conservatively to assign 0 Vcore to both child 
queues.
The old code will assign 1 Vcore to both child queues, which will be over total 
resource limit.

 In Fair Scheduler, size based weight can cause update thread to hold lock 
 indefinitely
 --

 Key: YARN-1458
 URL: https://issues.apache.org/jira/browse/YARN-1458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: Centos 2.6.18-238.19.1.el5 X86_64
 hadoop2.2.0
Reporter: qingwu.fu
Assignee: zhihai xu
  Labels: patch
 Fix For: 2.2.1

 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
 YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.patch

   Original Estimate: 408h
  Remaining Estimate: 408h

 The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
 clients submit lots jobs, it is not easy to reapear. We run the test cluster 
 for days to reapear it. The output of  jstack command on resourcemanager pid:
 {code}
  ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
 waiting for monitor entry [0x43aa9000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
 - waiting to lock 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
 at java.lang.Thread.run(Thread.java:744)
 ……
 FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
 runnable [0x433a2000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
 at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1326) RM should log using RMStore at startup time

2014-08-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107806#comment-14107806
 ] 

Hadoop QA commented on YARN-1326:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12663809/YARN-1326.4.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4701//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4701//console

This message is automatically generated.

 RM should log using RMStore at startup time
 ---

 Key: YARN-1326
 URL: https://issues.apache.org/jira/browse/YARN-1326
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.5.0
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1326.1.patch, YARN-1326.2.patch, YARN-1326.3.patch, 
 YARN-1326.4.patch, demo.png

   Original Estimate: 3h
  Remaining Estimate: 3h

 Currently there are no way to know which RMStore RM uses. It's useful to log 
 the information at RM's startup time.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2445) ATS does not reflect changes to uploaded TimelineEntity

2014-08-22 Thread Billie Rinaldi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107808#comment-14107808
 ] 

Billie Rinaldi commented on YARN-2445:
--

ATS is only designed to support aggregation.  In other words, each new primary 
filter or related entity is added to what is already there for the entity.  You 
cannot remove previously put information.  In this example, I would expect 
oldprop and newprop both to appear.

 ATS does not reflect changes to uploaded TimelineEntity
 ---

 Key: YARN-2445
 URL: https://issues.apache.org/jira/browse/YARN-2445
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Reporter: Marcelo Vanzin
Priority: Minor
 Attachments: ats2.java


 If you make a change to the TimelineEntity and send it to the ATS, that 
 change is not reflected in the stored data.
 For example, in the attached code, an existing primary filter is removed and 
 a new one is added. When you retrieve the entity from the ATS, it only 
 contains the old value:
 {noformat}
 {entities:[{events:[],entitytype:test,entity:testid-ad5380c0-090e-4982-8da8-21676fe4e9f4,starttime:1408746026958,relatedentities:{},primaryfilters:{oldprop:[val]},otherinfo:{}}]}
 {noformat}
 Perhaps this is what the design wanted, but from an API user standpoint, it's 
 really confusing, since to upload events I have to upload the entity itself, 
 and the changes are not reflected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely

2014-08-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107822#comment-14107822
 ] 

Hadoop QA commented on YARN-1458:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12663814/YARN-1458.004.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4702//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4702//console

This message is automatically generated.

 In Fair Scheduler, size based weight can cause update thread to hold lock 
 indefinitely
 --

 Key: YARN-1458
 URL: https://issues.apache.org/jira/browse/YARN-1458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: Centos 2.6.18-238.19.1.el5 X86_64
 hadoop2.2.0
Reporter: qingwu.fu
Assignee: zhihai xu
  Labels: patch
 Fix For: 2.2.1

 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
 YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.patch

   Original Estimate: 408h
  Remaining Estimate: 408h

 The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
 clients submit lots jobs, it is not easy to reapear. We run the test cluster 
 for days to reapear it. The output of  jstack command on resourcemanager pid:
 {code}
  ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
 waiting for monitor entry [0x43aa9000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
 - waiting to lock 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
 at java.lang.Thread.run(Thread.java:744)
 ……
 FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
 runnable [0x433a2000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
 at 
 

[jira] [Commented] (YARN-1326) RM should log using RMStore at startup time

2014-08-22 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107823#comment-14107823
 ] 

Tsuyoshi OZAWA commented on YARN-1326:
--

A patch is ready for review. [~kkambatl], . could you check it?

 RM should log using RMStore at startup time
 ---

 Key: YARN-1326
 URL: https://issues.apache.org/jira/browse/YARN-1326
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.5.0
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-1326.1.patch, YARN-1326.2.patch, YARN-1326.3.patch, 
 YARN-1326.4.patch, demo.png

   Original Estimate: 3h
  Remaining Estimate: 3h

 Currently there are no way to know which RMStore RM uses. It's useful to log 
 the information at RM's startup time.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-2035) FileSystemApplicationHistoryStore blocks RM and AHS while NN is in safemode

2014-08-22 Thread Jonathan Eagles (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Eagles updated YARN-2035:
--

Attachment: YARN-2035-v3.patch

Addressed failing tests with last patch.

 FileSystemApplicationHistoryStore blocks RM and AHS while NN is in safemode
 ---

 Key: YARN-2035
 URL: https://issues.apache.org/jira/browse/YARN-2035
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.1
Reporter: Jonathan Eagles
Assignee: Jonathan Eagles
 Attachments: YARN-2035-v2.patch, YARN-2035-v3.patch, YARN-2035.patch


 Small bug that prevents ResourceManager and ApplicationHistoryService from 
 coming up while Namenode is in safemode.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely

2014-08-22 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107851#comment-14107851
 ] 

zhihai xu commented on YARN-1458:
-

The test failure is not related to my change.
TestAMRestart is passed in my local build.


 T E S T S
---
Running 
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 89.639 sec - in 
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart

Results :

Tests run: 5, Failures: 0, Errors: 0, Skipped: 0

 In Fair Scheduler, size based weight can cause update thread to hold lock 
 indefinitely
 --

 Key: YARN-1458
 URL: https://issues.apache.org/jira/browse/YARN-1458
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.2.0
 Environment: Centos 2.6.18-238.19.1.el5 X86_64
 hadoop2.2.0
Reporter: qingwu.fu
Assignee: zhihai xu
  Labels: patch
 Fix For: 2.2.1

 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
 YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.patch

   Original Estimate: 408h
  Remaining Estimate: 408h

 The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
 clients submit lots jobs, it is not easy to reapear. We run the test cluster 
 for days to reapear it. The output of  jstack command on resourcemanager pid:
 {code}
  ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
 waiting for monitor entry [0x43aa9000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
 - waiting to lock 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
 at java.lang.Thread.run(Thread.java:744)
 ……
 FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
 runnable [0x433a2000]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
 - locked 0x00070026b6e0 (a 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
 at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-2035) FileSystemApplicationHistoryStore blocks RM and AHS while NN is in safemode

2014-08-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107852#comment-14107852
 ] 

Hadoop QA commented on YARN-2035:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12663820/YARN-2035-v3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4703//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4703//console

This message is automatically generated.

 FileSystemApplicationHistoryStore blocks RM and AHS while NN is in safemode
 ---

 Key: YARN-2035
 URL: https://issues.apache.org/jira/browse/YARN-2035
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.4.1
Reporter: Jonathan Eagles
Assignee: Jonathan Eagles
 Attachments: YARN-2035-v2.patch, YARN-2035-v3.patch, YARN-2035.patch


 Small bug that prevents ResourceManager and ApplicationHistoryService from 
 coming up while Namenode is in safemode.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


  1   2   >