[jira] [Commented] (YARN-1801) NPE in public localizer
[ https://issues.apache.org/jira/browse/YARN-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106555#comment-14106555 ] Hong Zhiguo commented on YARN-1801: --- I think YARN-1575 already fixed this NPE. We could mark it as duplicated. NPE in public localizer --- Key: YARN-1801 URL: https://issues.apache.org/jira/browse/YARN-1801 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.2.0 Reporter: Jason Lowe Assignee: Hong Zhiguo Priority: Critical Attachments: YARN-1801.patch While investigating YARN-1800 found this in the NM logs that caused the public localizer to shutdown: {noformat} 2014-01-23 01:26:38,655 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:addResource(651)) - Downloading public rsrc:{ hdfs://colo-2:8020/user/fertrist/oozie-oozi/601-140114233013619-oozie-oozi-W/aggregator--map-reduce/map-reduce-launcher.jar, 1390440382009, FILE, null } 2014-01-23 01:26:38,656 FATAL localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(726)) - Error: Shutting down java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:712) 2014-01-23 01:26:38,656 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(728)) - Public cache exiting {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106583#comment-14106583 ] mai shurong commented on YARN-1458: --- George Wong , You can try our YARN-1458.patch and it is easy to understand,but the issue is still unresolved. You can consult to the corresponding code in later hadoop version such as 2.2.1, 2.3.x, 2.4.x zhihai xu, It seams your thinking is more rigorous than our patch. In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely -- Key: YARN-1458 URL: https://issues.apache.org/jira/browse/YARN-1458 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Environment: Centos 2.6.18-238.19.1.el5 X86_64 hadoop2.2.0 Reporter: qingwu.fu Assignee: zhihai xu Labels: patch Fix For: 2.2.1 Attachments: YARN-1458.001.patch, YARN-1458.patch Original Estimate: 408h Remaining Estimate: 408h The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when clients submit lots jobs, it is not easy to reapear. We run the test cluster for days to reapear it. The output of jstack command on resourcemanager pid: {code} ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 waiting for monitor entry [0x43aa9000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) - waiting to lock 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) …… FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 runnable [0x433a2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-1458: Attachment: YARN-1458.002.patch In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely -- Key: YARN-1458 URL: https://issues.apache.org/jira/browse/YARN-1458 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Environment: Centos 2.6.18-238.19.1.el5 X86_64 hadoop2.2.0 Reporter: qingwu.fu Assignee: zhihai xu Labels: patch Fix For: 2.2.1 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, YARN-1458.patch Original Estimate: 408h Remaining Estimate: 408h The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when clients submit lots jobs, it is not easy to reapear. We run the test cluster for days to reapear it. The output of jstack command on resourcemanager pid: {code} ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 waiting for monitor entry [0x43aa9000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) - waiting to lock 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) …… FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 runnable [0x433a2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2439) Move winutils task related functionality under a yarn-servers-nodemanager project
Remus Rusanu created YARN-2439: -- Summary: Move winutils task related functionality under a yarn-servers-nodemanager project Key: YARN-2439 URL: https://issues.apache.org/jira/browse/YARN-2439 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Remus Rusanu Priority: Minor Currently winutils is build as part of hadoop-common. But winutils has features that relate strictly to the nodemanager, namely `winutils task`. Being build under hadoop-common menas that any mvn/pom compile configuration has to be done in the hadoop-common project. For example I wanted to add a configuration file similar to the container-executor cfg, which gets the .cfg location from the ${container-executor.conf.dir} in it's parent pom. But for winutils I would have to add the config to the hadoop-common pom, despite being very specific for the nodemanager use. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106605#comment-14106605 ] Remus Rusanu commented on YARN-2198: I have created YARN-2439 to track the separation of winutils task functionality into a nodemanager related project, away from hadoop-common Remove the need to run NodeManager as privileged account for Windows Secure Container Executor -- Key: YARN-2198 URL: https://issues.apache.org/jira/browse/YARN-2198 Project: Hadoop YARN Issue Type: Improvement Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Attachments: YARN-2198.1.patch, YARN-2198.2.patch, YARN-2198.separation.patch YARN-1972 introduces a Secure Windows Container Executor. However this executor requires a the process launching the container to be LocalSystem or a member of the a local Administrators group. Since the process in question is the NodeManager, the requirement translates to the entire NM to run as a privileged account, a very large surface area to review and protect. This proposal is to move the privileged operations into a dedicated NT service. The NM can run as a low privilege account and communicate with the privileged NT service when it needs to launch a container. This would reduce the surface exposed to the high privileges. There has to exist a secure, authenticated and authorized channel of communication between the NM and the privileged NT service. Possible alternatives are a new TCP endpoint, Java RPC etc. My proposal though would be to use Windows LPC (Local Procedure Calls), which is a Windows platform specific inter-process communication channel that satisfies all requirements and is easy to deploy. The privileged NT service would register and listen on an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop with libwinutils which would host the LPC client code. The client would connect to the LPC port (NtConnectPort) and send a message requesting a container launch (NtRequestWaitReplyPort). LPC provides authentication and the privileged NT service can use authorization API (AuthZ) to validate the caller. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106606#comment-14106606 ] zhihai xu commented on YARN-1458: - I added a test case testFairShareWithZeroWeight in new patch YARN-1458.002.patch to verify the patch can work with zero weight. Without the patch, testFairShareWithZeroWeight will run forever. In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely -- Key: YARN-1458 URL: https://issues.apache.org/jira/browse/YARN-1458 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Environment: Centos 2.6.18-238.19.1.el5 X86_64 hadoop2.2.0 Reporter: qingwu.fu Assignee: zhihai xu Labels: patch Fix For: 2.2.1 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, YARN-1458.patch Original Estimate: 408h Remaining Estimate: 408h The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when clients submit lots jobs, it is not easy to reapear. We run the test cluster for days to reapear it. The output of jstack command on resourcemanager pid: {code} ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 waiting for monitor entry [0x43aa9000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) - waiting to lock 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) …… FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 runnable [0x433a2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2345) yarn rmadmin -report
[ https://issues.apache.org/jira/browse/YARN-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106630#comment-14106630 ] Hao Gao commented on YARN-2345: --- yarn node -status nodeid will list the status of a single node. I reuse the code there to get the status of all nodes. {code:xml} Nodes Report : Node-Id : 192.168.1.6:53239 Rack : /default-rack Node-State : RUNNING Node-Http-Address : 192.168.1.6:8042 Last-Health-Update : Fri 22/Aug/14 12:53:38:312PDT Health-Report : Containers : 0 Memory-Used : 0MB Memory-Capacity : 8192MB CPU-Used : 0 vcores CPU-Capacity : 8 vcores {code} Do we need more information? Also Do we need to have options like -live -dead ? yarn rmadmin -report Key: YARN-2345 URL: https://issues.apache.org/jira/browse/YARN-2345 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager Reporter: Allen Wittenauer Assignee: Hao Gao Labels: newbie It would be good to have an equivalent of hdfs dfsadmin -report in YARN. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2345) yarn rmadmin -report
[ https://issues.apache.org/jira/browse/YARN-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hao Gao updated YARN-2345: -- Attachment: YARN-2345.1.patch Attached the patch. yarn rmadmin -report Key: YARN-2345 URL: https://issues.apache.org/jira/browse/YARN-2345 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager Reporter: Allen Wittenauer Assignee: Hao Gao Labels: newbie Attachments: YARN-2345.1.patch It would be good to have an equivalent of hdfs dfsadmin -report in YARN. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2437) [post-HADOOP-9902] start-yarn.sh/stop-yarn needs to give info
[ https://issues.apache.org/jira/browse/YARN-2437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hao Gao updated YARN-2437: -- Assignee: Hao Gao [post-HADOOP-9902] start-yarn.sh/stop-yarn needs to give info - Key: YARN-2437 URL: https://issues.apache.org/jira/browse/YARN-2437 Project: Hadoop YARN Issue Type: Bug Components: scripts Reporter: Allen Wittenauer Assignee: Hao Gao Labels: newbie With the merger and cleanup of the daemon launch code, yarn-daemons.sh no longer prints Starting information. This should be made more of an analog of start-dfs.sh/stop-dfs.sh. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml
Varun Vasudev created YARN-2440: --- Summary: Cgroups should limit YARN containers to cores allocated in yarn-site.xml Key: YARN-2440 URL: https://issues.apache.org/jira/browse/YARN-2440 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev The current cgroups implementation does not limit YARN containers to the cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2345) yarn rmadmin -report
[ https://issues.apache.org/jira/browse/YARN-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106669#comment-14106669 ] Wangda Tan commented on YARN-2345: -- Hi Hao, I think we already have a NodeCLI, which is yarn node -status nodeid as you said. We don't need add such method to RM admin CLI. RM admin CLI should only implement methods contained by ResourceManagerAdministrationProtocol. I would suggest to add more information when execute yarn node -all -list, like memory-used, CPU-used, etc. Just like RM web UI - nodes page. Thanks, Wangda yarn rmadmin -report Key: YARN-2345 URL: https://issues.apache.org/jira/browse/YARN-2345 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager Reporter: Allen Wittenauer Assignee: Hao Gao Labels: newbie Attachments: YARN-2345.1.patch It would be good to have an equivalent of hdfs dfsadmin -report in YARN. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-2440: Attachment: screenshot-current-implementation.jpg Screenshot with the CPU usage in the current implementation. In my yarn-site.xml, I had set yarn.nodemanager.resource.cpu-vcores to 2. The python script is taking up as many cores as it can. The quota for the yarn group was set to -1. varun@ubuntu:/var/hadoop/hadoop-3.0.0-SNAPSHOT$ cat /cgroup/cpu/yarn/cpu.cfs_quota_us -1 Cgroups should limit YARN containers to cores allocated in yarn-site.xml Key: YARN-2440 URL: https://issues.apache.org/jira/browse/YARN-2440 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: screenshot-current-implementation.jpg The current cgroups implementation does not limit YARN containers to the cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-2440: Attachment: apache-yarn-2440.0.patch Attached patch with fix. Cgroups should limit YARN containers to cores allocated in yarn-site.xml Key: YARN-2440 URL: https://issues.apache.org/jira/browse/YARN-2440 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2440.0.patch, screenshot-current-implementation.jpg The current cgroups implementation does not limit YARN containers to the cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106671#comment-14106671 ] Varun Vasudev commented on YARN-2440: - After applying the patch, the quota is set correctly. {noformat} varun@ubuntu:/var/hadoop/hadoop-3.0.0-SNAPSHOT$ cat /cgroup/cpu/yarn/cpu.cfs_quota_us 20 {noformat} Cgroups should limit YARN containers to cores allocated in yarn-site.xml Key: YARN-2440 URL: https://issues.apache.org/jira/browse/YARN-2440 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2440.0.patch, screenshot-current-implementation.jpg The current cgroups implementation does not limit YARN containers to the cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-810) Support CGroup ceiling enforcement on CPU
[ https://issues.apache.org/jira/browse/YARN-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106687#comment-14106687 ] Varun Vasudev commented on YARN-810: [~sandyr] [~ywskycn] are you still working on this? If not, I'd like to pick it up. Support CGroup ceiling enforcement on CPU - Key: YARN-810 URL: https://issues.apache.org/jira/browse/YARN-810 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.1.0-beta, 2.0.5-alpha Reporter: Chris Riccomini Assignee: Sandy Ryza Attachments: YARN-810.patch, YARN-810.patch Problem statement: YARN currently lets you define an NM's pcore count, and a pcore:vcore ratio. Containers are then allowed to request vcores between the minimum and maximum defined in the yarn-site.xml. In the case where a single-threaded container requests 1 vcore, with a pcore:vcore ratio of 1:4, the container is still allowed to use up to 100% of the core it's using, provided that no other container is also using it. This happens, even though the only guarantee that YARN/CGroups is making is that the container will get at least 1/4th of the core. If a second container then comes along, the second container can take resources from the first, provided that the first container is still getting at least its fair share (1/4th). There are certain cases where this is desirable. There are also certain cases where it might be desirable to have a hard limit on CPU usage, and not allow the process to go above the specified resource requirement, even if it's available. Here's an RFC that describes the problem in more detail: http://lwn.net/Articles/336127/ Solution: As it happens, when CFS is used in combination with CGroups, you can enforce a ceiling using two files in cgroups: {noformat} cpu.cfs_quota_us cpu.cfs_period_us {noformat} The usage of these two files is documented in more detail here: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html Testing: I have tested YARN CGroups using the 2.0.5-alpha implementation. By default, it behaves as described above (it is a soft cap, and allows containers to use more than they asked for). I then tested CFS CPU quotas manually with YARN. First, you can see that CFS is in use in the CGroup, based on the file names: {noformat} [criccomi@eat1-qa464 ~]$ sudo -u app ls -l /cgroup/cpu/hadoop-yarn/ total 0 -r--r--r-- 1 app app 0 Jun 13 16:46 cgroup.procs drwxr-xr-x 2 app app 0 Jun 13 17:08 container_1371141151815_0004_01_02 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_quota_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_runtime_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.shares -r--r--r-- 1 app app 0 Jun 13 16:46 cpu.stat -rw-r--r-- 1 app app 0 Jun 13 16:46 notify_on_release -rw-r--r-- 1 app app 0 Jun 13 16:46 tasks [criccomi@eat1-qa464 ~]$ sudo -u app cat /cgroup/cpu/hadoop-yarn/cpu.cfs_period_us 10 [criccomi@eat1-qa464 ~]$ sudo -u app cat /cgroup/cpu/hadoop-yarn/cpu.cfs_quota_us -1 {noformat} Oddly, it appears that the cfs_period_us is set to .1s, not 1s. We can place processes in hard limits. I have process 4370 running YARN container container_1371141151815_0003_01_03 on a host. By default, it's running at ~300% cpu usage. {noformat} CPU 4370 criccomi 20 0 1157m 551m 14m S 240.3 0.8 87:10.91 ... {noformat} When I set the CFS quote: {noformat} echo 1000 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us CPU 4370 criccomi 20 0 1157m 563m 14m S 1.0 0.8 90:08.39 ... {noformat} It drops to 1% usage, and you can see the box has room to spare: {noformat} Cpu(s): 2.4%us, 1.0%sy, 0.0%ni, 92.2%id, 4.2%wa, 0.0%hi, 0.1%si, 0.0%st {noformat} Turning the quota back to -1: {noformat} echo -1 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us {noformat} Burns the cores again: {noformat} Cpu(s): 11.1%us, 1.7%sy, 0.0%ni, 83.9%id, 3.1%wa, 0.0%hi, 0.2%si, 0.0%st CPU 4370 criccomi 20 0 1157m 563m 14m S 253.9 0.8 89:32.31 ... {noformat} On my dev box, I was testing CGroups by running a python process eight times, to burn through all the cores, since it was doing as described above (giving extra CPU to the process, even with a cpu.shares limit). Toggling the cfs_quota_us seems to enforce a hard limit.
[jira] [Commented] (YARN-2436) [post-HADOOP-9902] yarn application help doesn't work
[ https://issues.apache.org/jira/browse/YARN-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106720#comment-14106720 ] Hudson commented on YARN-2436: -- FAILURE: Integrated in Hadoop-Yarn-trunk #654 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/654/]) YARN-2436. [post-HADOOP-9902] yarn application help doesn't work (aw: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1619603) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/bin/yarn [post-HADOOP-9902] yarn application help doesn't work - Key: YARN-2436 URL: https://issues.apache.org/jira/browse/YARN-2436 Project: Hadoop YARN Issue Type: Bug Components: scripts Reporter: Allen Wittenauer Assignee: Allen Wittenauer Labels: newbie Fix For: 3.0.0 Attachments: YARN-2436.patch The previous version of the yarn command plays games with the command stack for some commands. The new code needs duplicate this wackiness. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2434) RM should not recover containers from previously failed attempt when AM restart is not enabled
[ https://issues.apache.org/jira/browse/YARN-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106717#comment-14106717 ] Hudson commented on YARN-2434: -- FAILURE: Integrated in Hadoop-Yarn-trunk #654 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/654/]) YARN-2434. RM should not recover containers from previously failed attempt when AM restart is not enabled. Contributed by Jian He (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1619614) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java RM should not recover containers from previously failed attempt when AM restart is not enabled -- Key: YARN-2434 URL: https://issues.apache.org/jira/browse/YARN-2434 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He Fix For: 3.0.0, 2.6.0 Attachments: YARN-2434.1.patch If container-preserving AM restart is not enabled and AM failed during RM restart, RM on restart should not recover containers from previously failed attempt. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2441) NPE in nodemanager after restart
Nishan Shetty created YARN-2441: --- Summary: NPE in nodemanager after restart Key: YARN-2441 URL: https://issues.apache.org/jira/browse/YARN-2441 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 3.0.0 Reporter: Nishan Shetty Priority: Minor {code} 2014-08-22 16:43:19,640 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Blocking new container-requests as container manager rpc server is still starting. 2014-08-22 16:43:19,658 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2014-08-22 16:43:19,675 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 45026: starting 2014-08-22 16:43:20,029 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager: Updating node address : host-10-18-40-95:45026 2014-08-22 16:43:20,029 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: ContainerManager started at /10.18.40.95:45026 2014-08-22 16:43:20,030 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: ContainerManager bound to host-10-18-40-95/10.18.40.95:45026 2014-08-22 16:43:20,073 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue class java.util.concurrent.LinkedBlockingQueue 2014-08-22 16:43:20,098 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 45027 2014-08-22 16:43:20,158 INFO org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding protocol org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB to the server 2014-08-22 16:43:20,178 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2014-08-22 16:43:20,192 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 45027: starting 2014-08-22 16:43:20,210 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for port 45026: readAndProcess from client 10.18.40.84 threw exception [java.lang.NullPointerException] java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:167) at org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:43) at org.apache.hadoop.security.token.SecretManager.retriableRetrievePassword(SecretManager.java:91) at org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.getPassword(SaslRpcServer.java:278) at org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.handle(SaslRpcServer.java:305) at com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:585) at com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244) at org.apache.hadoop.ipc.Server$Connection.processSaslToken(Server.java:1384) at org.apache.hadoop.ipc.Server$Connection.processSaslMessage(Server.java:1361) at org.apache.hadoop.ipc.Server$Connection.saslProcess(Server.java:1275) at org.apache.hadoop.ipc.Server$Connection.saslReadAndProcess(Server.java:1238) at org.apache.hadoop.ipc.Server$Connection.processRpcOutOfBandRequest(Server.java:1878) at org.apache.hadoop.ipc.Server$Connection.processOneRpc(Server.java:1755) at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1519) at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:750) at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:624) at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:595) 2014-08-22 16:43:20,227 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for port 45026: readAndProcess from client 10.18.40.84 threw exception [java.lang.NullPointerException] java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:167) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2442) ResourceManager JMX UI does not give HA State
Nishan Shetty created YARN-2442: --- Summary: ResourceManager JMX UI does not give HA State Key: YARN-2442 URL: https://issues.apache.org/jira/browse/YARN-2442 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0 Reporter: Nishan Shetty Priority: Trivial ResourceManager JMX UI can show the haState (INITIALIZING, ACTIVE, STANDBY, STOPPED) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (YARN-2345) yarn rmadmin -report
[ https://issues.apache.org/jira/browse/YARN-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106816#comment-14106816 ] Allen Wittenauer edited comment on YARN-2345 at 8/22/14 1:26 PM: - [~wangda], this is to bring consistency between HDFS and YARN.hdfs dfsadmin -report has existed for a very long time while YARN doesn't have one. From a user perspective, it's irrelevant what is happening on the inside, just that YARN is weird if the equivalent is yarn node -all -list. was (Author: aw): [~wangda], this is to bring consistency between HDFS and YARN.hdfs dfsadmin -report has existed for a very long time while the RM doesn't have one. From a user perspective, it's irrelevant what is happening on the inside, just that YARN is weird if the equivalent is yarn node -all -list. yarn rmadmin -report Key: YARN-2345 URL: https://issues.apache.org/jira/browse/YARN-2345 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager Reporter: Allen Wittenauer Assignee: Hao Gao Labels: newbie Attachments: YARN-2345.1.patch It would be good to have an equivalent of hdfs dfsadmin -report in YARN. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2345) yarn rmadmin -report
[ https://issues.apache.org/jira/browse/YARN-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106816#comment-14106816 ] Allen Wittenauer commented on YARN-2345: [~wangda], this is to bring consistency between HDFS and YARN.hdfs dfsadmin -report has existed for a very long time while the RM doesn't have one. From a user perspective, it's irrelevant what is happening on the inside, just that YARN is weird if the equivalent is yarn node -all -list. yarn rmadmin -report Key: YARN-2345 URL: https://issues.apache.org/jira/browse/YARN-2345 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager Reporter: Allen Wittenauer Assignee: Hao Gao Labels: newbie Attachments: YARN-2345.1.patch It would be good to have an equivalent of hdfs dfsadmin -report in YARN. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (YARN-2345) yarn rmadmin -report
[ https://issues.apache.org/jira/browse/YARN-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106816#comment-14106816 ] Allen Wittenauer edited comment on YARN-2345 at 8/22/14 1:28 PM: - [~leftnoteasy]], this is to bring consistency between HDFS and YARN.hdfs dfsadmin -report has existed for a very long time while YARN doesn't have one. From a user perspective, it's irrelevant what is happening on the inside, just that YARN is weird if the equivalent is yarn node -all -list. was (Author: aw): [~wangda], this is to bring consistency between HDFS and YARN.hdfs dfsadmin -report has existed for a very long time while YARN doesn't have one. From a user perspective, it's irrelevant what is happening on the inside, just that YARN is weird if the equivalent is yarn node -all -list. yarn rmadmin -report Key: YARN-2345 URL: https://issues.apache.org/jira/browse/YARN-2345 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager Reporter: Allen Wittenauer Assignee: Hao Gao Labels: newbie Attachments: YARN-2345.1.patch It would be good to have an equivalent of hdfs dfsadmin -report in YARN. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2436) [post-HADOOP-9902] yarn application help doesn't work
[ https://issues.apache.org/jira/browse/YARN-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106863#comment-14106863 ] Hudson commented on YARN-2436: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1845 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1845/]) YARN-2436. [post-HADOOP-9902] yarn application help doesn't work (aw: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1619603) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/bin/yarn [post-HADOOP-9902] yarn application help doesn't work - Key: YARN-2436 URL: https://issues.apache.org/jira/browse/YARN-2436 Project: Hadoop YARN Issue Type: Bug Components: scripts Reporter: Allen Wittenauer Assignee: Allen Wittenauer Labels: newbie Fix For: 3.0.0 Attachments: YARN-2436.patch The previous version of the yarn command plays games with the command stack for some commands. The new code needs duplicate this wackiness. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2434) RM should not recover containers from previously failed attempt when AM restart is not enabled
[ https://issues.apache.org/jira/browse/YARN-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106860#comment-14106860 ] Hudson commented on YARN-2434: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1845 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1845/]) YARN-2434. RM should not recover containers from previously failed attempt when AM restart is not enabled. Contributed by Jian He (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1619614) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java RM should not recover containers from previously failed attempt when AM restart is not enabled -- Key: YARN-2434 URL: https://issues.apache.org/jira/browse/YARN-2434 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He Fix For: 3.0.0, 2.6.0 Attachments: YARN-2434.1.patch If container-preserving AM restart is not enabled and AM failed during RM restart, RM on restart should not recover containers from previously failed attempt. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106889#comment-14106889 ] Nathan Roberts commented on YARN-2440: -- Thanks Varun for the patch. I'm wondering if it would be possible to make this configurable at the system level and per-app. For example, I'd like an application to be able to specify that it wants to run with strict container limits (to verify SLA's for example), but in general I don't want these limits in place (why not let a container use additional CPU if it's available?). Cgroups should limit YARN containers to cores allocated in yarn-site.xml Key: YARN-2440 URL: https://issues.apache.org/jira/browse/YARN-2440 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2440.0.patch, screenshot-current-implementation.jpg The current cgroups implementation does not limit YARN containers to the cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106893#comment-14106893 ] Varun Vasudev commented on YARN-2440: - [~nroberts] there's already a ticket for your request - YARN-810. That's next on my todo list. I've left a comment there asking if I can take it over. Cgroups should limit YARN containers to cores allocated in yarn-site.xml Key: YARN-2440 URL: https://issues.apache.org/jira/browse/YARN-2440 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2440.0.patch, screenshot-current-implementation.jpg The current cgroups implementation does not limit YARN containers to the cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2441) NPE in nodemanager after restart
[ https://issues.apache.org/jira/browse/YARN-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106902#comment-14106902 ] Jason Lowe commented on YARN-2441: -- Was this truly running trunk as the Affected Versions field indicates or was this some other version of Hadoop? Also was this a work-preserving NM restart scenario (i.e.: yarn.nodemanager.recovery.enabled=true) or a typical NM startup? NPE in nodemanager after restart Key: YARN-2441 URL: https://issues.apache.org/jira/browse/YARN-2441 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 3.0.0 Reporter: Nishan Shetty Priority: Minor {code} 2014-08-22 16:43:19,640 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Blocking new container-requests as container manager rpc server is still starting. 2014-08-22 16:43:19,658 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2014-08-22 16:43:19,675 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 45026: starting 2014-08-22 16:43:20,029 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager: Updating node address : host-10-18-40-95:45026 2014-08-22 16:43:20,029 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: ContainerManager started at /10.18.40.95:45026 2014-08-22 16:43:20,030 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: ContainerManager bound to host-10-18-40-95/10.18.40.95:45026 2014-08-22 16:43:20,073 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue class java.util.concurrent.LinkedBlockingQueue 2014-08-22 16:43:20,098 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 45027 2014-08-22 16:43:20,158 INFO org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding protocol org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB to the server 2014-08-22 16:43:20,178 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2014-08-22 16:43:20,192 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 45027: starting 2014-08-22 16:43:20,210 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for port 45026: readAndProcess from client 10.18.40.84 threw exception [java.lang.NullPointerException] java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:167) at org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:43) at org.apache.hadoop.security.token.SecretManager.retriableRetrievePassword(SecretManager.java:91) at org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.getPassword(SaslRpcServer.java:278) at org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.handle(SaslRpcServer.java:305) at com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:585) at com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244) at org.apache.hadoop.ipc.Server$Connection.processSaslToken(Server.java:1384) at org.apache.hadoop.ipc.Server$Connection.processSaslMessage(Server.java:1361) at org.apache.hadoop.ipc.Server$Connection.saslProcess(Server.java:1275) at org.apache.hadoop.ipc.Server$Connection.saslReadAndProcess(Server.java:1238) at org.apache.hadoop.ipc.Server$Connection.processRpcOutOfBandRequest(Server.java:1878) at org.apache.hadoop.ipc.Server$Connection.processOneRpc(Server.java:1755) at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1519) at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:750) at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:624) at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:595) 2014-08-22 16:43:20,227 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for port 45026: readAndProcess from client 10.18.40.84 threw exception [java.lang.NullPointerException] java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:167) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2434) RM should not recover containers from previously failed attempt when AM restart is not enabled
[ https://issues.apache.org/jira/browse/YARN-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106932#comment-14106932 ] Hudson commented on YARN-2434: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1871 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1871/]) YARN-2434. RM should not recover containers from previously failed attempt when AM restart is not enabled. Contributed by Jian He (jlowe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1619614) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java RM should not recover containers from previously failed attempt when AM restart is not enabled -- Key: YARN-2434 URL: https://issues.apache.org/jira/browse/YARN-2434 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He Fix For: 3.0.0, 2.6.0 Attachments: YARN-2434.1.patch If container-preserving AM restart is not enabled and AM failed during RM restart, RM on restart should not recover containers from previously failed attempt. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2436) [post-HADOOP-9902] yarn application help doesn't work
[ https://issues.apache.org/jira/browse/YARN-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106935#comment-14106935 ] Hudson commented on YARN-2436: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1871 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1871/]) YARN-2436. [post-HADOOP-9902] yarn application help doesn't work (aw: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1619603) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/bin/yarn [post-HADOOP-9902] yarn application help doesn't work - Key: YARN-2436 URL: https://issues.apache.org/jira/browse/YARN-2436 Project: Hadoop YARN Issue Type: Bug Components: scripts Reporter: Allen Wittenauer Assignee: Allen Wittenauer Labels: newbie Fix For: 3.0.0 Attachments: YARN-2436.patch The previous version of the yarn command plays games with the command stack for some commands. The new code needs duplicate this wackiness. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2393) FairScheduler: Implement steady fair share
[ https://issues.apache.org/jira/browse/YARN-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2393: --- Summary: FairScheduler: Implement steady fair share (was: Fair Scheduler : Implement steady fair share) FairScheduler: Implement steady fair share -- Key: YARN-2393 URL: https://issues.apache.org/jira/browse/YARN-2393 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Ashwin Shankar Assignee: Wei Yan Attachments: YARN-2393-1.patch, YARN-2393-2.patch, YARN-2393-3.patch, yarn-2393-4.patch Static fair share is a fair share allocation considering all(active/inactive) queues.It would be shown on the UI for better predictability of finish time of applications. We would compute static fair share only when needed, like on queue creation, node added/removed. Please see YARN-2026 for discussions on this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2442) ResourceManager JMX UI does not give HA State
[ https://issues.apache.org/jira/browse/YARN-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishan Shetty updated YARN-2442: Affects Version/s: (was: 3.0.0) 2.5.0 ResourceManager JMX UI does not give HA State - Key: YARN-2442 URL: https://issues.apache.org/jira/browse/YARN-2442 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Nishan Shetty Priority: Trivial ResourceManager JMX UI can show the haState (INITIALIZING, ACTIVE, STANDBY, STOPPED) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2441) NPE in nodemanager after restart
[ https://issues.apache.org/jira/browse/YARN-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106987#comment-14106987 ] Nishan Shetty commented on YARN-2441: - [~jlowe] Sorry i mentioned the wrong Affected Version. Its branch 2. Work-preserving NM is not enabled, its just plain restart NPE in nodemanager after restart Key: YARN-2441 URL: https://issues.apache.org/jira/browse/YARN-2441 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: Nishan Shetty Priority: Minor {code} 2014-08-22 16:43:19,640 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Blocking new container-requests as container manager rpc server is still starting. 2014-08-22 16:43:19,658 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2014-08-22 16:43:19,675 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 45026: starting 2014-08-22 16:43:20,029 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager: Updating node address : host-10-18-40-95:45026 2014-08-22 16:43:20,029 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: ContainerManager started at /10.18.40.95:45026 2014-08-22 16:43:20,030 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: ContainerManager bound to host-10-18-40-95/10.18.40.95:45026 2014-08-22 16:43:20,073 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue class java.util.concurrent.LinkedBlockingQueue 2014-08-22 16:43:20,098 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 45027 2014-08-22 16:43:20,158 INFO org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding protocol org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB to the server 2014-08-22 16:43:20,178 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2014-08-22 16:43:20,192 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 45027: starting 2014-08-22 16:43:20,210 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for port 45026: readAndProcess from client 10.18.40.84 threw exception [java.lang.NullPointerException] java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:167) at org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:43) at org.apache.hadoop.security.token.SecretManager.retriableRetrievePassword(SecretManager.java:91) at org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.getPassword(SaslRpcServer.java:278) at org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.handle(SaslRpcServer.java:305) at com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:585) at com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244) at org.apache.hadoop.ipc.Server$Connection.processSaslToken(Server.java:1384) at org.apache.hadoop.ipc.Server$Connection.processSaslMessage(Server.java:1361) at org.apache.hadoop.ipc.Server$Connection.saslProcess(Server.java:1275) at org.apache.hadoop.ipc.Server$Connection.saslReadAndProcess(Server.java:1238) at org.apache.hadoop.ipc.Server$Connection.processRpcOutOfBandRequest(Server.java:1878) at org.apache.hadoop.ipc.Server$Connection.processOneRpc(Server.java:1755) at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1519) at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:750) at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:624) at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:595) 2014-08-22 16:43:20,227 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for port 45026: readAndProcess from client 10.18.40.84 threw exception [java.lang.NullPointerException] java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:167) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2393) FairScheduler: Implement steady fair share
[ https://issues.apache.org/jira/browse/YARN-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106990#comment-14106990 ] Karthik Kambatla commented on YARN-2393: One of the reasons we (Sandy and I) wanted to make the fairshare being used for scheduling instantaneous was to address the case where the maxAMResource becomes so small when there are multiple queues that we can't run any applications at all. I think it is better to leave it as is. In case any one runs into (in testing) issues with maxAMResource, we can consider preempting AMs as an alternative. FairScheduler: Implement steady fair share -- Key: YARN-2393 URL: https://issues.apache.org/jira/browse/YARN-2393 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Ashwin Shankar Assignee: Wei Yan Attachments: YARN-2393-1.patch, YARN-2393-2.patch, YARN-2393-3.patch, yarn-2393-4.patch Static fair share is a fair share allocation considering all(active/inactive) queues.It would be shown on the UI for better predictability of finish time of applications. We would compute static fair share only when needed, like on queue creation, node added/removed. Please see YARN-2026 for discussions on this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2393) FairScheduler: Implement steady fair share
[ https://issues.apache.org/jira/browse/YARN-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106991#comment-14106991 ] Karthik Kambatla commented on YARN-2393: Committing this. FairScheduler: Implement steady fair share -- Key: YARN-2393 URL: https://issues.apache.org/jira/browse/YARN-2393 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Ashwin Shankar Assignee: Wei Yan Attachments: YARN-2393-1.patch, YARN-2393-2.patch, YARN-2393-3.patch, yarn-2393-4.patch Static fair share is a fair share allocation considering all(active/inactive) queues.It would be shown on the UI for better predictability of finish time of applications. We would compute static fair share only when needed, like on queue creation, node added/removed. Please see YARN-2026 for discussions on this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2393) FairScheduler: Add the notion of steady fair share
[ https://issues.apache.org/jira/browse/YARN-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2393: --- Summary: FairScheduler: Add the notion of steady fair share (was: FairScheduler: Implement steady fair share) FairScheduler: Add the notion of steady fair share -- Key: YARN-2393 URL: https://issues.apache.org/jira/browse/YARN-2393 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Ashwin Shankar Assignee: Wei Yan Attachments: YARN-2393-1.patch, YARN-2393-2.patch, YARN-2393-3.patch, yarn-2393-4.patch Static fair share is a fair share allocation considering all(active/inactive) queues.It would be shown on the UI for better predictability of finish time of applications. We would compute static fair share only when needed, like on queue creation, node added/removed. Please see YARN-2026 for discussions on this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2393) FairScheduler: Add the notion of steady fair share
[ https://issues.apache.org/jira/browse/YARN-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2393: --- Issue Type: New Feature (was: Improvement) FairScheduler: Add the notion of steady fair share -- Key: YARN-2393 URL: https://issues.apache.org/jira/browse/YARN-2393 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Wei Yan Attachments: YARN-2393-1.patch, YARN-2393-2.patch, YARN-2393-3.patch, yarn-2393-4.patch Static fair share is a fair share allocation considering all(active/inactive) queues.It would be shown on the UI for better predictability of finish time of applications. We would compute static fair share only when needed, like on queue creation, node added/removed. Please see YARN-2026 for discussions on this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-160) nodemanagers should obtain cpu/memory values from underlying OS
[ https://issues.apache.org/jira/browse/YARN-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106995#comment-14106995 ] Varun Vasudev commented on YARN-160: [~djp] {quote} Both physical id and core id are not guaranteed to have in /proc/cpuinfo (please see below for my local VM's info). We may use processor number instead in case these ids are 0 (like we did in Windows). Again, this weak my confidence that this automatic way of getting CPU/memory resources should happen by default (not sure if any cross-platform issues). May be a safer way here is to keep previous default behavior (with some static setting) with an extra config to enable this. We can wait this feature to be more stable later to change the default behavior. {noformat} processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 70 model name : Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz stepping: 1 cpu MHz : 2295.265 cache size : 6144 KB fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc up arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc aperfmperf unfair_spinlock pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi ept vpid fsgsbase smep bogomips: 4590.53 clflush size: 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: {noformat} {quote} In the example you gave, where we have processors listed but no physical id or core id entries, the numProcessors will be set to the number of entries and numCores will be set to 1. From the diff - {noformat} + numCores = 1; {noformat} There is also a test case to ensure this behaviour. In addition, cluster administrators can decide whether the NodeManager should report numProcessors or numCores by toggling yarn.nodemanager.resource.count-logical-processors-as-vcores which by default is true. In the vm example, by default the NodeManager will report vcores as the number of processor entries in /proc/cpuinfo. If yarn.nodemanager.resource.count-logical-processors-as-vcores is set to false, the NodeManager will report vcores as 1(if there are no physical id or core id entries). nodemanagers should obtain cpu/memory values from underlying OS --- Key: YARN-160 URL: https://issues.apache.org/jira/browse/YARN-160 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.0.3-alpha Reporter: Alejandro Abdelnur Assignee: Varun Vasudev Fix For: 2.6.0 Attachments: apache-yarn-160.0.patch, apache-yarn-160.1.patch As mentioned in YARN-2 *NM memory and CPU configs* Currently these values are coming from the config of the NM, we should be able to obtain those values from the OS (ie, in the case of Linux from /proc/meminfo /proc/cpuinfo). As this is highly OS dependent we should have an interface that obtains this information. In addition implementations of this interface should be able to specify a mem/cpu offset (amount of mem/cpu not to be avail as YARN resource), this would allow to reserve mem/cpu for the OS and other services outside of YARN containers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2441) NPE in nodemanager after restart
[ https://issues.apache.org/jira/browse/YARN-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106999#comment-14106999 ] Jason Lowe commented on YARN-2441: -- Ah, then this seems like a case where a client (likely an AM) is connecting to the NM before the NM has finished registering with the RM to get the secret keys. Trying to block new container requests at the app level probably isn't going to work in practice because the SASL layer in RPC doesn't let the connection get to the point where the app can try to reject the request. IMHO we should remove the blocking client requests code and instead do a delayed server start, sorta like the delay added by YARN-1337 when NM recovery is enabled. Ideally the RPC layer would support the ability to bind to a server socket but not start accepting requests until later. That would allow us to register with the RM knowing what our client port is but without trying to let clients through that port until we're really ready. Shorter term fix might be to have the secret manager throw an exception that can be retried by clients if the master key isn't set yet. NPE in nodemanager after restart Key: YARN-2441 URL: https://issues.apache.org/jira/browse/YARN-2441 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: Nishan Shetty Priority: Minor {code} 2014-08-22 16:43:19,640 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Blocking new container-requests as container manager rpc server is still starting. 2014-08-22 16:43:19,658 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2014-08-22 16:43:19,675 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 45026: starting 2014-08-22 16:43:20,029 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager: Updating node address : host-10-18-40-95:45026 2014-08-22 16:43:20,029 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: ContainerManager started at /10.18.40.95:45026 2014-08-22 16:43:20,030 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: ContainerManager bound to host-10-18-40-95/10.18.40.95:45026 2014-08-22 16:43:20,073 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue class java.util.concurrent.LinkedBlockingQueue 2014-08-22 16:43:20,098 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 45027 2014-08-22 16:43:20,158 INFO org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding protocol org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB to the server 2014-08-22 16:43:20,178 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2014-08-22 16:43:20,192 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 45027: starting 2014-08-22 16:43:20,210 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for port 45026: readAndProcess from client 10.18.40.84 threw exception [java.lang.NullPointerException] java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:167) at org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:43) at org.apache.hadoop.security.token.SecretManager.retriableRetrievePassword(SecretManager.java:91) at org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.getPassword(SaslRpcServer.java:278) at org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.handle(SaslRpcServer.java:305) at com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:585) at com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244) at org.apache.hadoop.ipc.Server$Connection.processSaslToken(Server.java:1384) at org.apache.hadoop.ipc.Server$Connection.processSaslMessage(Server.java:1361) at org.apache.hadoop.ipc.Server$Connection.saslProcess(Server.java:1275) at org.apache.hadoop.ipc.Server$Connection.saslReadAndProcess(Server.java:1238) at org.apache.hadoop.ipc.Server$Connection.processRpcOutOfBandRequest(Server.java:1878) at org.apache.hadoop.ipc.Server$Connection.processOneRpc(Server.java:1755) at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1519) at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:750) at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:624) at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:595) 2014-08-22
[jira] [Commented] (YARN-810) Support CGroup ceiling enforcement on CPU
[ https://issues.apache.org/jira/browse/YARN-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107000#comment-14107000 ] Wei Yan commented on YARN-810: -- [~vvasudev], thanks for the offer. I'm still working on this. Support CGroup ceiling enforcement on CPU - Key: YARN-810 URL: https://issues.apache.org/jira/browse/YARN-810 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.1.0-beta, 2.0.5-alpha Reporter: Chris Riccomini Assignee: Sandy Ryza Attachments: YARN-810.patch, YARN-810.patch Problem statement: YARN currently lets you define an NM's pcore count, and a pcore:vcore ratio. Containers are then allowed to request vcores between the minimum and maximum defined in the yarn-site.xml. In the case where a single-threaded container requests 1 vcore, with a pcore:vcore ratio of 1:4, the container is still allowed to use up to 100% of the core it's using, provided that no other container is also using it. This happens, even though the only guarantee that YARN/CGroups is making is that the container will get at least 1/4th of the core. If a second container then comes along, the second container can take resources from the first, provided that the first container is still getting at least its fair share (1/4th). There are certain cases where this is desirable. There are also certain cases where it might be desirable to have a hard limit on CPU usage, and not allow the process to go above the specified resource requirement, even if it's available. Here's an RFC that describes the problem in more detail: http://lwn.net/Articles/336127/ Solution: As it happens, when CFS is used in combination with CGroups, you can enforce a ceiling using two files in cgroups: {noformat} cpu.cfs_quota_us cpu.cfs_period_us {noformat} The usage of these two files is documented in more detail here: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html Testing: I have tested YARN CGroups using the 2.0.5-alpha implementation. By default, it behaves as described above (it is a soft cap, and allows containers to use more than they asked for). I then tested CFS CPU quotas manually with YARN. First, you can see that CFS is in use in the CGroup, based on the file names: {noformat} [criccomi@eat1-qa464 ~]$ sudo -u app ls -l /cgroup/cpu/hadoop-yarn/ total 0 -r--r--r-- 1 app app 0 Jun 13 16:46 cgroup.procs drwxr-xr-x 2 app app 0 Jun 13 17:08 container_1371141151815_0004_01_02 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_quota_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_runtime_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.shares -r--r--r-- 1 app app 0 Jun 13 16:46 cpu.stat -rw-r--r-- 1 app app 0 Jun 13 16:46 notify_on_release -rw-r--r-- 1 app app 0 Jun 13 16:46 tasks [criccomi@eat1-qa464 ~]$ sudo -u app cat /cgroup/cpu/hadoop-yarn/cpu.cfs_period_us 10 [criccomi@eat1-qa464 ~]$ sudo -u app cat /cgroup/cpu/hadoop-yarn/cpu.cfs_quota_us -1 {noformat} Oddly, it appears that the cfs_period_us is set to .1s, not 1s. We can place processes in hard limits. I have process 4370 running YARN container container_1371141151815_0003_01_03 on a host. By default, it's running at ~300% cpu usage. {noformat} CPU 4370 criccomi 20 0 1157m 551m 14m S 240.3 0.8 87:10.91 ... {noformat} When I set the CFS quote: {noformat} echo 1000 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us CPU 4370 criccomi 20 0 1157m 563m 14m S 1.0 0.8 90:08.39 ... {noformat} It drops to 1% usage, and you can see the box has room to spare: {noformat} Cpu(s): 2.4%us, 1.0%sy, 0.0%ni, 92.2%id, 4.2%wa, 0.0%hi, 0.1%si, 0.0%st {noformat} Turning the quota back to -1: {noformat} echo -1 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us {noformat} Burns the cores again: {noformat} Cpu(s): 11.1%us, 1.7%sy, 0.0%ni, 83.9%id, 3.1%wa, 0.0%hi, 0.2%si, 0.0%st CPU 4370 criccomi 20 0 1157m 563m 14m S 253.9 0.8 89:32.31 ... {noformat} On my dev box, I was testing CGroups by running a python process eight times, to burn through all the cores, since it was doing as described above (giving extra CPU to the process, even with a cpu.shares limit). Toggling the cfs_quota_us seems to enforce a hard limit. Implementation: What do you guys think about
[jira] [Commented] (YARN-2393) FairScheduler: Add the notion of steady fair share
[ https://issues.apache.org/jira/browse/YARN-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107007#comment-14107007 ] Hudson commented on YARN-2393: -- FAILURE: Integrated in Hadoop-trunk-Commit #6097 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6097/]) YARN-2393. FairScheduler: Add the notion of steady fair share. (Wei Yan via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1619845) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSParentQueue.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSQueue.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSQueueMetrics.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueueManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/Schedulable.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/SchedulingPolicy.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/ComputeFairShares.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/DominantResourceFairnessPolicy.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FairSharePolicy.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FifoPolicy.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FakeSchedulable.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairSchedulerFairShare.java FairScheduler: Add the notion of steady fair share -- Key: YARN-2393 URL: https://issues.apache.org/jira/browse/YARN-2393 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Wei Yan Attachments: YARN-2393-1.patch, YARN-2393-2.patch, YARN-2393-3.patch, yarn-2393-4.patch Static fair share is a fair share allocation considering all(active/inactive) queues.It would be shown on the UI for better predictability of finish time of applications. We would compute static fair share only when needed, like on queue creation, node added/removed. Please see YARN-2026 for discussions on this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107005#comment-14107005 ] Wei Yan commented on YARN-2440: --- [~vvasudev], for general cases, we shouldn't strictly limit the cfs_quota_us. We always want to let co-located containers to share the cpu resource in a proportional way, not strictly follow the container_vcores/NM_vcores ratio. We have one runnable patch in YARN-810. I'll check with Sandy for the reviewing. Cgroups should limit YARN containers to cores allocated in yarn-site.xml Key: YARN-2440 URL: https://issues.apache.org/jira/browse/YARN-2440 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2440.0.patch, screenshot-current-implementation.jpg The current cgroups implementation does not limit YARN containers to the cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107012#comment-14107012 ] Varun Vasudev commented on YARN-2440: - [~ywskycn] this patch doesn't limit containers to container_vcores/NM_vcores ratio. What it does do is limit the overall YARN usage to the yarn.nodemanager.resource.cpu-vcores. If you have 4 cores on a machine and set yarn.nodemanager.resource.cpu-vcores 2, we don't restrict the YARN containers to 2 cores. The containers can create threads and use up as many cores as they want, which defeats the purpose of setting yarn.nodemanager.resource.cpu-vcores. Cgroups should limit YARN containers to cores allocated in yarn-site.xml Key: YARN-2440 URL: https://issues.apache.org/jira/browse/YARN-2440 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2440.0.patch, screenshot-current-implementation.jpg The current cgroups implementation does not limit YARN containers to the cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2431) NM restart: cgroup is not removed for reacquired containers
[ https://issues.apache.org/jira/browse/YARN-2431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107008#comment-14107008 ] Jason Lowe commented on YARN-2431: -- Release audit problems are unrelated, see HDFS-6905. NM restart: cgroup is not removed for reacquired containers --- Key: YARN-2431 URL: https://issues.apache.org/jira/browse/YARN-2431 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-2431.patch The cgroup for a reacquired container is not being removed when the container exits. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107025#comment-14107025 ] Wei Yan commented on YARN-2440: --- [~vvasudev], I misunderstood this jira. Will post comment later. Cgroups should limit YARN containers to cores allocated in yarn-site.xml Key: YARN-2440 URL: https://issues.apache.org/jira/browse/YARN-2440 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2440.0.patch, screenshot-current-implementation.jpg The current cgroups implementation does not limit YARN containers to the cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2393) FairScheduler: Add the notion of steady fair share
[ https://issues.apache.org/jira/browse/YARN-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107050#comment-14107050 ] Wei Yan commented on YARN-2393: --- Thanks, [~kasha], [~ashwinshankar77]. Will post a patch for the YARN-2360 for the UI. FairScheduler: Add the notion of steady fair share -- Key: YARN-2393 URL: https://issues.apache.org/jira/browse/YARN-2393 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Wei Yan Fix For: 2.6.0 Attachments: YARN-2393-1.patch, YARN-2393-2.patch, YARN-2393-3.patch, yarn-2393-4.patch Static fair share is a fair share allocation considering all(active/inactive) queues.It would be shown on the UI for better predictability of finish time of applications. We would compute static fair share only when needed, like on queue creation, node added/removed. Please see YARN-2026 for discussions on this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2441) NPE in nodemanager after restart
[ https://issues.apache.org/jira/browse/YARN-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishan Shetty updated YARN-2441: Priority: Major (was: Minor) NPE in nodemanager after restart Key: YARN-2441 URL: https://issues.apache.org/jira/browse/YARN-2441 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: Nishan Shetty {code} 2014-08-22 16:43:19,640 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Blocking new container-requests as container manager rpc server is still starting. 2014-08-22 16:43:19,658 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2014-08-22 16:43:19,675 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 45026: starting 2014-08-22 16:43:20,029 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager: Updating node address : host-10-18-40-95:45026 2014-08-22 16:43:20,029 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: ContainerManager started at /10.18.40.95:45026 2014-08-22 16:43:20,030 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: ContainerManager bound to host-10-18-40-95/10.18.40.95:45026 2014-08-22 16:43:20,073 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue class java.util.concurrent.LinkedBlockingQueue 2014-08-22 16:43:20,098 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 45027 2014-08-22 16:43:20,158 INFO org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding protocol org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB to the server 2014-08-22 16:43:20,178 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2014-08-22 16:43:20,192 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 45027: starting 2014-08-22 16:43:20,210 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for port 45026: readAndProcess from client 10.18.40.84 threw exception [java.lang.NullPointerException] java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:167) at org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:43) at org.apache.hadoop.security.token.SecretManager.retriableRetrievePassword(SecretManager.java:91) at org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.getPassword(SaslRpcServer.java:278) at org.apache.hadoop.security.SaslRpcServer$SaslDigestCallbackHandler.handle(SaslRpcServer.java:305) at com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:585) at com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244) at org.apache.hadoop.ipc.Server$Connection.processSaslToken(Server.java:1384) at org.apache.hadoop.ipc.Server$Connection.processSaslMessage(Server.java:1361) at org.apache.hadoop.ipc.Server$Connection.saslProcess(Server.java:1275) at org.apache.hadoop.ipc.Server$Connection.saslReadAndProcess(Server.java:1238) at org.apache.hadoop.ipc.Server$Connection.processRpcOutOfBandRequest(Server.java:1878) at org.apache.hadoop.ipc.Server$Connection.processOneRpc(Server.java:1755) at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1519) at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:750) at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:624) at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:595) 2014-08-22 16:43:20,227 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for port 45026: readAndProcess from client 10.18.40.84 threw exception [java.lang.NullPointerException] java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM.retrievePassword(NMTokenSecretManagerInNM.java:167) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107057#comment-14107057 ] Jason Lowe commented on YARN-2440: -- I think cfs_quota_us has a maximum value of 100, so we may have an issue if vcores10. I don't see how this takes into account the mapping of vcores to actual CPUs. It's not safe to assume 1 vcore == 1 physical CPU, as some sites will map multiple vcores to a physical core to allow fractions of a physical CPU to be allocated or to account for varying CPU performance across a heterogeneous cluster. Cgroups should limit YARN containers to cores allocated in yarn-site.xml Key: YARN-2440 URL: https://issues.apache.org/jira/browse/YARN-2440 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2440.0.patch, screenshot-current-implementation.jpg The current cgroups implementation does not limit YARN containers to the cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107066#comment-14107066 ] zhihai xu commented on YARN-1458: - [~shurong.mai], YARN-1458.patch will cause regression. It won't work if all the weight and MinShare in the active queues are less than 1. The type conversion from double to int in computeShare loses precision. {code} private static int computeShare(Schedulable sched, double w2rRatio, ResourceType type) { double share = sched.getWeights().getWeight(type) * w2rRatio; share = Math.max(share, getResourceValue(sched.getMinShare(), type)); share = Math.min(share, getResourceValue(sched.getMaxShare(), type)); return (int) share; } {code} In above code, the initial value w2rRatio is 1.0. If weight and MinShare are less than 1, computeShare will return 0. resourceUsedWithWeightToResourceRatio will return the sum of all these return values from computeShare(after lose precision). It will be zero if all the weight and MinShare in the active queues are less than 1. Then YARN-1458.patch will exit the loop earlier with rMax value 1.0. Then right variable will be less than rMax(1.0). Then all queues' fair share will be set to 0 in the following code. {code} for (Schedulable sched : schedulables) { setResourceValue(computeShare(sched, right, type), sched.getFairShare(), type); } {code} This is the reason why the TestFairScheduler is failed at line 1049. testIsStarvedForFairShare configure the queueA weight 0.25 and queueB weight 0.75 and total node resource 4 * 1024. It creates two applications: one is assigned to queueA and the other is assigned to queueB. After FaiScheduler(update) calculated the fair share, queueA fair share should be 1 * 1024 and queueB fair share should be 3 * 1024. but with YARN-1458.patch, both queueA fair share and queueB fair share are set to 0, It is because in this test there are two active queues:queueA and queueB, both weights are less than 1(0.25 and 0.75), MinShare(minResources) in queueA and queueB are not configured, both MinShare use default value(0). In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely -- Key: YARN-1458 URL: https://issues.apache.org/jira/browse/YARN-1458 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Environment: Centos 2.6.18-238.19.1.el5 X86_64 hadoop2.2.0 Reporter: qingwu.fu Assignee: zhihai xu Labels: patch Fix For: 2.2.1 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, YARN-1458.patch Original Estimate: 408h Remaining Estimate: 408h The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when clients submit lots jobs, it is not easy to reapear. We run the test cluster for days to reapear it. The output of jstack command on resourcemanager pid: {code} ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 waiting for monitor entry [0x43aa9000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) - waiting to lock 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) …… FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 runnable [0x433a2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) at
[jira] [Commented] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107068#comment-14107068 ] Varun Vasudev commented on YARN-2440: - [~jlowe] does it make sense to get the number of physical cores on the machine and derive the vcore to physical cpu ratio? Cgroups should limit YARN containers to cores allocated in yarn-site.xml Key: YARN-2440 URL: https://issues.apache.org/jira/browse/YARN-2440 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2440.0.patch, screenshot-current-implementation.jpg The current cgroups implementation does not limit YARN containers to the cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107069#comment-14107069 ] Varun Vasudev commented on YARN-2440: - I'll update the patch to limit cfs_quota_us. Cgroups should limit YARN containers to cores allocated in yarn-site.xml Key: YARN-2440 URL: https://issues.apache.org/jira/browse/YARN-2440 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2440.0.patch, screenshot-current-implementation.jpg The current cgroups implementation does not limit YARN containers to the cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107093#comment-14107093 ] Jason Lowe commented on YARN-2440: -- bq. does it make sense to get the number of physical cores on the machine and derive the vcore to physical cpu ratio? Only if the user can specify the multiplier between a vcore and a physical CPU. Not all physical CPUs are created equal, and as I mentioned earlier, some sites will want to allow fractions of a physical CPU to be allocated. Otherwise we're limiting the number of containers to the number of physical cores, and not all tasks require a full core. Cgroups should limit YARN containers to cores allocated in yarn-site.xml Key: YARN-2440 URL: https://issues.apache.org/jira/browse/YARN-2440 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2440.0.patch, screenshot-current-implementation.jpg The current cgroups implementation does not limit YARN containers to the cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2443) Log Handling for Long Running Service
Xuan Gong created YARN-2443: --- Summary: Log Handling for Long Running Service Key: YARN-2443 URL: https://issues.apache.org/jira/browse/YARN-2443 Project: Hadoop YARN Issue Type: Task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1104) NMs to support rolling logs of stdout stderr
[ https://issues.apache.org/jira/browse/YARN-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-1104: Parent Issue: YARN-2443 (was: YARN-896) NMs to support rolling logs of stdout stderr -- Key: YARN-1104 URL: https://issues.apache.org/jira/browse/YARN-1104 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.1.0-beta Reporter: Steve Loughran Assignee: Xuan Gong Currently NMs stream the stdout and stderr streams of a container to a file. For longer lived processes those files need to be rotated so that the log doesn't overflow -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-810) Support CGroup ceiling enforcement on CPU
[ https://issues.apache.org/jira/browse/YARN-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107172#comment-14107172 ] Varun Vasudev commented on YARN-810: [~ywskycn] thanks for letting me know! Some comments on your patch - 1. In CgroupsLCEResourcesHandler.java, you set cfs_period_us to nmShares and cfs_quota_us to cpuShares. From the RedHat documentation, cfs_period_us and cfs_quota_us operate on a CPU basis. From the documentation {quote} Note that the quota and period parameters operate on a CPU basis. To allow a process to fully utilize two CPUs, for example, set cpu.cfs_quota_us to 20 and cpu.cfs_period_us to 10. {quote} With your current implementation, on a machine with 4 cores(and 4 vcores), a container which requests 2 vcores will have cfs_period_us set to 4096 and cfs_quota_us set to 2048 which will end up limiting it to 50% of one CPU. Is my understanding wrong? 2. This is just nitpicking, but is it possible to change CpuEnforceCeilingEnabled(and its variants) to just CpuCeilingEnabled or CpuCeilingEnforced? Support CGroup ceiling enforcement on CPU - Key: YARN-810 URL: https://issues.apache.org/jira/browse/YARN-810 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.1.0-beta, 2.0.5-alpha Reporter: Chris Riccomini Assignee: Sandy Ryza Attachments: YARN-810.patch, YARN-810.patch Problem statement: YARN currently lets you define an NM's pcore count, and a pcore:vcore ratio. Containers are then allowed to request vcores between the minimum and maximum defined in the yarn-site.xml. In the case where a single-threaded container requests 1 vcore, with a pcore:vcore ratio of 1:4, the container is still allowed to use up to 100% of the core it's using, provided that no other container is also using it. This happens, even though the only guarantee that YARN/CGroups is making is that the container will get at least 1/4th of the core. If a second container then comes along, the second container can take resources from the first, provided that the first container is still getting at least its fair share (1/4th). There are certain cases where this is desirable. There are also certain cases where it might be desirable to have a hard limit on CPU usage, and not allow the process to go above the specified resource requirement, even if it's available. Here's an RFC that describes the problem in more detail: http://lwn.net/Articles/336127/ Solution: As it happens, when CFS is used in combination with CGroups, you can enforce a ceiling using two files in cgroups: {noformat} cpu.cfs_quota_us cpu.cfs_period_us {noformat} The usage of these two files is documented in more detail here: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html Testing: I have tested YARN CGroups using the 2.0.5-alpha implementation. By default, it behaves as described above (it is a soft cap, and allows containers to use more than they asked for). I then tested CFS CPU quotas manually with YARN. First, you can see that CFS is in use in the CGroup, based on the file names: {noformat} [criccomi@eat1-qa464 ~]$ sudo -u app ls -l /cgroup/cpu/hadoop-yarn/ total 0 -r--r--r-- 1 app app 0 Jun 13 16:46 cgroup.procs drwxr-xr-x 2 app app 0 Jun 13 17:08 container_1371141151815_0004_01_02 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_quota_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_runtime_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.shares -r--r--r-- 1 app app 0 Jun 13 16:46 cpu.stat -rw-r--r-- 1 app app 0 Jun 13 16:46 notify_on_release -rw-r--r-- 1 app app 0 Jun 13 16:46 tasks [criccomi@eat1-qa464 ~]$ sudo -u app cat /cgroup/cpu/hadoop-yarn/cpu.cfs_period_us 10 [criccomi@eat1-qa464 ~]$ sudo -u app cat /cgroup/cpu/hadoop-yarn/cpu.cfs_quota_us -1 {noformat} Oddly, it appears that the cfs_period_us is set to .1s, not 1s. We can place processes in hard limits. I have process 4370 running YARN container container_1371141151815_0003_01_03 on a host. By default, it's running at ~300% cpu usage. {noformat} CPU 4370 criccomi 20 0 1157m 551m 14m S 240.3 0.8 87:10.91 ... {noformat} When I set the CFS quote: {noformat} echo 1000 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us CPU 4370 criccomi 20 0 1157m 563m 14m S 1.0 0.8 90:08.39 ... {noformat} It drops to 1% usage, and you can see the box
[jira] [Updated] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page
[ https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2360: -- Attachment: (was: Screen_Shot_v3.png) Fair Scheduler : Display dynamic fair share for queues on the scheduler page Key: YARN-2360 URL: https://issues.apache.org/jira/browse/YARN-2360 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, YARN-2360-v1.txt, YARN-2360-v2.txt Based on the discussion in YARN-2026, we'd like to display dynamic fair share for queues on the scheduler page. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page
[ https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2360: -- Attachment: Screen_Shot_v3.png Fair Scheduler : Display dynamic fair share for queues on the scheduler page Key: YARN-2360 URL: https://issues.apache.org/jira/browse/YARN-2360 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, YARN-2360-v1.txt, YARN-2360-v2.txt Based on the discussion in YARN-2026, we'd like to display dynamic fair share for queues on the scheduler page. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page
[ https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2360: -- Attachment: YARN-2360-v3.patch Update a patch after YARN-2393. The Screen_Shot_v3.png is the fair scheduler web page. Fair Scheduler : Display dynamic fair share for queues on the scheduler page Key: YARN-2360 URL: https://issues.apache.org/jira/browse/YARN-2360 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, Screen_Shot_v3.png, YARN-2360-v1.txt, YARN-2360-v2.txt, YARN-2360-v3.patch Based on the discussion in YARN-2026, we'd like to display dynamic fair share for queues on the scheduler page. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page
[ https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2360: -- Attachment: Screen_Shot_v3.png Fair Scheduler : Display dynamic fair share for queues on the scheduler page Key: YARN-2360 URL: https://issues.apache.org/jira/browse/YARN-2360 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, Screen_Shot_v3.png, YARN-2360-v1.txt, YARN-2360-v2.txt, YARN-2360-v3.patch Based on the discussion in YARN-2026, we'd like to display dynamic fair share for queues on the scheduler page. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-810) Support CGroup ceiling enforcement on CPU
[ https://issues.apache.org/jira/browse/YARN-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107221#comment-14107221 ] Wei Yan commented on YARN-810: -- bq. With your current implementation, on a machine with 4 cores(and 4 vcores), a container which requests 2 vcores will have cfs_period_us set to 4096 and cfs_quota_us set to 2048 which will end up limiting it to 50% of one CPU. Is my understanding wrong? Thanks, [~vvasudev]. I mentioned this problem after reading your YARN-2420 patch. I'll double check this problem, and will update the patch. Support CGroup ceiling enforcement on CPU - Key: YARN-810 URL: https://issues.apache.org/jira/browse/YARN-810 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.1.0-beta, 2.0.5-alpha Reporter: Chris Riccomini Assignee: Sandy Ryza Attachments: YARN-810.patch, YARN-810.patch Problem statement: YARN currently lets you define an NM's pcore count, and a pcore:vcore ratio. Containers are then allowed to request vcores between the minimum and maximum defined in the yarn-site.xml. In the case where a single-threaded container requests 1 vcore, with a pcore:vcore ratio of 1:4, the container is still allowed to use up to 100% of the core it's using, provided that no other container is also using it. This happens, even though the only guarantee that YARN/CGroups is making is that the container will get at least 1/4th of the core. If a second container then comes along, the second container can take resources from the first, provided that the first container is still getting at least its fair share (1/4th). There are certain cases where this is desirable. There are also certain cases where it might be desirable to have a hard limit on CPU usage, and not allow the process to go above the specified resource requirement, even if it's available. Here's an RFC that describes the problem in more detail: http://lwn.net/Articles/336127/ Solution: As it happens, when CFS is used in combination with CGroups, you can enforce a ceiling using two files in cgroups: {noformat} cpu.cfs_quota_us cpu.cfs_period_us {noformat} The usage of these two files is documented in more detail here: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html Testing: I have tested YARN CGroups using the 2.0.5-alpha implementation. By default, it behaves as described above (it is a soft cap, and allows containers to use more than they asked for). I then tested CFS CPU quotas manually with YARN. First, you can see that CFS is in use in the CGroup, based on the file names: {noformat} [criccomi@eat1-qa464 ~]$ sudo -u app ls -l /cgroup/cpu/hadoop-yarn/ total 0 -r--r--r-- 1 app app 0 Jun 13 16:46 cgroup.procs drwxr-xr-x 2 app app 0 Jun 13 17:08 container_1371141151815_0004_01_02 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_quota_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_runtime_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.shares -r--r--r-- 1 app app 0 Jun 13 16:46 cpu.stat -rw-r--r-- 1 app app 0 Jun 13 16:46 notify_on_release -rw-r--r-- 1 app app 0 Jun 13 16:46 tasks [criccomi@eat1-qa464 ~]$ sudo -u app cat /cgroup/cpu/hadoop-yarn/cpu.cfs_period_us 10 [criccomi@eat1-qa464 ~]$ sudo -u app cat /cgroup/cpu/hadoop-yarn/cpu.cfs_quota_us -1 {noformat} Oddly, it appears that the cfs_period_us is set to .1s, not 1s. We can place processes in hard limits. I have process 4370 running YARN container container_1371141151815_0003_01_03 on a host. By default, it's running at ~300% cpu usage. {noformat} CPU 4370 criccomi 20 0 1157m 551m 14m S 240.3 0.8 87:10.91 ... {noformat} When I set the CFS quote: {noformat} echo 1000 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us CPU 4370 criccomi 20 0 1157m 563m 14m S 1.0 0.8 90:08.39 ... {noformat} It drops to 1% usage, and you can see the box has room to spare: {noformat} Cpu(s): 2.4%us, 1.0%sy, 0.0%ni, 92.2%id, 4.2%wa, 0.0%hi, 0.1%si, 0.0%st {noformat} Turning the quota back to -1: {noformat} echo -1 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us {noformat} Burns the cores again: {noformat} Cpu(s): 11.1%us, 1.7%sy, 0.0%ni, 83.9%id, 3.1%wa, 0.0%hi, 0.2%si, 0.0%st CPU 4370 criccomi 20 0 1157m 563m 14m S 253.9 0.8 89:32.31
[jira] [Commented] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page
[ https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107244#comment-14107244 ] Karthik Kambatla commented on YARN-2360: I would rename the legend to Steady fairshare and Instantaneous fairshare. Fair Scheduler : Display dynamic fair share for queues on the scheduler page Key: YARN-2360 URL: https://issues.apache.org/jira/browse/YARN-2360 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, Screen_Shot_v3.png, YARN-2360-v1.txt, YARN-2360-v2.txt, YARN-2360-v3.patch Based on the discussion in YARN-2026, we'd like to display dynamic fair share for queues on the scheduler page. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page
[ https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107252#comment-14107252 ] Wei Yan commented on YARN-2360: --- Thanks, Karthik. Will update a patch with changes, also another problem in the FairSchedulerQueueInfo. Fair Scheduler : Display dynamic fair share for queues on the scheduler page Key: YARN-2360 URL: https://issues.apache.org/jira/browse/YARN-2360 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, Screen_Shot_v3.png, YARN-2360-v1.txt, YARN-2360-v2.txt, YARN-2360-v3.patch Based on the discussion in YARN-2026, we'd like to display dynamic fair share for queues on the scheduler page. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2444) Primary filters added after first submission not indexed, cause exceptions in logs.
[ https://issues.apache.org/jira/browse/YARN-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated YARN-2444: - Attachment: ats.java Primary filters added after first submission not indexed, cause exceptions in logs. --- Key: YARN-2444 URL: https://issues.apache.org/jira/browse/YARN-2444 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.5.0 Reporter: Marcelo Vanzin Attachments: ats.java See attached code for an example. The code creates an entity with a primary filter, submits it to the ATS. After that, a new primary filter value is added and the entity is resubmitted. At that point two things can be seen: - Searching for the new primary filter value does not return the entity - The following exception shows up in the logs: {noformat} 14/08/22 11:33:42 ERROR webapp.TimelineWebServices: Error when verifying access for user dr.who (auth:SIMPLE) on the events of the timeline entity { id: testid-48625678-9cbb-4e71-87de-93c50be51d1a, type: test } org.apache.hadoop.yarn.exceptions.YarnException: Owner information of the timeline entity { id: testid-48625678-9cbb-4e71-87de-93c50be51d1a, type: test } is corrupted. at org.apache.hadoop.yarn.server.timeline.security.TimelineACLsManager.checkAccess(TimelineACLsManager.java:67) at org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices.getEntities(TimelineWebServices.java:172) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2444) Primary filters added after first submission not indexed, cause exceptions in logs.
[ https://issues.apache.org/jira/browse/YARN-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107269#comment-14107269 ] Marcelo Vanzin commented on YARN-2444: -- The following search causes the problem described above: {noformat}/ws/v1/timeline/test?primaryFilter=prop2:val2{noformat} The following one works as expected: {noformat}/ws/v1/timeline/test?primaryFilter=prop1:val1{noformat} Primary filters added after first submission not indexed, cause exceptions in logs. --- Key: YARN-2444 URL: https://issues.apache.org/jira/browse/YARN-2444 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.5.0 Reporter: Marcelo Vanzin Attachments: ats.java See attached code for an example. The code creates an entity with a primary filter, submits it to the ATS. After that, a new primary filter value is added and the entity is resubmitted. At that point two things can be seen: - Searching for the new primary filter value does not return the entity - The following exception shows up in the logs: {noformat} 14/08/22 11:33:42 ERROR webapp.TimelineWebServices: Error when verifying access for user dr.who (auth:SIMPLE) on the events of the timeline entity { id: testid-48625678-9cbb-4e71-87de-93c50be51d1a, type: test } org.apache.hadoop.yarn.exceptions.YarnException: Owner information of the timeline entity { id: testid-48625678-9cbb-4e71-87de-93c50be51d1a, type: test } is corrupted. at org.apache.hadoop.yarn.server.timeline.security.TimelineACLsManager.checkAccess(TimelineACLsManager.java:67) at org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices.getEntities(TimelineWebServices.java:172) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107275#comment-14107275 ] Varun Vasudev commented on YARN-2440: - It might make things easier to go with [~sandyr] idea to add a configuration to add a config which expresses a % of node's CPU that is used by YARN. [~jlowe] would that address your concerns? Cgroups should limit YARN containers to cores allocated in yarn-site.xml Key: YARN-2440 URL: https://issues.apache.org/jira/browse/YARN-2440 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, screenshot-current-implementation.jpg The current cgroups implementation does not limit YARN containers to the cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page
[ https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2360: -- Attachment: Screen_Shot_v4.png Fair Scheduler : Display dynamic fair share for queues on the scheduler page Key: YARN-2360 URL: https://issues.apache.org/jira/browse/YARN-2360 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, Screen_Shot_v3.png, Screen_Shot_v4.png, YARN-2360-v1.txt, YARN-2360-v2.txt, YARN-2360-v3.patch, YARN-2360-v4.patch Based on the discussion in YARN-2026, we'd like to display dynamic fair share for queues on the scheduler page. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page
[ https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2360: -- Attachment: YARN-2360-v4.patch Fair Scheduler : Display dynamic fair share for queues on the scheduler page Key: YARN-2360 URL: https://issues.apache.org/jira/browse/YARN-2360 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, Screen_Shot_v3.png, Screen_Shot_v4.png, YARN-2360-v1.txt, YARN-2360-v2.txt, YARN-2360-v3.patch, YARN-2360-v4.patch Based on the discussion in YARN-2026, we'd like to display dynamic fair share for queues on the scheduler page. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107331#comment-14107331 ] Jason Lowe commented on YARN-2440: -- Sure for this JIRA we can go with a percent of total CPU to limit YARN. For something like YARN-160 we'd need the user to specify some kind of relationship between vcores and physical cores. Cgroups should limit YARN containers to cores allocated in yarn-site.xml Key: YARN-2440 URL: https://issues.apache.org/jira/browse/YARN-2440 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, screenshot-current-implementation.jpg The current cgroups implementation does not limit YARN containers to the cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page
[ https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107351#comment-14107351 ] Ashwin Shankar commented on YARN-2360: -- [~ywskycn], patch looks good. Should we mention what Instantaneous and Steady fair share means in the fair scheduler doc ie apt.vm file, so that users know what it means ? I'm also torn on whether we should define these terms on the UI as part of the legend tool tip or some other way ? Fair Scheduler : Display dynamic fair share for queues on the scheduler page Key: YARN-2360 URL: https://issues.apache.org/jira/browse/YARN-2360 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, Screen_Shot_v3.png, Screen_Shot_v4.png, YARN-2360-v1.txt, YARN-2360-v2.txt, YARN-2360-v3.patch, YARN-2360-v4.patch Based on the discussion in YARN-2026, we'd like to display dynamic fair share for queues on the scheduler page. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107379#comment-14107379 ] Hadoop QA commented on YARN-1458: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12663617/YARN-1458.002.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4695//console This message is automatically generated. In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely -- Key: YARN-1458 URL: https://issues.apache.org/jira/browse/YARN-1458 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Environment: Centos 2.6.18-238.19.1.el5 X86_64 hadoop2.2.0 Reporter: qingwu.fu Assignee: zhihai xu Labels: patch Fix For: 2.2.1 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, YARN-1458.patch Original Estimate: 408h Remaining Estimate: 408h The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when clients submit lots jobs, it is not easy to reapear. We run the test cluster for days to reapear it. The output of jstack command on resourcemanager pid: {code} ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 waiting for monitor entry [0x43aa9000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) - waiting to lock 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) …… FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 runnable [0x433a2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2408) Resource Request REST API for YARN
[ https://issues.apache.org/jira/browse/YARN-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Renan DelValle updated YARN-2408: - Attachment: (was: YARN-2408-2.patch) Resource Request REST API for YARN -- Key: YARN-2408 URL: https://issues.apache.org/jira/browse/YARN-2408 Project: Hadoop YARN Issue Type: New Feature Components: webapp Reporter: Renan DelValle Labels: features Attachments: YARN-2408-3.patch I’m proposing a new REST API for YARN which exposes a snapshot of the Resource Requests that exist inside of the Scheduler. My motivation behind this new feature is to allow external software to monitor the amount of resources being requested to gain more insightful information into cluster usage than is already provided. The API can also be used by external software to detect a starved application and alert the appropriate users and/or sys admin so that the problem may be remedied. Here is the proposed API: {code:xml} resourceRequests MB96256/MB VCores94/VCores appMaster applicationIdapplication_/applicationId applicationAttemptIdappattempt_/applicationAttemptId queueNamedefault/queueName totalPendingMB96256/totalPendingMB totalPendingVCores94/totalPendingVCores numResourceRequests3/numResourceRequests resourceRequests request MB1024/MB VCores1/VCores resourceName/default-rack/resourceName numContainers94/numContainers relaxLocalitytrue/relaxLocality priority20/priority /request request MB1024/MB VCores1/VCores resourceName*/resourceName numContainers94/numContainers relaxLocalitytrue/relaxLocality priority20/priority /request request MB1024/MB VCores1/VCores resourceNamemaster/resourceName numContainers94/numContainers relaxLocalitytrue/relaxLocality priority20/priority /request /resourceRequests /appMaster /resourceRequests {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2408) Resource Request REST API for YARN
[ https://issues.apache.org/jira/browse/YARN-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Renan DelValle updated YARN-2408: - Attachment: YARN-2408-3.patch Bug fix Resource Request REST API for YARN -- Key: YARN-2408 URL: https://issues.apache.org/jira/browse/YARN-2408 Project: Hadoop YARN Issue Type: New Feature Components: webapp Reporter: Renan DelValle Labels: features Attachments: YARN-2408-3.patch I’m proposing a new REST API for YARN which exposes a snapshot of the Resource Requests that exist inside of the Scheduler. My motivation behind this new feature is to allow external software to monitor the amount of resources being requested to gain more insightful information into cluster usage than is already provided. The API can also be used by external software to detect a starved application and alert the appropriate users and/or sys admin so that the problem may be remedied. Here is the proposed API: {code:xml} resourceRequests MB96256/MB VCores94/VCores appMaster applicationIdapplication_/applicationId applicationAttemptIdappattempt_/applicationAttemptId queueNamedefault/queueName totalPendingMB96256/totalPendingMB totalPendingVCores94/totalPendingVCores numResourceRequests3/numResourceRequests resourceRequests request MB1024/MB VCores1/VCores resourceName/default-rack/resourceName numContainers94/numContainers relaxLocalitytrue/relaxLocality priority20/priority /request request MB1024/MB VCores1/VCores resourceName*/resourceName numContainers94/numContainers relaxLocalitytrue/relaxLocality priority20/priority /request request MB1024/MB VCores1/VCores resourceNamemaster/resourceName numContainers94/numContainers relaxLocalitytrue/relaxLocality priority20/priority /request /resourceRequests /appMaster /resourceRequests {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2440) Cgroups should limit YARN containers to cores allocated in yarn-site.xml
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107429#comment-14107429 ] Hadoop QA commented on YARN-2440: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12663704/apache-yarn-2440.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4694//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4694//console This message is automatically generated. Cgroups should limit YARN containers to cores allocated in yarn-site.xml Key: YARN-2440 URL: https://issues.apache.org/jira/browse/YARN-2440 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, screenshot-current-implementation.jpg The current cgroups implementation does not limit YARN containers to the cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-1458: Attachment: YARN-1458.003.patch In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely -- Key: YARN-1458 URL: https://issues.apache.org/jira/browse/YARN-1458 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Environment: Centos 2.6.18-238.19.1.el5 X86_64 hadoop2.2.0 Reporter: qingwu.fu Assignee: zhihai xu Labels: patch Fix For: 2.2.1 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, YARN-1458.003.patch, YARN-1458.patch Original Estimate: 408h Remaining Estimate: 408h The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when clients submit lots jobs, it is not easy to reapear. We run the test cluster for days to reapear it. The output of jstack command on resourcemanager pid: {code} ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 waiting for monitor entry [0x43aa9000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) - waiting to lock 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) …… FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 runnable [0x433a2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107471#comment-14107471 ] zhihai xu commented on YARN-1458: - I uploaded a new patch YARN-1458.003.patch to resolve merge conflict after rebase to latest code. In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely -- Key: YARN-1458 URL: https://issues.apache.org/jira/browse/YARN-1458 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Environment: Centos 2.6.18-238.19.1.el5 X86_64 hadoop2.2.0 Reporter: qingwu.fu Assignee: zhihai xu Labels: patch Fix For: 2.2.1 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, YARN-1458.003.patch, YARN-1458.patch Original Estimate: 408h Remaining Estimate: 408h The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when clients submit lots jobs, it is not easy to reapear. We run the test cluster for days to reapear it. The output of jstack command on resourcemanager pid: {code} ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 waiting for monitor entry [0x43aa9000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) - waiting to lock 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) …… FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 runnable [0x433a2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page
[ https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107479#comment-14107479 ] Hadoop QA commented on YARN-2360: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12663715/YARN-2360-v4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterService {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4696//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4696//console This message is automatically generated. Fair Scheduler : Display dynamic fair share for queues on the scheduler page Key: YARN-2360 URL: https://issues.apache.org/jira/browse/YARN-2360 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, Screen_Shot_v3.png, Screen_Shot_v4.png, YARN-2360-v1.txt, YARN-2360-v2.txt, YARN-2360-v3.patch, YARN-2360-v4.patch Based on the discussion in YARN-2026, we'd like to display dynamic fair share for queues on the scheduler page. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page
[ https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2360: -- Attachment: Screen_Shot_v5.png Fair Scheduler : Display dynamic fair share for queues on the scheduler page Key: YARN-2360 URL: https://issues.apache.org/jira/browse/YARN-2360 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, Screen_Shot_v3.png, Screen_Shot_v4.png, Screen_Shot_v5.png, YARN-2360-v1.txt, YARN-2360-v2.txt, YARN-2360-v3.patch, YARN-2360-v4.patch Based on the discussion in YARN-2026, we'd like to display dynamic fair share for queues on the scheduler page. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page
[ https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2360: -- Attachment: (was: Screen_Shot_v5.png) Fair Scheduler : Display dynamic fair share for queues on the scheduler page Key: YARN-2360 URL: https://issues.apache.org/jira/browse/YARN-2360 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, Screen_Shot_v3.png, Screen_Shot_v4.png, YARN-2360-v1.txt, YARN-2360-v2.txt, YARN-2360-v3.patch, YARN-2360-v4.patch, YARN-2360-v5.patch Based on the discussion in YARN-2026, we'd like to display dynamic fair share for queues on the scheduler page. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page
[ https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2360: -- Attachment: YARN-2360-v5.patch A new patch that adds description in the fair scheduler .apt.vm file, also shows the description in the web UI when the mouse hover over the steady fair share label or instantaneous fair share label. Fair Scheduler : Display dynamic fair share for queues on the scheduler page Key: YARN-2360 URL: https://issues.apache.org/jira/browse/YARN-2360 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, Screen_Shot_v3.png, Screen_Shot_v4.png, YARN-2360-v1.txt, YARN-2360-v2.txt, YARN-2360-v3.patch, YARN-2360-v4.patch, YARN-2360-v5.patch Based on the discussion in YARN-2026, we'd like to display dynamic fair share for queues on the scheduler page. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2408) Resource Request REST API for YARN
[ https://issues.apache.org/jira/browse/YARN-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107563#comment-14107563 ] Hadoop QA commented on YARN-2408: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12663726/YARN-2408-3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4697//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4697//console This message is automatically generated. Resource Request REST API for YARN -- Key: YARN-2408 URL: https://issues.apache.org/jira/browse/YARN-2408 Project: Hadoop YARN Issue Type: New Feature Components: webapp Reporter: Renan DelValle Labels: features Attachments: YARN-2408-3.patch I’m proposing a new REST API for YARN which exposes a snapshot of the Resource Requests that exist inside of the Scheduler. My motivation behind this new feature is to allow external software to monitor the amount of resources being requested to gain more insightful information into cluster usage than is already provided. The API can also be used by external software to detect a starved application and alert the appropriate users and/or sys admin so that the problem may be remedied. Here is the proposed API: {code:xml} resourceRequests MB96256/MB VCores94/VCores appMaster applicationIdapplication_/applicationId applicationAttemptIdappattempt_/applicationAttemptId queueNamedefault/queueName totalPendingMB96256/totalPendingMB totalPendingVCores94/totalPendingVCores numResourceRequests3/numResourceRequests resourceRequests request MB1024/MB VCores1/VCores resourceName/default-rack/resourceName numContainers94/numContainers relaxLocalitytrue/relaxLocality priority20/priority /request request MB1024/MB VCores1/VCores resourceName*/resourceName numContainers94/numContainers relaxLocalitytrue/relaxLocality priority20/priority /request request MB1024/MB VCores1/VCores resourceNamemaster/resourceName numContainers94/numContainers relaxLocalitytrue/relaxLocality priority20/priority /request /resourceRequests /appMaster /resourceRequests {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page
[ https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2360: -- Attachment: (was: YARN-2360-v5.patch) Fair Scheduler : Display dynamic fair share for queues on the scheduler page Key: YARN-2360 URL: https://issues.apache.org/jira/browse/YARN-2360 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, Screen_Shot_v3.png, Screen_Shot_v4.png, Screen_Shot_v5.png, YARN-2360-v1.txt, YARN-2360-v2.txt, YARN-2360-v3.patch, YARN-2360-v4.patch, YARN-2360-v5.patch Based on the discussion in YARN-2026, we'd like to display dynamic fair share for queues on the scheduler page. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page
[ https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2360: -- Attachment: Screen_Shot_v5.png Fair Scheduler : Display dynamic fair share for queues on the scheduler page Key: YARN-2360 URL: https://issues.apache.org/jira/browse/YARN-2360 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, Screen_Shot_v3.png, Screen_Shot_v4.png, Screen_Shot_v5.png, YARN-2360-v1.txt, YARN-2360-v2.txt, YARN-2360-v3.patch, YARN-2360-v4.patch, YARN-2360-v5.patch Based on the discussion in YARN-2026, we'd like to display dynamic fair share for queues on the scheduler page. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page
[ https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2360: -- Attachment: YARN-2360-v5.patch Fair Scheduler : Display dynamic fair share for queues on the scheduler page Key: YARN-2360 URL: https://issues.apache.org/jira/browse/YARN-2360 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, Screen_Shot_v3.png, Screen_Shot_v4.png, Screen_Shot_v5.png, YARN-2360-v1.txt, YARN-2360-v2.txt, YARN-2360-v3.patch, YARN-2360-v4.patch, YARN-2360-v5.patch Based on the discussion in YARN-2026, we'd like to display dynamic fair share for queues on the scheduler page. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2445) ATS does not reflect changes to uploaded TimelineEntity
Marcelo Vanzin created YARN-2445: Summary: ATS does not reflect changes to uploaded TimelineEntity Key: YARN-2445 URL: https://issues.apache.org/jira/browse/YARN-2445 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Marcelo Vanzin Priority: Minor Attachments: ats2.java If you make a change to the TimelineEntity and send it to the ATS, that change is not reflected in the stored data. For example, in the attached code, an existing primary filter is removed and a new one is added. When you retrieve the entity from the ATS, it only contains the old value: {noformat} {entities:[{events:[],entitytype:test,entity:testid-ad5380c0-090e-4982-8da8-21676fe4e9f4,starttime:1408746026958,relatedentities:{},primaryfilters:{oldprop:[val]},otherinfo:{}}]} {noformat} Perhaps this is what the design wanted, but from an API user standpoint, it's really confusing, since to upload events I have to upload the entity itself, and the changes are not reflected. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2445) ATS does not reflect changes to uploaded TimelineEntity
[ https://issues.apache.org/jira/browse/YARN-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated YARN-2445: - Attachment: ats2.java ATS does not reflect changes to uploaded TimelineEntity --- Key: YARN-2445 URL: https://issues.apache.org/jira/browse/YARN-2445 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Marcelo Vanzin Priority: Minor Attachments: ats2.java If you make a change to the TimelineEntity and send it to the ATS, that change is not reflected in the stored data. For example, in the attached code, an existing primary filter is removed and a new one is added. When you retrieve the entity from the ATS, it only contains the old value: {noformat} {entities:[{events:[],entitytype:test,entity:testid-ad5380c0-090e-4982-8da8-21676fe4e9f4,starttime:1408746026958,relatedentities:{},primaryfilters:{oldprop:[val]},otherinfo:{}}]} {noformat} Perhaps this is what the design wanted, but from an API user standpoint, it's really confusing, since to upload events I have to upload the entity itself, and the changes are not reflected. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-321) Generic application history service
[ https://issues.apache.org/jira/browse/YARN-321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Gao reassigned YARN-321: --- Assignee: Yu Gao Generic application history service --- Key: YARN-321 URL: https://issues.apache.org/jira/browse/YARN-321 Project: Hadoop YARN Issue Type: Improvement Reporter: Luke Lu Assignee: Yu Gao Attachments: AHS Diagram.pdf, ApplicationHistoryServiceHighLevel.pdf, Generic Application History - Design-20131219.pdf, HistoryStorageDemo.java The mapreduce job history server currently needs to be deployed as a trusted server in sync with the mapreduce runtime. Every new application would need a similar application history server. Having to deploy O(T*V) (where T is number of type of application, V is number of version of application) trusted servers is clearly not scalable. Job history storage handling itself is pretty generic: move the logs and history data into a particular directory for later serving. Job history data is already stored as json (or binary avro). I propose that we create only one trusted application history server, which can have a generic UI (display json as a tree of strings) as well. Specific application/version can deploy untrusted webapps (a la AMs) to query the application history server and interpret the json for its specific UI and/or analytics. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107721#comment-14107721 ] Hadoop QA commented on YARN-1458: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12663743/YARN-1458.003.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerFairShare {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4698//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4698//console This message is automatically generated. In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely -- Key: YARN-1458 URL: https://issues.apache.org/jira/browse/YARN-1458 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Environment: Centos 2.6.18-238.19.1.el5 X86_64 hadoop2.2.0 Reporter: qingwu.fu Assignee: zhihai xu Labels: patch Fix For: 2.2.1 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, YARN-1458.003.patch, YARN-1458.patch Original Estimate: 408h Remaining Estimate: 408h The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when clients submit lots jobs, it is not easy to reapear. We run the test cluster for days to reapear it. The output of jstack command on resourcemanager pid: {code} ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 waiting for monitor entry [0x43aa9000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) - waiting to lock 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) …… FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 runnable [0x433a2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) at
[jira] [Updated] (YARN-2395) FairScheduler: Preemption timeout should be configurable per queue
[ https://issues.apache.org/jira/browse/YARN-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2395: -- Attachment: YARN-2395-2.patch Update a new patch which addresses Karthik's latest comments, and also add per-job preemption timeout configuration for min share. FairScheduler: Preemption timeout should be configurable per queue -- Key: YARN-2395 URL: https://issues.apache.org/jira/browse/YARN-2395 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Wei Yan Attachments: YARN-2395-1.patch, YARN-2395-2.patch Currently in fair scheduler, the preemption logic considers fair share starvation only at leaf queue level. This jira is created to implement it at the parent queue as well. It involves : 1. Making check for fair share starvation and amount of resource to preempt recursive such that they traverse the queue hierarchy from root to leaf. 2. Currently fairSharePreemptionTimeout is a global config. We could make it configurable on a per queue basis,so that we can specify different timeouts for parent queues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-321) Generic application history service
[ https://issues.apache.org/jira/browse/YARN-321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-321: - Assignee: (was: Yu Gao) Generic application history service --- Key: YARN-321 URL: https://issues.apache.org/jira/browse/YARN-321 Project: Hadoop YARN Issue Type: Improvement Reporter: Luke Lu Attachments: AHS Diagram.pdf, ApplicationHistoryServiceHighLevel.pdf, Generic Application History - Design-20131219.pdf, HistoryStorageDemo.java The mapreduce job history server currently needs to be deployed as a trusted server in sync with the mapreduce runtime. Every new application would need a similar application history server. Having to deploy O(T*V) (where T is number of type of application, V is number of version of application) trusted servers is clearly not scalable. Job history storage handling itself is pretty generic: move the logs and history data into a particular directory for later serving. Job history data is already stored as json (or binary avro). I propose that we create only one trusted application history server, which can have a generic UI (display json as a tree of strings) as well. Specific application/version can deploy untrusted webapps (a la AMs) to query the application history server and interpret the json for its specific UI and/or analytics. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page
[ https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107754#comment-14107754 ] Hadoop QA commented on YARN-2360: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12663761/YARN-2360-v5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4699//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4699//console This message is automatically generated. Fair Scheduler : Display dynamic fair share for queues on the scheduler page Key: YARN-2360 URL: https://issues.apache.org/jira/browse/YARN-2360 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, Screen_Shot_v3.png, Screen_Shot_v4.png, Screen_Shot_v5.png, YARN-2360-v1.txt, YARN-2360-v2.txt, YARN-2360-v3.patch, YARN-2360-v4.patch, YARN-2360-v5.patch Based on the discussion in YARN-2026, we'd like to display dynamic fair share for queues on the scheduler page. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1326) RM should log using RMStore at startup time
[ https://issues.apache.org/jira/browse/YARN-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1326: - Attachment: YARN-1326.4.patch Fixed failures of TestRMWebServices. RM should log using RMStore at startup time --- Key: YARN-1326 URL: https://issues.apache.org/jira/browse/YARN-1326 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.5.0 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-1326.1.patch, YARN-1326.2.patch, YARN-1326.3.patch, YARN-1326.4.patch, demo.png Original Estimate: 3h Remaining Estimate: 3h Currently there are no way to know which RMStore RM uses. It's useful to log the information at RM's startup time. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2395) FairScheduler: Preemption timeout should be configurable per queue
[ https://issues.apache.org/jira/browse/YARN-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107785#comment-14107785 ] Hadoop QA commented on YARN-2395: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12663799/YARN-2395-2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4700//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4700//console This message is automatically generated. FairScheduler: Preemption timeout should be configurable per queue -- Key: YARN-2395 URL: https://issues.apache.org/jira/browse/YARN-2395 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Wei Yan Attachments: YARN-2395-1.patch, YARN-2395-2.patch Currently in fair scheduler, the preemption logic considers fair share starvation only at leaf queue level. This jira is created to implement it at the parent queue as well. It involves : 1. Making check for fair share starvation and amount of resource to preempt recursive such that they traverse the queue hierarchy from root to leaf. 2. Currently fairSharePreemptionTimeout is a global config. We could make it configurable on a per queue basis,so that we can specify different timeouts for parent queues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-1458: Attachment: YARN-1458.004.patch In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely -- Key: YARN-1458 URL: https://issues.apache.org/jira/browse/YARN-1458 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Environment: Centos 2.6.18-238.19.1.el5 X86_64 hadoop2.2.0 Reporter: qingwu.fu Assignee: zhihai xu Labels: patch Fix For: 2.2.1 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.patch Original Estimate: 408h Remaining Estimate: 408h The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when clients submit lots jobs, it is not easy to reapear. We run the test cluster for days to reapear it. The output of jstack command on resourcemanager pid: {code} ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 waiting for monitor entry [0x43aa9000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) - waiting to lock 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) …… FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 runnable [0x433a2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107807#comment-14107807 ] zhihai xu commented on YARN-1458: - I uploaded a new patch YARN-1458.004.patch to fix the test failure. The test failure is the following: Parent Queue: root.parentB have one Vcore steady fair share. But root.parentB have two child queues:root.parentB.childB1 and root.parentB.childB2. we can't split one Vcore to two child queues. The new patch will calculate conservatively to assign 0 Vcore to both child queues. The old code will assign 1 Vcore to both child queues, which will be over total resource limit. In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely -- Key: YARN-1458 URL: https://issues.apache.org/jira/browse/YARN-1458 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Environment: Centos 2.6.18-238.19.1.el5 X86_64 hadoop2.2.0 Reporter: qingwu.fu Assignee: zhihai xu Labels: patch Fix For: 2.2.1 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.patch Original Estimate: 408h Remaining Estimate: 408h The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when clients submit lots jobs, it is not easy to reapear. We run the test cluster for days to reapear it. The output of jstack command on resourcemanager pid: {code} ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 waiting for monitor entry [0x43aa9000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) - waiting to lock 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) …… FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 runnable [0x433a2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1326) RM should log using RMStore at startup time
[ https://issues.apache.org/jira/browse/YARN-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107806#comment-14107806 ] Hadoop QA commented on YARN-1326: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12663809/YARN-1326.4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4701//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4701//console This message is automatically generated. RM should log using RMStore at startup time --- Key: YARN-1326 URL: https://issues.apache.org/jira/browse/YARN-1326 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.5.0 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-1326.1.patch, YARN-1326.2.patch, YARN-1326.3.patch, YARN-1326.4.patch, demo.png Original Estimate: 3h Remaining Estimate: 3h Currently there are no way to know which RMStore RM uses. It's useful to log the information at RM's startup time. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2445) ATS does not reflect changes to uploaded TimelineEntity
[ https://issues.apache.org/jira/browse/YARN-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107808#comment-14107808 ] Billie Rinaldi commented on YARN-2445: -- ATS is only designed to support aggregation. In other words, each new primary filter or related entity is added to what is already there for the entity. You cannot remove previously put information. In this example, I would expect oldprop and newprop both to appear. ATS does not reflect changes to uploaded TimelineEntity --- Key: YARN-2445 URL: https://issues.apache.org/jira/browse/YARN-2445 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Marcelo Vanzin Priority: Minor Attachments: ats2.java If you make a change to the TimelineEntity and send it to the ATS, that change is not reflected in the stored data. For example, in the attached code, an existing primary filter is removed and a new one is added. When you retrieve the entity from the ATS, it only contains the old value: {noformat} {entities:[{events:[],entitytype:test,entity:testid-ad5380c0-090e-4982-8da8-21676fe4e9f4,starttime:1408746026958,relatedentities:{},primaryfilters:{oldprop:[val]},otherinfo:{}}]} {noformat} Perhaps this is what the design wanted, but from an API user standpoint, it's really confusing, since to upload events I have to upload the entity itself, and the changes are not reflected. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107822#comment-14107822 ] Hadoop QA commented on YARN-1458: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12663814/YARN-1458.004.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4702//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4702//console This message is automatically generated. In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely -- Key: YARN-1458 URL: https://issues.apache.org/jira/browse/YARN-1458 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Environment: Centos 2.6.18-238.19.1.el5 X86_64 hadoop2.2.0 Reporter: qingwu.fu Assignee: zhihai xu Labels: patch Fix For: 2.2.1 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.patch Original Estimate: 408h Remaining Estimate: 408h The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when clients submit lots jobs, it is not easy to reapear. We run the test cluster for days to reapear it. The output of jstack command on resourcemanager pid: {code} ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 waiting for monitor entry [0x43aa9000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) - waiting to lock 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) …… FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 runnable [0x433a2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) at
[jira] [Commented] (YARN-1326) RM should log using RMStore at startup time
[ https://issues.apache.org/jira/browse/YARN-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107823#comment-14107823 ] Tsuyoshi OZAWA commented on YARN-1326: -- A patch is ready for review. [~kkambatl], . could you check it? RM should log using RMStore at startup time --- Key: YARN-1326 URL: https://issues.apache.org/jira/browse/YARN-1326 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.5.0 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-1326.1.patch, YARN-1326.2.patch, YARN-1326.3.patch, YARN-1326.4.patch, demo.png Original Estimate: 3h Remaining Estimate: 3h Currently there are no way to know which RMStore RM uses. It's useful to log the information at RM's startup time. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2035) FileSystemApplicationHistoryStore blocks RM and AHS while NN is in safemode
[ https://issues.apache.org/jira/browse/YARN-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Eagles updated YARN-2035: -- Attachment: YARN-2035-v3.patch Addressed failing tests with last patch. FileSystemApplicationHistoryStore blocks RM and AHS while NN is in safemode --- Key: YARN-2035 URL: https://issues.apache.org/jira/browse/YARN-2035 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Jonathan Eagles Assignee: Jonathan Eagles Attachments: YARN-2035-v2.patch, YARN-2035-v3.patch, YARN-2035.patch Small bug that prevents ResourceManager and ApplicationHistoryService from coming up while Namenode is in safemode. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107851#comment-14107851 ] zhihai xu commented on YARN-1458: - The test failure is not related to my change. TestAMRestart is passed in my local build. T E S T S --- Running org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 89.639 sec - in org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart Results : Tests run: 5, Failures: 0, Errors: 0, Skipped: 0 In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely -- Key: YARN-1458 URL: https://issues.apache.org/jira/browse/YARN-1458 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Environment: Centos 2.6.18-238.19.1.el5 X86_64 hadoop2.2.0 Reporter: qingwu.fu Assignee: zhihai xu Labels: patch Fix For: 2.2.1 Attachments: YARN-1458.001.patch, YARN-1458.002.patch, YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.patch Original Estimate: 408h Remaining Estimate: 408h The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when clients submit lots jobs, it is not easy to reapear. We run the test cluster for days to reapear it. The output of jstack command on resourcemanager pid: {code} ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 waiting for monitor entry [0x43aa9000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) - waiting to lock 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) …… FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 runnable [0x433a2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2035) FileSystemApplicationHistoryStore blocks RM and AHS while NN is in safemode
[ https://issues.apache.org/jira/browse/YARN-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107852#comment-14107852 ] Hadoop QA commented on YARN-2035: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12663820/YARN-2035-v3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4703//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4703//console This message is automatically generated. FileSystemApplicationHistoryStore blocks RM and AHS while NN is in safemode --- Key: YARN-2035 URL: https://issues.apache.org/jira/browse/YARN-2035 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Jonathan Eagles Assignee: Jonathan Eagles Attachments: YARN-2035-v2.patch, YARN-2035-v3.patch, YARN-2035.patch Small bug that prevents ResourceManager and ApplicationHistoryService from coming up while Namenode is in safemode. -- This message was sent by Atlassian JIRA (v6.2#6252)