[jira] [Created] (YARN-2304) TestRMWebServices* fails intermittently
Tsuyoshi OZAWA created YARN-2304: Summary: TestRMWebServices* fails intermittently Key: YARN-2304 URL: https://issues.apache.org/jira/browse/YARN-2304 Project: Hadoop YARN Issue Type: Test Reporter: Tsuyoshi OZAWA Attachments: test-failure-log-RMWeb.txt The test fails intermittently because of bind exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2304) TestRMWebServices* fails intermittently
[ https://issues.apache.org/jira/browse/YARN-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2304: - Attachment: test-failure-log-RMWeb.txt TestRMWebServices* fails intermittently --- Key: YARN-2304 URL: https://issues.apache.org/jira/browse/YARN-2304 Project: Hadoop YARN Issue Type: Test Reporter: Tsuyoshi OZAWA Attachments: test-failure-log-RMWeb.txt The test fails intermittently because of bind exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1050) Document the Fair Scheduler REST API
[ https://issues.apache.org/jira/browse/YARN-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kenji Kikushima updated YARN-1050: -- Attachment: YARN-1050-3.patch Updated for removing whitespaces, but missing '[' is not changed yet. This JSON response example is generated automatically, like wget -O - http://RM:8088/ws/v1/cluster/scheduler | python -m json.tool. I think missing '[' is a bug (or spec?), so we should discuss on other JIRA ticket. How about it, [~ajisakaa]? Document the Fair Scheduler REST API Key: YARN-1050 URL: https://issues.apache.org/jira/browse/YARN-1050 Project: Hadoop YARN Issue Type: Improvement Components: documentation Reporter: Sandy Ryza Assignee: Kenji Kikushima Attachments: YARN-1050-2.patch, YARN-1050-3.patch, YARN-1050.patch The documentation should be placed here along with the Capacity Scheduler documentation: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Scheduler_API -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1050) Document the Fair Scheduler REST API
[ https://issues.apache.org/jira/browse/YARN-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064682#comment-14064682 ] Hadoop QA commented on YARN-1050: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656239/YARN-1050-3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+0 tests included{color}. The patch appears to be a documentation patch that doesn't require tests. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4341//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4341//console This message is automatically generated. Document the Fair Scheduler REST API Key: YARN-1050 URL: https://issues.apache.org/jira/browse/YARN-1050 Project: Hadoop YARN Issue Type: Improvement Components: documentation Reporter: Sandy Ryza Assignee: Kenji Kikushima Attachments: YARN-1050-2.patch, YARN-1050-3.patch, YARN-1050.patch The documentation should be placed here along with the Capacity Scheduler documentation: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Scheduler_API -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2305) When a container is in reserved state then total cluster memory is displayed wrongly.
J.Andreina created YARN-2305: Summary: When a container is in reserved state then total cluster memory is displayed wrongly. Key: YARN-2305 URL: https://issues.apache.org/jira/browse/YARN-2305 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.1 Reporter: J.Andreina ENV Details: = 3 queues : a(50%),b(25%),c(25%) --- All max utilization is set to 100 2 Node cluster with total memory as 16GB TestSteps: = Execute following 3 jobs with different memory configurations for Map , reducer and AM task ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=a -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=2048 /dir8 /preempt_85 (application_1405414066690_0023) ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=b -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 -Dyarn.app.mapreduce.am.resource.mb=2048 -Dmapreduce.reduce.memory.mb=2048 /dir2 /preempt_86 (application_1405414066690_0025) ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=c -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=1024 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=1024 /dir2 /preempt_62 Issue = when 2GB memory is in reserved state totoal memory is shown as 15GB and used as 15GB ( while total memory is 16GB) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2299) inconsistency at identifying node
[ https://issues.apache.org/jira/browse/YARN-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064711#comment-14064711 ] Hong Zhiguo commented on YARN-2299: --- or take usage of existing config property yarn.scheduler.include-port-in-node-name when differentiating nodes. inconsistency at identifying node - Key: YARN-2299 URL: https://issues.apache.org/jira/browse/YARN-2299 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Critical If port of yarn.nodemanager.address is not specified at NM, NM will choose random port. If the NM is ungracefully dead(OOM kill, kill -9, or OS restart) and then restarted within yarn.nm.liveness-monitor.expiry-interval-ms, host:port1 and host:port2 will both be present in Active Nodes on WebUI for a while, and after host:port1 expiration, we get host:port1 in Lost Nodes and host:port2 in Active Nodes. If the NM is ungracefully dead again, we get only host:port1 in Lost Nodes. host:port2 is neither in Active Nodes nor in Lost Nodes. Another case, two NM is running on same host(miniYarnCluster or other test purpose), if both of them are lost, we get only one Lost Nodes in WebUI. In both case, sum of Active Nodes and Lost Nodes is not the number of nodes we expected. The root cause is due to inconsistency at how we think two Nodes are identical. When we manager active nodes(RMContextImpl.nodes), we use NodeId which contains port. Two nodes with same host but different port are thought to be different node. But when we manager inactive nodes(RMContextImpl.inactiveNodes), we use only use host. Two nodes with same host but different port are thought to identical. To fix the inconsistency, we should differentiate below 2 cases and be consistent for both of them: - intentionally multiple NMs per host - NM instances one after another on same host Two possible solutions: 1) Introduce a boolean config like one-node-per-host(default as true), and use host to differentiate nodes on RM if it's true. 2) Make it mandatory to have valid port in yarn.nodemanager.address config. In this sutiation, NM instances one after another on same host will have same NodeId, while intentionally multiple NMs per host will have different NodeId. Personally I prefer option 1 because it's easier for users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2305) When a container is in reserved state then total cluster memory is displayed wrongly.
[ https://issues.apache.org/jira/browse/YARN-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Zhiguo reassigned YARN-2305: - Assignee: Hong Zhiguo When a container is in reserved state then total cluster memory is displayed wrongly. - Key: YARN-2305 URL: https://issues.apache.org/jira/browse/YARN-2305 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.1 Reporter: J.Andreina Assignee: Hong Zhiguo ENV Details: = 3 queues : a(50%),b(25%),c(25%) --- All max utilization is set to 100 2 Node cluster with total memory as 16GB TestSteps: = Execute following 3 jobs with different memory configurations for Map , reducer and AM task ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=a -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=2048 /dir8 /preempt_85 (application_1405414066690_0023) ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=b -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 -Dyarn.app.mapreduce.am.resource.mb=2048 -Dmapreduce.reduce.memory.mb=2048 /dir2 /preempt_86 (application_1405414066690_0025) ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=c -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=1024 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=1024 /dir2 /preempt_62 Issue = when 2GB memory is in reserved state totoal memory is shown as 15GB and used as 15GB ( while total memory is 16GB) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2305) When a container is in reserved state then total cluster memory is displayed wrongly.
[ https://issues.apache.org/jira/browse/YARN-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064732#comment-14064732 ] Hong Zhiguo commented on YARN-2305: --- Are you using fair scheduler? If yes, then I thinks it's the same reason of YARN-2306. When a container is in reserved state then total cluster memory is displayed wrongly. - Key: YARN-2305 URL: https://issues.apache.org/jira/browse/YARN-2305 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.1 Reporter: J.Andreina ENV Details: = 3 queues : a(50%),b(25%),c(25%) --- All max utilization is set to 100 2 Node cluster with total memory as 16GB TestSteps: = Execute following 3 jobs with different memory configurations for Map , reducer and AM task ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=a -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=2048 /dir8 /preempt_85 (application_1405414066690_0023) ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=b -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 -Dyarn.app.mapreduce.am.resource.mb=2048 -Dmapreduce.reduce.memory.mb=2048 /dir2 /preempt_86 (application_1405414066690_0025) ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=c -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=1024 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=1024 /dir2 /preempt_62 Issue = when 2GB memory is in reserved state totoal memory is shown as 15GB and used as 15GB ( while total memory is 16GB) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2305) When a container is in reserved state then total cluster memory is displayed wrongly.
[ https://issues.apache.org/jira/browse/YARN-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] J.Andreina updated YARN-2305: - Attachment: Capture.jpg When a container is in reserved state then total cluster memory is displayed wrongly. - Key: YARN-2305 URL: https://issues.apache.org/jira/browse/YARN-2305 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.1 Reporter: J.Andreina Assignee: Hong Zhiguo Attachments: Capture.jpg ENV Details: = 3 queues : a(50%),b(25%),c(25%) --- All max utilization is set to 100 2 Node cluster with total memory as 16GB TestSteps: = Execute following 3 jobs with different memory configurations for Map , reducer and AM task ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=a -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=2048 /dir8 /preempt_85 (application_1405414066690_0023) ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=b -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 -Dyarn.app.mapreduce.am.resource.mb=2048 -Dmapreduce.reduce.memory.mb=2048 /dir2 /preempt_86 (application_1405414066690_0025) ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=c -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=1024 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=1024 /dir2 /preempt_62 Issue = when 2GB memory is in reserved state totoal memory is shown as 15GB and used as 15GB ( while total memory is 16GB) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2305) When a container is in reserved state then total cluster memory is displayed wrongly.
[ https://issues.apache.org/jira/browse/YARN-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064735#comment-14064735 ] J.Andreina commented on YARN-2305: -- Iam using Capacity Scheduler When a container is in reserved state then total cluster memory is displayed wrongly. - Key: YARN-2305 URL: https://issues.apache.org/jira/browse/YARN-2305 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.1 Reporter: J.Andreina Assignee: Hong Zhiguo Attachments: Capture.jpg ENV Details: = 3 queues : a(50%),b(25%),c(25%) --- All max utilization is set to 100 2 Node cluster with total memory as 16GB TestSteps: = Execute following 3 jobs with different memory configurations for Map , reducer and AM task ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=a -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=2048 /dir8 /preempt_85 (application_1405414066690_0023) ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=b -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 -Dyarn.app.mapreduce.am.resource.mb=2048 -Dmapreduce.reduce.memory.mb=2048 /dir2 /preempt_86 (application_1405414066690_0025) ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=c -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=1024 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=1024 /dir2 /preempt_62 Issue = when 2GB memory is in reserved state totoal memory is shown as 15GB and used as 15GB ( while total memory is 16GB) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2305) When a container is in reserved state then total cluster memory is displayed wrongly.
[ https://issues.apache.org/jira/browse/YARN-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064736#comment-14064736 ] Sunil G commented on YARN-2305: --- No. Its a capacity scheduler. I could collect logs from Andreina for same. I will take over and analyze for Capacity Scheduler. {noformat} 2014-07-15 16:56:50,720 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1405414066690_0023_01_000129 Container Transitioned from NEW to RESERVED 2014-07-15 16:56:50,720 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: Reserved container application attempt=appattempt_1405414066690_0023_01 resource=memory:2048, vCores:1 queue=a: capacity=0.5, absoluteCapacity=0.5, usedResources=memory:7168, vCores:4, usedCapacity=0.875, absoluteUsedCapacity=0.4375, numApps=1, numContainers=4 node=host: host-10-18-40-14:45026 #containers=4 available=1024 used=7168 clusterResource=memory:16384, vCores:16 2014-07-15 16:56:50,720 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Re-sorting assigned queue: root.a stats: a: capacity=0.5, absoluteCapacity=0.5, usedResources=memory:9216, vCores:5, usedCapacity=1.125, absoluteUsedCapacity=0.5625, numApps=1, numContainers=5 2014-07-15 16:56:50,720 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0625 absoluteUsedCapacity=1.0625 used=memory:17408, vCores:10 cluster=memory:16384, vCores:16 {noformat} When a container is in reserved state then total cluster memory is displayed wrongly. - Key: YARN-2305 URL: https://issues.apache.org/jira/browse/YARN-2305 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.1 Reporter: J.Andreina Assignee: Hong Zhiguo Attachments: Capture.jpg ENV Details: = 3 queues : a(50%),b(25%),c(25%) --- All max utilization is set to 100 2 Node cluster with total memory as 16GB TestSteps: = Execute following 3 jobs with different memory configurations for Map , reducer and AM task ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=a -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=2048 /dir8 /preempt_85 (application_1405414066690_0023) ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=b -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 -Dyarn.app.mapreduce.am.resource.mb=2048 -Dmapreduce.reduce.memory.mb=2048 /dir2 /preempt_86 (application_1405414066690_0025) ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=c -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=1024 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=1024 /dir2 /preempt_62 Issue = when 2GB memory is in reserved state totoal memory is shown as 15GB and used as 15GB ( while total memory is 16GB) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2305) When a container is in reserved state then total cluster memory is displayed wrongly.
[ https://issues.apache.org/jira/browse/YARN-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064737#comment-14064737 ] Sunil G commented on YARN-2305: --- Hi [~zhiguohong], Could I take over the issue.. When a container is in reserved state then total cluster memory is displayed wrongly. - Key: YARN-2305 URL: https://issues.apache.org/jira/browse/YARN-2305 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.1 Reporter: J.Andreina Assignee: Hong Zhiguo Attachments: Capture.jpg ENV Details: = 3 queues : a(50%),b(25%),c(25%) --- All max utilization is set to 100 2 Node cluster with total memory as 16GB TestSteps: = Execute following 3 jobs with different memory configurations for Map , reducer and AM task ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=a -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=2048 /dir8 /preempt_85 (application_1405414066690_0023) ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=b -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 -Dyarn.app.mapreduce.am.resource.mb=2048 -Dmapreduce.reduce.memory.mb=2048 /dir2 /preempt_86 (application_1405414066690_0025) ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=c -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=1024 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=1024 /dir2 /preempt_62 Issue = when 2GB memory is in reserved state totoal memory is shown as 15GB and used as 15GB ( while total memory is 16GB) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2306) leak of reservation metrics (fair scheduler)
[ https://issues.apache.org/jira/browse/YARN-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Zhiguo updated YARN-2306: -- Summary: leak of reservation metrics (fair scheduler) (was: leak of reservation metrics) leak of reservation metrics (fair scheduler) Key: YARN-2306 URL: https://issues.apache.org/jira/browse/YARN-2306 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor This only applies to fair scheduler. Capacity scheduler is OK. When appAttempt or node is removed, the metrics for reservation(reservedContainers, reservedMB, reservedVCores) is not reduced back. These are important metrics for administrator. The wrong metrics confuses may confuse them. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2305) When a container is in reserved state then total cluster memory is displayed wrongly.
[ https://issues.apache.org/jira/browse/YARN-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064738#comment-14064738 ] Hong Zhiguo commented on YARN-2305: --- OK When a container is in reserved state then total cluster memory is displayed wrongly. - Key: YARN-2305 URL: https://issues.apache.org/jira/browse/YARN-2305 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.1 Reporter: J.Andreina Assignee: Hong Zhiguo Attachments: Capture.jpg ENV Details: = 3 queues : a(50%),b(25%),c(25%) --- All max utilization is set to 100 2 Node cluster with total memory as 16GB TestSteps: = Execute following 3 jobs with different memory configurations for Map , reducer and AM task ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=a -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=2048 /dir8 /preempt_85 (application_1405414066690_0023) ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=b -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 -Dyarn.app.mapreduce.am.resource.mb=2048 -Dmapreduce.reduce.memory.mb=2048 /dir2 /preempt_86 (application_1405414066690_0025) ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=c -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=1024 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=1024 /dir2 /preempt_62 Issue = when 2GB memory is in reserved state totoal memory is shown as 15GB and used as 15GB ( while total memory is 16GB) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2305) When a container is in reserved state then total cluster memory is displayed wrongly.
[ https://issues.apache.org/jira/browse/YARN-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G reassigned YARN-2305: - Assignee: Sunil G (was: Hong Zhiguo) When a container is in reserved state then total cluster memory is displayed wrongly. - Key: YARN-2305 URL: https://issues.apache.org/jira/browse/YARN-2305 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.1 Reporter: J.Andreina Assignee: Sunil G Attachments: Capture.jpg ENV Details: = 3 queues : a(50%),b(25%),c(25%) --- All max utilization is set to 100 2 Node cluster with total memory as 16GB TestSteps: = Execute following 3 jobs with different memory configurations for Map , reducer and AM task ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=a -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=2048 /dir8 /preempt_85 (application_1405414066690_0023) ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=b -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 -Dyarn.app.mapreduce.am.resource.mb=2048 -Dmapreduce.reduce.memory.mb=2048 /dir2 /preempt_86 (application_1405414066690_0025) ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=c -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=1024 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=1024 /dir2 /preempt_62 Issue = when 2GB memory is in reserved state totoal memory is shown as 15GB and used as 15GB ( while total memory is 16GB) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2307) Capacity scheduler user only ADMINISTER_QUEUE also can submit app
tangjunjie created YARN-2307: Summary: Capacity scheduler user only ADMINISTER_QUEUE also can submit app Key: YARN-2307 URL: https://issues.apache.org/jira/browse/YARN-2307 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.3.0 Environment: hadoop 2.3.0 centos6.5 jdk1.7 Reporter: tangjunjie Priority: Minor Queue acls for user : root Queue Operations = root default china ADMINISTER_QUEUE unfunded user root only have ADMINISTER_QUEUE but user root can sumbit app to china queue -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2305) When a container is in reserved state then total cluster memory is displayed wrongly.
[ https://issues.apache.org/jira/browse/YARN-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064749#comment-14064749 ] Wangda Tan commented on YARN-2305: -- Hi [~sunilg], Thanks for taking this issue, I think there're two issues in your screenshot, 1) Root queue usage above 100% It is possible that queue used resource is larger than its guaranteed resource because of container reservation. We may need show reserved resource and used resource separately in our web UI. I encountered a similar problem in YARN-2285 too. 2) Total cluster memory showing on web UI is different from CapacityScheduler.clusterResource This seems a new issue to me, memory showing on web UI is usedMemory+availableMemory of root queue. I feel like CSQueueUtils.updateQueueStatistics has some issues when we reserve container in LeafQueue. Hope to get more thoughts in your side. Thanks, Wangda When a container is in reserved state then total cluster memory is displayed wrongly. - Key: YARN-2305 URL: https://issues.apache.org/jira/browse/YARN-2305 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.1 Reporter: J.Andreina Assignee: Sunil G Attachments: Capture.jpg ENV Details: = 3 queues : a(50%),b(25%),c(25%) --- All max utilization is set to 100 2 Node cluster with total memory as 16GB TestSteps: = Execute following 3 jobs with different memory configurations for Map , reducer and AM task ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=a -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=2048 /dir8 /preempt_85 (application_1405414066690_0023) ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=b -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 -Dyarn.app.mapreduce.am.resource.mb=2048 -Dmapreduce.reduce.memory.mb=2048 /dir2 /preempt_86 (application_1405414066690_0025) ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=c -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=1024 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=1024 /dir2 /preempt_62 Issue = when 2GB memory is in reserved state totoal memory is shown as 15GB and used as 15GB ( while total memory is 16GB) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2307) Capacity scheduler user only ADMINISTER_QUEUE also can submit app
[ https://issues.apache.org/jira/browse/YARN-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064760#comment-14064760 ] Sunil G commented on YARN-2307: --- By default, SUBMIT_APPLICATIONS at root queue level can take * as default [If not configured]. So any user job submission in child queue will pass. I think its a config issue (if so can mark this as invalid), if not pls share config details. Capacity scheduler user only ADMINISTER_QUEUE also can submit app -- Key: YARN-2307 URL: https://issues.apache.org/jira/browse/YARN-2307 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.3.0 Environment: hadoop 2.3.0 centos6.5 jdk1.7 Reporter: tangjunjie Priority: Minor Queue acls for user : root Queue Operations = root default china ADMINISTER_QUEUE unfunded user root only have ADMINISTER_QUEUE but user root can sumbit app to china queue -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2306) leak of reservation metrics (fair scheduler)
[ https://issues.apache.org/jira/browse/YARN-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Zhiguo updated YARN-2306: -- Attachment: YARN-2306.patch leak of reservation metrics (fair scheduler) Key: YARN-2306 URL: https://issues.apache.org/jira/browse/YARN-2306 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor Attachments: YARN-2306.patch This only applies to fair scheduler. Capacity scheduler is OK. When appAttempt or node is removed, the metrics for reservation(reservedContainers, reservedMB, reservedVCores) is not reduced back. These are important metrics for administrator. The wrong metrics confuses may confuse them. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed
Wangda Tan created YARN-2308: Summary: NPE happened when RM restart after CapacityScheduler queue configuration changed Key: YARN-2308 URL: https://issues.apache.org/jira/browse/YARN-2308 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.6.0 Reporter: Wangda Tan I encountered a NPE when RM restart {code} 2014-07-16 07:22:46,957 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} And RM will be failed to restart. This is caused by queue configuration changed, I removed some queues and added new queues. So when RM restarts, it tries to recover history applications, and when any of queues of these applications removed, NPE will be raised. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2264) Race in DrainDispatcher can cause random test failures
[ https://issues.apache.org/jira/browse/YARN-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064797#comment-14064797 ] Hudson commented on YARN-2264: -- FAILURE: Integrated in Hadoop-Yarn-trunk #615 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/615/]) YARN-2264. Fixed a race condition in DrainDispatcher which may cause random test failures. Contributed by Li Lu (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1611126) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/DrainDispatcher.java Race in DrainDispatcher can cause random test failures -- Key: YARN-2264 URL: https://issues.apache.org/jira/browse/YARN-2264 Project: Hadoop YARN Issue Type: Bug Reporter: Siddharth Seth Assignee: Li Lu Fix For: 2.6.0 Attachments: YARN-2264-070814.patch This is what can happen. This is the potential race. DrainDispatcher is started via serviceStart() . As a last step, this starts the actual dispatcher thread (eventHandlingThread.start() - and returns immediately - which means the thread may or may not have started up by the time start returns. Event sequence: UserThread: calls dispatcher.getEventHandler().handle() This sets drained = false, and a context switch happens. DispatcherThread: starts running DispatcherThread drained = queue.isEmpty(); - This sets drained to true, since Thread1 yielded before putting anything into the queue. UserThread: actual.handle(event) - which puts the event in the queue for the dispatcher thread to process, and returns control. UserThread: dispatcher.await() - Since drained is true, this returns immediately - even though there is a pending event to process. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2306) leak of reservation metrics (fair scheduler)
[ https://issues.apache.org/jira/browse/YARN-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064813#comment-14064813 ] Hadoop QA commented on YARN-2306: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656254/YARN-2306.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4342//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4342//console This message is automatically generated. leak of reservation metrics (fair scheduler) Key: YARN-2306 URL: https://issues.apache.org/jira/browse/YARN-2306 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor Attachments: YARN-2306.patch This only applies to fair scheduler. Capacity scheduler is OK. When appAttempt or node is removed, the metrics for reservation(reservedContainers, reservedMB, reservedVCores) is not reduced back. These are important metrics for administrator. The wrong metrics confuses may confuse them. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2309) NPE during RM-Restart test scenario
Nishan Shetty created YARN-2309: --- Summary: NPE during RM-Restart test scenario Key: YARN-2309 URL: https://issues.apache.org/jira/browse/YARN-2309 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Nishan Shetty Priority: Minor During RMRestart test scenarios, we met with below exception. A point to note here is, Zookeeper also was not stable during this testing, we could see many Zookeeper exception before getting this NPE {code} 2014-07-10 10:49:46,817 WARN org.apache.hadoop.service.AbstractService: When stopping the service org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceStop(EmbeddedElectorService.java:108) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:125) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:232) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1039) {code} Zookeeper Exception {code} 2014-07-10 10:49:46,816 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService failed in state INITED; cause: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef.waitForZKConnectionEvent(ActiveStandbyElector.java:1046) at org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef.access$400(ActiveStandbyElector.java:1017) at org.apache.hadoop.ha.ActiveStandbyElector.getNewZooKeeper(ActiveStandbyElector.java:632) at org.apache.hadoop.ha.ActiveStandbyElector.createConnection(ActiveStandbyElector.java:766) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2045) Data persisted in NM should be versioned
[ https://issues.apache.org/jira/browse/YARN-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064849#comment-14064849 ] Hadoop QA commented on YARN-2045: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656193/YARN-2045-v5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServicesContainers org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServices org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServicesApps {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4343//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4343//console This message is automatically generated. Data persisted in NM should be versioned Key: YARN-2045 URL: https://issues.apache.org/jira/browse/YARN-2045 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.4.1 Reporter: Junping Du Assignee: Junping Du Attachments: YARN-2045-v2.patch, YARN-2045-v3.patch, YARN-2045-v4.patch, YARN-2045-v5.patch, YARN-2045.patch As a split task from YARN-667, we want to add version info to NM related data, include: - NodeManager local LevelDB state - NodeManager directory structure -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064870#comment-14064870 ] Hadoop QA commented on YARN-1341: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656064/YARN-1341v7.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServicesContainers org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServices org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServicesApps {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4344//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4344//console This message is automatically generated. Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch, YARN-1341v7.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2301) Improve yarn container command
[ https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064887#comment-14064887 ] Naganarasimha G R commented on YARN-2301: - Hi [~jianhe] Could i work on this issue ? Improve yarn container command -- Key: YARN-2301 URL: https://issues.apache.org/jira/browse/YARN-2301 Project: Hadoop YARN Issue Type: Improvement Reporter: Jian He Labels: usability While running yarn container -list Application Attempt ID command, some observations: 1) the scheme (e.g. http/https ) before LOG-URL is missing 2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to print as time format. 3) finish-time is 0 if container is not yet finished. May be N/A 4) May have an option to run as yarn container -list appId OR yarn application -list-containers appId also. As attempt Id is not shown on console, this is easier for user to just copy the appId and run it, may also be useful for container-preserving AM restart. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2219) AMs and NMs can get exceptions after recovery but before scheduler knowns apps and app-attempts
[ https://issues.apache.org/jira/browse/YARN-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064902#comment-14064902 ] Hudson commented on YARN-2219: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1834 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1834/]) YARN-2219. Addendum patch for YARN-2219 (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1611240) * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestFifoScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairSchedulerTestBase.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java YARN-2219. Changed ResourceManager to avoid AMs and NMs getting exceptions after RM recovery but before scheduler learns about apps and app-attempts. Contributed by Jian He. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1611222) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/event/AppAddedSchedulerEvent.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/event/AppAttemptAddedSchedulerEvent.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestFifoScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairSchedulerTestBase.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java AMs and NMs can get exceptions after recovery but before scheduler knowns apps and app-attempts --- Key: YARN-2219 URL: https://issues.apache.org/jira/browse/YARN-2219 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Ashwin Shankar Assignee: Jian He Fix For: 2.6.0 Attachments: YARN-2219-fix-compilation-failure.txt, YARN-2219.1.patch, YARN-2219.2.patch {code} org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart testAppReregisterOnRMWorkPreservingRestart[0](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart) Time elapsed: 4.335 sec ERROR! java.lang.NullPointerException: null at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getTransferredContainers(AbstractYarnScheduler.java:91) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.registerApplicationMaster(ApplicationMasterService.java:297) at org.apache.hadoop.yarn.server.resourcemanager.MockAM$1.run(MockAM.java:113) at
[jira] [Commented] (YARN-2264) Race in DrainDispatcher can cause random test failures
[ https://issues.apache.org/jira/browse/YARN-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064904#comment-14064904 ] Hudson commented on YARN-2264: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1834 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1834/]) YARN-2264. Fixed a race condition in DrainDispatcher which may cause random test failures. Contributed by Li Lu (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1611126) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/DrainDispatcher.java Race in DrainDispatcher can cause random test failures -- Key: YARN-2264 URL: https://issues.apache.org/jira/browse/YARN-2264 Project: Hadoop YARN Issue Type: Bug Reporter: Siddharth Seth Assignee: Li Lu Fix For: 2.6.0 Attachments: YARN-2264-070814.patch This is what can happen. This is the potential race. DrainDispatcher is started via serviceStart() . As a last step, this starts the actual dispatcher thread (eventHandlingThread.start() - and returns immediately - which means the thread may or may not have started up by the time start returns. Event sequence: UserThread: calls dispatcher.getEventHandler().handle() This sets drained = false, and a context switch happens. DispatcherThread: starts running DispatcherThread drained = queue.isEmpty(); - This sets drained to true, since Thread1 yielded before putting anything into the queue. UserThread: actual.handle(event) - which puts the event in the queue for the dispatcher thread to process, and returns control. UserThread: dispatcher.await() - Since drained is true, this returns immediately - even though there is a pending event to process. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2264) Race in DrainDispatcher can cause random test failures
[ https://issues.apache.org/jira/browse/YARN-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064925#comment-14064925 ] Hudson commented on YARN-2264: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1807 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1807/]) YARN-2264. Fixed a race condition in DrainDispatcher which may cause random test failures. Contributed by Li Lu (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1611126) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/event/DrainDispatcher.java Race in DrainDispatcher can cause random test failures -- Key: YARN-2264 URL: https://issues.apache.org/jira/browse/YARN-2264 Project: Hadoop YARN Issue Type: Bug Reporter: Siddharth Seth Assignee: Li Lu Fix For: 2.6.0 Attachments: YARN-2264-070814.patch This is what can happen. This is the potential race. DrainDispatcher is started via serviceStart() . As a last step, this starts the actual dispatcher thread (eventHandlingThread.start() - and returns immediately - which means the thread may or may not have started up by the time start returns. Event sequence: UserThread: calls dispatcher.getEventHandler().handle() This sets drained = false, and a context switch happens. DispatcherThread: starts running DispatcherThread drained = queue.isEmpty(); - This sets drained to true, since Thread1 yielded before putting anything into the queue. UserThread: actual.handle(event) - which puts the event in the queue for the dispatcher thread to process, and returns control. UserThread: dispatcher.await() - Since drained is true, this returns immediately - even though there is a pending event to process. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2219) AMs and NMs can get exceptions after recovery but before scheduler knowns apps and app-attempts
[ https://issues.apache.org/jira/browse/YARN-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064923#comment-14064923 ] Hudson commented on YARN-2219: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1807 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1807/]) YARN-2219. Addendum patch for YARN-2219 (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1611240) * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestFifoScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairSchedulerTestBase.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java YARN-2219. Changed ResourceManager to avoid AMs and NMs getting exceptions after RM recovery but before scheduler learns about apps and app-attempts. Contributed by Jian He. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1611222) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/event/AppAddedSchedulerEvent.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/event/AppAttemptAddedSchedulerEvent.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestFifoScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairSchedulerTestBase.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java AMs and NMs can get exceptions after recovery but before scheduler knowns apps and app-attempts --- Key: YARN-2219 URL: https://issues.apache.org/jira/browse/YARN-2219 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Ashwin Shankar Assignee: Jian He Fix For: 2.6.0 Attachments: YARN-2219-fix-compilation-failure.txt, YARN-2219.1.patch, YARN-2219.2.patch {code} org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart testAppReregisterOnRMWorkPreservingRestart[0](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart) Time elapsed: 4.335 sec ERROR! java.lang.NullPointerException: null at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getTransferredContainers(AbstractYarnScheduler.java:91) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.registerApplicationMaster(ApplicationMasterService.java:297) at org.apache.hadoop.yarn.server.resourcemanager.MockAM$1.run(MockAM.java:113) at
[jira] [Commented] (YARN-2247) Allow RM web services users to authenticate using delegation tokens
[ https://issues.apache.org/jira/browse/YARN-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065002#comment-14065002 ] Zhijie Shen commented on YARN-2247: --- bq. The current implementation uses the standard http authentication for hadoop. Users can set it to simple if they choose. I was trying to make the point that when kerberos authentication is not used, simple authentication is not implicitly set, isn't it? In this case, without the authentication filter, we cannot identify the user via HTTP interface, such that we cannot behave correctly for those operations that require the knowledge of user information, such as submit/kill an application. One step back, and let's look at the analog RPC interfaces. By default, the authentication is SIMPLE, and at the server side, we can still identify who the user is, such that the feature such as ACLs are is still working in the SIMPLE case. bq. For now I'd like to use the same configs as the standard hadoop http auth. I'm open to changing them if we feel strongly about it in the future. It's okay to keep the configuration same. Just think it out loudly. If so, you may not want to add RM_WEBAPP_USE_YARN_AUTH_FILTER at all, and not load YarnAuthenticationFilterInitializer programatically. The rationales behind them are similar. Previously, I tried to add TimelineAuthenticationFilterInitializer programmatically because I find the same http auth config applies to different daemons, and I think it's annoying that at a single node cluster, I want to config something only for timeline server, it will affect others. Afterwards, I tried to make timeline server to use a set of configs with timeline-service prefix. This is what we did for the RPC interface configurations. bq. I didn't understand - can you explain further? Let's take RMWebServices#getApp as an example. Previously we don't have (at least don't know) the auth filter, such that we cannot detect the user. Therefore, we don't check the ACLs, and simply get the application from RMContext and return the user. Now, we have the auth filter, and we can identify the user. Hence, it's possible for use to fix this API to only return the application information to the user that has the access. This is also another reason why I suggest to always have authentication filter on, whether it is simple or kerberos. bq. Am I looking at the wrong file? This is the right file, but I'm afraid it is not the correct logic. AuthenticationFilter accept null secret file. However, if we use AuthenticationFilterInitializer to construct AuthenticationFilter, the null case is denied. I previously open a ticket for this issue (HADOOP-10600). Allow RM web services users to authenticate using delegation tokens --- Key: YARN-2247 URL: https://issues.apache.org/jira/browse/YARN-2247 Project: Hadoop YARN Issue Type: Sub-task Reporter: Varun Vasudev Assignee: Varun Vasudev Priority: Blocker Attachments: apache-yarn-2247.0.patch, apache-yarn-2247.1.patch, apache-yarn-2247.2.patch The RM webapp should allow users to authenticate using delegation tokens to maintain parity with RPC. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2310) Revisit the APIs in RM web services where user information can make difference
Zhijie Shen created YARN-2310: - Summary: Revisit the APIs in RM web services where user information can make difference Key: YARN-2310 URL: https://issues.apache.org/jira/browse/YARN-2310 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 3.0.0, 2.5.0 Reporter: Zhijie Shen After YARN-2247, RM web services can be sheltered by the authentication filter, which can help to identify who the user is. With this information, we should be able to fix the security problem of some existing APIs, such as getApp, getAppAttempts, getApps. We should use the user information to check the ACLs before returning the requested data to the user. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2311) Revisit RM web pages where user information may make difference.
Zhijie Shen created YARN-2311: - Summary: Revisit RM web pages where user information may make difference. Key: YARN-2311 URL: https://issues.apache.org/jira/browse/YARN-2311 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 3.0.0, 2.5.0 Reporter: Zhijie Shen Similar to YARN-2310, RM web apps are list information without considering whether the user have the access to it. This could be fixed after YARN-2247, via which we can have the authentication filter to identify the user of the incoming request. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2247) Allow RM web services users to authenticate using delegation tokens
[ https://issues.apache.org/jira/browse/YARN-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065027#comment-14065027 ] Zhijie Shen commented on YARN-2247: --- I filed YARN-2310 and YARN-2311 for the third point. Allow RM web services users to authenticate using delegation tokens --- Key: YARN-2247 URL: https://issues.apache.org/jira/browse/YARN-2247 Project: Hadoop YARN Issue Type: Sub-task Reporter: Varun Vasudev Assignee: Varun Vasudev Priority: Blocker Attachments: apache-yarn-2247.0.patch, apache-yarn-2247.1.patch, apache-yarn-2247.2.patch The RM webapp should allow users to authenticate using delegation tokens to maintain parity with RPC. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2312) Marking ContainerId#getId as deprecated
Tsuyoshi OZAWA created YARN-2312: Summary: Marking ContainerId#getId as deprecated Key: YARN-2312 URL: https://issues.apache.org/jira/browse/YARN-2312 Project: Hadoop YARN Issue Type: Sub-task Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA {{ContainerId#getId}} will only return partial value of containerId, only sequence number of container id without epoch, after YARN-2229. We should mark {{ContainerId#getId}} as deprecated and use {{ContainerId#getContainerId}} instead. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2229) ContainerId can overflow with RM restart
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065045#comment-14065045 ] Tsuyoshi OZAWA commented on YARN-2229: -- Created YARN-2312 to address marking getId as deprecated method. ContainerId can overflow with RM restart Key: YARN-2229 URL: https://issues.apache.org/jira/browse/YARN-2229 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2229.1.patch, YARN-2229.10.patch, YARN-2229.10.patch, YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, YARN-2229.4.patch, YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, YARN-2229.8.patch, YARN-2229.9.patch On YARN-2052, we changed containerId format: upper 10 bits are for epoch, lower 22 bits are for sequence number of Ids. This is for preserving semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM restarts 1024 times. To avoid the problem, its better to make containerId long. We need to define the new format of container Id with preserving backward compatibility on this JIRA. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-415: Attachment: YARN-415.201407171553.txt Thanks [~leftnoteasy]] {quote} No, you can update current trunk code, and check RMContainerImpl#FinishedTransition#updateMetricsIfPreempted, you can change the updateMetricsIfPreempted to something like updateAttemptMetrics. And create a new method in RMAppAttemptMetrics, like updateResourceUtilization. The benefit of doing this is you don need send payload to RMAppAttempt, all you needed information should be existed in RMContainer. {quote} updateMetricsIfPreempted() is using the current attempt. Is there a way to get the RMAppAttempt object for completed attempts. IIUC, there are races where there is no running attempt, such as when an attempt dies after other containers have started and then the app itself dies or is killed. Also, in the case of work-preserving restart, the appattempt could die and it's child containers could be assigned to the second appattempt, I have included a new patch that adds the payload to the CONTAINER_FINISHED event, which is sent to the appropriate RMAppAttempt. The RMAppAttempt then will keep track of it's own stats, even after the container for that appattempt has finished. Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2301) Improve yarn container command
[ https://issues.apache.org/jira/browse/YARN-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065144#comment-14065144 ] Jian He commented on YARN-2301: --- [~Naganarasimha], sure, thanks for working it! Improve yarn container command -- Key: YARN-2301 URL: https://issues.apache.org/jira/browse/YARN-2301 Project: Hadoop YARN Issue Type: Improvement Reporter: Jian He Labels: usability While running yarn container -list Application Attempt ID command, some observations: 1) the scheme (e.g. http/https ) before LOG-URL is missing 2) the start-time is printed as milli seconds (e.g. 1405540544844). Better to print as time format. 3) finish-time is 0 if container is not yet finished. May be N/A 4) May have an option to run as yarn container -list appId OR yarn application -list-containers appId also. As attempt Id is not shown on console, this is easier for user to just copy the appId and run it, may also be useful for container-preserving AM restart. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2306) leak of reservation metrics (fair scheduler)
[ https://issues.apache.org/jira/browse/YARN-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065170#comment-14065170 ] Wei Yan commented on YARN-2306: --- Thanks, [~zhiguohong], the patch looks good to me. leak of reservation metrics (fair scheduler) Key: YARN-2306 URL: https://issues.apache.org/jira/browse/YARN-2306 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Hong Zhiguo Assignee: Hong Zhiguo Priority: Minor Attachments: YARN-2306.patch This only applies to fair scheduler. Capacity scheduler is OK. When appAttempt or node is removed, the metrics for reservation(reservedContainers, reservedMB, reservedVCores) is not reduced back. These are important metrics for administrator. The wrong metrics confuses may confuse them. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2313) Livelock can occur on FairScheduler when there are lots entry in queue
Tsuyoshi OZAWA created YARN-2313: Summary: Livelock can occur on FairScheduler when there are lots entry in queue Key: YARN-2313 URL: https://issues.apache.org/jira/browse/YARN-2313 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.4.1 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Observed livelock on FairScheduler when there are lots entry in queue. After my investigating code, following case can occur: 1. {{update()}} called by UpdateThread takes longer times than UPDATE_INTERVAL(500ms) if there are lots queue. 2. UpdateThread goes busy loop. 3. Other threads(AllocationFileReloader, ResourceManager$SchedulerEventDispatcher) can wait forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2313) Livelock can occur on FairScheduler when there are lots entry in queue
[ https://issues.apache.org/jira/browse/YARN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2313: - Attachment: YARN-2313.1.patch Ideally, UPDATE_INTERVAL should be calculated based on current number of entries in queue. Another workaround is making UPDATE_INTERVAL configurable. Attached patch takes the latter approach, because it's easy to implement. Livelock can occur on FairScheduler when there are lots entry in queue -- Key: YARN-2313 URL: https://issues.apache.org/jira/browse/YARN-2313 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.4.1 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2313.1.patch Observed livelock on FairScheduler when there are lots entry in queue. After my investigating code, following case can occur: 1. {{update()}} called by UpdateThread takes longer times than UPDATE_INTERVAL(500ms) if there are lots queue. 2. UpdateThread goes busy loop. 3. Other threads(AllocationFileReloader, ResourceManager$SchedulerEventDispatcher) can wait forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065203#comment-14065203 ] Hadoop QA commented on YARN-415: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656291/YARN-415.201407171553.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.util.TestFSDownload org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4345//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4345//console This message is automatically generated. Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2313) Livelock can occur on FairScheduler when there are lots entry in queue
[ https://issues.apache.org/jira/browse/YARN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2313: - Attachment: rm-stack-trace.txt Attached stack trace when we faced the problem. Livelock can occur on FairScheduler when there are lots entry in queue -- Key: YARN-2313 URL: https://issues.apache.org/jira/browse/YARN-2313 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.4.1 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2313.1.patch, rm-stack-trace.txt Observed livelock on FairScheduler when there are lots entry in queue. After my investigating code, following case can occur: 1. {{update()}} called by UpdateThread takes longer times than UPDATE_INTERVAL(500ms) if there are lots queue. 2. UpdateThread goes busy loop. 3. Other threads(AllocationFileReloader, ResourceManager$SchedulerEventDispatcher) can wait forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2305) When a container is in reserved state then total cluster memory is displayed wrongly.
[ https://issues.apache.org/jira/browse/YARN-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065238#comment-14065238 ] Sunil G commented on YARN-2305: --- 1. Yes [~leftnoteasy], GUI display of 106% is similar to YARN-2285. It can be tackled there. 2. As mentioned in earlier comment, Total MB in GUI is internally sum of availableMB+allottedMB. a. *LeafQueue#usedResources* is sum of used and reserved memory. But *CSQueueUtils#updateQueueStatistics()* code may give a -ve value in case of reservation which sets availableMB in QueueMetrics. {code} Resource available = Resources.subtract(queueLimit, usedResources); {code} If this comes as -ve, then *availableMB* is set as 0. b. *allocatedMB*: This is set when a container is really allocated. This is the real queue usage. In above scenario, it should have come as {noformat}15(availableMb)+1(allocatedMB)=16{noformat} But due to reservation, allocatedMB became 0. Hence total shown as 15. I feel instead of showing Total as *allocated+available*, we can show *clusterResource* here. Any particular reason why we need like allocated+available, thoughts? When a container is in reserved state then total cluster memory is displayed wrongly. - Key: YARN-2305 URL: https://issues.apache.org/jira/browse/YARN-2305 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.1 Reporter: J.Andreina Assignee: Sunil G Attachments: Capture.jpg ENV Details: = 3 queues : a(50%),b(25%),c(25%) --- All max utilization is set to 100 2 Node cluster with total memory as 16GB TestSteps: = Execute following 3 jobs with different memory configurations for Map , reducer and AM task ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=a -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=2048 /dir8 /preempt_85 (application_1405414066690_0023) ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=b -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=2048 -Dyarn.app.mapreduce.am.resource.mb=2048 -Dmapreduce.reduce.memory.mb=2048 /dir2 /preempt_86 (application_1405414066690_0025) ./yarn jar wordcount-sleep.jar -Dmapreduce.job.queuename=c -Dwordcount.map.sleep.time=2000 -Dmapreduce.map.memory.mb=1024 -Dyarn.app.mapreduce.am.resource.mb=1024 -Dmapreduce.reduce.memory.mb=1024 /dir2 /preempt_62 Issue = when 2GB memory is in reserved state totoal memory is shown as 15GB and used as 15GB ( while total memory is 16GB) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
Jason Lowe created YARN-2314: Summary: ContainerManagementProtocolProxy can create thousands of threads for a large cluster Key: YARN-2314 URL: https://issues.apache.org/jira/browse/YARN-2314 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.1.0-beta Reporter: Jason Lowe Priority: Critical ContainerManagementProtocolProxy has a cache of NM proxies, and the size of this cache is configurable. However the cache can grow far beyond the configured size when running on a large cluster and blow AM address/container limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2313) Livelock can occur on FairScheduler when there are lots entry in queue
[ https://issues.apache.org/jira/browse/YARN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065294#comment-14065294 ] Hadoop QA commented on YARN-2313: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656305/rm-stack-trace.txt against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4347//console This message is automatically generated. Livelock can occur on FairScheduler when there are lots entry in queue -- Key: YARN-2313 URL: https://issues.apache.org/jira/browse/YARN-2313 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.4.1 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2313.1.patch, rm-stack-trace.txt Observed livelock on FairScheduler when there are lots entry in queue. After my investigating code, following case can occur: 1. {{update()}} called by UpdateThread takes longer times than UPDATE_INTERVAL(500ms) if there are lots queue. 2. UpdateThread goes busy loop. 3. Other threads(AllocationFileReloader, ResourceManager$SchedulerEventDispatcher) can wait forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2313) Livelock can occur on FairScheduler when there are lots entry in queue
[ https://issues.apache.org/jira/browse/YARN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2313: - Attachment: (was: YARN-2313.1.patch) Livelock can occur on FairScheduler when there are lots entry in queue -- Key: YARN-2313 URL: https://issues.apache.org/jira/browse/YARN-2313 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.4.1 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2313.1.patch, rm-stack-trace.txt Observed livelock on FairScheduler when there are lots entry in queue. After my investigating code, following case can occur: 1. {{update()}} called by UpdateThread takes longer times than UPDATE_INTERVAL(500ms) if there are lots queue. 2. UpdateThread goes busy loop. 3. Other threads(AllocationFileReloader, ResourceManager$SchedulerEventDispatcher) can wait forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2313) Livelock can occur on FairScheduler when there are lots entry in queue
[ https://issues.apache.org/jira/browse/YARN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2313: - Attachment: YARN-2313.1.patch Livelock can occur on FairScheduler when there are lots entry in queue -- Key: YARN-2313 URL: https://issues.apache.org/jira/browse/YARN-2313 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.4.1 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2313.1.patch, rm-stack-trace.txt Observed livelock on FairScheduler when there are lots entry in queue. After my investigating code, following case can occur: 1. {{update()}} called by UpdateThread takes longer times than UPDATE_INTERVAL(500ms) if there are lots queue. 2. UpdateThread goes busy loop. 3. Other threads(AllocationFileReloader, ResourceManager$SchedulerEventDispatcher) can wait forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
[ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065316#comment-14065316 ] Jason Lowe commented on YARN-2314: -- The problem is that the cache doesn't try very hard to remove proxies when the cache is at or beyond the maximum configured size. When adding a new proxy to the cache and it should remove an entry, it simply grabs the least-recently-used proxy and tries to close it. If the entry is currently in use then an entry isn't immediately removed and that means we're running with a cache larger than configured. This can get far worse on a big cluster. For example, if the least-recently-used proxy is currently performing a call that is stuck on socket connection retries, the LRU entry could take quite a while before it closes. During that time each new proxy created will make the same attempt to close that proxy and fail to do so. That means that the cache size is now N-1 larger than it should be when it finally does close where N is the number of proxies created while the LRU entry was busy. On a large cluster with thousands of nodes a proxy hanging on one node could allow the cache to have thousands of more proxies in it than configured. Since each proxy is a thread, that's thousands of threads, and all those thread stacks can blow container limits on the AM (or address limits if it's a 32-bit AM). ContainerManagementProtocolProxy can create thousands of threads for a large cluster Key: YARN-2314 URL: https://issues.apache.org/jira/browse/YARN-2314 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.1.0-beta Reporter: Jason Lowe Priority: Critical ContainerManagementProtocolProxy has a cache of NM proxies, and the size of this cache is configurable. However the cache can grow far beyond the configured size when running on a large cluster and blow AM address/container limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2313) Livelock can occur on FairScheduler when there are lots entry in queue
[ https://issues.apache.org/jira/browse/YARN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065312#comment-14065312 ] Hadoop QA commented on YARN-2313: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656304/YARN-2313.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4346//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/4346//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4346//console This message is automatically generated. Livelock can occur on FairScheduler when there are lots entry in queue -- Key: YARN-2313 URL: https://issues.apache.org/jira/browse/YARN-2313 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.4.1 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2313.1.patch, rm-stack-trace.txt Observed livelock on FairScheduler when there are lots entry in queue. After my investigating code, following case can occur: 1. {{update()}} called by UpdateThread takes longer times than UPDATE_INTERVAL(500ms) if there are lots queue. 2. UpdateThread goes busy loop. 3. Other threads(AllocationFileReloader, ResourceManager$SchedulerEventDispatcher) can wait forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed
[ https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065363#comment-14065363 ] Sunil G commented on YARN-2308: --- During *RMAppRecoveredTransition* in RMAppImpl, may be we can check recovered app queue (can get this from submission context) is still a valid queue? If this queue not present, recovery for that app can be made failed, and may be need to do some more RMApp clean up. Sounds doable? NPE happened when RM restart after CapacityScheduler queue configuration changed - Key: YARN-2308 URL: https://issues.apache.org/jira/browse/YARN-2308 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.6.0 Reporter: Wangda Tan Priority: Critical I encountered a NPE when RM restart {code} 2014-07-16 07:22:46,957 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} And RM will be failed to restart. This is caused by queue configuration changed, I removed some queues and added new queues. So when RM restarts, it tries to recover history applications, and when any of queues of these applications removed, NPE will be raised. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2208) AMRMTokenManager need to have a way to roll over AMRMToken
[ https://issues.apache.org/jira/browse/YARN-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2208: Attachment: YARN-2208.9.patch AMRMTokenManager need to have a way to roll over AMRMToken -- Key: YARN-2208 URL: https://issues.apache.org/jira/browse/YARN-2208 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2208.1.patch, YARN-2208.2.patch, YARN-2208.3.patch, YARN-2208.4.patch, YARN-2208.5.patch, YARN-2208.5.patch, YARN-2208.6.patch, YARN-2208.7.patch, YARN-2208.8.patch, YARN-2208.8.patch, YARN-2208.8.patch, YARN-2208.9.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2313) Livelock can occur on FairScheduler when there are lots entry in queue
[ https://issues.apache.org/jira/browse/YARN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065414#comment-14065414 ] Hadoop QA commented on YARN-2313: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656316/YARN-2313.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4348//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/4348//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4348//console This message is automatically generated. Livelock can occur on FairScheduler when there are lots entry in queue -- Key: YARN-2313 URL: https://issues.apache.org/jira/browse/YARN-2313 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.4.1 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2313.1.patch, rm-stack-trace.txt Observed livelock on FairScheduler when there are lots entry in queue. After my investigating code, following case can occur: 1. {{update()}} called by UpdateThread takes longer times than UPDATE_INTERVAL(500ms) if there are lots queue. 2. UpdateThread goes busy loop. 3. Other threads(AllocationFileReloader, ResourceManager$SchedulerEventDispatcher) can wait forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2208) AMRMTokenManager need to have a way to roll over AMRMToken
[ https://issues.apache.org/jira/browse/YARN-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065451#comment-14065451 ] Hadoop QA commented on YARN-2208: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656318/YARN-2208.9.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.util.TestFSDownload org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4349//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4349//console This message is automatically generated. AMRMTokenManager need to have a way to roll over AMRMToken -- Key: YARN-2208 URL: https://issues.apache.org/jira/browse/YARN-2208 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2208.1.patch, YARN-2208.2.patch, YARN-2208.3.patch, YARN-2208.4.patch, YARN-2208.5.patch, YARN-2208.5.patch, YARN-2208.6.patch, YARN-2208.7.patch, YARN-2208.8.patch, YARN-2208.8.patch, YARN-2208.8.patch, YARN-2208.9.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2208) AMRMTokenManager need to have a way to roll over AMRMToken
[ https://issues.apache.org/jira/browse/YARN-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2208: Attachment: YARN-2208.9.patch Try again with the same patch AMRMTokenManager need to have a way to roll over AMRMToken -- Key: YARN-2208 URL: https://issues.apache.org/jira/browse/YARN-2208 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2208.1.patch, YARN-2208.2.patch, YARN-2208.3.patch, YARN-2208.4.patch, YARN-2208.5.patch, YARN-2208.5.patch, YARN-2208.6.patch, YARN-2208.7.patch, YARN-2208.8.patch, YARN-2208.8.patch, YARN-2208.8.patch, YARN-2208.9.patch, YARN-2208.9.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2208) AMRMTokenManager need to have a way to roll over AMRMToken
[ https://issues.apache.org/jira/browse/YARN-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065525#comment-14065525 ] Hadoop QA commented on YARN-2208: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656332/YARN-2208.9.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.util.TestFSDownload org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4350//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4350//console This message is automatically generated. AMRMTokenManager need to have a way to roll over AMRMToken -- Key: YARN-2208 URL: https://issues.apache.org/jira/browse/YARN-2208 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2208.1.patch, YARN-2208.2.patch, YARN-2208.3.patch, YARN-2208.4.patch, YARN-2208.5.patch, YARN-2208.5.patch, YARN-2208.6.patch, YARN-2208.7.patch, YARN-2208.8.patch, YARN-2208.8.patch, YARN-2208.8.patch, YARN-2208.9.patch, YARN-2208.9.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-415: Attachment: YARN-415.201407172144.txt Thank you, [~leftnoteasy] {quote} No, you can update current trunk code, and check RMContainerImpl#FinishedTransition#updateMetricsIfPreempted, you can change the updateMetricsIfPreempted to something like updateAttemptMetrics. And create a new method in RMAppAttemptMetrics, like updateResourceUtilization. The benefit of doing this is you don need send payload to RMAppAttempt, all you needed information should be existed in RMContainer. {quote} I see that I can use RMApp#getRMAppAttempt to get the attempt that belongs to the container, so your suggestion will work for this use case. This is a cleaner solution. I have updated the patch with your suggestions. I am still looking into the test problems. Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2045) Data persisted in NM should be versioned
[ https://issues.apache.org/jira/browse/YARN-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065669#comment-14065669 ] Jason Lowe commented on YARN-2045: -- Thanks for updating the patch! bq. Also, I like suggestion to make unit test as a black box which may only handle NMStateStore's start and stop. However, in this case, it could means we need extra API to update CURRENT_VERSION_INFO which is a constant now (but could be changed to different values across different YARN versions) What I meant is instead of using checkVersion to verify the version we would instead stop and start the state store to see if it accepted the version. We would still need to use the storeVersion(NMDBSchemaVersion) package-private method to store a custom version after it starts then restart the state store to verify it either started up or failed due to an incompatible version. It's not a big deal if you'd rather keep it as-is. Otherwise latest patch looks good. Will give it a closer look tomorrow for what I think will be final review/commit, and that will also give [~vvasudev] a chance to take another look. Data persisted in NM should be versioned Key: YARN-2045 URL: https://issues.apache.org/jira/browse/YARN-2045 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.4.1 Reporter: Junping Du Assignee: Junping Du Attachments: YARN-2045-v2.patch, YARN-2045-v3.patch, YARN-2045-v4.patch, YARN-2045-v5.patch, YARN-2045.patch As a split task from YARN-667, we want to add version info to NM related data, include: - NodeManager local LevelDB state - NodeManager directory structure -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2315) Should use setCurrentCapacity instead of setCapacity to configure used resource capacity for FairScheduler.
zhihai xu created YARN-2315: --- Summary: Should use setCurrentCapacity instead of setCapacity to configure used resource capacity for FairScheduler. Key: YARN-2315 URL: https://issues.apache.org/jira/browse/YARN-2315 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Should use setCurrentCapacity instead of setCapacity to configure used resource capacity for FairScheduler. In function getQueueInfo of FSQueue.java, we call setCapacity twice with different parameters so the first call is overrode by the second call. queueInfo.setCapacity((float) getFairShare().getMemory() / scheduler.getClusterResource().getMemory()); queueInfo.setCapacity((float) getResourceUsage().getMemory() / scheduler.getClusterResource().getMemory()); We should change the second setCapacity call to setCurrentCapacity to configure the current used capacity. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2315) Should use setCurrentCapacity instead of setCapacity to configure used resource capacity for FairScheduler.
[ https://issues.apache.org/jira/browse/YARN-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2315: Attachment: YARN-2315.patch Should use setCurrentCapacity instead of setCapacity to configure used resource capacity for FairScheduler. --- Key: YARN-2315 URL: https://issues.apache.org/jira/browse/YARN-2315 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2315.patch Should use setCurrentCapacity instead of setCapacity to configure used resource capacity for FairScheduler. In function getQueueInfo of FSQueue.java, we call setCapacity twice with different parameters so the first call is overrode by the second call. queueInfo.setCapacity((float) getFairShare().getMemory() / scheduler.getClusterResource().getMemory()); queueInfo.setCapacity((float) getResourceUsage().getMemory() / scheduler.getClusterResource().getMemory()); We should change the second setCapacity call to setCurrentCapacity to configure the current used capacity. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2316) TestNMWebServices* get failed on trunk
Junping Du created YARN-2316: Summary: TestNMWebServices* get failed on trunk Key: YARN-2316 URL: https://issues.apache.org/jira/browse/YARN-2316 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Junping Du From Jenkins test in YARN-2045 and YARN-1341, these tests get failed with address already get bind. The similar issue happens at RMWebService (YARN-2304) and AMWebService (MAPREDUCE-5973) as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2316) TestNMWebServices* get failed on trunk
[ https://issues.apache.org/jira/browse/YARN-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065705#comment-14065705 ] Junping Du commented on YARN-2316: -- I suspect some new added tests recently on Web-Services didn't do proper cleanup. TestNMWebServices* get failed on trunk -- Key: YARN-2316 URL: https://issues.apache.org/jira/browse/YARN-2316 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Junping Du From Jenkins test in YARN-2045 and YARN-1341, these tests get failed with address already get bind. The similar issue happens at RMWebService (YARN-2304) and AMWebService (MAPREDUCE-5973) as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065716#comment-14065716 ] Junping Du commented on YARN-1341: -- +1. Patch looks good. Will commit it shortly. Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch, YARN-1341v7.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065715#comment-14065715 ] Junping Du commented on YARN-1341: -- I can confirm test failure is not related to the patch as it also show up in YARN-2045. The similar issue happens in AM WebServices (MAPREDUCE-5973)and RM WebServices (YARN-2304) also. Already filed YARN-2316 to track these failures. Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch, YARN-1341v7.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065743#comment-14065743 ] Hadoop QA commented on YARN-415: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656361/YARN-415.201407172144.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.util.TestFSDownload {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4351//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4351//console This message is automatically generated. Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2045) Data persisted in NM should be versioned
[ https://issues.apache.org/jira/browse/YARN-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-2045: - Attachment: YARN-2045-v6.patch Address [~jlowe]'s comments in v6 patch. Data persisted in NM should be versioned Key: YARN-2045 URL: https://issues.apache.org/jira/browse/YARN-2045 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.4.1 Reporter: Junping Du Assignee: Junping Du Attachments: YARN-2045-v2.patch, YARN-2045-v3.patch, YARN-2045-v4.patch, YARN-2045-v5.patch, YARN-2045-v6.patch, YARN-2045.patch As a split task from YARN-667, we want to add version info to NM related data, include: - NodeManager local LevelDB state - NodeManager directory structure -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2315) Should use setCurrentCapacity instead of setCapacity to configure used resource capacity for FairScheduler.
[ https://issues.apache.org/jira/browse/YARN-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065751#comment-14065751 ] Hadoop QA commented on YARN-2315: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656373/YARN-2315.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4352//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4352//console This message is automatically generated. Should use setCurrentCapacity instead of setCapacity to configure used resource capacity for FairScheduler. --- Key: YARN-2315 URL: https://issues.apache.org/jira/browse/YARN-2315 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2315.patch Should use setCurrentCapacity instead of setCapacity to configure used resource capacity for FairScheduler. In function getQueueInfo of FSQueue.java, we call setCapacity twice with different parameters so the first call is overrode by the second call. queueInfo.setCapacity((float) getFairShare().getMemory() / scheduler.getClusterResource().getMemory()); queueInfo.setCapacity((float) getResourceUsage().getMemory() / scheduler.getClusterResource().getMemory()); We should change the second setCapacity call to setCurrentCapacity to configure the current used capacity. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2317) Update documentation about how to write YARN applications
Li Lu created YARN-2317: --- Summary: Update documentation about how to write YARN applications Key: YARN-2317 URL: https://issues.apache.org/jira/browse/YARN-2317 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Some information in WritingYarnApplications webpage is out-dated. Need some refresh work on this document to reflect the most recent changes in YARN APIs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2317) Update documentation about how to write YARN applications
[ https://issues.apache.org/jira/browse/YARN-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2317: Attachment: YARN-2317-071714.patch Hi folks, I've refreshed the WritingYarnApplications webpage to keep this consistent with some API changes. I'm using the YARN distributed shell as a sample, and explained how the new ( esp. asynchronous) APIs works. This is my first draft of it. I would definitely appreciate comments/suggestions from the whole community on this. Thank you! Update documentation about how to write YARN applications - Key: YARN-2317 URL: https://issues.apache.org/jira/browse/YARN-2317 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2317-071714.patch Some information in WritingYarnApplications webpage is out-dated. Need some refresh work on this document to reflect the most recent changes in YARN APIs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2275) When log aggregation not enabled, message should point to NM HTTP port, not IPC port
[ https://issues.apache.org/jira/browse/YARN-2275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Chiang resolved YARN-2275. -- Resolution: Won't Fix Unable to fix this using a single Configuration property. Patch which hacks and uses two properties considered not acceptable. Closing this bug as won't fix. When log aggregation not enabled, message should point to NM HTTP port, not IPC port - Key: YARN-2275 URL: https://issues.apache.org/jira/browse/YARN-2275 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Assignee: Ray Chiang Labels: usability Attachments: MAPREDUCE5185-01.patch When I try to get a container's logs in the JHS without log aggregation enabled, I get a message that looks like this: Aggregation is not enabled. Try the nodemanager at sandy-ThinkPad-T530:33224 This could be a lot more helpful by actually pointing the URL that would show the container logs on the NM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2045) Data persisted in NM should be versioned
[ https://issues.apache.org/jira/browse/YARN-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065791#comment-14065791 ] Hadoop QA commented on YARN-2045: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656386/YARN-2045-v6.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4353//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4353//console This message is automatically generated. Data persisted in NM should be versioned Key: YARN-2045 URL: https://issues.apache.org/jira/browse/YARN-2045 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.4.1 Reporter: Junping Du Assignee: Junping Du Attachments: YARN-2045-v2.patch, YARN-2045-v3.patch, YARN-2045-v4.patch, YARN-2045-v5.patch, YARN-2045-v6.patch, YARN-2045.patch As a split task from YARN-667, we want to add version info to NM related data, include: - NodeManager local LevelDB state - NodeManager directory structure -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2317) Update documentation about how to write YARN applications
[ https://issues.apache.org/jira/browse/YARN-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065793#comment-14065793 ] Hadoop QA commented on YARN-2317: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656393/YARN-2317-071714.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4354//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4354//console This message is automatically generated. Update documentation about how to write YARN applications - Key: YARN-2317 URL: https://issues.apache.org/jira/browse/YARN-2317 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2317-071714.patch Some information in WritingYarnApplications webpage is out-dated. Need some refresh work on this document to reflect the most recent changes in YARN APIs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065795#comment-14065795 ] Hudson commented on YARN-1341: -- FAILURE: Integrated in Hadoop-trunk-Commit #5906 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5906/]) YARN-1341. Recover NMTokens upon nodemanager restart. (Contributed by Jason Lowe) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1611512) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/BaseNMTokenSecretManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMNullStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/security/NMTokenSecretManagerInNM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMMemoryStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/security * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/security/TestNMTokenSecretManagerInNM.java Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch, YARN-1341v7.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2307) Capacity scheduler user only ADMINISTER_QUEUE also can submit app
[ https://issues.apache.org/jira/browse/YARN-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065840#comment-14065840 ] tangjunjie commented on YARN-2307: -- Queue acls for user : root Queue Operations = root default china ADMINISTER_QUEUE unfunded I think if root user can submit job ,hadoop queue -showacls will display as follow: Queue acls for user : root Queue Operations = root default china ADMINISTER_QUEUESUBMIT_APPLICATIONS unfunded This is my config detail : configuration property nameyarn.scheduler.capacity.root.queues/name valueunfunded,china,default/value /property property nameyarn.scheduler.capacity.root.capacity/name value100/value /property property nameyarn.scheduler.capacity.root.acl_submit_applications/name valuejj/value /property property nameyarn.scheduler.capacity.root.acl_administer_queue/name valuejj/value /property property nameyarn.scheduler.capacity.root.unfunded.acl_submit_applications/name valuexjj/value /property property nameyarn.scheduler.capacity.root.unfunded.acl_administer_queue/name valuexjj/value /property property nameyarn.scheduler.capacity.root.china.acl_submit_applications/name valuechina1/value /property property nameyarn.scheduler.capacity.root.china.acl_administer_queue/name valuechina,root/value /property property nameyarn.scheduler.capacity.root.unfunded.capacity/name value40/value /property property nameyarn.scheduler.capacity.root.china.capacity/name value50/value /property property nameyarn.scheduler.capacity.root.default.capacity/name value10/value /property /configuration Capacity scheduler user only ADMINISTER_QUEUE also can submit app -- Key: YARN-2307 URL: https://issues.apache.org/jira/browse/YARN-2307 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.3.0 Environment: hadoop 2.3.0 centos6.5 jdk1.7 Reporter: tangjunjie Priority: Minor Queue acls for user : root Queue Operations = root default china ADMINISTER_QUEUE unfunded user root only have ADMINISTER_QUEUE but user root can sumbit app to china queue -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2317) Update documentation about how to write YARN applications
[ https://issues.apache.org/jira/browse/YARN-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira AJISAKA updated YARN-2317: Component/s: documentation Update documentation about how to write YARN applications - Key: YARN-2317 URL: https://issues.apache.org/jira/browse/YARN-2317 Project: Hadoop YARN Issue Type: Sub-task Components: documentation Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2317-071714.patch Some information in WritingYarnApplications webpage is out-dated. Need some refresh work on this document to reflect the most recent changes in YARN APIs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1342) Recover container tokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065868#comment-14065868 ] Junping Du commented on YARN-1342: -- Yes. The patch is not get sync for sometime. [~jlowe], would you mind to update the patch against latest trunk? Recover container tokens upon nodemanager restart - Key: YARN-1342 URL: https://issues.apache.org/jira/browse/YARN-1342 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1342.patch, YARN-1342v2.patch, YARN-1342v3-and-YARN-1987.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065910#comment-14065910 ] Wangda Tan commented on YARN-415: - Hi [~eepayne], Thanks for updating your patch, the failed test case should be irrelevant to your changes, it is tracked by YARN-2270. Reviewing.. Thanks, Wangda Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2229) ContainerId can overflow with RM restart
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065942#comment-14065942 ] Tsuyoshi OZAWA commented on YARN-2229: -- [~jianhe], I appreciate if you can take a look. ContainerId can overflow with RM restart Key: YARN-2229 URL: https://issues.apache.org/jira/browse/YARN-2229 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2229.1.patch, YARN-2229.10.patch, YARN-2229.10.patch, YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, YARN-2229.4.patch, YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, YARN-2229.8.patch, YARN-2229.9.patch On YARN-2052, we changed containerId format: upper 10 bits are for epoch, lower 22 bits are for sequence number of Ids. This is for preserving semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM restarts 1024 times. To avoid the problem, its better to make containerId long. We need to define the new format of container Id with preserving backward compatibility on this JIRA. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2313) Livelock can occur on FairScheduler when there are lots entry in queue
[ https://issues.apache.org/jira/browse/YARN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2313: - Attachment: YARN-2313.2.patch Fixed the warning by findbugs. Livelock can occur on FairScheduler when there are lots entry in queue -- Key: YARN-2313 URL: https://issues.apache.org/jira/browse/YARN-2313 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.4.1 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2313.1.patch, YARN-2313.2.patch, rm-stack-trace.txt Observed livelock on FairScheduler when there are lots entry in queue. After my investigating code, following case can occur: 1. {{update()}} called by UpdateThread takes longer times than UPDATE_INTERVAL(500ms) if there are lots queue. 2. UpdateThread goes busy loop. 3. Other threads(AllocationFileReloader, ResourceManager$SchedulerEventDispatcher) can wait forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2318) hadoop configuraion checker
tangjunjie created YARN-2318: Summary: hadoop configuraion checker Key: YARN-2318 URL: https://issues.apache.org/jira/browse/YARN-2318 Project: Hadoop YARN Issue Type: New Feature Reporter: tangjunjie hadoop have a lot of config property. People will make mistake when modify configuration file. So hadoop can do config check tool .This tool can find mistake as follow. if config property namemapreduce.tasktracker.reduce.tasks.maximu/name !--should be mapreduce.tasktracker.reduce.tasks.maximum-- value9/value descriptionThe maximum number of reduce tasks that will be run simultaneously by a task tracker. /description /property OR this tool can warn use deprecated property name and correct it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2318) hadoop configuraion checker
[ https://issues.apache.org/jira/browse/YARN-2318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tangjunjie updated YARN-2318: - Description: hadoop have a lot of config property. People will make mistake when modify configuration file. So hadoop can do config check tool .This tool can find mistake as follow. if config property namemapreduce.tasktracker.reduce.tasks.maximu/name should be mapreduce.tasktracker.reduce.tasks.maximum value9/value descriptionThe maximum number of reduce tasks that will be run simultaneously by a task tracker. /description /property OR this tool can warn use deprecated property name and correct it. was: hadoop have a lot of config property. People will make mistake when modify configuration file. So hadoop can do config check tool .This tool can find mistake as follow. if config property namemapreduce.tasktracker.reduce.tasks.maximu/name !--should be mapreduce.tasktracker.reduce.tasks.maximum-- value9/value descriptionThe maximum number of reduce tasks that will be run simultaneously by a task tracker. /description /property OR this tool can warn use deprecated property name and correct it. hadoop configuraion checker --- Key: YARN-2318 URL: https://issues.apache.org/jira/browse/YARN-2318 Project: Hadoop YARN Issue Type: New Feature Reporter: tangjunjie hadoop have a lot of config property. People will make mistake when modify configuration file. So hadoop can do config check tool .This tool can find mistake as follow. if config property namemapreduce.tasktracker.reduce.tasks.maximu/name should be mapreduce.tasktracker.reduce.tasks.maximum value9/value descriptionThe maximum number of reduce tasks that will be run simultaneously by a task tracker. /description /property OR this tool can warn use deprecated property name and correct it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2313) Livelock can occur on FairScheduler when there are lots entry in queue
[ https://issues.apache.org/jira/browse/YARN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065991#comment-14065991 ] Hadoop QA commented on YARN-2313: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656429/YARN-2313.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4355//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/4355//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4355//console This message is automatically generated. Livelock can occur on FairScheduler when there are lots entry in queue -- Key: YARN-2313 URL: https://issues.apache.org/jira/browse/YARN-2313 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.4.1 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2313.1.patch, YARN-2313.2.patch, rm-stack-trace.txt Observed livelock on FairScheduler when there are lots entry in queue. After my investigating code, following case can occur: 1. {{update()}} called by UpdateThread takes longer times than UPDATE_INTERVAL(500ms) if there are lots queue. 2. UpdateThread goes busy loop. 3. Other threads(AllocationFileReloader, ResourceManager$SchedulerEventDispatcher) can wait forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2313) Livelock can occur on FairScheduler when there are lots entry in queue
[ https://issues.apache.org/jira/browse/YARN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2313: - Attachment: YARN-2313.3.patch Livelock can occur on FairScheduler when there are lots entry in queue -- Key: YARN-2313 URL: https://issues.apache.org/jira/browse/YARN-2313 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.4.1 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2313.1.patch, YARN-2313.2.patch, YARN-2313.3.patch, rm-stack-trace.txt Observed livelock on FairScheduler when there are lots entry in queue. After my investigating code, following case can occur: 1. {{update()}} called by UpdateThread takes longer times than UPDATE_INTERVAL(500ms) if there are lots queue. 2. UpdateThread goes busy loop. 3. Other threads(AllocationFileReloader, ResourceManager$SchedulerEventDispatcher) can wait forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2319) Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java
Wenwu Peng created YARN-2319: Summary: Fix MiniKdc not close in TestRMWebServicesDelegationTokens.java Key: YARN-2319 URL: https://issues.apache.org/jira/browse/YARN-2319 Project: Hadoop YARN Issue Type: Test Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wenwu Peng Assignee: Wenwu Peng MiniKdc only invoke start method not stop in TestRMWebServicesDelegationTokens.java {code} testMiniKDC.start(); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2244) FairScheduler missing handling of containers for unknown application attempts
[ https://issues.apache.org/jira/browse/YARN-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066089#comment-14066089 ] Karthik Kambatla commented on YARN-2244: Latest patch looks good. A couple of nits: # The 80 chars limit doesn't apply to imports - can we get them one per line? {code} +import org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity +.CapacityScheduler; +import org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica +.FiCaSchedulerApp; {code} # Limit the following line to 80 chars. {code} +SchedulerApplicationAttempt application = getCurrentAttemptForContainer(containerId); {code} # Unused imports in CapacityScheduler, FairScheduler, FifoScheduler. # Few comments on the do-while loop in the test: {code} int waitCount = 0; dispatcher.await(); ListContainerId contsToClean; int cleanedConts = 0; do { contsToClean = resp.getContainersToCleanup(); cleanedConts += contsToClean.size(); if (cleanedConts = 1) { break; } Thread.sleep(100); resp = nm.nodeHeartbeat(true); dispatcher.await(); } while(waitCount++ 200); {code} ## Define waitCount and cleanedCounts on the same line? ## while should be on the same line as the closing brace. ## Remove dispatcher.await() outside the loop and have it as the first statement in the do-block? FairScheduler missing handling of containers for unknown application attempts -- Key: YARN-2244 URL: https://issues.apache.org/jira/browse/YARN-2244 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Critical Attachments: YARN-2224.patch, YARN-2244.001.patch, YARN-2244.002.patch, YARN-2244.003.patch, YARN-2244.004.patch We are missing changes in patch MAPREDUCE-3596 in FairScheduler. Among other fixes that were common across schedulers, there were some scheduler specific fixes added to handle containers for unknown application attempts. Without these fair scheduler simply logs that an unknown container was found and continues to let it run. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2313) Livelock can occur on FairScheduler when there are lots entry in queue
[ https://issues.apache.org/jira/browse/YARN-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066091#comment-14066091 ] Hadoop QA commented on YARN-2313: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656445/YARN-2313.3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4356//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4356//console This message is automatically generated. Livelock can occur on FairScheduler when there are lots entry in queue -- Key: YARN-2313 URL: https://issues.apache.org/jira/browse/YARN-2313 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.4.1 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2313.1.patch, YARN-2313.2.patch, YARN-2313.3.patch, rm-stack-trace.txt Observed livelock on FairScheduler when there are lots entry in queue. After my investigating code, following case can occur: 1. {{update()}} called by UpdateThread takes longer times than UPDATE_INTERVAL(500ms) if there are lots queue. 2. UpdateThread goes busy loop. 3. Other threads(AllocationFileReloader, ResourceManager$SchedulerEventDispatcher) can wait forever. -- This message was sent by Atlassian JIRA (v6.2#6252)