[jira] [Commented] (YARN-2056) Disable preemption at Queue level
[ https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126621#comment-14126621 ] Wangda Tan commented on YARN-2056: -- Hi [~eepayne], Thanks for update the patch. Sorry for late, I've jus take a look at your patch, I think the new added test looks good to me, only a couple of minor comments, 1. {code} +// verify capacity taken from queueB, not queueE despite queueE being far +// over its absolute guaranteed capacity {code} queueE isn't preempted because its parent queue still under satisfied, as you know, it's internal mechanism of preemption policy. I think it's better to add it to the comment, can save some time when people looking at the test. 2. {code} +ApplicationAttemptId expectedAttemptOnQueueB = +ApplicationAttemptId.newInstance( +appA.getApplicationId(), appA.getAttemptId()); +assertTrue("appA should be running on queueB", +mCS.getAppsInQueue("queueB").contains(expectedAttemptOnQueueB)); {code} It's better to remove such assertion, it's unrelated to preemption policy. I guess you added it here because you want to check if mockQueue/mockApp is correct. I suggest you can add a separated test to verify mock nest queue/app. Also some similar checks on queueC/E 3. failed test TestCapacitySchedulerQueueACLs should not related to this change, but it's better to re-kick jenkins to make sure it. Thanks, Wangda > Disable preemption at Queue level > - > > Key: YARN-2056 > URL: https://issues.apache.org/jira/browse/YARN-2056 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Mayank Bansal >Assignee: Eric Payne > Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt, > YARN-2056.201408310117.txt, YARN-2056.201409022208.txt > > > We need to be able to disable preemption at individual queue level -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126591#comment-14126591 ] Hadoop QA commented on YARN-1458: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667336/YARN-1458.alternative2.patch against trunk revision 7498dd7. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4854//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4854//console This message is automatically generated. > In Fair Scheduler, size based weight can cause update thread to hold lock > indefinitely > -- > > Key: YARN-1458 > URL: https://issues.apache.org/jira/browse/YARN-1458 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.2.0 > Environment: Centos 2.6.18-238.19.1.el5 X86_64 > hadoop2.2.0 >Reporter: qingwu.fu >Assignee: zhihai xu > Labels: patch > Fix For: 2.2.1 > > Attachments: YARN-1458.001.patch, YARN-1458.002.patch, > YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.alternative0.patch, > YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, > yarn-1458-5.patch > > Original Estimate: 408h > Remaining Estimate: 408h > > The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when > clients submit lots jobs, it is not easy to reapear. We run the test cluster > for days to reapear it. The output of jstack command on resourcemanager pid: > {code} > "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 > waiting for monitor entry [0x43aa9000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) > - waiting to lock <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) > at java.lang.Thread.run(Thread.java:744) > …… > "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 > runnable [0x433a2000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.
[jira] [Commented] (YARN-2523) ResourceManager UI showing negative value for "Decommissioned Nodes" field
[ https://issues.apache.org/jira/browse/YARN-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126580#comment-14126580 ] Rohith commented on YARN-2523: -- Decommissioned Node metrics are set by NodeListManager. If decommission nodes rejoin, then RMNodeImpl#updateMetricsForRejoinedNode() again decrements metrics by 1 which cause negative value. There should have check in RMNodeImpl#updateMetricsForRejoinedNode() for decommission state. {code} if (!ecludedHosts.contains(hostName) && !ecludedHosts.contains(NetUtils.normalizeHostName(hostName))) { metrics.decrDecommisionedNMs(); } {code} > ResourceManager UI showing negative value for "Decommissioned Nodes" field > -- > > Key: YARN-2523 > URL: https://issues.apache.org/jira/browse/YARN-2523 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, webapp >Affects Versions: 3.0.0 >Reporter: Nishan Shetty >Assignee: Rohith > > 1. Decommission one NodeManager by configuring ip in excludehost file > 2. Remove ip from excludehost file > 3. Execute -refreshNodes command and restart Decommissioned NodeManager > Observe that in RM UI negative value for "Decommissioned Nodes" field is shown -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2523) ResourceManager UI showing negative value for "Decommissioned Nodes" field
[ https://issues.apache.org/jira/browse/YARN-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishan Shetty updated YARN-2523: Affects Version/s: (was: 2.4.1) 3.0.0 > ResourceManager UI showing negative value for "Decommissioned Nodes" field > -- > > Key: YARN-2523 > URL: https://issues.apache.org/jira/browse/YARN-2523 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, webapp >Affects Versions: 3.0.0 >Reporter: Nishan Shetty >Assignee: Rohith > > 1. Decommission one NodeManager by configuring ip in excludehost file > 2. Remove ip from excludehost file > 3. Execute -refreshNodes command and restart Decommissioned NodeManager > Observe that in RM UI negative value for "Decommissioned Nodes" field is shown -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2523) ResourceManager UI showing negative value for "Decommissioned Nodes" field
[ https://issues.apache.org/jira/browse/YARN-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishan Shetty updated YARN-2523: Priority: Major (was: Minor) > ResourceManager UI showing negative value for "Decommissioned Nodes" field > -- > > Key: YARN-2523 > URL: https://issues.apache.org/jira/browse/YARN-2523 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, webapp >Affects Versions: 2.4.1 >Reporter: Nishan Shetty >Assignee: Rohith > > 1. Decommission one NodeManager by configuring ip in excludehost file > 2. Remove ip from excludehost file > 3. Execute -refreshNodes command and restart Decommissioned NodeManager > Observe that in RM UI negative value for "Decommissioned Nodes" field is shown -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2523) ResourceManager UI showing negative value for "Decommissioned Nodes" field
[ https://issues.apache.org/jira/browse/YARN-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith reassigned YARN-2523: Assignee: Rohith > ResourceManager UI showing negative value for "Decommissioned Nodes" field > -- > > Key: YARN-2523 > URL: https://issues.apache.org/jira/browse/YARN-2523 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, webapp >Affects Versions: 2.4.1 >Reporter: Nishan Shetty >Assignee: Rohith >Priority: Minor > > 1. Decommission one NodeManager by configuring ip in excludehost file > 2. Remove ip from excludehost file > 3. Execute -refreshNodes command and restart Decommissioned NodeManager > Observe that in RM UI negative value for "Decommissioned Nodes" field is shown -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2524) ResourceManager UI shows negative value for "Decommissioned Nodes" field
[ https://issues.apache.org/jira/browse/YARN-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishan Shetty resolved YARN-2524. - Resolution: Invalid 2 issues got created by mistake. > ResourceManager UI shows negative value for "Decommissioned Nodes" field > > > Key: YARN-2524 > URL: https://issues.apache.org/jira/browse/YARN-2524 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Nishan Shetty > > 1. Decommission one NodeManager by configuring ip in excludehost file > 2. Remove ip from excludehost file > 3. Execute -refreshNodes command and restart Decommissioned NodeManager > Observe that in RM UI negative value for "Decommissioned Nodes" field is shown -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2524) ResourceManager UI shows negative value for "Decommissioned Nodes" field
Nishan Shetty created YARN-2524: --- Summary: ResourceManager UI shows negative value for "Decommissioned Nodes" field Key: YARN-2524 URL: https://issues.apache.org/jira/browse/YARN-2524 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Nishan Shetty 1. Decommission one NodeManager by configuring ip in excludehost file 2. Remove ip from excludehost file 3. Execute -refreshNodes command and restart Decommissioned NodeManager Observe that in RM UI negative value for "Decommissioned Nodes" field is shown -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2523) ResourceManager UI showing negative value for "Decommissioned Nodes" field
Nishan Shetty created YARN-2523: --- Summary: ResourceManager UI showing negative value for "Decommissioned Nodes" field Key: YARN-2523 URL: https://issues.apache.org/jira/browse/YARN-2523 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.4.1 Reporter: Nishan Shetty Priority: Minor 1. Decommission one NodeManager by configuring ip in excludehost file 2. Remove ip from excludehost file 3. Execute -refreshNodes command and restart Decommissioned NodeManager Observe that in RM UI negative value for "Decommissioned Nodes" field is shown -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2463) Add total cluster capacity to AllocateResponse
[ https://issues.apache.org/jira/browse/YARN-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev resolved YARN-2463. - Resolution: Invalid > Add total cluster capacity to AllocateResponse > -- > > Key: YARN-2463 > URL: https://issues.apache.org/jira/browse/YARN-2463 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Varun Vasudev >Assignee: Varun Vasudev > > YARN-2448 exposes the ResourceCalculator being used by the scheduler so that > AMs can make better decisions when scheduling tasks. The > DominantResourceCalculator needs the total cluster capacity to function > correctly. We should add this information to the AllocateResponse. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2463) Add total cluster capacity to AllocateResponse
[ https://issues.apache.org/jira/browse/YARN-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126558#comment-14126558 ] Varun Vasudev commented on YARN-2463: - Not required anymore since we don't expose the resource calculator. > Add total cluster capacity to AllocateResponse > -- > > Key: YARN-2463 > URL: https://issues.apache.org/jira/browse/YARN-2463 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Varun Vasudev >Assignee: Varun Vasudev > > YARN-2448 exposes the ResourceCalculator being used by the scheduler so that > AMs can make better decisions when scheduling tasks. The > DominantResourceCalculator needs the total cluster capacity to function > correctly. We should add this information to the AllocateResponse. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2517) Implement TimelineClientAsync
[ https://issues.apache.org/jira/browse/YARN-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126557#comment-14126557 ] Tsuyoshi OZAWA commented on YARN-2517: -- As Zhijie mentioned, we should have the callback if we need to check errors. IMHO, if we have a thread for the callback "onError", we should also have "onEntitiesPut" since the complexity doesn't increase so much and it's useful to distinguish connection-level exceptions from entity-level errors. > Implement TimelineClientAsync > - > > Key: YARN-2517 > URL: https://issues.apache.org/jira/browse/YARN-2517 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2517.1.patch > > > In some scenarios, we'd like to put timeline entities in another thread no to > block the current one. > It's good to have a TimelineClientAsync like AMRMClientAsync and > NMClientAsync. It can buffer entities, put them in a separate thread, and > have callback to handle the responses. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126533#comment-14126533 ] zhihai xu commented on YARN-1458: - I uploaded a new patch "YARN-1458.alternative2.patch" which add a new test case:all queues have none zero minShare: queueA and queueB each have eight 0.5 and minShare 1024, the cluster have resource 8192. so each queue should have 4096 fair share. > In Fair Scheduler, size based weight can cause update thread to hold lock > indefinitely > -- > > Key: YARN-1458 > URL: https://issues.apache.org/jira/browse/YARN-1458 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.2.0 > Environment: Centos 2.6.18-238.19.1.el5 X86_64 > hadoop2.2.0 >Reporter: qingwu.fu >Assignee: zhihai xu > Labels: patch > Fix For: 2.2.1 > > Attachments: YARN-1458.001.patch, YARN-1458.002.patch, > YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.alternative0.patch, > YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, > yarn-1458-5.patch > > Original Estimate: 408h > Remaining Estimate: 408h > > The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when > clients submit lots jobs, it is not easy to reapear. We run the test cluster > for days to reapear it. The output of jstack command on resourcemanager pid: > {code} > "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 > waiting for monitor entry [0x43aa9000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) > - waiting to lock <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) > at java.lang.Thread.run(Thread.java:744) > …… > "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 > runnable [0x433a2000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) > at java.lang.Thread.run(Thread.java:744) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-1458: Attachment: YARN-1458.alternative2.patch > In Fair Scheduler, size based weight can cause update thread to hold lock > indefinitely > -- > > Key: YARN-1458 > URL: https://issues.apache.org/jira/browse/YARN-1458 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.2.0 > Environment: Centos 2.6.18-238.19.1.el5 X86_64 > hadoop2.2.0 >Reporter: qingwu.fu >Assignee: zhihai xu > Labels: patch > Fix For: 2.2.1 > > Attachments: YARN-1458.001.patch, YARN-1458.002.patch, > YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.alternative0.patch, > YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, > yarn-1458-5.patch > > Original Estimate: 408h > Remaining Estimate: 408h > > The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when > clients submit lots jobs, it is not easy to reapear. We run the test cluster > for days to reapear it. The output of jstack command on resourcemanager pid: > {code} > "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 > waiting for monitor entry [0x43aa9000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) > - waiting to lock <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) > at java.lang.Thread.run(Thread.java:744) > …… > "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 > runnable [0x433a2000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) > at java.lang.Thread.run(Thread.java:744) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126498#comment-14126498 ] zhihai xu commented on YARN-1458: - Hi [~kasha], I just found an example to prove the first approach doesn't work when minShare is not zero(all queues have none zero minShare). The following is the example: We have 4 queues A,B,C and D: each have 0.25 weight, each have minShare 1024, The cluster have resource 6144(6*1024) using the first approach to compare with previous result, we will exit early in the loop with each Queue's fair share is 1024. The reason is that computeShare will return minShare value 1024 when rMax <=2048 in the following code: {code} private static int computeShare(Schedulable sched, double w2rRatio, ResourceType type) { double share = sched.getWeights().getWeight(type) * w2rRatio; share = Math.max(share, getResourceValue(sched.getMinShare(), type)); share = Math.min(share, getResourceValue(sched.getMaxShare(), type)); return (int) share; } {code} So for the first 12 iterations, the currentRU is not changed which is sum of all queues' minShare(4096). If we use second approach, we will get the correct result: each Queue's fair share is 1536. In this case, the second approach is definitely better than the first approach, the first approach can't handle the case:all queues have none zero minShare. I will create a new test case in the second approach patch. > In Fair Scheduler, size based weight can cause update thread to hold lock > indefinitely > -- > > Key: YARN-1458 > URL: https://issues.apache.org/jira/browse/YARN-1458 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.2.0 > Environment: Centos 2.6.18-238.19.1.el5 X86_64 > hadoop2.2.0 >Reporter: qingwu.fu >Assignee: zhihai xu > Labels: patch > Fix For: 2.2.1 > > Attachments: YARN-1458.001.patch, YARN-1458.002.patch, > YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.alternative0.patch, > YARN-1458.alternative1.patch, YARN-1458.patch, yarn-1458-5.patch > > Original Estimate: 408h > Remaining Estimate: 408h > > The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when > clients submit lots jobs, it is not easy to reapear. We run the test cluster > for days to reapear it. The output of jstack command on resourcemanager pid: > {code} > "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 > waiting for monitor entry [0x43aa9000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) > - waiting to lock <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) > at java.lang.Thread.run(Thread.java:744) > …… > "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 > runnable [0x433a2000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) > at > org.apache.hadoop.yarn.server.resourceman
[jira] [Commented] (YARN-2494) [YARN-796] Node label manager API and storage implementations
[ https://issues.apache.org/jira/browse/YARN-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126493#comment-14126493 ] Hadoop QA commented on YARN-2494: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667321/YARN-2494.patch against trunk revision 7498dd7. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4853//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4853//console This message is automatically generated. > [YARN-796] Node label manager API and storage implementations > - > > Key: YARN-2494 > URL: https://issues.apache.org/jira/browse/YARN-2494 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-2494.patch, YARN-2494.patch > > > This JIRA includes APIs and storage implementations of node label manager, > NodeLabelManager is an abstract class used to manage labels of nodes in the > cluster, it has APIs to query/modify > - Nodes according to given label > - Labels according to given hostname > - Add/remove labels > - Set labels of nodes in the cluster > - Persist/recover changes of labels/labels-on-nodes to/from storage > And it has two implementations to store modifications > - Memory based storage: It will not persist changes, so all labels will be > lost when RM restart > - FileSystem based storage: It will persist/recover to/from FileSystem (like > HDFS), and all labels and labels-on-nodes will be recovered upon RM restart -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126481#comment-14126481 ] Hadoop QA commented on YARN-2033: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667301/YARN-2033.9.patch against trunk revision 7498dd7. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 17 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4852//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4852//console This message is automatically generated. > Investigate merging generic-history into the Timeline Store > --- > > Key: YARN-2033 > URL: https://issues.apache.org/jira/browse/YARN-2033 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Zhijie Shen > Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, > YARN-2033.1.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, > YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.7.patch, YARN-2033.8.patch, > YARN-2033.9.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, > YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch > > > Having two different stores isn't amicable to generic insights on what's > happening with applications. This is to investigate porting generic-history > into the Timeline Store. > One goal is to try and retain most of the client side interfaces as close to > what we have today. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2494) [YARN-796] Node label manager API and storage implementations
[ https://issues.apache.org/jira/browse/YARN-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2494: - Attachment: YARN-2494.patch > [YARN-796] Node label manager API and storage implementations > - > > Key: YARN-2494 > URL: https://issues.apache.org/jira/browse/YARN-2494 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-2494.patch, YARN-2494.patch > > > This JIRA includes APIs and storage implementations of node label manager, > NodeLabelManager is an abstract class used to manage labels of nodes in the > cluster, it has APIs to query/modify > - Nodes according to given label > - Labels according to given hostname > - Add/remove labels > - Set labels of nodes in the cluster > - Persist/recover changes of labels/labels-on-nodes to/from storage > And it has two implementations to store modifications > - Memory based storage: It will not persist changes, so all labels will be > lost when RM restart > - FileSystem based storage: It will persist/recover to/from FileSystem (like > HDFS), and all labels and labels-on-nodes will be recovered upon RM restart -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2522) AHSClient may be not necessary
[ https://issues.apache.org/jira/browse/YARN-2522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2522: -- Issue Type: Bug (was: Sub-task) Parent: (was: YARN-1530) > AHSClient may be not necessary > -- > > Key: YARN-2522 > URL: https://issues.apache.org/jira/browse/YARN-2522 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen > > Per discussion in > [YARN-2033|https://issues.apache.org/jira/browse/YARN-2033?focusedCommentId=14126073&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14126073], > it may be not necessary to have a separate AHSClient. The methods can be > incorporated into TimelineClient. APPLICATION_HISTORY_ENABLED is also useless > then. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2522) AHSClient may be not necessary
[ https://issues.apache.org/jira/browse/YARN-2522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2522: -- Issue Type: Sub-task (was: Bug) Parent: YARN-321 > AHSClient may be not necessary > -- > > Key: YARN-2522 > URL: https://issues.apache.org/jira/browse/YARN-2522 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen > > Per discussion in > [YARN-2033|https://issues.apache.org/jira/browse/YARN-2033?focusedCommentId=14126073&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14126073], > it may be not necessary to have a separate AHSClient. The methods can be > incorporated into TimelineClient. APPLICATION_HISTORY_ENABLED is also useless > then. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2522) AHSClient may be not necessary
Zhijie Shen created YARN-2522: - Summary: AHSClient may be not necessary Key: YARN-2522 URL: https://issues.apache.org/jira/browse/YARN-2522 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Per discussion in [YARN-2033|https://issues.apache.org/jira/browse/YARN-2033?focusedCommentId=14126073&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14126073], it may be not necessary to have a separate AHSClient. The methods can be incorporated into TimelineClient. APPLICATION_HISTORY_ENABLED is also useless then. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1250) Generic history service should support application-acls
[ https://issues.apache.org/jira/browse/YARN-1250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1250: -- Attachment: YARN-1250.2.patch Update the patch according the latest patch of YARN-2033. > Generic history service should support application-acls > --- > > Key: YARN-1250 > URL: https://issues.apache.org/jira/browse/YARN-1250 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Zhijie Shen > Attachments: GenericHistoryACLs.pdf, YARN-1250.1.patch, > YARN-1250.2.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1712) Admission Control: plan follower
[ https://issues.apache.org/jira/browse/YARN-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126397#comment-14126397 ] Jian He commented on YARN-1712: --- Thanks Subra and Carlo working on the patch. Some comments and questions on the patch: - I think the default queue can be initialized upfront when PlanQueue is initialized in CapacityScheduler {code} // Add default queue if it doesnt exist if (scheduler.getQueue(defPlanQName) == null) { {code} - Consolidate the comments into 2 lines {code} // identify the reservations that have expired and new reservations that // have to // be activated {code} - Exceptions like the following are ignored. Is this intentional ? {code} } catch (YarnException e) { LOG.warn( "Exception while trying to release default queue capacity for plan: {}", planQueueName, e); } {code} - may be create a common method to calculate lhsRes and rhsRes {code} CSQueue lhsQueue = scheduler.getQueue(lhs.getReservationId().toString()); if (lhsQueue != null) { lhsRes = Resources.subtract( lhs.getResourcesAtTime(now), Resources.multiply(clusterResource, lhsQueue.getAbsoluteCapacity())); } else { lhsRes = lhs.getResourcesAtTime(now); } {code} - allocatedCapacity, may rename to reservedResources {code} Resource allocatedCapacity = Resource.newInstance(0, 0); {code} - Instead of doing the following: {code} for (CSQueue resQueue : resQueues) { previousReservations.add(resQueue.getQueueName()); } Set expired = Sets.difference(previousReservations, curReservationNames); Set toAdd = Sets.difference(curReservationNames, previousReservations); {code} we can do something like this to save some time cost. {code} for queue in previousReservations: if (queue not in curReservationNames) expired.add(queue) else: curReservationNames.remove(queue) // curReservationNames contains the ToAdd queues in the end {code} - Not sure if this method is only used by PlanFollower. If it is, we can change the return value to be a set of reservation names so that we don’t need to loop later to get all the reservation names.. {code} Set currentReservations = plan.getReservationsAtTime(now); {code} - rename defPlanQName to defReservationQueue {code} String defPlanQName = planQueueName + PlanQueue.DEFAULT_QUEUE_SUFFIX; {code} - The apps are already in current planQueue, IIUC, this is the defaultReservationQueue ? If that, I think we may change the queueName parameter to the proper defaultReservationQueue name. Also, AbstractYarnScheduler#moveAllApps is actually expecting the queue to be leafQueue(ReservationQueue), not planQueue(parentQueue). {code} // Move all the apps in these queues to the PlanQueue moveAppsInQueues(toMove, planQueueName); {code} - I’m thinking if we can make PlanFollower synchronously move apps to the defaultQueue, for the following reasons: {code} 1. IIUC, the logic for moveAll and killAll is that: the first Time synchronizePlan is called, it will try to move all expired apps; next Time synchronizePlan is called, it will kill all the previously not-yet-moved apps. If the synchronizePlan interval is very small, it’s likely to kill most apps that are being move. 2. Exceptions from CapacityScheduler#moveApplication are currently just ignored, if doing asynchronously 3. PlanFollower is now anyways locking the whole scheduler in synchronizePlan method (though I’m still thinking if we need to lock the whole scheduler as this is kind of costly.) 4. In AbstractYarnScheduler#moveAllApps, We can do the moveApp synchronously, and still send events to RMApp to update its bookkeeping if needed. (But I don’t think we need to send the event now.) 5. PlanFollower move logic should be much simpler if doing synchronously {code} > Admission Control: plan follower > > > Key: YARN-1712 > URL: https://issues.apache.org/jira/browse/YARN-1712 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, resourcemanager >Reporter: Carlo Curino >Assignee: Carlo Curino > Labels: reservations, scheduler > Attachments: YARN-1712.1.patch, YARN-1712.2.patch, YARN-1712.patch > > > This JIRA tracks a thread that continuously propagates the current state of > an inventory subsystem to the scheduler. As the inventory subsystem store the > "plan" of how the resources should be subdivided, the work we propose in this > JIRA realizes such plan by dynamically instructing the CapacityScheduler to > add/remove/resize queues to follow the plan. -- This message was sent by Atlassian JIRA (v6.3.4#633
[jira] [Updated] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2033: -- Attachment: YARN-2033.9.patch Fix one typo in the class name > Investigate merging generic-history into the Timeline Store > --- > > Key: YARN-2033 > URL: https://issues.apache.org/jira/browse/YARN-2033 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Zhijie Shen > Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, > YARN-2033.1.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, > YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.7.patch, YARN-2033.8.patch, > YARN-2033.9.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, > YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch > > > Having two different stores isn't amicable to generic insights on what's > happening with applications. This is to investigate porting generic-history > into the Timeline Store. > One goal is to try and retain most of the client side interfaces as close to > what we have today. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126340#comment-14126340 ] Hadoop QA commented on YARN-2033: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667269/YARN-2033.8.patch against trunk revision d989ac0. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 17 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4851//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4851//console This message is automatically generated. > Investigate merging generic-history into the Timeline Store > --- > > Key: YARN-2033 > URL: https://issues.apache.org/jira/browse/YARN-2033 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Zhijie Shen > Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, > YARN-2033.1.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, > YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.7.patch, YARN-2033.8.patch, > YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, > YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch > > > Having two different stores isn't amicable to generic insights on what's > happening with applications. This is to investigate porting generic-history > into the Timeline Store. > One goal is to try and retain most of the client side interfaces as close to > what we have today. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed
[ https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126284#comment-14126284 ] Xuan Gong commented on YARN-2308: - bq. We can check if the queue exists on recovery. If not, directly return FAILED state and no need to add the attempts anymore. Thoughts ? If we are doing this, the RMAppAttempt will show the *in-correct* state in ApplicationHistoryStore > NPE happened when RM restart after CapacityScheduler queue configuration > changed > - > > Key: YARN-2308 > URL: https://issues.apache.org/jira/browse/YARN-2308 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, scheduler >Affects Versions: 2.6.0 >Reporter: Wangda Tan >Assignee: chang li >Priority: Critical > Attachments: jira2308.patch, jira2308.patch, jira2308.patch > > > I encountered a NPE when RM restart > {code} > 2014-07-16 07:22:46,957 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_ADDED to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:744) > {code} > And RM will be failed to restart. > This is caused by queue configuration changed, I removed some queues and > added new queues. So when RM restarts, it tries to recover history > applications, and when any of queues of these applications removed, NPE will > be raised. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2320) Removing old application history store after we store the history data to timeline store
[ https://issues.apache.org/jira/browse/YARN-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126272#comment-14126272 ] Zhijie Shen commented on YARN-2320: --- According to Vinod's comments: https://issues.apache.org/jira/browse/YARN-2033?focusedCommentId=14126073&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14126073 We may think of removing the old store stack directly. > Removing old application history store after we store the history data to > timeline store > > > Key: YARN-2320 > URL: https://issues.apache.org/jira/browse/YARN-2320 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Zhijie Shen > > After YARN-2033, we should deprecate application history store set. There's > no need to maintain two sets of store interfaces. In addition, we should > conclude the outstanding jira's under YARN-321 about the application history > store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2320) Removing old application history store after we store the history data to timeline store
[ https://issues.apache.org/jira/browse/YARN-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2320: -- Summary: Removing old application history store after we store the history data to timeline store (was: Deprecate existing application history store after we store the history data to timeline store) > Removing old application history store after we store the history data to > timeline store > > > Key: YARN-2320 > URL: https://issues.apache.org/jira/browse/YARN-2320 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Zhijie Shen > > After YARN-2033, we should deprecate application history store set. There's > no need to maintain two sets of store interfaces. In addition, we should > conclude the outstanding jira's under YARN-321 about the application history > store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2459) RM crashes if App gets rejected for any reason and HA is enabled
[ https://issues.apache.org/jira/browse/YARN-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126259#comment-14126259 ] Hadoop QA commented on YARN-2459: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667254/YARN-2459.6.patch against trunk revision d989ac0. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4850//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4850//console This message is automatically generated. > RM crashes if App gets rejected for any reason and HA is enabled > > > Key: YARN-2459 > URL: https://issues.apache.org/jira/browse/YARN-2459 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.1 >Reporter: Mayank Bansal >Assignee: Mayank Bansal > Attachments: YARN-2459-1.patch, YARN-2459-2.patch, YARN-2459.3.patch, > YARN-2459.4.patch, YARN-2459.5.patch, YARN-2459.6.patch > > > If RM HA is enabled and used Zookeeper store for RM State Store. > If for any reason Any app gets rejected and directly goes to NEW to FAILED > then final transition makes that to RMApps and Completed Apps memory > structure but that doesn't make it to State store. > Now when RMApps default limit reaches it starts deleting apps from memory and > store. In that case it try to delete this app from store and fails which > causes RM to crash. > Thanks, > Mayank -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126249#comment-14126249 ] zhihai xu commented on YARN-1458: - Yes, it works, it can fix the zero weight with non-zero minShare if we compare with previous result. But the alternative approach will be a little faster compare to the first approach(less computation and less schedulables in the calculation after filtering fixed shared schedulables). Either approach is ok for me. I will submit a patch on the first approach to compare with previous result. > In Fair Scheduler, size based weight can cause update thread to hold lock > indefinitely > -- > > Key: YARN-1458 > URL: https://issues.apache.org/jira/browse/YARN-1458 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.2.0 > Environment: Centos 2.6.18-238.19.1.el5 X86_64 > hadoop2.2.0 >Reporter: qingwu.fu >Assignee: zhihai xu > Labels: patch > Fix For: 2.2.1 > > Attachments: YARN-1458.001.patch, YARN-1458.002.patch, > YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.alternative0.patch, > YARN-1458.alternative1.patch, YARN-1458.patch, yarn-1458-5.patch > > Original Estimate: 408h > Remaining Estimate: 408h > > The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when > clients submit lots jobs, it is not easy to reapear. We run the test cluster > for days to reapear it. The output of jstack command on resourcemanager pid: > {code} > "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 > waiting for monitor entry [0x43aa9000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) > - waiting to lock <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) > at java.lang.Thread.run(Thread.java:744) > …… > "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 > runnable [0x433a2000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) > at java.lang.Thread.run(Thread.java:744) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126240#comment-14126240 ] Hadoop QA commented on YARN-1458: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667252/yarn-1458-5.patch against trunk revision d989ac0. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4849//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4849//console This message is automatically generated. > In Fair Scheduler, size based weight can cause update thread to hold lock > indefinitely > -- > > Key: YARN-1458 > URL: https://issues.apache.org/jira/browse/YARN-1458 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.2.0 > Environment: Centos 2.6.18-238.19.1.el5 X86_64 > hadoop2.2.0 >Reporter: qingwu.fu >Assignee: zhihai xu > Labels: patch > Fix For: 2.2.1 > > Attachments: YARN-1458.001.patch, YARN-1458.002.patch, > YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.alternative0.patch, > YARN-1458.alternative1.patch, YARN-1458.patch, yarn-1458-5.patch > > Original Estimate: 408h > Remaining Estimate: 408h > > The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when > clients submit lots jobs, it is not easy to reapear. We run the test cluster > for days to reapear it. The output of jstack command on resourcemanager pid: > {code} > "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 > waiting for monitor entry [0x43aa9000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) > - waiting to lock <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) > at java.lang.Thread.run(Thread.java:744) > …… > "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 > runnable [0x433a2000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQu
[jira] [Updated] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2033: -- Attachment: YARN-2033.8.patch > Investigate merging generic-history into the Timeline Store > --- > > Key: YARN-2033 > URL: https://issues.apache.org/jira/browse/YARN-2033 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Zhijie Shen > Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, > YARN-2033.1.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, > YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.7.patch, YARN-2033.8.patch, > YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, > YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch > > > Having two different stores isn't amicable to generic insights on what's > happening with applications. This is to investigate porting generic-history > into the Timeline Store. > One goal is to try and retain most of the client side interfaces as close to > what we have today. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126236#comment-14126236 ] Zhijie Shen commented on YARN-2033: --- [~vinodkv], thanks for your comments. I've updated the patch accordingly. bq. RMApplicationHistoryWriter is not really needed anymore. We did document it to be unstable/alpha too. We can remove it directly instead of deprecating it. It's a burden to support two interface hierarchies. I'm okay doing it separately though. Seems to make sense. Previously I created a ticket for deprecating the old history store stack. Let me update that jira. bq. YarnClientImpl: Calls using AHSClient shouldn't rely on timeline-publisher yet, we should continue to use APPLICATION_HISTORY_ENABLED for that till we get rid of AHSClient altogether. We should file a ticket for this too. In the newer patch, I revert the change in YarnClientImpl, making it use APPLICATION_HISTORY_ENABLED. And ApplicationHistoryServer checks APPLICATION_HISTORY_STORE for backward compatibility. This can be simplified once the old history store stack is removed. Also I simplify the configuration check in SystemMetricsPublisher. I'll create a jira for getting rid of AHSClient. bq. You removed the unstable annotations from ApplicationContext APIs. We should retain them, this stuff isn't stable yet. ApplicationContext is for internal usage only, not user-faced interface. So I think it should be removed not to confuse people. bq. Rename YarnMetricsPublisher -> {Platform|System} MetricsPublisher to avoid confusing it with host/daemon metrics that exist outside today? Renamed all yarnmetrics -> systemmetrics. > Investigate merging generic-history into the Timeline Store > --- > > Key: YARN-2033 > URL: https://issues.apache.org/jira/browse/YARN-2033 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Zhijie Shen > Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, > YARN-2033.1.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, > YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.7.patch, > YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, > YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch > > > Having two different stores isn't amicable to generic insights on what's > happening with applications. This is to investigate porting generic-history > into the Timeline Store. > One goal is to try and retain most of the client side interfaces as close to > what we have today. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126210#comment-14126210 ] Karthik Kambatla commented on YARN-1458: bq. the alternative approach can fix zero weight with non-zero minShare but the first approach can't I see. Good point. I was wondering if there were cases we might want to check for {{if (currentRU - previousRU < epsilon || currentRU > totalResource)}}. The zero weight and non-zero minshare should be handled by such a check, no? > In Fair Scheduler, size based weight can cause update thread to hold lock > indefinitely > -- > > Key: YARN-1458 > URL: https://issues.apache.org/jira/browse/YARN-1458 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.2.0 > Environment: Centos 2.6.18-238.19.1.el5 X86_64 > hadoop2.2.0 >Reporter: qingwu.fu >Assignee: zhihai xu > Labels: patch > Fix For: 2.2.1 > > Attachments: YARN-1458.001.patch, YARN-1458.002.patch, > YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.alternative0.patch, > YARN-1458.alternative1.patch, YARN-1458.patch, yarn-1458-5.patch > > Original Estimate: 408h > Remaining Estimate: 408h > > The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when > clients submit lots jobs, it is not easy to reapear. We run the test cluster > for days to reapear it. The output of jstack command on resourcemanager pid: > {code} > "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 > waiting for monitor entry [0x43aa9000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) > - waiting to lock <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) > at java.lang.Thread.run(Thread.java:744) > …… > "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 > runnable [0x433a2000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) > at java.lang.Thread.run(Thread.java:744) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126204#comment-14126204 ] zhihai xu commented on YARN-1458: - Hi [~kasha], thanks for the review, The first approach has simplicity and readability advantage but it can't cover all the corner cases. the alternative approach can fix zero weight with non-zero minShare but the first approach can't. Both approach can fix zero weight with zero minShare. We may have limitation to keep track of the resource-usage from the previous iteration and see if we are making progress, For example for a very small weight, there may be 0 value return from resourceUsedWithWeightToResourceRatio after multiple iteration. thanks zhihai > In Fair Scheduler, size based weight can cause update thread to hold lock > indefinitely > -- > > Key: YARN-1458 > URL: https://issues.apache.org/jira/browse/YARN-1458 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.2.0 > Environment: Centos 2.6.18-238.19.1.el5 X86_64 > hadoop2.2.0 >Reporter: qingwu.fu >Assignee: zhihai xu > Labels: patch > Fix For: 2.2.1 > > Attachments: YARN-1458.001.patch, YARN-1458.002.patch, > YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.alternative0.patch, > YARN-1458.alternative1.patch, YARN-1458.patch, yarn-1458-5.patch > > Original Estimate: 408h > Remaining Estimate: 408h > > The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when > clients submit lots jobs, it is not easy to reapear. We run the test cluster > for days to reapear it. The output of jstack command on resourcemanager pid: > {code} > "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 > waiting for monitor entry [0x43aa9000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) > - waiting to lock <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) > at java.lang.Thread.run(Thread.java:744) > …… > "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 > runnable [0x433a2000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) > at java.lang.Thread.run(Thread.java:744) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-415) Capture aggregate memory allocation at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126202#comment-14126202 ] Jian He commented on YARN-415: -- Looks good to me. Just one more question, I kind of lose context why we need this check; seems we don't need, because the returned ApplicationResourceUsageReport for non-active attempt is anyways null. {code} // Only add in the running containers if this is the active attempt. RMAppAttempt currentAttempt = rmContext.getRMApps() .get(attemptId.getApplicationId()).getCurrentAppAttempt(); if (currentAttempt != null && currentAttempt.getAppAttemptId().equals(attemptId)) { {code} > Capture aggregate memory allocation at the app-level for chargeback > --- > > Key: YARN-415 > URL: https://issues.apache.org/jira/browse/YARN-415 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager >Affects Versions: 2.5.0 >Reporter: Kendall Thrapp >Assignee: Andrey Klochkov > Attachments: YARN-415--n10.patch, YARN-415--n2.patch, > YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, > YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, > YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, > YARN-415.201406262136.txt, YARN-415.201407042037.txt, > YARN-415.201407071542.txt, YARN-415.201407171553.txt, > YARN-415.201407172144.txt, YARN-415.201407232237.txt, > YARN-415.201407242148.txt, YARN-415.201407281816.txt, > YARN-415.201408062232.txt, YARN-415.201408080204.txt, > YARN-415.201408092006.txt, YARN-415.201408132109.txt, > YARN-415.201408150030.txt, YARN-415.201408181938.txt, > YARN-415.201408181938.txt, YARN-415.201408212033.txt, > YARN-415.201409040036.txt, YARN-415.patch > > > For the purpose of chargeback, I'd like to be able to compute the cost of an > application in terms of cluster resource usage. To start out, I'd like to > get the memory utilization of an application. The unit should be MB-seconds > or something similar and, from a chargeback perspective, the memory amount > should be the memory reserved for the application, as even if the app didn't > use all that memory, no one else was able to use it. > (reserved ram for container 1 * lifetime of container 1) + (reserved ram for > container 2 * lifetime of container 2) + ... + (reserved ram for container n > * lifetime of container n) > It'd be nice to have this at the app level instead of the job level because: > 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't > appear on the job history server). > 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). > This new metric should be available both through the RM UI and RM Web > Services REST API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1709) Admission Control: Reservation subsystem
[ https://issues.apache.org/jira/browse/YARN-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126193#comment-14126193 ] Hadoop QA commented on YARN-1709: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667251/YARN-1709.patch against trunk revision d989ac0. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4848//console This message is automatically generated. > Admission Control: Reservation subsystem > > > Key: YARN-1709 > URL: https://issues.apache.org/jira/browse/YARN-1709 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Carlo Curino >Assignee: Subramaniam Krishnan > Attachments: YARN-1709.patch, YARN-1709.patch, YARN-1709.patch, > YARN-1709.patch > > > This JIRA is about the key data structure used to track resources over time > to enable YARN-1051. The Reservation subsystem is conceptually a "plan" of > how the scheduler will allocate resources over-time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126172#comment-14126172 ] Karthik Kambatla commented on YARN-1458: By the way, I like the first approach mainly because of its simplicity and readability. In the while loop that was running forever, we could optionally keep track of the resource-usage from the previous iteration and see if we are making progress. > In Fair Scheduler, size based weight can cause update thread to hold lock > indefinitely > -- > > Key: YARN-1458 > URL: https://issues.apache.org/jira/browse/YARN-1458 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.2.0 > Environment: Centos 2.6.18-238.19.1.el5 X86_64 > hadoop2.2.0 >Reporter: qingwu.fu >Assignee: zhihai xu > Labels: patch > Fix For: 2.2.1 > > Attachments: YARN-1458.001.patch, YARN-1458.002.patch, > YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.alternative0.patch, > YARN-1458.alternative1.patch, YARN-1458.patch, yarn-1458-5.patch > > Original Estimate: 408h > Remaining Estimate: 408h > > The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when > clients submit lots jobs, it is not easy to reapear. We run the test cluster > for days to reapear it. The output of jstack command on resourcemanager pid: > {code} > "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 > waiting for monitor entry [0x43aa9000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) > - waiting to lock <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) > at java.lang.Thread.run(Thread.java:744) > …… > "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 > runnable [0x433a2000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) > at java.lang.Thread.run(Thread.java:744) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2448) RM should expose the resource types considered during scheduling when AMs register
[ https://issues.apache.org/jira/browse/YARN-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126162#comment-14126162 ] Vinod Kumar Vavilapalli commented on YARN-2448: --- +1, this looks good. Checking this in.. > RM should expose the resource types considered during scheduling when AMs > register > -- > > Key: YARN-2448 > URL: https://issues.apache.org/jira/browse/YARN-2448 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Attachments: apache-yarn-2448.0.patch, apache-yarn-2448.1.patch, > apache-yarn-2448.2.patch > > > The RM should expose the name of the ResourceCalculator being used when AMs > register, as part of the RegisterApplicationMasterResponse. > This will allow applications to make better decisions when scheduling. > MapReduce for example, only looks at memory when deciding it's scheduling, > even though the RM could potentially be using the DominantResourceCalculator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2459) RM crashes if App gets rejected for any reason and HA is enabled
[ https://issues.apache.org/jira/browse/YARN-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2459: -- Attachment: YARN-2459.6.patch > RM crashes if App gets rejected for any reason and HA is enabled > > > Key: YARN-2459 > URL: https://issues.apache.org/jira/browse/YARN-2459 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.1 >Reporter: Mayank Bansal >Assignee: Mayank Bansal > Attachments: YARN-2459-1.patch, YARN-2459-2.patch, YARN-2459.3.patch, > YARN-2459.4.patch, YARN-2459.5.patch, YARN-2459.6.patch > > > If RM HA is enabled and used Zookeeper store for RM State Store. > If for any reason Any app gets rejected and directly goes to NEW to FAILED > then final transition makes that to RMApps and Completed Apps memory > structure but that doesn't make it to State store. > Now when RMApps default limit reaches it starts deleting apps from memory and > store. In that case it try to delete this app from store and fails which > causes RM to crash. > Thanks, > Mayank -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126158#comment-14126158 ] Karthik Kambatla commented on YARN-1458: Thanks Zhihai for working on this. I like the first approach: uploading a patch with minor nit fixes. Let me know if this looks good to you. > In Fair Scheduler, size based weight can cause update thread to hold lock > indefinitely > -- > > Key: YARN-1458 > URL: https://issues.apache.org/jira/browse/YARN-1458 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.2.0 > Environment: Centos 2.6.18-238.19.1.el5 X86_64 > hadoop2.2.0 >Reporter: qingwu.fu >Assignee: zhihai xu > Labels: patch > Fix For: 2.2.1 > > Attachments: YARN-1458.001.patch, YARN-1458.002.patch, > YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.alternative0.patch, > YARN-1458.alternative1.patch, YARN-1458.patch, yarn-1458-5.patch > > Original Estimate: 408h > Remaining Estimate: 408h > > The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when > clients submit lots jobs, it is not easy to reapear. We run the test cluster > for days to reapear it. The output of jstack command on resourcemanager pid: > {code} > "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 > waiting for monitor entry [0x43aa9000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) > - waiting to lock <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) > at java.lang.Thread.run(Thread.java:744) > …… > "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 > runnable [0x433a2000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) > at java.lang.Thread.run(Thread.java:744) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1458: --- Attachment: yarn-1458-5.patch > In Fair Scheduler, size based weight can cause update thread to hold lock > indefinitely > -- > > Key: YARN-1458 > URL: https://issues.apache.org/jira/browse/YARN-1458 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.2.0 > Environment: Centos 2.6.18-238.19.1.el5 X86_64 > hadoop2.2.0 >Reporter: qingwu.fu >Assignee: zhihai xu > Labels: patch > Fix For: 2.2.1 > > Attachments: YARN-1458.001.patch, YARN-1458.002.patch, > YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.alternative0.patch, > YARN-1458.alternative1.patch, YARN-1458.patch, yarn-1458-5.patch > > Original Estimate: 408h > Remaining Estimate: 408h > > The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when > clients submit lots jobs, it is not easy to reapear. We run the test cluster > for days to reapear it. The output of jstack command on resourcemanager pid: > {code} > "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 > waiting for monitor entry [0x43aa9000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) > - waiting to lock <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) > at java.lang.Thread.run(Thread.java:744) > …… > "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 > runnable [0x433a2000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) > at java.lang.Thread.run(Thread.java:744) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed
[ https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126156#comment-14126156 ] Jian He commented on YARN-2308: --- Looked at this again, I think the solution mentioned by [~sunilg] is reasonable: bq. During RMAppRecoveredTransition in RMAppImpl, may be we can check recovered app queue (can get this from submission context) is still a valid queue? If this queue not present, recovery for that app can be made failed, and may be need to do some more RMApp clean up. Sounds doable? We can check if the queue exists on recovery. If not, directly return FAILED state and no need to add the attempts anymore. Thoughts ? > NPE happened when RM restart after CapacityScheduler queue configuration > changed > - > > Key: YARN-2308 > URL: https://issues.apache.org/jira/browse/YARN-2308 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, scheduler >Affects Versions: 2.6.0 >Reporter: Wangda Tan >Assignee: chang li >Priority: Critical > Attachments: jira2308.patch, jira2308.patch, jira2308.patch > > > I encountered a NPE when RM restart > {code} > 2014-07-16 07:22:46,957 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_ADDED to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:744) > {code} > And RM will be failed to restart. > This is caused by queue configuration changed, I removed some queues and > added new queues. So when RM restarts, it tries to recover history > applications, and when any of queues of these applications removed, NPE will > be raised. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1709) Admission Control: Reservation subsystem
[ https://issues.apache.org/jira/browse/YARN-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subramaniam Krishnan updated YARN-1709: --- Attachment: YARN-1709.patch Thanks [~chris.douglas] for your exhaustive review. I am uploading a patch that has the following fixes: * Cloned _ZERO_RESOURCE_, _minimumAllocation_ and _maximumAllocation_ to prevent leaking of mutable data * Removed MessageFormat. Have to concat strings in few cases where they are both logged and included as part of exception message * Fixed the code readability and lock scope in _addReservation()_ * Added assertions for _isWriteLockedByCurrentThread()_ in private methods that assume locks * Removed redundant _this_ in get methods * toString uses StringBuilder instead of StringBuffer now * Fixed Javadoc - content (_getEarliestStartTime()_) and whitespaces * Made _ReservationInterval_ immutable, good catch The ReservationSystem uses UTCClock (added as part of YARN-1708) to enforce UTC times. > Admission Control: Reservation subsystem > > > Key: YARN-1709 > URL: https://issues.apache.org/jira/browse/YARN-1709 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Carlo Curino >Assignee: Subramaniam Krishnan > Attachments: YARN-1709.patch, YARN-1709.patch, YARN-1709.patch, > YARN-1709.patch > > > This JIRA is about the key data structure used to track resources over time > to enable YARN-1051. The Reservation subsystem is conceptually a "plan" of > how the scheduler will allocate resources over-time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2440) Cgroups should allow YARN containers to be limited to allocated cores
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126140#comment-14126140 ] Vinod Kumar Vavilapalli commented on YARN-2440: --- Just caught up with the discussion. I can get behind an absolute limit too. Specifically in the context of heterogeneous clusters where uniform % configurations can go really bad where the only resort will then be to do per-node configuration - not ideal. Would that be a valid use-case for putting in the absolute limit? [~jlowe]? Even if it were, I am okay punting that off to a separate JIRA. Comments on the patch: - containers-limit-cpu-percentage -> {{yarn.nodemanager.resource.percentage-cpu-limit}} to be consistent? Similarly NM_CONTAINERS_CPU_PERC? I don't like the tag 'resource', it should have been 'resources' but it is what it is. - You still have refs to YarnConfiguration.NM_CONTAINERS_CPU_ABSOLUTE in the patch. Similarly the javadoc in NodeManagerHardwareUtils needs to be updated if we are not adding the absolute cpu config. It should no longer refer to "number of cores that should be used for YARN containers" - TestCgroupsLCEResourcesHandler: You can use mockito if you only want to override num-processors in TestResourceCalculatorPlugin. Similarly in TestNodeManagerHardwareUtils. - The tests may fail on a machine with > 4 cores? :) > Cgroups should allow YARN containers to be limited to allocated cores > - > > Key: YARN-2440 > URL: https://issues.apache.org/jira/browse/YARN-2440 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, > apache-yarn-2440.2.patch, apache-yarn-2440.3.patch, apache-yarn-2440.4.patch, > screenshot-current-implementation.jpg > > > The current cgroups implementation does not limit YARN containers to the > cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2459) RM crashes if App gets rejected for any reason and HA is enabled
[ https://issues.apache.org/jira/browse/YARN-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126135#comment-14126135 ] Hadoop QA commented on YARN-2459: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667234/YARN-2459.5.patch against trunk revision d989ac0. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4847//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4847//console This message is automatically generated. > RM crashes if App gets rejected for any reason and HA is enabled > > > Key: YARN-2459 > URL: https://issues.apache.org/jira/browse/YARN-2459 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.1 >Reporter: Mayank Bansal >Assignee: Mayank Bansal > Attachments: YARN-2459-1.patch, YARN-2459-2.patch, YARN-2459.3.patch, > YARN-2459.4.patch, YARN-2459.5.patch > > > If RM HA is enabled and used Zookeeper store for RM State Store. > If for any reason Any app gets rejected and directly goes to NEW to FAILED > then final transition makes that to RMApps and Completed Apps memory > structure but that doesn't make it to State store. > Now when RMApps default limit reaches it starts deleting apps from memory and > store. In that case it try to delete this app from store and fails which > causes RM to crash. > Thanks, > Mayank -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2256) Too many nodemanager and resourcemanager audit logs are generated
[ https://issues.apache.org/jira/browse/YARN-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126119#comment-14126119 ] Varun Saxena commented on YARN-2256: bq. Someone correct me if I'm wrong, but I'm fairly certain that the intent of this information is to be the equivalent of the HDFS audit log. In other words, setting these to debug completely defeats the purpose. Instead, I suspect the real culprit is that the log4j settings are wrong for the node manager process. [~aw], the issue raised was basically for both NM and RM. I have updated the description to reflect that. The issue here is that some of the container related operations' audit logs in both NM and RM are too frequent and too many. This may impact performance as well. Now, there are 2 solutions possible, either remove these logs or change the log level, so that they do not occur in live environment and can be opened only when required. As I wasnt sure if these audit logs have to be removed or not, I changed the log level for some of these logs in RM and all of them in NM. To ensure this, I supported printing of audit logs at different levels, as is done in HBase (as per my info). This is handled as part of YARN-2287 Now for NM, you are correct, Log level can be changed in log4j properties to suppress these logs if required. But for RM, as not all logs have to be suppressed, this cant be done. So to be consistent, I added log levels for both NM and RM. If its agreeable to remove these audit logs, that can be a possible solution as well. Pls suggest. > Too many nodemanager and resourcemanager audit logs are generated > - > > Key: YARN-2256 > URL: https://issues.apache.org/jira/browse/YARN-2256 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, resourcemanager >Affects Versions: 2.4.0 >Reporter: Varun Saxena >Assignee: Varun Saxena > Attachments: YARN-2256.patch > > > Following audit logs are generated too many times(due to the possibility of a > large number of containers) : > 1. In NM - Audit logs corresponding to Starting, Stopping and finishing of a > container > 2. In RM - Audit logs corresponding to AM allocating a container and AM > releasing a container > We can have different log levels even for NM and RM audit logs and move these > successful container related logs to DEBUG. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126073#comment-14126073 ] Vinod Kumar Vavilapalli commented on YARN-2033: --- Mostly looks fine, this is a rapidly changing part of the code-base! I get a feeling we need some umbrella cleanup effort to make consistent usage w.r.t history-service/timeline-service. Anyways, some comments - RMApplicationHistoryWriter is not really needed anymore. We did document it to be unstable/alpha too. We can remove it directly instead of deprecating it. It's a burden to support two interface hierarchies. I'm okay doing it separately though. - YarnClientImpl: Calls using AHSClient shouldn't rely on timeline-publisher yet, we should continue to use APPLICATION_HISTORY_ENABLED for that till we get rid of AHSClient altogether. We should file a ticket for this too. - You removed the unstable annotations from ApplicationContext APIs. We should retain them, this stuff isn't stable yet. - Rename YarnMetricsPublisher -> {Platform|System}MetricsPublisher to avoid confusing it with host/daemon metrics that exist outside today? > Investigate merging generic-history into the Timeline Store > --- > > Key: YARN-2033 > URL: https://issues.apache.org/jira/browse/YARN-2033 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Zhijie Shen > Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, > YARN-2033.1.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, > YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.7.patch, > YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, > YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch > > > Having two different stores isn't amicable to generic insights on what's > happening with applications. This is to investigate porting generic-history > into the Timeline Store. > One goal is to try and retain most of the client side interfaces as close to > what we have today. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2459) RM crashes if App gets rejected for any reason and HA is enabled
[ https://issues.apache.org/jira/browse/YARN-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2459: -- Attachment: YARN-2459.5.patch > RM crashes if App gets rejected for any reason and HA is enabled > > > Key: YARN-2459 > URL: https://issues.apache.org/jira/browse/YARN-2459 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.1 >Reporter: Mayank Bansal >Assignee: Mayank Bansal > Attachments: YARN-2459-1.patch, YARN-2459-2.patch, YARN-2459.3.patch, > YARN-2459.4.patch, YARN-2459.5.patch > > > If RM HA is enabled and used Zookeeper store for RM State Store. > If for any reason Any app gets rejected and directly goes to NEW to FAILED > then final transition makes that to RMApps and Completed Apps memory > structure but that doesn't make it to State store. > Now when RMApps default limit reaches it starts deleting apps from memory and > store. In that case it try to delete this app from store and fails which > causes RM to crash. > Thanks, > Mayank -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2459) RM crashes if App gets rejected for any reason and HA is enabled
[ https://issues.apache.org/jira/browse/YARN-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126037#comment-14126037 ] Jian He commented on YARN-2459: --- New patch added some comments in the test case > RM crashes if App gets rejected for any reason and HA is enabled > > > Key: YARN-2459 > URL: https://issues.apache.org/jira/browse/YARN-2459 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.1 >Reporter: Mayank Bansal >Assignee: Mayank Bansal > Attachments: YARN-2459-1.patch, YARN-2459-2.patch, YARN-2459.3.patch, > YARN-2459.4.patch, YARN-2459.5.patch > > > If RM HA is enabled and used Zookeeper store for RM State Store. > If for any reason Any app gets rejected and directly goes to NEW to FAILED > then final transition makes that to RMApps and Completed Apps memory > structure but that doesn't make it to State store. > Now when RMApps default limit reaches it starts deleting apps from memory and > store. In that case it try to delete this app from store and fails which > causes RM to crash. > Thanks, > Mayank -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2459) RM crashes if App gets rejected for any reason and HA is enabled
[ https://issues.apache.org/jira/browse/YARN-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126036#comment-14126036 ] Hadoop QA commented on YARN-2459: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667215/YARN-2459.4.patch against trunk revision df8c84c. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4846//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4846//console This message is automatically generated. > RM crashes if App gets rejected for any reason and HA is enabled > > > Key: YARN-2459 > URL: https://issues.apache.org/jira/browse/YARN-2459 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.1 >Reporter: Mayank Bansal >Assignee: Mayank Bansal > Attachments: YARN-2459-1.patch, YARN-2459-2.patch, YARN-2459.3.patch, > YARN-2459.4.patch, YARN-2459.5.patch > > > If RM HA is enabled and used Zookeeper store for RM State Store. > If for any reason Any app gets rejected and directly goes to NEW to FAILED > then final transition makes that to RMApps and Completed Apps memory > structure but that doesn't make it to State store. > Now when RMApps default limit reaches it starts deleting apps from memory and > store. In that case it try to delete this app from store and fails which > causes RM to crash. > Thanks, > Mayank -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2256) Too many nodemanager and resourcemanager audit logs are generated
[ https://issues.apache.org/jira/browse/YARN-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-2256: --- Summary: Too many nodemanager and resourcemanager audit logs are generated (was: Too many nodemanager audit logs are generated) > Too many nodemanager and resourcemanager audit logs are generated > - > > Key: YARN-2256 > URL: https://issues.apache.org/jira/browse/YARN-2256 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, resourcemanager >Affects Versions: 2.4.0 >Reporter: Varun Saxena >Assignee: Varun Saxena > Attachments: YARN-2256.patch > > > Following audit logs are generated too many times(due to the possibility of a > large number of containers) : > 1. In NM - Audit logs corresponding to Starting, Stopping and finishing of a > container > 2. In RM - Audit logs corresponding to AM allocating a container and AM > releasing a container > We can have different log levels even for NM and RM audit logs and move these > successful container related logs to DEBUG. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2154) FairScheduler: Improve preemption to preempt only those containers that would satisfy the incoming request
[ https://issues.apache.org/jira/browse/YARN-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126005#comment-14126005 ] Sandy Ryza commented on YARN-2154: -- I'd like to add another constraint that I've been thinking about into the mix. We don't necessarily need to implement it in this JIRA, but I think it's worth considering how it would affect the approach. A queue should only be able to preempt a container from another queue if every queue between the starved queue and their least common ancestor is starved. This essentially means that we consider preemption and fairness hierarchically. If the "marketing" and "engineering" queues are square in terms of resources, starved teams in engineering shouldn't be able to take resources from queues in marketing - they should only be able to preempt from queues within engineering. > FairScheduler: Improve preemption to preempt only those containers that would > satisfy the incoming request > -- > > Key: YARN-2154 > URL: https://issues.apache.org/jira/browse/YARN-2154 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 2.4.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla >Priority: Critical > > Today, FairScheduler uses a spray-gun approach to preemption. Instead, it > should only preempt resources that would satisfy the incoming request. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2080) Admission Control: Integrate Reservation subsystem with ResourceManager
[ https://issues.apache.org/jira/browse/YARN-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125994#comment-14125994 ] Vinod Kumar Vavilapalli commented on YARN-2080: --- Some comments on the patch: - Configuration -- admission.enable -> Rename to reservations.enable? -- RM_SCHEDULER_ENABLE_RESERVATIONS -> RM_RESERVATIONS_ENABLE, DEFAULT_RM_SCHEDULER_ENABLE_RESERVATIONS -> DEFAULT_RM_RESERVATIONS_ENABLE -- reservation.planfollower.time-step -> reservation-system.plan-follower.time-step -- RM_PLANFOLLOWER_TIME_STEP, DEFAULT_RM_PLANFOLLOWER_TIME_STEP -> RM_RESERVATION_SYSTEM_PLAN_FOLLOWER_TIME_STEP, DEFAULT_RM_RESERVATION_SYSTEM_PLAN_FOLLOWER_TIME_STEP - A meta question about configuration: It seems like if I pick up a scheduler and enable reservations, the system-class, the plan-follower should be picked up automatically instead of them being standalone configs. Can we do that? Otherwise the following -- reservation.class -> reservation-system.class? -- RM_RESERVATION, DEFAULT_RM_RESERVATION -> RM_RESERVATION_SYSTEM_CLASS, DEFAULT_RM_RESERVATION_SYSTEM_CLASS -- reservation.plan.follower -> reservation-system.plan-follower -- RM_RESERVATION_PLAN_FOLLOWER, DEFAULT_RM_RESERVATION_PLAN_FOLLOWER -> RM_RESERVATION_SYSTEM_PLAN_FOLLOWER, DEFAULT_RM_RESERVATION_SYSTEM_PLAN_FOLLOWER - YarnClient.submitReservation(): We don't return a queue-name anymore after the latest YARN-1708? There are javadoc refs to the queue-name being returned. - ClientRMService -- If reservations are not enabled, we get a host of "Reservation is not enabled. Please enable & try again" everytime which is not desirable. See checkReservationSystem(). This log and a bunch of similar logs in ReservationInputValidator may either be (1) deleted or (2) actually belong to the audit-log (RMAuditLogger) - we don't need to double-log -- checkReservationACLs: Today anyone who can submit applications can also submit reservations. We may want to separate them, if you agree, I'll file a ticket for future separation of these ACLs. - AbstractReservationSystem -- getPlanFollower() -> createPlanFollower() -- create and init plan-follower should be in serviceInit()? -- getNewReservationId(): Use ReservationId.newInstance() - ReservationInputValidator: Deleting a request shouldn't need validateReservationUpdateRequest->validateReservationDefinition. We only need the ID validation. - CapacitySchedulerConfiguration: I don't understand the semantics of configs - average-capacity, reservable.queue, reservation-window, reservation-enforcement-window, instantaneous-max-capacity, - yet as they are not used in this patch. Can we drop them (and their setters/getters) here and move them to the JIRA that actually uses them? Tests - TestYarnClient: You can use the newInstance methods and avoid using pb implementations and the setters directly (for e.g {{new ReservationDeleteRequestPBImpl()}} - TestClientRMService: -- ReservationRequest.setLeaseDuration() was renamed to be simply setDuration() in YARN-1708. Seems like there are other such occurrences in the patch. -- Similary to TestYarnClient, use record.newInstance methods instead of directly invoking PBImpls. Can't understand CapacityReservationSystem yet as I have to dig into the details of YARN-1709. > Admission Control: Integrate Reservation subsystem with ResourceManager > --- > > Key: YARN-2080 > URL: https://issues.apache.org/jira/browse/YARN-2080 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Subramaniam Krishnan >Assignee: Subramaniam Krishnan > Attachments: YARN-2080.patch, YARN-2080.patch, YARN-2080.patch > > > This JIRA tracks the integration of Reservation subsystem data structures > introduced in YARN-1709 with the YARN RM. This is essentially end2end wiring > of YARN-1051. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data
[ https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125990#comment-14125990 ] Zhijie Shen commented on YARN-1530: --- [~sjlee0], thanks for your feedback. Here're some additional thoughts and clarifications upon your comments. bq. This option would make sense only if the imports are less frequent. To be more specific, I mean sending the same amount of entities (not too big, if too big HTTP REST request has to chunk them into some continuous HTTP requests with reasonable size) via HTTP REST or HDFS should perform similar. HTTP REST may be better because of less secondary storage I/O (ethernet should be fast than disk). HTTP REST doesn't prevent the user from batching the entities and put them once, and current API supports it. It's up to the user to put the entity immediately for realtime/near-realtime inquiry, or to batch entities if the can tolerant some delay. However, I agree HDFS or some other single-node storage technique is a interesting part to prevent losing the entities when they are not published to the timeline server yet, in particular when we batching them. bq. Regarding option (2), I think your point is valid that it would be a transition from a thin client to a fat client. bq. However, I'm not too sure if it would make changing the data store much more complicated than other scenarios. I'm also not very sure about the necessary changes. As what I mentioned before, timeline server doesn't simply put the entities into the data store. One immediate problem I can come up with is the authorization. I'm not sure it's going to be logically correct to check the user's access in the client at the user's side. If we move authorization to the data store, HBase supports access control, but Levedb seems not. And I'm not sure HBase access control is enough for timeline sever's specific logic. Still need to think more about it. As the client is growing fatter, it's difficult to maintain different versions of clients. For example, if we do some incompatible optimization for the storage schema, only the new client can write into it, while the old client will not work any more. Moreover, as most writing logic is conducted at user land, which is not predictable, it is likely to raise some unexpected failure than a well setup server. In general, I prefer to keep the client simple, such that the future client distribution and maintenance could be of less effort. bq. But then again, if we consider a scenario such as a cluster of ATS instances, the same problem exists there. Right the same problem will exist at the server side, but the web front has isolated it from the users. Compared to the clients at the application, the ATS instances are a relatively small controllable set that we can pause them and do proper upgrading process. How do you think? > [Umbrella] Store, manage and serve per-framework application-timeline data > -- > > Key: YARN-1530 > URL: https://issues.apache.org/jira/browse/YARN-1530 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Vinod Kumar Vavilapalli > Attachments: ATS-Write-Pipeline-Design-Proposal.pdf, > ATS-meet-up-8-28-2014-notes.pdf, application timeline design-20140108.pdf, > application timeline design-20140116.pdf, application timeline > design-20140130.pdf, application timeline design-20140210.pdf > > > This is a sibling JIRA for YARN-321. > Today, each application/framework has to do store, and serve per-framework > data all by itself as YARN doesn't have a common solution. This JIRA attempts > to solve the storage, management and serving of per-framework data from > various applications, both running and finished. The aim is to change YARN to > collect and store data in a generic manner with plugin points for frameworks > to do their own thing w.r.t interpretation and serving. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2459) RM crashes if App gets rejected for any reason and HA is enabled
[ https://issues.apache.org/jira/browse/YARN-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125959#comment-14125959 ] Jian He commented on YARN-2459: --- bq. Add one in TestRMRestart to get an app rejected and make sure that the final-status gets recorded Added. bq. Another one in RMStateStoreTestBase to ensure it is okay to have an updateApp call without a storeApp call like in this case. Turns out RMStateStoreTestBase already has this test. {code} // test updating the state of an app/attempt whose initial state was not // saved. {code} > RM crashes if App gets rejected for any reason and HA is enabled > > > Key: YARN-2459 > URL: https://issues.apache.org/jira/browse/YARN-2459 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.1 >Reporter: Mayank Bansal >Assignee: Mayank Bansal > Attachments: YARN-2459-1.patch, YARN-2459-2.patch, YARN-2459.3.patch, > YARN-2459.4.patch > > > If RM HA is enabled and used Zookeeper store for RM State Store. > If for any reason Any app gets rejected and directly goes to NEW to FAILED > then final transition makes that to RMApps and Completed Apps memory > structure but that doesn't make it to State store. > Now when RMApps default limit reaches it starts deleting apps from memory and > store. In that case it try to delete this app from store and fails which > causes RM to crash. > Thanks, > Mayank -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2459) RM crashes if App gets rejected for any reason and HA is enabled
[ https://issues.apache.org/jira/browse/YARN-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2459: -- Attachment: YARN-2459.4.patch > RM crashes if App gets rejected for any reason and HA is enabled > > > Key: YARN-2459 > URL: https://issues.apache.org/jira/browse/YARN-2459 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.1 >Reporter: Mayank Bansal >Assignee: Mayank Bansal > Attachments: YARN-2459-1.patch, YARN-2459-2.patch, YARN-2459.3.patch, > YARN-2459.4.patch > > > If RM HA is enabled and used Zookeeper store for RM State Store. > If for any reason Any app gets rejected and directly goes to NEW to FAILED > then final transition makes that to RMApps and Completed Apps memory > structure but that doesn't make it to State store. > Now when RMApps default limit reaches it starts deleting apps from memory and > store. In that case it try to delete this app from store and fails which > causes RM to crash. > Thanks, > Mayank -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2154) FairScheduler: Improve preemption to preempt only those containers that would satisfy the incoming request
[ https://issues.apache.org/jira/browse/YARN-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125950#comment-14125950 ] Karthik Kambatla commented on YARN-2154: Just discussed this with [~ashwinshankar77] offline. He rightly pointed out the sort order should take usage into account. I ll post what the order should be, as soon as I get to consult my notes. > FairScheduler: Improve preemption to preempt only those containers that would > satisfy the incoming request > -- > > Key: YARN-2154 > URL: https://issues.apache.org/jira/browse/YARN-2154 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 2.4.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla >Priority: Critical > > Today, FairScheduler uses a spray-gun approach to preemption. Instead, it > should only preempt resources that would satisfy the incoming request. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2518) Support in-process container executor
[ https://issues.apache.org/jira/browse/YARN-2518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125907#comment-14125907 ] BoYang commented on YARN-2518: -- Yeah, there might be some issues with this, which need to be figure out. Thanks Allen for bring it out. I just come with YARN recently and cannot clearly identify all potential issues now. My point is that this in-process container executor seems to be a generic need from different people. I kind of see several discussion about this in my search. Some uses dummy process (for example, Impala?) as a proxy to relay the task to the long running process for further processing. So if the YARN community can realize the need for this common scenario, bring it up for further discussion, and explore the possibilities to support it natively, that will be really appreciated. And it will probably benefit a lot of other people or projects as well, and make YARN a even more generic framework to be adopted more broadly. > Support in-process container executor > - > > Key: YARN-2518 > URL: https://issues.apache.org/jira/browse/YARN-2518 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Affects Versions: 2.5.0 > Environment: Linux, Windows >Reporter: BoYang >Priority: Minor > Labels: container, dispatch, in-process, job, node > > Node Manage always creates a new process for a new application. We have hit a > scenario where we want the node manager to execute the application inside its > own process, so we get fast response time. It would be nice if Node Manager > or YARN can provide native support for that. > In general, the scenario is that we have a long running process which can > accept requests and process the requests inside its own process. Since YARN > is good at scheduling jobs, we want to use YARN to dispatch jobs (e.g. > requests in JSON) to the long running process. In that case, we do not want > YARN container to spin up a new process for each request. Instead, we want > YARN container to send the request to the long running process for further > processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data
[ https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125908#comment-14125908 ] Sangjin Lee commented on YARN-1530: --- {quote} The bottleneck is still there. Essentially I don’t see any difference between publishing entities via HTTP REST interface and via HDFS in terms of scalability. {quote} IMO, option (1) necessarily entails less frequent imports into the store by ATS. Obviously, if ATS still imports the HDFS files at the same speed as the timeline entries are generated, there would be no difference in scalability. This option would make sense only if the imports are less frequent. It also would mean that as a trade-off reads would be more stale. I believe Robert's document points out all those points. Regarding option (2), I think your point is valid that it would be a transition from a thin client to a fat client. And along with that would be some complications as you point out. However, I'm not too sure if it would make changing the data store much more complicated than other scenarios. I think the main problem of switching the data store is when not all writers are updated to point to the new data store. If writes are in progress, and the clients are being upgraded, there would be some inconsistencies between clients that were already upgraded and started writing to the new store and those that are not upgraded yet and still writing to the old store. If you have a single writer (such as the current ATS design), then it would be simpler. But then again, if we consider a scenario such as a cluster of ATS instances, the same problem exists there. I think that specific problem could be solved by holding the writes in some sort of a backup area (e.g. hdfs) before the switch starts, and recovering/re-enabling once all the writers are upgraded. The idea of a cluster of ATS instances (multiple write/read instances) sounds interesting. It might be able to address the scalability/reliability problem at hand. We'd need to think through and poke holes to see if the idea holds up well, however. It would need to address how load balancing would be done and whether it would be left up to the user, for example. > [Umbrella] Store, manage and serve per-framework application-timeline data > -- > > Key: YARN-1530 > URL: https://issues.apache.org/jira/browse/YARN-1530 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Vinod Kumar Vavilapalli > Attachments: ATS-Write-Pipeline-Design-Proposal.pdf, > ATS-meet-up-8-28-2014-notes.pdf, application timeline design-20140108.pdf, > application timeline design-20140116.pdf, application timeline > design-20140130.pdf, application timeline design-20140210.pdf > > > This is a sibling JIRA for YARN-321. > Today, each application/framework has to do store, and serve per-framework > data all by itself as YARN doesn't have a common solution. This JIRA attempts > to solve the storage, management and serving of per-framework data from > various applications, both running and finished. The aim is to change YARN to > collect and store data in a generic manner with plugin points for frameworks > to do their own thing w.r.t interpretation and serving. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2515) Update ConverterUtils#toContainerId to parse epoch
[ https://issues.apache.org/jira/browse/YARN-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125884#comment-14125884 ] Hudson commented on YARN-2515: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1890 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1890/]) YARN-2515. Updated ConverterUtils#toContainerId to parse epoch. Contributed by Tsuyoshi OZAWA (jianhe: rev 0974f434c47ffbf4b77a8478937fd99106c8ddbd) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ContainerId.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ConverterUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/TestContainerId.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestConverterUtils.java * hadoop-yarn-project/CHANGES.txt > Update ConverterUtils#toContainerId to parse epoch > -- > > Key: YARN-2515 > URL: https://issues.apache.org/jira/browse/YARN-2515 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Fix For: 2.6.0 > > Attachments: YARN-2515.1.patch, YARN-2515.2.patch > > > ContaienrId#toString was updated on YARN-2182. We should also update > ConverterUtils#toContainerId to parse epoch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2512) Allow for origin pattern matching in cross origin filter
[ https://issues.apache.org/jira/browse/YARN-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125886#comment-14125886 ] Hudson commented on YARN-2512: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1890 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1890/]) YARN-2512. Allowed pattern matching for origins in CrossOriginFilter. Contributed by Jonathan Eagles. (zjshen: rev a092cdf32de4d752456286a9f4dda533d8a62bca) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/webapp/TestCrossOriginFilter.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/webapp/CrossOriginFilter.java > Allow for origin pattern matching in cross origin filter > > > Key: YARN-2512 > URL: https://issues.apache.org/jira/browse/YARN-2512 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles > Fix For: 2.6.0 > > Attachments: YARN-2512-v1.patch > > > Extending the feature set of allowed origins. Now a "*" in a pattern > indicates this allowed origin is a pattern and will be matched including > multiple sub-domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2507) Document Cross Origin Filter Configuration for ATS
[ https://issues.apache.org/jira/browse/YARN-2507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125885#comment-14125885 ] Hudson commented on YARN-2507: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1890 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1890/]) YARN-2507. Documented CrossOriginFilter configurations for the timeline server. Contributed by Jonathan Eagles. (zjshen: rev 56dc496a1031621d2b701801de4ec29179d75f2e) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/TimelineServer.apt.vm * hadoop-yarn-project/CHANGES.txt > Document Cross Origin Filter Configuration for ATS > -- > > Key: YARN-2507 > URL: https://issues.apache.org/jira/browse/YARN-2507 > Project: Hadoop YARN > Issue Type: Sub-task > Components: documentation, timelineserver >Affects Versions: 2.6.0 >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles > Fix For: 2.6.0 > > Attachments: YARN-2507-v1.patch > > > CORS support was added for ATS as part of YARN-2277. This jira is to document > configuration for ATS CORS support. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2097) Documentation: health check return status
[ https://issues.apache.org/jira/browse/YARN-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125852#comment-14125852 ] Hadoop QA commented on YARN-2097: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12646615/YARN-2097.1.patch against trunk revision 302d9a0. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4845//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4845//console This message is automatically generated. > Documentation: health check return status > - > > Key: YARN-2097 > URL: https://issues.apache.org/jira/browse/YARN-2097 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.4.0 >Reporter: Allen Wittenauer >Assignee: Rekha Joshi > Labels: newbie > Attachments: YARN-2097.1.patch > > > We need to document that the output of the health check script is ignored on > non-0 exit status. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2377) Localization exception stack traces are not passed as diagnostic info
[ https://issues.apache.org/jira/browse/YARN-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125846#comment-14125846 ] Gera Shegalov commented on YARN-2377: - [~kasha], do you agree with the points above? > Localization exception stack traces are not passed as diagnostic info > - > > Key: YARN-2377 > URL: https://issues.apache.org/jira/browse/YARN-2377 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.4.0 >Reporter: Gera Shegalov >Assignee: Gera Shegalov > Attachments: YARN-2377.v01.patch > > > In the Localizer log one can only see this kind of message > {code} > 14/07/31 10:29:00 INFO localizer.ResourceLocalizationService: DEBUG: FAILED { > hdfs://ha-nn-uri-0:8020/tmp/hadoop-yarn/staging/gshegalov/.staging/job_1406825443306_0004/job.jar, > 1406827248944, PATTERN, (?:classes/|lib/).* }, java.net.UnknownHos > tException: ha-nn-uri-0 > {code} > And then only {{ java.net.UnknownHostException: ha-nn-uri-0}} message is > propagated as diagnostics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2518) Support in-process container executor
[ https://issues.apache.org/jira/browse/YARN-2518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125838#comment-14125838 ] Allen Wittenauer commented on YARN-2518: Sorry, I wasn't clear: if this feature goes in, it must fail the nodemanager process if security is enabled due to running tasks as the yarn user being extremely insecure. > Support in-process container executor > - > > Key: YARN-2518 > URL: https://issues.apache.org/jira/browse/YARN-2518 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Affects Versions: 2.5.0 > Environment: Linux, Windows >Reporter: BoYang >Priority: Minor > Labels: container, dispatch, in-process, job, node > > Node Manage always creates a new process for a new application. We have hit a > scenario where we want the node manager to execute the application inside its > own process, so we get fast response time. It would be nice if Node Manager > or YARN can provide native support for that. > In general, the scenario is that we have a long running process which can > accept requests and process the requests inside its own process. Since YARN > is good at scheduling jobs, we want to use YARN to dispatch jobs (e.g. > requests in JSON) to the long running process. In that case, we do not want > YARN container to spin up a new process for each request. Instead, we want > YARN container to send the request to the long running process for further > processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2518) Support in-process container executor
[ https://issues.apache.org/jira/browse/YARN-2518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125834#comment-14125834 ] BoYang commented on YARN-2518: -- In my rough testing, it did not fail the node manager process. In my Container Executor implementation (launchContainer method), I register a new application master, send a message to another long running process, and unregister the application master. I can see the application finished successfully. Of course, that was my very draft initial testing. We could fine-tune the code to make it work better. But technically it seems doable now. Thus I am curious whether the YARN community could take this feature and provide official support. > Support in-process container executor > - > > Key: YARN-2518 > URL: https://issues.apache.org/jira/browse/YARN-2518 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Affects Versions: 2.5.0 > Environment: Linux, Windows >Reporter: BoYang >Priority: Minor > Labels: container, dispatch, in-process, job, node > > Node Manage always creates a new process for a new application. We have hit a > scenario where we want the node manager to execute the application inside its > own process, so we get fast response time. It would be nice if Node Manager > or YARN can provide native support for that. > In general, the scenario is that we have a long running process which can > accept requests and process the requests inside its own process. Since YARN > is good at scheduling jobs, we want to use YARN to dispatch jobs (e.g. > requests in JSON) to the long running process. In that case, we do not want > YARN container to spin up a new process for each request. Instead, we want > YARN container to send the request to the long running process for further > processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2097) Documentation: health check return status
[ https://issues.apache.org/jira/browse/YARN-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-2097: --- Assignee: Rekha Joshi > Documentation: health check return status > - > > Key: YARN-2097 > URL: https://issues.apache.org/jira/browse/YARN-2097 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.4.0 >Reporter: Allen Wittenauer >Assignee: Rekha Joshi > Labels: newbie > Attachments: YARN-2097.1.patch > > > We need to document that the output of the health check script is ignored on > non-0 exit status. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2422) yarn.scheduler.maximum-allocation-mb should not be hard-coded in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-2422: --- Assignee: Gopal V > yarn.scheduler.maximum-allocation-mb should not be hard-coded in > yarn-default.xml > - > > Key: YARN-2422 > URL: https://issues.apache.org/jira/browse/YARN-2422 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.6.0 >Reporter: Gopal V >Assignee: Gopal V >Priority: Minor > Attachments: YARN-2422.1.patch > > > Cluster with 40Gb NM refuses to run containers >8Gb. > It was finally tracked down to yarn-default.xml hard-coding it to 8Gb. > In case of lack of a better override, it should default to - > ${yarn.nodemanager.resource.memory-mb} instead of a hard-coded 8Gb. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2348) ResourceManager web UI should display server-side time instead of UTC time
[ https://issues.apache.org/jira/browse/YARN-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-2348: --- Assignee: Leitao Guo > ResourceManager web UI should display server-side time instead of UTC time > -- > > Key: YARN-2348 > URL: https://issues.apache.org/jira/browse/YARN-2348 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.1 >Reporter: Leitao Guo >Assignee: Leitao Guo > Attachments: 3.before-patch.JPG, 4.after-patch.JPG, YARN-2348.2.patch > > > ResourceManager web UI, including application list and scheduler, displays > UTC time in default, this will confuse users who do not use UTC time. This > web UI should display server-side time in default. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2256) Too many nodemanager audit logs are generated
[ https://issues.apache.org/jira/browse/YARN-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-2256: --- Assignee: Varun Saxena > Too many nodemanager audit logs are generated > - > > Key: YARN-2256 > URL: https://issues.apache.org/jira/browse/YARN-2256 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, resourcemanager >Affects Versions: 2.4.0 >Reporter: Varun Saxena >Assignee: Varun Saxena > Attachments: YARN-2256.patch > > > Following audit logs are generated too many times(due to the possibility of a > large number of containers) : > 1. In NM - Audit logs corresponding to Starting, Stopping and finishing of a > container > 2. In RM - Audit logs corresponding to AM allocating a container and AM > releasing a container > We can have different log levels even for NM and RM audit logs and move these > successful container related logs to DEBUG. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2256) Too many nodemanager audit logs are generated
[ https://issues.apache.org/jira/browse/YARN-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125808#comment-14125808 ] Allen Wittenauer commented on YARN-2256: Someone correct me if I'm wrong, but I'm fairly certain that the intent of this information is to be the equivalent of the HDFS audit log. In other words, setting these to debug completely defeats the purpose. Instead, I suspect the real culprit is that the log4j settings are wrong for the node manager process. > Too many nodemanager audit logs are generated > - > > Key: YARN-2256 > URL: https://issues.apache.org/jira/browse/YARN-2256 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, resourcemanager >Affects Versions: 2.4.0 >Reporter: Varun Saxena > Attachments: YARN-2256.patch > > > Following audit logs are generated too many times(due to the possibility of a > large number of containers) : > 1. In NM - Audit logs corresponding to Starting, Stopping and finishing of a > container > 2. In RM - Audit logs corresponding to AM allocating a container and AM > releasing a container > We can have different log levels even for NM and RM audit logs and move these > successful container related logs to DEBUG. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2461) Fix PROCFS_USE_SMAPS_BASED_RSS_ENABLED property in YarnConfiguration
[ https://issues.apache.org/jira/browse/YARN-2461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125809#comment-14125809 ] Ray Chiang commented on YARN-2461: -- Same observation as before. No need for a new unit test for fixed property value. > Fix PROCFS_USE_SMAPS_BASED_RSS_ENABLED property in YarnConfiguration > > > Key: YARN-2461 > URL: https://issues.apache.org/jira/browse/YARN-2461 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.5.0 >Reporter: Ray Chiang >Assignee: Ray Chiang >Priority: Minor > Labels: newbie > Attachments: YARN-2461-01.patch > > > The property PROCFS_USE_SMAPS_BASED_RSS_ENABLED has an extra period. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2518) Support in-process container executor
[ https://issues.apache.org/jira/browse/YARN-2518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125797#comment-14125797 ] Allen Wittenauer commented on YARN-2518: This is pretty much incompatible with security. So it should probably fail the nodemanager process under that condition. > Support in-process container executor > - > > Key: YARN-2518 > URL: https://issues.apache.org/jira/browse/YARN-2518 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Affects Versions: 2.5.0 > Environment: Linux, Windows >Reporter: BoYang >Priority: Minor > Labels: container, dispatch, in-process, job, node > > Node Manage always creates a new process for a new application. We have hit a > scenario where we want the node manager to execute the application inside its > own process, so we get fast response time. It would be nice if Node Manager > or YARN can provide native support for that. > In general, the scenario is that we have a long running process which can > accept requests and process the requests inside its own process. Since YARN > is good at scheduling jobs, we want to use YARN to dispatch jobs (e.g. > requests in JSON) to the long running process. In that case, we do not want > YARN container to spin up a new process for each request. Instead, we want > YARN container to send the request to the long running process for further > processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2517) Implement TimelineClientAsync
[ https://issues.apache.org/jira/browse/YARN-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125777#comment-14125777 ] Zhijie Shen commented on YARN-2517: --- [~vinodkv], thanks for your feedback. The reason why an async client (or async HTTP REST call) is going to be good is to unblock the current thread if it is doing the important management logic. For example, in YARN-2033, we have a bunch of logic to dispatch the entity putting action on a separate thread, to make the application life cycle management move on. Given an async client, it could be far more simplified. I think from the user point of view, it may be a useful feature as well. I'm fine whether we can two classes, sync for one and async for the other, or one class for both modes, while the former option complies with the previous client design. I think the callback is necessary, at least "onError". TimelinePutResponse will give the user a summary of why his uploaded entity is not accepted by the timeline server. Based on the response, the user can determine whether the app should neglect the problem and move on, or stop immediately. > Implement TimelineClientAsync > - > > Key: YARN-2517 > URL: https://issues.apache.org/jira/browse/YARN-2517 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2517.1.patch > > > In some scenarios, we'd like to put timeline entities in another thread no to > block the current one. > It's good to have a TimelineClientAsync like AMRMClientAsync and > NMClientAsync. It can buffer entities, put them in a separate thread, and > have callback to handle the responses. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2517) Implement TimelineClientAsync
[ https://issues.apache.org/jira/browse/YARN-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125760#comment-14125760 ] Tsuyoshi OZAWA commented on YARN-2517: -- Thanks for your comment, Vinod. {quote} an asynchronous write, the end of which they don't care about. I think we should simply have a mode in the existing client to post events asynchronously without any further need for call-back handlers. {quote} Make sense. We can assure at-most-once semantics without any callbacks. How about adding a {{flush()}} API to TimelineClient for asynchronous mode? It helps users to know whether contents of current buffer are written to Timeline Server or not. > Implement TimelineClientAsync > - > > Key: YARN-2517 > URL: https://issues.apache.org/jira/browse/YARN-2517 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2517.1.patch > > > In some scenarios, we'd like to put timeline entities in another thread no to > block the current one. > It's good to have a TimelineClientAsync like AMRMClientAsync and > NMClientAsync. It can buffer entities, put them in a separate thread, and > have callback to handle the responses. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2515) Update ConverterUtils#toContainerId to parse epoch
[ https://issues.apache.org/jira/browse/YARN-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125744#comment-14125744 ] Tsuyoshi OZAWA commented on YARN-2515: -- Thanks for your review, Jian! > Update ConverterUtils#toContainerId to parse epoch > -- > > Key: YARN-2515 > URL: https://issues.apache.org/jira/browse/YARN-2515 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Fix For: 2.6.0 > > Attachments: YARN-2515.1.patch, YARN-2515.2.patch > > > ContaienrId#toString was updated on YARN-2182. We should also update > ConverterUtils#toContainerId to parse epoch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2517) Implement TimelineClientAsync
[ https://issues.apache.org/jira/browse/YARN-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125714#comment-14125714 ] Vinod Kumar Vavilapalli commented on YARN-2517: --- I am not entirely sure we need a parallel client for this. The other clients needed async clients because - they had loads of functionality that made sense in the blocking and non-blocking modes - the client code really needed call-back hooks to act on the results. Timeline Client's only responsibility is to post events. There are only two use-cases: Clients need a sync write through, or an asynchronous write, the end of which they don't care about. I think we should simply have a mode in the existing client to post events asynchronously without any further need for call-back handlers. What do others think? > Implement TimelineClientAsync > - > > Key: YARN-2517 > URL: https://issues.apache.org/jira/browse/YARN-2517 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2517.1.patch > > > In some scenarios, we'd like to put timeline entities in another thread no to > block the current one. > It's good to have a TimelineClientAsync like AMRMClientAsync and > NMClientAsync. It can buffer entities, put them in a separate thread, and > have callback to handle the responses. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster
[ https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125706#comment-14125706 ] Hadoop QA commented on YARN-913: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667181/YARN-913-002.patch against trunk revision 0974f43. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 25 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4844//console This message is automatically generated. > Add a way to register long-lived services in a YARN cluster > --- > > Key: YARN-913 > URL: https://issues.apache.org/jira/browse/YARN-913 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, resourcemanager >Affects Versions: 2.5.0, 2.4.1 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, > 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, > YARN-913-001.patch, YARN-913-002.patch, yarnregistry.pdf, yarnregistry.tla > > > In a YARN cluster you can't predict where services will come up -or on what > ports. The services need to work those things out as they come up and then > publish them somewhere. > Applications need to be able to find the service instance they are to bond to > -and not any others in the cluster. > Some kind of service registry -in the RM, in ZK, could do this. If the RM > held the write access to the ZK nodes, it would be more secure than having > apps register with ZK themselves. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-913) Add a way to register long-lived services in a YARN cluster
[ https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-913: Attachment: YARN-913-002.patch Patch -002 # adds persistence policy # {{RegistryOperationsService}} implements callbacks for various RM events, and implements the setup/purge behaviour underneath. # adds a new class in the resource manager, {{RegistryService}}. This bridges from YARN to the registry by subscribing to application and container events, translating and forwarding to the {{RegistryOperationsService}} where they may trigger setup/purge operations # Hooks this up to the RM # Extends the DistributedShell by enabling it to register service records with the different persistence options. # Adds a test to verify the distributed shell does register the entries, and that the purgeable ones are purged after the application completes. This means the {{TestDistributedShell}} test is now capable of verifying that YARN applications can register themselves, that they can then be discovered, and that the RM cleans up after they terminate. > Add a way to register long-lived services in a YARN cluster > --- > > Key: YARN-913 > URL: https://issues.apache.org/jira/browse/YARN-913 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, resourcemanager >Affects Versions: 2.5.0, 2.4.1 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, > 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, > YARN-913-001.patch, YARN-913-002.patch, yarnregistry.pdf, yarnregistry.tla > > > In a YARN cluster you can't predict where services will come up -or on what > ports. The services need to work those things out as they come up and then > publish them somewhere. > Applications need to be able to find the service instance they are to bond to > -and not any others in the cluster. > Some kind of service registry -in the RM, in ZK, could do this. If the RM > held the write access to the ZK nodes, it would be more secure than having > apps register with ZK themselves. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-913) Add a way to register long-lived services in a YARN cluster
[ https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-913: Attachment: yarnregistry.tla yarnregistry.pdf 2014-09-08_YARN_Service_Registry.pdf h3. Updated YARN service registry description This adds a {{persistence}} field to service records, enabling the records to be automatically deleted (along with all child entries) when the application, app attempt or container is terminated. h3. TLA+ service registry specification. This is my initial attempt to define the expected behaviour of a service registry built atop zookeeper. Corrections welcome. > Add a way to register long-lived services in a YARN cluster > --- > > Key: YARN-913 > URL: https://issues.apache.org/jira/browse/YARN-913 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, resourcemanager >Affects Versions: 2.5.0, 2.4.1 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, > 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, > YARN-913-001.patch, yarnregistry.pdf, yarnregistry.tla > > > In a YARN cluster you can't predict where services will come up -or on what > ports. The services need to work those things out as they come up and then > publish them somewhere. > Applications need to be able to find the service instance they are to bond to > -and not any others in the cluster. > Some kind of service registry -in the RM, in ZK, could do this. If the RM > held the write access to the ZK nodes, it would be more secure than having > apps register with ZK themselves. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2494) [YARN-796] Node label manager API and storage implementations
[ https://issues.apache.org/jira/browse/YARN-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125673#comment-14125673 ] Hadoop QA commented on YARN-2494: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667170/YARN-2494.patch against trunk revision 0974f43. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 6 warning messages. See https://builds.apache.org/job/PreCommit-YARN-Build/4843//artifact/trunk/patchprocess/diffJavadocWarnings.txt for details. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. {color:red}-1 release audit{color}. The applied patch generated 1 release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common: org.apache.hadoop.yarn.label.TestFileSystemNodeLabelManager org.apache.hadoop.yarn.label.TestNodeLabelManager {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4843//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-YARN-Build/4843//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/4843//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4843//console This message is automatically generated. > [YARN-796] Node label manager API and storage implementations > - > > Key: YARN-2494 > URL: https://issues.apache.org/jira/browse/YARN-2494 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-2494.patch > > > This JIRA includes APIs and storage implementations of node label manager, > NodeLabelManager is an abstract class used to manage labels of nodes in the > cluster, it has APIs to query/modify > - Nodes according to given label > - Labels according to given hostname > - Add/remove labels > - Set labels of nodes in the cluster > - Persist/recover changes of labels/labels-on-nodes to/from storage > And it has two implementations to store modifications > - Memory based storage: It will not persist changes, so all labels will be > lost when RM restart > - FileSystem based storage: It will persist/recover to/from FileSystem (like > HDFS), and all labels and labels-on-nodes will be recovered upon RM restart -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2517) Implement TimelineClientAsync
[ https://issues.apache.org/jira/browse/YARN-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125664#comment-14125664 ] Hadoop QA commented on YARN-2517: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667168/YARN-2517.1.patch against trunk revision 0974f43. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4842//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4842//console This message is automatically generated. > Implement TimelineClientAsync > - > > Key: YARN-2517 > URL: https://issues.apache.org/jira/browse/YARN-2517 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2517.1.patch > > > In some scenarios, we'd like to put timeline entities in another thread no to > block the current one. > It's good to have a TimelineClientAsync like AMRMClientAsync and > NMClientAsync. It can buffer entities, put them in a separate thread, and > have callback to handle the responses. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2492) (Clone of YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-2492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125659#comment-14125659 ] Wangda Tan commented on YARN-2492: -- Uploaded 1st version of patch for NodeLabelManager API and implementation to YARN-2494, it doesn't rely on YARN-2493. So it can be applied on current trunk directly. > (Clone of YARN-796) Allow for (admin) labels on nodes and resource-requests > > > Key: YARN-2492 > URL: https://issues.apache.org/jira/browse/YARN-2492 > Project: Hadoop YARN > Issue Type: Task > Components: api, client, resourcemanager >Reporter: Wangda Tan > > Since YARN-796 is a sub JIRA of YARN-397, this JIRA is used to create and > track sub tasks and attach split patches for YARN-796. > *Let's still keep over-all discussions on YARN-796.* -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2494) [YARN-796] Node label manager API and storage implementations
[ https://issues.apache.org/jira/browse/YARN-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2494: - Attachment: YARN-2494.patch Attached patch of NodeLabelManager API and storage implementation. And some PB related changes (more than half of the patch). Please kindly review, Thanks! > [YARN-796] Node label manager API and storage implementations > - > > Key: YARN-2494 > URL: https://issues.apache.org/jira/browse/YARN-2494 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-2494.patch > > > This JIRA includes APIs and storage implementations of node label manager, > NodeLabelManager is an abstract class used to manage labels of nodes in the > cluster, it has APIs to query/modify > - Nodes according to given label > - Labels according to given hostname > - Add/remove labels > - Set labels of nodes in the cluster > - Persist/recover changes of labels/labels-on-nodes to/from storage > And it has two implementations to store modifications > - Memory based storage: It will not persist changes, so all labels will be > lost when RM restart > - FileSystem based storage: It will persist/recover to/from FileSystem (like > HDFS), and all labels and labels-on-nodes will be recovered upon RM restart -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2517) Implement TimelineClientAsync
[ https://issues.apache.org/jira/browse/YARN-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2517: - Attachment: YARN-2517.1.patch Attached a first patch for review. The differences between TimelineClientAsync and TimelineClient are as follows: * TimelineClientAsyncImpl has 2 blocking queues and 2 threads: {{requestQueue}} is for queuing requests from {{TimelineClientAsync#putEntities}}. {{responseQueue}} is for queueing responses and errors from {{TimelineClientImpl#putEntities}}. {{dispatcherThread}} deques requests from {{requestQueue}} and dispatches requests to TimelineServer. {{handlerThread}} deques results of {{TimelineClient#putEntities}} and callback user-defined methods defined in CallbackHandler. * CallbackHandler has two APIs for users: onEntitiesPut is a API for receiving results of putEntities and onError is a API for handling errors. If Configuration#TIMELINE_SERVICE_ENABLED is false, results of putEntities are returned via Callback#onEntitiesPut. * {{void TimelineClientAsync#putEntities}} can throw InterruptedException because it uses {{BlockingQueue#put}} in {{TimelineClientAsyncImpl}}, though I think it's not blocked basically because the queue length is configured as Integer.MAX_VALUE. We can add a configuration for controlling memory consumption of the queues. > Implement TimelineClientAsync > - > > Key: YARN-2517 > URL: https://issues.apache.org/jira/browse/YARN-2517 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2517.1.patch > > > In some scenarios, we'd like to put timeline entities in another thread no to > block the current one. > It's good to have a TimelineClientAsync like AMRMClientAsync and > NMClientAsync. It can buffer entities, put them in a separate thread, and > have callback to handle the responses. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2512) Allow for origin pattern matching in cross origin filter
[ https://issues.apache.org/jira/browse/YARN-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125511#comment-14125511 ] Hudson commented on YARN-2512: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1865 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1865/]) YARN-2512. Allowed pattern matching for origins in CrossOriginFilter. Contributed by Jonathan Eagles. (zjshen: rev a092cdf32de4d752456286a9f4dda533d8a62bca) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/webapp/CrossOriginFilter.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/webapp/TestCrossOriginFilter.java > Allow for origin pattern matching in cross origin filter > > > Key: YARN-2512 > URL: https://issues.apache.org/jira/browse/YARN-2512 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles > Fix For: 2.6.0 > > Attachments: YARN-2512-v1.patch > > > Extending the feature set of allowed origins. Now a "*" in a pattern > indicates this allowed origin is a pattern and will be matched including > multiple sub-domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2507) Document Cross Origin Filter Configuration for ATS
[ https://issues.apache.org/jira/browse/YARN-2507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125510#comment-14125510 ] Hudson commented on YARN-2507: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1865 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1865/]) YARN-2507. Documented CrossOriginFilter configurations for the timeline server. Contributed by Jonathan Eagles. (zjshen: rev 56dc496a1031621d2b701801de4ec29179d75f2e) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/TimelineServer.apt.vm * hadoop-yarn-project/CHANGES.txt > Document Cross Origin Filter Configuration for ATS > -- > > Key: YARN-2507 > URL: https://issues.apache.org/jira/browse/YARN-2507 > Project: Hadoop YARN > Issue Type: Sub-task > Components: documentation, timelineserver >Affects Versions: 2.6.0 >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles > Fix For: 2.6.0 > > Attachments: YARN-2507-v1.patch > > > CORS support was added for ATS as part of YARN-2277. This jira is to document > configuration for ATS CORS support. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2515) Update ConverterUtils#toContainerId to parse epoch
[ https://issues.apache.org/jira/browse/YARN-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125509#comment-14125509 ] Hudson commented on YARN-2515: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1865 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1865/]) YARN-2515. Updated ConverterUtils#toContainerId to parse epoch. Contributed by Tsuyoshi OZAWA (jianhe: rev 0974f434c47ffbf4b77a8478937fd99106c8ddbd) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ConverterUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestConverterUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/TestContainerId.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ContainerId.java > Update ConverterUtils#toContainerId to parse epoch > -- > > Key: YARN-2515 > URL: https://issues.apache.org/jira/browse/YARN-2515 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Fix For: 2.6.0 > > Attachments: YARN-2515.1.patch, YARN-2515.2.patch > > > ContaienrId#toString was updated on YARN-2182. We should also update > ConverterUtils#toContainerId to parse epoch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2507) Document Cross Origin Filter Configuration for ATS
[ https://issues.apache.org/jira/browse/YARN-2507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125419#comment-14125419 ] Hudson commented on YARN-2507: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #674 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/674/]) YARN-2507. Documented CrossOriginFilter configurations for the timeline server. Contributed by Jonathan Eagles. (zjshen: rev 56dc496a1031621d2b701801de4ec29179d75f2e) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/TimelineServer.apt.vm * hadoop-yarn-project/CHANGES.txt > Document Cross Origin Filter Configuration for ATS > -- > > Key: YARN-2507 > URL: https://issues.apache.org/jira/browse/YARN-2507 > Project: Hadoop YARN > Issue Type: Sub-task > Components: documentation, timelineserver >Affects Versions: 2.6.0 >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles > Fix For: 2.6.0 > > Attachments: YARN-2507-v1.patch > > > CORS support was added for ATS as part of YARN-2277. This jira is to document > configuration for ATS CORS support. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2515) Update ConverterUtils#toContainerId to parse epoch
[ https://issues.apache.org/jira/browse/YARN-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125418#comment-14125418 ] Hudson commented on YARN-2515: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #674 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/674/]) YARN-2515. Updated ConverterUtils#toContainerId to parse epoch. Contributed by Tsuyoshi OZAWA (jianhe: rev 0974f434c47ffbf4b77a8478937fd99106c8ddbd) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ConverterUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ContainerId.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestConverterUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/TestContainerId.java > Update ConverterUtils#toContainerId to parse epoch > -- > > Key: YARN-2515 > URL: https://issues.apache.org/jira/browse/YARN-2515 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Fix For: 2.6.0 > > Attachments: YARN-2515.1.patch, YARN-2515.2.patch > > > ContaienrId#toString was updated on YARN-2182. We should also update > ConverterUtils#toContainerId to parse epoch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2512) Allow for origin pattern matching in cross origin filter
[ https://issues.apache.org/jira/browse/YARN-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125420#comment-14125420 ] Hudson commented on YARN-2512: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #674 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/674/]) YARN-2512. Allowed pattern matching for origins in CrossOriginFilter. Contributed by Jonathan Eagles. (zjshen: rev a092cdf32de4d752456286a9f4dda533d8a62bca) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/webapp/TestCrossOriginFilter.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/webapp/CrossOriginFilter.java * hadoop-yarn-project/CHANGES.txt > Allow for origin pattern matching in cross origin filter > > > Key: YARN-2512 > URL: https://issues.apache.org/jira/browse/YARN-2512 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles > Fix For: 2.6.0 > > Attachments: YARN-2512-v1.patch > > > Extending the feature set of allowed origins. Now a "*" in a pattern > indicates this allowed origin is a pattern and will be matched including > multiple sub-domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)