[jira] [Assigned] (YARN-5531) UnmanagedAM pool manager for federating application across clusters
[ https://issues.apache.org/jira/browse/YARN-5531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Zhang reassigned YARN-5531: Assignee: Sarvesh Sakalanaga (was: Peng Zhang) > UnmanagedAM pool manager for federating application across clusters > --- > > Key: YARN-5531 > URL: https://issues.apache.org/jira/browse/YARN-5531 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Subru Krishnan >Assignee: Sarvesh Sakalanaga > > One of the main tenets the YARN Federation is to *transparently* scale > applications across multiple clusters. This is achieved by running UAMs on > behalf of the application on other clusters. This JIRA tracks the addition of > a UnmanagedAM pool manager for federating application across clusters which > will be used the FederationInterceptor (YARN-3666) which is part of the > AMRMProxy pipeline introduced in YARN-2884. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-5531) UnmanagedAM pool manager for federating application across clusters
[ https://issues.apache.org/jira/browse/YARN-5531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Zhang reassigned YARN-5531: Assignee: Peng Zhang (was: Sarvesh Sakalanaga) > UnmanagedAM pool manager for federating application across clusters > --- > > Key: YARN-5531 > URL: https://issues.apache.org/jira/browse/YARN-5531 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager >Reporter: Subru Krishnan >Assignee: Peng Zhang > > One of the main tenets the YARN Federation is to *transparently* scale > applications across multiple clusters. This is achieved by running UAMs on > behalf of the application on other clusters. This JIRA tracks the addition of > a UnmanagedAM pool manager for federating application across clusters which > will be used the FederationInterceptor (YARN-3666) which is part of the > AMRMProxy pipeline introduced in YARN-2884. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-3054) Preempt policy in FairScheduler may cause mapreduce job never finish
[ https://issues.apache.org/jira/browse/YARN-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15192683#comment-15192683 ] Peng Zhang commented on YARN-3054: -- Thanks [~kasha] I agreed with "have a preemption priority or even a preemption cost per container". And in my temporary fix, I preempt latest scheduled container instead of low or high priority containers. I think this will make containers for a mount of resources(at least steady fair share) steady. So MapReduce job progress will proceed. > Preempt policy in FairScheduler may cause mapreduce job never finish > > > Key: YARN-3054 > URL: https://issues.apache.org/jira/browse/YARN-3054 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Affects Versions: 2.6.0 >Reporter: Peng Zhang > > Preemption policy is related with schedule policy now. Using comparator of > schedule policy to find preemption candidate cannot guarantee a subset of > containers never be preempted. And this may cause tasks to be preempted > periodically before they finish. So job cannot make any progress. > I think preemption in YARN should got below assurance: > 1. Mapreduce jobs can get additional resources when others are idle; > 2. Mapreduce jobs for one user in one queue can still progress with its min > share when others preempt resources back. > Maybe always preempt the latest app and container can get this? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3054) Preempt policy in FairScheduler may cause mapreduce job never finish
[ https://issues.apache.org/jira/browse/YARN-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186492#comment-15186492 ] Peng Zhang commented on YARN-3054: -- Preemption happened on low priority container, and for MapReduce reduce task got higher priority than map task for scheduling first, but it has data dependency on map task. So preempt map task which has lower priority may cause job progress never proceed. Detailed scenario described as below: 1. assume 10 resources in cluster(map and reduce task request the same amount of memory and cpu, 1 resource per task), two queues(q1 and q2). 2. q1 has one job and get all resources when q2 is idle. 3. job in q1 has 5 map tasks and 5 reduce tasks. 4. when q2 get new job, job in q1 will be preempted, and 5 containers will be preempted. 5. according to container preemption policy, all map tasks with lower priority will be preempted (all progress for these tasks are lost) 6. after container preemption, job in q1 get new resource headroom, and decide new ratio between map and reduce tasks, and then AM preempt reduce tasks for map tasks. so 5 reduce tasks are killed and new 5 map tasks start. 7. when q2 is idle, job in q1 will then get 5 extra resources and new 5 reduce tasks start. This is back to phase 3. and this may happens periodically(maybe because map tasks for job in q1 run for a long time), map tasks cannot finish before container is preempted. So job cannot make any progress. > Preempt policy in FairScheduler may cause mapreduce job never finish > > > Key: YARN-3054 > URL: https://issues.apache.org/jira/browse/YARN-3054 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Affects Versions: 2.6.0 >Reporter: Peng Zhang > > Preemption policy is related with schedule policy now. Using comparator of > schedule policy to find preemption candidate cannot guarantee a subset of > containers never be preempted. And this may cause tasks to be preempted > periodically before they finish. So job cannot make any progress. > I think preemption in YARN should got below assurance: > 1. Mapreduce jobs can get additional resources when others are idle; > 2. Mapreduce jobs for one user in one queue can still progress with its min > share when others preempt resources back. > Maybe always preempt the latest app and container can get this? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629369#comment-14629369 ] Peng Zhang commented on YARN-3535: -- bq. there are chances that recoverResourceRequest may not be correct. Sorry, I didn't catch this, maybe I missed sth?. I think {{recoverResourceRequest}} will not be affected by whether container finished event is processed faster. Cause {{recoverResourceRequest}} only process the ResourceRequest in container and not care its status. ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED - Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629208#comment-14629208 ] Peng Zhang commented on YARN-3535: -- Thanks [~rohithsharma] for updating patch. patch LGTM. bq. One point to be clear that , here the assumption made is if RMContainer is ALLOCATED then only recover ResourceRequest. If RMcontainer is in RUNNING, then completed container will go to AM in allocate response and AM will ask new ResourceRequest. During running in our scale cluster with FS and preemption enabled, MapReduce app works good with this assumption. Basically, I think this assumption make sense for other type app. ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED - Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Attachments: 0003-YARN-3535.patch, 0004-YARN-3535.patch, 0005-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626207#comment-14626207 ] Peng Zhang commented on YARN-3535: -- [~rohithsharma] Thanks for rebase and adding tests. As for removing {{recoverResourceRequestForContainer}}, in my notes, it caused test {{CapacityScheduler#testRecoverRequestAfterPreemption}} failed. But I cannot remember my old thoughts: bq. Remove call of recoverResourceRequestForContainer from preemption to avoid duplication of recover RR. I applied my patch {{YARN-3535-002.patch}} on our production cluster, preemption works well with FairScheduler. Failure of {{TestAMRestart.testAMRestartWithExistingContainers}} , I met it before. And I think it's because: bq. Changing TestAMRestart.java is because that case testAMRestartWithExistingContainers will trigger this logic. After this patch, one more container may be scheduled, and attempt.getJustFinishedContainers().size() may be bigger than expectedNum and loop never ends. So I simply change the situation. ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED - Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Attachments: 0003-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626244#comment-14626244 ] Peng Zhang commented on YARN-3535: -- bq. Remove call of recoverResourceRequestForContainer from preemption to avoid duplication of recover RR. I remembered the reason. For preemption, container killed has two cases: container already pulled by AM or not. For 1st case, AM should know container is killed, and AM will re-ask container for task. For the case container not pull by AM, preemption killing caused the same case of this issue. So I think it should not be recovered twice. ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED - Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Attachments: 0003-YARN-3535.patch, YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3453) Fair Scheduler : Parts of preemption logic uses DefaultResourceCalculator even in DRF mode causing thrashing
[ https://issues.apache.org/jira/browse/YARN-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621883#comment-14621883 ] Peng Zhang commented on YARN-3453: -- I understood your thought. My suggestion is based on our practice: I found it's confusing to use different policy in queue configuration: eg. parent use fair, child use drf may cause child queue has no resource on cpu dimension, so job will hang there. So we use only drf in one cluster, and change the code to support setting the calculator class in scheduler scope. After review above comments, I am reminded that the case (0 GB, non-zero cores) like (non-zero GB, 0 cores) will also cause preempt more resources than necessary. I mentioned before: bq. To decrease this kind of waste, I want to found what's the ratio of demand can be fulfilled by resourceUpperBound, and use this ratio * resourceUpperBound to be targetResource. Actually, current implementation ignored the resource boundary of each requested container, so even after above logic, it still will has some waste. As for YARN-2154, if we want to only preempt containers can satisfy incoming request, IMHO, we should to do preemption for each incoming request instead count them up with {{resourceDeficit}}. Fair Scheduler : Parts of preemption logic uses DefaultResourceCalculator even in DRF mode causing thrashing Key: YARN-3453 URL: https://issues.apache.org/jira/browse/YARN-3453 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Ashwin Shankar Assignee: Arun Suresh Attachments: YARN-3453.1.patch, YARN-3453.2.patch, YARN-3453.3.patch, YARN-3453.4.patch, YARN-3453.5.patch There are two places in preemption code flow where DefaultResourceCalculator is used, even in DRF mode. Which basically results in more resources getting preempted than needed, and those extra preempted containers aren’t even getting to the “starved” queue since scheduling logic is based on DRF's Calculator. Following are the two places : 1. {code:title=FSLeafQueue.java|borderStyle=solid} private boolean isStarved(Resource share) {code} A queue shouldn’t be marked as “starved” if the dominant resource usage is = fair/minshare. 2. {code:title=FairScheduler.java|borderStyle=solid} protected Resource resToPreempt(FSLeafQueue sched, long curTime) {code} -- One more thing that I believe needs to change in DRF mode is : during a preemption round,if preempting a few containers results in satisfying needs of a resource type, then we should exit that preemption round, since the containers that we just preempted should bring the dominant resource usage to min/fair share. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3453) Fair Scheduler : Parts of preemption logic uses DefaultResourceCalculator even in DRF mode causing thrashing
[ https://issues.apache.org/jira/browse/YARN-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621631#comment-14621631 ] Peng Zhang commented on YARN-3453: -- Thanks [~asuresh] for working on this comments: # Why not changing all usage of calculator in FairScheduler to policy related. In below code, RESOURCE_CALCULATOR only calculate memory, and it may return false when resToPreempt is (0, non-zero) for DRF policy: {code:title=FairScheduler.java|borderStyle=solid} if (Resources.greaterThan(RESOURCE_CALCULATOR, clusterResource, resToPreempt, Resources.none())) { {code} Fair Scheduler : Parts of preemption logic uses DefaultResourceCalculator even in DRF mode causing thrashing Key: YARN-3453 URL: https://issues.apache.org/jira/browse/YARN-3453 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Ashwin Shankar Assignee: Arun Suresh Attachments: YARN-3453.1.patch, YARN-3453.2.patch, YARN-3453.3.patch, YARN-3453.4.patch There are two places in preemption code flow where DefaultResourceCalculator is used, even in DRF mode. Which basically results in more resources getting preempted than needed, and those extra preempted containers aren’t even getting to the “starved” queue since scheduling logic is based on DRF's Calculator. Following are the two places : 1. {code:title=FSLeafQueue.java|borderStyle=solid} private boolean isStarved(Resource share) {code} A queue shouldn’t be marked as “starved” if the dominant resource usage is = fair/minshare. 2. {code:title=FairScheduler.java|borderStyle=solid} protected Resource resToPreempt(FSLeafQueue sched, long curTime) {code} -- One more thing that I believe needs to change in DRF mode is : during a preemption round,if preempting a few containers results in satisfying needs of a resource type, then we should exit that preemption round, since the containers that we just preempted should bring the dominant resource usage to min/fair share. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576970#comment-14576970 ] Peng Zhang commented on YARN-3535: -- Sorry for late reply. Thanks for your comments. bq. 1. I think the method recoverResourceRequestForContainer should be synchronized, any thought? I notice it's not with synchronized originally. I checked this method and found only applications need to be protected( get by calling getCurrentAttemptForContainer() ). applications is instantiated using ConcurrentHashMap in derived scheduler, so I think it's no need to add synchronized. Other three comments are all related with test. # Changing TestAMRestart.java is because that case testAMRestartWithExistingContainers will trigger this logic. After this patch, one more container may be scheduled, and attempt.getJustFinishedContainers().size() may be bigger than expectedNum and loop never ends. So I simply change the situation. # I agreed that this issue exist in all scheduler, and should be tested generally. But I didn't find good way to reproduce it. I'll take a try with ParameterizedSchedulerTestBase. # I change RMContextImpl.java to get schedulerDispatcher and start it in test TestFairScheduler. Otherwise event handler cannot be triggered. I'll check if this can also be solved based on ParameterizedSchedulerTestBase. ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED - Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Labels: BB2015-05-TBR Attachments: YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3640) NodeManager JVM continues to run after SHUTDOWN event is triggered.
[ https://issues.apache.org/jira/browse/YARN-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541751#comment-14541751 ] Peng Zhang commented on YARN-3640: -- I use pstack to get it NodeManager JVM continues to run after SHUTDOWN event is triggered. --- Key: YARN-3640 URL: https://issues.apache.org/jira/browse/YARN-3640 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Rohith Attachments: hadoop-rohith-nodemanager-test123.log, nm_141.out, nm_143.out We faced strange issue in the cluster that NodeManager did not exitted when the SHUTDOWN event is called from NodeStatusUpdaterImpl. Taken the thread dump and verified it, but did not get much idea why NM jvm not exited. At a time, for 3 NodeManger got this problem, and all the 3 NodeManager thread dump looks similar. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3640) NodeManager JVM continues to run after SHUTDOWN event is triggered.
[ https://issues.apache.org/jira/browse/YARN-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541733#comment-14541733 ] Peng Zhang commented on YARN-3640: -- I'v encounter the same problem, and filed YARN-3585. I think it's related with leveldb thread. I also see it in your thread out. NodeManager JVM continues to run after SHUTDOWN event is triggered. --- Key: YARN-3640 URL: https://issues.apache.org/jira/browse/YARN-3640 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Rohith Attachments: hadoop-rohith-nodemanager-test123.log, nm_141.out, nm_143.out We faced strange issue in the cluster that NodeManager did not exitted when the SHUTDOWN event is called from NodeStatusUpdaterImpl. Taken the thread dump and verified it, but did not get much idea why NM jvm not exited. At a time, for 3 NodeManger got this problem, and all the 3 NodeManager thread dump looks similar. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3585) Nodemanager cannot exit when decommission with NM recovery enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541812#comment-14541812 ] Peng Zhang commented on YARN-3585: -- As YARN-3640, Rohith has encountered the same problem. And we all see leveldb thread in thread stack. I think it's probably related with NM recovery. Decommission is not the key matter. [~devaraj.k] Do you enable NM recovery in your env? Nodemanager cannot exit when decommission with NM recovery enabled -- Key: YARN-3585 URL: https://issues.apache.org/jira/browse/YARN-3585 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang With NM recovery enabled, after decommission, nodemanager log show stop but process cannot end. non daemon thread: {noformat} DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on condition [0x] leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable [0x] VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 nid=0x29ed runnable Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 nid=0x29ee runnable Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 nid=0x29ef runnable Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 nid=0x29f0 runnable Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 nid=0x29f1 runnable Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 nid=0x29f2 runnable Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 nid=0x29f3 runnable Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 nid=0x29f4 runnable Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 runnable Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 nid=0x29f5 runnable Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 nid=0x29f6 runnable VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting on condition {noformat} and jni leveldb thread stack {noformat} Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7f33dfce2a3b in leveldb::(anonymous namespace)::PosixEnv::BGThreadWrapper(void*) () from /tmp/libleveldbjni-64-1-6922178968300745716.8 #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 #3 0x003d830e811d in clone () from /lib64/libc.so.6 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3585) Nodemanager cannot exit when decommission with NM recovery enabled
Peng Zhang created YARN-3585: Summary: Nodemanager cannot exit when decommission with NM recovery enabled Key: YARN-3585 URL: https://issues.apache.org/jira/browse/YARN-3585 Project: Hadoop YARN Issue Type: Bug Reporter: Peng Zhang With NM recovery enabled, after decommission, nodemanager log show stop but process cannot end. non daemon thread: {noformat} DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on condition [0x] leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable [0x] VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 nid=0x29ed runnable Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 nid=0x29ee runnable Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 nid=0x29ef runnable Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 nid=0x29f0 runnable Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 nid=0x29f1 runnable Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 nid=0x29f2 runnable Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 nid=0x29f3 runnable Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 nid=0x29f4 runnable Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 runnable Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 nid=0x29f5 runnable Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 nid=0x29f6 runnable VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting on condition {noformat} and jni leveldb thread stack {noformat} Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7f33dfce2a3b in leveldb::(anonymous namespace)::PosixEnv::BGThreadWrapper(void*) () from /tmp/libleveldbjni-64-1-6922178968300745716.8 #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 #3 0x003d830e811d in clone () from /lib64/libc.so.6 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3585) Nodemanager cannot exit when decommission with NM recovery enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Zhang updated YARN-3585: - Affects Version/s: 2.6.0 Nodemanager cannot exit when decommission with NM recovery enabled -- Key: YARN-3585 URL: https://issues.apache.org/jira/browse/YARN-3585 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang With NM recovery enabled, after decommission, nodemanager log show stop but process cannot end. non daemon thread: {noformat} DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on condition [0x] leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable [0x] VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 nid=0x29ed runnable Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 nid=0x29ee runnable Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 nid=0x29ef runnable Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 nid=0x29f0 runnable Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 nid=0x29f1 runnable Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 nid=0x29f2 runnable Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 nid=0x29f3 runnable Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 nid=0x29f4 runnable Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 runnable Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 nid=0x29f5 runnable Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 nid=0x29f6 runnable VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting on condition {noformat} and jni leveldb thread stack {noformat} Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7f33dfce2a3b in leveldb::(anonymous namespace)::PosixEnv::BGThreadWrapper(void*) () from /tmp/libleveldbjni-64-1-6922178968300745716.8 #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 #3 0x003d830e811d in clone () from /lib64/libc.so.6 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Zhang updated YARN-3535: - Attachment: YARN-3535-002.patch # Remove call of recoverResourceRequestForContainer from preemption to avoid duplication of recover RR. # Fix broken tests. ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED - Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Attachments: YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519368#comment-14519368 ] Peng Zhang commented on YARN-3535: -- I think TestAMRestart failure is not related with this patch. I found YARN-2483 is to resolve it. ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED - Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Attachments: YARN-3535-001.patch, YARN-3535-002.patch, syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517022#comment-14517022 ] Peng Zhang commented on YARN-3535: -- Attached patch to restore ResourceRequest for transition ALLOCATED to KILLED. Added test case for FairScheduler and I added getter for SchedulerDispatcher in RMContextImpl to start it in test. I've tested rolling update operation in small cluster: found issue transition is triggered, and MR job works well. ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED - Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Attachments: YARN-3535-001.patch, syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Zhang updated YARN-3535: - Attachment: YARN-3535-001.patch ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED - Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Attachments: YARN-3535-001.patch, syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517272#comment-14517272 ] Peng Zhang commented on YARN-3535: -- Sorry, I only run all tests in FairScheduler package, I'll fix others tomorrow. And how to know the specific checkstyle errors? I am using code formatter from cloudera in Intellij. ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED - Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Attachments: YARN-3535-001.patch, syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509256#comment-14509256 ] Peng Zhang commented on YARN-3535: -- As per [~jlowe]'s thoughts, I understand here are two separated thing: # During NM reconnection, RM and NM should do sync at container level. For this issue's scenario, container 04 should not be killed and rescheduled, so AM can acquire and launch it on NM after NM registered. # Still need fix in RMContainerImpl: restore request during transition from ALLOCATED to KILLED. Because NM's real lost may cause transition from ALLOCATED to KILLED with very small possibility(AM may heartbeat and acquire container after NM heartbeats timeout). I think first thing is an improvement to save time or scheduling work done before. Or did I get any mistake? ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED - Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Attachments: syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3405) FairScheduler's preemption cannot happen between sibling in some case
[ https://issues.apache.org/jira/browse/YARN-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508536#comment-14508536 ] Peng Zhang commented on YARN-3405: -- Update patch: only preempt from children when queue is not starved and add test case. FairScheduler's preemption cannot happen between sibling in some case - Key: YARN-3405 URL: https://issues.apache.org/jira/browse/YARN-3405 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Attachments: YARN-3405.01.patch, YARN-3405.02.patch Queue hierarchy described as below: {noformat} root / \ queue-1 queue-2 / \ queue-1-1 queue-1-2 {noformat} Assume cluster resource is 100 # queue-1-1 and queue-2 has app. Each get 50 usage and 50 fairshare. # When queue-1-2 is active, and it cause some new preemption request for fairshare 25. # When preemption from root, it has possibility to find preemption candidate is queue-2. If so preemptContainerPreCheck for queue-2 return false because it's equal to its fairshare. # Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. What I expect here is that queue-1-2 preempt from queue-1-1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3453) Fair Scheduler : Parts of preemption logic uses DefaultResourceCalculator even in DRF mode causing thrashing
[ https://issues.apache.org/jira/browse/YARN-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508519#comment-14508519 ] Peng Zhang commented on YARN-3453: -- Update code snippet for calculation of resDueToFairShare {code} Resource target; if (resourceCalculator instanceof DominantResourceCalculator) { float targetRatio = Math.min(1, ((DominantResourceCalculator) resourceCalculator) .getResourceAsValue(sched.getDemand(), resourceUpperBound, false)); target = Resources.multiply(sched.getDemand(), targetRatio); } else { target = resourceUpperBound; } resDueToFairShare = Resources.max(resourceCalculator, clusterResource, Resources.none(), Resources.subtract(target, sched.getResourceUsage())); {code} Fair Scheduler : Parts of preemption logic uses DefaultResourceCalculator even in DRF mode causing thrashing Key: YARN-3453 URL: https://issues.apache.org/jira/browse/YARN-3453 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Ashwin Shankar There are two places in preemption code flow where DefaultResourceCalculator is used, even in DRF mode. Which basically results in more resources getting preempted than needed, and those extra preempted containers aren’t even getting to the “starved” queue since scheduling logic is based on DRF's Calculator. Following are the two places : 1. {code:title=FSLeafQueue.java|borderStyle=solid} private boolean isStarved(Resource share) {code} A queue shouldn’t be marked as “starved” if the dominant resource usage is = fair/minshare. 2. {code:title=FairScheduler.java|borderStyle=solid} protected Resource resToPreempt(FSLeafQueue sched, long curTime) {code} -- One more thing that I believe needs to change in DRF mode is : during a preemption round,if preempting a few containers results in satisfying needs of a resource type, then we should exit that preemption round, since the containers that we just preempted should bring the dominant resource usage to min/fair share. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3405) FairScheduler's preemption cannot happen between sibling in some case
[ https://issues.apache.org/jira/browse/YARN-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Zhang updated YARN-3405: - Attachment: YARN-3405.02.patch FairScheduler's preemption cannot happen between sibling in some case - Key: YARN-3405 URL: https://issues.apache.org/jira/browse/YARN-3405 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Attachments: YARN-3405.01.patch, YARN-3405.02.patch Queue hierarchy described as below: {noformat} root / \ queue-1 queue-2 / \ queue-1-1 queue-1-2 {noformat} Assume cluster resource is 100 # queue-1-1 and queue-2 has app. Each get 50 usage and 50 fairshare. # When queue-1-2 is active, and it cause some new preemption request for fairshare 25. # When preemption from root, it has possibility to find preemption candidate is queue-2. If so preemptContainerPreCheck for queue-2 return false because it's equal to its fairshare. # Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. What I expect here is that queue-1-2 preempt from queue-1-1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3405) FairScheduler's preemption cannot happen between sibling in some case
[ https://issues.apache.org/jira/browse/YARN-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Zhang updated YARN-3405: - Attachment: (was: YARN-3405.02.patch) FairScheduler's preemption cannot happen between sibling in some case - Key: YARN-3405 URL: https://issues.apache.org/jira/browse/YARN-3405 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Attachments: YARN-3405.01.patch, YARN-3405.02.patch Queue hierarchy described as below: {noformat} root / \ queue-1 queue-2 / \ queue-1-1 queue-1-2 {noformat} Assume cluster resource is 100 # queue-1-1 and queue-2 has app. Each get 50 usage and 50 fairshare. # When queue-1-2 is active, and it cause some new preemption request for fairshare 25. # When preemption from root, it has possibility to find preemption candidate is queue-2. If so preemptContainerPreCheck for queue-2 return false because it's equal to its fairshare. # Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. What I expect here is that queue-1-2 preempt from queue-1-1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3405) FairScheduler's preemption cannot happen between sibling in some case
[ https://issues.apache.org/jira/browse/YARN-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Zhang updated YARN-3405: - Attachment: YARN-3405.02.patch FairScheduler's preemption cannot happen between sibling in some case - Key: YARN-3405 URL: https://issues.apache.org/jira/browse/YARN-3405 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Attachments: YARN-3405.01.patch, YARN-3405.02.patch Queue hierarchy described as below: {noformat} root / \ queue-1 queue-2 / \ queue-1-1 queue-1-2 {noformat} Assume cluster resource is 100 # queue-1-1 and queue-2 has app. Each get 50 usage and 50 fairshare. # When queue-1-2 is active, and it cause some new preemption request for fairshare 25. # When preemption from root, it has possibility to find preemption candidate is queue-2. If so preemptContainerPreCheck for queue-2 return false because it's equal to its fairshare. # Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. What I expect here is that queue-1-2 preempt from queue-1-1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3535) ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508901#comment-14508901 ] Peng Zhang commented on YARN-3535: -- Thanks [~rohithsharma] for help. ResourceRequest should be restored back to scheduler when RMContainer is killed at ALLOCATED - Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Attachments: syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3535) ResourceRequest should be restored back to scheduler when container is killed at
[ https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Zhang reassigned YARN-3535: Assignee: Peng Zhang ResourceRequest should be restored back to scheduler when container is killed at - Key: YARN-3535 URL: https://issues.apache.org/jira/browse/YARN-3535 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Attachments: syslog.tgz, yarn-app.log During rolling update of NM, AM start of container on NM failed. And then job hang there. Attach AM logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3485) FairScheduler headroom calculation doesn't consider maxResources for Fifo and FairShare policies
[ https://issues.apache.org/jira/browse/YARN-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14495742#comment-14495742 ] Peng Zhang commented on YARN-3485: -- max -memory- *cpu* usually is set arbitrarily without real meaning. FairScheduler headroom calculation doesn't consider maxResources for Fifo and FairShare policies Key: YARN-3485 URL: https://issues.apache.org/jira/browse/YARN-3485 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Attachments: yarn-3485-prelim.patch FairScheduler's headroom calculations consider the fairshare and cluster-available-resources, and the fairshare has maxResources. However, for Fifo and Fairshare policies, the fairshare is used only for memory and not cpu. So, the scheduler ends up showing a higher headroom than is actually available. This could lead to applications waiting for resources far longer than then intend to. e.g. MAPREDUCE-6302. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3485) FairScheduler headroom calculation doesn't consider maxResources for Fifo and FairShare policies
[ https://issues.apache.org/jira/browse/YARN-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14495573#comment-14495573 ] Peng Zhang commented on YARN-3485: -- {code} +maxAvailableResource.setMemory( +Math.min(maxAvailableResource.getMemory(), +queue.getMaxShare().getMemory())); +maxAvailableResource.setVirtualCores( +Math.min(maxAvailableResource.getVirtualCores(), +queue.getMaxShare().getVirtualCores())); {code} use Resources.componentwiseMin() better FairScheduler headroom calculation doesn't consider maxResources for Fifo and FairShare policies Key: YARN-3485 URL: https://issues.apache.org/jira/browse/YARN-3485 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Attachments: yarn-3485-prelim.patch FairScheduler's headroom calculations consider the fairshare and cluster-available-resources, and the fairshare has maxResources. However, for Fifo and Fairshare policies, the fairshare is used only for memory and not cpu. So, the scheduler ends up showing a higher headroom than is actually available. This could lead to applications waiting for resources far longer than then intend to. e.g. MAPREDUCE-6302. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3485) FairScheduler headroom calculation doesn't consider maxResources for Fifo and FairShare policies
[ https://issues.apache.org/jira/browse/YARN-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14495593#comment-14495593 ] Peng Zhang commented on YARN-3485: -- some thoughts for this issue: # I think for fair and fifo, queue's fair share should already be 0, so no need not to use min(maxAvailableResource, queue.getMaxShare()) in first. And normally, for non-drf policy, max memory usually is set arbitrarily without real meaning. # I think even if headroom of fair and fifo maybe got wrong value in cpu dimension, MapReduce will not request wrong number of map and reduces. Because MapReduce use ResourceCalculatorUtils.computeAvailableContainers() to compute how much containers to ask. It will use min containers computed by two dimensions. As MAPREDUCE-6302, I think if it's related with headroom, it is probably wrong in both dimensions. FairScheduler headroom calculation doesn't consider maxResources for Fifo and FairShare policies Key: YARN-3485 URL: https://issues.apache.org/jira/browse/YARN-3485 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Attachments: yarn-3485-prelim.patch FairScheduler's headroom calculations consider the fairshare and cluster-available-resources, and the fairshare has maxResources. However, for Fifo and Fairshare policies, the fairshare is used only for memory and not cpu. So, the scheduler ends up showing a higher headroom than is actually available. This could lead to applications waiting for resources far longer than then intend to. e.g. MAPREDUCE-6302. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3453) Fair Scheduler : Parts of preemption logic uses DefaultResourceCalculator even in DRF mode causing thrashing
[ https://issues.apache.org/jira/browse/YARN-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493529#comment-14493529 ] Peng Zhang commented on YARN-3453: -- [~ashwinshankar77] bq. Would it help if we have a separate method or modify DominantResourceCalculator.compare() (in a backward compatible way), such that if the dominant resource is equal, then return -1. Changing compare() to fix this corner case may cause some conflict with dominant notion. I found some operations in DominantResourceCalculator now are already not be consistent with Dominant notion. I think they are more related with multiple resources calculator. So I think maybe we'd better extract new layer for these operations. One layer for resource dimension(only memory or multiple resources), and the other for policy( fair or dominant, and maybe other policy). bq. Doing componentwiseMin makes sense. But why should we calculate targetRatio ? Shouldnt we just preempt (resourceUpperBound - usage) ? Can you pls give an example ? eg: fairshare (100g, 3core) demand (10g, 10core), componentwiseMin return (10g, 3core). When do preemption, assume each preempted container is (1g, 1core). After preemping 3 containers, toPreempt is (7g, 0core), still bigger than None(0g, 0core) even using Dominant calculator. So it will still preempt other 7 containers(7g, 7core) which cannot be scheduled. This cause waste of preemption. To decrease this kind of waste, I want to found what's the ratio of demand can be fulfilled by resourceUpperBound, and use this ratio * resourceUpperBound to be targetResource. Actually, current implementation ignored the resource boundary of each requested container, so even after above logic, it still will has some waste. Fair Scheduler : Parts of preemption logic uses DefaultResourceCalculator even in DRF mode causing thrashing Key: YARN-3453 URL: https://issues.apache.org/jira/browse/YARN-3453 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Ashwin Shankar There are two places in preemption code flow where DefaultResourceCalculator is used, even in DRF mode. Which basically results in more resources getting preempted than needed, and those extra preempted containers aren’t even getting to the “starved” queue since scheduling logic is based on DRF's Calculator. Following are the two places : 1. {code:title=FSLeafQueue.java|borderStyle=solid} private boolean isStarved(Resource share) {code} A queue shouldn’t be marked as “starved” if the dominant resource usage is = fair/minshare. 2. {code:title=FairScheduler.java|borderStyle=solid} protected Resource resToPreempt(FSLeafQueue sched, long curTime) {code} -- One more thing that I believe needs to change in DRF mode is : during a preemption round,if preempting a few containers results in satisfying needs of a resource type, then we should exit that preemption round, since the containers that we just preempted should bring the dominant resource usage to min/fair share. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3453) Fair Scheduler : Parts of preemption logic uses DefaultResourceCalculator even in DRF mode causing thrashing
[ https://issues.apache.org/jira/browse/YARN-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492440#comment-14492440 ] Peng Zhang commented on YARN-3453: -- [~ashwinshankar77] I‘ve the same problem in our cluster. I think 2 points should be done to make it work: # Calculator should be configurable, so isStarved() can use DominantResourceCalculator to test. But this still has corner case: cluster(40960, 8), expected( 16384, 4), real(4096, 4) it still result with starved. My idea for this is to lower starve threshold to ignore this. Any good suggestions? # in resToPreempt(), we should change logic of calculation, changing calculator is not enough. My hacked code is like below: {code} - Resource target = Resources.min(getResourceCalculator(), clusterResource, - sched.getFairShare(), sched.getDemand()); + Resource resourceUpperBound = Resources.componentwiseMin( sched.getFairShare(), sched.getDemand()); // min value of both cpu and memory as upperBound of request resource + float targetRatio = 0; +// getResourceAsValue is not super method in ResourceCalculator, and i cannot figure out one name for this calculation logic, so hack like this + if (resourceCalculator instanceof DominantResourceCalculator) { +// the ratio of demand can be get under resourceUpperBound. +targetRatio = ((DominantResourceCalculator)resourceCalculator) +.getResourceAsValue(sched.getDemand(), resourceUpperBound, false); // min of ratio for cpu and memory + } else { +targetRatio = Resources.ratio(resourceCalculator, sched.getDemand(), +resourceUpperBound); + } + Resource target = Resources.multiply(sched.getDemand(), targetRatio); // demand resource can be fulfilled under fair share. {code} Besides drf policy problem with preemption, I also filed YARN-3405 to describe some common problems with preemption. Fair Scheduler : Parts of preemption logic uses DefaultResourceCalculator even in DRF mode causing thrashing Key: YARN-3453 URL: https://issues.apache.org/jira/browse/YARN-3453 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Ashwin Shankar There are two places in preemption code flow where DefaultResourceCalculator is used, even in DRF mode. Which basically results in more resources getting preempted than needed, and those extra preempted containers aren’t even getting to the “starved” queue since scheduling logic is based on DRF's Calculator. Following are the two places : 1. {code:title=FSLeafQueue.java|borderStyle=solid} private boolean isStarved(Resource share) {code} A queue shouldn’t be marked as “starved” if the dominant resource usage is = fair/minshare. 2. {code:title=FairScheduler.java|borderStyle=solid} protected Resource resToPreempt(FSLeafQueue sched, long curTime) {code} -- One more thing that I believe needs to change in DRF mode is : during a preemption round,if preempting a few containers results in satisfying needs of a resource type, then we should exit that preemption round, since the containers that we just preempted should bring the dominant resource usage to min/fair share. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3405) FairScheduler's preemption cannot happen between sibling in some case
[ https://issues.apache.org/jira/browse/YARN-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492467#comment-14492467 ] Peng Zhang commented on YARN-3405: -- Other issues for preemption during development, need confirmation: # Jobs in the same queue will not trigger preemption, cause resToPreemption() only considers unfair between queues. # MapReduce's map task will cause unneeded preemption request, because FSAppAttempt.updateDemand() will count all of ANY, rack and host request, so preemption demand will be triple for one map task. I want to change it to only counting for ANY request, but do not know whether it will affect Non-MapReduce framework. # Notion of MinResources is confusing and easy to misconfigure. Because calculation of fair share considers min, max weight, when min of one queue is above cluster resources or its parent queue, other queue's fair share is 0, also I found sometimes sum of children's fair share can be larger than parent queue's fair share. I have some suggestions for these notion like below: * max resources means maximum resources that one queue can get * min resources means under which threshold the queue cannot not be preempted * weight notion changed to expected fair share - like 10240mb 10cores (I see weight implementation has memory and cpu, but we use only memory now), and make expected fair share as the only considered element during calculation of fair share. FairScheduler's preemption cannot happen between sibling in some case - Key: YARN-3405 URL: https://issues.apache.org/jira/browse/YARN-3405 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Attachments: YARN-3405.01.patch Queue hierarchy described as below: {noformat} root / \ queue-1 queue-2 / \ queue-1-1 queue-1-2 {noformat} Assume cluster resource is 100 # queue-1-1 and queue-2 has app. Each get 50 usage and 50 fairshare. # When queue-1-2 is active, and it cause some new preemption request for fairshare 25. # When preemption from root, it has possibility to find preemption candidate is queue-2. If so preemptContainerPreCheck for queue-2 return false because it's equal to its fairshare. # Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. What I expect here is that queue-1-2 preempt from queue-1-1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3414) FairScheduler's preemption may cause livelock
[ https://issues.apache.org/jira/browse/YARN-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492405#comment-14492405 ] Peng Zhang commented on YARN-3414: -- It has the same root cause like YARN-3405. FairScheduler's preemption may cause livelock - Key: YARN-3414 URL: https://issues.apache.org/jira/browse/YARN-3414 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Peng Zhang I met this problem in our cluster, it cause livelock during preemption and scheduling. Queue hierarchy described as below: {noformat} root /|\ queue-1queue-2queue-3 /\ queue-1-1 queue-1-2 {noformat} # Assume cluster resource is 100G in memory # Assume queue-1 has max resource limit 20G # queue-1-1 is active and it will get max 20G memory(equal to its fairshare) # queue-2 is active then, and it require 30G memory(less than its fairshare) # queue-3 is active, and it can be assigned with all other resources, 50G memory(larger than its fairshare). At here three queues' fair share is (20, 40, 40), and usage is (20, 30, 50) # queue-1-2 is active, it will cause new preemption request(10G memory and intuitively it can only preempt from its sibling queue-1-1) # Actually preemption starts from root, and it will find queue-3 is most over fairshare, and preempt some resources form queue-3. # But during scheduling, it will find queue-1 itself arrived it's max fairshare, and cannot assign resource to it. Then resource's again assigned to queue-3 And then it repeats between last two steps. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3405) FairScheduler's preemption cannot happen between sibling in some case
[ https://issues.apache.org/jira/browse/YARN-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492402#comment-14492402 ] Peng Zhang commented on YARN-3405: -- I uploaded one patch for this. This patch updated TestFairSchedulerPreemption to test the preemption process and final usage share of queues. And this patch can also resolve YARN-3414, tested by TestFairSchedulerPreemption#testPreemptionWithFreeResources. It now works good for fair policy, but for drf policy it still has some work to do related YARN-3453. I'll fix them in that issue after finishing this. FairScheduler's preemption cannot happen between sibling in some case - Key: YARN-3405 URL: https://issues.apache.org/jira/browse/YARN-3405 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Attachments: YARN-3405.01.patch Queue hierarchy described as below: {noformat} root / \ queue-1 queue-2 / \ queue-1-1 queue-1-2 {noformat} Assume cluster resource is 100 # queue-1-1 and queue-2 has app. Each get 50 usage and 50 fairshare. # When queue-1-2 is active, and it cause some new preemption request for fairshare 25. # When preemption from root, it has possibility to find preemption candidate is queue-2. If so preemptContainerPreCheck for queue-2 return false because it's equal to its fairshare. # Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. What I expect here is that queue-1-2 preempt from queue-1-1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3405) FairScheduler's preemption cannot happen between sibling in some case
[ https://issues.apache.org/jira/browse/YARN-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Zhang updated YARN-3405: - Attachment: YARN-3405.01.patch FairScheduler's preemption cannot happen between sibling in some case - Key: YARN-3405 URL: https://issues.apache.org/jira/browse/YARN-3405 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Attachments: YARN-3405.01.patch Queue hierarchy described as below: {noformat} root / \ queue-1 queue-2 / \ queue-1-1 queue-1-2 {noformat} Assume cluster resource is 100 # queue-1-1 and queue-2 has app. Each get 50 usage and 50 fairshare. # When queue-1-2 is active, and it cause some new preemption request for fairshare 25. # When preemption from root, it has possibility to find preemption candidate is queue-2. If so preemptContainerPreCheck for queue-2 return false because it's equal to its fairshare. # Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. What I expect here is that queue-1-2 preempt from queue-1-1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3405) FairScheduler's preemption cannot happen between sibling in some case
[ https://issues.apache.org/jira/browse/YARN-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390471#comment-14390471 ] Peng Zhang commented on YARN-3405: -- bq. 2. if parent's usage reached its fair share, it will not propagate preemption request upside again. So preemption request in parent queue means preemption needed between its children. make above statement more clear: If request from children added with current usage less than fair share, parent queue will propagate request upside. This means current queue is under fair share, it need preempt from its sibling that who is over scheduled. When the amount reached current queue's fair share, the above request amount will be stored on current queue. This means these request amount need happen between current queue's children, FairScheduler's preemption cannot happen between sibling in some case - Key: YARN-3405 URL: https://issues.apache.org/jira/browse/YARN-3405 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Queue hierarchy described as below: {noformat} root / \ queue-1 queue-2 / \ queue-1-1 queue-1-2 {noformat} Assume cluster resource is 100 # queue-1-1 and queue-2 has app. Each get 50 usage and 50 fairshare. # When queue-1-2 is active, and it cause some new preemption request for fairshare 25. # When preemption from root, it has possibility to find preemption candidate is queue-2. If so preemptContainerPreCheck for queue-2 return false because it's equal to its fairshare. # Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. What I expect here is that queue-1-2 preempt from queue-1-1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3405) FairScheduler's preemption cannot happen between sibling in some case
[ https://issues.apache.org/jira/browse/YARN-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390455#comment-14390455 ] Peng Zhang commented on YARN-3405: -- I've a primitive idea to fix this and YARN-3414 under current preemption architecture. 1. When calculation preemption request, update parent's preemption request. 2. if parent's usage reached its fair share, it will not propagate preemption request upside again. So preemption request in parent queue means preemption needed between its children. 3. During preempting phase, walk from root to downside a. if parent queue has preemption request, it will do preemption between its children for the request(process like now, find the most over fair, and preempt recursively). b. And then(including after doing 3.a and the case not need preempt between children), traverse its children and repeat 3.a; This process bring in traverse of the tree. And I think this will not affect performance severely because there are usually small amount of queues. FairScheduler's preemption cannot happen between sibling in some case - Key: YARN-3405 URL: https://issues.apache.org/jira/browse/YARN-3405 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Queue hierarchy described as below: {noformat} root / \ queue-1 queue-2 / \ queue-1-1 queue-1-2 {noformat} Assume cluster resource is 100 # queue-1-1 and queue-2 has app. Each get 50 usage and 50 fairshare. # When queue-1-2 is active, and it cause some new preemption request for fairshare 25. # When preemption from root, it has possibility to find preemption candidate is queue-2. If so preemptContainerPreCheck for queue-2 return false because it's equal to its fairshare. # Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. What I expect here is that queue-1-2 preempt from queue-1-1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3405) FairScheduler's preemption cannot happen between sibling in some case
[ https://issues.apache.org/jira/browse/YARN-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388503#comment-14388503 ] Peng Zhang commented on YARN-3405: -- I test this case in cluster: queue-2 will be preempted until its ResourceUsage( consumption - preempted resources) equal to its fair share, and will not be over preempted. And then no preemption from sibling queue-1-1, and hang there. FairScheduler's preemption cannot happen between sibling in some case - Key: YARN-3405 URL: https://issues.apache.org/jira/browse/YARN-3405 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Peng Zhang Priority: Critical Queue hierarchy described as below: {noformat} root / \ queue-1 queue-2 / \ queue-1-1 queue-1-2 {noformat} Assume cluster resource is 100 # queue-1-1 and queue-2 has app. Each get 50 usage and 50 fairshare. # When queue-1-2 is active, and it cause some new preemption request for fairshare 25. # When preemption from root, it has possibility to find preemption candidate is queue-2. If so preemptContainerPreCheck for queue-2 return false because it's equal to its fairshare. # Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. What I expect here is that queue-1-2 preempt from queue-1-1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3405) FairScheduler's preemption cannot happen between sibling in some case
[ https://issues.apache.org/jira/browse/YARN-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385154#comment-14385154 ] Peng Zhang commented on YARN-3405: -- Yes, changing comparator may solve this specific case, but what if queue-2 has same sub-queue hierarchy like queue-1, and at the same period, the second queue of them get active? Recursive compare still return equal, and the two latter sub-queue will be waiting. As for this issue and YARN-3414, IMPO we should combine calculation of preemption request and preemption. For each preemption request of leaf queue, starts preempt upside. If parent queue is under faieshare, found the most over fairshare from sibling, otherwise go up again. Finally when get to the root, it end because root definitively under fairshare. This idea can also solve YARN-3414. When found parent has got fairshare(limited by max), it will preempt its sibling. FairScheduler's preemption cannot happen between sibling in some case - Key: YARN-3405 URL: https://issues.apache.org/jira/browse/YARN-3405 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Peng Zhang Priority: Critical Queue hierarchy described as below: {noformat} root / \ queue-1 queue-2 / \ queue-1-1 queue-1-2 {noformat} Assume cluster resource is 100 # queue-1-1 and queue-2 has app. Each get 50 usage and 50 fairshare. # When queue-1-2 is active, and it cause some new preemption request for fairshare 25. # When preemption from root, it has possibility to find preemption candidate is queue-2. If so preemptContainerPreCheck for queue-2 return false because it's equal to its fairshare. # Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. What I expect here is that queue-1-2 preempt from queue-1-1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3405) FairScheduler's preemption cannot happen between sibling in some case
[ https://issues.apache.org/jira/browse/YARN-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Zhang updated YARN-3405: - Description: Queue hierarchy described as below: {noformat} root | queue-1 / \ queue-1-1queue-1-2 {noformat} 1. When queue-1-1 is active and it has been assigned with all resources. 2. When queue-1-2 is active, and it cause some new preemption request. 3. But when do preemption, it now starts from root, and found queue-1 is not over fairshare, so no recursion preemption to queue-1-1. 4. Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. was: Queue hierarchy described as below: {noformat} root / queue-1 / \ queue-1-1queue-1-2 {noformat} 1. When queue-1-1 is active and it has been assigned with all resources. 2. When queue-1-2 is active, and it cause some new preemption request. 3. But when do preemption, it now starts from root, and found queue-1 is not over fairshare, so no recursion preemption to queue-1-1. 4. Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. FairScheduler's preemption cannot happen between sibling in some case - Key: YARN-3405 URL: https://issues.apache.org/jira/browse/YARN-3405 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Peng Zhang Priority: Critical Queue hierarchy described as below: {noformat} root | queue-1 / \ queue-1-1queue-1-2 {noformat} 1. When queue-1-1 is active and it has been assigned with all resources. 2. When queue-1-2 is active, and it cause some new preemption request. 3. But when do preemption, it now starts from root, and found queue-1 is not over fairshare, so no recursion preemption to queue-1-1. 4. Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3405) FairScheduler's preemption cannot happen between sibling in some case
Peng Zhang created YARN-3405: Summary: FairScheduler's preemption cannot happen between sibling in some case Key: YARN-3405 URL: https://issues.apache.org/jira/browse/YARN-3405 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Peng Zhang Priority: Critical Queue hierarchy described as below: {noformat} root / queue-1 / \ queue-1-1queue-1-2 {noformat} 1. When queue-1-1 is active and it has been assigned with all resources. 2. When queue-1-2 is active, and it cause some new preemption request. 3. But when do preemption, it now starts from root, and found queue-1 is not over fairshare, so no recursion preemption to queue-1-1. 4. Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3405) FairScheduler's preemption cannot happen between sibling in some case
[ https://issues.apache.org/jira/browse/YARN-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383615#comment-14383615 ] Peng Zhang commented on YARN-3405: -- [~zxu] I have verified that there's no problem in first scenario. Second scenario problem still exists. bq. And If queue1 level has some other sibling queue(like queue-2) that equals to queue-1's usage/fairshare, candidateQueue still may be not the queue-1 itself, because they are equal by comparing, and will depends on the queue order.Then queue-1-2 still cannot preempt its sibling, and cause some live lock issue like above second scenario. I think for above scenario it maybe results in preemptContainerPreCheck() for queue-2 (leaf queue) will fail, and queue-1-2 cannot get preempt any resources. Live lock will not happen. I'll update description once you committed above bad cases. Thanks. FairScheduler's preemption cannot happen between sibling in some case - Key: YARN-3405 URL: https://issues.apache.org/jira/browse/YARN-3405 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Peng Zhang Priority: Critical Queue hierarchy described as below: {noformat} root | queue-1 / \ queue-1-1queue-1-2 {noformat} 1. When queue-1-1 is active and it has been assigned with all resources. 2. When queue-1-2 is active, and it cause some new preemption request. 3. But when do preemption, it now starts from root, and found queue-1 is not over fairshare, so no recursion preemption to queue-1-1. 4. Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3405) FairScheduler's preemption cannot happen between sibling in some case
[ https://issues.apache.org/jira/browse/YARN-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383469#comment-14383469 ] Peng Zhang commented on YARN-3405: -- And there is another related case which will cause live lock during preemption and scheduling. If necessary, I will create a separated issue for it. Queue hierarchy described as below: {noformat} root /|\ queue-1queue-2queue-3 /\ queue-1-1 queue-1-2 {noformat} # Assume cluster resource is 100G in memory # Assume queue-1 has max resource limit 20G # queue-1-1 is active and it will get max 20G memory(equal to its fairshare) # queue-2 is active then, and it require 30G memory(less than its fairshare) # queue-3 is active, and it can be assigned with all other resources, 50G memory(larger than its fairshare) # queue-1-2 is active, it will cause new preemption request(10G memory and intuitively it can only preempt from its sibling queue-1-1) # Actually preemption starts from root, and it will find queue-3 is most over fairshare, and preempt some resources form queue-3. # But during scheduling, it will find queue-1 itself arrived it's max fairshare, and cannot assign resource to it. Then resource's again assigned to queue-3 And then it repeats between last two steps. FairScheduler's preemption cannot happen between sibling in some case - Key: YARN-3405 URL: https://issues.apache.org/jira/browse/YARN-3405 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Peng Zhang Priority: Critical Queue hierarchy described as below: {noformat} root | queue-1 / \ queue-1-1queue-1-2 {noformat} 1. When queue-1-1 is active and it has been assigned with all resources. 2. When queue-1-2 is active, and it cause some new preemption request. 3. But when do preemption, it now starts from root, and found queue-1 is not over fairshare, so no recursion preemption to queue-1-1. 4. Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3405) FairScheduler's preemption cannot happen between sibling in some case
[ https://issues.apache.org/jira/browse/YARN-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383546#comment-14383546 ] Peng Zhang commented on YARN-3405: -- Thanks, my mistake on the code detail. And I will verify it works now. And If queue1 level has some other sibling queue(like queue-2) that equals to queue-1's usage/fairshare, candidateQueue still may be not the queue-1 itself, because they are equal by comparing, and will depends on the queue order. Then queue-1-2 still cannot preempt its sibling, and cause some live lock issue like above second scenario. FairScheduler's preemption cannot happen between sibling in some case - Key: YARN-3405 URL: https://issues.apache.org/jira/browse/YARN-3405 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Peng Zhang Priority: Critical Queue hierarchy described as below: {noformat} root | queue-1 / \ queue-1-1queue-1-2 {noformat} 1. When queue-1-1 is active and it has been assigned with all resources. 2. When queue-1-2 is active, and it cause some new preemption request. 3. But when do preemption, it now starts from root, and found queue-1 is not over fairshare, so no recursion preemption to queue-1-1. 4. Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3405) FairScheduler's preemption cannot happen between sibling in some case
[ https://issues.apache.org/jira/browse/YARN-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385099#comment-14385099 ] Peng Zhang commented on YARN-3405: -- bq. There is a possibility for the first scenario. If we have another queue queue-2 which is queue-1's sibling and queue-2 is greater than queue-1 when compare queue-1 and queue-2, then queue-2 will always be picked for preemption and queue-1 won't have chance to be preempted. For this case queue-2 is greater than queue-1 when compare queue-1 and queue-2: I think firstly preempt from queue-2(if it is LeafQueue) or queue-2's child Queue is reasonable. And then when preemption and scheduling cause queue-1 greater than queue-2, it should preempt from queue-1-1 ideally. (I think this may not happen in time by checking code, and maybe cause queue-1 is over preempted. But even if queue-1 is over preempted, during scheduling, the preempted containers will not be all assigned to queue-1, because queue-2 itself is under fair share. Finally it will got a balance fair state.) FairScheduler's preemption cannot happen between sibling in some case - Key: YARN-3405 URL: https://issues.apache.org/jira/browse/YARN-3405 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Peng Zhang Priority: Critical Queue hierarchy described as below: {noformat} root | queue-1 / \ queue-1-1queue-1-2 {noformat} 1. When queue-1-1 is active and it has been assigned with all resources. 2. When queue-1-2 is active, and it cause some new preemption request. 3. But when do preemption, it now starts from root, and found queue-1 is not over fairshare, so no recursion preemption to queue-1-1. 4. Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3414) FairScheduler's preemption may cause livelock
Peng Zhang created YARN-3414: Summary: FairScheduler's preemption may cause livelock Key: YARN-3414 URL: https://issues.apache.org/jira/browse/YARN-3414 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Peng Zhang I met this problem in our cluster, it cause livelock during preemption and scheduling. Queue hierarchy described as below: {noformat} root /|\ queue-1queue-2queue-3 /\ queue-1-1 queue-1-2 {noformat} # Assume cluster resource is 100G in memory # Assume queue-1 has max resource limit 20G # queue-1-1 is active and it will get max 20G memory(equal to its fairshare) # queue-2 is active then, and it require 30G memory(less than its fairshare) # queue-3 is active, and it can be assigned with all other resources, 50G memory(larger than its fairshare). At here three queues' fair share is (20, 40, 40), and usage is (20, 30, 50) # queue-1-2 is active, it will cause new preemption request(10G memory and intuitively it can only preempt from its sibling queue-1-1) # Actually preemption starts from root, and it will find queue-3 is most over fairshare, and preempt some resources form queue-3. # But during scheduling, it will find queue-1 itself arrived it's max fairshare, and cannot assign resource to it. Then resource's again assigned to queue-3 And then it repeats between last two steps. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3405) FairScheduler's preemption cannot happen between sibling in some case
[ https://issues.apache.org/jira/browse/YARN-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385124#comment-14385124 ] Peng Zhang commented on YARN-3405: -- [~kasha] What the real problem I met in our cluster is livelock. After checking code, I think it has some other bad cases like we talked above. I list them together because I think they has the same basic problem: calculation of preemption request and preemption of container are separated as two phases, lot of necessary info is lost between these two phases. For less confusion, I created YARN-3414 to discuss livelock problem. And I'll update this description to show non-livelock case that I think. FairScheduler's preemption cannot happen between sibling in some case - Key: YARN-3405 URL: https://issues.apache.org/jira/browse/YARN-3405 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Peng Zhang Priority: Critical Queue hierarchy described as below: {noformat} root | queue-1 / \ queue-1-1queue-1-2 {noformat} 1. When queue-1-1 is active and it has been assigned with all resources. 2. When queue-1-2 is active, and it cause some new preemption request. 3. But when do preemption, it now starts from root, and found queue-1 is not over fairshare, so no recursion preemption to queue-1-1. 4. Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3405) FairScheduler's preemption cannot happen between sibling in some case
[ https://issues.apache.org/jira/browse/YARN-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Zhang updated YARN-3405: - Description: Queue hierarchy described as below: {noformat} root / \ queue-1 queue-2 / \ queue-1-1 queue-1-2 {noformat} Assume cluster resource is 100 # queue-1-1 and queue-2 has app. Each get 50 usage and 50 fairshare. # When queue-1-2 is active, and it cause some new preemption request for fairshare 25. # When preemption from root, it has possibility to find preemption candidate is queue-2. If so preemptContainerPreCheck for queue-2 return false because it's equal to its fairshare. # Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. What I expect here is that queue-1-2 preempt from queue-1-1. was: Queue hierarchy described as below: {noformat} root / \ queue-1queue-2 / \ queue-1-1queue-1-2 {noformat} Assume cluster resource is 100 # queue-1-1 and queue-2 has app. Each get 50 usage and 50 fairshare. # When queue-1-2 is active, and it cause some new preemption request for fairshare 25. # When preemption from root, it has possibility to find preemption candidate is queue-2. If so preemptContainerPreCheck for queue-2 return false because it's equal to its fairshare. # Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. What I expect here is that queue-1-2 preempt from queue-1-1. FairScheduler's preemption cannot happen between sibling in some case - Key: YARN-3405 URL: https://issues.apache.org/jira/browse/YARN-3405 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Peng Zhang Priority: Critical Queue hierarchy described as below: {noformat} root / \ queue-1 queue-2 / \ queue-1-1 queue-1-2 {noformat} Assume cluster resource is 100 # queue-1-1 and queue-2 has app. Each get 50 usage and 50 fairshare. # When queue-1-2 is active, and it cause some new preemption request for fairshare 25. # When preemption from root, it has possibility to find preemption candidate is queue-2. If so preemptContainerPreCheck for queue-2 return false because it's equal to its fairshare. # Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. What I expect here is that queue-1-2 preempt from queue-1-1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3405) FairScheduler's preemption cannot happen between sibling in some case
[ https://issues.apache.org/jira/browse/YARN-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Zhang updated YARN-3405: - Description: Queue hierarchy described as below: {noformat} root / \ queue-1queue-2 / \ queue-1-1queue-1-2 {noformat} Assume cluster resource is 100 # queue-1-1 and queue-2 has app. Each get 50 usage and 50 fairshare. # When queue-1-2 is active, and it cause some new preemption request for fairshare 25. # When preemption from root, it has possibility to find preemption candidate is queue-2. If so preemptContainerPreCheck for queue-2 return false because it's equal to its fairshare. # Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. What I expect here is that queue-1-2 preempt from queue-1-1. was: Queue hierarchy described as below: {noformat} root / \ queue-1queue-2 / \ queue-1-1queue-1-2 {noformat} Assume cluster resource is 100 # queue-1-1 and queue-2 has app. Each get 50 usage and 50 fairshare. # When queue-1-2 is active, and it cause some new preemption request for fairshare 25. # When preemption from root, it has possibility to find preemption candidate is queue-2. If so #. Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. FairScheduler's preemption cannot happen between sibling in some case - Key: YARN-3405 URL: https://issues.apache.org/jira/browse/YARN-3405 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Peng Zhang Priority: Critical Queue hierarchy described as below: {noformat} root / \ queue-1queue-2 / \ queue-1-1queue-1-2 {noformat} Assume cluster resource is 100 # queue-1-1 and queue-2 has app. Each get 50 usage and 50 fairshare. # When queue-1-2 is active, and it cause some new preemption request for fairshare 25. # When preemption from root, it has possibility to find preemption candidate is queue-2. If so preemptContainerPreCheck for queue-2 return false because it's equal to its fairshare. # Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. What I expect here is that queue-1-2 preempt from queue-1-1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3405) FairScheduler's preemption cannot happen between sibling in some case
[ https://issues.apache.org/jira/browse/YARN-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Zhang updated YARN-3405: - Description: Queue hierarchy described as below: {noformat} root / \ queue-1queue-2 / \ queue-1-1queue-1-2 {noformat} Assume cluster resource is 100 # queue-1-1 and queue-2 has app. Each get 50 usage and 50 fairshare. # When queue-1-2 is active, and it cause some new preemption request for fairshare 25. # When preemption from root, it has possibility to find preemption candidate is queue-2. If so #. Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. was: Queue hierarchy described as below: {noformat} root / \ queue-1queue-2 / \ queue-1-1queue-1-2 {noformat} 1. When queue-1-1 is active and it has been assigned with all resources. 2. When queue-1-2 is active, and it cause some new preemption request. 3. But when do preemption, it now starts from root, and found queue-1 is not over fairshare, so no recursion preemption to queue-1-1. 4. Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. FairScheduler's preemption cannot happen between sibling in some case - Key: YARN-3405 URL: https://issues.apache.org/jira/browse/YARN-3405 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Peng Zhang Priority: Critical Queue hierarchy described as below: {noformat} root / \ queue-1queue-2 / \ queue-1-1queue-1-2 {noformat} Assume cluster resource is 100 # queue-1-1 and queue-2 has app. Each get 50 usage and 50 fairshare. # When queue-1-2 is active, and it cause some new preemption request for fairshare 25. # When preemption from root, it has possibility to find preemption candidate is queue-2. If so #. Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3405) FairScheduler's preemption cannot happen between sibling in some case
[ https://issues.apache.org/jira/browse/YARN-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Zhang updated YARN-3405: - Description: Queue hierarchy described as below: {noformat} root / \ queue-1queue-2 / \ queue-1-1queue-1-2 {noformat} 1. When queue-1-1 is active and it has been assigned with all resources. 2. When queue-1-2 is active, and it cause some new preemption request. 3. But when do preemption, it now starts from root, and found queue-1 is not over fairshare, so no recursion preemption to queue-1-1. 4. Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. was: Queue hierarchy described as below: {noformat} root | \ queue-1queue-2 / \ queue-1-1queue-1-2 {noformat} 1. When queue-1-1 is active and it has been assigned with all resources. 2. When queue-1-2 is active, and it cause some new preemption request. 3. But when do preemption, it now starts from root, and found queue-1 is not over fairshare, so no recursion preemption to queue-1-1. 4. Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. FairScheduler's preemption cannot happen between sibling in some case - Key: YARN-3405 URL: https://issues.apache.org/jira/browse/YARN-3405 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Peng Zhang Priority: Critical Queue hierarchy described as below: {noformat} root / \ queue-1queue-2 / \ queue-1-1queue-1-2 {noformat} 1. When queue-1-1 is active and it has been assigned with all resources. 2. When queue-1-2 is active, and it cause some new preemption request. 3. But when do preemption, it now starts from root, and found queue-1 is not over fairshare, so no recursion preemption to queue-1-1. 4. Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3405) FairScheduler's preemption cannot happen between sibling in some case
[ https://issues.apache.org/jira/browse/YARN-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Zhang updated YARN-3405: - Description: Queue hierarchy described as below: {noformat} root / \ queue-1queue-2 / \ queue-1-1queue-1-2 {noformat} 1. When queue-1-1 is active and it has been assigned with all resources. 2. When queue-1-2 is active, and it cause some new preemption request. 3. But when do preemption, it now starts from root, and found queue-1 is not over fairshare, so no recursion preemption to queue-1-1. 4. Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. was: Queue hierarchy described as below: {noformat} root / \ queue-1queue-2 / \ queue-1-1queue-1-2 {noformat} 1. When queue-1-1 is active and it has been assigned with all resources. 2. When queue-1-2 is active, and it cause some new preemption request. 3. But when do preemption, it now starts from root, and found queue-1 is not over fairshare, so no recursion preemption to queue-1-1. 4. Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. FairScheduler's preemption cannot happen between sibling in some case - Key: YARN-3405 URL: https://issues.apache.org/jira/browse/YARN-3405 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Peng Zhang Priority: Critical Queue hierarchy described as below: {noformat} root / \ queue-1queue-2 / \ queue-1-1queue-1-2 {noformat} 1. When queue-1-1 is active and it has been assigned with all resources. 2. When queue-1-2 is active, and it cause some new preemption request. 3. But when do preemption, it now starts from root, and found queue-1 is not over fairshare, so no recursion preemption to queue-1-1. 4. Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3405) FairScheduler's preemption cannot happen between sibling in some case
[ https://issues.apache.org/jira/browse/YARN-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Zhang updated YARN-3405: - Description: Queue hierarchy described as below: {noformat} root | \ queue-1queue-2 / \ queue-1-1queue-1-2 {noformat} 1. When queue-1-1 is active and it has been assigned with all resources. 2. When queue-1-2 is active, and it cause some new preemption request. 3. But when do preemption, it now starts from root, and found queue-1 is not over fairshare, so no recursion preemption to queue-1-1. 4. Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. was: Queue hierarchy described as below: {noformat} root | queue-1 / \ queue-1-1queue-1-2 {noformat} 1. When queue-1-1 is active and it has been assigned with all resources. 2. When queue-1-2 is active, and it cause some new preemption request. 3. But when do preemption, it now starts from root, and found queue-1 is not over fairshare, so no recursion preemption to queue-1-1. 4. Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. FairScheduler's preemption cannot happen between sibling in some case - Key: YARN-3405 URL: https://issues.apache.org/jira/browse/YARN-3405 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Peng Zhang Priority: Critical Queue hierarchy described as below: {noformat} root | \ queue-1queue-2 / \ queue-1-1queue-1-2 {noformat} 1. When queue-1-1 is active and it has been assigned with all resources. 2. When queue-1-2 is active, and it cause some new preemption request. 3. But when do preemption, it now starts from root, and found queue-1 is not over fairshare, so no recursion preemption to queue-1-1. 4. Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3111) Fix ratio problem on FairScheduler page
[ https://issues.apache.org/jira/browse/YARN-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Zhang updated YARN-3111: - Attachment: YARN-3111.v2.patch Fix ratio problem on FairScheduler page --- Key: YARN-3111 URL: https://issues.apache.org/jira/browse/YARN-3111 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Minor Attachments: YARN-3111.1.patch, YARN-3111.png, YARN-3111.v2.patch, parenttooltip.png Found 3 problems on FairScheduler page: 1. Only compute memory for ratio even when queue schedulingPolicy is DRF. 2. When min resources is configured larger than real resources, the steady fair share ratio is so long that it is out the page. 3. When cluster resources is 0(no nodemanager start), ratio is displayed as NaN% used Attached image shows the snapshot of above problems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3111) Fix ratio problem on FairScheduler page
[ https://issues.apache.org/jira/browse/YARN-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368731#comment-14368731 ] Peng Zhang commented on YARN-3111: -- bq. 1. represent steady/instant/max of only the dominant resource in the bar. This dominant resource type is decided by queue's usage resource to cluster ? And then display this type name in tooltip. I think this is good: all percentage value for one bar is in the same dimension. bq. Parent queues(except root) have a tooltip, I just checked in trunk. Can you check again ? Is tooltip the block displayed as xxx Queue Status, and content like Used Resources Num Active Applications? If so, I checked the code and run local cluster mode with trunk: for non-leafQueue, it doesn't exist. Fix ratio problem on FairScheduler page --- Key: YARN-3111 URL: https://issues.apache.org/jira/browse/YARN-3111 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Minor Attachments: YARN-3111.1.patch, YARN-3111.png Found 3 problems on FairScheduler page: 1. Only compute memory for ratio even when queue schedulingPolicy is DRF. 2. When min resources is configured larger than real resources, the steady fair share ratio is so long that it is out the page. 3. When cluster resources is 0(no nodemanager start), ratio is displayed as NaN% used Attached image shows the snapshot of above problems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3111) Fix ratio problem on FairScheduler page
[ https://issues.apache.org/jira/browse/YARN-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368859#comment-14368859 ] Peng Zhang commented on YARN-3111: -- I edit FairSchedulerPage.java and add queue status info for non-leafQueue(include root). If need, I can file a new issue for this. Fix ratio problem on FairScheduler page --- Key: YARN-3111 URL: https://issues.apache.org/jira/browse/YARN-3111 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Minor Attachments: YARN-3111.1.patch, YARN-3111.png Found 3 problems on FairScheduler page: 1. Only compute memory for ratio even when queue schedulingPolicy is DRF. 2. When min resources is configured larger than real resources, the steady fair share ratio is so long that it is out the page. 3. When cluster resources is 0(no nodemanager start), ratio is displayed as NaN% used Attached image shows the snapshot of above problems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3111) Fix ratio problem on FairScheduler page
[ https://issues.apache.org/jira/browse/YARN-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14370625#comment-14370625 ] Peng Zhang commented on YARN-3111: -- [~ashwinshankar77] Thanks, I got it. I'll update patch to implement 1 3 in your advices. Fix ratio problem on FairScheduler page --- Key: YARN-3111 URL: https://issues.apache.org/jira/browse/YARN-3111 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Minor Attachments: YARN-3111.1.patch, YARN-3111.png, parenttooltip.png Found 3 problems on FairScheduler page: 1. Only compute memory for ratio even when queue schedulingPolicy is DRF. 2. When min resources is configured larger than real resources, the steady fair share ratio is so long that it is out the page. 3. When cluster resources is 0(no nodemanager start), ratio is displayed as NaN% used Attached image shows the snapshot of above problems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3111) Fix ratio problem on FairScheduler page
[ https://issues.apache.org/jira/browse/YARN-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366621#comment-14366621 ] Peng Zhang commented on YARN-3111: -- Thanks for your advices For 4 proposal listed above: 1 2 are already done in the patch 3 is good, but one question is that parent queue has no tooltip now, but it has its own bar. And think over 3 4, what about listing all resources's usage percent on the text on the right of each bar? Maybe color red for dominant resource? or just judge it by comparing percent number? And also what do you think of the issue I mentioned above? I think it still can happen after 1 2, cause for one queue: steady, fair, max, usage resource may have different dominant resource type. If I make a mistake here, please let me know. bq. queue's bar width is decided by (queue steady resource / cluster resource), and queue's usage width is decided by (queue's usage resource / cluster resource). For above two percent computation, dominant resource may be different, so two percent value is still in different dimension, and it causes confusion. Fix ratio problem on FairScheduler page --- Key: YARN-3111 URL: https://issues.apache.org/jira/browse/YARN-3111 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Minor Attachments: YARN-3111.1.patch, YARN-3111.png Found 3 problems on FairScheduler page: 1. Only compute memory for ratio even when queue schedulingPolicy is DRF. 2. When min resources is configured larger than real resources, the steady fair share ratio is so long that it is out the page. 3. When cluster resources is 0(no nodemanager start), ratio is displayed as NaN% used Attached image shows the snapshot of above problems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3111) Fix ratio problem on FairScheduler page
[ https://issues.apache.org/jira/browse/YARN-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364815#comment-14364815 ] Peng Zhang commented on YARN-3111: -- I think overlay is not a good choice. Currently scheduler bar is already overlay of steady share, instantaneous share and max resources. Then overlaying two dimension of resources may generate 2 * 3 elements? If so, it should be too cluttered without new resources added. When test this patch in our cluster, I found a new issue with some abnormal configuration: queue's bar width is decided by (queue steady resource / cluster resource), and queue's usage width is decided by (queue's usage resource / cluster resource). For above two percent computation, dominant resource may be different, so two percent value is still in different dimension, and it causes confusion. To figure out above problem, we practice making queue's steady share proportional to root queue share in different resource dimension, so first percent value(queue steady resource / cluster resource) has the same percent value in different resources, and it will not cause confusion. I think deeper problem is that FS can configure cpu and memory seperately(eg: min resource, max resource ), and it makes resource not proportional between queues, but need a view of percentage. Fix ratio problem on FairScheduler page --- Key: YARN-3111 URL: https://issues.apache.org/jira/browse/YARN-3111 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Minor Attachments: YARN-3111.1.patch, YARN-3111.png Found 3 problems on FairScheduler page: 1. Only compute memory for ratio even when queue schedulingPolicy is DRF. 2. When min resources is configured larger than real resources, the steady fair share ratio is so long that it is out the page. 3. When cluster resources is 0(no nodemanager start), ratio is displayed as NaN% used Attached image shows the snapshot of above problems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3111) Fix ratio problem on FairScheduler page
[ https://issues.apache.org/jira/browse/YARN-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348213#comment-14348213 ] Peng Zhang commented on YARN-3111: -- Thanks [~kasha] for your advice, I think: 1. When cluster capacity is 0, display is not that important: so 1, 0 are all acceptable for me. 2. Display two percent value for resources is good. And the following question is what about the usage bar? And also what if more resource is supported in YARN? Fix ratio problem on FairScheduler page --- Key: YARN-3111 URL: https://issues.apache.org/jira/browse/YARN-3111 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Minor Fix For: 2.7.0 Attachments: YARN-3111.1.patch, YARN-3111.png Found 3 problems on FairScheduler page: 1. Only compute memory for ratio even when queue schedulingPolicy is DRF. 2. When min resources is configured larger than real resources, the steady fair share ratio is so long that it is out the page. 3. When cluster resources is 0(no nodemanager start), ratio is displayed as NaN% used Attached image shows the snapshot of above problems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3190) NM can't aggregate logs: token can't be found in cache
[ https://issues.apache.org/jira/browse/YARN-3190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342999#comment-14342999 ] Peng Zhang commented on YARN-3190: -- I met the same issue for spark streaming job running on YARN in Hadoop 2.4 with security enabled. I found failed period is related with 7 days, and I think it may be caused by dfs.namenode.delegation.token.max-lifetime that default value is 7 days. After 7 days application token is removed from NameNode, so application cannot access HDFS any more. If so, how can long running service work on security cluster? NM can't aggregate logs: token can't be found in cache --- Key: YARN-3190 URL: https://issues.apache.org/jira/browse/YARN-3190 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Environment: CDH 5.3.1 HA HDFS Kerberos Reporter: Andrejs Dubovskis Priority: Minor In rare cases node manager can not aggregate logs: generating exception: {code} 2015-02-12 13:04:03,703 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Starting aggregate log-file for app application_1423661043235_2150 at /tmp/logs/catalyst/logs/application_1423661043235_2150/catdn001.intrum.net_8041.tmp 2015-02-12 13:04:03,707 INFO org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting absolute path : /data5/yarn/nm/usercache/catalyst/appcache/application_1423661043235_2150/container_1423661043235_2150_01_000442 2015-02-12 13:04:03,707 INFO org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting absolute path : /data6/yarn/nm/usercache/catalyst/appcache/application_1423661043235_2150/container_1423661043235_2150_01_000442 2015-02-12 13:04:03,707 INFO org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting absolute path : /data7/yarn/nm/usercache/catalyst/appcache/application_1423661043235_2150/container_1423661043235_2150_01_000442 2015-02-12 13:04:03,709 INFO org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting absolute path : /data1/yarn/nm/usercache/catalyst/appcache/application_1423661043235_2150 2015-02-12 13:04:03,709 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:catalyst (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 2334644 for catalyst) can't be found in cache 2015-02-12 13:04:03,709 WARN org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 2334644 for catalyst) can't be found in cache 2015-02-12 13:04:03,709 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:catalyst (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 2334644 for catalyst) can't be found in cache 2015-02-12 13:04:03,712 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:catalyst (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 2334644 for catalyst) can't be found in cache 2015-02-12 13:04:03,712 ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Cannot create writer for app application_1423661043235_2150. Disabling log-aggregation for this app. org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 2334644 for catalyst) can't be found in cache at org.apache.hadoop.ipc.Client.call(Client.java:1411) at org.apache.hadoop.ipc.Client.call(Client.java:1364) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy19.getServerDefaults(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getServerDefaults(ClientNamenodeProtocolTranslatorPB.java:259) at sun.reflect.GeneratedMethodAccessor114.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at
[jira] [Updated] (YARN-3111) Fix ratio problem on FairScheduler page
[ https://issues.apache.org/jira/browse/YARN-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Zhang updated YARN-3111: - Attachment: YARN-3111.png Fix ratio problem on FairScheduler page --- Key: YARN-3111 URL: https://issues.apache.org/jira/browse/YARN-3111 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Minor Attachments: YARN-3111.png Found 3 problems on FairScheduler page: 1. Only compute memory for ratio even when queue schedulingPolicy is DRF. 2. When min resources is configured larger than real resources, the steady fair share ratio is so long that it is out the page. 3. When cluster resources is 0(no nodemanager start), ratio is displayed as NaN% used Attached image shows the snapshot of above problems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3111) Fix ratio problem on FairScheduler page
Peng Zhang created YARN-3111: Summary: Fix ratio problem on FairScheduler page Key: YARN-3111 URL: https://issues.apache.org/jira/browse/YARN-3111 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Minor Found 3 problems on FairScheduler page: 1. Only compute memory for ratio even when queue schedulingPolicy is DRF. 2. When min resources is configured larger than real resources, the steady fair share ratio is so long that it is out the page. 3. When cluster resources is 0(no nodemanager start), ratio is displayed as NaN% used Attached image shows the snapshot of above problems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3111) Fix ratio problem on FairScheduler page
[ https://issues.apache.org/jira/browse/YARN-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Zhang updated YARN-3111: - Attachment: YARN-3111.1.patch Fix ratio problem on FairScheduler page --- Key: YARN-3111 URL: https://issues.apache.org/jira/browse/YARN-3111 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Minor Fix For: 2.7.0 Attachments: YARN-3111.1.patch, YARN-3111.png Found 3 problems on FairScheduler page: 1. Only compute memory for ratio even when queue schedulingPolicy is DRF. 2. When min resources is configured larger than real resources, the steady fair share ratio is so long that it is out the page. 3. When cluster resources is 0(no nodemanager start), ratio is displayed as NaN% used Attached image shows the snapshot of above problems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3111) Fix ratio problem on FairScheduler page
[ https://issues.apache.org/jira/browse/YARN-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14298157#comment-14298157 ] Peng Zhang commented on YARN-3111: -- I think test failure of org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore.testFSRMStateStoreClientRetry is not related with this patch Fix ratio problem on FairScheduler page --- Key: YARN-3111 URL: https://issues.apache.org/jira/browse/YARN-3111 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Minor Fix For: 2.7.0 Attachments: YARN-3111.1.patch, YARN-3111.png Found 3 problems on FairScheduler page: 1. Only compute memory for ratio even when queue schedulingPolicy is DRF. 2. When min resources is configured larger than real resources, the steady fair share ratio is so long that it is out the page. 3. When cluster resources is 0(no nodemanager start), ratio is displayed as NaN% used Attached image shows the snapshot of above problems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3054) Preempt policy in FairScheduler may cause mapreduce job never finish
Peng Zhang created YARN-3054: Summary: Preempt policy in FairScheduler may cause mapreduce job never finish Key: YARN-3054 URL: https://issues.apache.org/jira/browse/YARN-3054 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Peng Zhang Preemption policy is related with schedule policy now. Using comparator of schedule policy to find preemption candidate cannot guarantee a subset of containers never be preempted. And this may cause tasks to be preempted periodically before they finish. So job cannot make any progress. I think preemption in YARN should got below assurance: 1. Mapreduce jobs can get additional resources when others are idle; 2. Mapreduce jobs for one user in one queue can still progress with its min share when others preempt resources back. Maybe always preempt the latest app and container can get this? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2360) Fair Scheduler: Display dynamic fair share for queues on the scheduler page
[ https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264304#comment-14264304 ] Peng Zhang commented on YARN-2360: -- [~kasha] It seems that this is missed in branch-2, and also not included in version 2.6. Fair Scheduler: Display dynamic fair share for queues on the scheduler page --- Key: YARN-2360 URL: https://issues.apache.org/jira/browse/YARN-2360 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Fix For: 2.6.0 Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, Screen_Shot_v3.png, Screen_Shot_v4.png, Screen_Shot_v5.png, YARN-2360-v1.txt, YARN-2360-v2.txt, YARN-2360-v3.patch, YARN-2360-v4.patch, YARN-2360-v5.patch, yarn-2360-6.patch Based on the discussion in YARN-2026, we'd like to display dynamic fair share for queues on the scheduler page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2965) Enhance Node Managers to monitor and report the resource usage on the machines
[ https://issues.apache.org/jira/browse/YARN-2965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247851#comment-14247851 ] Peng Zhang commented on YARN-2965: -- {quote} we were thinking of tracking the resource usage per container as well {quote} Any issue to track this now? I think this is very useful for audit and accounting. And I also wonder whether monitor in this issue should distinguish YARN services and other services(such as HDFS, Storm that not running on YARN)? IMHO, if machine's resource is isolated between YARN and non-YARN services by cgroups (described in Cloudera Manager docs as static resource pool), monitor here should only track each container's resource and then aggregate them to RT Master. Enhance Node Managers to monitor and report the resource usage on the machines -- Key: YARN-2965 URL: https://issues.apache.org/jira/browse/YARN-2965 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Robert Grandl Attachments: ddoc_RT.docx This JIRA is about augmenting Node Managers to monitor the resource usage on the machine, aggregates these reports and exposes them to the RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1810) YARN RM Webapp Application page Issue
[ https://issues.apache.org/jira/browse/YARN-1810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045679#comment-14045679 ] Peng Zhang commented on YARN-1810: -- I updated $('#apps').dataTable().fnFilter(q, 3, true); field number from 3 to 4, click “default queue bar, applications will not disappear. But I found this fnFilter query will be maintained to Application page. As we have multiple queues, If I click one of them in scheduler page, and go to application page, only applications of clicked queue will show, other applications are filtered. Cause no filter query shows on page, so this may cause confusions. YARN RM Webapp Application page Issue - Key: YARN-1810 URL: https://issues.apache.org/jira/browse/YARN-1810 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.3.0 Reporter: Ethan Setnik Attachments: Screen Shot 2014-03-10 at 3.59.54 PM.png, Screen Shot 2014-03-11 at 1.40.12 PM.png When browsing the ResourceManager's web interface I am presented with the attached screenshot. I can't understand why it does not show the applications, even though there is no search text. The application counts show the correct values relative to the submissions, successes, and failures. Also see the text in the screenshot: Showing 0 to 0 of 0 entries (filtered from 19 total entries) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2104) Scheduler queue filter failed to work because index of queue column changed
[ https://issues.apache.org/jira/browse/YARN-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045708#comment-14045708 ] Peng Zhang commented on YARN-2104: -- Looks good to me. Scheduler queue filter failed to work because index of queue column changed --- Key: YARN-2104 URL: https://issues.apache.org/jira/browse/YARN-2104 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2104.patch YARN-563 added, {code} + th(.type, Application Type”). {code} to application table, which makes queue’s column index from 3 to 4. And in scheduler page, queue’s column index is hard coded to 3 when filter application with queue’s name, {code} if (q == 'root') q = '';, else q = '^' + q.substr(q.lastIndexOf('.') + 1) + '$';, $('#apps').dataTable().fnFilter(q, 3, true);, {code} So queue filter will not work for application page. Reproduce steps: (Thanks Bo Yang for pointing this) {code} 1) In default setup, there’s a default queue under root queue 2) Run an arbitrary application, you can find it in “Applications” page 3) Click “Default” queue in scheduler page 4) Click “Applications”, no application will show here 5) Click “Root” queue in scheduler page 6) Click “Applications”, application will show again {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2221) WebUI: RM scheduler page's queue filter status will affect appllication page
Peng Zhang created YARN-2221: Summary: WebUI: RM scheduler page's queue filter status will affect appllication page Key: YARN-2221 URL: https://issues.apache.org/jira/browse/YARN-2221 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Peng Zhang Priority: Minor Apps queue filter added by clicking queue bar in scheduler page will affect display of applications page. No filter query is shown on applications page, this makes confusions. Also we cannot reset the filter query on application page, and we must come back to scheduler page, click root queue to reset. Reproduce steps: {code} 1) Configure two queues under root( A B) 2) Run some apps using queue A and B respectively 3) Click “A” queue in scheduler page 4) Click “Applications”, only apps of queue A show 5) Click “B” queue in scheduler page 6) Click “Applications”, only apps of queue B show {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1810) YARN RM Webapp Application page Issue
[ https://issues.apache.org/jira/browse/YARN-1810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045717#comment-14045717 ] Peng Zhang commented on YARN-1810: -- OK, I created JIRA: https://issues.apache.org/jira/browse/YARN-2221 YARN RM Webapp Application page Issue - Key: YARN-1810 URL: https://issues.apache.org/jira/browse/YARN-1810 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.3.0 Reporter: Ethan Setnik Attachments: Screen Shot 2014-03-10 at 3.59.54 PM.png, Screen Shot 2014-03-11 at 1.40.12 PM.png When browsing the ResourceManager's web interface I am presented with the attached screenshot. I can't understand why it does not show the applications, even though there is no search text. The application counts show the correct values relative to the submissions, successes, and failures. Also see the text in the screenshot: Showing 0 to 0 of 0 entries (filtered from 19 total entries) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1810) YARN RM Webapp Application page Issue
[ https://issues.apache.org/jira/browse/YARN-1810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14044737#comment-14044737 ] Peng Zhang commented on YARN-1810: -- I found if click default queue bar, all running applications will disappear, and click root they will come back. YARN RM Webapp Application page Issue - Key: YARN-1810 URL: https://issues.apache.org/jira/browse/YARN-1810 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.3.0 Reporter: Ethan Setnik Attachments: Screen Shot 2014-03-10 at 3.59.54 PM.png, Screen Shot 2014-03-11 at 1.40.12 PM.png When browsing the ResourceManager's web interface I am presented with the attached screenshot. I can't understand why it does not show the applications, even though there is no search text. The application counts show the correct values relative to the submissions, successes, and failures. Also see the text in the screenshot: Showing 0 to 0 of 0 entries (filtered from 19 total entries) -- This message was sent by Atlassian JIRA (v6.2#6252)