[jira] [Commented] (YARN-8200) Backport resource types/GPU features to branch-3.0/branch-2
[ https://issues.apache.org/jira/browse/YARN-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025156#comment-17025156 ] Thomas Graves commented on YARN-8200: - Hey [~jhung] , I am trying out the gpu scheduling in hadoop 2.10 and the first thing I noticed is it doesn't error properly if you ask for to many GPU's. It seems to happyily say it gave them to me, although I think its really giving me the max configured. Is this a known issue already or did configuration change? I have gpu max configured at 4 and I try to allocate 8, on hadoop 3 I get: Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException): Invalid resource request, requested resource type=[yarn.io/gpu] < 0 or greater than maximum allowed allocation. Requested resource=, maximum allowed allocation=, please note that maximum allowed allocation is calculated by scheduler based on maximum resource of registered NodeManagers, which might be less than configured maximum allocation= On hadoop 2.10 I get a container allocated but the logs and UI says it only has 4 gpus. > Backport resource types/GPU features to branch-3.0/branch-2 > --- > > Key: YARN-8200 > URL: https://issues.apache.org/jira/browse/YARN-8200 > Project: Hadoop YARN > Issue Type: Task >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Labels: release-blocker > Fix For: 2.10.0 > > Attachments: YARN-8200-branch-2.001.patch, > YARN-8200-branch-2.002.patch, YARN-8200-branch-2.003.patch, > YARN-8200-branch-3.0.001.patch, > counter.scheduler.operation.allocate.csv.defaultResources, > counter.scheduler.operation.allocate.csv.gpuResources, synth_sls.json > > > Currently we have a need for GPU scheduling on our YARN clusters to support > deep learning workloads. However, our main production clusters are running > older versions of branch-2 (2.7 in our case). To prevent supporting too many > very different hadoop versions across multiple clusters, we would like to > backport the resource types/resource profiles feature to branch-2, as well as > the GPU specific support. > > We have done a trial backport of YARN-3926 and some miscellaneous patches in > YARN-7069 based on issues we uncovered, and the backport was fairly smooth. > We also did a trial backport of most of YARN-6223 (sans docker support). > > Regarding the backports, perhaps we can do the development in a feature > branch and then merge to branch-2 when ready. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8200) Backport resource types/GPU features to branch-3.0/branch-2
[ https://issues.apache.org/jira/browse/YARN-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025285#comment-17025285 ] Thomas Graves commented on YARN-8200: - After messing with this a bit more I removed the maximum allocation configurations after seeing the documentation didn't have them in the 2.10 release. so removed this setting: yarn.resource-types.yarn.io/gpu.maximum-allocation 4 And it appears now yarn doesn't allocate me a container unless it has fullfilled all of the gpus I requested. So in this case my nodemanager has 4 gpus so if I request 5 then it just hangs waiting to fullfill the request. This behavior is much better then giving me one that is less then I requested. > Backport resource types/GPU features to branch-3.0/branch-2 > --- > > Key: YARN-8200 > URL: https://issues.apache.org/jira/browse/YARN-8200 > Project: Hadoop YARN > Issue Type: Task >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Labels: release-blocker > Fix For: 2.10.0 > > Attachments: YARN-8200-branch-2.001.patch, > YARN-8200-branch-2.002.patch, YARN-8200-branch-2.003.patch, > YARN-8200-branch-3.0.001.patch, > counter.scheduler.operation.allocate.csv.defaultResources, > counter.scheduler.operation.allocate.csv.gpuResources, synth_sls.json > > > Currently we have a need for GPU scheduling on our YARN clusters to support > deep learning workloads. However, our main production clusters are running > older versions of branch-2 (2.7 in our case). To prevent supporting too many > very different hadoop versions across multiple clusters, we would like to > backport the resource types/resource profiles feature to branch-2, as well as > the GPU specific support. > > We have done a trial backport of YARN-3926 and some miscellaneous patches in > YARN-7069 based on issues we uncovered, and the backport was fairly smooth. > We also did a trial backport of most of YARN-6223 (sans docker support). > > Regarding the backports, perhaps we can do the development in a feature > branch and then merge to branch-2 when ready. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9055) Capacity Scheduler: allow larger queue level maximum-allocation-mb to override the cluster configuration
[ https://issues.apache.org/jira/browse/YARN-9055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700511#comment-16700511 ] Thomas Graves commented on YARN-9055: - It would definitely be a change in behavior which could surprise people with existing configurations. I do think its easier to have this way so you don't have to configure all the queues. I don't remember all the details on why I did it this way, I think it was mostly to not break the existing functionality of the cluster level max., > Capacity Scheduler: allow larger queue level maximum-allocation-mb to > override the cluster configuration > > > Key: YARN-9055 > URL: https://issues.apache.org/jira/browse/YARN-9055 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Affects Versions: 2.7.0 >Reporter: Aihua Xu >Assignee: Aihua Xu >Priority: Major > Attachments: YARN-9055.1.patch > > > YARN-1582 adds the support of maximum-allocation-mb configuration per queue. > That feature gives the flexibility to give different memory requirements for > different queues. Such patch adds the limitation that the queue level > configuration can't exceed the cluster level default configuration, but I > feel it may make more sense to remove such limitation to allow any overrides > since > # Such configuration is controlled by the admin so it shouldn't get abused; > # It's common that typical queues require standard size containers while some > job (queues) have requirements for larger containers. With current > limitation, we have to set larger configuration on the cluster setting which > will cause resource abuse unless we override them on all the queues. > We can remove such limitation in CapacitySchedulerConfiguration.java so the > cluster setting provides the default value and queue setting can override it. > {noformat} >if (maxAllocationMbPerQueue > clusterMax.getMemorySize() > || maxAllocationVcoresPerQueue > clusterMax.getVirtualCores()) { > throw new IllegalArgumentException( > "Queue maximum allocation cannot be larger than the cluster setting" > + " for queue " + queue > + " max allocation per queue: " + result > + " cluster setting: " + clusterMax); > } > {noformat} > Let me know if it makes sense. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9116) Capacity Scheduler: add the default maximum-allocation-mb and maximum-allocation-vcores for the queues
[ https://issues.apache.org/jira/browse/YARN-9116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16737392#comment-16737392 ] Thomas Graves commented on YARN-9116: - Yes so you want to keep the behavior that the cluster level maximum is the absolute maximum and no child queues can be larger then that, otherwise it breaks backwards compatibility. > Capacity Scheduler: add the default maximum-allocation-mb and > maximum-allocation-vcores for the queues > -- > > Key: YARN-9116 > URL: https://issues.apache.org/jira/browse/YARN-9116 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Affects Versions: 2.7.0 >Reporter: Aihua Xu >Assignee: Aihua Xu >Priority: Major > Attachments: YARN-9116.1.patch > > > YARN-1582 adds the support of maximum-allocation-mb configuration per queue > which is targeting to support larger container features on dedicated queues > (larger maximum-allocation-mb/maximum-allocation-vcores for such queue) . > While to achieve larger container configuration, we need to increase the > global maximum-allocation-mb/maximum-allocation-vcores (e.g. 120G/256) and > then override those configurations with desired values on the queues since > queue configuration can't be larger than cluster configuration. There are > many queues in the system and if we forget to configure such values when > adding a new queue, then such queue gets default 120G/256 which typically is > not what we want. > We can come up with a queue-default configuration (set to normal queue > configuration like 16G/8), so the leaf queues gets such values by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4610) Reservations continue looking for one app causes other apps to starve
[ https://issues.apache.org/jira/browse/YARN-4610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15108697#comment-15108697 ] Thomas Graves commented on YARN-4610: - +1. Thanks for fixing this. > Reservations continue looking for one app causes other apps to starve > - > > Key: YARN-4610 > URL: https://issues.apache.org/jira/browse/YARN-4610 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.7.1 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Blocker > Attachments: YARN-4610.001.patch > > > CapacityScheduler's LeafQueue has "reservations continue looking" logic that > allows an application to unreserve elsewhere to fulfil a container request on > a node that has available space. However in 2.7 that logic seems to break > allocations for subsequent apps in the queue. Once a user hits its user > limit, subsequent apps in the queue for other users receive containers at a > significantly reduced rate. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4610) Reservations continue looking for one app causes other apps to starve
[ https://issues.apache.org/jira/browse/YARN-4610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15109330#comment-15109330 ] Thomas Graves commented on YARN-4610: - Ok thanks for investigating. +1 from me feel free to commit. > Reservations continue looking for one app causes other apps to starve > - > > Key: YARN-4610 > URL: https://issues.apache.org/jira/browse/YARN-4610 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.7.1 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Blocker > Attachments: YARN-4610.001.patch > > > CapacityScheduler's LeafQueue has "reservations continue looking" logic that > allows an application to unreserve elsewhere to fulfil a container request on > a node that has available space. However in 2.7 that logic seems to break > allocations for subsequent apps in the queue. Once a user hits its user > limit, subsequent apps in the queue for other users receive containers at a > significantly reduced rate. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4610) Reservations continue looking for one app causes other apps to starve
[ https://issues.apache.org/jira/browse/YARN-4610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15109516#comment-15109516 ] Thomas Graves commented on YARN-4610: - Sorry after looking some more I think there might be an issue with this for parent queue max capacities, looking some more. > Reservations continue looking for one app causes other apps to starve > - > > Key: YARN-4610 > URL: https://issues.apache.org/jira/browse/YARN-4610 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.7.1 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Blocker > Attachments: YARN-4610.001.patch, YARN-4610.branch-2.7.001.patch > > > CapacityScheduler's LeafQueue has "reservations continue looking" logic that > allows an application to unreserve elsewhere to fulfil a container request on > a node that has available space. However in 2.7 that logic seems to break > allocations for subsequent apps in the queue. Once a user hits its user > limit, subsequent apps in the queue for other users receive containers at a > significantly reduced rate. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4610) Reservations continue looking for one app causes other apps to starve
[ https://issues.apache.org/jira/browse/YARN-4610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15110778#comment-15110778 ] Thomas Graves commented on YARN-4610: - +1 for branch 2.7. After investigating this some more the original patch of setting it to none() works. The reason is that the parents limit is passed and it would be taken into account int he leaf calculation. I think the latter patch is safer but either is fine with me. The master patch I'm not sure about how its taking the max capacity into account so I'll have to look at that more, but the unit tests are passing and that would be a separate issue from this fix. +1 on that patch as well. > Reservations continue looking for one app causes other apps to starve > - > > Key: YARN-4610 > URL: https://issues.apache.org/jira/browse/YARN-4610 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.7.1 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Blocker > Attachments: YARN-4610-branch-2.7.002.patch, YARN-4610.001.patch, > YARN-4610.branch-2.7.001.patch > > > CapacityScheduler's LeafQueue has "reservations continue looking" logic that > allows an application to unreserve elsewhere to fulfil a container request on > a node that has available space. However in 2.7 that logic seems to break > allocations for subsequent apps in the queue. Once a user hits its user > limit, subsequent apps in the queue for other users receive containers at a > significantly reduced rate. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4641) CapacityScheduler Active Users Info table should be sortable
Thomas Graves created YARN-4641: --- Summary: CapacityScheduler Active Users Info table should be sortable Key: YARN-4641 URL: https://issues.apache.org/jira/browse/YARN-4641 Project: Hadoop YARN Issue Type: Improvement Components: capacity scheduler Affects Versions: 2.7.1 Reporter: Thomas Graves The Scheduler page when using the Capacity scheduler allows you to see all the Active Users Info. If you have lots of users this is a big table and if you want to be able to see who is using the most it would be nice to have this sortable or show the %used like it used to. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-5010) maxActiveApplications and maxActiveApplicationsPerUser are missing from REST API
[ https://issues.apache.org/jira/browse/YARN-5010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15262914#comment-15262914 ] Thomas Graves commented on YARN-5010: - we shouldn't just remove them as its an API compatibility issue. I would say they should be added back and definition updated or we should rev rest api version. > maxActiveApplications and maxActiveApplicationsPerUser are missing from REST > API > > > Key: YARN-5010 > URL: https://issues.apache.org/jira/browse/YARN-5010 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 >Reporter: Jason Lowe > > The RM used to report maxActiveApplications and maxActiveApplicationsPerUser > in the REST API for a queue, but these are missing in 2.7.0. It appears > YARN-2637 replaced them with aMResourceLimit and userAMResourceLimit, > respectively, which broke some internal tools that were expecting the max app > fields to still be there. We should at least update the REST docs to reflect > that change. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-3434: Attachment: YARN-3434-branch2.7.patch Attaching patch for branch2.7. [~leftnoteasy] could you take a look when you have a chance? > Interaction between reservations and userlimit can result in significant ULF > violation > -- > > Key: YARN-3434 > URL: https://issues.apache.org/jira/browse/YARN-3434 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Fix For: 2.8.0 > > Attachments: YARN-3434-branch2.7.patch, YARN-3434.patch, > YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, > YARN-3434.patch, YARN-3434.patch > > > ULF was set to 1.0 > User was able to consume 1.4X queue capacity. > It looks like when this application launched, it reserved about 1000 > containers, each 8G each, within about 5 seconds. I think this allowed the > logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3600) AM container link is broken (on a killed application, at least)
[ https://issues.apache.org/jira/browse/YARN-3600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-3600: Labels: (was: BB2015-05-RFC) > AM container link is broken (on a killed application, at least) > --- > > Key: YARN-3600 > URL: https://issues.apache.org/jira/browse/YARN-3600 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0 >Reporter: Sergey Shelukhin >Assignee: Naganarasimha G R > Attachments: YARN-3600.20150508-1.patch > > > Running some fairly recent (couple weeks ago) version of 2.8.0-SNAPSHOT. > I have an application that ran fine for a while and then I yarn kill-ed it. > Now when I go to the only app attempt URL (like so: http://(snip RM host > name):8088/cluster/appattempt/appattempt_1429683757595_0795_01) > I see: > AM Container: container_1429683757595_0795_01_01 > Node: N/A > and the container link goes to {noformat}http://(snip RM host > name):8088/cluster/N/A > {noformat} > which obviously doesn't work -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3600) AM container link is broken (on a killed application, at least)
[ https://issues.apache.org/jira/browse/YARN-3600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14534621#comment-14534621 ] Thomas Graves commented on YARN-3600: - reviewing and kicking jenkins. > AM container link is broken (on a killed application, at least) > --- > > Key: YARN-3600 > URL: https://issues.apache.org/jira/browse/YARN-3600 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0 >Reporter: Sergey Shelukhin >Assignee: Naganarasimha G R > Attachments: YARN-3600.20150508-1.patch > > > Running some fairly recent (couple weeks ago) version of 2.8.0-SNAPSHOT. > I have an application that ran fine for a while and then I yarn kill-ed it. > Now when I go to the only app attempt URL (like so: http://(snip RM host > name):8088/cluster/appattempt/appattempt_1429683757595_0795_01) > I see: > AM Container: container_1429683757595_0795_01_01 > Node: N/A > and the container link goes to {noformat}http://(snip RM host > name):8088/cluster/N/A > {noformat} > which obviously doesn't work -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3600) AM container link is broken (on a killed application, at least)
[ https://issues.apache.org/jira/browse/YARN-3600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14534825#comment-14534825 ] Thomas Graves commented on YARN-3600: - So the change does fix the broken link issue, but it seems to me other things are broken with this page. Obviously if it ran for a while it got an AM and there fore should have a valid container. But I guess that link only works if its actually running? The container table below that also confused me a bit. I thought at first it was list of AM containers, but after playing with it its really list of running containers. I think we should add heading for that. I filed separate jira for those things. Anyway, +1. Thanks! > AM container link is broken (on a killed application, at least) > --- > > Key: YARN-3600 > URL: https://issues.apache.org/jira/browse/YARN-3600 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0 >Reporter: Sergey Shelukhin >Assignee: Naganarasimha G R > Attachments: YARN-3600.20150508-1.patch > > > Running some fairly recent (couple weeks ago) version of 2.8.0-SNAPSHOT. > I have an application that ran fine for a while and then I yarn kill-ed it. > Now when I go to the only app attempt URL (like so: http://(snip RM host > name):8088/cluster/appattempt/appattempt_1429683757595_0795_01) > I see: > AM Container: container_1429683757595_0795_01_01 > Node: N/A > and the container link goes to {noformat}http://(snip RM host > name):8088/cluster/N/A > {noformat} > which obviously doesn't work -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3603) Application Attempts page confusing
Thomas Graves created YARN-3603: --- Summary: Application Attempts page confusing Key: YARN-3603 URL: https://issues.apache.org/jira/browse/YARN-3603 Project: Hadoop YARN Issue Type: Bug Components: webapp Affects Versions: 2.8.0 Reporter: Thomas Graves The application attempts page (http://RM:8088/cluster/appattempt/appattempt_1431101480046_0003_01) is a bit confusing on what is going on. I think the table of containers there is for only Running containers and when the app is completed or killed its empty. The table should have a label on it stating so. Also the "AM Container" field is a link when running but not when its killed. That might be confusing. There is no link to the logs in this page but there is in the app attempt table when looking at http:// rm:8088/cluster/app/application_1431101480046_0003 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-20) More information for "yarn.resourcemanager.webapp.address" in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-20?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-20: -- Labels: newbie (was: BB2015-05-RFC newbie) > More information for "yarn.resourcemanager.webapp.address" in yarn-default.xml > -- > > Key: YARN-20 > URL: https://issues.apache.org/jira/browse/YARN-20 > Project: Hadoop YARN > Issue Type: Improvement > Components: documentation, resourcemanager >Affects Versions: 2.0.0-alpha >Reporter: Nemon Lou >Assignee: Bartosz Ługowski >Priority: Trivial > Labels: newbie > Attachments: YARN-20.1.patch, YARN-20.2.patch, YARN-20.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > The parameter yarn.resourcemanager.webapp.address in yarn-default.xml is > in "host:port" format,which is noted in the cluster set up guide > (http://hadoop.apache.org/common/docs/r2.0.0-alpha/hadoop-yarn/hadoop-yarn-site/ClusterSetup.html). > When i read though the code,i find "host" format is also supported. In > "host" format,the port will be random. > So we may add more documentation in yarn-default.xml for easy understood. > I will submit a patch if it's helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-20) More information for "yarn.resourcemanager.webapp.address" in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-20?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14534967#comment-14534967 ] Thomas Graves commented on YARN-20: --- +1. Thanks! > More information for "yarn.resourcemanager.webapp.address" in yarn-default.xml > -- > > Key: YARN-20 > URL: https://issues.apache.org/jira/browse/YARN-20 > Project: Hadoop YARN > Issue Type: Improvement > Components: documentation, resourcemanager >Affects Versions: 2.0.0-alpha >Reporter: Nemon Lou >Assignee: Bartosz Ługowski >Priority: Trivial > Labels: newbie > Attachments: YARN-20.1.patch, YARN-20.2.patch, YARN-20.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > The parameter yarn.resourcemanager.webapp.address in yarn-default.xml is > in "host:port" format,which is noted in the cluster set up guide > (http://hadoop.apache.org/common/docs/r2.0.0-alpha/hadoop-yarn/hadoop-yarn-site/ClusterSetup.html). > When i read though the code,i find "host" format is also supported. In > "host" format,the port will be random. > So we may add more documentation in yarn-default.xml for easy understood. > I will submit a patch if it's helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3603) Application Attempts page confusing
[ https://issues.apache.org/jira/browse/YARN-3603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535013#comment-14535013 ] Thomas Graves commented on YARN-3603: - go for it. Thanks! > Application Attempts page confusing > --- > > Key: YARN-3603 > URL: https://issues.apache.org/jira/browse/YARN-3603 > Project: Hadoop YARN > Issue Type: Bug > Components: webapp >Affects Versions: 2.8.0 >Reporter: Thomas Graves >Assignee: Sunil G > > The application attempts page > (http://RM:8088/cluster/appattempt/appattempt_1431101480046_0003_01) > is a bit confusing on what is going on. I think the table of containers > there is for only Running containers and when the app is completed or killed > its empty. The table should have a label on it stating so. > Also the "AM Container" field is a link when running but not when its killed. > That might be confusing. > There is no link to the logs in this page but there is in the app attempt > table when looking at http:// > rm:8088/cluster/app/application_1431101480046_0003 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535397#comment-14535397 ] Thomas Graves commented on YARN-3434: - I'm not sure jenkins will work on this since this is for branch-2.7 unless we've hook it up to run for specific branches other then trunk. Patch won't apply on trunk. > Interaction between reservations and userlimit can result in significant ULF > violation > -- > > Key: YARN-3434 > URL: https://issues.apache.org/jira/browse/YARN-3434 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Fix For: 2.8.0 > > Attachments: YARN-3434-branch2.7.patch, YARN-3434.patch, > YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, > YARN-3434.patch, YARN-3434.patch > > > ULF was set to 1.0 > User was able to consume 1.4X queue capacity. > It looks like when this application launched, it reserved about 1000 > containers, each 8G each, within about 5 seconds. I think this allowed the > logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-172) AM logs link in RM ui redirects back to RM if AM not started
[ https://issues.apache.org/jira/browse/YARN-172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved YARN-172. Resolution: Invalid confirmed. > AM logs link in RM ui redirects back to RM if AM not started > > > Key: YARN-172 > URL: https://issues.apache.org/jira/browse/YARN-172 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 0.23.3 >Reporter: Thomas Graves > Labels: usability > > I went to the RM UI app page for an application that failed to start with the > error: org.apache.hadoop.security.AccessControlException: User user cannot > submit applications to queue root.foo > I tried to click on the AM logs link and it just redirected me back to the RM > page. if the AM didn't start we shouldn't show an attempt there. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3600) AM container link is broken (on a killed application, at least)
[ https://issues.apache.org/jira/browse/YARN-3600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535500#comment-14535500 ] Thomas Graves commented on YARN-3600: - Hey [~sseth], Can you check to make sure YARN-3603 covers what you also think it should do? I added the bit about logs in there but I may have forgot about something else. > AM container link is broken (on a killed application, at least) > --- > > Key: YARN-3600 > URL: https://issues.apache.org/jira/browse/YARN-3600 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.0 >Reporter: Sergey Shelukhin >Assignee: Naganarasimha G R > Fix For: 2.8.0 > > Attachments: YARN-3600.20150508-1.patch > > > Running some fairly recent (couple weeks ago) version of 2.8.0-SNAPSHOT. > I have an application that ran fine for a while and then I yarn kill-ed it. > Now when I go to the only app attempt URL (like so: http://(snip RM host > name):8088/cluster/appattempt/appattempt_1429683757595_0795_01) > I see: > AM Container: container_1429683757595_0795_01_01 > Node: N/A > and the container link goes to {noformat}http://(snip RM host > name):8088/cluster/N/A > {noformat} > which obviously doesn't work -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14538252#comment-14538252 ] Thomas Graves commented on YARN-3434: - whats your question exactly? For branch patches jenkins has never been hooked up. We generally download the patch, build and possibly the run the tests that apply and commit. > Interaction between reservations and userlimit can result in significant ULF > violation > -- > > Key: YARN-3434 > URL: https://issues.apache.org/jira/browse/YARN-3434 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Fix For: 2.8.0 > > Attachments: YARN-3434-branch2.7.patch, YARN-3434.patch, > YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, > YARN-3434.patch, YARN-3434.patch > > > ULF was set to 1.0 > User was able to consume 1.4X queue capacity. > It looks like when this application launched, it reserved about 1000 > containers, each 8G each, within about 5 seconds. I think this allowed the > logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4045) Negative avaialbleMB is being reported for root queue.
[ https://issues.apache.org/jira/browse/YARN-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14682115#comment-14682115 ] Thomas Graves commented on YARN-4045: - I remember seeing that this was fixed in branch-2 by some of the capacity scheduler work for labels. I thought this might be fixed by https://issues.apache.org/jira/browse/YARN-3243 but that is included. This might be fixed as part of https://issues.apache.org/jira/browse/YARN-3361 which is probably to big to backport totally. [~leftnoteasy] Do you remember this issue? Note that it also shows up in capacity scheduler UI as root queue going over 100%. I remember when I was testing YARN-3434 it wasn't occurring for me on branch-2 (2.8) and I thought it was one of the above jiras that fixed. > Negative avaialbleMB is being reported for root queue. > -- > > Key: YARN-4045 > URL: https://issues.apache.org/jira/browse/YARN-4045 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Rushabh S Shah > > We recently deployed 2.7 in one of our cluster. > We are seeing negative availableMB being reported for queue=root. > This is from the jmx output: > {noformat} > > ... > -163328 > ... > > {noformat} > The following is the RM log: > {noformat} > 2015-08-10 14:42:28,280 [ResourceManager Event Processor] INFO > capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 > absoluteUsedCapacity=1.0029854 used= > cluster= > 2015-08-10 14:42:28,404 [ResourceManager Event Processor] INFO > capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 > absoluteUsedCapacity=1.0032743 used= > cluster= > 2015-08-10 14:42:30,913 [ResourceManager Event Processor] INFO > capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 > absoluteUsedCapacity=1.0029854 used= > cluster= > 2015-08-10 14:42:30,913 [ResourceManager Event Processor] INFO > capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 > absoluteUsedCapacity=1.0032743 used= > cluster= > 2015-08-10 14:42:33,093 [ResourceManager Event Processor] INFO > capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 > absoluteUsedCapacity=1.0029854 used= > cluster= > 2015-08-10 14:42:33,093 [ResourceManager Event Processor] INFO > capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 > absoluteUsedCapacity=1.0032743 used= > cluster= > 2015-08-10 14:42:35,548 [ResourceManager Event Processor] INFO > capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 > absoluteUsedCapacity=1.0029854 used= > cluster= > 2015-08-10 14:42:35,549 [ResourceManager Event Processor] INFO > capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 > absoluteUsedCapacity=1.0032743 used= > cluster= > 2015-08-10 14:42:39,088 [ResourceManager Event Processor] INFO > capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 > absoluteUsedCapacity=1.0029854 used= > cluster= > 2015-08-10 14:42:39,089 [ResourceManager Event Processor] INFO > capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 > absoluteUsedCapacity=1.0032743 used= > cluster= > 2015-08-10 14:42:39,338 [ResourceManager Event Processor] INFO > capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 > absoluteUsedCapacity=1.0029854 used= > cluster= > 2015-08-10 14:42:39,339 [ResourceManager Event Processor] INFO > capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 > absoluteUsedCapacity=1.0032743 used= > cluster= > 2015-08-10 14:42:39,757 [ResourceManager Event Processor] INFO > capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 > absoluteUsedCapacity=1.0029854 used= > cluster= > 2015-08-10 14:42:39,758 [ResourceManager Event Processor] INFO > capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 > absoluteUsedCapacity=1.0032743 used= > cluster= > 2015-08-10 14:42:43,056 [ResourceManager Event Processor] INFO > capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 > absoluteUsedCapacity=1.0029854 used= > cluster= > 2015-08-10 14:42:43,070 [ResourceManager Event Processor] INFO > capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 > absoluteUsedCapacity=1.0032743 used= > cluster= > 2015-08-10 14:42:44,486 [ResourceManager Event Processor] INFO > capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 > absoluteUsedCapacity=1.0029854 used= > cluster= > 2015-08-10 14:42:44,487 [ResourceManager Event Processor] INFO > capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 > absoluteUsedCapacity=1.0032743 used= > cluster= > 2015-08-10 14:42:44,886 [
[jira] [Commented] (YARN-656) In scheduler UI, including reserved memory in "Memory Total" can make it exceed cluster capacity.
[ https://issues.apache.org/jira/browse/YARN-656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391472#comment-14391472 ] Thomas Graves commented on YARN-656: Note this broke the UI, at least for the capacity scheduler. It now displays total that is lacking the reserved. Perhaps this is a difference in how fair scheduler and capacity scheduler keep track of allocated vs reservations. > In scheduler UI, including reserved memory in "Memory Total" can make it > exceed cluster capacity. > - > > Key: YARN-656 > URL: https://issues.apache.org/jira/browse/YARN-656 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, scheduler >Affects Versions: 2.0.4-alpha >Reporter: Sandy Ryza >Assignee: Sandy Ryza > Fix For: 2.1.0-beta > > Attachments: YARN-656-1.patch, YARN-656.patch > > > "Memory Total" is currently a sum of availableMB, allocatedMB, and > reservedMB. Including reservedMB in this sum can make the total exceed the > capacity of the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3432) Cluster metrics have wrong Total Memory when there is reserved memory on CS
Thomas Graves created YARN-3432: --- Summary: Cluster metrics have wrong Total Memory when there is reserved memory on CS Key: YARN-3432 URL: https://issues.apache.org/jira/browse/YARN-3432 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Affects Versions: 2.6.0 Reporter: Thomas Graves I noticed that when reservations happen when using the Capacity Scheduler, the UI and web services report the wrong total memory. For example. I have a 300GB of total memory in my cluster. I allocate 50 and I reserve 10. The cluster metrics for total memory get reported as 290GB. This was broken by https://issues.apache.org/jira/browse/YARN-656 so perhaps there is a difference between fair scheduler and capacity scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3432) Cluster metrics have wrong Total Memory when there is reserved memory on CS
[ https://issues.apache.org/jira/browse/YARN-3432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392687#comment-14392687 ] Thomas Graves commented on YARN-3432: - that will fix it for the capacity scheduler, we need to see if that breaks the FairScheduler though. > Cluster metrics have wrong Total Memory when there is reserved memory on CS > --- > > Key: YARN-3432 > URL: https://issues.apache.org/jira/browse/YARN-3432 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, resourcemanager >Affects Versions: 2.6.0 >Reporter: Thomas Graves >Assignee: Brahma Reddy Battula > > I noticed that when reservations happen when using the Capacity Scheduler, > the UI and web services report the wrong total memory. > For example. I have a 300GB of total memory in my cluster. I allocate 50 > and I reserve 10. The cluster metrics for total memory get reported as 290GB. > This was broken by https://issues.apache.org/jira/browse/YARN-656 so perhaps > there is a difference between fair scheduler and capacity scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
Thomas Graves created YARN-3434: --- Summary: Interaction between reservations and userlimit can result in significant ULF violation Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392751#comment-14392751 ] Thomas Graves commented on YARN-3434: - The issue here is that in if we allow the user to continue from the user limit checks in assignContainers because they have reservations, when it gets down into the assignContainer routine and its allowed to get a container and the node has space we don't double check the user limit in this case. We recheck in all other cases but this one is missed. > Interaction between reservations and userlimit can result in significant ULF > violation > -- > > Key: YARN-3434 > URL: https://issues.apache.org/jira/browse/YARN-3434 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > > ULF was set to 1.0 > User was able to consume 1.4X queue capacity. > It looks like when this application launched, it reserved about 1000 > containers, each 8G each, within about 5 seconds. I think this allowed the > logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485798#comment-14485798 ] Thomas Graves commented on YARN-3434: - [~wangda] YARN-3243 fixes part of the problem with the max capacities, but it doesn't solve the user limit side of it. The user limit check is never done again. I'll have a patch up for this shortly I would appreciate it if you could take a look and give me feedback. > Interaction between reservations and userlimit can result in significant ULF > violation > -- > > Key: YARN-3434 > URL: https://issues.apache.org/jira/browse/YARN-3434 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > > ULF was set to 1.0 > User was able to consume 1.4X queue capacity. > It looks like when this application launched, it reserved about 1000 > containers, each 8G each, within about 5 seconds. I think this allowed the > logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485798#comment-14485798 ] Thomas Graves edited comment on YARN-3434 at 4/8/15 6:59 PM: - [~wangda] YARN-3243 fixes part of the problem with the max capacities, but it doesn't solve the user limit side of it. The user limit check is never done again in assignContainer() if it skipped the checks in assignContainers() based on reservations but then is allowed to shouldAllocOrReserveNewContainer. I'll have a patch up for this shortly I would appreciate it if you could take a look and give me feedback. was (Author: tgraves): [~wangda] YARN-3243 fixes part of the problem with the max capacities, but it doesn't solve the user limit side of it. The user limit check is never done again. I'll have a patch up for this shortly I would appreciate it if you could take a look and give me feedback. > Interaction between reservations and userlimit can result in significant ULF > violation > -- > > Key: YARN-3434 > URL: https://issues.apache.org/jira/browse/YARN-3434 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > > ULF was set to 1.0 > User was able to consume 1.4X queue capacity. > It looks like when this application launched, it reserved about 1000 > containers, each 8G each, within about 5 seconds. I think this allowed the > logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-3434: Attachment: YARN-3434.patch > Interaction between reservations and userlimit can result in significant ULF > violation > -- > > Key: YARN-3434 > URL: https://issues.apache.org/jira/browse/YARN-3434 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Attachments: YARN-3434.patch > > > ULF was set to 1.0 > User was able to consume 1.4X queue capacity. > It looks like when this application launched, it reserved about 1000 > containers, each 8G each, within about 5 seconds. I think this allowed the > logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485834#comment-14485834 ] Thomas Graves commented on YARN-3434: - Note I had a reproducible test case for this. Set userlimit% to 100%, user limit factor to 1. 15 nodes, 20GB each. 1 queue configured for capacity 70, the 2nd queue configured capacity 30. In one queue I started a sleep job needing 10 - 12GB containers in the first queue. I then started a second job in the 2nd queue that needed 25, 12GB containers, the second job got containers but then had to reserve others waiting for the first job to release some. Without this change when the first job started releasing containers the second job would grab them and go over the user limit. With this fix it stayed within the user limit. > Interaction between reservations and userlimit can result in significant ULF > violation > -- > > Key: YARN-3434 > URL: https://issues.apache.org/jira/browse/YARN-3434 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Attachments: YARN-3434.patch > > > ULF was set to 1.0 > User was able to consume 1.4X queue capacity. > It looks like when this application launched, it reserved about 1000 > containers, each 8G each, within about 5 seconds. I think this allowed the > logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14488011#comment-14488011 ] Thomas Graves commented on YARN-3434: - The code you mention is in the else part of that check where it would do a reservation. The situation I'm talking about actually allocates a container, not reserve one. I'll try to explain better: Application ask for lots of containers. It acquires some containers, then it reserves some. At this point it hits its normal user limit which in my example = capacity. It hasn't hit the max amount if can allocate or reserved (shouldAllocOrReserveNewContainer()). The next node heartbeats in that isn't yet reserved and has enough space for it to place a container on. It first checked in assignContainers -> canAssignToThisQueue. That passes since we haven't hit max capacity. Then it checks assignContainers -> canAssignToUser. That passes but only because used - reserved < the user limit. This allows it to continue down into assignContainer. In assignContainer the node has available space and we haven't hit shouldAllocOrReserveNewContainer(). reservationsContinueLooking is on and labels are empty so it does the check: {noformat} if (!shouldAllocOrReserveNewContainer || Resources.greaterThan(resourceCalculator, clusterResource, minimumUnreservedResource, Resources.none())) {noformat} as I said before its allowed to allocate or reserve so it passes that test. Then it hasn't met its maximum capacity (capacity = 30% and max capacity = 100%) yet so that is None and that check doesn't kick in, so it doesn't go into the block to findNodeToUnreserve(). Then it goes ahead and allocates when it should have needed to unreserve. Basically we needed to also do the user limit check again and force it to do the findNodeToUnreserve. > Interaction between reservations and userlimit can result in significant ULF > violation > -- > > Key: YARN-3434 > URL: https://issues.apache.org/jira/browse/YARN-3434 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Attachments: YARN-3434.patch > > > ULF was set to 1.0 > User was able to consume 1.4X queue capacity. > It looks like when this application launched, it reserved about 1000 > containers, each 8G each, within about 5 seconds. I think this allowed the > logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14487416#comment-14487416 ] Thomas Graves commented on YARN-3434: - [~wangda] I'm not sure I follow what are saying? The reservations are already counted in the users usage and we do consider reserved when doing the user limit calculations. Look at LeafQueue.assignContainers call to allocateResource is where it ends up adding to user usage.The canAssignToUser is where it does user limit check and substracts the reservations off to see if it can continue. Note I do think we should just get rid of the config for reservationsContinueLooking, but that is a separate issue. > Interaction between reservations and userlimit can result in significant ULF > violation > -- > > Key: YARN-3434 > URL: https://issues.apache.org/jira/browse/YARN-3434 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Attachments: YARN-3434.patch > > > ULF was set to 1.0 > User was able to consume 1.4X queue capacity. > It looks like when this application launched, it reserved about 1000 > containers, each 8G each, within about 5 seconds. I think this allowed the > logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14488061#comment-14488061 ] Thomas Graves commented on YARN-3434: - {quote} And I've a question about continous reservation checking behavior, may or may not related to this issue: Now it will try to unreserve all containers under a user, but actually it will only unreserve at most one container to allocate a new container. Do you think is it fine to change the logic to be: When (continousReservation-enabled) && (user.usage + required - min(max-allocation, user.total-reserved) <=user.limit), assignContainers will continue. This will prevent doing impossible allocation when user reserved lots of containers. (As same as queue reservation checking). {quote} I do think the reservation checking and unreserving can be improved. I basically started with very simple thing and figured we could improve. I'm not sure how much that check would help in practice. I guess it might help the cases where you have 1 user in the queue and a second one shows up and your user limit gets decreased by a lot. In that case it may prevent it from continuing when it can short circuit here. So it would seem to be ok for that. > Interaction between reservations and userlimit can result in significant ULF > violation > -- > > Key: YARN-3434 > URL: https://issues.apache.org/jira/browse/YARN-3434 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Attachments: YARN-3434.patch > > > ULF was set to 1.0 > User was able to consume 1.4X queue capacity. > It looks like when this application launched, it reserved about 1000 > containers, each 8G each, within about 5 seconds. I think this allowed the > logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496239#comment-14496239 ] Thomas Graves commented on YARN-3434: - So I had considered putting it in the ResourceLimits but ResourceLimits seems to be more of a queue level thing to me (not a user level). For instance parentQueue passes this into leafQueue. ParentQueue cares nothing about user limits. If you stored it there you would either need to track the user it was for or track for all users. ResourceLimits get updated when nodes are added and removed. We don't need to compute a particular user limit when that happens. So it would then be out of date or we change to update it when that happens, but that to me is fairly large change and not really needed. The user limit calculation are lower down and recomputed per user, per application, per current request regularly and putting this into the global based on how being calculated and used didn't make sense to me. All you would be using it for is passing it down to assignContainer and then it would be out of date. If someone else started looking at that value assuming it was up to date then it would be wrong (unless of course we started updating it as stated above). But it would only be for a single user, not all users unless again we changed to calculate for every user whenever something changed. That seems a bit excessive. You are correct that needToUnreserve could go away. I started out on 2.6 which didn't have our changes and I could have removed it when I added in amountNeededUnreserve. If we were to store it in the global ResourceLimit then yes the entire LimitsInfo can go away including shouldContinue as you would fall back to use the boolean return from each function. But again based on my above comments I'm not sure ResourceLimit is the correct place to put this. I just noticed that we are already keeping the userLimit in the User class, that would be another option. But again I think we need to make it clear about what it is. This particular check is done per application, per user based on the current requested Resource. The value stored that wouldn't necessarily apply to all the users applications since the resource request size could be different. thoughts or is there something I'm missing about ResourceLimits? > Interaction between reservations and userlimit can result in significant ULF > violation > -- > > Key: YARN-3434 > URL: https://issues.apache.org/jira/browse/YARN-3434 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Attachments: YARN-3434.patch > > > ULF was set to 1.0 > User was able to consume 1.4X queue capacity. > It looks like when this application launched, it reserved about 1000 > containers, each 8G each, within about 5 seconds. I think this allowed the > logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496735#comment-14496735 ] Thomas Graves commented on YARN-3434: - I am not saying child needs to know how parent calculate resource limit. I am saying user limit and whether it needs to unreserve to make another reservation has nothing to do with the parent queue (ie it doesn't apply to parent queue). Remember I'm not needing to store user limit, I'm needing to store the fact of whether it needs to unreserve and if it does how much does it need to unreserve. When a node heartbeats it goes through the regular assignments and updates the leafQueue clusterResources based on what the parent passes in. When a node is removed or added then it updates the resource limits (none of these apply to calculation of whether it needs to unreserve or not). Basically it comes down to is this information useful outside of the small window between when it calculates it and when its needed in assignContainer() and my thought is no. And you said it yourself in last bullet above. Although we have been referring to the userLImit and perhaps that is the problem. I don't need to store the userLimit, I need to store whether it needs to unreserve and if so how much. Therefore it fits better as a local transient variable rather then a globally stored one. If you store just the userLImit then you need to recalculate stuff which I'm trying to avoid. I understand why we are storing the current information in ResourceLimits because it has to do with headroom and parent limits and is recalculated at various points, but the current implementation in canAssignToUser doesn't use headroom at all and whether we need to unreserve or not on the last call to assignContainers doesn't affect the headroom calculation. Again basically all we would be doing is placing an extra global variable(s) in the ResourceLimits class just to pass it on down a couple of functions. That to me is a parameter. Now if we had multiple things needing this or updating it then to me fits better in the ResourceLimits. > Interaction between reservations and userlimit can result in significant ULF > violation > -- > > Key: YARN-3434 > URL: https://issues.apache.org/jira/browse/YARN-3434 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Attachments: YARN-3434.patch > > > ULF was set to 1.0 > User was able to consume 1.4X queue capacity. > It looks like when this application launched, it reserved about 1000 > containers, each 8G each, within about 5 seconds. I think this allowed the > logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497055#comment-14497055 ] Thomas Graves commented on YARN-3434: - I agree with Both section. I'm not sure I completely follow the Only section. Are you suggesting we change the patch to modify ResourceLimits and pass down rather then using the LimitsInfo class? If so that won't work, at least not without adding the shouldContinue flag to it. Unless you mean keep LimitsInfo class for use locally in assignContainers and then pass ResourceLimits down to assignContainer with the value of amountNeededUnreserve as the limit. That wouldn't really change much exception the object we pass down through the functions. > Interaction between reservations and userlimit can result in significant ULF > violation > -- > > Key: YARN-3434 > URL: https://issues.apache.org/jira/browse/YARN-3434 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Attachments: YARN-3434.patch > > > ULF was set to 1.0 > User was able to consume 1.4X queue capacity. > It looks like when this application launched, it reserved about 1000 > containers, each 8G each, within about 5 seconds. I think this allowed the > logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497076#comment-14497076 ] Thomas Graves commented on YARN-3434: - so you are saying add amountNeededUnreserve to ResourceLimits and then set the global currentResourceLimits.amountNeededUnreserve inside of canAssignToUser? This is what I was not in favor of above and there would be no need to pass it down as parameter. Or were you saying create a ResourceLimit and pass it as parameter to canAssignToUser and canAssignToThisQueue and modify that instance. That instance would then be passed down though to assignContainer()? I don't see how else you set the ResourceLimit. > Interaction between reservations and userlimit can result in significant ULF > violation > -- > > Key: YARN-3434 > URL: https://issues.apache.org/jira/browse/YARN-3434 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Attachments: YARN-3434.patch > > > ULF was set to 1.0 > User was able to consume 1.4X queue capacity. > It looks like when this application launched, it reserved about 1000 > containers, each 8G each, within about 5 seconds. I think this allowed the > logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14499803#comment-14499803 ] Thomas Graves commented on YARN-3434: - Ok, I'll make the changes and post an updated patch > Interaction between reservations and userlimit can result in significant ULF > violation > -- > > Key: YARN-3434 > URL: https://issues.apache.org/jira/browse/YARN-3434 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Attachments: YARN-3434.patch > > > ULF was set to 1.0 > User was able to consume 1.4X queue capacity. > It looks like when this application launched, it reserved about 1000 > containers, each 8G each, within about 5 seconds. I think this allowed the > logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-3434: Attachment: YARN-3434.patch Updated patch with review comments. > Interaction between reservations and userlimit can result in significant ULF > violation > -- > > Key: YARN-3434 > URL: https://issues.apache.org/jira/browse/YARN-3434 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Attachments: YARN-3434.patch, YARN-3434.patch > > > ULF was set to 1.0 > User was able to consume 1.4X queue capacity. > It looks like when this application launched, it reserved about 1000 > containers, each 8G each, within about 5 seconds. I think this allowed the > logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-3434: Attachment: YARN-3434.patch Upmerged to latest > Interaction between reservations and userlimit can result in significant ULF > violation > -- > > Key: YARN-3434 > URL: https://issues.apache.org/jira/browse/YARN-3434 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Attachments: YARN-3434.patch, YARN-3434.patch, YARN-3434.patch > > > ULF was set to 1.0 > User was able to consume 1.4X queue capacity. > It looks like when this application launched, it reserved about 1000 > containers, each 8G each, within about 5 seconds. I think this allowed the > logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3294) Allow dumping of Capacity Scheduler debug logs via web UI for a fixed time period
[ https://issues.apache.org/jira/browse/YARN-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503760#comment-14503760 ] Thomas Graves commented on YARN-3294: - [~xgong] [~vvasudev] I saw this show up in the UI on branch-2. I don't see any permissions checks on this, am I perhaps missing it? We don't want arbitrary users to be able to change log level on the RM. They could slow it down and cause disks to fill up. I also don't see an option to disable this, is there one? If not I think we want it. Honestly I don't really see a need for this button at all as you can change in the logLevel app. But since its in we atleast need to protect it and in my opinion disable it for normal users. > Allow dumping of Capacity Scheduler debug logs via web UI for a fixed time > period > - > > Key: YARN-3294 > URL: https://issues.apache.org/jira/browse/YARN-3294 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Fix For: 2.8.0 > > Attachments: Screen Shot 2015-03-12 at 8.51.25 PM.png, > apache-yarn-3294.0.patch, apache-yarn-3294.1.patch, apache-yarn-3294.2.patch, > apache-yarn-3294.3.patch, apache-yarn-3294.4.patch > > > It would be nice to have a button on the web UI that would allow dumping of > debug logs for just the capacity scheduler for a fixed period of time(1 min, > 5 min or so) in a separate log file. It would be useful when debugging > scheduler behavior without affecting the rest of the resourcemanager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3517) RM web ui for dumping scheduler logs should be for admins only
[ https://issues.apache.org/jira/browse/YARN-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504954#comment-14504954 ] Thomas Graves commented on YARN-3517: - Thanks for following up on this. Could you also change it to not show the button if you aren't an admin? I don't want to confuse users by having a button there that doesn't do anything. One other thing is could you add some css or something to make it look more like a button. Right now it just looks like text and I didn't know it was clickable at first. The placement of it seems a bit weird to me also but as along as its only showing up for admins that is less of an issue. I haven't looked at the patch if details but I see we are creating a new AdminACLsManager each time. It would be nice if we didn't have to do that. > RM web ui for dumping scheduler logs should be for admins only > -- > > Key: YARN-3517 > URL: https://issues.apache.org/jira/browse/YARN-3517 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, security >Affects Versions: 2.7.0 >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Labels: security > Attachments: YARN-3517.001.patch > > > YARN-3294 allows users to dump scheduler logs from the web UI. This should be > for admins only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-3434: Attachment: YARN-3434.patch updated based on review comments > Interaction between reservations and userlimit can result in significant ULF > violation > -- > > Key: YARN-3434 > URL: https://issues.apache.org/jira/browse/YARN-3434 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Attachments: YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, > YARN-3434.patch > > > ULF was set to 1.0 > User was able to consume 1.4X queue capacity. > It looks like when this application launched, it reserved about 1000 > containers, each 8G each, within about 5 seconds. I think this allowed the > logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-3434: Attachment: YARN-3434.patch Upmerged patch to latest > Interaction between reservations and userlimit can result in significant ULF > violation > -- > > Key: YARN-3434 > URL: https://issues.apache.org/jira/browse/YARN-3434 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Attachments: YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, > YARN-3434.patch, YARN-3434.patch > > > ULF was set to 1.0 > User was able to consume 1.4X queue capacity. > It looks like when this application launched, it reserved about 1000 > containers, each 8G each, within about 5 seconds. I think this allowed the > logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-3434: Attachment: YARN-3434.patch Fixed the line length and the white space style issues. Other then that I moved things around and its just complaining about the same things more. > Interaction between reservations and userlimit can result in significant ULF > violation > -- > > Key: YARN-3434 > URL: https://issues.apache.org/jira/browse/YARN-3434 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Attachments: YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, > YARN-3434.patch, YARN-3434.patch, YARN-3434.patch > > > ULF was set to 1.0 > User was able to consume 1.4X queue capacity. > It looks like when this application launched, it reserved about 1000 > containers, each 8G each, within about 5 seconds. I think this allowed the > logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-3434: Attachment: YARN-3434.patch Attaching exact same patch to kick jenkins again > Interaction between reservations and userlimit can result in significant ULF > violation > -- > > Key: YARN-3434 > URL: https://issues.apache.org/jira/browse/YARN-3434 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.6.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Attachments: YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, > YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch > > > ULF was set to 1.0 > User was able to consume 1.4X queue capacity. > It looks like when this application launched, it reserved about 1000 > containers, each 8G each, within about 5 seconds. I think this allowed the > logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3517) RM web ui for dumping scheduler logs should be for admins only
[ https://issues.apache.org/jira/browse/YARN-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14509152#comment-14509152 ] Thomas Graves commented on YARN-3517: - + // non-secure mode with no acls enabled + if (!isAdmin && !UserGroupInformation.isSecurityEnabled() + && !adminACLsManager.areACLsEnabled()) { +isAdmin = true; + } + We don't need the isSecurityEnabled check, just keep the one for areAclsEnabled. This could be combined with the previous if, make this the else if part but that isn't a big deal. in QueuesBlock we are creating the AdminACLsManager every web page load. Perhaps a better way would be to use the this.rm.getApplicationACLsManager() and extend the ApplicationAclsManager to explose an isAdmin functionality > RM web ui for dumping scheduler logs should be for admins only > -- > > Key: YARN-3517 > URL: https://issues.apache.org/jira/browse/YARN-3517 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, security >Affects Versions: 2.7.0 >Reporter: Varun Vasudev >Assignee: Varun Vasudev >Priority: Blocker > Labels: security > Attachments: YARN-3517.001.patch, YARN-3517.002.patch, > YARN-3517.003.patch > > > YARN-3294 allows users to dump scheduler logs from the web UI. This should be > for admins only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3517) RM web ui for dumping scheduler logs should be for admins only
[ https://issues.apache.org/jira/browse/YARN-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned YARN-3517: --- Assignee: Thomas Graves (was: Varun Vasudev) > RM web ui for dumping scheduler logs should be for admins only > -- > > Key: YARN-3517 > URL: https://issues.apache.org/jira/browse/YARN-3517 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, security >Reporter: Varun Vasudev >Assignee: Thomas Graves >Priority: Blocker > Labels: security > Attachments: YARN-3517.001.patch, YARN-3517.002.patch, > YARN-3517.003.patch, YARN-3517.004.patch, YARN-3517.005.patch > > > YARN-3294 allows users to dump scheduler logs from the web UI. This should be > for admins only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3517) RM web ui for dumping scheduler logs should be for admins only
[ https://issues.apache.org/jira/browse/YARN-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518054#comment-14518054 ] Thomas Graves commented on YARN-3517: - in RMWebServices.java we don't need the isSecurityEnabled check. Just remove the entire check. My reasoning is that logLevel app does not do those checks, it simply makes sure you are an admin. +if (UserGroupInformation.isSecurityEnabled() && callerUGI == null) { + String msg = "Unable to obtain user name, user not authenticated"; + throw new AuthorizationException(msg); +} in the test TestRMWebServices.java. We aren't actually asserting anything. we should assert that the expected files exist. Personally I would also like to see an assert that the expected exception occurred. > RM web ui for dumping scheduler logs should be for admins only > -- > > Key: YARN-3517 > URL: https://issues.apache.org/jira/browse/YARN-3517 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, security >Reporter: Varun Vasudev >Assignee: Thomas Graves >Priority: Blocker > Labels: security > Attachments: YARN-3517.001.patch, YARN-3517.002.patch, > YARN-3517.003.patch, YARN-3517.004.patch, YARN-3517.005.patch > > > YARN-3294 allows users to dump scheduler logs from the web UI. This should be > for admins only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3517) RM web ui for dumping scheduler logs should be for admins only
[ https://issues.apache.org/jira/browse/YARN-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14520252#comment-14520252 ] Thomas Graves commented on YARN-3517: - changes look good, +1. thanks [~vvasudev] > RM web ui for dumping scheduler logs should be for admins only > -- > > Key: YARN-3517 > URL: https://issues.apache.org/jira/browse/YARN-3517 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, security >Reporter: Varun Vasudev >Assignee: Thomas Graves >Priority: Blocker > Labels: security > Attachments: YARN-3517.001.patch, YARN-3517.002.patch, > YARN-3517.003.patch, YARN-3517.004.patch, YARN-3517.005.patch, > YARN-3517.006.patch > > > YARN-3294 allows users to dump scheduler logs from the web UI. This should be > for admins only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3517) RM web ui for dumping scheduler logs should be for admins only
[ https://issues.apache.org/jira/browse/YARN-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14520355#comment-14520355 ] Thomas Graves commented on YARN-3517: - thanks [~vinodkv] I missed that. > RM web ui for dumping scheduler logs should be for admins only > -- > > Key: YARN-3517 > URL: https://issues.apache.org/jira/browse/YARN-3517 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, security >Reporter: Varun Vasudev >Assignee: Varun Vasudev >Priority: Blocker > Labels: security > Fix For: 2.8.0 > > Attachments: YARN-3517.001.patch, YARN-3517.002.patch, > YARN-3517.003.patch, YARN-3517.004.patch, YARN-3517.005.patch, > YARN-3517.006.patch > > > YARN-3294 allows users to dump scheduler logs from the web UI. This should be > for admins only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3243) CapacityScheduler should pass headroom from parent to children to make sure ParentQueue obey its capacity limits.
[ https://issues.apache.org/jira/browse/YARN-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14521580#comment-14521580 ] Thomas Graves commented on YARN-3243: - [~leftnoteasy] Can we pull this back into the branch-2.7? > CapacityScheduler should pass headroom from parent to children to make sure > ParentQueue obey its capacity limits. > - > > Key: YARN-3243 > URL: https://issues.apache.org/jira/browse/YARN-3243 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, resourcemanager >Reporter: Wangda Tan >Assignee: Wangda Tan > Fix For: 2.8.0 > > Attachments: YARN-3243.1.patch, YARN-3243.2.patch, YARN-3243.3.patch, > YARN-3243.4.patch, YARN-3243.5.patch > > > Now CapacityScheduler has some issues to make sure ParentQueue always obeys > its capacity limits, for example: > 1) When allocating container of a parent queue, it will only check > parentQueue.usage < parentQueue.max. If leaf queue allocated a container.size > > (parentQueue.max - parentQueue.usage), parent queue can excess its max > resource limit, as following example: > {code} > A (usage=54, max=55) >/ \ > A1 A2 (usage=1, max=55) > (usage=53, max=53) > {code} > Queue-A2 is able to allocate container since its usage < max, but if we do > that, A's usage can excess A.max. > 2) When doing continous reservation check, parent queue will only tell > children "you need unreserve *some* resource, so that I will less than my > maximum resource", but it will not tell how many resource need to be > unreserved. This may lead to parent queue excesses configured maximum > capacity as well. > With YARN-3099/YARN-3124, now we have {{ResourceUsage}} class in each class, > *here is my proposal*: > - ParentQueue will set its children's ResourceUsage.headroom, which means, > *maximum resource its children can allocate*. > - ParentQueue will set its children's headroom to be (saying parent's name is > "qA"): min(qA.headroom, qA.max - qA.used). This will make sure qA's > ancestors' capacity will be enforced as well (qA.headroom is set by qA's > parent). > - {{needToUnReserve}} is not necessary, instead, children can get how much > resource need to be unreserved to keep its parent's resource limit. > - More over, with this, YARN-3026 will make a clear boundary between > LeafQueue and FiCaSchedulerApp, headroom will consider user-limit, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3243) CapacityScheduler should pass headroom from parent to children to make sure ParentQueue obey its capacity limits.
[ https://issues.apache.org/jira/browse/YARN-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14522021#comment-14522021 ] Thomas Graves commented on YARN-3243: - I was wanting to pull YARN-3434 back into 2.7. It kind of depends on this one. Atleast I think it would merge cleanly if this one was there. This is also fixing a bug which I would like to see fixed in the 2.7 line if we are going to use it. Its not a blocker since it exists in our 2.6 but it would be nice to have. If we decide its to big then I'll just port YARN-3434 back without it > CapacityScheduler should pass headroom from parent to children to make sure > ParentQueue obey its capacity limits. > - > > Key: YARN-3243 > URL: https://issues.apache.org/jira/browse/YARN-3243 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, resourcemanager >Reporter: Wangda Tan >Assignee: Wangda Tan > Fix For: 2.8.0 > > Attachments: YARN-3243.1.patch, YARN-3243.2.patch, YARN-3243.3.patch, > YARN-3243.4.patch, YARN-3243.5.patch > > > Now CapacityScheduler has some issues to make sure ParentQueue always obeys > its capacity limits, for example: > 1) When allocating container of a parent queue, it will only check > parentQueue.usage < parentQueue.max. If leaf queue allocated a container.size > > (parentQueue.max - parentQueue.usage), parent queue can excess its max > resource limit, as following example: > {code} > A (usage=54, max=55) >/ \ > A1 A2 (usage=1, max=55) > (usage=53, max=53) > {code} > Queue-A2 is able to allocate container since its usage < max, but if we do > that, A's usage can excess A.max. > 2) When doing continous reservation check, parent queue will only tell > children "you need unreserve *some* resource, so that I will less than my > maximum resource", but it will not tell how many resource need to be > unreserved. This may lead to parent queue excesses configured maximum > capacity as well. > With YARN-3099/YARN-3124, now we have {{ResourceUsage}} class in each class, > *here is my proposal*: > - ParentQueue will set its children's ResourceUsage.headroom, which means, > *maximum resource its children can allocate*. > - ParentQueue will set its children's headroom to be (saying parent's name is > "qA"): min(qA.headroom, qA.max - qA.used). This will make sure qA's > ancestors' capacity will be enforced as well (qA.headroom is set by qA's > parent). > - {{needToUnReserve}} is not necessary, instead, children can get how much > resource need to be unreserved to keep its parent's resource limit. > - More over, with this, YARN-3026 will make a clear boundary between > LeafQueue and FiCaSchedulerApp, headroom will consider user-limit, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3243) CapacityScheduler should pass headroom from parent to children to make sure ParentQueue obey its capacity limits.
[ https://issues.apache.org/jira/browse/YARN-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14522066#comment-14522066 ] Thomas Graves commented on YARN-3243: - It might to merge completely clean but it wouldn't require it for functionality. It would be nice to have this in 2.7 either way though. > CapacityScheduler should pass headroom from parent to children to make sure > ParentQueue obey its capacity limits. > - > > Key: YARN-3243 > URL: https://issues.apache.org/jira/browse/YARN-3243 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, resourcemanager >Reporter: Wangda Tan >Assignee: Wangda Tan > Fix For: 2.8.0 > > Attachments: YARN-3243.1.patch, YARN-3243.2.patch, YARN-3243.3.patch, > YARN-3243.4.patch, YARN-3243.5.patch > > > Now CapacityScheduler has some issues to make sure ParentQueue always obeys > its capacity limits, for example: > 1) When allocating container of a parent queue, it will only check > parentQueue.usage < parentQueue.max. If leaf queue allocated a container.size > > (parentQueue.max - parentQueue.usage), parent queue can excess its max > resource limit, as following example: > {code} > A (usage=54, max=55) >/ \ > A1 A2 (usage=1, max=55) > (usage=53, max=53) > {code} > Queue-A2 is able to allocate container since its usage < max, but if we do > that, A's usage can excess A.max. > 2) When doing continous reservation check, parent queue will only tell > children "you need unreserve *some* resource, so that I will less than my > maximum resource", but it will not tell how many resource need to be > unreserved. This may lead to parent queue excesses configured maximum > capacity as well. > With YARN-3099/YARN-3124, now we have {{ResourceUsage}} class in each class, > *here is my proposal*: > - ParentQueue will set its children's ResourceUsage.headroom, which means, > *maximum resource its children can allocate*. > - ParentQueue will set its children's headroom to be (saying parent's name is > "qA"): min(qA.headroom, qA.max - qA.used). This will make sure qA's > ancestors' capacity will be enforced as well (qA.headroom is set by qA's > parent). > - {{needToUnReserve}} is not necessary, instead, children can get how much > resource need to be unreserved to keep its parent's resource limit. > - More over, with this, YARN-3026 will make a clear boundary between > LeafQueue and FiCaSchedulerApp, headroom will consider user-limit, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (YARN-3243) CapacityScheduler should pass headroom from parent to children to make sure ParentQueue obey its capacity limits.
[ https://issues.apache.org/jira/browse/YARN-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14522066#comment-14522066 ] Thomas Graves edited comment on YARN-3243 at 4/30/15 7:02 PM: -- It might to merge completely clean but it wouldn't require it for functionality. It would be nice to have this in 2.7 either way though. I'll try it out later and see. was (Author: tgraves): It might to merge completely clean but it wouldn't require it for functionality. It would be nice to have this in 2.7 either way though. > CapacityScheduler should pass headroom from parent to children to make sure > ParentQueue obey its capacity limits. > - > > Key: YARN-3243 > URL: https://issues.apache.org/jira/browse/YARN-3243 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, resourcemanager >Reporter: Wangda Tan >Assignee: Wangda Tan > Fix For: 2.8.0 > > Attachments: YARN-3243.1.patch, YARN-3243.2.patch, YARN-3243.3.patch, > YARN-3243.4.patch, YARN-3243.5.patch > > > Now CapacityScheduler has some issues to make sure ParentQueue always obeys > its capacity limits, for example: > 1) When allocating container of a parent queue, it will only check > parentQueue.usage < parentQueue.max. If leaf queue allocated a container.size > > (parentQueue.max - parentQueue.usage), parent queue can excess its max > resource limit, as following example: > {code} > A (usage=54, max=55) >/ \ > A1 A2 (usage=1, max=55) > (usage=53, max=53) > {code} > Queue-A2 is able to allocate container since its usage < max, but if we do > that, A's usage can excess A.max. > 2) When doing continous reservation check, parent queue will only tell > children "you need unreserve *some* resource, so that I will less than my > maximum resource", but it will not tell how many resource need to be > unreserved. This may lead to parent queue excesses configured maximum > capacity as well. > With YARN-3099/YARN-3124, now we have {{ResourceUsage}} class in each class, > *here is my proposal*: > - ParentQueue will set its children's ResourceUsage.headroom, which means, > *maximum resource its children can allocate*. > - ParentQueue will set its children's headroom to be (saying parent's name is > "qA"): min(qA.headroom, qA.max - qA.used). This will make sure qA's > ancestors' capacity will be enforced as well (qA.headroom is set by qA's > parent). > - {{needToUnReserve}} is not necessary, instead, children can get how much > resource need to be unreserved to keep its parent's resource limit. > - More over, with this, YARN-3026 will make a clear boundary between > LeafQueue and FiCaSchedulerApp, headroom will consider user-limit, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3243) CapacityScheduler should pass headroom from parent to children to make sure ParentQueue obey its capacity limits.
[ https://issues.apache.org/jira/browse/YARN-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14523187#comment-14523187 ] Thomas Graves commented on YARN-3243: - thanks [~leftnoteasy] I'll attempt to merge YARN-3434. If its not clean I'll put up a patch for it. > CapacityScheduler should pass headroom from parent to children to make sure > ParentQueue obey its capacity limits. > - > > Key: YARN-3243 > URL: https://issues.apache.org/jira/browse/YARN-3243 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, resourcemanager >Reporter: Wangda Tan >Assignee: Wangda Tan > Fix For: 2.8.0 > > Attachments: YARN-3243.1.patch, YARN-3243.2.patch, YARN-3243.3.patch, > YARN-3243.4.patch, YARN-3243.5.patch > > > Now CapacityScheduler has some issues to make sure ParentQueue always obeys > its capacity limits, for example: > 1) When allocating container of a parent queue, it will only check > parentQueue.usage < parentQueue.max. If leaf queue allocated a container.size > > (parentQueue.max - parentQueue.usage), parent queue can excess its max > resource limit, as following example: > {code} > A (usage=54, max=55) >/ \ > A1 A2 (usage=1, max=55) > (usage=53, max=53) > {code} > Queue-A2 is able to allocate container since its usage < max, but if we do > that, A's usage can excess A.max. > 2) When doing continous reservation check, parent queue will only tell > children "you need unreserve *some* resource, so that I will less than my > maximum resource", but it will not tell how many resource need to be > unreserved. This may lead to parent queue excesses configured maximum > capacity as well. > With YARN-3099/YARN-3124, now we have {{ResourceUsage}} class in each class, > *here is my proposal*: > - ParentQueue will set its children's ResourceUsage.headroom, which means, > *maximum resource its children can allocate*. > - ParentQueue will set its children's headroom to be (saying parent's name is > "qA"): min(qA.headroom, qA.max - qA.used). This will make sure qA's > ancestors' capacity will be enforced as well (qA.headroom is set by qA's > parent). > - {{needToUnReserve}} is not necessary, instead, children can get how much > resource need to be unreserved to keep its parent's resource limit. > - More over, with this, YARN-3026 will make a clear boundary between > LeafQueue and FiCaSchedulerApp, headroom will consider user-limit, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3243) CapacityScheduler should pass headroom from parent to children to make sure ParentQueue obey its capacity limits.
[ https://issues.apache.org/jira/browse/YARN-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-3243: Fix Version/s: 2.7.1 > CapacityScheduler should pass headroom from parent to children to make sure > ParentQueue obey its capacity limits. > - > > Key: YARN-3243 > URL: https://issues.apache.org/jira/browse/YARN-3243 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, resourcemanager >Reporter: Wangda Tan >Assignee: Wangda Tan > Fix For: 2.8.0, 2.7.1 > > Attachments: YARN-3243.1.patch, YARN-3243.2.patch, YARN-3243.3.patch, > YARN-3243.4.patch, YARN-3243.5.patch > > > Now CapacityScheduler has some issues to make sure ParentQueue always obeys > its capacity limits, for example: > 1) When allocating container of a parent queue, it will only check > parentQueue.usage < parentQueue.max. If leaf queue allocated a container.size > > (parentQueue.max - parentQueue.usage), parent queue can excess its max > resource limit, as following example: > {code} > A (usage=54, max=55) >/ \ > A1 A2 (usage=1, max=55) > (usage=53, max=53) > {code} > Queue-A2 is able to allocate container since its usage < max, but if we do > that, A's usage can excess A.max. > 2) When doing continous reservation check, parent queue will only tell > children "you need unreserve *some* resource, so that I will less than my > maximum resource", but it will not tell how many resource need to be > unreserved. This may lead to parent queue excesses configured maximum > capacity as well. > With YARN-3099/YARN-3124, now we have {{ResourceUsage}} class in each class, > *here is my proposal*: > - ParentQueue will set its children's ResourceUsage.headroom, which means, > *maximum resource its children can allocate*. > - ParentQueue will set its children's headroom to be (saying parent's name is > "qA"): min(qA.headroom, qA.max - qA.used). This will make sure qA's > ancestors' capacity will be enforced as well (qA.headroom is set by qA's > parent). > - {{needToUnReserve}} is not necessary, instead, children can get how much > resource need to be unreserved to keep its parent's resource limit. > - More over, with this, YARN-3026 will make a clear boundary between > LeafQueue and FiCaSchedulerApp, headroom will consider user-limit, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1631) Container allocation issue in Leafqueue assignContainers()
[ https://issues.apache.org/jira/browse/YARN-1631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14524042#comment-14524042 ] Thomas Graves commented on YARN-1631: - we need to be careful with this. You could end up starving out the first application. It definitely changes current semantics. What version of hadoop are you seeing this issue? With my patch for reservations continue looking it should actually look at node 2 and take that one and unreserve node 1. There is the logic for the needsContainer that might be affecting this that I would have to look at more. > Container allocation issue in Leafqueue assignContainers() > -- > > Key: YARN-1631 > URL: https://issues.apache.org/jira/browse/YARN-1631 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.2.0 > Environment: SuSe 11 Linux >Reporter: Sunil G >Assignee: Sunil G > Attachments: Yarn-1631.1.patch, Yarn-1631.2.patch > > > Application1 has a demand of 8GB[Map Task Size as 8GB] which is more than > Node_1 can handle. > Node_1 has a size of 8GB and 2GB is used by Application1's AM. > Hence reservation happened for remaining 6GB in Node_1 by Application1. > A new job is submitted with 2GB AM size and 2GB task size with only 2 Maps to > run. > Node_2 also has 8GB capability. > But Application2's AM cannot be launched in Node_2. And Application2 waits > longer as only 2 Nodes are available in cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-8149) Revisit behavior of Re-Reservation in Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-8149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16436295#comment-16436295 ] Thomas Graves commented on YARN-8149: - are you going to do anything with starvation then or allocation a certain % more then what is required? I am hesitant to remove this without doing some major testing. I haven't had a chance to look at the latest code to investigate. It might be more fine now that we do continue looking at other nodes after reservation where as originally that didn't happen. Is in queue preemption on by default? > Revisit behavior of Re-Reservation in Capacity Scheduler > > > Key: YARN-8149 > URL: https://issues.apache.org/jira/browse/YARN-8149 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wangda Tan >Priority: Critical > > Frankly speaking, I'm not sure why we need the re-reservation. The formula is > not that easy to understand: > Inside: > {{org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#shouldAllocOrReserveNewContainer}} > {code:java} > starvation = re-reservation / (#reserved-container * > (1 - min(requested-resource / max-alloc, > max-alloc - min-alloc / max-alloc)) > should_allocate = starvation + requiredContainers - reservedContainers > > 0{code} > I think we should be able to remove the starvation computation, just to check > requiredContainers > reservedContainers should be enough. > In a large cluster, we can easily overflow re-reservation to MAX_INT, see > YARN-7636. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8149) Revisit behavior of Re-Reservation in Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-8149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16436366#comment-16436366 ] Thomas Graves commented on YARN-8149: - thinking about this a little more, even with the current preemption on, I don't think preemption is smart enough to keep starvation from happening. If preemption was smart enough to kill enough containers on a reserved node to make it so the big container actually gets scheduled there that might be ok. But last time I checked it doesn't do that. Without that or having another way to prevent starvation I wouldn't want to remove this. I think adding a config would be alright but if anyone finds it useful you can't remove and would just be an extra config. If we have other ideas to simply or make this better, great we should look at. Or if there is a way for us to get stats on if this is useful we could add those and run and determine if we should remove. > Revisit behavior of Re-Reservation in Capacity Scheduler > > > Key: YARN-8149 > URL: https://issues.apache.org/jira/browse/YARN-8149 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wangda Tan >Priority: Critical > > Frankly speaking, I'm not sure why we need the re-reservation. The formula is > not that easy to understand: > Inside: > {{org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#shouldAllocOrReserveNewContainer}} > {code:java} > starvation = re-reservation / (#reserved-container * > (1 - min(requested-resource / max-alloc, > max-alloc - min-alloc / max-alloc)) > should_allocate = starvation + requiredContainers - reservedContainers > > 0{code} > I think we should be able to remove the starvation computation, just to check > requiredContainers > reservedContainers should be enough. > In a large cluster, we can easily overflow re-reservation to MAX_INT, see > YARN-7636. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7935) Expose container's hostname to applications running within the docker container
[ https://issues.apache.org/jira/browse/YARN-7935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16373039#comment-16373039 ] Thomas Graves commented on YARN-7935: - [~mridulm80] what is the spark Jira for this? If this goes in it will still have to grab this from env to pass in to the executorRunnable. > Expose container's hostname to applications running within the docker > container > --- > > Key: YARN-7935 > URL: https://issues.apache.org/jira/browse/YARN-7935 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Reporter: Suma Shivaprasad >Assignee: Suma Shivaprasad >Priority: Major > Attachments: YARN-7935.1.patch, YARN-7935.2.patch > > > Some applications have a need to bind to the container's hostname (like > Spark) which is different from the NodeManager's hostname(NM_HOST which is > available as an env during container launch) when launched through Docker > runtime. The container's hostname can be exposed to applications via an env > CONTAINER_HOSTNAME. Another potential candidate is the container's IP but > this can be addressed in a separate jira. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7935) Expose container's hostname to applications running within the docker container
[ https://issues.apache.org/jira/browse/YARN-7935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16374598#comment-16374598 ] Thomas Graves commented on YARN-7935: - thanks for the explanation Mridul. I'm fine with waiting on the spark Jira til you know the scope better, I'm currently not doing anything with bridge mode so won't be able to help there at this point. > Expose container's hostname to applications running within the docker > container > --- > > Key: YARN-7935 > URL: https://issues.apache.org/jira/browse/YARN-7935 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Reporter: Suma Shivaprasad >Assignee: Suma Shivaprasad >Priority: Major > Attachments: YARN-7935.1.patch, YARN-7935.2.patch > > > Some applications have a need to bind to the container's hostname (like > Spark) which is different from the NodeManager's hostname(NM_HOST which is > available as an env during container launch) when launched through Docker > runtime. The container's hostname can be exposed to applications via an env > CONTAINER_HOSTNAME. Another potential candidate is the container's IP but > this can be addressed in a separate jira. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8991) nodemanager not cleaning blockmgr directories inside appcache
[ https://issues.apache.org/jira/browse/YARN-8991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16681525#comment-16681525 ] Thomas Graves commented on YARN-8991: - [~teonadi] can you clarify here. Are you saying its not getting cleaned up while the Spark application is still running or its not getting cleaned up after the spark application finishes? > nodemanager not cleaning blockmgr directories inside appcache > -- > > Key: YARN-8991 > URL: https://issues.apache.org/jira/browse/YARN-8991 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Hidayat Teonadi >Priority: Major > Attachments: yarn-nm-log.txt > > > Hi, I'm running spark on yarn and have enabled the Spark Shuffle Service. I'm > noticing that during the lifetime of my spark streaming application, the nm > appcache folder is building up with blockmgr directories (filled with > shuffle_*.data). > Looking into the nm logs, it seems like the blockmgr directories is not part > of the cleanup process of the application. Eventually disk will fill up and > app will crash. I have both > {{yarn.nodemanager.localizer.cache.cleanup.interval-ms}} and > {{yarn.nodemanager.localizer.cache.target-size-mb}} set, so I don't think its > a configuration issue. > What is stumping me is the executor ID listed by spark during the external > shuffle block registration doesn't match the executor ID listed in yarn's nm > log. Maybe this executorID disconnect explains why the cleanup is not done ? > I'm assuming that blockmgr directories are supposed to be cleaned up ? > > {noformat} > 2018-11-05 15:01:21,349 INFO > org.apache.spark.network.shuffle.ExternalShuffleBlockResolver: Registered > executor AppExecId{appId=application_1541045942679_0193, execId=1299} with > ExecutorShuffleInfo{localDirs=[/mnt1/yarn/nm/usercache/auction_importer/appcache/application_1541045942679_0193/blockmgr-b9703ae3-722c-47d1-a374-abf1cc954f42], > subDirsPerLocalDir=64, > shuffleManager=org.apache.spark.shuffle.sort.SortShuffleManager} > {noformat} > > seems similar to https://issues.apache.org/jira/browse/YARN-7070, although > I'm not sure if the behavior I'm seeing is spark use related. > [https://stackoverflow.com/questions/52923386/spark-streaming-job-doesnt-delete-shuffle-files] > has a stop gap solution of cleaning up via cron. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8991) nodemanager not cleaning blockmgr directories inside appcache
[ https://issues.apache.org/jira/browse/YARN-8991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16683946#comment-16683946 ] Thomas Graves commented on YARN-8991: - if its while its running then you should file this with Spark. Its very similar to https://issues.apache.org/jira/browse/SPARK-17233. The spark external shuffle service doesn't supports that at this point. The problem with that is that you may have an Spark Executor running on one host, generate some map output data to shuffle and then that executor exits as its not needed anymore. When a reduce starts it just talked to the Yarn nodemanager and the external shuffle server to get the map output. Now there is no executor left on the node to cleanup the shuffle output. Support would have to be added for like the driver to tell the spark external shuffle service to cleanup. If you don't use dynamic allocation and the external shuffle service it should cleanup properly. > nodemanager not cleaning blockmgr directories inside appcache > -- > > Key: YARN-8991 > URL: https://issues.apache.org/jira/browse/YARN-8991 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Hidayat Teonadi >Priority: Major > Attachments: yarn-nm-log.txt > > > Hi, I'm running spark on yarn and have enabled the Spark Shuffle Service. I'm > noticing that during the lifetime of my spark streaming application, the nm > appcache folder is building up with blockmgr directories (filled with > shuffle_*.data). > Looking into the nm logs, it seems like the blockmgr directories is not part > of the cleanup process of the application. Eventually disk will fill up and > app will crash. I have both > {{yarn.nodemanager.localizer.cache.cleanup.interval-ms}} and > {{yarn.nodemanager.localizer.cache.target-size-mb}} set, so I don't think its > a configuration issue. > What is stumping me is the executor ID listed by spark during the external > shuffle block registration doesn't match the executor ID listed in yarn's nm > log. Maybe this executorID disconnect explains why the cleanup is not done ? > I'm assuming that blockmgr directories are supposed to be cleaned up ? > > {noformat} > 2018-11-05 15:01:21,349 INFO > org.apache.spark.network.shuffle.ExternalShuffleBlockResolver: Registered > executor AppExecId{appId=application_1541045942679_0193, execId=1299} with > ExecutorShuffleInfo{localDirs=[/mnt1/yarn/nm/usercache/auction_importer/appcache/application_1541045942679_0193/blockmgr-b9703ae3-722c-47d1-a374-abf1cc954f42], > subDirsPerLocalDir=64, > shuffleManager=org.apache.spark.shuffle.sort.SortShuffleManager} > {noformat} > > seems similar to https://issues.apache.org/jira/browse/YARN-7070, although > I'm not sure if the behavior I'm seeing is spark use related. > [https://stackoverflow.com/questions/52923386/spark-streaming-job-doesnt-delete-shuffle-files] > has a stop gap solution of cleaning up via cron. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7204) Localizer errors on archive without any files
Thomas Graves created YARN-7204: --- Summary: Localizer errors on archive without any files Key: YARN-7204 URL: https://issues.apache.org/jira/browse/YARN-7204 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.8.1 Reporter: Thomas Graves If a user sends an archive without any files in it (only directories), yarn fails to localize it with the error below. I ran into this specifically running spark job but looks generic to localizer. Application application_1505252418630_25423 failed 3 times due to AM Container for appattempt_1505252418630_25423_03 exited with exitCode: -1000 Failing this attempt.Diagnostics: No such file or directory ENOENT: No such file or directory at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmodImpl(Native Method) at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmod(NativeIO.java:230) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:767) at org.apache.hadoop.fs.DelegateToFileSystem.setPermission(DelegateToFileSystem.java:218) at org.apache.hadoop.fs.FilterFs.setPermission(FilterFs.java:264) at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1009) at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1005) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.setPermission(FileContext.java:1012) at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:421) at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:419) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1945) at org.apache.hadoop.yarn.util.FSDownload.changePermissions(FSDownload.java:419) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:365) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:233) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:226) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:214) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) For more detailed output, check the application tracking page: https://axonitered-jt1.red.ygrid.yahoo.com:50508/applicationhistory/app/application_1505252418630_25423 Then click on links to logs of each attempt. . Failing the application. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7204) Localizer errors on archive without any files
[ https://issues.apache.org/jira/browse/YARN-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-7204: Description: If a user sends an archive without any files in it (only directories), yarn fails to localize it with the error below. I ran into this specifically running spark job but looks generic to localizer. Application application_1505252418630_25423 failed 3 times due to AM Container for appattempt_1505252418630_25423_03 exited with exitCode: -1000 Failing this attempt.Diagnostics: No such file or directory ENOENT: No such file or directory at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmodImpl(Native Method) at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmod(NativeIO.java:230) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:767) at org.apache.hadoop.fs.DelegateToFileSystem.setPermission(DelegateToFileSystem.java:218) at org.apache.hadoop.fs.FilterFs.setPermission(FilterFs.java:264) at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1009) at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1005) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.setPermission(FileContext.java:1012) at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:421) at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:419) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1945) at org.apache.hadoop.yarn.util.FSDownload.changePermissions(FSDownload.java:419) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:365) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:233) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:226) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:214) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) For more detailed output, check the application tracking page: https://rm.com:50508/applicationhistory/app/application_1505252418630_25423 Then click on links to logs of each attempt. . Failing the application. was: If a user sends an archive without any files in it (only directories), yarn fails to localize it with the error below. I ran into this specifically running spark job but looks generic to localizer. Application application_1505252418630_25423 failed 3 times due to AM Container for appattempt_1505252418630_25423_03 exited with exitCode: -1000 Failing this attempt.Diagnostics: No such file or directory ENOENT: No such file or directory at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmodImpl(Native Method) at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmod(NativeIO.java:230) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:767) at org.apache.hadoop.fs.DelegateToFileSystem.setPermission(DelegateToFileSystem.java:218) at org.apache.hadoop.fs.FilterFs.setPermission(FilterFs.java:264) at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1009) at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1005) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.setPermission(FileContext.java:1012) at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:421) at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:419) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1945) at org.apache.hadoop.yarn.util.FSDownload.changePermissions(FSDownload.java:419) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:365) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:233) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:226) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:214) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
[jira] [Updated] (YARN-7204) Localizer errors on archive without any files
[ https://issues.apache.org/jira/browse/YARN-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-7204: Description: If a user sends an archive without any files in it (only directories), yarn fails to localize it with the error below. I ran into this specifically running spark job but looks generic to localizer. Application application_1505252418630_25423 failed 3 times due to AM Container for appattempt_1505252418630_25423_03 exited with exitCode: -1000 Failing this attempt.Diagnostics: No such file or directory ENOENT: No such file or directory at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmodImpl(Native Method) at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmod(NativeIO.java:230) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:767) at org.apache.hadoop.fs.DelegateToFileSystem.setPermission(DelegateToFileSystem.java:218) at org.apache.hadoop.fs.FilterFs.setPermission(FilterFs.java:264) at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1009) at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1005) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.setPermission(FileContext.java:1012) at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:421) at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:419) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1945) at org.apache.hadoop.yarn.util.FSDownload.changePermissions(FSDownload.java:419) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:365) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:233) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:226) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:214) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) For more detailed output, check the application tracking page: https://rm.com:50708/applicationhistory/app/application_1505252418630_25423 Then click on links to logs of each attempt. . Failing the application. was: If a user sends an archive without any files in it (only directories), yarn fails to localize it with the error below. I ran into this specifically running spark job but looks generic to localizer. Application application_1505252418630_25423 failed 3 times due to AM Container for appattempt_1505252418630_25423_03 exited with exitCode: -1000 Failing this attempt.Diagnostics: No such file or directory ENOENT: No such file or directory at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmodImpl(Native Method) at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmod(NativeIO.java:230) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:767) at org.apache.hadoop.fs.DelegateToFileSystem.setPermission(DelegateToFileSystem.java:218) at org.apache.hadoop.fs.FilterFs.setPermission(FilterFs.java:264) at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1009) at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1005) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.setPermission(FileContext.java:1012) at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:421) at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:419) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1945) at org.apache.hadoop.yarn.util.FSDownload.changePermissions(FSDownload.java:419) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:365) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:233) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:226) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:214) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-1769: Attachment: YARN-1769.patch attaching the same patch to kick jenkins. > CapacityScheduler: Improve reservations > > > Key: YARN-1769 > URL: https://issues.apache.org/jira/browse/YARN-1769 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch > > > Currently the CapacityScheduler uses reservations in order to handle requests > for large containers and the fact there might not currently be enough space > available on a single host. > The current algorithm for reservations is to reserve as many containers as > currently required and then it will start to reserve more above that after a > certain number of re-reservations (currently biased against larger > containers). Anytime it hits the limit of number reserved it stops looking > at any other nodes. This results in potentially missing nodes that have > enough space to fullfill the request. > The other place for improvement is currently reservations count against your > queue capacity. If you have reservations you could hit the various limits > which would then stop you from looking further at that node. > The above 2 cases can cause an application requesting a larger container to > take a long time to gets it resources. > We could improve upon both of those by simply continuing to look at incoming > nodes to see if we could potentially swap out a reservation for an actual > allocation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-1769: Attachment: YARN-1769.patch Update tests for handle SystemMetricsPublisher > CapacityScheduler: Improve reservations > > > Key: YARN-1769 > URL: https://issues.apache.org/jira/browse/YARN-1769 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch > > > Currently the CapacityScheduler uses reservations in order to handle requests > for large containers and the fact there might not currently be enough space > available on a single host. > The current algorithm for reservations is to reserve as many containers as > currently required and then it will start to reserve more above that after a > certain number of re-reservations (currently biased against larger > containers). Anytime it hits the limit of number reserved it stops looking > at any other nodes. This results in potentially missing nodes that have > enough space to fullfill the request. > The other place for improvement is currently reservations count against your > queue capacity. If you have reservations you could hit the various limits > which would then stop you from looking further at that node. > The above 2 cases can cause an application requesting a larger container to > take a long time to gets it resources. > We could improve upon both of those by simply continuing to look at incoming > nodes to see if we could potentially swap out a reservation for an actual > allocation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-1769: Attachment: YARN-1769.patch fix patch > CapacityScheduler: Improve reservations > > > Key: YARN-1769 > URL: https://issues.apache.org/jira/browse/YARN-1769 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch > > > Currently the CapacityScheduler uses reservations in order to handle requests > for large containers and the fact there might not currently be enough space > available on a single host. > The current algorithm for reservations is to reserve as many containers as > currently required and then it will start to reserve more above that after a > certain number of re-reservations (currently biased against larger > containers). Anytime it hits the limit of number reserved it stops looking > at any other nodes. This results in potentially missing nodes that have > enough space to fullfill the request. > The other place for improvement is currently reservations count against your > queue capacity. If you have reservations you could hit the various limits > which would then stop you from looking further at that node. > The above 2 cases can cause an application requesting a larger container to > take a long time to gets it resources. > We could improve upon both of those by simply continuing to look at incoming > nodes to see if we could potentially swap out a reservation for an actual > allocation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-1769: Attachment: YARN-1769.patch > CapacityScheduler: Improve reservations > > > Key: YARN-1769 > URL: https://issues.apache.org/jira/browse/YARN-1769 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch > > > Currently the CapacityScheduler uses reservations in order to handle requests > for large containers and the fact there might not currently be enough space > available on a single host. > The current algorithm for reservations is to reserve as many containers as > currently required and then it will start to reserve more above that after a > certain number of re-reservations (currently biased against larger > containers). Anytime it hits the limit of number reserved it stops looking > at any other nodes. This results in potentially missing nodes that have > enough space to fullfill the request. > The other place for improvement is currently reservations count against your > queue capacity. If you have reservations you could hit the various limits > which would then stop you from looking further at that node. > The above 2 cases can cause an application requesting a larger container to > take a long time to gets it resources. > We could improve upon both of those by simply continuing to look at incoming > nodes to see if we could potentially swap out a reservation for an actual > allocation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148098#comment-14148098 ] Thomas Graves commented on YARN-1769: - We've been running this now on cluster for quite a while and its showing great improvements in the time to get larger containers. I would like to put this in. > CapacityScheduler: Improve reservations > > > Key: YARN-1769 > URL: https://issues.apache.org/jira/browse/YARN-1769 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch > > > Currently the CapacityScheduler uses reservations in order to handle requests > for large containers and the fact there might not currently be enough space > available on a single host. > The current algorithm for reservations is to reserve as many containers as > currently required and then it will start to reserve more above that after a > certain number of re-reservations (currently biased against larger > containers). Anytime it hits the limit of number reserved it stops looking > at any other nodes. This results in potentially missing nodes that have > enough space to fullfill the request. > The other place for improvement is currently reservations count against your > queue capacity. If you have reservations you could hit the various limits > which would then stop you from looking further at that node. > The above 2 cases can cause an application requesting a larger container to > take a long time to gets it resources. > We could improve upon both of those by simply continuing to look at incoming > nodes to see if we could potentially swap out a reservation for an actual > allocation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14149565#comment-14149565 ] Thomas Graves commented on YARN-1769: - Thanks for the review Jason. I'll update the patch and remove some of the logging or make it truly debug. > CapacityScheduler: Improve reservations > > > Key: YARN-1769 > URL: https://issues.apache.org/jira/browse/YARN-1769 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch > > > Currently the CapacityScheduler uses reservations in order to handle requests > for large containers and the fact there might not currently be enough space > available on a single host. > The current algorithm for reservations is to reserve as many containers as > currently required and then it will start to reserve more above that after a > certain number of re-reservations (currently biased against larger > containers). Anytime it hits the limit of number reserved it stops looking > at any other nodes. This results in potentially missing nodes that have > enough space to fullfill the request. > The other place for improvement is currently reservations count against your > queue capacity. If you have reservations you could hit the various limits > which would then stop you from looking further at that node. > The above 2 cases can cause an application requesting a larger container to > take a long time to gets it resources. > We could improve upon both of those by simply continuing to look at incoming > nodes to see if we could potentially swap out a reservation for an actual > allocation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-1769: Attachment: YARN-1769.patch patch with log statments changed to debug > CapacityScheduler: Improve reservations > > > Key: YARN-1769 > URL: https://issues.apache.org/jira/browse/YARN-1769 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch > > > Currently the CapacityScheduler uses reservations in order to handle requests > for large containers and the fact there might not currently be enough space > available on a single host. > The current algorithm for reservations is to reserve as many containers as > currently required and then it will start to reserve more above that after a > certain number of re-reservations (currently biased against larger > containers). Anytime it hits the limit of number reserved it stops looking > at any other nodes. This results in potentially missing nodes that have > enough space to fullfill the request. > The other place for improvement is currently reservations count against your > queue capacity. If you have reservations you could hit the various limits > which would then stop you from looking further at that node. > The above 2 cases can cause an application requesting a larger container to > take a long time to gets it resources. > We could improve upon both of those by simply continuing to look at incoming > nodes to see if we could potentially swap out a reservation for an actual > allocation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-443) allow OS scheduling priority of NM to be different than the containers it launches
[ https://issues.apache.org/jira/browse/YARN-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14185578#comment-14185578 ] Thomas Graves commented on YARN-443: Can you be more specific, what is different about it and why it is a problem? The trunk patch shows that there was an existing getRunCommand() routine (before this change) where as the other didn't have one before (it looks like for windows support). > allow OS scheduling priority of NM to be different than the containers it > launches > -- > > Key: YARN-443 > URL: https://issues.apache.org/jira/browse/YARN-443 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.0.3-alpha, 0.23.6 >Reporter: Thomas Graves >Assignee: Thomas Graves > Fix For: 0.23.7, 2.0.4-alpha > > Attachments: YARN-443-branch-0.23.patch, YARN-443-branch-0.23.patch, > YARN-443-branch-0.23.patch, YARN-443-branch-0.23.patch, > YARN-443-branch-2.patch, YARN-443-branch-2.patch, YARN-443-branch-2.patch, > YARN-443.patch, YARN-443.patch, YARN-443.patch, YARN-443.patch, > YARN-443.patch, YARN-443.patch, YARN-443.patch > > > It would be nice if we could have the nodemanager run at a different OS > scheduling priority than the containers so that you can still communicate > with the nodemanager if the containers out of control. > On linux we could launch the nodemanager at a higher priority, but then all > the containers it launches would also be at that higher priority, so we need > a way for the container executor to launch them at a lower priority. > I'm not sure how this applies to windows if at all. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2828) Enable auto refresh of web pages (using http parameter)
[ https://issues.apache.org/jira/browse/YARN-2828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14202033#comment-14202033 ] Thomas Graves commented on YARN-2828: - auto refresh was removed because some pages load a lot of data and you actually may not want it to update. It can make debugging harder if you are looking at a lot of data and the screen keeps refreshing on you. I think the only way to bring it back is to make it optional. > Enable auto refresh of web pages (using http parameter) > --- > > Key: YARN-2828 > URL: https://issues.apache.org/jira/browse/YARN-2828 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Tim Robertson >Priority: Minor > > The MR1 Job Tracker had a useful HTTP parameter of e.g. "&refresh=3" that > could be appended to URLs which enabled a page reload. This was very useful > when developing mapreduce jobs, especially to watch counters changing. This > is lost in the the Yarn interface. > Could be implemented as a page element (e.g. drop down or so), but I'd > recommend that the page not be more cluttered, and simply bring back the > optional "refresh" HTTP param. It worked really nicely. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-563) Add application type to ApplicationReport
[ https://issues.apache.org/jira/browse/YARN-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662015#comment-13662015 ] Thomas Graves commented on YARN-563: I agree with Hitesh, I think this should be in the web UI and webservices as well as the CLI. This could be very useful to anyone debugging their application that uses the web UI, SE looking for patterns or issues with particular type of application, tools using the webservices to aggregate info and create their own useful experiences, etc. Mayank, I'm not sure what you consider attributes? Are you referring just to the filtering part? The web ui and webservices print almost everything that is a part of the application report. I'm OK with web ui/webservices being added under a separate jira but would have rather seen them done here with the CLI part. > Add application type to ApplicationReport > -- > > Key: YARN-563 > URL: https://issues.apache.org/jira/browse/YARN-563 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Thomas Weise >Assignee: Mayank Bansal > Attachments: YARN-563-trunk-1.patch, YARN-563-trunk-2.patch, > YARN-563-trunk-3.patch, YARN-563-trunk-4.patch > > > This field is needed to distinguish different types of applications (app > master implementations). For example, we may run applications of type XYZ in > a cluster alongside MR and would like to filter applications by type. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-563) Add application type to ApplicationReport
[ https://issues.apache.org/jira/browse/YARN-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662237#comment-13662237 ] Thomas Graves commented on YARN-563: Sorry if I wasn't clear. I think we might be mixing terms here. In my mind there is showing them at all in the web ui/webservices and then there is the additional thing of supporting filtering on them. I agree with you that the filtering part is separate. The showing them in web ui and webservices to me is the same thing as showing them in the output of the yarn application CLI command. > Add application type to ApplicationReport > -- > > Key: YARN-563 > URL: https://issues.apache.org/jira/browse/YARN-563 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Thomas Weise >Assignee: Mayank Bansal > Attachments: YARN-563-trunk-1.patch, YARN-563-trunk-2.patch, > YARN-563-trunk-3.patch, YARN-563-trunk-4.patch > > > This field is needed to distinguish different types of applications (app > master implementations). For example, we may run applications of type XYZ in > a cluster alongside MR and would like to filter applications by type. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-563) Add application type to ApplicationReport
[ https://issues.apache.org/jira/browse/YARN-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664068#comment-13664068 ] Thomas Graves commented on YARN-563: Thanks Mayank, can you please update the web services documentation also? Similar to http://hadoop.apache.org/docs/r2.0.4-alpha/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html its in ./hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRest.apt.vm > Add application type to ApplicationReport > -- > > Key: YARN-563 > URL: https://issues.apache.org/jira/browse/YARN-563 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Thomas Weise >Assignee: Mayank Bansal > Attachments: YARN-563-trunk-10-jenkins.patch, > YARN-563-trunk-10-review.patch, YARN-563-trunk-1.patch, > YARN-563-trunk-2.patch, YARN-563-trunk-3.patch, YARN-563-trunk-4.patch, > YARN-563-trunk-5.patch, YARN-563-trunk-6.patch, YARN-563-trunk-7.patch, > YARN-563-trunk-8.patch, YARN-563-trunk-9-jenkins.patch, > YARN-563-trunk-9-review.patch > > > This field is needed to distinguish different types of applications (app > master implementations). For example, we may run applications of type XYZ in > a cluster alongside MR and would like to filter applications by type. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-126) yarn rmadmin help message contains reference to hadoop cli and JT
[ https://issues.apache.org/jira/browse/YARN-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-126: --- Target Version/s: 3.0.0, 2.0.5-beta, 0.23.9 (was: 3.0.0, 2.0.5-beta, 0.23.8) > yarn rmadmin help message contains reference to hadoop cli and JT > - > > Key: YARN-126 > URL: https://issues.apache.org/jira/browse/YARN-126 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 2.0.3-alpha >Reporter: Thomas Graves >Assignee: Rémy SAISSY > Labels: usability > Attachments: YARN-126.patch > > > has option to specify a job tracker and the last line for general command > line syntax had "bin/hadoop command [genericOptions] [commandOptions]" > ran "yarn rmadmin" to get usage: > RMAdmin > Usage: java RMAdmin >[-refreshQueues] >[-refreshNodes] >[-refreshUserToGroupsMappings] >[-refreshSuperUserGroupsConfiguration] >[-refreshAdminAcls] >[-refreshServiceAcl] >[-help [cmd]] > Generic options supported are > -conf specify an application configuration file > -D use value for given property > -fs specify a namenode > -jt specify a job tracker > -files specify comma separated files to be > copied to the map reduce cluster > -libjars specify comma separated jar files > to include in the classpath. > -archives specify comma separated > archives to be unarchived on the compute machines. > The general command line syntax is > bin/hadoop command [genericOptions] [commandOptions] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-459) DefaultContainerExecutor doesn't log stderr from container launch
[ https://issues.apache.org/jira/browse/YARN-459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-459: --- Target Version/s: 2.0.5-beta, 0.23.9 (was: 2.0.4-alpha, 0.23.8) > DefaultContainerExecutor doesn't log stderr from container launch > - > > Key: YARN-459 > URL: https://issues.apache.org/jira/browse/YARN-459 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.0.3-alpha, 0.23.7 >Reporter: Thomas Graves >Assignee: Sandy Ryza > > The DefaultContainerExecutor does not log stderr or add it to the diagnostics > message it something fails during the container launch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-276) Capacity Scheduler can hang when submit many jobs concurrently
[ https://issues.apache.org/jira/browse/YARN-276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673458#comment-13673458 ] Thomas Graves commented on YARN-276: Nemon, Sorry it appears this got lost in the shuffle and it no longer applies, could you update the patch for the current trunk/branch-2? > Capacity Scheduler can hang when submit many jobs concurrently > -- > > Key: YARN-276 > URL: https://issues.apache.org/jira/browse/YARN-276 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.0.0, 2.0.1-alpha >Reporter: nemon lou >Assignee: nemon lou > Labels: incompatible > Attachments: YARN-276.patch, YARN-276.patch, YARN-276.patch, > YARN-276.patch, YARN-276.patch, YARN-276.patch, YARN-276.patch, > YARN-276.patch, YARN-276.patch, YARN-276.patch, YARN-276.patch, > YARN-276.patch, YARN-276.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > In hadoop2.0.1,When i submit many jobs concurrently at the same time,Capacity > scheduler can hang with most resources taken up by AM and don't have enough > resources for tasks.And then all applications hang there. > The cause is that "yarn.scheduler.capacity.maximum-am-resource-percent" not > check directly.Instead ,this property only used for maxActiveApplications. > And maxActiveApplications is computed by minimumAllocation (not by Am > actually used). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-750) Allow for black-listing resources in CS
[ https://issues.apache.org/jira/browse/YARN-750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13674461#comment-13674461 ] Thomas Graves commented on YARN-750: javadocs warnings are complaining about + * {@link ResourceRequest#ANY} in InvalidBlacklistRequestException.java . > Allow for black-listing resources in CS > --- > > Key: YARN-750 > URL: https://issues.apache.org/jira/browse/YARN-750 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Arun C Murthy >Assignee: Arun C Murthy > Attachments: YARN-750.patch, YARN-750.patch, YARN-750.patch, > YARN-750.patch, YARN-750.patch, YARN-750.patch > > > YARN-392 and YARN-398 enhance scheduler api to allow for white-lists of > resources. > This jira is a companion to allow for black-listing (in CS). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (YARN-750) Allow for black-listing resources in CS
[ https://issues.apache.org/jira/browse/YARN-750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13674461#comment-13674461 ] Thomas Graves edited comment on YARN-750 at 6/4/13 2:59 PM: javadocs warnings are complaining about + * {@link ResourceRequest#ANY} in InvalidBlacklistRequestException.java . Update: Ignore looks like Arun updated patch as I was commenting. was (Author: tgraves): javadocs warnings are complaining about + * {@link ResourceRequest#ANY} in InvalidBlacklistRequestException.java . > Allow for black-listing resources in CS > --- > > Key: YARN-750 > URL: https://issues.apache.org/jira/browse/YARN-750 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Arun C Murthy >Assignee: Arun C Murthy > Attachments: YARN-750.patch, YARN-750.patch, YARN-750.patch, > YARN-750.patch, YARN-750.patch, YARN-750.patch > > > YARN-392 and YARN-398 enhance scheduler api to allow for white-lists of > resources. > This jira is a companion to allow for black-listing (in CS). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-750) Allow for black-listing resources in CS
[ https://issues.apache.org/jira/browse/YARN-750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13674527#comment-13674527 ] Thomas Graves commented on YARN-750: Can we make BlacklistRequestPBImpl immutable since we are changing that in other places (YARN-735)? > Allow for black-listing resources in CS > --- > > Key: YARN-750 > URL: https://issues.apache.org/jira/browse/YARN-750 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Arun C Murthy >Assignee: Arun C Murthy > Attachments: YARN-750.patch, YARN-750.patch, YARN-750.patch, > YARN-750.patch, YARN-750.patch, YARN-750.patch > > > YARN-392 and YARN-398 enhance scheduler api to allow for white-lists of > resources. > This jira is a companion to allow for black-listing (in CS). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-276) Capacity Scheduler can hang when submit many jobs concurrently
[ https://issues.apache.org/jira/browse/YARN-276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13675315#comment-13675315 ] Thomas Graves commented on YARN-276: Thanks Nemon, I'm still reviewing it, here are a couple of things so far. I hope to finish reviewing later tonight. - LeafQueue - please wrap at 80 characters - LeafQueue - please use the @VisibleForTesting annoation in setMaxAMResourcePerQueuePerUserPercent - FicaSchedulerApp - for misspelled as foe - FicaSchedulerApp - please use the @VisibleForTesting annotation around setAMResource I ran a few tests and looked at the scheduler webui for the queue I was running in and the used resource and am used resources showed up blank even though there were jobs running. Can you please take a look to see why? The REST web services call were returning values for those fields. > Capacity Scheduler can hang when submit many jobs concurrently > -- > > Key: YARN-276 > URL: https://issues.apache.org/jira/browse/YARN-276 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.0.0, 2.0.1-alpha >Reporter: nemon lou >Assignee: nemon lou > Labels: incompatible > Attachments: YARN-276.patch, YARN-276.patch, YARN-276.patch, > YARN-276.patch, YARN-276.patch, YARN-276.patch, YARN-276.patch, > YARN-276.patch, YARN-276.patch, YARN-276.patch, YARN-276.patch, > YARN-276.patch, YARN-276.patch, YARN-276.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > In hadoop2.0.1,When i submit many jobs concurrently at the same time,Capacity > scheduler can hang with most resources taken up by AM and don't have enough > resources for tasks.And then all applications hang there. > The cause is that "yarn.scheduler.capacity.maximum-am-resource-percent" not > check directly.Instead ,this property only used for maxActiveApplications. > And maxActiveApplications is computed by minimumAllocation (not by Am > actually used). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-276) Capacity Scheduler can hang when submit many jobs concurrently
[ https://issues.apache.org/jira/browse/YARN-276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13675574#comment-13675574 ] Thomas Graves commented on YARN-276: I need to spend more time looking through the new logic, here are a few more comments for now. remove the comment from overAMUsedPercent about max active application since its no longer present. FicaSchedulerApp getAMResource, change amRequedResource -> amRequiredResource the max active applications per user used to use the absolute queue capacity instead of the absolute max queue capacity. It was changed to use the absolute capacity because it uses the userlimitfactor in the calculation, which should be applied to the capacity and not max capacity (see MAPREDUCE-3897 for more details). We should change the overAMUsedPercentPerUser similarly to use absolute capacity, not absolute max capacity. This can be filed as a separate jira since it was pre-existing but a bad app that requests 0 for the memory could cause divide by zero exception. > Capacity Scheduler can hang when submit many jobs concurrently > -- > > Key: YARN-276 > URL: https://issues.apache.org/jira/browse/YARN-276 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.0.0, 2.0.1-alpha >Reporter: nemon lou >Assignee: nemon lou > Labels: incompatible > Attachments: YARN-276.patch, YARN-276.patch, YARN-276.patch, > YARN-276.patch, YARN-276.patch, YARN-276.patch, YARN-276.patch, > YARN-276.patch, YARN-276.patch, YARN-276.patch, YARN-276.patch, > YARN-276.patch, YARN-276.patch, YARN-276.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > In hadoop2.0.1,When i submit many jobs concurrently at the same time,Capacity > scheduler can hang with most resources taken up by AM and don't have enough > resources for tasks.And then all applications hang there. > The cause is that "yarn.scheduler.capacity.maximum-am-resource-percent" not > check directly.Instead ,this property only used for maxActiveApplications. > And maxActiveApplications is computed by minimumAllocation (not by Am > actually used). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-764) blank Used Resources on Capacity Scheduler page
[ https://issues.apache.org/jira/browse/YARN-764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13675917#comment-13675917 ] Thomas Graves commented on YARN-764: Nemon, thanks for looking at this. I guess its because it uses the raw "<" character instead of "<". Another option, which I would prefer, is just to escape the string using StringEscapeUtils.escapeHtml() in CapacitySchedulerPage.java. That way if someone adds something else to the string in the future or it accidentally gets changed back it will still work. > blank Used Resources on Capacity Scheduler page > > > Key: YARN-764 > URL: https://issues.apache.org/jira/browse/YARN-764 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.0.4-alpha >Reporter: nemon lou >Assignee: nemon lou > Attachments: YARN-764.patch > > > Even when there are jobs running,used resources is empty on Capacity > Scheduler page for leaf queue.(I use google-chrome on windows 7.) > After changing resource.java's toString method by replacing "<>" with > "{}",this bug gets fixed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-768) RM crashes due to DNS issue
[ https://issues.apache.org/jira/browse/YARN-768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13677026#comment-13677026 ] Thomas Graves commented on YARN-768: This is a dup of YARN-713. Its had more work done on it, so lets use that one. > RM crashes due to DNS issue > --- > > Key: YARN-768 > URL: https://issues.apache.org/jira/browse/YARN-768 > Project: Hadoop YARN > Issue Type: Bug >Reporter: PengZhang > Attachments: YARN-768_v1.patch > > > I encountered problem described in MAPREDUCE-4295. And I think that patch has > been removed since YARN-39. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-713) ResourceManager can exit unexpectedly if DNS is unavailable
[ https://issues.apache.org/jira/browse/YARN-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13677027#comment-13677027 ] Thomas Graves commented on YARN-713: Note that this had been fixed at one time by MAPREDUCE-4295, but was lost. > ResourceManager can exit unexpectedly if DNS is unavailable > --- > > Key: YARN-713 > URL: https://issues.apache.org/jira/browse/YARN-713 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Jason Lowe >Priority: Critical > Attachments: YARN-713.patch, YARN-713.patch > > > As discussed in MAPREDUCE-5261, there's a possibility that a DNS outage could > lead to an unhandled exception in the ResourceManager's AsyncDispatcher, and > that ultimately would cause the RM to exit. The RM should not exit during > DNS hiccups. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-713) ResourceManager can exit unexpectedly if DNS is unavailable
[ https://issues.apache.org/jira/browse/YARN-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-713: --- Assignee: Maysam Yabandeh > ResourceManager can exit unexpectedly if DNS is unavailable > --- > > Key: YARN-713 > URL: https://issues.apache.org/jira/browse/YARN-713 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.1.0-beta >Reporter: Jason Lowe >Assignee: Maysam Yabandeh >Priority: Critical > Attachments: YARN-713.patch, YARN-713.patch > > > As discussed in MAPREDUCE-5261, there's a possibility that a DNS outage could > lead to an unhandled exception in the ResourceManager's AsyncDispatcher, and > that ultimately would cause the RM to exit. The RM should not exit during > DNS hiccups. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-764) blank Used Resources on Capacity Scheduler page
[ https://issues.apache.org/jira/browse/YARN-764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13677340#comment-13677340 ] Thomas Graves commented on YARN-764: +1, Thanks Nemon, I'll commit this shortly. > blank Used Resources on Capacity Scheduler page > > > Key: YARN-764 > URL: https://issues.apache.org/jira/browse/YARN-764 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.0.4-alpha >Reporter: nemon lou >Assignee: nemon lou > Attachments: YARN-764.patch, YARN-764.patch > > > Even when there are jobs running,used resources is empty on Capacity > Scheduler page for leaf queue.(I use google-chrome on windows 7.) > After changing resource.java's toString method by replacing "<>" with > "{}",this bug gets fixed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-276) Capacity Scheduler can hang when submit many jobs concurrently
[ https://issues.apache.org/jira/browse/YARN-276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13677544#comment-13677544 ] Thomas Graves commented on YARN-276: Thanks for the updates, some comments: - we need to escapeHtml the AM used resources similar to YARN-764 - I think you should put back maxAMResourcePerQueuePerUserPercent. The main reason being its useful to show to users so that they know what limit they might be hitting. Otherwise their job could be waiting to activate and the UI doesn't show them any limits they might be hitting. The overAMUsedPercentPerUser should use the Capacity not maxCapacity. The per user checks need to taking into account the minimum user percent as well as the user limit factor (like it did in previous version of the patch). Ideally this is dynamically figured out instead of it being hardcoded like before since you could have a user limit % at like 20%, but if there is only 2 users each user really gets 50%. That could be complicated based on the timing of things. The downside to the dynamic is that it makes it much harder for users to understand why there job might not be launched. It might make more sense to keep the formula similar to before where it uses both user limit factor and user limit percent for now and file a separate jira to investigate making that more dynamic. That jira could also look into addressing the amresource percent applying to the absolute max capacity. - can you update the web services documentation (./hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRest.apt.vm) - we can remove the "Per Queue" from the web ui: Max AM Resource Per Queue Percent. I think we can remove the "PerQueue" bit from the REST web services too: maxAMResourcePerQueuePercent -> maxAMResourcePercent - we are keeping the AM used resource percent at the user level. It might be nice to print output this atleast through the REST webservices. It would be nice to have in the UI too but I'm a bit afraid its going to get to cluttered there. - the REST webservices print out of the amUsedResources should be of type ResourceInfo so that you get it in separated fields like: 4096 2 The old format that we kept for backwards compatibility was: . We don't need that format since this is new. - TestApplicationLimits - remove the old comment - // set max active to 2 - TestApplicationLimits - why are you multiplying by the userLimitFactor? +Resource queueResource = Resources.multiply(clusterResources, +queue.getAbsoluteCapacity() * queue.getUserLimitFactor()); - what are the changes in TestClientTokens.java? - In the MiniYarnCluster why are we setting the AM resource percent to 100%? > Capacity Scheduler can hang when submit many jobs concurrently > -- > > Key: YARN-276 > URL: https://issues.apache.org/jira/browse/YARN-276 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.0.0, 2.0.1-alpha >Reporter: nemon lou >Assignee: nemon lou > Labels: incompatible > Attachments: YARN-276.patch, YARN-276.patch, YARN-276.patch, > YARN-276.patch, YARN-276.patch, YARN-276.patch, YARN-276.patch, > YARN-276.patch, YARN-276.patch, YARN-276.patch, YARN-276.patch, > YARN-276.patch, YARN-276.patch, YARN-276.patch, YARN-276.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > In hadoop2.0.1,When i submit many jobs concurrently at the same time,Capacity > scheduler can hang with most resources taken up by AM and don't have enough > resources for tasks.And then all applications hang there. > The cause is that "yarn.scheduler.capacity.maximum-am-resource-percent" not > check directly.Instead ,this property only used for maxActiveApplications. > And maxActiveApplications is computed by minimumAllocation (not by Am > actually used). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-862) ResourceManager and NodeManager versions should match on node registration or error out
[ https://issues.apache.org/jira/browse/YARN-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-862: --- Target Version/s: 0.23.10 (was: 0.23.9) > ResourceManager and NodeManager versions should match on node registration or > error out > --- > > Key: YARN-862 > URL: https://issues.apache.org/jira/browse/YARN-862 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, resourcemanager >Affects Versions: 0.23.8 >Reporter: Robert Parker >Assignee: Robert Parker > Attachments: YARN-862-b0.23-v1.patch, YARN-862-b0.23-v2.patch > > > For branch-0.23 the versions of the node manager and the resource manager > should match to complete a successful registration. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-902) "Used Resources" field in Resourcemanager scheduler UI not displaying any values
[ https://issues.apache.org/jira/browse/YARN-902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13701988#comment-13701988 ] Thomas Graves commented on YARN-902: Are you using the latest branch-2 or the released 2.0.5-alpha? This might be a duplicate of YARN-764. > "Used Resources" field in Resourcemanager scheduler UI not displaying any > values > > > Key: YARN-902 > URL: https://issues.apache.org/jira/browse/YARN-902 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.0.5-alpha >Reporter: Nishan Shetty >Priority: Minor > > "Used Resources" field in Resourcemanager scheduler UI not displaying any > values -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-964) Give a parameter that can set AM retry interval
[ https://issues.apache.org/jira/browse/YARN-964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13718338#comment-13718338 ] Thomas Graves commented on YARN-964: How many NM's did you have? How would an AM retry interval have helped this? > Give a parameter that can set AM retry interval > > > Key: YARN-964 > URL: https://issues.apache.org/jira/browse/YARN-964 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: qus-jiawei > > Our am retry number is 4. > As one nodemanager 's disk is full,the container of am couldn't allocate on > this nodemanager.But RM try this AM on the same NM every 3 secondes. > i think there shoule be a params to set the AM retry interval. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-589) Expose a REST API for monitoring the fair scheduler
[ https://issues.apache.org/jira/browse/YARN-589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13734756#comment-13734756 ] Thomas Graves commented on YARN-589: sorry for jumping in late on this, do we have another jira for adding documentation? > Expose a REST API for monitoring the fair scheduler > --- > > Key: YARN-589 > URL: https://issues.apache.org/jira/browse/YARN-589 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Affects Versions: 2.0.3-alpha >Reporter: Sandy Ryza >Assignee: Sandy Ryza > Fix For: 2.1.1-beta > > Attachments: fairscheduler.xml, YARN-589-1.patch, YARN-589-2.patch, > YARN-589.patch > > > The fair scheduler should have an HTTP interface that exposes information > such as applications per queue, fair shares, demands, current allocations. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-337) RM handles killed application tracking URL poorly
[ https://issues.apache.org/jira/browse/YARN-337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738968#comment-13738968 ] Thomas Graves commented on YARN-337: +1 looks good. Thanks Jason! Feel free to commit it. > RM handles killed application tracking URL poorly > - > > Key: YARN-337 > URL: https://issues.apache.org/jira/browse/YARN-337 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.0.2-alpha, 0.23.5 >Reporter: Jason Lowe >Assignee: Jason Lowe > Labels: usability > Attachments: YARN-337.patch > > > When the ResourceManager kills an application, it leaves the proxy URL > redirecting to the original tracking URL for the application even though the > ApplicationMaster is no longer there to service it. It should redirect it > somewhere more useful, like the RM's web page for the application, where the > user can find that the application was killed and links to the AM logs. > In addition, sometimes the AM during teardown from the kill can attempt to > unregister and provide an updated tracking URL, but unfortunately the RM has > "forgotten" the AM due to the kill and refuses to process the unregistration. > Instead it logs: > {noformat} > 2013-01-09 17:37:49,671 [IPC Server handler 2 on 8030] ERROR > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: > AppAttemptId doesnt exist in cache appattempt_1357575694478_28614_01 > {noformat} > It should go ahead and process the unregistration to update the tracking URL > since the application offered it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira