[jira] [Commented] (YARN-8200) Backport resource types/GPU features to branch-3.0/branch-2
[ https://issues.apache.org/jira/browse/YARN-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025285#comment-17025285 ] Thomas Graves commented on YARN-8200: - After messing with this a bit more I removed the maximum allocation configurations after seeing the documentation didn't have them in the 2.10 release. so removed this setting: yarn.resource-types.yarn.io/gpu.maximum-allocation 4 And it appears now yarn doesn't allocate me a container unless it has fullfilled all of the gpus I requested. So in this case my nodemanager has 4 gpus so if I request 5 then it just hangs waiting to fullfill the request. This behavior is much better then giving me one that is less then I requested. > Backport resource types/GPU features to branch-3.0/branch-2 > --- > > Key: YARN-8200 > URL: https://issues.apache.org/jira/browse/YARN-8200 > Project: Hadoop YARN > Issue Type: Task >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Labels: release-blocker > Fix For: 2.10.0 > > Attachments: YARN-8200-branch-2.001.patch, > YARN-8200-branch-2.002.patch, YARN-8200-branch-2.003.patch, > YARN-8200-branch-3.0.001.patch, > counter.scheduler.operation.allocate.csv.defaultResources, > counter.scheduler.operation.allocate.csv.gpuResources, synth_sls.json > > > Currently we have a need for GPU scheduling on our YARN clusters to support > deep learning workloads. However, our main production clusters are running > older versions of branch-2 (2.7 in our case). To prevent supporting too many > very different hadoop versions across multiple clusters, we would like to > backport the resource types/resource profiles feature to branch-2, as well as > the GPU specific support. > > We have done a trial backport of YARN-3926 and some miscellaneous patches in > YARN-7069 based on issues we uncovered, and the backport was fairly smooth. > We also did a trial backport of most of YARN-6223 (sans docker support). > > Regarding the backports, perhaps we can do the development in a feature > branch and then merge to branch-2 when ready. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8200) Backport resource types/GPU features to branch-3.0/branch-2
[ https://issues.apache.org/jira/browse/YARN-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025156#comment-17025156 ] Thomas Graves commented on YARN-8200: - Hey [~jhung] , I am trying out the gpu scheduling in hadoop 2.10 and the first thing I noticed is it doesn't error properly if you ask for to many GPU's. It seems to happyily say it gave them to me, although I think its really giving me the max configured. Is this a known issue already or did configuration change? I have gpu max configured at 4 and I try to allocate 8, on hadoop 3 I get: Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException): Invalid resource request, requested resource type=[yarn.io/gpu] < 0 or greater than maximum allowed allocation. Requested resource=, maximum allowed allocation=, please note that maximum allowed allocation is calculated by scheduler based on maximum resource of registered NodeManagers, which might be less than configured maximum allocation= On hadoop 2.10 I get a container allocated but the logs and UI says it only has 4 gpus. > Backport resource types/GPU features to branch-3.0/branch-2 > --- > > Key: YARN-8200 > URL: https://issues.apache.org/jira/browse/YARN-8200 > Project: Hadoop YARN > Issue Type: Task >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Labels: release-blocker > Fix For: 2.10.0 > > Attachments: YARN-8200-branch-2.001.patch, > YARN-8200-branch-2.002.patch, YARN-8200-branch-2.003.patch, > YARN-8200-branch-3.0.001.patch, > counter.scheduler.operation.allocate.csv.defaultResources, > counter.scheduler.operation.allocate.csv.gpuResources, synth_sls.json > > > Currently we have a need for GPU scheduling on our YARN clusters to support > deep learning workloads. However, our main production clusters are running > older versions of branch-2 (2.7 in our case). To prevent supporting too many > very different hadoop versions across multiple clusters, we would like to > backport the resource types/resource profiles feature to branch-2, as well as > the GPU specific support. > > We have done a trial backport of YARN-3926 and some miscellaneous patches in > YARN-7069 based on issues we uncovered, and the backport was fairly smooth. > We also did a trial backport of most of YARN-6223 (sans docker support). > > Regarding the backports, perhaps we can do the development in a feature > branch and then merge to branch-2 when ready. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9116) Capacity Scheduler: add the default maximum-allocation-mb and maximum-allocation-vcores for the queues
[ https://issues.apache.org/jira/browse/YARN-9116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737392#comment-16737392 ] Thomas Graves commented on YARN-9116: - Yes so you want to keep the behavior that the cluster level maximum is the absolute maximum and no child queues can be larger then that, otherwise it breaks backwards compatibility. > Capacity Scheduler: add the default maximum-allocation-mb and > maximum-allocation-vcores for the queues > -- > > Key: YARN-9116 > URL: https://issues.apache.org/jira/browse/YARN-9116 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Affects Versions: 2.7.0 >Reporter: Aihua Xu >Assignee: Aihua Xu >Priority: Major > Attachments: YARN-9116.1.patch > > > YARN-1582 adds the support of maximum-allocation-mb configuration per queue > which is targeting to support larger container features on dedicated queues > (larger maximum-allocation-mb/maximum-allocation-vcores for such queue) . > While to achieve larger container configuration, we need to increase the > global maximum-allocation-mb/maximum-allocation-vcores (e.g. 120G/256) and > then override those configurations with desired values on the queues since > queue configuration can't be larger than cluster configuration. There are > many queues in the system and if we forget to configure such values when > adding a new queue, then such queue gets default 120G/256 which typically is > not what we want. > We can come up with a queue-default configuration (set to normal queue > configuration like 16G/8), so the leaf queues gets such values by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9055) Capacity Scheduler: allow larger queue level maximum-allocation-mb to override the cluster configuration
[ https://issues.apache.org/jira/browse/YARN-9055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16700511#comment-16700511 ] Thomas Graves commented on YARN-9055: - It would definitely be a change in behavior which could surprise people with existing configurations. I do think its easier to have this way so you don't have to configure all the queues. I don't remember all the details on why I did it this way, I think it was mostly to not break the existing functionality of the cluster level max., > Capacity Scheduler: allow larger queue level maximum-allocation-mb to > override the cluster configuration > > > Key: YARN-9055 > URL: https://issues.apache.org/jira/browse/YARN-9055 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Affects Versions: 2.7.0 >Reporter: Aihua Xu >Assignee: Aihua Xu >Priority: Major > Attachments: YARN-9055.1.patch > > > YARN-1582 adds the support of maximum-allocation-mb configuration per queue. > That feature gives the flexibility to give different memory requirements for > different queues. Such patch adds the limitation that the queue level > configuration can't exceed the cluster level default configuration, but I > feel it may make more sense to remove such limitation to allow any overrides > since > # Such configuration is controlled by the admin so it shouldn't get abused; > # It's common that typical queues require standard size containers while some > job (queues) have requirements for larger containers. With current > limitation, we have to set larger configuration on the cluster setting which > will cause resource abuse unless we override them on all the queues. > We can remove such limitation in CapacitySchedulerConfiguration.java so the > cluster setting provides the default value and queue setting can override it. > {noformat} >if (maxAllocationMbPerQueue > clusterMax.getMemorySize() > || maxAllocationVcoresPerQueue > clusterMax.getVirtualCores()) { > throw new IllegalArgumentException( > "Queue maximum allocation cannot be larger than the cluster setting" > + " for queue " + queue > + " max allocation per queue: " + result > + " cluster setting: " + clusterMax); > } > {noformat} > Let me know if it makes sense. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8991) nodemanager not cleaning blockmgr directories inside appcache
[ https://issues.apache.org/jira/browse/YARN-8991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683946#comment-16683946 ] Thomas Graves commented on YARN-8991: - if its while its running then you should file this with Spark. Its very similar to https://issues.apache.org/jira/browse/SPARK-17233. The spark external shuffle service doesn't supports that at this point. The problem with that is that you may have an Spark Executor running on one host, generate some map output data to shuffle and then that executor exits as its not needed anymore. When a reduce starts it just talked to the Yarn nodemanager and the external shuffle server to get the map output. Now there is no executor left on the node to cleanup the shuffle output. Support would have to be added for like the driver to tell the spark external shuffle service to cleanup. If you don't use dynamic allocation and the external shuffle service it should cleanup properly. > nodemanager not cleaning blockmgr directories inside appcache > -- > > Key: YARN-8991 > URL: https://issues.apache.org/jira/browse/YARN-8991 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Hidayat Teonadi >Priority: Major > Attachments: yarn-nm-log.txt > > > Hi, I'm running spark on yarn and have enabled the Spark Shuffle Service. I'm > noticing that during the lifetime of my spark streaming application, the nm > appcache folder is building up with blockmgr directories (filled with > shuffle_*.data). > Looking into the nm logs, it seems like the blockmgr directories is not part > of the cleanup process of the application. Eventually disk will fill up and > app will crash. I have both > {{yarn.nodemanager.localizer.cache.cleanup.interval-ms}} and > {{yarn.nodemanager.localizer.cache.target-size-mb}} set, so I don't think its > a configuration issue. > What is stumping me is the executor ID listed by spark during the external > shuffle block registration doesn't match the executor ID listed in yarn's nm > log. Maybe this executorID disconnect explains why the cleanup is not done ? > I'm assuming that blockmgr directories are supposed to be cleaned up ? > > {noformat} > 2018-11-05 15:01:21,349 INFO > org.apache.spark.network.shuffle.ExternalShuffleBlockResolver: Registered > executor AppExecId{appId=application_1541045942679_0193, execId=1299} with > ExecutorShuffleInfo{localDirs=[/mnt1/yarn/nm/usercache/auction_importer/appcache/application_1541045942679_0193/blockmgr-b9703ae3-722c-47d1-a374-abf1cc954f42], > subDirsPerLocalDir=64, > shuffleManager=org.apache.spark.shuffle.sort.SortShuffleManager} > {noformat} > > seems similar to https://issues.apache.org/jira/browse/YARN-7070, although > I'm not sure if the behavior I'm seeing is spark use related. > [https://stackoverflow.com/questions/52923386/spark-streaming-job-doesnt-delete-shuffle-files] > has a stop gap solution of cleaning up via cron. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8991) nodemanager not cleaning blockmgr directories inside appcache
[ https://issues.apache.org/jira/browse/YARN-8991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681525#comment-16681525 ] Thomas Graves commented on YARN-8991: - [~teonadi] can you clarify here. Are you saying its not getting cleaned up while the Spark application is still running or its not getting cleaned up after the spark application finishes? > nodemanager not cleaning blockmgr directories inside appcache > -- > > Key: YARN-8991 > URL: https://issues.apache.org/jira/browse/YARN-8991 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: Hidayat Teonadi >Priority: Major > Attachments: yarn-nm-log.txt > > > Hi, I'm running spark on yarn and have enabled the Spark Shuffle Service. I'm > noticing that during the lifetime of my spark streaming application, the nm > appcache folder is building up with blockmgr directories (filled with > shuffle_*.data). > Looking into the nm logs, it seems like the blockmgr directories is not part > of the cleanup process of the application. Eventually disk will fill up and > app will crash. I have both > {{yarn.nodemanager.localizer.cache.cleanup.interval-ms}} and > {{yarn.nodemanager.localizer.cache.target-size-mb}} set, so I don't think its > a configuration issue. > What is stumping me is the executor ID listed by spark during the external > shuffle block registration doesn't match the executor ID listed in yarn's nm > log. Maybe this executorID disconnect explains why the cleanup is not done ? > I'm assuming that blockmgr directories are supposed to be cleaned up ? > > {noformat} > 2018-11-05 15:01:21,349 INFO > org.apache.spark.network.shuffle.ExternalShuffleBlockResolver: Registered > executor AppExecId{appId=application_1541045942679_0193, execId=1299} with > ExecutorShuffleInfo{localDirs=[/mnt1/yarn/nm/usercache/auction_importer/appcache/application_1541045942679_0193/blockmgr-b9703ae3-722c-47d1-a374-abf1cc954f42], > subDirsPerLocalDir=64, > shuffleManager=org.apache.spark.shuffle.sort.SortShuffleManager} > {noformat} > > seems similar to https://issues.apache.org/jira/browse/YARN-7070, although > I'm not sure if the behavior I'm seeing is spark use related. > [https://stackoverflow.com/questions/52923386/spark-streaming-job-doesnt-delete-shuffle-files] > has a stop gap solution of cleaning up via cron. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8149) Revisit behavior of Re-Reservation in Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-8149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436366#comment-16436366 ] Thomas Graves commented on YARN-8149: - thinking about this a little more, even with the current preemption on, I don't think preemption is smart enough to keep starvation from happening. If preemption was smart enough to kill enough containers on a reserved node to make it so the big container actually gets scheduled there that might be ok. But last time I checked it doesn't do that. Without that or having another way to prevent starvation I wouldn't want to remove this. I think adding a config would be alright but if anyone finds it useful you can't remove and would just be an extra config. If we have other ideas to simply or make this better, great we should look at. Or if there is a way for us to get stats on if this is useful we could add those and run and determine if we should remove. > Revisit behavior of Re-Reservation in Capacity Scheduler > > > Key: YARN-8149 > URL: https://issues.apache.org/jira/browse/YARN-8149 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wangda Tan >Priority: Critical > > Frankly speaking, I'm not sure why we need the re-reservation. The formula is > not that easy to understand: > Inside: > {{org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#shouldAllocOrReserveNewContainer}} > {code:java} > starvation = re-reservation / (#reserved-container * > (1 - min(requested-resource / max-alloc, > max-alloc - min-alloc / max-alloc)) > should_allocate = starvation + requiredContainers - reservedContainers > > 0{code} > I think we should be able to remove the starvation computation, just to check > requiredContainers > reservedContainers should be enough. > In a large cluster, we can easily overflow re-reservation to MAX_INT, see > YARN-7636. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8149) Revisit behavior of Re-Reservation in Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-8149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436295#comment-16436295 ] Thomas Graves commented on YARN-8149: - are you going to do anything with starvation then or allocation a certain % more then what is required? I am hesitant to remove this without doing some major testing. I haven't had a chance to look at the latest code to investigate. It might be more fine now that we do continue looking at other nodes after reservation where as originally that didn't happen. Is in queue preemption on by default? > Revisit behavior of Re-Reservation in Capacity Scheduler > > > Key: YARN-8149 > URL: https://issues.apache.org/jira/browse/YARN-8149 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wangda Tan >Priority: Critical > > Frankly speaking, I'm not sure why we need the re-reservation. The formula is > not that easy to understand: > Inside: > {{org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#shouldAllocOrReserveNewContainer}} > {code:java} > starvation = re-reservation / (#reserved-container * > (1 - min(requested-resource / max-alloc, > max-alloc - min-alloc / max-alloc)) > should_allocate = starvation + requiredContainers - reservedContainers > > 0{code} > I think we should be able to remove the starvation computation, just to check > requiredContainers > reservedContainers should be enough. > In a large cluster, we can easily overflow re-reservation to MAX_INT, see > YARN-7636. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7935) Expose container's hostname to applications running within the docker container
[ https://issues.apache.org/jira/browse/YARN-7935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16374598#comment-16374598 ] Thomas Graves commented on YARN-7935: - thanks for the explanation Mridul. I'm fine with waiting on the spark Jira til you know the scope better, I'm currently not doing anything with bridge mode so won't be able to help there at this point. > Expose container's hostname to applications running within the docker > container > --- > > Key: YARN-7935 > URL: https://issues.apache.org/jira/browse/YARN-7935 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Reporter: Suma Shivaprasad >Assignee: Suma Shivaprasad >Priority: Major > Attachments: YARN-7935.1.patch, YARN-7935.2.patch > > > Some applications have a need to bind to the container's hostname (like > Spark) which is different from the NodeManager's hostname(NM_HOST which is > available as an env during container launch) when launched through Docker > runtime. The container's hostname can be exposed to applications via an env > CONTAINER_HOSTNAME. Another potential candidate is the container's IP but > this can be addressed in a separate jira. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7935) Expose container's hostname to applications running within the docker container
[ https://issues.apache.org/jira/browse/YARN-7935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373039#comment-16373039 ] Thomas Graves commented on YARN-7935: - [~mridulm80] what is the spark Jira for this? If this goes in it will still have to grab this from env to pass in to the executorRunnable. > Expose container's hostname to applications running within the docker > container > --- > > Key: YARN-7935 > URL: https://issues.apache.org/jira/browse/YARN-7935 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Reporter: Suma Shivaprasad >Assignee: Suma Shivaprasad >Priority: Major > Attachments: YARN-7935.1.patch, YARN-7935.2.patch > > > Some applications have a need to bind to the container's hostname (like > Spark) which is different from the NodeManager's hostname(NM_HOST which is > available as an env during container launch) when launched through Docker > runtime. The container's hostname can be exposed to applications via an env > CONTAINER_HOSTNAME. Another potential candidate is the container's IP but > this can be addressed in a separate jira. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7204) Localizer errors on archive without any files
[ https://issues.apache.org/jira/browse/YARN-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-7204: Description: If a user sends an archive without any files in it (only directories), yarn fails to localize it with the error below. I ran into this specifically running spark job but looks generic to localizer. Application application_1505252418630_25423 failed 3 times due to AM Container for appattempt_1505252418630_25423_03 exited with exitCode: -1000 Failing this attempt.Diagnostics: No such file or directory ENOENT: No such file or directory at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmodImpl(Native Method) at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmod(NativeIO.java:230) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:767) at org.apache.hadoop.fs.DelegateToFileSystem.setPermission(DelegateToFileSystem.java:218) at org.apache.hadoop.fs.FilterFs.setPermission(FilterFs.java:264) at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1009) at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1005) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.setPermission(FileContext.java:1012) at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:421) at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:419) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1945) at org.apache.hadoop.yarn.util.FSDownload.changePermissions(FSDownload.java:419) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:365) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:233) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:226) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:214) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) For more detailed output, check the application tracking page: https://rm.com:50708/applicationhistory/app/application_1505252418630_25423 Then click on links to logs of each attempt. . Failing the application. was: If a user sends an archive without any files in it (only directories), yarn fails to localize it with the error below. I ran into this specifically running spark job but looks generic to localizer. Application application_1505252418630_25423 failed 3 times due to AM Container for appattempt_1505252418630_25423_03 exited with exitCode: -1000 Failing this attempt.Diagnostics: No such file or directory ENOENT: No such file or directory at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmodImpl(Native Method) at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmod(NativeIO.java:230) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:767) at org.apache.hadoop.fs.DelegateToFileSystem.setPermission(DelegateToFileSystem.java:218) at org.apache.hadoop.fs.FilterFs.setPermission(FilterFs.java:264) at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1009) at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1005) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.setPermission(FileContext.java:1012) at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:421) at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:419) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1945) at org.apache.hadoop.yarn.util.FSDownload.changePermissions(FSDownload.java:419) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:365) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:233) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:226) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:214) at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[jira] [Updated] (YARN-7204) Localizer errors on archive without any files
[ https://issues.apache.org/jira/browse/YARN-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-7204: Description: If a user sends an archive without any files in it (only directories), yarn fails to localize it with the error below. I ran into this specifically running spark job but looks generic to localizer. Application application_1505252418630_25423 failed 3 times due to AM Container for appattempt_1505252418630_25423_03 exited with exitCode: -1000 Failing this attempt.Diagnostics: No such file or directory ENOENT: No such file or directory at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmodImpl(Native Method) at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmod(NativeIO.java:230) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:767) at org.apache.hadoop.fs.DelegateToFileSystem.setPermission(DelegateToFileSystem.java:218) at org.apache.hadoop.fs.FilterFs.setPermission(FilterFs.java:264) at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1009) at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1005) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.setPermission(FileContext.java:1012) at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:421) at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:419) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1945) at org.apache.hadoop.yarn.util.FSDownload.changePermissions(FSDownload.java:419) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:365) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:233) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:226) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:214) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) For more detailed output, check the application tracking page: https://rm.com:50508/applicationhistory/app/application_1505252418630_25423 Then click on links to logs of each attempt. . Failing the application. was: If a user sends an archive without any files in it (only directories), yarn fails to localize it with the error below. I ran into this specifically running spark job but looks generic to localizer. Application application_1505252418630_25423 failed 3 times due to AM Container for appattempt_1505252418630_25423_03 exited with exitCode: -1000 Failing this attempt.Diagnostics: No such file or directory ENOENT: No such file or directory at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmodImpl(Native Method) at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmod(NativeIO.java:230) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:767) at org.apache.hadoop.fs.DelegateToFileSystem.setPermission(DelegateToFileSystem.java:218) at org.apache.hadoop.fs.FilterFs.setPermission(FilterFs.java:264) at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1009) at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1005) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.setPermission(FileContext.java:1012) at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:421) at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:419) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1945) at org.apache.hadoop.yarn.util.FSDownload.changePermissions(FSDownload.java:419) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:365) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:233) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:226) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:214) at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[jira] [Created] (YARN-7204) Localizer errors on archive without any files
Thomas Graves created YARN-7204: --- Summary: Localizer errors on archive without any files Key: YARN-7204 URL: https://issues.apache.org/jira/browse/YARN-7204 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.8.1 Reporter: Thomas Graves If a user sends an archive without any files in it (only directories), yarn fails to localize it with the error below. I ran into this specifically running spark job but looks generic to localizer. Application application_1505252418630_25423 failed 3 times due to AM Container for appattempt_1505252418630_25423_03 exited with exitCode: -1000 Failing this attempt.Diagnostics: No such file or directory ENOENT: No such file or directory at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmodImpl(Native Method) at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmod(NativeIO.java:230) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:767) at org.apache.hadoop.fs.DelegateToFileSystem.setPermission(DelegateToFileSystem.java:218) at org.apache.hadoop.fs.FilterFs.setPermission(FilterFs.java:264) at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1009) at org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1005) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.setPermission(FileContext.java:1012) at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:421) at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:419) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1945) at org.apache.hadoop.yarn.util.FSDownload.changePermissions(FSDownload.java:419) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:365) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:233) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:226) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:214) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) For more detailed output, check the application tracking page: https://axonitered-jt1.red.ygrid.yahoo.com:50508/applicationhistory/app/application_1505252418630_25423 Then click on links to logs of each attempt. . Failing the application. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5010) maxActiveApplications and maxActiveApplicationsPerUser are missing from REST API
[ https://issues.apache.org/jira/browse/YARN-5010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15262914#comment-15262914 ] Thomas Graves commented on YARN-5010: - we shouldn't just remove them as its an API compatibility issue. I would say they should be added back and definition updated or we should rev rest api version. > maxActiveApplications and maxActiveApplicationsPerUser are missing from REST > API > > > Key: YARN-5010 > URL: https://issues.apache.org/jira/browse/YARN-5010 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 >Reporter: Jason Lowe > > The RM used to report maxActiveApplications and maxActiveApplicationsPerUser > in the REST API for a queue, but these are missing in 2.7.0. It appears > YARN-2637 replaced them with aMResourceLimit and userAMResourceLimit, > respectively, which broke some internal tools that were expecting the max app > fields to still be there. We should at least update the REST docs to reflect > that change. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4641) CapacityScheduler Active Users Info table should be sortable
Thomas Graves created YARN-4641: --- Summary: CapacityScheduler Active Users Info table should be sortable Key: YARN-4641 URL: https://issues.apache.org/jira/browse/YARN-4641 Project: Hadoop YARN Issue Type: Improvement Components: capacity scheduler Affects Versions: 2.7.1 Reporter: Thomas Graves The Scheduler page when using the Capacity scheduler allows you to see all the Active Users Info. If you have lots of users this is a big table and if you want to be able to see who is using the most it would be nice to have this sortable or show the %used like it used to. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4610) Reservations continue looking for one app causes other apps to starve
[ https://issues.apache.org/jira/browse/YARN-4610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110778#comment-15110778 ] Thomas Graves commented on YARN-4610: - +1 for branch 2.7. After investigating this some more the original patch of setting it to none() works. The reason is that the parents limit is passed and it would be taken into account int he leaf calculation. I think the latter patch is safer but either is fine with me. The master patch I'm not sure about how its taking the max capacity into account so I'll have to look at that more, but the unit tests are passing and that would be a separate issue from this fix. +1 on that patch as well. > Reservations continue looking for one app causes other apps to starve > - > > Key: YARN-4610 > URL: https://issues.apache.org/jira/browse/YARN-4610 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.7.1 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Blocker > Attachments: YARN-4610-branch-2.7.002.patch, YARN-4610.001.patch, > YARN-4610.branch-2.7.001.patch > > > CapacityScheduler's LeafQueue has "reservations continue looking" logic that > allows an application to unreserve elsewhere to fulfil a container request on > a node that has available space. However in 2.7 that logic seems to break > allocations for subsequent apps in the queue. Once a user hits its user > limit, subsequent apps in the queue for other users receive containers at a > significantly reduced rate. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4610) Reservations continue looking for one app causes other apps to starve
[ https://issues.apache.org/jira/browse/YARN-4610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15108697#comment-15108697 ] Thomas Graves commented on YARN-4610: - +1. Thanks for fixing this. > Reservations continue looking for one app causes other apps to starve > - > > Key: YARN-4610 > URL: https://issues.apache.org/jira/browse/YARN-4610 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.7.1 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Blocker > Attachments: YARN-4610.001.patch > > > CapacityScheduler's LeafQueue has "reservations continue looking" logic that > allows an application to unreserve elsewhere to fulfil a container request on > a node that has available space. However in 2.7 that logic seems to break > allocations for subsequent apps in the queue. Once a user hits its user > limit, subsequent apps in the queue for other users receive containers at a > significantly reduced rate. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4610) Reservations continue looking for one app causes other apps to starve
[ https://issues.apache.org/jira/browse/YARN-4610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15109330#comment-15109330 ] Thomas Graves commented on YARN-4610: - Ok thanks for investigating. +1 from me feel free to commit. > Reservations continue looking for one app causes other apps to starve > - > > Key: YARN-4610 > URL: https://issues.apache.org/jira/browse/YARN-4610 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.7.1 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Blocker > Attachments: YARN-4610.001.patch > > > CapacityScheduler's LeafQueue has "reservations continue looking" logic that > allows an application to unreserve elsewhere to fulfil a container request on > a node that has available space. However in 2.7 that logic seems to break > allocations for subsequent apps in the queue. Once a user hits its user > limit, subsequent apps in the queue for other users receive containers at a > significantly reduced rate. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4610) Reservations continue looking for one app causes other apps to starve
[ https://issues.apache.org/jira/browse/YARN-4610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15109516#comment-15109516 ] Thomas Graves commented on YARN-4610: - Sorry after looking some more I think there might be an issue with this for parent queue max capacities, looking some more. > Reservations continue looking for one app causes other apps to starve > - > > Key: YARN-4610 > URL: https://issues.apache.org/jira/browse/YARN-4610 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.7.1 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Blocker > Attachments: YARN-4610.001.patch, YARN-4610.branch-2.7.001.patch > > > CapacityScheduler's LeafQueue has "reservations continue looking" logic that > allows an application to unreserve elsewhere to fulfil a container request on > a node that has available space. However in 2.7 that logic seems to break > allocations for subsequent apps in the queue. Once a user hits its user > limit, subsequent apps in the queue for other users receive containers at a > significantly reduced rate. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4045) Negative avaialbleMB is being reported for root queue.
[ https://issues.apache.org/jira/browse/YARN-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14682115#comment-14682115 ] Thomas Graves commented on YARN-4045: - I remember seeing that this was fixed in branch-2 by some of the capacity scheduler work for labels. I thought this might be fixed by https://issues.apache.org/jira/browse/YARN-3243 but that is included. This might be fixed as part of https://issues.apache.org/jira/browse/YARN-3361 which is probably to big to backport totally. [~leftnoteasy] Do you remember this issue? Note that it also shows up in capacity scheduler UI as root queue going over 100%. I remember when I was testing YARN-3434 it wasn't occurring for me on branch-2 (2.8) and I thought it was one of the above jiras that fixed. Negative avaialbleMB is being reported for root queue. -- Key: YARN-4045 URL: https://issues.apache.org/jira/browse/YARN-4045 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.1 Reporter: Rushabh S Shah We recently deployed 2.7 in one of our cluster. We are seeing negative availableMB being reported for queue=root. This is from the jmx output: {noformat} clusterMetrics ... availableMB-163328/availableMB ... /clusterMetrics {noformat} The following is the RM log: {noformat} 2015-08-10 14:42:28,280 [ResourceManager Event Processor] INFO capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 absoluteUsedCapacity=1.0029854 used=memory:5332480, vCores:6202 cluster=memory:5316608, vCores:28320 2015-08-10 14:42:28,404 [ResourceManager Event Processor] INFO capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 absoluteUsedCapacity=1.0032743 used=memory:5334016, vCores:6212 cluster=memory:5316608, vCores:28320 2015-08-10 14:42:30,913 [ResourceManager Event Processor] INFO capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 absoluteUsedCapacity=1.0029854 used=memory:5332480, vCores:6202 cluster=memory:5316608, vCores:28320 2015-08-10 14:42:30,913 [ResourceManager Event Processor] INFO capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 absoluteUsedCapacity=1.0032743 used=memory:5334016, vCores:6212 cluster=memory:5316608, vCores:28320 2015-08-10 14:42:33,093 [ResourceManager Event Processor] INFO capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 absoluteUsedCapacity=1.0029854 used=memory:5332480, vCores:6202 cluster=memory:5316608, vCores:28320 2015-08-10 14:42:33,093 [ResourceManager Event Processor] INFO capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 absoluteUsedCapacity=1.0032743 used=memory:5334016, vCores:6212 cluster=memory:5316608, vCores:28320 2015-08-10 14:42:35,548 [ResourceManager Event Processor] INFO capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 absoluteUsedCapacity=1.0029854 used=memory:5332480, vCores:6202 cluster=memory:5316608, vCores:28320 2015-08-10 14:42:35,549 [ResourceManager Event Processor] INFO capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 absoluteUsedCapacity=1.0032743 used=memory:5334016, vCores:6212 cluster=memory:5316608, vCores:28320 2015-08-10 14:42:39,088 [ResourceManager Event Processor] INFO capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 absoluteUsedCapacity=1.0029854 used=memory:5332480, vCores:6202 cluster=memory:5316608, vCores:28320 2015-08-10 14:42:39,089 [ResourceManager Event Processor] INFO capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 absoluteUsedCapacity=1.0032743 used=memory:5334016, vCores:6212 cluster=memory:5316608, vCores:28320 2015-08-10 14:42:39,338 [ResourceManager Event Processor] INFO capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 absoluteUsedCapacity=1.0029854 used=memory:5332480, vCores:6202 cluster=memory:5316608, vCores:28320 2015-08-10 14:42:39,339 [ResourceManager Event Processor] INFO capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 absoluteUsedCapacity=1.0032743 used=memory:5334016, vCores:6212 cluster=memory:5316608, vCores:28320 2015-08-10 14:42:39,757 [ResourceManager Event Processor] INFO capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 absoluteUsedCapacity=1.0029854 used=memory:5332480, vCores:6202 cluster=memory:5316608, vCores:28320 2015-08-10 14:42:39,758 [ResourceManager Event Processor] INFO capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 absoluteUsedCapacity=1.0032743 used=memory:5334016, vCores:6212 cluster=memory:5316608, vCores:28320 2015-08-10 14:42:43,056 [ResourceManager Event Processor]
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14538252#comment-14538252 ] Thomas Graves commented on YARN-3434: - whats your question exactly? For branch patches jenkins has never been hooked up. We generally download the patch, build and possibly the run the tests that apply and commit. Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Fix For: 2.8.0 Attachments: YARN-3434-branch2.7.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3600) AM container link is broken (on a killed application, at least)
[ https://issues.apache.org/jira/browse/YARN-3600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-3600: Labels: (was: BB2015-05-RFC) AM container link is broken (on a killed application, at least) --- Key: YARN-3600 URL: https://issues.apache.org/jira/browse/YARN-3600 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.8.0 Reporter: Sergey Shelukhin Assignee: Naganarasimha G R Attachments: YARN-3600.20150508-1.patch Running some fairly recent (couple weeks ago) version of 2.8.0-SNAPSHOT. I have an application that ran fine for a while and then I yarn kill-ed it. Now when I go to the only app attempt URL (like so: http://(snip RM host name):8088/cluster/appattempt/appattempt_1429683757595_0795_01) I see: AM Container: container_1429683757595_0795_01_01 Node: N/A and the container link goes to {noformat}http://(snip RM host name):8088/cluster/N/A {noformat} which obviously doesn't work -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3600) AM container link is broken (on a killed application, at least)
[ https://issues.apache.org/jira/browse/YARN-3600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534621#comment-14534621 ] Thomas Graves commented on YARN-3600: - reviewing and kicking jenkins. AM container link is broken (on a killed application, at least) --- Key: YARN-3600 URL: https://issues.apache.org/jira/browse/YARN-3600 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.8.0 Reporter: Sergey Shelukhin Assignee: Naganarasimha G R Attachments: YARN-3600.20150508-1.patch Running some fairly recent (couple weeks ago) version of 2.8.0-SNAPSHOT. I have an application that ran fine for a while and then I yarn kill-ed it. Now when I go to the only app attempt URL (like so: http://(snip RM host name):8088/cluster/appattempt/appattempt_1429683757595_0795_01) I see: AM Container: container_1429683757595_0795_01_01 Node: N/A and the container link goes to {noformat}http://(snip RM host name):8088/cluster/N/A {noformat} which obviously doesn't work -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3603) Application Attempts page confusing
[ https://issues.apache.org/jira/browse/YARN-3603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14535013#comment-14535013 ] Thomas Graves commented on YARN-3603: - go for it. Thanks! Application Attempts page confusing --- Key: YARN-3603 URL: https://issues.apache.org/jira/browse/YARN-3603 Project: Hadoop YARN Issue Type: Bug Components: webapp Affects Versions: 2.8.0 Reporter: Thomas Graves Assignee: Sunil G The application attempts page (http://RM:8088/cluster/appattempt/appattempt_1431101480046_0003_01) is a bit confusing on what is going on. I think the table of containers there is for only Running containers and when the app is completed or killed its empty. The table should have a label on it stating so. Also the AM Container field is a link when running but not when its killed. That might be confusing. There is no link to the logs in this page but there is in the app attempt table when looking at http:// rm:8088/cluster/app/application_1431101480046_0003 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3603) Application Attempts page confusing
Thomas Graves created YARN-3603: --- Summary: Application Attempts page confusing Key: YARN-3603 URL: https://issues.apache.org/jira/browse/YARN-3603 Project: Hadoop YARN Issue Type: Bug Components: webapp Affects Versions: 2.8.0 Reporter: Thomas Graves The application attempts page (http://RM:8088/cluster/appattempt/appattempt_1431101480046_0003_01) is a bit confusing on what is going on. I think the table of containers there is for only Running containers and when the app is completed or killed its empty. The table should have a label on it stating so. Also the AM Container field is a link when running but not when its killed. That might be confusing. There is no link to the logs in this page but there is in the app attempt table when looking at http:// rm:8088/cluster/app/application_1431101480046_0003 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-20) More information for yarn.resourcemanager.webapp.address in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-20?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534967#comment-14534967 ] Thomas Graves commented on YARN-20: --- +1. Thanks! More information for yarn.resourcemanager.webapp.address in yarn-default.xml -- Key: YARN-20 URL: https://issues.apache.org/jira/browse/YARN-20 Project: Hadoop YARN Issue Type: Improvement Components: documentation, resourcemanager Affects Versions: 2.0.0-alpha Reporter: Nemon Lou Assignee: Bartosz Ługowski Priority: Trivial Labels: newbie Attachments: YARN-20.1.patch, YARN-20.2.patch, YARN-20.patch Original Estimate: 1h Remaining Estimate: 1h The parameter yarn.resourcemanager.webapp.address in yarn-default.xml is in host:port format,which is noted in the cluster set up guide (http://hadoop.apache.org/common/docs/r2.0.0-alpha/hadoop-yarn/hadoop-yarn-site/ClusterSetup.html). When i read though the code,i find host format is also supported. In host format,the port will be random. So we may add more documentation in yarn-default.xml for easy understood. I will submit a patch if it's helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-20) More information for yarn.resourcemanager.webapp.address in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-20?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-20: -- Labels: newbie (was: BB2015-05-RFC newbie) More information for yarn.resourcemanager.webapp.address in yarn-default.xml -- Key: YARN-20 URL: https://issues.apache.org/jira/browse/YARN-20 Project: Hadoop YARN Issue Type: Improvement Components: documentation, resourcemanager Affects Versions: 2.0.0-alpha Reporter: Nemon Lou Assignee: Bartosz Ługowski Priority: Trivial Labels: newbie Attachments: YARN-20.1.patch, YARN-20.2.patch, YARN-20.patch Original Estimate: 1h Remaining Estimate: 1h The parameter yarn.resourcemanager.webapp.address in yarn-default.xml is in host:port format,which is noted in the cluster set up guide (http://hadoop.apache.org/common/docs/r2.0.0-alpha/hadoop-yarn/hadoop-yarn-site/ClusterSetup.html). When i read though the code,i find host format is also supported. In host format,the port will be random. So we may add more documentation in yarn-default.xml for easy understood. I will submit a patch if it's helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3600) AM container link is broken (on a killed application, at least)
[ https://issues.apache.org/jira/browse/YARN-3600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534825#comment-14534825 ] Thomas Graves commented on YARN-3600: - So the change does fix the broken link issue, but it seems to me other things are broken with this page. Obviously if it ran for a while it got an AM and there fore should have a valid container. But I guess that link only works if its actually running? The container table below that also confused me a bit. I thought at first it was list of AM containers, but after playing with it its really list of running containers. I think we should add heading for that. I filed separate jira for those things. Anyway, +1. Thanks! AM container link is broken (on a killed application, at least) --- Key: YARN-3600 URL: https://issues.apache.org/jira/browse/YARN-3600 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.8.0 Reporter: Sergey Shelukhin Assignee: Naganarasimha G R Attachments: YARN-3600.20150508-1.patch Running some fairly recent (couple weeks ago) version of 2.8.0-SNAPSHOT. I have an application that ran fine for a while and then I yarn kill-ed it. Now when I go to the only app attempt URL (like so: http://(snip RM host name):8088/cluster/appattempt/appattempt_1429683757595_0795_01) I see: AM Container: container_1429683757595_0795_01_01 Node: N/A and the container link goes to {noformat}http://(snip RM host name):8088/cluster/N/A {noformat} which obviously doesn't work -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-3434: Attachment: YARN-3434-branch2.7.patch Attaching patch for branch2.7. [~leftnoteasy] could you take a look when you have a chance? Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Fix For: 2.8.0 Attachments: YARN-3434-branch2.7.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1631) Container allocation issue in Leafqueue assignContainers()
[ https://issues.apache.org/jira/browse/YARN-1631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14524042#comment-14524042 ] Thomas Graves commented on YARN-1631: - we need to be careful with this. You could end up starving out the first application. It definitely changes current semantics. What version of hadoop are you seeing this issue? With my patch for reservations continue looking it should actually look at node 2 and take that one and unreserve node 1. There is the logic for the needsContainer that might be affecting this that I would have to look at more. Container allocation issue in Leafqueue assignContainers() -- Key: YARN-1631 URL: https://issues.apache.org/jira/browse/YARN-1631 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Environment: SuSe 11 Linux Reporter: Sunil G Assignee: Sunil G Attachments: Yarn-1631.1.patch, Yarn-1631.2.patch Application1 has a demand of 8GB[Map Task Size as 8GB] which is more than Node_1 can handle. Node_1 has a size of 8GB and 2GB is used by Application1's AM. Hence reservation happened for remaining 6GB in Node_1 by Application1. A new job is submitted with 2GB AM size and 2GB task size with only 2 Maps to run. Node_2 also has 8GB capability. But Application2's AM cannot be launched in Node_2. And Application2 waits longer as only 2 Nodes are available in cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3243) CapacityScheduler should pass headroom from parent to children to make sure ParentQueue obey its capacity limits.
[ https://issues.apache.org/jira/browse/YARN-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523187#comment-14523187 ] Thomas Graves commented on YARN-3243: - thanks [~leftnoteasy] I'll attempt to merge YARN-3434. If its not clean I'll put up a patch for it. CapacityScheduler should pass headroom from parent to children to make sure ParentQueue obey its capacity limits. - Key: YARN-3243 URL: https://issues.apache.org/jira/browse/YARN-3243 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Fix For: 2.8.0 Attachments: YARN-3243.1.patch, YARN-3243.2.patch, YARN-3243.3.patch, YARN-3243.4.patch, YARN-3243.5.patch Now CapacityScheduler has some issues to make sure ParentQueue always obeys its capacity limits, for example: 1) When allocating container of a parent queue, it will only check parentQueue.usage parentQueue.max. If leaf queue allocated a container.size (parentQueue.max - parentQueue.usage), parent queue can excess its max resource limit, as following example: {code} A (usage=54, max=55) / \ A1 A2 (usage=1, max=55) (usage=53, max=53) {code} Queue-A2 is able to allocate container since its usage max, but if we do that, A's usage can excess A.max. 2) When doing continous reservation check, parent queue will only tell children you need unreserve *some* resource, so that I will less than my maximum resource, but it will not tell how many resource need to be unreserved. This may lead to parent queue excesses configured maximum capacity as well. With YARN-3099/YARN-3124, now we have {{ResourceUsage}} class in each class, *here is my proposal*: - ParentQueue will set its children's ResourceUsage.headroom, which means, *maximum resource its children can allocate*. - ParentQueue will set its children's headroom to be (saying parent's name is qA): min(qA.headroom, qA.max - qA.used). This will make sure qA's ancestors' capacity will be enforced as well (qA.headroom is set by qA's parent). - {{needToUnReserve}} is not necessary, instead, children can get how much resource need to be unreserved to keep its parent's resource limit. - More over, with this, YARN-3026 will make a clear boundary between LeafQueue and FiCaSchedulerApp, headroom will consider user-limit, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3243) CapacityScheduler should pass headroom from parent to children to make sure ParentQueue obey its capacity limits.
[ https://issues.apache.org/jira/browse/YARN-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-3243: Fix Version/s: 2.7.1 CapacityScheduler should pass headroom from parent to children to make sure ParentQueue obey its capacity limits. - Key: YARN-3243 URL: https://issues.apache.org/jira/browse/YARN-3243 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Fix For: 2.8.0, 2.7.1 Attachments: YARN-3243.1.patch, YARN-3243.2.patch, YARN-3243.3.patch, YARN-3243.4.patch, YARN-3243.5.patch Now CapacityScheduler has some issues to make sure ParentQueue always obeys its capacity limits, for example: 1) When allocating container of a parent queue, it will only check parentQueue.usage parentQueue.max. If leaf queue allocated a container.size (parentQueue.max - parentQueue.usage), parent queue can excess its max resource limit, as following example: {code} A (usage=54, max=55) / \ A1 A2 (usage=1, max=55) (usage=53, max=53) {code} Queue-A2 is able to allocate container since its usage max, but if we do that, A's usage can excess A.max. 2) When doing continous reservation check, parent queue will only tell children you need unreserve *some* resource, so that I will less than my maximum resource, but it will not tell how many resource need to be unreserved. This may lead to parent queue excesses configured maximum capacity as well. With YARN-3099/YARN-3124, now we have {{ResourceUsage}} class in each class, *here is my proposal*: - ParentQueue will set its children's ResourceUsage.headroom, which means, *maximum resource its children can allocate*. - ParentQueue will set its children's headroom to be (saying parent's name is qA): min(qA.headroom, qA.max - qA.used). This will make sure qA's ancestors' capacity will be enforced as well (qA.headroom is set by qA's parent). - {{needToUnReserve}} is not necessary, instead, children can get how much resource need to be unreserved to keep its parent's resource limit. - More over, with this, YARN-3026 will make a clear boundary between LeafQueue and FiCaSchedulerApp, headroom will consider user-limit, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3243) CapacityScheduler should pass headroom from parent to children to make sure ParentQueue obey its capacity limits.
[ https://issues.apache.org/jira/browse/YARN-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14521580#comment-14521580 ] Thomas Graves commented on YARN-3243: - [~leftnoteasy] Can we pull this back into the branch-2.7? CapacityScheduler should pass headroom from parent to children to make sure ParentQueue obey its capacity limits. - Key: YARN-3243 URL: https://issues.apache.org/jira/browse/YARN-3243 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Fix For: 2.8.0 Attachments: YARN-3243.1.patch, YARN-3243.2.patch, YARN-3243.3.patch, YARN-3243.4.patch, YARN-3243.5.patch Now CapacityScheduler has some issues to make sure ParentQueue always obeys its capacity limits, for example: 1) When allocating container of a parent queue, it will only check parentQueue.usage parentQueue.max. If leaf queue allocated a container.size (parentQueue.max - parentQueue.usage), parent queue can excess its max resource limit, as following example: {code} A (usage=54, max=55) / \ A1 A2 (usage=1, max=55) (usage=53, max=53) {code} Queue-A2 is able to allocate container since its usage max, but if we do that, A's usage can excess A.max. 2) When doing continous reservation check, parent queue will only tell children you need unreserve *some* resource, so that I will less than my maximum resource, but it will not tell how many resource need to be unreserved. This may lead to parent queue excesses configured maximum capacity as well. With YARN-3099/YARN-3124, now we have {{ResourceUsage}} class in each class, *here is my proposal*: - ParentQueue will set its children's ResourceUsage.headroom, which means, *maximum resource its children can allocate*. - ParentQueue will set its children's headroom to be (saying parent's name is qA): min(qA.headroom, qA.max - qA.used). This will make sure qA's ancestors' capacity will be enforced as well (qA.headroom is set by qA's parent). - {{needToUnReserve}} is not necessary, instead, children can get how much resource need to be unreserved to keep its parent's resource limit. - More over, with this, YARN-3026 will make a clear boundary between LeafQueue and FiCaSchedulerApp, headroom will consider user-limit, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3243) CapacityScheduler should pass headroom from parent to children to make sure ParentQueue obey its capacity limits.
[ https://issues.apache.org/jira/browse/YARN-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14522021#comment-14522021 ] Thomas Graves commented on YARN-3243: - I was wanting to pull YARN-3434 back into 2.7. It kind of depends on this one. Atleast I think it would merge cleanly if this one was there. This is also fixing a bug which I would like to see fixed in the 2.7 line if we are going to use it. Its not a blocker since it exists in our 2.6 but it would be nice to have. If we decide its to big then I'll just port YARN-3434 back without it CapacityScheduler should pass headroom from parent to children to make sure ParentQueue obey its capacity limits. - Key: YARN-3243 URL: https://issues.apache.org/jira/browse/YARN-3243 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Fix For: 2.8.0 Attachments: YARN-3243.1.patch, YARN-3243.2.patch, YARN-3243.3.patch, YARN-3243.4.patch, YARN-3243.5.patch Now CapacityScheduler has some issues to make sure ParentQueue always obeys its capacity limits, for example: 1) When allocating container of a parent queue, it will only check parentQueue.usage parentQueue.max. If leaf queue allocated a container.size (parentQueue.max - parentQueue.usage), parent queue can excess its max resource limit, as following example: {code} A (usage=54, max=55) / \ A1 A2 (usage=1, max=55) (usage=53, max=53) {code} Queue-A2 is able to allocate container since its usage max, but if we do that, A's usage can excess A.max. 2) When doing continous reservation check, parent queue will only tell children you need unreserve *some* resource, so that I will less than my maximum resource, but it will not tell how many resource need to be unreserved. This may lead to parent queue excesses configured maximum capacity as well. With YARN-3099/YARN-3124, now we have {{ResourceUsage}} class in each class, *here is my proposal*: - ParentQueue will set its children's ResourceUsage.headroom, which means, *maximum resource its children can allocate*. - ParentQueue will set its children's headroom to be (saying parent's name is qA): min(qA.headroom, qA.max - qA.used). This will make sure qA's ancestors' capacity will be enforced as well (qA.headroom is set by qA's parent). - {{needToUnReserve}} is not necessary, instead, children can get how much resource need to be unreserved to keep its parent's resource limit. - More over, with this, YARN-3026 will make a clear boundary between LeafQueue and FiCaSchedulerApp, headroom will consider user-limit, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3243) CapacityScheduler should pass headroom from parent to children to make sure ParentQueue obey its capacity limits.
[ https://issues.apache.org/jira/browse/YARN-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14522066#comment-14522066 ] Thomas Graves commented on YARN-3243: - It might to merge completely clean but it wouldn't require it for functionality. It would be nice to have this in 2.7 either way though. CapacityScheduler should pass headroom from parent to children to make sure ParentQueue obey its capacity limits. - Key: YARN-3243 URL: https://issues.apache.org/jira/browse/YARN-3243 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Fix For: 2.8.0 Attachments: YARN-3243.1.patch, YARN-3243.2.patch, YARN-3243.3.patch, YARN-3243.4.patch, YARN-3243.5.patch Now CapacityScheduler has some issues to make sure ParentQueue always obeys its capacity limits, for example: 1) When allocating container of a parent queue, it will only check parentQueue.usage parentQueue.max. If leaf queue allocated a container.size (parentQueue.max - parentQueue.usage), parent queue can excess its max resource limit, as following example: {code} A (usage=54, max=55) / \ A1 A2 (usage=1, max=55) (usage=53, max=53) {code} Queue-A2 is able to allocate container since its usage max, but if we do that, A's usage can excess A.max. 2) When doing continous reservation check, parent queue will only tell children you need unreserve *some* resource, so that I will less than my maximum resource, but it will not tell how many resource need to be unreserved. This may lead to parent queue excesses configured maximum capacity as well. With YARN-3099/YARN-3124, now we have {{ResourceUsage}} class in each class, *here is my proposal*: - ParentQueue will set its children's ResourceUsage.headroom, which means, *maximum resource its children can allocate*. - ParentQueue will set its children's headroom to be (saying parent's name is qA): min(qA.headroom, qA.max - qA.used). This will make sure qA's ancestors' capacity will be enforced as well (qA.headroom is set by qA's parent). - {{needToUnReserve}} is not necessary, instead, children can get how much resource need to be unreserved to keep its parent's resource limit. - More over, with this, YARN-3026 will make a clear boundary between LeafQueue and FiCaSchedulerApp, headroom will consider user-limit, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (YARN-3243) CapacityScheduler should pass headroom from parent to children to make sure ParentQueue obey its capacity limits.
[ https://issues.apache.org/jira/browse/YARN-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14522066#comment-14522066 ] Thomas Graves edited comment on YARN-3243 at 4/30/15 7:02 PM: -- It might to merge completely clean but it wouldn't require it for functionality. It would be nice to have this in 2.7 either way though. I'll try it out later and see. was (Author: tgraves): It might to merge completely clean but it wouldn't require it for functionality. It would be nice to have this in 2.7 either way though. CapacityScheduler should pass headroom from parent to children to make sure ParentQueue obey its capacity limits. - Key: YARN-3243 URL: https://issues.apache.org/jira/browse/YARN-3243 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Fix For: 2.8.0 Attachments: YARN-3243.1.patch, YARN-3243.2.patch, YARN-3243.3.patch, YARN-3243.4.patch, YARN-3243.5.patch Now CapacityScheduler has some issues to make sure ParentQueue always obeys its capacity limits, for example: 1) When allocating container of a parent queue, it will only check parentQueue.usage parentQueue.max. If leaf queue allocated a container.size (parentQueue.max - parentQueue.usage), parent queue can excess its max resource limit, as following example: {code} A (usage=54, max=55) / \ A1 A2 (usage=1, max=55) (usage=53, max=53) {code} Queue-A2 is able to allocate container since its usage max, but if we do that, A's usage can excess A.max. 2) When doing continous reservation check, parent queue will only tell children you need unreserve *some* resource, so that I will less than my maximum resource, but it will not tell how many resource need to be unreserved. This may lead to parent queue excesses configured maximum capacity as well. With YARN-3099/YARN-3124, now we have {{ResourceUsage}} class in each class, *here is my proposal*: - ParentQueue will set its children's ResourceUsage.headroom, which means, *maximum resource its children can allocate*. - ParentQueue will set its children's headroom to be (saying parent's name is qA): min(qA.headroom, qA.max - qA.used). This will make sure qA's ancestors' capacity will be enforced as well (qA.headroom is set by qA's parent). - {{needToUnReserve}} is not necessary, instead, children can get how much resource need to be unreserved to keep its parent's resource limit. - More over, with this, YARN-3026 will make a clear boundary between LeafQueue and FiCaSchedulerApp, headroom will consider user-limit, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3517) RM web ui for dumping scheduler logs should be for admins only
[ https://issues.apache.org/jira/browse/YARN-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520252#comment-14520252 ] Thomas Graves commented on YARN-3517: - changes look good, +1. thanks [~vvasudev] RM web ui for dumping scheduler logs should be for admins only -- Key: YARN-3517 URL: https://issues.apache.org/jira/browse/YARN-3517 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, security Reporter: Varun Vasudev Assignee: Thomas Graves Priority: Blocker Labels: security Attachments: YARN-3517.001.patch, YARN-3517.002.patch, YARN-3517.003.patch, YARN-3517.004.patch, YARN-3517.005.patch, YARN-3517.006.patch YARN-3294 allows users to dump scheduler logs from the web UI. This should be for admins only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3517) RM web ui for dumping scheduler logs should be for admins only
[ https://issues.apache.org/jira/browse/YARN-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520355#comment-14520355 ] Thomas Graves commented on YARN-3517: - thanks [~vinodkv] I missed that. RM web ui for dumping scheduler logs should be for admins only -- Key: YARN-3517 URL: https://issues.apache.org/jira/browse/YARN-3517 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, security Reporter: Varun Vasudev Assignee: Varun Vasudev Priority: Blocker Labels: security Fix For: 2.8.0 Attachments: YARN-3517.001.patch, YARN-3517.002.patch, YARN-3517.003.patch, YARN-3517.004.patch, YARN-3517.005.patch, YARN-3517.006.patch YARN-3294 allows users to dump scheduler logs from the web UI. This should be for admins only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3517) RM web ui for dumping scheduler logs should be for admins only
[ https://issues.apache.org/jira/browse/YARN-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518054#comment-14518054 ] Thomas Graves commented on YARN-3517: - in RMWebServices.java we don't need the isSecurityEnabled check. Just remove the entire check. My reasoning is that logLevel app does not do those checks, it simply makes sure you are an admin. +if (UserGroupInformation.isSecurityEnabled() callerUGI == null) { + String msg = Unable to obtain user name, user not authenticated; + throw new AuthorizationException(msg); +} in the test TestRMWebServices.java. We aren't actually asserting anything. we should assert that the expected files exist. Personally I would also like to see an assert that the expected exception occurred. RM web ui for dumping scheduler logs should be for admins only -- Key: YARN-3517 URL: https://issues.apache.org/jira/browse/YARN-3517 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, security Reporter: Varun Vasudev Assignee: Thomas Graves Priority: Blocker Labels: security Attachments: YARN-3517.001.patch, YARN-3517.002.patch, YARN-3517.003.patch, YARN-3517.004.patch, YARN-3517.005.patch YARN-3294 allows users to dump scheduler logs from the web UI. This should be for admins only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3517) RM web ui for dumping scheduler logs should be for admins only
[ https://issues.apache.org/jira/browse/YARN-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned YARN-3517: --- Assignee: Thomas Graves (was: Varun Vasudev) RM web ui for dumping scheduler logs should be for admins only -- Key: YARN-3517 URL: https://issues.apache.org/jira/browse/YARN-3517 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, security Reporter: Varun Vasudev Assignee: Thomas Graves Priority: Blocker Labels: security Attachments: YARN-3517.001.patch, YARN-3517.002.patch, YARN-3517.003.patch, YARN-3517.004.patch, YARN-3517.005.patch YARN-3294 allows users to dump scheduler logs from the web UI. This should be for admins only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3517) RM web ui for dumping scheduler logs should be for admins only
[ https://issues.apache.org/jira/browse/YARN-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509152#comment-14509152 ] Thomas Graves commented on YARN-3517: - + // non-secure mode with no acls enabled + if (!isAdmin !UserGroupInformation.isSecurityEnabled() + !adminACLsManager.areACLsEnabled()) { +isAdmin = true; + } + We don't need the isSecurityEnabled check, just keep the one for areAclsEnabled. This could be combined with the previous if, make this the else if part but that isn't a big deal. in QueuesBlock we are creating the AdminACLsManager every web page load. Perhaps a better way would be to use the this.rm.getApplicationACLsManager() and extend the ApplicationAclsManager to explose an isAdmin functionality RM web ui for dumping scheduler logs should be for admins only -- Key: YARN-3517 URL: https://issues.apache.org/jira/browse/YARN-3517 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, security Affects Versions: 2.7.0 Reporter: Varun Vasudev Assignee: Varun Vasudev Priority: Blocker Labels: security Attachments: YARN-3517.001.patch, YARN-3517.002.patch, YARN-3517.003.patch YARN-3294 allows users to dump scheduler logs from the web UI. This should be for admins only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-3434: Attachment: YARN-3434.patch Upmerged patch to latest Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-3434: Attachment: YARN-3434.patch Fixed the line length and the white space style issues. Other then that I moved things around and its just complaining about the same things more. Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-3434: Attachment: YARN-3434.patch Attaching exact same patch to kick jenkins again Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-3434: Attachment: YARN-3434.patch updated based on review comments Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-3434.patch, YARN-3434.patch, YARN-3434.patch, YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3517) RM web ui for dumping scheduler logs should be for admins only
[ https://issues.apache.org/jira/browse/YARN-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504954#comment-14504954 ] Thomas Graves commented on YARN-3517: - Thanks for following up on this. Could you also change it to not show the button if you aren't an admin? I don't want to confuse users by having a button there that doesn't do anything. One other thing is could you add some css or something to make it look more like a button. Right now it just looks like text and I didn't know it was clickable at first. The placement of it seems a bit weird to me also but as along as its only showing up for admins that is less of an issue. I haven't looked at the patch if details but I see we are creating a new AdminACLsManager each time. It would be nice if we didn't have to do that. RM web ui for dumping scheduler logs should be for admins only -- Key: YARN-3517 URL: https://issues.apache.org/jira/browse/YARN-3517 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, security Affects Versions: 2.7.0 Reporter: Varun Vasudev Assignee: Varun Vasudev Labels: security Attachments: YARN-3517.001.patch YARN-3294 allows users to dump scheduler logs from the web UI. This should be for admins only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3294) Allow dumping of Capacity Scheduler debug logs via web UI for a fixed time period
[ https://issues.apache.org/jira/browse/YARN-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503760#comment-14503760 ] Thomas Graves commented on YARN-3294: - [~xgong] [~vvasudev] I saw this show up in the UI on branch-2. I don't see any permissions checks on this, am I perhaps missing it? We don't want arbitrary users to be able to change log level on the RM. They could slow it down and cause disks to fill up. I also don't see an option to disable this, is there one? If not I think we want it. Honestly I don't really see a need for this button at all as you can change in the logLevel app. But since its in we atleast need to protect it and in my opinion disable it for normal users. Allow dumping of Capacity Scheduler debug logs via web UI for a fixed time period - Key: YARN-3294 URL: https://issues.apache.org/jira/browse/YARN-3294 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.8.0 Attachments: Screen Shot 2015-03-12 at 8.51.25 PM.png, apache-yarn-3294.0.patch, apache-yarn-3294.1.patch, apache-yarn-3294.2.patch, apache-yarn-3294.3.patch, apache-yarn-3294.4.patch It would be nice to have a button on the web UI that would allow dumping of debug logs for just the capacity scheduler for a fixed period of time(1 min, 5 min or so) in a separate log file. It would be useful when debugging scheduler behavior without affecting the rest of the resourcemanager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-3434: Attachment: YARN-3434.patch Upmerged to latest Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-3434.patch, YARN-3434.patch, YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-3434: Attachment: YARN-3434.patch Updated patch with review comments. Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-3434.patch, YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499803#comment-14499803 ] Thomas Graves commented on YARN-3434: - Ok, I'll make the changes and post an updated patch Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496239#comment-14496239 ] Thomas Graves commented on YARN-3434: - So I had considered putting it in the ResourceLimits but ResourceLimits seems to be more of a queue level thing to me (not a user level). For instance parentQueue passes this into leafQueue. ParentQueue cares nothing about user limits. If you stored it there you would either need to track the user it was for or track for all users. ResourceLimits get updated when nodes are added and removed. We don't need to compute a particular user limit when that happens. So it would then be out of date or we change to update it when that happens, but that to me is fairly large change and not really needed. The user limit calculation are lower down and recomputed per user, per application, per current request regularly and putting this into the global based on how being calculated and used didn't make sense to me. All you would be using it for is passing it down to assignContainer and then it would be out of date. If someone else started looking at that value assuming it was up to date then it would be wrong (unless of course we started updating it as stated above). But it would only be for a single user, not all users unless again we changed to calculate for every user whenever something changed. That seems a bit excessive. You are correct that needToUnreserve could go away. I started out on 2.6 which didn't have our changes and I could have removed it when I added in amountNeededUnreserve. If we were to store it in the global ResourceLimit then yes the entire LimitsInfo can go away including shouldContinue as you would fall back to use the boolean return from each function. But again based on my above comments I'm not sure ResourceLimit is the correct place to put this. I just noticed that we are already keeping the userLimit in the User class, that would be another option. But again I think we need to make it clear about what it is. This particular check is done per application, per user based on the current requested Resource. The value stored that wouldn't necessarily apply to all the users applications since the resource request size could be different. thoughts or is there something I'm missing about ResourceLimits? Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496735#comment-14496735 ] Thomas Graves commented on YARN-3434: - I am not saying child needs to know how parent calculate resource limit. I am saying user limit and whether it needs to unreserve to make another reservation has nothing to do with the parent queue (ie it doesn't apply to parent queue). Remember I'm not needing to store user limit, I'm needing to store the fact of whether it needs to unreserve and if it does how much does it need to unreserve. When a node heartbeats it goes through the regular assignments and updates the leafQueue clusterResources based on what the parent passes in. When a node is removed or added then it updates the resource limits (none of these apply to calculation of whether it needs to unreserve or not). Basically it comes down to is this information useful outside of the small window between when it calculates it and when its needed in assignContainer() and my thought is no. And you said it yourself in last bullet above. Although we have been referring to the userLImit and perhaps that is the problem. I don't need to store the userLimit, I need to store whether it needs to unreserve and if so how much. Therefore it fits better as a local transient variable rather then a globally stored one. If you store just the userLImit then you need to recalculate stuff which I'm trying to avoid. I understand why we are storing the current information in ResourceLimits because it has to do with headroom and parent limits and is recalculated at various points, but the current implementation in canAssignToUser doesn't use headroom at all and whether we need to unreserve or not on the last call to assignContainers doesn't affect the headroom calculation. Again basically all we would be doing is placing an extra global variable(s) in the ResourceLimits class just to pass it on down a couple of functions. That to me is a parameter. Now if we had multiple things needing this or updating it then to me fits better in the ResourceLimits. Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497055#comment-14497055 ] Thomas Graves commented on YARN-3434: - I agree with Both section. I'm not sure I completely follow the Only section. Are you suggesting we change the patch to modify ResourceLimits and pass down rather then using the LimitsInfo class? If so that won't work, at least not without adding the shouldContinue flag to it. Unless you mean keep LimitsInfo class for use locally in assignContainers and then pass ResourceLimits down to assignContainer with the value of amountNeededUnreserve as the limit. That wouldn't really change much exception the object we pass down through the functions. Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497076#comment-14497076 ] Thomas Graves commented on YARN-3434: - so you are saying add amountNeededUnreserve to ResourceLimits and then set the global currentResourceLimits.amountNeededUnreserve inside of canAssignToUser? This is what I was not in favor of above and there would be no need to pass it down as parameter. Or were you saying create a ResourceLimit and pass it as parameter to canAssignToUser and canAssignToThisQueue and modify that instance. That instance would then be passed down though to assignContainer()? I don't see how else you set the ResourceLimit. Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14488011#comment-14488011 ] Thomas Graves commented on YARN-3434: - The code you mention is in the else part of that check where it would do a reservation. The situation I'm talking about actually allocates a container, not reserve one. I'll try to explain better: Application ask for lots of containers. It acquires some containers, then it reserves some. At this point it hits its normal user limit which in my example = capacity. It hasn't hit the max amount if can allocate or reserved (shouldAllocOrReserveNewContainer()). The next node heartbeats in that isn't yet reserved and has enough space for it to place a container on. It first checked in assignContainers - canAssignToThisQueue. That passes since we haven't hit max capacity. Then it checks assignContainers - canAssignToUser. That passes but only because used - reserved the user limit. This allows it to continue down into assignContainer. In assignContainer the node has available space and we haven't hit shouldAllocOrReserveNewContainer(). reservationsContinueLooking is on and labels are empty so it does the check: {noformat} if (!shouldAllocOrReserveNewContainer || Resources.greaterThan(resourceCalculator, clusterResource, minimumUnreservedResource, Resources.none())) {noformat} as I said before its allowed to allocate or reserve so it passes that test. Then it hasn't met its maximum capacity (capacity = 30% and max capacity = 100%) yet so that is None and that check doesn't kick in, so it doesn't go into the block to findNodeToUnreserve(). Then it goes ahead and allocates when it should have needed to unreserve. Basically we needed to also do the user limit check again and force it to do the findNodeToUnreserve. Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487416#comment-14487416 ] Thomas Graves commented on YARN-3434: - [~wangda] I'm not sure I follow what are saying? The reservations are already counted in the users usage and we do consider reserved when doing the user limit calculations. Look at LeafQueue.assignContainers call to allocateResource is where it ends up adding to user usage.The canAssignToUser is where it does user limit check and substracts the reservations off to see if it can continue. Note I do think we should just get rid of the config for reservationsContinueLooking, but that is a separate issue. Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14488061#comment-14488061 ] Thomas Graves commented on YARN-3434: - {quote} And I've a question about continous reservation checking behavior, may or may not related to this issue: Now it will try to unreserve all containers under a user, but actually it will only unreserve at most one container to allocate a new container. Do you think is it fine to change the logic to be: When (continousReservation-enabled) (user.usage + required - min(max-allocation, user.total-reserved) =user.limit), assignContainers will continue. This will prevent doing impossible allocation when user reserved lots of containers. (As same as queue reservation checking). {quote} I do think the reservation checking and unreserving can be improved. I basically started with very simple thing and figured we could improve. I'm not sure how much that check would help in practice. I guess it might help the cases where you have 1 user in the queue and a second one shows up and your user limit gets decreased by a lot. In that case it may prevent it from continuing when it can short circuit here. So it would seem to be ok for that. Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485798#comment-14485798 ] Thomas Graves commented on YARN-3434: - [~wangda] YARN-3243 fixes part of the problem with the max capacities, but it doesn't solve the user limit side of it. The user limit check is never done again. I'll have a patch up for this shortly I would appreciate it if you could take a look and give me feedback. Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485798#comment-14485798 ] Thomas Graves edited comment on YARN-3434 at 4/8/15 6:59 PM: - [~wangda] YARN-3243 fixes part of the problem with the max capacities, but it doesn't solve the user limit side of it. The user limit check is never done again in assignContainer() if it skipped the checks in assignContainers() based on reservations but then is allowed to shouldAllocOrReserveNewContainer. I'll have a patch up for this shortly I would appreciate it if you could take a look and give me feedback. was (Author: tgraves): [~wangda] YARN-3243 fixes part of the problem with the max capacities, but it doesn't solve the user limit side of it. The user limit check is never done again. I'll have a patch up for this shortly I would appreciate it if you could take a look and give me feedback. Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-3434: Attachment: YARN-3434.patch Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485834#comment-14485834 ] Thomas Graves commented on YARN-3434: - Note I had a reproducible test case for this. Set userlimit% to 100%, user limit factor to 1. 15 nodes, 20GB each. 1 queue configured for capacity 70, the 2nd queue configured capacity 30. In one queue I started a sleep job needing 10 - 12GB containers in the first queue. I then started a second job in the 2nd queue that needed 25, 12GB containers, the second job got containers but then had to reserve others waiting for the first job to release some. Without this change when the first job started releasing containers the second job would grab them and go over the user limit. With this fix it stayed within the user limit. Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-3434.patch ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
Thomas Graves created YARN-3434: --- Summary: Interaction between reservations and userlimit can result in significant ULF violation Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
[ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392751#comment-14392751 ] Thomas Graves commented on YARN-3434: - The issue here is that in if we allow the user to continue from the user limit checks in assignContainers because they have reservations, when it gets down into the assignContainer routine and its allowed to get a container and the node has space we don't double check the user limit in this case. We recheck in all other cases but this one is missed. Interaction between reservations and userlimit can result in significant ULF violation -- Key: YARN-3434 URL: https://issues.apache.org/jira/browse/YARN-3434 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Thomas Graves ULF was set to 1.0 User was able to consume 1.4X queue capacity. It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3432) Cluster metrics have wrong Total Memory when there is reserved memory on CS
[ https://issues.apache.org/jira/browse/YARN-3432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392687#comment-14392687 ] Thomas Graves commented on YARN-3432: - that will fix it for the capacity scheduler, we need to see if that breaks the FairScheduler though. Cluster metrics have wrong Total Memory when there is reserved memory on CS --- Key: YARN-3432 URL: https://issues.apache.org/jira/browse/YARN-3432 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Brahma Reddy Battula I noticed that when reservations happen when using the Capacity Scheduler, the UI and web services report the wrong total memory. For example. I have a 300GB of total memory in my cluster. I allocate 50 and I reserve 10. The cluster metrics for total memory get reported as 290GB. This was broken by https://issues.apache.org/jira/browse/YARN-656 so perhaps there is a difference between fair scheduler and capacity scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3432) Cluster metrics have wrong Total Memory when there is reserved memory on CS
Thomas Graves created YARN-3432: --- Summary: Cluster metrics have wrong Total Memory when there is reserved memory on CS Key: YARN-3432 URL: https://issues.apache.org/jira/browse/YARN-3432 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Affects Versions: 2.6.0 Reporter: Thomas Graves I noticed that when reservations happen when using the Capacity Scheduler, the UI and web services report the wrong total memory. For example. I have a 300GB of total memory in my cluster. I allocate 50 and I reserve 10. The cluster metrics for total memory get reported as 290GB. This was broken by https://issues.apache.org/jira/browse/YARN-656 so perhaps there is a difference between fair scheduler and capacity scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-656) In scheduler UI, including reserved memory in Memory Total can make it exceed cluster capacity.
[ https://issues.apache.org/jira/browse/YARN-656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391472#comment-14391472 ] Thomas Graves commented on YARN-656: Note this broke the UI, at least for the capacity scheduler. It now displays total that is lacking the reserved. Perhaps this is a difference in how fair scheduler and capacity scheduler keep track of allocated vs reservations. In scheduler UI, including reserved memory in Memory Total can make it exceed cluster capacity. - Key: YARN-656 URL: https://issues.apache.org/jira/browse/YARN-656 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Fix For: 2.1.0-beta Attachments: YARN-656-1.patch, YARN-656.patch Memory Total is currently a sum of availableMB, allocatedMB, and reservedMB. Including reservedMB in this sum can make the total exceed the capacity of the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1582) Capacity Scheduler: add a maximum-allocation-mb setting per queue
[ https://issues.apache.org/jira/browse/YARN-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14297753#comment-14297753 ] Thomas Graves commented on YARN-1582: - +1 looks good. Thanks Jason. Feel free to commit. Capacity Scheduler: add a maximum-allocation-mb setting per queue -- Key: YARN-1582 URL: https://issues.apache.org/jira/browse/YARN-1582 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 3.0.0, 0.23.10, 2.2.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-1582-branch-0.23.patch, YARN-1582.002.patch, YARN-1582.003.patch We want to allow certain queues to use larger container sizes while limiting other queues to smaller container sizes. Setting it per queue will help prevent abuse, help limit the impact of reservations, and allow changes in the maximum container size to be rolled out more easily. One reason this is needed is more application types are becoming available on yarn and certain applications require more memory to run efficiently. While we want to allow for that we don't want other applications to abuse that and start requesting bigger containers then what they really need. Note that we could have this based on application type, but that might not be totally accurate either since for example you might want to allow certain users on MapReduce to use larger containers, while limiting other users of MapReduce to smaller containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2828) Enable auto refresh of web pages (using http parameter)
[ https://issues.apache.org/jira/browse/YARN-2828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202033#comment-14202033 ] Thomas Graves commented on YARN-2828: - auto refresh was removed because some pages load a lot of data and you actually may not want it to update. It can make debugging harder if you are looking at a lot of data and the screen keeps refreshing on you. I think the only way to bring it back is to make it optional. Enable auto refresh of web pages (using http parameter) --- Key: YARN-2828 URL: https://issues.apache.org/jira/browse/YARN-2828 Project: Hadoop YARN Issue Type: Improvement Reporter: Tim Robertson Priority: Minor The MR1 Job Tracker had a useful HTTP parameter of e.g. refresh=3 that could be appended to URLs which enabled a page reload. This was very useful when developing mapreduce jobs, especially to watch counters changing. This is lost in the the Yarn interface. Could be implemented as a page element (e.g. drop down or so), but I'd recommend that the page not be more cluttered, and simply bring back the optional refresh HTTP param. It worked really nicely. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-443) allow OS scheduling priority of NM to be different than the containers it launches
[ https://issues.apache.org/jira/browse/YARN-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14185578#comment-14185578 ] Thomas Graves commented on YARN-443: Can you be more specific, what is different about it and why it is a problem? The trunk patch shows that there was an existing getRunCommand() routine (before this change) where as the other didn't have one before (it looks like for windows support). allow OS scheduling priority of NM to be different than the containers it launches -- Key: YARN-443 URL: https://issues.apache.org/jira/browse/YARN-443 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.0.3-alpha, 0.23.6 Reporter: Thomas Graves Assignee: Thomas Graves Fix For: 0.23.7, 2.0.4-alpha Attachments: YARN-443-branch-0.23.patch, YARN-443-branch-0.23.patch, YARN-443-branch-0.23.patch, YARN-443-branch-0.23.patch, YARN-443-branch-2.patch, YARN-443-branch-2.patch, YARN-443-branch-2.patch, YARN-443.patch, YARN-443.patch, YARN-443.patch, YARN-443.patch, YARN-443.patch, YARN-443.patch, YARN-443.patch It would be nice if we could have the nodemanager run at a different OS scheduling priority than the containers so that you can still communicate with the nodemanager if the containers out of control. On linux we could launch the nodemanager at a higher priority, but then all the containers it launches would also be at that higher priority, so we need a way for the container executor to launch them at a lower priority. I'm not sure how this applies to windows if at all. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149565#comment-14149565 ] Thomas Graves commented on YARN-1769: - Thanks for the review Jason. I'll update the patch and remove some of the logging or make it truly debug. CapacityScheduler: Improve reservations Key: YARN-1769 URL: https://issues.apache.org/jira/browse/YARN-1769 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch Currently the CapacityScheduler uses reservations in order to handle requests for large containers and the fact there might not currently be enough space available on a single host. The current algorithm for reservations is to reserve as many containers as currently required and then it will start to reserve more above that after a certain number of re-reservations (currently biased against larger containers). Anytime it hits the limit of number reserved it stops looking at any other nodes. This results in potentially missing nodes that have enough space to fullfill the request. The other place for improvement is currently reservations count against your queue capacity. If you have reservations you could hit the various limits which would then stop you from looking further at that node. The above 2 cases can cause an application requesting a larger container to take a long time to gets it resources. We could improve upon both of those by simply continuing to look at incoming nodes to see if we could potentially swap out a reservation for an actual allocation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-1769: Attachment: YARN-1769.patch patch with log statments changed to debug CapacityScheduler: Improve reservations Key: YARN-1769 URL: https://issues.apache.org/jira/browse/YARN-1769 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch Currently the CapacityScheduler uses reservations in order to handle requests for large containers and the fact there might not currently be enough space available on a single host. The current algorithm for reservations is to reserve as many containers as currently required and then it will start to reserve more above that after a certain number of re-reservations (currently biased against larger containers). Anytime it hits the limit of number reserved it stops looking at any other nodes. This results in potentially missing nodes that have enough space to fullfill the request. The other place for improvement is currently reservations count against your queue capacity. If you have reservations you could hit the various limits which would then stop you from looking further at that node. The above 2 cases can cause an application requesting a larger container to take a long time to gets it resources. We could improve upon both of those by simply continuing to look at incoming nodes to see if we could potentially swap out a reservation for an actual allocation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-1769: Attachment: YARN-1769.patch attaching the same patch to kick jenkins. CapacityScheduler: Improve reservations Key: YARN-1769 URL: https://issues.apache.org/jira/browse/YARN-1769 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch Currently the CapacityScheduler uses reservations in order to handle requests for large containers and the fact there might not currently be enough space available on a single host. The current algorithm for reservations is to reserve as many containers as currently required and then it will start to reserve more above that after a certain number of re-reservations (currently biased against larger containers). Anytime it hits the limit of number reserved it stops looking at any other nodes. This results in potentially missing nodes that have enough space to fullfill the request. The other place for improvement is currently reservations count against your queue capacity. If you have reservations you could hit the various limits which would then stop you from looking further at that node. The above 2 cases can cause an application requesting a larger container to take a long time to gets it resources. We could improve upon both of those by simply continuing to look at incoming nodes to see if we could potentially swap out a reservation for an actual allocation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-1769: Attachment: YARN-1769.patch Update tests for handle SystemMetricsPublisher CapacityScheduler: Improve reservations Key: YARN-1769 URL: https://issues.apache.org/jira/browse/YARN-1769 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch Currently the CapacityScheduler uses reservations in order to handle requests for large containers and the fact there might not currently be enough space available on a single host. The current algorithm for reservations is to reserve as many containers as currently required and then it will start to reserve more above that after a certain number of re-reservations (currently biased against larger containers). Anytime it hits the limit of number reserved it stops looking at any other nodes. This results in potentially missing nodes that have enough space to fullfill the request. The other place for improvement is currently reservations count against your queue capacity. If you have reservations you could hit the various limits which would then stop you from looking further at that node. The above 2 cases can cause an application requesting a larger container to take a long time to gets it resources. We could improve upon both of those by simply continuing to look at incoming nodes to see if we could potentially swap out a reservation for an actual allocation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-1769: Attachment: YARN-1769.patch fix patch CapacityScheduler: Improve reservations Key: YARN-1769 URL: https://issues.apache.org/jira/browse/YARN-1769 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch Currently the CapacityScheduler uses reservations in order to handle requests for large containers and the fact there might not currently be enough space available on a single host. The current algorithm for reservations is to reserve as many containers as currently required and then it will start to reserve more above that after a certain number of re-reservations (currently biased against larger containers). Anytime it hits the limit of number reserved it stops looking at any other nodes. This results in potentially missing nodes that have enough space to fullfill the request. The other place for improvement is currently reservations count against your queue capacity. If you have reservations you could hit the various limits which would then stop you from looking further at that node. The above 2 cases can cause an application requesting a larger container to take a long time to gets it resources. We could improve upon both of those by simply continuing to look at incoming nodes to see if we could potentially swap out a reservation for an actual allocation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-1769: Attachment: YARN-1769.patch CapacityScheduler: Improve reservations Key: YARN-1769 URL: https://issues.apache.org/jira/browse/YARN-1769 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch Currently the CapacityScheduler uses reservations in order to handle requests for large containers and the fact there might not currently be enough space available on a single host. The current algorithm for reservations is to reserve as many containers as currently required and then it will start to reserve more above that after a certain number of re-reservations (currently biased against larger containers). Anytime it hits the limit of number reserved it stops looking at any other nodes. This results in potentially missing nodes that have enough space to fullfill the request. The other place for improvement is currently reservations count against your queue capacity. If you have reservations you could hit the various limits which would then stop you from looking further at that node. The above 2 cases can cause an application requesting a larger container to take a long time to gets it resources. We could improve upon both of those by simply continuing to look at incoming nodes to see if we could potentially swap out a reservation for an actual allocation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148098#comment-14148098 ] Thomas Graves commented on YARN-1769: - We've been running this now on cluster for quite a while and its showing great improvements in the time to get larger containers. I would like to put this in. CapacityScheduler: Improve reservations Key: YARN-1769 URL: https://issues.apache.org/jira/browse/YARN-1769 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch Currently the CapacityScheduler uses reservations in order to handle requests for large containers and the fact there might not currently be enough space available on a single host. The current algorithm for reservations is to reserve as many containers as currently required and then it will start to reserve more above that after a certain number of re-reservations (currently biased against larger containers). Anytime it hits the limit of number reserved it stops looking at any other nodes. This results in potentially missing nodes that have enough space to fullfill the request. The other place for improvement is currently reservations count against your queue capacity. If you have reservations you could hit the various limits which would then stop you from looking further at that node. The above 2 cases can cause an application requesting a larger container to take a long time to gets it resources. We could improve upon both of those by simply continuing to look at incoming nodes to see if we could potentially swap out a reservation for an actual allocation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2431) NM restart: cgroup is not removed for reacquired containers
[ https://issues.apache.org/jira/browse/YARN-2431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121796#comment-14121796 ] Thomas Graves commented on YARN-2431: - +1. Thanks Jason! Feel free to check it in. NM restart: cgroup is not removed for reacquired containers --- Key: YARN-2431 URL: https://issues.apache.org/jira/browse/YARN-2431 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-2431.patch The cgroup for a reacquired container is not being removed when the container exits. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2419) RM applications page doesn't sort application id properly
Thomas Graves created YARN-2419: --- Summary: RM applications page doesn't sort application id properly Key: YARN-2419 URL: https://issues.apache.org/jira/browse/YARN-2419 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Thomas Graves The ResourceManager apps page doesn't sort the application ids properly when the app id rolls over from to 1. When it rolls over the 1+ application ids end up being many pages down by the 0XXX numbers. I assume we just sort alphabetically so we would need a special sorter that knows about application ids. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-1769: Attachment: YARN-1769.patch fixed merged conflict. CapacityScheduler: Improve reservations Key: YARN-1769 URL: https://issues.apache.org/jira/browse/YARN-1769 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch Currently the CapacityScheduler uses reservations in order to handle requests for large containers and the fact there might not currently be enough space available on a single host. The current algorithm for reservations is to reserve as many containers as currently required and then it will start to reserve more above that after a certain number of re-reservations (currently biased against larger containers). Anytime it hits the limit of number reserved it stops looking at any other nodes. This results in potentially missing nodes that have enough space to fullfill the request. The other place for improvement is currently reservations count against your queue capacity. If you have reservations you could hit the various limits which would then stop you from looking further at that node. The above 2 cases can cause an application requesting a larger container to take a long time to gets it resources. We could improve upon both of those by simply continuing to look at incoming nodes to see if we could potentially swap out a reservation for an actual allocation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2072) RM/NM UIs and webservices are missing vcore information
[ https://issues.apache.org/jira/browse/YARN-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042565#comment-14042565 ] Thomas Graves commented on YARN-2072: - +1, looks good. Thanks Nathan. RM/NM UIs and webservices are missing vcore information --- Key: YARN-2072 URL: https://issues.apache.org/jira/browse/YARN-2072 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, resourcemanager, webapp Affects Versions: 3.0.0, 2.4.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Attachments: YARN-2072.patch, YARN-2072.patch Change RM and NM UIs and webservices to include virtual cores. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2072) RM/NM UIs and webservices are missing vcore information
[ https://issues.apache.org/jira/browse/YARN-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-2072: Issue Type: Improvement (was: Bug) RM/NM UIs and webservices are missing vcore information --- Key: YARN-2072 URL: https://issues.apache.org/jira/browse/YARN-2072 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager, webapp Affects Versions: 3.0.0, 2.4.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Attachments: YARN-2072.patch, YARN-2072.patch Change RM and NM UIs and webservices to include virtual cores. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2072) RM/NM UIs and webservices are missing vcore information
[ https://issues.apache.org/jira/browse/YARN-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14035901#comment-14035901 ] Thomas Graves commented on YARN-2072: - Thanks Nathan, it mostly looks good. A few comments: - In the UserMetricsInfo getReservedVirtualCores is returning allocatedVirtualCores and should be returning reservedVirtualCores - In MetricsOverviewTable you should capitalize the C in Vcores Reserved The only other thing is if we want to do something special if the DominantResourceCalculator isn't being used. Right now with the capacityScheduler and FifoScheduler it ends up showings it using 1 vcore per container. The fairscheduler reports what the user would be using but I believe doesn't enforce it. The fairscheduler reporting seems more intuitive so perhaps we should change the capacityscheduler/fifo to do similar reporting. I think that would be a follow up jira though. Sandy, do you have any comments on this about how it shows up with fairscheduler? RM/NM UIs and webservices are missing vcore information --- Key: YARN-2072 URL: https://issues.apache.org/jira/browse/YARN-2072 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, resourcemanager, webapp Affects Versions: 3.0.0, 2.4.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Attachments: YARN-2072.patch Change RM and NM UIs and webservices to include virtual cores. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()
[ https://issues.apache.org/jira/browse/YARN-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-2171: Priority: Major (was: Critical) AMs block on the CapacityScheduler lock during allocate() - Key: YARN-2171 URL: https://issues.apache.org/jira/browse/YARN-2171 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 0.23.10, 2.4.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-2171.patch, YARN-2171v2.patch When AMs heartbeat into the RM via the allocate() call they are blocking on the CapacityScheduler lock when trying to get the number of nodes in the cluster via getNumClusterNodes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()
[ https://issues.apache.org/jira/browse/YARN-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-2171: Priority: Critical (was: Major) AMs block on the CapacityScheduler lock during allocate() - Key: YARN-2171 URL: https://issues.apache.org/jira/browse/YARN-2171 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 0.23.10, 2.4.0 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Attachments: YARN-2171.patch, YARN-2171v2.patch When AMs heartbeat into the RM via the allocate() call they are blocking on the CapacityScheduler lock when trying to get the number of nodes in the cluster via getNumClusterNodes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()
[ https://issues.apache.org/jira/browse/YARN-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-2171: Target Version/s: 2.5.0 (was: 0.23.11, 2.5.0) AMs block on the CapacityScheduler lock during allocate() - Key: YARN-2171 URL: https://issues.apache.org/jira/browse/YARN-2171 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 0.23.10, 2.4.0 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Attachments: YARN-2171.patch, YARN-2171v2.patch When AMs heartbeat into the RM via the allocate() call they are blocking on the CapacityScheduler lock when trying to get the number of nodes in the cluster via getNumClusterNodes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-1769: Attachment: YARN-1769.patch fix patch . I generated it from the wrong directory. CapacityScheduler: Improve reservations Key: YARN-1769 URL: https://issues.apache.org/jira/browse/YARN-1769 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch Currently the CapacityScheduler uses reservations in order to handle requests for large containers and the fact there might not currently be enough space available on a single host. The current algorithm for reservations is to reserve as many containers as currently required and then it will start to reserve more above that after a certain number of re-reservations (currently biased against larger containers). Anytime it hits the limit of number reserved it stops looking at any other nodes. This results in potentially missing nodes that have enough space to fullfill the request. The other place for improvement is currently reservations count against your queue capacity. If you have reservations you could hit the various limits which would then stop you from looking further at that node. The above 2 cases can cause an application requesting a larger container to take a long time to gets it resources. We could improve upon both of those by simply continuing to look at incoming nodes to see if we could potentially swap out a reservation for an actual allocation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-1769: Attachment: YARN-1769.patch CapacityScheduler: Improve reservations Key: YARN-1769 URL: https://issues.apache.org/jira/browse/YARN-1769 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch Currently the CapacityScheduler uses reservations in order to handle requests for large containers and the fact there might not currently be enough space available on a single host. The current algorithm for reservations is to reserve as many containers as currently required and then it will start to reserve more above that after a certain number of re-reservations (currently biased against larger containers). Anytime it hits the limit of number reserved it stops looking at any other nodes. This results in potentially missing nodes that have enough space to fullfill the request. The other place for improvement is currently reservations count against your queue capacity. If you have reservations you could hit the various limits which would then stop you from looking further at that node. The above 2 cases can cause an application requesting a larger container to take a long time to gets it resources. We could improve upon both of those by simply continuing to look at incoming nodes to see if we could potentially swap out a reservation for an actual allocation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-1769: Attachment: YARN-1769.patch patch to fix TestReservations after YARN-1474 CapacityScheduler: Improve reservations Key: YARN-1769 URL: https://issues.apache.org/jira/browse/YARN-1769 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch Currently the CapacityScheduler uses reservations in order to handle requests for large containers and the fact there might not currently be enough space available on a single host. The current algorithm for reservations is to reserve as many containers as currently required and then it will start to reserve more above that after a certain number of re-reservations (currently biased against larger containers). Anytime it hits the limit of number reserved it stops looking at any other nodes. This results in potentially missing nodes that have enough space to fullfill the request. The other place for improvement is currently reservations count against your queue capacity. If you have reservations you could hit the various limits which would then stop you from looking further at that node. The above 2 cases can cause an application requesting a larger container to take a long time to gets it resources. We could improve upon both of those by simply continuing to look at incoming nodes to see if we could potentially swap out a reservation for an actual allocation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-1769: Attachment: YARN-1769.patch upmerged to latest CapacityScheduler: Improve reservations Key: YARN-1769 URL: https://issues.apache.org/jira/browse/YARN-1769 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch Currently the CapacityScheduler uses reservations in order to handle requests for large containers and the fact there might not currently be enough space available on a single host. The current algorithm for reservations is to reserve as many containers as currently required and then it will start to reserve more above that after a certain number of re-reservations (currently biased against larger containers). Anytime it hits the limit of number reserved it stops looking at any other nodes. This results in potentially missing nodes that have enough space to fullfill the request. The other place for improvement is currently reservations count against your queue capacity. If you have reservations you could hit the various limits which would then stop you from looking further at that node. The above 2 cases can cause an application requesting a larger container to take a long time to gets it resources. We could improve upon both of those by simply continuing to look at incoming nodes to see if we could potentially swap out a reservation for an actual allocation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010358#comment-14010358 ] Thomas Graves commented on YARN-1769: - TestFairScheduler is failing for other reasons. see https://issues.apache.org/jira/browse/YARN-2105. CapacityScheduler: Improve reservations Key: YARN-1769 URL: https://issues.apache.org/jira/browse/YARN-1769 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch Currently the CapacityScheduler uses reservations in order to handle requests for large containers and the fact there might not currently be enough space available on a single host. The current algorithm for reservations is to reserve as many containers as currently required and then it will start to reserve more above that after a certain number of re-reservations (currently biased against larger containers). Anytime it hits the limit of number reserved it stops looking at any other nodes. This results in potentially missing nodes that have enough space to fullfill the request. The other place for improvement is currently reservations count against your queue capacity. If you have reservations you could hit the various limits which would then stop you from looking further at that node. The above 2 cases can cause an application requesting a larger container to take a long time to gets it resources. We could improve upon both of those by simply continuing to look at incoming nodes to see if we could potentially swap out a reservation for an actual allocation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1946) need Public interface for WebAppUtils.getProxyHostAndPort
[ https://issues.apache.org/jira/browse/YARN-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13972959#comment-13972959 ] Thomas Graves commented on YARN-1946: - {quote} We've replaced the normal AmFilter with one that doesn't proxy under ws/*, looking into is code {quote} sorry I don't understand what you are saying here? You are saying your application has? If it doesn't proxy it doesn't really follow yarn security rules? You are correct that the YarnConfiguration is public, but the getProxyHostAndPort and now getProxyHostsAndPortsForAmFilter handle other things for you. https, RM HA, etc... need Public interface for WebAppUtils.getProxyHostAndPort - Key: YARN-1946 URL: https://issues.apache.org/jira/browse/YARN-1946 Project: Hadoop YARN Issue Type: Sub-task Components: api, webapp Affects Versions: 2.4.0 Reporter: Thomas Graves Priority: Critical ApplicationMasters are supposed to go through the ResourceManager web app proxy if they have web UI's so they are properly secured. There is currently no public interface for Application Masters to conveniently get the proxy host and port. There is a function in WebAppUtils, but that class is private. We should provide this as a utility since any properly written AM will need to do this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1946) need Public interface for WebAppUtils.getProxyHostAndPort
[ https://issues.apache.org/jira/browse/YARN-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13971421#comment-13971421 ] Thomas Graves commented on YARN-1946: - Thanks for the suggestion Steve. That is only the proxy base /proxy/... Which I am actually also using: val proxy = WebAppUtils.getProxyHostAndPort(conf) val uriBase = http://; + proxy + System.getenv(ApplicationConstants.APPLICATION_WEB_PROXY_BASE_ENV) Do you know if there is a way to more easily get the proxy host and port? If you are an application that doesn't use the Hadoop HttpServer for webapps then in order to use the web app proxy you have to use install AmIpFilter directly and provide the PROXY_HOST and PROXY_URI_BASE. If you are using the HttpServer then you can use the AmFilterInitializer which figures out the host and port for you. Thus why I need the host and port. Note that I would basically copy this code into the application trying to do this but if I need to copy it more then likely other applications would also so perhaps we should make Public utility functions for it. Looking some more I could also try to use the AmFilterInitializer by creating a class with the interface for FilterContainer and then call initFilter directly. This however would specialize the code whereas using the AmIpFilter directly fits in with installing any normal java servlet filter. I would prefer not to do the specialized code since an application may run on multiple framework - yarn being on of them. Also note back in hadoop 0.23 this was easier to get (publicly available) as it was in YarnConfiguration.getProxyHostAndPort it actually also looks like my code is out of date as it was updated recently to handle HA: YARN-1811 need Public interface for WebAppUtils.getProxyHostAndPort - Key: YARN-1946 URL: https://issues.apache.org/jira/browse/YARN-1946 Project: Hadoop YARN Issue Type: Bug Components: api, webapp Affects Versions: 2.4.0 Reporter: Thomas Graves Priority: Critical ApplicationMasters are supposed to go through the ResourceManager web app proxy if they have web UI's so they are properly secured. There is currently no public interface for Application Masters to conveniently get the proxy host and port. There is a function in WebAppUtils, but that class is private. We should provide this as a utility since any properly written AM will need to do this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1931) Private API change in YARN-1824 in 2.4 broke compatibility with previous releases
[ https://issues.apache.org/jira/browse/YARN-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13971567#comment-13971567 ] Thomas Graves commented on YARN-1931: - +1, Thanks Sandy. I'll wait til this afternoon to commit in case Vinod has any further comments. Private API change in YARN-1824 in 2.4 broke compatibility with previous releases - Key: YARN-1931 URL: https://issues.apache.org/jira/browse/YARN-1931 Project: Hadoop YARN Issue Type: Bug Components: applications Affects Versions: 2.4.0 Reporter: Thomas Graves Assignee: Sandy Ryza Priority: Blocker Attachments: YARN-1931-1.patch, YARN-1931-2.patch, YARN-1931.patch YARN-1824 broke compatibility with previous 2.x releases by changes the API's in org.apache.hadoop.yarn.util.Apps.{setEnvFromInputString,addToEnvironment} The old api should be added back in. This affects any ApplicationMasters who were using this api. It also breaks previously built MapReduce libraries from working with the new Yarn release as MR uses this api. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1942) ConverterUtils should not be Private
Thomas Graves created YARN-1942: --- Summary: ConverterUtils should not be Private Key: YARN-1942 URL: https://issues.apache.org/jira/browse/YARN-1942 Project: Hadoop YARN Issue Type: Bug Components: api Affects Versions: 2.4.0 Reporter: Thomas Graves ConverterUtils has a bunch of functions that are useful to application masters. It should either be made public or we make some of the utilities in it public or we provide other external apis for application masters to use. Note that distributedshell and MR are both using these interfaces. For instance the main use case I see right now is for getting the application attempt id within the appmaster: String containerIdStr = System.getenv(Environment.CONTAINER_ID.name()); ConverterUtils.toContainerId ContainerId containerId = ConverterUtils.toContainerId(containerIdStr); ApplicationAttemptId applicationAttemptId = containerId.getApplicationAttemptId(); I don't see any other way for the application master to get this information. If there is please let me know. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1942) ConverterUtils should not be Private
[ https://issues.apache.org/jira/browse/YARN-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13969695#comment-13969695 ] Thomas Graves commented on YARN-1942: - Note that tez and spark are also using these utils. ConverterUtils should not be Private Key: YARN-1942 URL: https://issues.apache.org/jira/browse/YARN-1942 Project: Hadoop YARN Issue Type: Bug Components: api Affects Versions: 2.4.0 Reporter: Thomas Graves Priority: Critical ConverterUtils has a bunch of functions that are useful to application masters. It should either be made public or we make some of the utilities in it public or we provide other external apis for application masters to use. Note that distributedshell and MR are both using these interfaces. For instance the main use case I see right now is for getting the application attempt id within the appmaster: String containerIdStr = System.getenv(Environment.CONTAINER_ID.name()); ConverterUtils.toContainerId ContainerId containerId = ConverterUtils.toContainerId(containerIdStr); ApplicationAttemptId applicationAttemptId = containerId.getApplicationAttemptId(); I don't see any other way for the application master to get this information. If there is please let me know. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1942) ConverterUtils should not be Private
[ https://issues.apache.org/jira/browse/YARN-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-1942: Priority: Critical (was: Major) ConverterUtils should not be Private Key: YARN-1942 URL: https://issues.apache.org/jira/browse/YARN-1942 Project: Hadoop YARN Issue Type: Bug Components: api Affects Versions: 2.4.0 Reporter: Thomas Graves Priority: Critical ConverterUtils has a bunch of functions that are useful to application masters. It should either be made public or we make some of the utilities in it public or we provide other external apis for application masters to use. Note that distributedshell and MR are both using these interfaces. For instance the main use case I see right now is for getting the application attempt id within the appmaster: String containerIdStr = System.getenv(Environment.CONTAINER_ID.name()); ConverterUtils.toContainerId ContainerId containerId = ConverterUtils.toContainerId(containerIdStr); ApplicationAttemptId applicationAttemptId = containerId.getApplicationAttemptId(); I don't see any other way for the application master to get this information. If there is please let me know. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1942) ConverterUtils should not be Private
[ https://issues.apache.org/jira/browse/YARN-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-1942: Target Version/s: 3.0.0, 2.4.1 ConverterUtils should not be Private Key: YARN-1942 URL: https://issues.apache.org/jira/browse/YARN-1942 Project: Hadoop YARN Issue Type: Bug Components: api Affects Versions: 2.4.0 Reporter: Thomas Graves Priority: Critical ConverterUtils has a bunch of functions that are useful to application masters. It should either be made public or we make some of the utilities in it public or we provide other external apis for application masters to use. Note that distributedshell and MR are both using these interfaces. For instance the main use case I see right now is for getting the application attempt id within the appmaster: String containerIdStr = System.getenv(Environment.CONTAINER_ID.name()); ConverterUtils.toContainerId ContainerId containerId = ConverterUtils.toContainerId(containerIdStr); ApplicationAttemptId applicationAttemptId = containerId.getApplicationAttemptId(); I don't see any other way for the application master to get this information. If there is please let me know. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1946) need Public interface for WebAppUtils.getProxyHostAndPort
Thomas Graves created YARN-1946: --- Summary: need Public interface for WebAppUtils.getProxyHostAndPort Key: YARN-1946 URL: https://issues.apache.org/jira/browse/YARN-1946 Project: Hadoop YARN Issue Type: Bug Components: api, webapp Affects Versions: 2.4.0 Reporter: Thomas Graves Priority: Critical ApplicationMasters are supposed to go through the ResourceManager web app proxy if they have web UI's so they are properly secured. There is currently no public interface for Application Masters to conveniently get the proxy host and port. There is a function in WebAppUtils, but that class is private. We should provide this as a utility since any properly written AM will need to do this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1931) Private API change in YARN-1824 in 2.4 broke compatibility with previous releases
[ https://issues.apache.org/jira/browse/YARN-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13970099#comment-13970099 ] Thomas Graves commented on YARN-1931: - I also agree it makes more sense to make a new utility class. Put these back for backwards compatibility but ask people to move to the new api. Should this class be marked LimitedPrivate({MapReduce, Yarn})? Its also missing the interface stability annotation. Private API change in YARN-1824 in 2.4 broke compatibility with previous releases - Key: YARN-1931 URL: https://issues.apache.org/jira/browse/YARN-1931 Project: Hadoop YARN Issue Type: Bug Components: applications Affects Versions: 2.4.0 Reporter: Thomas Graves Assignee: Sandy Ryza Priority: Blocker Attachments: YARN-1931-1.patch, YARN-1931.patch YARN-1824 broke compatibility with previous 2.x releases by changes the API's in org.apache.hadoop.yarn.util.Apps.{setEnvFromInputString,addToEnvironment} The old api should be added back in. This affects any ApplicationMasters who were using this api. It also breaks previously built MapReduce libraries from working with the new Yarn release as MR uses this api. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1939) Improve the packaging of AmIpFilter
Thomas Graves created YARN-1939: --- Summary: Improve the packaging of AmIpFilter Key: YARN-1939 URL: https://issues.apache.org/jira/browse/YARN-1939 Project: Hadoop YARN Issue Type: Improvement Components: api, webapp Affects Versions: 2.4.0 Reporter: Thomas Graves It is recommended for applications to use the AmIpFilter to properly secure any WebUI that is specific to that application. The packaging of the AmIpFilter is in org.apache.hadoop.yarn.server.webproxy.amfilter, which requires a application to pull in yarn-server as a dependency which isn't very user friendly for applications wanting to pick up the bare minimum. We should improve the packaging of that so it can be pulled in independently. We do need to be careful to keep it backwards compatible atleast in the 2.x release line also. -- This message was sent by Atlassian JIRA (v6.2#6252)