[ https://issues.apache.org/jira/browse/YARN-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025156#comment-17025156 ]
Thomas Graves commented on YARN-8200: ------------------------------------- Hey [~jhung] , I am trying out the gpu scheduling in hadoop 2.10 and the first thing I noticed is it doesn't error properly if you ask for to many GPU's. It seems to happyily say it gave them to me, although I think its really giving me the max configured. Is this a known issue already or did configuration change? I have gpu max configured at 4 and I try to allocate 8, on hadoop 3 I get: Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException): Invalid resource request, requested resource type=[yarn.io/gpu] < 0 or greater than maximum allowed allocation. Requested resource=<memory:1408, vCores:1, yarn.io/gpu: 8>, maximum allowed allocation=<memory:8192, vCores:4, yarn.io/gpu: 4>, please note that maximum allowed allocation is calculated by scheduler based on maximum resource of registered NodeManagers, which might be less than configured maximum allocation=<memory:8192, vCores:4, yarn.io/gpu: 10> On hadoop 2.10 I get a container allocated but the logs and UI says it only has 4 gpus. > Backport resource types/GPU features to branch-3.0/branch-2 > ----------------------------------------------------------- > > Key: YARN-8200 > URL: https://issues.apache.org/jira/browse/YARN-8200 > Project: Hadoop YARN > Issue Type: Task > Reporter: Jonathan Hung > Assignee: Jonathan Hung > Priority: Major > Labels: release-blocker > Fix For: 2.10.0 > > Attachments: YARN-8200-branch-2.001.patch, > YARN-8200-branch-2.002.patch, YARN-8200-branch-2.003.patch, > YARN-8200-branch-3.0.001.patch, > counter.scheduler.operation.allocate.csv.defaultResources, > counter.scheduler.operation.allocate.csv.gpuResources, synth_sls.json > > > Currently we have a need for GPU scheduling on our YARN clusters to support > deep learning workloads. However, our main production clusters are running > older versions of branch-2 (2.7 in our case). To prevent supporting too many > very different hadoop versions across multiple clusters, we would like to > backport the resource types/resource profiles feature to branch-2, as well as > the GPU specific support. > > We have done a trial backport of YARN-3926 and some miscellaneous patches in > YARN-7069 based on issues we uncovered, and the backport was fairly smooth. > We also did a trial backport of most of YARN-6223 (sans docker support). > > Regarding the backports, perhaps we can do the development in a feature > branch and then merge to branch-2 when ready. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org