[
https://issues.apache.org/jira/browse/YARN-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025156#comment-17025156
]
Thomas Graves commented on YARN-8200:
-------------------------------------
Hey [~jhung] ,
I am trying out the gpu scheduling in hadoop 2.10 and the first thing I noticed
is it doesn't error properly if you ask for to many GPU's. It seems to happyily
say it gave them to me, although I think its really giving me the max
configured. Is this a known issue already or did configuration change?
I have gpu max configured at 4 and I try to allocate 8, on hadoop 3 I get:
Caused by:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException):
Invalid resource request, requested resource type=[yarn.io/gpu] < 0 or greater
than maximum allowed allocation. Requested resource=<memory:1408, vCores:1,
yarn.io/gpu: 8>, maximum allowed allocation=<memory:8192, vCores:4,
yarn.io/gpu: 4>, please note that maximum allowed allocation is calculated by
scheduler based on maximum resource of registered NodeManagers, which might be
less than configured maximum allocation=<memory:8192, vCores:4, yarn.io/gpu: 10>
On hadoop 2.10 I get a container allocated but the logs and UI says it only has
4 gpus.
> Backport resource types/GPU features to branch-3.0/branch-2
> -----------------------------------------------------------
>
> Key: YARN-8200
> URL: https://issues.apache.org/jira/browse/YARN-8200
> Project: Hadoop YARN
> Issue Type: Task
> Reporter: Jonathan Hung
> Assignee: Jonathan Hung
> Priority: Major
> Labels: release-blocker
> Fix For: 2.10.0
>
> Attachments: YARN-8200-branch-2.001.patch,
> YARN-8200-branch-2.002.patch, YARN-8200-branch-2.003.patch,
> YARN-8200-branch-3.0.001.patch,
> counter.scheduler.operation.allocate.csv.defaultResources,
> counter.scheduler.operation.allocate.csv.gpuResources, synth_sls.json
>
>
> Currently we have a need for GPU scheduling on our YARN clusters to support
> deep learning workloads. However, our main production clusters are running
> older versions of branch-2 (2.7 in our case). To prevent supporting too many
> very different hadoop versions across multiple clusters, we would like to
> backport the resource types/resource profiles feature to branch-2, as well as
> the GPU specific support.
>
> We have done a trial backport of YARN-3926 and some miscellaneous patches in
> YARN-7069 based on issues we uncovered, and the backport was fairly smooth.
> We also did a trial backport of most of YARN-6223 (sans docker support).
>
> Regarding the backports, perhaps we can do the development in a feature
> branch and then merge to branch-2 when ready.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]