[
https://issues.apache.org/jira/browse/YARN-6620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169655#comment-16169655
]
Zhankun Tang commented on YARN-6620:
------------------------------------
{quote}
Good point, I think we should use node attribute to distinguish them. I think
this might be unavoidable: different DL workload needs different driver
versions / GPU architectures, and different frameworks like OpenCL/CUDA, we
need node attribute anyway.
{quote}
[~wangda], Yeah. Node attributes is a must.
And just another thing come to my mind, do we need to support one physical
machine with two different vendor GPU cards? If this scenario requirement is
true, we may need to extend resource handler to mange different several
plugins(I've done this in prior FPGA patch) as below:
1. In "bootstrap" method, all GPU vendor's plugin register to one GPU resource
handler with the resource name it can handlers. For instance, one plugin A
registers a resource "A-GPU" and B register "B-GPU". And GPU resource handler
will holds records of <resourceName, pluginInstance>.
2. When "preStart" invoked, it will retrieve the ResourceInformation array from
container.getResource().getResources() to find a proper GPU vendor plugin to do
plugin callback( or no callback needed for GPU. It seems needed for FPGA) and
then use GPU allocator allocates requested count of this specific type of GPU
in a round-robin manner. Then do cgroups isolation.
3. Now back to the AM, it's possible to request a container with one "A-GPU"
named resource in containerRequest and node attributes "CUDA v1" at the same
time.
I'm not sure if this one host with different vendor device is a real
requirements. If so, it may brings another concerns to our current design since
we treat them as the same resource implicitly. Any idea?
> [YARN-6223] NM Java side code changes to support isolate GPU devices by using
> CGroups
> -------------------------------------------------------------------------------------
>
> Key: YARN-6620
> URL: https://issues.apache.org/jira/browse/YARN-6620
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Wangda Tan
> Assignee: Wangda Tan
> Attachments: YARN-6620.001.patch, YARN-6620.002.patch,
> YARN-6620.003.patch, YARN-6620.004.patch, YARN-6620.005.patch,
> YARN-6620.006-WIP.patch
>
>
> This JIRA plan to add support of:
> 1) GPU configuration for NodeManagers
> 2) Isolation in CGroups. (Java side).
> 3) NM restart and recovery allocated GPU devices
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]