[jira] [Commented] (YARN-6620) Add support in NodeManager to isolate GPU devices by using CGroups

Zhankun Tang (JIRA) Tue, 17 Oct 2017 23:25:03 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-6620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208861#comment-16208861
 ]


Zhankun Tang commented on YARN-6620:
------------------------------------

[~wangda], thanks for the clarification. 
The below code confuses me previously is clear now:

{code:java}
public static final Map<String, ResourceInformation> MANDATORY_RESOURCES =
      ImmutableMap.of(MEMORY_URI, MEMORY_MB, VCORES_URI, VCORES, GPU_URI, GPUS);
...
private static void checkMandatoryResources(
...
if (!expectedUnit.equals(actualUnit) || !expectedType.equals(
            actualType)) {
  ...
}
...
}
{code}

The above code indicates that "yarn.io/gpu" should be defined in 
resource-type.xml(type name) and node-resource.xml(total count) by admin with 
exact yarn expectation. On the other hand, the admin-allowed minor device 
numbers are declared in yarn-site.xml. In the end, the major and minor device 
number is also declared in gpu section of container-executor.cfg(by root user). 

And as we mentioned before, even using the same "yarn.io/gpu", a different 
vendor's GPU can be handled by node attributes to meet scheduling needs in a 
heterogeneous cluster. But more widely, if the vendor's device needs different 
toolchain for discovering or flashing( in FPGA cases), current one resource 
handler instance might be not enough for handling all toolchain operations.

Anyway, I'm satisfied with the current design and let's evolve it when we get 
more cases.


> Add support in NodeManager to isolate GPU devices by using CGroups
> ------------------------------------------------------------------
>
>                 Key: YARN-6620
>                 URL: https://issues.apache.org/jira/browse/YARN-6620
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Wangda Tan
>            Assignee: Wangda Tan
>             Fix For: 3.1.0
>
>         Attachments: YARN-6620.001.patch, YARN-6620.002.patch, 
> YARN-6620.003.patch, YARN-6620.004.patch, YARN-6620.005.patch, 
> YARN-6620.006-WIP.patch, YARN-6620.007.patch, YARN-6620.008.patch, 
> YARN-6620.009.patch, YARN-6620.010.patch, YARN-6620.011.patch, 
> YARN-6620.012.patch, YARN-6620.013.patch, YARN-6620.014.patch, 
> YARN-6620.015.patch, YARN-6620.016.patch, YARN-6620.017.patch
>
>
> This JIRA plan to add support of:
> 1) GPU configuration for NodeManagers
> 2) Isolation in CGroups. (Java side).
> 3) NM restart and recovery allocated GPU devices



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-6620) Add support in NodeManager to isolate GPU devices by using CGroups

Reply via email to