[jira] [Commented] (YARN-6620) Add support in NodeManager to isolate GPU devices by using CGroups
[ https://issues.apache.org/jira/browse/YARN-6620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16216105#comment-16216105 ] Jonathan Hung commented on YARN-6620: - I see, thanks [~leftnoteasy], that makes sense. Seems this is not important for GPU since people will use the GPU plugin for minor numbers / GPU isolation (and therefore number of GPUs per node). But this is probably needed for other resources which don't update via plugin. Created YARN-7383 for this issue. > Add support in NodeManager to isolate GPU devices by using CGroups > -- > > Key: YARN-6620 > URL: https://issues.apache.org/jira/browse/YARN-6620 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > Fix For: 3.1.0 > > Attachments: YARN-6620.001.patch, YARN-6620.002.patch, > YARN-6620.003.patch, YARN-6620.004.patch, YARN-6620.005.patch, > YARN-6620.006-WIP.patch, YARN-6620.007.patch, YARN-6620.008.patch, > YARN-6620.009.patch, YARN-6620.010.patch, YARN-6620.011.patch, > YARN-6620.012.patch, YARN-6620.013.patch, YARN-6620.014.patch, > YARN-6620.015.patch, YARN-6620.016.patch, YARN-6620.017.patch > > > This JIRA plan to add support of: > 1) GPU configuration for NodeManagers > 2) Isolation in CGroups. (Java side). > 3) NM restart and recovery allocated GPU devices -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6620) Add support in NodeManager to isolate GPU devices by using CGroups
[ https://issues.apache.org/jira/browse/YARN-6620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16216051#comment-16216051 ] Wangda Tan commented on YARN-6620: -- [~jhung], thanks for reporting this. I think the parsing logic should be changed since we want to support namespace concept in resource name. Please file a bug under YARN-7069 accordingly. I haven't tried to use node-resources.xml to set GPU configs before. The way works for me is to set {{YarnConfiguration.NM_GPU_ALLOWED_DEVICES}} to allowed devices (such as {{0,1,2,3}}) or {{auto}} so NM can discover it automatically. Inside {{NodeStatusUpdaterImpl#serviceInit}}, it calls {{updateConfiguredResourcesViaPlugins}} to overwrite {{totalResources}} after loading from node-resources.xml. > Add support in NodeManager to isolate GPU devices by using CGroups > -- > > Key: YARN-6620 > URL: https://issues.apache.org/jira/browse/YARN-6620 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > Fix For: 3.1.0 > > Attachments: YARN-6620.001.patch, YARN-6620.002.patch, > YARN-6620.003.patch, YARN-6620.004.patch, YARN-6620.005.patch, > YARN-6620.006-WIP.patch, YARN-6620.007.patch, YARN-6620.008.patch, > YARN-6620.009.patch, YARN-6620.010.patch, YARN-6620.011.patch, > YARN-6620.012.patch, YARN-6620.013.patch, YARN-6620.014.patch, > YARN-6620.015.patch, YARN-6620.016.patch, YARN-6620.017.patch > > > This JIRA plan to add support of: > 1) GPU configuration for NodeManagers > 2) Isolation in CGroups. (Java side). > 3) NM restart and recovery allocated GPU devices -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6620) Add support in NodeManager to isolate GPU devices by using CGroups
[ https://issues.apache.org/jira/browse/YARN-6620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16216012#comment-16216012 ] Jonathan Hung commented on YARN-6620: - Not sure I follow the naming convention of GPU resource. I ran into some issues when trying to initialize GPU capability for a nodemanager. It seems NodeManagerHardwareUtils#getNodeResources is responsible for getting a node's total resources. But when it tries to parse it, ResourceUtils#addResourceInformation uses {noformat}String[] parts = prop.split("\\."); LOG.info("Found resource entry " + prop); if (parts.length == 4) { String resourceType = parts[3]; if (!nodeResources.containsKey(resourceType)) { nodeResources .put(resourceType, ResourceInformation.newInstance(resourceType)); } String units = getUnits(value); Long resourceValue = Long.valueOf(value.substring(0, value.length() - units.length())); nodeResources.get(resourceType).setValue(resourceValue); nodeResources.get(resourceType).setUnits(units); if (LOG.isDebugEnabled()) { LOG.debug("Setting value for resource type " + resourceType + " to " + resourceValue + " with units " + units); } }{noformat} for this. But since the resource name for GPU ({{yarn.io/gpu}}) has a "." in it, it's not parsing correctly. The configuration set in {{node-resources.xml}} was {{yarn.nodemanager.resource-type.yarn.io/gpu}}. Perhaps the GPU_URI should be renamed? Or the parsing logic should be changed. (Or I have something misconfigured.) > Add support in NodeManager to isolate GPU devices by using CGroups > -- > > Key: YARN-6620 > URL: https://issues.apache.org/jira/browse/YARN-6620 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > Fix For: 3.1.0 > > Attachments: YARN-6620.001.patch, YARN-6620.002.patch, > YARN-6620.003.patch, YARN-6620.004.patch, YARN-6620.005.patch, > YARN-6620.006-WIP.patch, YARN-6620.007.patch, YARN-6620.008.patch, > YARN-6620.009.patch, YARN-6620.010.patch, YARN-6620.011.patch, > YARN-6620.012.patch, YARN-6620.013.patch, YARN-6620.014.patch, > YARN-6620.015.patch, YARN-6620.016.patch, YARN-6620.017.patch > > > This JIRA plan to add support of: > 1) GPU configuration for NodeManagers > 2) Isolation in CGroups. (Java side). > 3) NM restart and recovery allocated GPU devices -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6620) Add support in NodeManager to isolate GPU devices by using CGroups
[ https://issues.apache.org/jira/browse/YARN-6620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208861#comment-16208861 ] Zhankun Tang commented on YARN-6620: [~wangda], thanks for the clarification. The below code confuses me previously is clear now: {code:java} public static final Map MANDATORY_RESOURCES = ImmutableMap.of(MEMORY_URI, MEMORY_MB, VCORES_URI, VCORES, GPU_URI, GPUS); ... private static void checkMandatoryResources( ... if (!expectedUnit.equals(actualUnit) || !expectedType.equals( actualType)) { ... } ... } {code} The above code indicates that "yarn.io/gpu" should be defined in resource-type.xml(type name) and node-resource.xml(total count) by admin with exact yarn expectation. On the other hand, the admin-allowed minor device numbers are declared in yarn-site.xml. In the end, the major and minor device number is also declared in gpu section of container-executor.cfg(by root user). And as we mentioned before, even using the same "yarn.io/gpu", a different vendor's GPU can be handled by node attributes to meet scheduling needs in a heterogeneous cluster. But more widely, if the vendor's device needs different toolchain for discovering or flashing( in FPGA cases), current one resource handler instance might be not enough for handling all toolchain operations. Anyway, I'm satisfied with the current design and let's evolve it when we get more cases. > Add support in NodeManager to isolate GPU devices by using CGroups > -- > > Key: YARN-6620 > URL: https://issues.apache.org/jira/browse/YARN-6620 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > Fix For: 3.1.0 > > Attachments: YARN-6620.001.patch, YARN-6620.002.patch, > YARN-6620.003.patch, YARN-6620.004.patch, YARN-6620.005.patch, > YARN-6620.006-WIP.patch, YARN-6620.007.patch, YARN-6620.008.patch, > YARN-6620.009.patch, YARN-6620.010.patch, YARN-6620.011.patch, > YARN-6620.012.patch, YARN-6620.013.patch, YARN-6620.014.patch, > YARN-6620.015.patch, YARN-6620.016.patch, YARN-6620.017.patch > > > This JIRA plan to add support of: > 1) GPU configuration for NodeManagers > 2) Isolation in CGroups. (Java side). > 3) NM restart and recovery allocated GPU devices -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6620) Add support in NodeManager to isolate GPU devices by using CGroups
[ https://issues.apache.org/jira/browse/YARN-6620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208590#comment-16208590 ] Wangda Tan commented on YARN-6620: -- [~tangzhankun], I may not make it clear: what I meant is GPU should be a first-class resource instead of mandatory resource. To me the only mandatory resource for now is memory and vcores, in the future we might add network/disk as mandatory resource. The definition of mandatory resource: in order to run process, mandatory resource is must required. The definition of first class resource: Officially supported by YARN. For your questions. bq. 1. First-class resource should be parsed from resource-types.xml and node-resources.xml(or auto discover) instead of yarn-site.xml? To me, for all resources beyond memory/vcores (because of historical reason), they should be defined in resource-types.xml and node-resources.xml regardless if it is a mandatory or first-class. bq. 2. First-calss resource handler should register itself with the same resource name defined in xml files? To me this is true when resource isolation on NM side is required, all first-class resource should started with "yarn.io/" namespace. bq. 3. First-class resource should be shown in a separate user-defined column in web pages? I'm not sure about this, in the future we may add more and more first-class / mandatory resources, it might be too much if we add columns for every new resources we added. To me the ideal solution is user can select and filter columns in web UI (support this in new UI should be a trivial task). > Add support in NodeManager to isolate GPU devices by using CGroups > -- > > Key: YARN-6620 > URL: https://issues.apache.org/jira/browse/YARN-6620 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > Fix For: 3.1.0 > > Attachments: YARN-6620.001.patch, YARN-6620.002.patch, > YARN-6620.003.patch, YARN-6620.004.patch, YARN-6620.005.patch, > YARN-6620.006-WIP.patch, YARN-6620.007.patch, YARN-6620.008.patch, > YARN-6620.009.patch, YARN-6620.010.patch, YARN-6620.011.patch, > YARN-6620.012.patch, YARN-6620.013.patch, YARN-6620.014.patch, > YARN-6620.015.patch, YARN-6620.016.patch, YARN-6620.017.patch > > > This JIRA plan to add support of: > 1) GPU configuration for NodeManagers > 2) Isolation in CGroups. (Java side). > 3) NM restart and recovery allocated GPU devices -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6620) Add support in NodeManager to isolate GPU devices by using CGroups
[ https://issues.apache.org/jira/browse/YARN-6620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207026#comment-16207026 ] Zhankun Tang commented on YARN-6620: [~wangda], Thanks for the great effort. I'll implement the FPGA resource plugin(only supports OpenCL FPGA application for now) based on that. One question is that in above discussions you mentioned the GPU is a mandatory resource like cpu and memory, but what's the difference between mandatory and first-class resource? And a list of first-class resource? Currently, If I understand correctly, 1. First-class resource should be parsed from resource-types.xml and node-resources.xml instead of yarn-site.xml? 2. First-calss resource handler should register itself with the same resource name defined in xml files? 3. First-class resource should be shown in a separate user-defined column in web pages? > Add support in NodeManager to isolate GPU devices by using CGroups > -- > > Key: YARN-6620 > URL: https://issues.apache.org/jira/browse/YARN-6620 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > Fix For: 3.1.0 > > Attachments: YARN-6620.001.patch, YARN-6620.002.patch, > YARN-6620.003.patch, YARN-6620.004.patch, YARN-6620.005.patch, > YARN-6620.006-WIP.patch, YARN-6620.007.patch, YARN-6620.008.patch, > YARN-6620.009.patch, YARN-6620.010.patch, YARN-6620.011.patch, > YARN-6620.012.patch, YARN-6620.013.patch, YARN-6620.014.patch, > YARN-6620.015.patch, YARN-6620.016.patch, YARN-6620.017.patch > > > This JIRA plan to add support of: > 1) GPU configuration for NodeManagers > 2) Isolation in CGroups. (Java side). > 3) NM restart and recovery allocated GPU devices -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6620) Add support in NodeManager to isolate GPU devices by using CGroups
[ https://issues.apache.org/jira/browse/YARN-6620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16200864#comment-16200864 ] Wangda Tan commented on YARN-6620: -- Thanks [~sunilg] for committing the patch, thanks [~devaraj.k]/[~tangzhankun] for reviewing the patch and thanks [~hex108] for offline suggestions. > Add support in NodeManager to isolate GPU devices by using CGroups > -- > > Key: YARN-6620 > URL: https://issues.apache.org/jira/browse/YARN-6620 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > Fix For: 3.1.0 > > Attachments: YARN-6620.001.patch, YARN-6620.002.patch, > YARN-6620.003.patch, YARN-6620.004.patch, YARN-6620.005.patch, > YARN-6620.006-WIP.patch, YARN-6620.007.patch, YARN-6620.008.patch, > YARN-6620.009.patch, YARN-6620.010.patch, YARN-6620.011.patch, > YARN-6620.012.patch, YARN-6620.013.patch, YARN-6620.014.patch, > YARN-6620.015.patch, YARN-6620.016.patch, YARN-6620.017.patch > > > This JIRA plan to add support of: > 1) GPU configuration for NodeManagers > 2) Isolation in CGroups. (Java side). > 3) NM restart and recovery allocated GPU devices -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6620) Add support in NodeManager to isolate GPU devices by using CGroups
[ https://issues.apache.org/jira/browse/YARN-6620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16200723#comment-16200723 ] Hudson commented on YARN-6620: -- FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #13073 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/13073/]) YARN-6620. Add support in NodeManager to isolate GPU devices by using (sunilg: rev fa5cfc68f37c78b6cf26ce13247b9ff34da806cd) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLinuxContainerExecutor.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLinuxContainerExecutorWithMocks.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/Context.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDefaultContainerExecutor.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/privileged/PrivilegedOperation.java * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/webapp/dao/gpu/TestGpuDeviceInformationParser.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/TestResourcePluginManager.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/ContainerExecutor.java * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/ResourcePlugin.java * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/GpuNodeResourceUpdateHandler.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/TestResourceHandlerModule.java * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/dao/gpu/PerGpuMemoryUsage.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/dao/gpu/PerGpuDeviceInformation.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/amrmproxy/BaseAMRMProxyTest.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ResourceInformation.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/ResourceHandlerChain.java * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/dao/gpu/GpuDeviceInformation.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/TestContainersMonitorResourceChange.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux