[jira] [Commented] (YARN-6620) Add support in NodeManager to isolate GPU devices by using CGroups

2017-10-23 Thread Jonathan Hung (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16216105#comment-16216105
 ] 

Jonathan Hung commented on YARN-6620:
-

I see, thanks [~leftnoteasy], that makes sense.

Seems this is not important for GPU since people will use the GPU plugin for 
minor numbers / GPU isolation (and therefore number of GPUs per node). But this 
is probably needed for other resources which don't update via plugin. Created 
YARN-7383 for this issue.

> Add support in NodeManager to isolate GPU devices by using CGroups
> --
>
> Key: YARN-6620
> URL: https://issues.apache.org/jira/browse/YARN-6620
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Fix For: 3.1.0
>
> Attachments: YARN-6620.001.patch, YARN-6620.002.patch, 
> YARN-6620.003.patch, YARN-6620.004.patch, YARN-6620.005.patch, 
> YARN-6620.006-WIP.patch, YARN-6620.007.patch, YARN-6620.008.patch, 
> YARN-6620.009.patch, YARN-6620.010.patch, YARN-6620.011.patch, 
> YARN-6620.012.patch, YARN-6620.013.patch, YARN-6620.014.patch, 
> YARN-6620.015.patch, YARN-6620.016.patch, YARN-6620.017.patch
>
>
> This JIRA plan to add support of:
> 1) GPU configuration for NodeManagers
> 2) Isolation in CGroups. (Java side).
> 3) NM restart and recovery allocated GPU devices



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6620) Add support in NodeManager to isolate GPU devices by using CGroups

2017-10-23 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16216051#comment-16216051
 ] 

Wangda Tan commented on YARN-6620:
--

[~jhung], thanks for reporting this. 

I think the parsing logic should be changed since we want to support namespace 
concept in resource name. Please file a bug under YARN-7069 accordingly.

I haven't tried to use node-resources.xml to set GPU configs before. The way 
works for me is to set {{YarnConfiguration.NM_GPU_ALLOWED_DEVICES}} to allowed 
devices (such as {{0,1,2,3}}) or {{auto}} so NM can discover it automatically. 
Inside {{NodeStatusUpdaterImpl#serviceInit}}, it calls 
{{updateConfiguredResourcesViaPlugins}} to overwrite {{totalResources}} after 
loading from node-resources.xml.

> Add support in NodeManager to isolate GPU devices by using CGroups
> --
>
> Key: YARN-6620
> URL: https://issues.apache.org/jira/browse/YARN-6620
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Fix For: 3.1.0
>
> Attachments: YARN-6620.001.patch, YARN-6620.002.patch, 
> YARN-6620.003.patch, YARN-6620.004.patch, YARN-6620.005.patch, 
> YARN-6620.006-WIP.patch, YARN-6620.007.patch, YARN-6620.008.patch, 
> YARN-6620.009.patch, YARN-6620.010.patch, YARN-6620.011.patch, 
> YARN-6620.012.patch, YARN-6620.013.patch, YARN-6620.014.patch, 
> YARN-6620.015.patch, YARN-6620.016.patch, YARN-6620.017.patch
>
>
> This JIRA plan to add support of:
> 1) GPU configuration for NodeManagers
> 2) Isolation in CGroups. (Java side).
> 3) NM restart and recovery allocated GPU devices



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6620) Add support in NodeManager to isolate GPU devices by using CGroups

2017-10-23 Thread Jonathan Hung (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16216012#comment-16216012
 ] 

Jonathan Hung commented on YARN-6620:
-

Not sure I follow the naming convention of GPU resource. I ran into some issues 
when trying to initialize GPU capability for a nodemanager. It seems 
NodeManagerHardwareUtils#getNodeResources is responsible for getting a node's 
total resources. But when it tries to parse it, 
ResourceUtils#addResourceInformation uses {noformat}String[] parts = 
prop.split("\\.");
LOG.info("Found resource entry " + prop);
if (parts.length == 4) {
  String resourceType = parts[3];
  if (!nodeResources.containsKey(resourceType)) {
nodeResources
.put(resourceType, ResourceInformation.newInstance(resourceType));
  }
  String units = getUnits(value);
  Long resourceValue =
  Long.valueOf(value.substring(0, value.length() - units.length()));
  nodeResources.get(resourceType).setValue(resourceValue);
  nodeResources.get(resourceType).setUnits(units);
  if (LOG.isDebugEnabled()) {
LOG.debug("Setting value for resource type " + resourceType + " to "
+ resourceValue + " with units " + units);
  }
}{noformat}
for this. But since the resource name for GPU ({{yarn.io/gpu}}) has a "." in 
it, it's not parsing correctly.

The configuration set in {{node-resources.xml}} was 
{{yarn.nodemanager.resource-type.yarn.io/gpu}}.

Perhaps the GPU_URI should be renamed? Or the parsing logic should be changed. 
(Or I have something misconfigured.)

> Add support in NodeManager to isolate GPU devices by using CGroups
> --
>
> Key: YARN-6620
> URL: https://issues.apache.org/jira/browse/YARN-6620
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Fix For: 3.1.0
>
> Attachments: YARN-6620.001.patch, YARN-6620.002.patch, 
> YARN-6620.003.patch, YARN-6620.004.patch, YARN-6620.005.patch, 
> YARN-6620.006-WIP.patch, YARN-6620.007.patch, YARN-6620.008.patch, 
> YARN-6620.009.patch, YARN-6620.010.patch, YARN-6620.011.patch, 
> YARN-6620.012.patch, YARN-6620.013.patch, YARN-6620.014.patch, 
> YARN-6620.015.patch, YARN-6620.016.patch, YARN-6620.017.patch
>
>
> This JIRA plan to add support of:
> 1) GPU configuration for NodeManagers
> 2) Isolation in CGroups. (Java side).
> 3) NM restart and recovery allocated GPU devices



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6620) Add support in NodeManager to isolate GPU devices by using CGroups

2017-10-17 Thread Zhankun Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208861#comment-16208861
 ] 

Zhankun Tang commented on YARN-6620:


[~wangda], thanks for the clarification. 
The below code confuses me previously is clear now:

{code:java}
public static final Map MANDATORY_RESOURCES =
  ImmutableMap.of(MEMORY_URI, MEMORY_MB, VCORES_URI, VCORES, GPU_URI, GPUS);
...
private static void checkMandatoryResources(
...
if (!expectedUnit.equals(actualUnit) || !expectedType.equals(
actualType)) {
  ...
}
...
}
{code}

The above code indicates that "yarn.io/gpu" should be defined in 
resource-type.xml(type name) and node-resource.xml(total count) by admin with 
exact yarn expectation. On the other hand, the admin-allowed minor device 
numbers are declared in yarn-site.xml. In the end, the major and minor device 
number is also declared in gpu section of container-executor.cfg(by root user). 

And as we mentioned before, even using the same "yarn.io/gpu", a different 
vendor's GPU can be handled by node attributes to meet scheduling needs in a 
heterogeneous cluster. But more widely, if the vendor's device needs different 
toolchain for discovering or flashing( in FPGA cases), current one resource 
handler instance might be not enough for handling all toolchain operations.

Anyway, I'm satisfied with the current design and let's evolve it when we get 
more cases.


> Add support in NodeManager to isolate GPU devices by using CGroups
> --
>
> Key: YARN-6620
> URL: https://issues.apache.org/jira/browse/YARN-6620
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Fix For: 3.1.0
>
> Attachments: YARN-6620.001.patch, YARN-6620.002.patch, 
> YARN-6620.003.patch, YARN-6620.004.patch, YARN-6620.005.patch, 
> YARN-6620.006-WIP.patch, YARN-6620.007.patch, YARN-6620.008.patch, 
> YARN-6620.009.patch, YARN-6620.010.patch, YARN-6620.011.patch, 
> YARN-6620.012.patch, YARN-6620.013.patch, YARN-6620.014.patch, 
> YARN-6620.015.patch, YARN-6620.016.patch, YARN-6620.017.patch
>
>
> This JIRA plan to add support of:
> 1) GPU configuration for NodeManagers
> 2) Isolation in CGroups. (Java side).
> 3) NM restart and recovery allocated GPU devices



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6620) Add support in NodeManager to isolate GPU devices by using CGroups

2017-10-17 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208590#comment-16208590
 ] 

Wangda Tan commented on YARN-6620:
--

[~tangzhankun], 

I may not make it clear: what I meant is GPU should be a first-class resource 
instead of mandatory resource. To me the only mandatory resource for now is 
memory and vcores, in the future we might add network/disk as mandatory 
resource.

The definition of mandatory resource: in order to run process, mandatory 
resource is must required.
The definition of first class resource: Officially supported by YARN.

For your questions.

bq. 1. First-class resource should be parsed from resource-types.xml and 
node-resources.xml(or auto discover) instead of yarn-site.xml?
To me, for all resources beyond memory/vcores (because of historical reason), 
they should be defined in resource-types.xml and node-resources.xml regardless 
if it is a mandatory or first-class.

bq. 2. First-calss resource handler should register itself with the same 
resource name defined in xml files?
To me this is true when resource isolation on NM side is required, all 
first-class resource should started with "yarn.io/" namespace. 

bq. 3. First-class resource should be shown in a separate user-defined column 
in web pages?
I'm not sure about this, in the future we may add more and more first-class / 
mandatory resources, it might be too much if we add columns for every new 
resources we added. To me the ideal solution is user can select and filter 
columns in web UI (support this in new UI should be a trivial task).

> Add support in NodeManager to isolate GPU devices by using CGroups
> --
>
> Key: YARN-6620
> URL: https://issues.apache.org/jira/browse/YARN-6620
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Fix For: 3.1.0
>
> Attachments: YARN-6620.001.patch, YARN-6620.002.patch, 
> YARN-6620.003.patch, YARN-6620.004.patch, YARN-6620.005.patch, 
> YARN-6620.006-WIP.patch, YARN-6620.007.patch, YARN-6620.008.patch, 
> YARN-6620.009.patch, YARN-6620.010.patch, YARN-6620.011.patch, 
> YARN-6620.012.patch, YARN-6620.013.patch, YARN-6620.014.patch, 
> YARN-6620.015.patch, YARN-6620.016.patch, YARN-6620.017.patch
>
>
> This JIRA plan to add support of:
> 1) GPU configuration for NodeManagers
> 2) Isolation in CGroups. (Java side).
> 3) NM restart and recovery allocated GPU devices



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6620) Add support in NodeManager to isolate GPU devices by using CGroups

2017-10-16 Thread Zhankun Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207026#comment-16207026
 ] 

Zhankun Tang commented on YARN-6620:


[~wangda], Thanks for the great effort. I'll implement the FPGA resource 
plugin(only supports OpenCL FPGA application for now) based on that.

One question is that in above discussions you mentioned the GPU is a mandatory 
resource like cpu and memory, but what's the difference between mandatory and 
first-class resource? And a list of first-class resource? Currently, If I 
understand correctly,
1. First-class resource should be parsed from resource-types.xml and 
node-resources.xml instead of yarn-site.xml?
2. First-calss resource handler should register itself with the same resource 
name defined in xml files?
3. First-class resource should be shown in a separate user-defined column in 
web pages?

> Add support in NodeManager to isolate GPU devices by using CGroups
> --
>
> Key: YARN-6620
> URL: https://issues.apache.org/jira/browse/YARN-6620
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Fix For: 3.1.0
>
> Attachments: YARN-6620.001.patch, YARN-6620.002.patch, 
> YARN-6620.003.patch, YARN-6620.004.patch, YARN-6620.005.patch, 
> YARN-6620.006-WIP.patch, YARN-6620.007.patch, YARN-6620.008.patch, 
> YARN-6620.009.patch, YARN-6620.010.patch, YARN-6620.011.patch, 
> YARN-6620.012.patch, YARN-6620.013.patch, YARN-6620.014.patch, 
> YARN-6620.015.patch, YARN-6620.016.patch, YARN-6620.017.patch
>
>
> This JIRA plan to add support of:
> 1) GPU configuration for NodeManagers
> 2) Isolation in CGroups. (Java side).
> 3) NM restart and recovery allocated GPU devices



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6620) Add support in NodeManager to isolate GPU devices by using CGroups

2017-10-11 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16200864#comment-16200864
 ] 

Wangda Tan commented on YARN-6620:
--

Thanks [~sunilg] for committing the patch, thanks [~devaraj.k]/[~tangzhankun] 
for reviewing the patch and thanks [~hex108] for offline suggestions.

> Add support in NodeManager to isolate GPU devices by using CGroups
> --
>
> Key: YARN-6620
> URL: https://issues.apache.org/jira/browse/YARN-6620
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Fix For: 3.1.0
>
> Attachments: YARN-6620.001.patch, YARN-6620.002.patch, 
> YARN-6620.003.patch, YARN-6620.004.patch, YARN-6620.005.patch, 
> YARN-6620.006-WIP.patch, YARN-6620.007.patch, YARN-6620.008.patch, 
> YARN-6620.009.patch, YARN-6620.010.patch, YARN-6620.011.patch, 
> YARN-6620.012.patch, YARN-6620.013.patch, YARN-6620.014.patch, 
> YARN-6620.015.patch, YARN-6620.016.patch, YARN-6620.017.patch
>
>
> This JIRA plan to add support of:
> 1) GPU configuration for NodeManagers
> 2) Isolation in CGroups. (Java side).
> 3) NM restart and recovery allocated GPU devices



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6620) Add support in NodeManager to isolate GPU devices by using CGroups

2017-10-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16200723#comment-16200723
 ] 

Hudson commented on YARN-6620:
--

FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #13073 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/13073/])
YARN-6620. Add support in NodeManager to isolate GPU devices by using (sunilg: 
rev fa5cfc68f37c78b6cf26ce13247b9ff34da806cd)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLinuxContainerExecutor.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLinuxContainerExecutorWithMocks.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/Context.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDefaultContainerExecutor.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/privileged/PrivilegedOperation.java
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/webapp/dao/gpu/TestGpuDeviceInformationParser.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/TestResourcePluginManager.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/ContainerExecutor.java
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/ResourcePlugin.java
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/GpuNodeResourceUpdateHandler.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/TestResourceHandlerModule.java
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/dao/gpu/PerGpuMemoryUsage.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/dao/gpu/PerGpuDeviceInformation.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/amrmproxy/BaseAMRMProxyTest.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ResourceInformation.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/ResourceHandlerChain.java
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/dao/gpu/GpuDeviceInformation.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/TestContainersMonitorResourceChange.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux