[jira] [Commented] (YARN-8821) [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable device framework

2019-02-24 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776288#comment-16776288
 ] 

Zhankun Tang commented on YARN-8821:


[~sunilg] , [~Weiwei Yang] , Thanks for the review!

> [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable 
> device framework
> -
>
> Key: YARN-8821
> URL: https://issues.apache.org/jira/browse/YARN-8821
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: GPUTopologyPerformance.png, YARN-8821-trunk.001.patch, 
> YARN-8821-trunk.002.patch, YARN-8821-trunk.003.patch, 
> YARN-8821-trunk.004.patch, YARN-8821-trunk.005.patch, 
> YARN-8821-trunk.006.patch, YARN-8821-trunk.007.patch, 
> YARN-8821-trunk.008.patch, YARN-8821-trunk.009.patch, 
> YARN-8821-trunk.010.patch
>
>
> h2. Background
> GPU topology affects performance. There's been a discussion in YARN-7481. But 
> we'd like to move related discussions here.
> And please note that YARN-8851 will provide a pluggable device framework 
> which can support plugin custom scheduler. Based on the framework, GPU plugin 
> could have own topology scheduler.
> h2. Details of the proposed scheduling algorithm
> The proposed patch has a topology algorithm implemented as below:
>  *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" 
> to build a hash map whose key is all pairs of GPUs and the value is the 
> communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - 
> 2"=>4, ...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set 
> based on the connection type.
> *Step 2*. And then it constructs a _+cost table+_ which caches all 
> combinations of GPUs and corresponding cost between them and cache it. The 
> cost table is a map whose structure is like
> {code:java}
> { 2=>{[0,1]=>2,..},
>   3=>{[0,1,2]=>10,..},
>   4=>{[0,1,2,3]=>18}}.
> {code}
> The key of the map is the count of GPUs, the value of it is a map whose key 
> is the combination of GPUs and the value is the calculated communication cost 
> of the numbers of GPUs. The cost calculation algorithm is to sum all 
> non-duplicate pairs of GPU's cost. For instance, the total cost of [0,1,2] 
> GPUs are the sum of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get 
> from the map built in step 1.
> *Step 3*. After the cost table is built, when allocating GPUs based on 
> topology, we provide two policy which container can set through an 
> environment variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or 
> "SPREAD". The "PACK" means it prefers faster GPU-GPU communication. The 
> "SPREAD" means it prefers faster CPU-GPU communication( since GPUs are not 
> using the same bus to CPU). And the key difference of the two policy is the 
> sort order of the inner map in the cost table. For instance, let's assume 2 
> GPUs is wanted. The costTable.get(2) would return a map containing all 
> combinations of two GPUs and their cost. If the policy is "PACK", we'll sort 
> the map by cost in ascending order. The first entry will be the GPUs has 
> minimum GPU-GPU cost. If the policy is "SPREAD", we sort it in descending 
> order and get the first one which is the highest GPU-GPU cost which means 
> lowest CPU-GPU costs.
> h2. Estimation of the algorithm
> Initial analysis of the topology scheduling algorithm(Using PACK policy) 
> based on the performance tests in an AWS EC2 with 8 GPU cards (P3) is done. 
> Below figure shows the performance gain of the topology scheduling 
> algorithm's allocation (PACK policy).
> !GPUTopologyPerformance.png!  
> Some of the conclusions are:
> 1. The topology between GPUs impacts the performance dramatically. The best 
> combination GPUs can get *5% to 185%* *performance gain* among the test cases 
> with various factors including CNN model, batch size, GPU subset, etc. The 
> scheduling algorithm should be close to this fact.
> 2. The "inception3" and "resnet50" networks seem not topology sensitive. The 
> topology scheduling can only potentially get *about 6.8% to 10%* speedup in 
> best cases.
> 3. Our current version of topology scheduling algorithm can achieve 6.8*% to 
> 177.1%* *performance gain in best cases. In average, it also outperforms the 
> median performance(0.8% to 28.2%).*
> *4. And the algorithm's allocations match the fastest GPUs needed by "vgg16" 
> best*.
>  
> In summary, the GPU topology scheduling algorithm is effective and can 
> potentially get 6.8% to 185% performance gain in the best cases and 1% to 30% 
> on average.
>  *It means about maximum 3X comparing to a random GPU scheduling algorithm in 
> a specific scenario*.
>  
> The spreadsheets are here for 

[jira] [Commented] (YARN-8821) [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable device framework

2019-02-24 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776178#comment-16776178
 ] 

Hudson commented on YARN-8821:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #16045 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/16045/])
YARN-8821. [YARN-8851] GPU hierarchy/topology scheduling support based (sunilg: 
rev dddcfa4d9f07aa96692a12afd3003ae89ac40b17)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/com/nvidia/NvidiaGPUPluginForRuntimeV2.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/api/deviceplugin/DevicePluginScheduler.java
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/resources/tensorflow-bench-result-for-GPU.csv
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/deviceframework/DeviceMappingManager.java
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/com/nvidia/TestNvidiaGPUPluginForRuntimeV2.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/deviceframework/FakeTestDevicePlugin1.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/deviceframework/TestDevicePluginAdapter.java
* (delete) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/nvidia/com/TestNvidiaGpuPlugin.java


> [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable 
> device framework
> -
>
> Key: YARN-8821
> URL: https://issues.apache.org/jira/browse/YARN-8821
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: GPUTopologyPerformance.png, YARN-8821-trunk.001.patch, 
> YARN-8821-trunk.002.patch, YARN-8821-trunk.003.patch, 
> YARN-8821-trunk.004.patch, YARN-8821-trunk.005.patch, 
> YARN-8821-trunk.006.patch, YARN-8821-trunk.007.patch, 
> YARN-8821-trunk.008.patch, YARN-8821-trunk.009.patch, 
> YARN-8821-trunk.010.patch
>
>
> h2. Background
> GPU topology affects performance. There's been a discussion in YARN-7481. But 
> we'd like to move related discussions here.
> And please note that YARN-8851 will provide a pluggable device framework 
> which can support plugin custom scheduler. Based on the framework, GPU plugin 
> could have own topology scheduler.
> h2. Details of the proposed scheduling algorithm
> The proposed patch has a topology algorithm implemented as below:
>  *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" 
> to build a hash map whose key is all pairs of GPUs and the value is the 
> communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - 
> 2"=>4, ...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set 
> based on the connection type.
> *Step 2*. And then it constructs a _+cost table+_ which caches all 
> combinations of GPUs and corresponding cost between them and cache it. The 
> cost table is a map whose structure is like
> {code:java}
> { 2=>{[0,1]=>2,..},
>   3=>{[0,1,2]=>10,..},
>   4=>{[0,1,2,3]=>18}}.
> {code}
> The key of the map is the count of GPUs, the value of it is a map whose key 
> is the combination of GPUs and the value is the calculated communication cost 
> of the numbers of GPUs. The cost calculation algorithm is to sum all 
> non-duplicate pairs of GPU's cost. For instance, the total cost of [0,1,2] 
> GPUs are the sum of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get 
> from the map built in step 1.
> *Step 3*. After the cost table is built, when allocating GPUs based on 
> topology, we provide two policy which container can set through an 
> environment variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or 
> "SPREAD". The "PACK" means it prefers faster GPU-GPU communication. The 
> "SPREAD" means it prefers faster CPU-GPU communication( since GPUs are not 
> using the same bus to CPU). And the key difference of the two policy is the 
> sort order of the inner map in the cost 

[jira] [Commented] (YARN-8821) [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable device framework

2019-02-20 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772807#comment-16772807
 ] 

Weiwei Yang commented on YARN-8821:
---

Thanks [~tangzhankun], LGTM. +1 to v10 patch.

I will commit this patch tomorrow if no further comments from others.

> [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable 
> device framework
> -
>
> Key: YARN-8821
> URL: https://issues.apache.org/jira/browse/YARN-8821
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: GPUTopologyPerformance.png, YARN-8821-trunk.001.patch, 
> YARN-8821-trunk.002.patch, YARN-8821-trunk.003.patch, 
> YARN-8821-trunk.004.patch, YARN-8821-trunk.005.patch, 
> YARN-8821-trunk.006.patch, YARN-8821-trunk.007.patch, 
> YARN-8821-trunk.008.patch, YARN-8821-trunk.009.patch, 
> YARN-8821-trunk.010.patch
>
>
> h2. Background
> GPU topology affects performance. There's been a discussion in YARN-7481. But 
> we'd like to move related discussions here.
> And please note that YARN-8851 will provide a pluggable device framework 
> which can support plugin custom scheduler. Based on the framework, GPU plugin 
> could have own topology scheduler.
> h2. Details of the proposed scheduling algorithm
> The proposed patch has a topology algorithm implemented as below:
>  *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" 
> to build a hash map whose key is all pairs of GPUs and the value is the 
> communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - 
> 2"=>4, ...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set 
> based on the connection type.
> *Step 2*. And then it constructs a _+cost table+_ which caches all 
> combinations of GPUs and corresponding cost between them and cache it. The 
> cost table is a map whose structure is like
> {code:java}
> { 2=>{[0,1]=>2,..},
>   3=>{[0,1,2]=>10,..},
>   4=>{[0,1,2,3]=>18}}.
> {code}
> The key of the map is the count of GPUs, the value of it is a map whose key 
> is the combination of GPUs and the value is the calculated communication cost 
> of the numbers of GPUs. The cost calculation algorithm is to sum all 
> non-duplicate pairs of GPU's cost. For instance, the total cost of [0,1,2] 
> GPUs are the sum of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get 
> from the map built in step 1.
> *Step 3*. After the cost table is built, when allocating GPUs based on 
> topology, we provide two policy which container can set through an 
> environment variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or 
> "SPREAD". The "PACK" means it prefers faster GPU-GPU communication. The 
> "SPREAD" means it prefers faster CPU-GPU communication( since GPUs are not 
> using the same bus to CPU). And the key difference of the two policy is the 
> sort order of the inner map in the cost table. For instance, let's assume 2 
> GPUs is wanted. The costTable.get(2) would return a map containing all 
> combinations of two GPUs and their cost. If the policy is "PACK", we'll sort 
> the map by cost in ascending order. The first entry will be the GPUs has 
> minimum GPU-GPU cost. If the policy is "SPREAD", we sort it in descending 
> order and get the first one which is the highest GPU-GPU cost which means 
> lowest CPU-GPU costs.
> h2. Estimation of the algorithm
> Initial analysis of the topology scheduling algorithm(Using PACK policy) 
> based on the performance tests in an AWS EC2 with 8 GPU cards (P3) is done. 
> Below figure shows the performance gain of the topology scheduling 
> algorithm's allocation (PACK policy).
> !GPUTopologyPerformance.png!  
> Some of the conclusions are:
> 1. The topology between GPUs impacts the performance dramatically. The best 
> combination GPUs can get *5% to 185%* *performance gain* among the test cases 
> with various factors including CNN model, batch size, GPU subset, etc. The 
> scheduling algorithm should be close to this fact.
> 2. The "inception3" and "resnet50" networks seem not topology sensitive. The 
> topology scheduling can only potentially get *about 6.8% to 10%* speedup in 
> best cases.
> 3. Our current version of topology scheduling algorithm can achieve 6.8*% to 
> 177.1%* *performance gain in best cases. In average, it also outperforms the 
> median performance(0.8% to 28.2%).*
> *4. And the algorithm's allocations match the fastest GPUs needed by "vgg16" 
> best*.
>  
> In summary, the GPU topology scheduling algorithm is effective and can 
> potentially get 6.8% to 185% performance gain in the best cases and 1% to 30% 
> on average.
>  *It means about maximum 3X comparing to a random GPU scheduling algorithm in 
> a specific scenario*.
>  

[jira] [Commented] (YARN-8821) [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable device framework

2019-02-19 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772672#comment-16772672
 ] 

Hadoop QA commented on YARN-8821:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
20s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 5 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 
30s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
2s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
28s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
40s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 34s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
58s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
25s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
36s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 43s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
24s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 20m 36s{color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
25s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 71m 42s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.nodemanager.amrmproxy.TestFederationInterceptor |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | YARN-8821 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12959367/YARN-8821-trunk.010.patch
 |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 29f8dd684c95 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 
5 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 1d30fd9 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_191 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-YARN-Build/23448/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/23448/testReport/ |
| Max. process+thread count | 308 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: 

[jira] [Commented] (YARN-8821) [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable device framework

2019-02-19 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772640#comment-16772640
 ] 

Zhankun Tang commented on YARN-8821:


[~cheersyang] , Thanks for the review!
{quote}1. {{NvidiaGPUPluginForRuntimeV2#topologyAwareSchedule}}

IIRC, line 396 and 402, they sort all combinations for a given count of devices 
every time. Why not just maintain a ordered list for these combinations in the 
map, so it only needs to sort once (when cost table initiated).
{quote}

Zhankun=> Good point! Yeah. I changed the value of costTable to a list of map 
entry. And when constructing costTable, the list is sorted by cost value in 
ascending order. When doing topology scheduling, we use an iterator of the 
list. If PACK policy, just use the iterator to loop. But if is SPREAD policy, 
the iterator is changed to descending iterator.

2,3,4,5 are fixed.

> [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable 
> device framework
> -
>
> Key: YARN-8821
> URL: https://issues.apache.org/jira/browse/YARN-8821
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: GPUTopologyPerformance.png, YARN-8821-trunk.001.patch, 
> YARN-8821-trunk.002.patch, YARN-8821-trunk.003.patch, 
> YARN-8821-trunk.004.patch, YARN-8821-trunk.005.patch, 
> YARN-8821-trunk.006.patch, YARN-8821-trunk.007.patch, 
> YARN-8821-trunk.008.patch, YARN-8821-trunk.009.patch, 
> YARN-8821-trunk.010.patch
>
>
> h2. Background
> GPU topology affects performance. There's been a discussion in YARN-7481. But 
> we'd like to move related discussions here.
> And please note that YARN-8851 will provide a pluggable device framework 
> which can support plugin custom scheduler. Based on the framework, GPU plugin 
> could have own topology scheduler.
> h2. Details of the proposed scheduling algorithm
> The proposed patch has a topology algorithm implemented as below:
>  *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" 
> to build a hash map whose key is all pairs of GPUs and the value is the 
> communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - 
> 2"=>4, ...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set 
> based on the connection type.
> *Step 2*. And then it constructs a _+cost table+_ which caches all 
> combinations of GPUs and corresponding cost between them and cache it. The 
> cost table is a map whose structure is like
> {code:java}
> { 2=>{[0,1]=>2,..},
>   3=>{[0,1,2]=>10,..},
>   4=>{[0,1,2,3]=>18}}.
> {code}
> The key of the map is the count of GPUs, the value of it is a map whose key 
> is the combination of GPUs and the value is the calculated communication cost 
> of the numbers of GPUs. The cost calculation algorithm is to sum all 
> non-duplicate pairs of GPU's cost. For instance, the total cost of [0,1,2] 
> GPUs are the sum of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get 
> from the map built in step 1.
> *Step 3*. After the cost table is built, when allocating GPUs based on 
> topology, we provide two policy which container can set through an 
> environment variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or 
> "SPREAD". The "PACK" means it prefers faster GPU-GPU communication. The 
> "SPREAD" means it prefers faster CPU-GPU communication( since GPUs are not 
> using the same bus to CPU). And the key difference of the two policy is the 
> sort order of the inner map in the cost table. For instance, let's assume 2 
> GPUs is wanted. The costTable.get(2) would return a map containing all 
> combinations of two GPUs and their cost. If the policy is "PACK", we'll sort 
> the map by cost in ascending order. The first entry will be the GPUs has 
> minimum GPU-GPU cost. If the policy is "SPREAD", we sort it in descending 
> order and get the first one which is the highest GPU-GPU cost which means 
> lowest CPU-GPU costs.
> h2. Estimation of the algorithm
> Initial analysis of the topology scheduling algorithm(Using PACK policy) 
> based on the performance tests in an AWS EC2 with 8 GPU cards (P3) is done. 
> Below figure shows the performance gain of the topology scheduling 
> algorithm's allocation (PACK policy).
> !GPUTopologyPerformance.png!  
> Some of the conclusions are:
> 1. The topology between GPUs impacts the performance dramatically. The best 
> combination GPUs can get *5% to 185%* *performance gain* among the test cases 
> with various factors including CNN model, batch size, GPU subset, etc. The 
> scheduling algorithm should be close to this fact.
> 2. The "inception3" and "resnet50" networks seem not topology sensitive. The 
> topology scheduling can only potentially get *about 6.8% 

[jira] [Commented] (YARN-8821) [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable device framework

2019-02-19 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771993#comment-16771993
 ] 

Weiwei Yang commented on YARN-8821:
---

Thanks for working on this [~tangzhankun], it looks really good.

For the v9 patch, I think it's almost there, just some minor comments

1. {{NvidiaGPUPluginForRuntimeV2#topologyAwareSchedule}}

IIRC, line 396 and 402, they sort all combinations for a given count of devices 
every time. Why not just maintain a ordered list for these combinations in the 
map, so it only needs to sort once (when cost table initiated).

2. {{NvidiaGPUPluginForRuntimeV2#allocateDevices}}
{code:java}
topologyAwareSchedule(allocation, count, envs, availableDevices, 
this.costTable);
if (allocation.size() != count) {
  LOG.error("Failed to do topology scheduling. Skip to use basic " + 
"scheduling");
}
return allocation;
{code}
this seems to return the allocation result from {{topologyAwareSchedule}} 
instead of doing basic scheduling when it failed.

3. {{NvidiaGPUPluginForRuntimeV2#allocateDevices}}

 line 249, this logging hides the actual error and stacktrace, can we change to 
LOG.error("", e)? Same comment applies to line 268.

4. NvidiaGPUPluginForRuntimeV2#allocateDevices

line 226 - 235, the 2nd if can be removed and added to 1st one.

5. I am wondering if it makes sense to add a debug logging to print cost table, 
as that is most important data for scheduling, we might need it while debugging 
issues.

Thanks

> [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable 
> device framework
> -
>
> Key: YARN-8821
> URL: https://issues.apache.org/jira/browse/YARN-8821
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: GPUTopologyPerformance.png, YARN-8821-trunk.001.patch, 
> YARN-8821-trunk.002.patch, YARN-8821-trunk.003.patch, 
> YARN-8821-trunk.004.patch, YARN-8821-trunk.005.patch, 
> YARN-8821-trunk.006.patch, YARN-8821-trunk.007.patch, 
> YARN-8821-trunk.008.patch, YARN-8821-trunk.009.patch
>
>
> h2. Background
> GPU topology affects performance. There's been a discussion in YARN-7481. But 
> we'd like to move related discussions here.
> And please note that YARN-8851 will provide a pluggable device framework 
> which can support plugin custom scheduler. Based on the framework, GPU plugin 
> could have own topology scheduler.
> h2. Details of the proposed scheduling algorithm
> The proposed patch has a topology algorithm implemented as below:
>  *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" 
> to build a hash map whose key is all pairs of GPUs and the value is the 
> communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - 
> 2"=>4, ...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set 
> based on the connection type.
> *Step 2*. And then it constructs a _+cost table+_ which caches all 
> combinations of GPUs and corresponding cost between them and cache it. The 
> cost table is a map whose structure is like
> {code:java}
> { 2=>{[0,1]=>2,..},
>   3=>{[0,1,2]=>10,..},
>   4=>{[0,1,2,3]=>18}}.
> {code}
> The key of the map is the count of GPUs, the value of it is a map whose key 
> is the combination of GPUs and the value is the calculated communication cost 
> of the numbers of GPUs. The cost calculation algorithm is to sum all 
> non-duplicate pairs of GPU's cost. For instance, the total cost of [0,1,2] 
> GPUs are the sum of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get 
> from the map built in step 1.
> *Step 3*. After the cost table is built, when allocating GPUs based on 
> topology, we provide two policy which container can set through an 
> environment variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or 
> "SPREAD". The "PACK" means it prefers faster GPU-GPU communication. The 
> "SPREAD" means it prefers faster CPU-GPU communication( since GPUs are not 
> using the same bus to CPU). And the key difference of the two policy is the 
> sort order of the inner map in the cost table. For instance, let's assume 2 
> GPUs is wanted. The costTable.get(2) would return a map containing all 
> combinations of two GPUs and their cost. If the policy is "PACK", we'll sort 
> the map by cost in ascending order. The first entry will be the GPUs has 
> minimum GPU-GPU cost. If the policy is "SPREAD", we sort it in descending 
> order and get the first one which is the highest GPU-GPU cost which means 
> lowest CPU-GPU costs.
> h2. Estimation of the algorithm
> Initial analysis of the topology scheduling algorithm(Using PACK policy) 
> based on the performance tests in an AWS EC2 with 8 GPU cards (P3) is done. 
> Below figure shows the 

[jira] [Commented] (YARN-8821) [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable device framework

2019-02-19 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771683#comment-16771683
 ] 

Zhankun Tang commented on YARN-8821:


The unit test seems unrelated to this patch.

> [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable 
> device framework
> -
>
> Key: YARN-8821
> URL: https://issues.apache.org/jira/browse/YARN-8821
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: GPUTopologyPerformance.png, YARN-8821-trunk.001.patch, 
> YARN-8821-trunk.002.patch, YARN-8821-trunk.003.patch, 
> YARN-8821-trunk.004.patch, YARN-8821-trunk.005.patch, 
> YARN-8821-trunk.006.patch, YARN-8821-trunk.007.patch, 
> YARN-8821-trunk.008.patch, YARN-8821-trunk.009.patch
>
>
> h2. Background
> GPU topology affects performance. There's been a discussion in YARN-7481. But 
> we'd like to move related discussions here.
> And please note that YARN-8851 will provide a pluggable device framework 
> which can support plugin custom scheduler. Based on the framework, GPU plugin 
> could have own topology scheduler.
> h2. Details of the proposed scheduling algorithm
> The proposed patch has a topology algorithm implemented as below:
>  *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" 
> to build a hash map whose key is all pairs of GPUs and the value is the 
> communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - 
> 2"=>4, ...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set 
> based on the connection type.
> *Step 2*. And then it constructs a _+cost table+_ which caches all 
> combinations of GPUs and corresponding cost between them and cache it. The 
> cost table is a map whose structure is like
> {code:java}
> { 2=>{[0,1]=>2,..},
>   3=>{[0,1,2]=>10,..},
>   4=>{[0,1,2,3]=>18}}.
> {code}
> The key of the map is the count of GPUs, the value of it is a map whose key 
> is the combination of GPUs and the value is the calculated communication cost 
> of the numbers of GPUs. The cost calculation algorithm is to sum all 
> non-duplicate pairs of GPU's cost. For instance, the total cost of [0,1,2] 
> GPUs are the sum of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get 
> from the map built in step 1.
> *Step 3*. After the cost table is built, when allocating GPUs based on 
> topology, we provide two policy which container can set through an 
> environment variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or 
> "SPREAD". The "PACK" means it prefers faster GPU-GPU communication. The 
> "SPREAD" means it prefers faster CPU-GPU communication( since GPUs are not 
> using the same bus to CPU). And the key difference of the two policy is the 
> sort order of the inner map in the cost table. For instance, let's assume 2 
> GPUs is wanted. The costTable.get(2) would return a map containing all 
> combinations of two GPUs and their cost. If the policy is "PACK", we'll sort 
> the map by cost in ascending order. The first entry will be the GPUs has 
> minimum GPU-GPU cost. If the policy is "SPREAD", we sort it in descending 
> order and get the first one which is the highest GPU-GPU cost which means 
> lowest CPU-GPU costs.
> h2. Estimation of the algorithm
> Initial analysis of the topology scheduling algorithm(Using PACK policy) 
> based on the performance tests in an AWS EC2 with 8 GPU cards (P3) is done. 
> Below figure shows the performance gain of the topology scheduling 
> algorithm's allocation (PACK policy).
> !GPUTopologyPerformance.png!  
> Some of the conclusions are:
> 1. The topology between GPUs impacts the performance dramatically. The best 
> combination GPUs can get *5% to 185%* *performance gain* among the test cases 
> with various factors including CNN model, batch size, GPU subset, etc. The 
> scheduling algorithm should be close to this fact.
> 2. The "inception3" and "resnet50" networks seem not topology sensitive. The 
> topology scheduling can only potentially get *about 6.8% to 10%* speedup in 
> best cases.
> 3. Our current version of topology scheduling algorithm can achieve 6.8*% to 
> 177.1%* *performance gain in best cases. In average, it also outperforms the 
> median performance(0.8% to 28.2%).*
> *4. And the algorithm's allocations match the fastest GPUs needed by "vgg16" 
> best*.
>  
> In summary, the GPU topology scheduling algorithm is effective and can 
> potentially get 6.8% to 185% performance gain in the best cases and 1% to 30% 
> on average.
>  *It means about maximum 3X comparing to a random GPU scheduling algorithm in 
> a specific scenario*.
>  
> The spreadsheets are here for your reference.
>  
> 

[jira] [Commented] (YARN-8821) [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable device framework

2019-02-18 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771656#comment-16771656
 ] 

Hadoop QA commented on YARN-8821:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
21s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 5 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 
24s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
10s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
26s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
42s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 13s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
58s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
25s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 57s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
5s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
24s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 20m 34s{color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
25s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 71m 37s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.nodemanager.amrmproxy.TestFederationInterceptor |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | YARN-8821 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12959202/YARN-8821-trunk.009.patch
 |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux e51ea419e435 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 
5 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 67af509 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_191 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-YARN-Build/23444/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/23444/testReport/ |
| Max. process+thread count | 304 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: