[jira] [Commented] (YARN-8821) [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable device framework

Hadoop QA (JIRA) Tue, 19 Feb 2019 22:43:27 -0800


    [ 
https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16772672#comment-16772672
 ]


Hadoop QA commented on YARN-8821:
---------------------------------

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
20s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 5 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 
30s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
2s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
28s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
40s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 34s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
58s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
25s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
36s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 43s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
24s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 20m 36s{color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
25s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 71m 42s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.nodemanager.amrmproxy.TestFederationInterceptor |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | YARN-8821 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12959367/YARN-8821-trunk.010.patch
 |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 29f8dd684c95 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 
5 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 1d30fd9 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_191 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-YARN-Build/23448/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/23448/testReport/ |
| Max. process+thread count | 308 (vs. ulimit of 10000) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/23448/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable 
> device framework
> -----------------------------------------------------------------------------------------
>
>                 Key: YARN-8821
>                 URL: https://issues.apache.org/jira/browse/YARN-8821
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Zhankun Tang
>            Assignee: Zhankun Tang
>            Priority: Major
>         Attachments: GPUTopologyPerformance.png, YARN-8821-trunk.001.patch, 
> YARN-8821-trunk.002.patch, YARN-8821-trunk.003.patch, 
> YARN-8821-trunk.004.patch, YARN-8821-trunk.005.patch, 
> YARN-8821-trunk.006.patch, YARN-8821-trunk.007.patch, 
> YARN-8821-trunk.008.patch, YARN-8821-trunk.009.patch, 
> YARN-8821-trunk.010.patch
>
>
> h2. Background
> GPU topology affects performance. There's been a discussion in YARN-7481. But 
> we'd like to move related discussions here.
> And please note that YARN-8851 will provide a pluggable device framework 
> which can support plugin custom scheduler. Based on the framework, GPU plugin 
> could have own topology scheduler.
> h2. Details of the proposed scheduling algorithm
> The proposed patch has a topology algorithm implemented as below:
>  *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" 
> to build a hash map whose key is all pairs of GPUs and the value is the 
> communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - 
> 2"=>4, ...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set 
> based on the connection type.
> *Step 2*. And then it constructs a _+cost table+_ which caches all 
> combinations of GPUs and corresponding cost between them and cache it. The 
> cost table is a map whose structure is like
> {code:java}
> { 2=>{[0,1]=>2,..},
>   3=>{[0,1,2]=>10,..},
>   4=>{[0,1,2,3]=>18}}.
> {code}
> The key of the map is the count of GPUs, the value of it is a map whose key 
> is the combination of GPUs and the value is the calculated communication cost 
> of the numbers of GPUs. The cost calculation algorithm is to sum all 
> non-duplicate pairs of GPU's cost. For instance, the total cost of [0,1,2] 
> GPUs are the sum of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get 
> from the map built in step 1.
> *Step 3*. After the cost table is built, when allocating GPUs based on 
> topology, we provide two policy which container can set through an 
> environment variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or 
> "SPREAD". The "PACK" means it prefers faster GPU-GPU communication. The 
> "SPREAD" means it prefers faster CPU-GPU communication( since GPUs are not 
> using the same bus to CPU). And the key difference of the two policy is the 
> sort order of the inner map in the cost table. For instance, let's assume 2 
> GPUs is wanted. The costTable.get(2) would return a map containing all 
> combinations of two GPUs and their cost. If the policy is "PACK", we'll sort 
> the map by cost in ascending order. The first entry will be the GPUs has 
> minimum GPU-GPU cost. If the policy is "SPREAD", we sort it in descending 
> order and get the first one which is the highest GPU-GPU cost which means 
> lowest CPU-GPU costs.
> h2. Estimation of the algorithm
> Initial analysis of the topology scheduling algorithm(Using PACK policy) 
> based on the performance tests in an AWS EC2 with 8 GPU cards (P3) is done. 
> Below figure shows the performance gain of the topology scheduling 
> algorithm's allocation (PACK policy).
> !GPUTopologyPerformance.png!  
> Some of the conclusions are:
> 1. The topology between GPUs impacts the performance dramatically. The best 
> combination GPUs can get *5% to 185%* *performance gain* among the test cases 
> with various factors including CNN model, batch size, GPU subset, etc. The 
> scheduling algorithm should be close to this fact.
> 2. The "inception3" and "resnet50" networks seem not topology sensitive. The 
> topology scheduling can only potentially get *about 6.8% to 10%* speedup in 
> best cases.
> 3. Our current version of topology scheduling algorithm can achieve 6.8*% to 
> 177.1%* *performance gain in best cases. In average, it also outperforms the 
> median performance(0.8% to 28.2%).*
> *4. And the algorithm's allocations match the fastest GPUs needed by "vgg16" 
> best*.
>  
> In summary, the GPU topology scheduling algorithm is effective and can 
> potentially get 6.8% to 185% performance gain in the best cases and 1% to 30% 
> on average.
>  *It means about maximum 3X comparing to a random GPU scheduling algorithm in 
> a specific scenario*.
>  
> The spreadsheets are here for your reference.
>  
> [https://docs.google.com/spreadsheets/d/1t1QgiSuyMY2u-9TtsTVpVhG3WYc46hoaqy3BuADPS14/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8821) [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable device framework

Reply via email to