[ https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16772672#comment-16772672 ]
Hadoop QA commented on YARN-8821: --------------------------------- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 20s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 5 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 30s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 2s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 28s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 40s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 34s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 58s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 34s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 43s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 20m 36s{color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 25s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 71m 42s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.nodemanager.amrmproxy.TestFederationInterceptor | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f | | JIRA Issue | YARN-8821 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12959367/YARN-8821-trunk.010.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 29f8dd684c95 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 1d30fd9 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | unit | https://builds.apache.org/job/PreCommit-YARN-Build/23448/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/23448/testReport/ | | Max. process+thread count | 308 (vs. ulimit of 10000) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/23448/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable > device framework > ----------------------------------------------------------------------------------------- > > Key: YARN-8821 > URL: https://issues.apache.org/jira/browse/YARN-8821 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Zhankun Tang > Assignee: Zhankun Tang > Priority: Major > Attachments: GPUTopologyPerformance.png, YARN-8821-trunk.001.patch, > YARN-8821-trunk.002.patch, YARN-8821-trunk.003.patch, > YARN-8821-trunk.004.patch, YARN-8821-trunk.005.patch, > YARN-8821-trunk.006.patch, YARN-8821-trunk.007.patch, > YARN-8821-trunk.008.patch, YARN-8821-trunk.009.patch, > YARN-8821-trunk.010.patch > > > h2. Background > GPU topology affects performance. There's been a discussion in YARN-7481. But > we'd like to move related discussions here. > And please note that YARN-8851 will provide a pluggable device framework > which can support plugin custom scheduler. Based on the framework, GPU plugin > could have own topology scheduler. > h2. Details of the proposed scheduling algorithm > The proposed patch has a topology algorithm implemented as below: > *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" > to build a hash map whose key is all pairs of GPUs and the value is the > communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - > 2"=>4, ...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set > based on the connection type. > *Step 2*. And then it constructs a _+cost table+_ which caches all > combinations of GPUs and corresponding cost between them and cache it. The > cost table is a map whose structure is like > {code:java} > { 2=>{[0,1]=>2,..}, > 3=>{[0,1,2]=>10,..}, > 4=>{[0,1,2,3]=>18}}. > {code} > The key of the map is the count of GPUs, the value of it is a map whose key > is the combination of GPUs and the value is the calculated communication cost > of the numbers of GPUs. The cost calculation algorithm is to sum all > non-duplicate pairs of GPU's cost. For instance, the total cost of [0,1,2] > GPUs are the sum of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get > from the map built in step 1. > *Step 3*. After the cost table is built, when allocating GPUs based on > topology, we provide two policy which container can set through an > environment variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or > "SPREAD". The "PACK" means it prefers faster GPU-GPU communication. The > "SPREAD" means it prefers faster CPU-GPU communication( since GPUs are not > using the same bus to CPU). And the key difference of the two policy is the > sort order of the inner map in the cost table. For instance, let's assume 2 > GPUs is wanted. The costTable.get(2) would return a map containing all > combinations of two GPUs and their cost. If the policy is "PACK", we'll sort > the map by cost in ascending order. The first entry will be the GPUs has > minimum GPU-GPU cost. If the policy is "SPREAD", we sort it in descending > order and get the first one which is the highest GPU-GPU cost which means > lowest CPU-GPU costs. > h2. Estimation of the algorithm > Initial analysis of the topology scheduling algorithm(Using PACK policy) > based on the performance tests in an AWS EC2 with 8 GPU cards (P3) is done. > Below figure shows the performance gain of the topology scheduling > algorithm's allocation (PACK policy). > !GPUTopologyPerformance.png! > Some of the conclusions are: > 1. The topology between GPUs impacts the performance dramatically. The best > combination GPUs can get *5% to 185%* *performance gain* among the test cases > with various factors including CNN model, batch size, GPU subset, etc. The > scheduling algorithm should be close to this fact. > 2. The "inception3" and "resnet50" networks seem not topology sensitive. The > topology scheduling can only potentially get *about 6.8% to 10%* speedup in > best cases. > 3. Our current version of topology scheduling algorithm can achieve 6.8*% to > 177.1%* *performance gain in best cases. In average, it also outperforms the > median performance(0.8% to 28.2%).* > *4. And the algorithm's allocations match the fastest GPUs needed by "vgg16" > best*. > > In summary, the GPU topology scheduling algorithm is effective and can > potentially get 6.8% to 185% performance gain in the best cases and 1% to 30% > on average. > *It means about maximum 3X comparing to a random GPU scheduling algorithm in > a specific scenario*. > > The spreadsheets are here for your reference. > > [https://docs.google.com/spreadsheets/d/1t1QgiSuyMY2u-9TtsTVpVhG3WYc46hoaqy3BuADPS14/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org