[
https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771572#comment-16771572
]
Hadoop QA commented on YARN-8821:
---------------------------------
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 6s{color}
| {color:red} YARN-8821 does not apply to trunk. Rebase required? Wrong Branch?
See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | YARN-8821 |
| Console output |
https://builds.apache.org/job/PreCommit-YARN-Build/23441/console |
| Powered by | Apache Yetus 0.8.0 http://yetus.apache.org |
This message was automatically generated.
> GPU hierarchy/topology scheduling support
> -----------------------------------------
>
> Key: YARN-8821
> URL: https://issues.apache.org/jira/browse/YARN-8821
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Zhankun Tang
> Assignee: Zhankun Tang
> Priority: Major
> Attachments: GPUTopologyPerformance.png, YARN-8821-trunk.001.patch,
> YARN-8821-trunk.002.patch, YARN-8821-trunk.003.patch,
> YARN-8821-trunk.004.patch, YARN-8821-trunk.005.patch,
> YARN-8821-trunk.006.patch, YARN-8821-trunk.007.patch
>
>
> h2. Background
> GPU topology affects performance. There's been a discussion in YARN-7481. But
> we'd like to move related discussions here.
> And please note that YARN-8851 will provide a pluggable device framework
> which can support plugin custom scheduler. Based on the framework, GPU plugin
> could have own topology scheduler.
> h2. Details of the proposed scheduling algorithm
> The proposed patch has a topology algorithm implemented as below:
> *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m"
> to build a hash map whose key is all pairs of GPUs and the value is the
> communication cost between the two. The map is like \{"0 - 1"=> 2, "0 -
> 2"=>4, ...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set
> based on the connection type.
> *Step 2*. And then it constructs a _+cost table+_ which caches all
> combinations of GPUs and corresponding cost between them and cache it. The
> cost table is a map whose structure is like
> {code:java}
> { 2=>{[0,1]=>2,..},
> 3=>{[0,1,2]=>10,..},
> 4=>{[0,1,2,3]=>18}}.
> {code}
> The key of the map is the count of GPUs, the value of it is a map whose key
> is the combination of GPUs and the value is the calculated communication cost
> of the numbers of GPUs. The cost calculation algorithm is to sum all
> non-duplicate pairs of GPU's cost. For instance, the total cost of [0,1,2]
> GPUs are the sum of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get
> from the map built in step 1.
> *Step 3*. After the cost table is built, when allocating GPUs based on
> topology, we provide two policy which container can set through an
> environment variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or
> "SPREAD". The "PACK" means it prefers faster GPU-GPU communication. The
> "SPREAD" means it prefers faster CPU-GPU communication( since GPUs are not
> using the same bus to CPU). And the key difference of the two policy is the
> sort order of the inner map in the cost table. For instance, let's assume 2
> GPUs is wanted. The costTable.get(2) would return a map containing all
> combinations of two GPUs and their cost. If the policy is "PACK", we'll sort
> the map by cost in ascending order. The first entry will be the GPUs has
> minimum GPU-GPU cost. If the policy is "SPREAD", we sort it in descending
> order and get the first one which is the highest GPU-GPU cost which means
> lowest CPU-GPU costs.
> h2. Estimation of the algorithm
> Initial analysis of the topology scheduling algorithm(Using PACK policy)
> based on the performance tests in an AWS EC2 with 8 GPU cards (P3) is done.
> !GPUTopologyPerformance.png!
> Some of the conclusions are:
> 1. The topology between GPUs impacts the performance dramatically. The best
> combination GPUs can get *5% to 185%* *performance gain* among the test cases
> with various factors including CNN model, batch size, GPU subset, etc.
> 2. The "inception3" and "resnet50" networks seem not topology sensitive. The
> topology scheduling can only potentially get *about 10%* speedup.
> 3. Our current version of topology scheduling algorithm can achieve *3% to
> 140%* *performance gain. And the algorithm's allocations match the fastest
> GPUs needed by "vgg16"*.
>
> In summary, the GPU topology scheduling algorithm is effective and can
> potentially get 5% to 185% performance gain after more optimization.
> *It means about maximum 3X comparing to a random GPU scheduling algorithm in
> a specific scenario*.
>
> The spreadsheets are here for your reference.
>
> [https://docs.google.com/spreadsheets/d/1t1QgiSuyMY2u-9TtsTVpVhG3WYc46hoaqy3BuADPS14/edit?usp=sharing]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]