[ 
https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8821:
-------------------------------
    Attachment: YARN-8821-trunk.003.patch

> GPU hierarchy/topology scheduling support
> -----------------------------------------
>
>                 Key: YARN-8821
>                 URL: https://issues.apache.org/jira/browse/YARN-8821
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Zhankun Tang
>            Assignee: Zhankun Tang
>            Priority: Major
>         Attachments: YARN-8821-trunk.001.patch, YARN-8821-trunk.002.patch, 
> YARN-8821-trunk.003.patch
>
>
> GPU topology affects performance dramatically. There's been a discussion in 
> YARN-7481. But we'd like to move related discussions here.
> Please note that YARN-8851 will provide a pluggable device framework which 
> can support plugin custom scheduler. And based on the framework, GPU plugin 
> could have own topology scheduler. The proposed patch has a topology 
> algorithm implemented as below:
>  *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" 
> to build a hash map whose key is all pairs of GPUs and the value is the 
> communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - 
> 2"=>4, ...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set 
> based on the connection type.
> *Step 2*. And then it constructs a _+cost table+_ which caches all 
> combinations of GPUs and corresponding cost between them and cache it. The 
> cost table is a map whose structure is like
>  
> {code:java}
> { 2=>{[0,1]=>2,..},
>   3=>{[0,1,2]=>10,..},
>   4=>{[0,1,2,3]=>18}}.
> {code}
> The key of the map is the count of GPUs, the value of it is a map whose key 
> is the combination of GPUs and the value is the calculated communication cost 
> of the numbers of GPUs. The cost calculation algorithm is to sum all 
> non-duplicate pairs of GPU's cost. For instance, the total cost of [0,1,2] 
> GPUs are the sum of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get 
> from the map built in step 1.
> *Step 3*. After the cost table is built, when allocating GPUs based on 
> topology, we provide two policy which container can set through an 
> environment variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or 
> "SPREAD". The "PACK" means it prefers faster GPU-GPU communication. The 
> "SPREAD" means it prefers faster CPU-GPU communication( since GPUs are not 
> using the same bus to CPU). And the key difference of the two policy is the 
> sort order of the inner map in the cost table. For instance, let's assume 2 
> GPUs is wanted. The costTable.get(2) would return a map containing all 
> combinations of two GPUs and their cost. If the policy is "PACK", we'll sort 
> the map by cost in ascending order. The first entry will be the GPUs has 
> minimum GPU-GPU cost. If the policy is "SPREAD", we sort it in descending 
> order and get the first one which is the highest GPU-GPU cost which means 
> lowest CPU-GPU costs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to