[ 
https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8821:
-------------------------------
    Description: 
GPU topology affects performance dramatically. There's been a discussion in 
YARN-7481. But we'd like to move related discussions here.

Please note that YARN-8851 will provide a pluggable device framework which can 
support plugin custom scheduler. And based on the framework, GPU plugin could 
have own topology scheduler. The proposed patch has a topology algorithm 
implemented as below:
 *Step 1*. When plugin inits, parse the output of "nvidia-smi topo -m" to build 
a hash map whose key is all pairs of GPUs and the value is the communication 
cost between the two. The map is like \{"0 - 1"=> 2, "0 - 2"=>4, ...} which 
means the minimum cost of GPU 0 to 1 is 2. The cost is set based on the 
connection type. Haven't considered CPU affinity or NUMA node yet.

*Step 2*. And then it constructs a _+cost table+_ which caches all combinations 
of GPUs and corresponding cost between them. The cost table is a map whose 
structure is like

 
{code:java}
{ 2=>{[0,1]=>2,..},
  3=>{[0,1,2]=>10,..},
  4=>{[0,1,2,3]=>18}}.
{code}
The key of the map is the count of GPUs, the value of it is a map whose key is 
the combination of GPUs and the value is the calculated communication cost of 
the numbers of GPUs. The cost calculation algorithm is to sum all non-duplicate 
pairs of GPU's cost. For instance, the total cost of [0,1,2] GPUs are the sum 
of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get from the map built 
in step 1.

*Step 3*. After the cost table is built, when allocating GPUs based on 
topology, we provide two policy which container can set through an environment 
variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or "SPREAD". The 
"PACK" means it prefers faster GPU-GPU communication. The "SPREAD" means it 
prefers faster CPU-GPU communication( since GPU is not using the same bus to 
CPU). And the key difference of the two policy is the sort order of the inner 
map in the cost table. For instance, let's assume 2 GPUs is wanted. The 
costTable.get(2) would return a map containing all combinations of two GPUs and 
their cost. If the policy is "PACK", we'll sort the map by cost in ascending 
order. The first entry will be the GPUs has minimum GPU-GPU cost. If the policy 
is "SPREAD", we sort it in descending order and get the first one which is the 
highest GPU-GPU cost which means lowest CPU-GPU costs.

  was:
GPU topology affects performance dramatically. There's been a discussion in 
YARN-7481. But we'd like to move related discussions here.

Please note that YARN-8851 will provide a pluggable device framework which can 
support plugin custom scheduler. And based on the framework, GPU plugin could 
have own topology scheduler. The proposed patch has a topology algorithm 
implemented as below:
 *Step 1*. When plugin inits, parse the output of "nvidia-smi topo -m" to build 
a hash map whose key is all pairs of GPUs and the value is the communication 
cost between the two. The map is like \{"0 - 1"=> 2, "0 - 2"=>4, ...} which 
means the minimum cost of GPU 0 to 1 is 2. The cost is set based on the 
connection type. Haven't considered CPU affinity or NUMA node yet.

*Step 2*. And then it constructs a cost table which caches all combinations of 
GPUs and corresponding cost between them. The cost table is a map whose 
structure is like

 
{code:java}
{ 2=>{[0,1]=>2,..},
  3=>{[0,1,2]=>10,..},
  4=>{[0,1,2,3]=>18}}.
{code}

The key of the map is the count of GPUs, the value of it is a map whose key is 
the combination of GPUs and the value is the calculated communication cost of 
the numbers of GPUs. The cost calculation algorithm is to sum all non-duplicate 
pairs of GPU's cost. For instance, the total cost of [0,1,2] GPUs are the sum 
of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get from the map built 
in step 1.

*Step 3*. After the cache table is built, when allocating GPUs based on 
topology, we provide two policy which container can set through an environment 
variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or "SPREAD". The 
"PACK" means it prefers faster GPU-GPU communication. The "SPREAD" means it 
prefers faster CPU-GPU communication( since GPU is not using the same bus to 
CPU). And the key difference of the two policy is the sort order of the inner 
map in the cost table. For instance, let's assume 2 GPUs is wanted. The 
costTable.get(2) would return a map containing all combinations of two GPUs and 
their cost. If the policy is "PACK", we'll sort the map by cost in ascending 
order. The first entry will be the GPUs has minimum GPU-GPU cost. If the policy 
is "SPREAD", we sort it in descending order and get the first one which is the 
highest GPU-GPU cost which means lowest CPU-GPU costs.


> GPU hierarchy/topology scheduling support
> -----------------------------------------
>
>                 Key: YARN-8821
>                 URL: https://issues.apache.org/jira/browse/YARN-8821
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Zhankun Tang
>            Assignee: Zhankun Tang
>            Priority: Major
>         Attachments: YARN-8821-trunk.001.patch
>
>
> GPU topology affects performance dramatically. There's been a discussion in 
> YARN-7481. But we'd like to move related discussions here.
> Please note that YARN-8851 will provide a pluggable device framework which 
> can support plugin custom scheduler. And based on the framework, GPU plugin 
> could have own topology scheduler. The proposed patch has a topology 
> algorithm implemented as below:
>  *Step 1*. When plugin inits, parse the output of "nvidia-smi topo -m" to 
> build a hash map whose key is all pairs of GPUs and the value is the 
> communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - 
> 2"=>4, ...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set 
> based on the connection type. Haven't considered CPU affinity or NUMA node 
> yet.
> *Step 2*. And then it constructs a _+cost table+_ which caches all 
> combinations of GPUs and corresponding cost between them. The cost table is a 
> map whose structure is like
>  
> {code:java}
> { 2=>{[0,1]=>2,..},
>   3=>{[0,1,2]=>10,..},
>   4=>{[0,1,2,3]=>18}}.
> {code}
> The key of the map is the count of GPUs, the value of it is a map whose key 
> is the combination of GPUs and the value is the calculated communication cost 
> of the numbers of GPUs. The cost calculation algorithm is to sum all 
> non-duplicate pairs of GPU's cost. For instance, the total cost of [0,1,2] 
> GPUs are the sum of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get 
> from the map built in step 1.
> *Step 3*. After the cost table is built, when allocating GPUs based on 
> topology, we provide two policy which container can set through an 
> environment variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or 
> "SPREAD". The "PACK" means it prefers faster GPU-GPU communication. The 
> "SPREAD" means it prefers faster CPU-GPU communication( since GPU is not 
> using the same bus to CPU). And the key difference of the two policy is the 
> sort order of the inner map in the cost table. For instance, let's assume 2 
> GPUs is wanted. The costTable.get(2) would return a map containing all 
> combinations of two GPUs and their cost. If the policy is "PACK", we'll sort 
> the map by cost in ascending order. The first entry will be the GPUs has 
> minimum GPU-GPU cost. If the policy is "SPREAD", we sort it in descending 
> order and get the first one which is the highest GPU-GPU cost which means 
> lowest CPU-GPU costs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to