[tip: sched/core] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2

2021-03-06 Thread tip-bot2 for Barry Song
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 585b6d2723dc927ebc4ad884c4e879e4da8bc21f
Gitweb:
https://git.kernel.org/tip/585b6d2723dc927ebc4ad884c4e879e4da8bc21f
Author:Barry Song 
AuthorDate:Wed, 24 Feb 2021 16:09:44 +13:00
Committer: Ingo Molnar 
CommitterDate: Sat, 06 Mar 2021 12:40:22 +01:00

sched/topology: fix the issue groups don't span domain->span for NUMA diameter 
> 2

As long as NUMA diameter > 2, building sched_domain by sibling's child
domain will definitely create a sched_domain with sched_group which will
span out of the sched_domain:

   +--+ +--++---+   +--+
   | node |  12 |node  | 20 | node  |  12   |node  |
   |  0   +-+1 ++ 2 +---+3 |
   +--+ +--++---+   +--+

domain0node0node1node2  node3

domain1node0+1  node0+1  node2+3node2+3
 +
domain2node0+1+2 |
 group: node0+1  |
   group:node2+3 <---+

when node2 is added into the domain2 of node0, kernel is using the child
domain of node2's domain2, which is domain1(node2+3). Node 3 is outside
the span of the domain including node0+1+2.

This will make load_balance() run based on screwed avg_load and group_type
in the sched_group spanning out of the sched_domain, and it also makes
select_task_rq_fair() pick an idle CPU outside the sched_domain.

Real servers which suffer from this problem include Kunpeng920 and 8-node
Sun Fire X4600-M2, at least.

Here we move to use the *child* domain of the *child* domain of node2's
domain2 as the new added sched_group. At the same, we re-use the lower
level sgc directly.
   +--+ +--++---+   +--+
   | node |  12 |node  | 20 | node  |  12   |node  |
   |  0   +-+1 ++ 2 +---+3 |
   +--+ +--++---+   +--+

domain0node0node1  +- node2  node3
   |
domain1node0+1  node0+1| node2+3node2+3
   |
domain2node0+1+2   |
 group: node0+1|
   group:node2 <---+

While the lower level sgc is re-used, this patch only changes the remote
sched_groups for those sched_domains playing grandchild trick, therefore,
sgc->next_update is still safe since it's only touched by CPUs that have
the group span as local group. And sgc->imbalance is also safe because
sd_parent remains the same in load_balance and LB only tries other CPUs
from the local group.
Moreover, since local groups are not touched, they are still getting
roughly equal size in a TL. And should_we_balance() only matters with
local groups, so the pull probability of those groups are still roughly
equal.

Tested by the below topology:
qemu-system-aarch64  -M virt -nographic \
 -smp cpus=8 \
 -numa node,cpus=0-1,nodeid=0 \
 -numa node,cpus=2-3,nodeid=1 \
 -numa node,cpus=4-5,nodeid=2 \
 -numa node,cpus=6-7,nodeid=3 \
 -numa dist,src=0,dst=1,val=12 \
 -numa dist,src=0,dst=2,val=20 \
 -numa dist,src=0,dst=3,val=22 \
 -numa dist,src=1,dst=2,val=22 \
 -numa dist,src=2,dst=3,val=12 \
 -numa dist,src=1,dst=3,val=24 \
 -m 4G -cpu cortex-a57 -kernel arch/arm64/boot/Image

w/o patch, we get lots of "groups don't span domain->span":
[0.802139] CPU0 attaching sched-domain(s):
[0.802193]  domain-0: span=0-1 level=MC
[0.802443]   groups: 0:{ span=0 cap=1013 }, 1:{ span=1 cap=979 }
[0.802693]   domain-1: span=0-3 level=NUMA
[0.802731]groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 }
[0.802811]domain-2: span=0-5 level=NUMA
[0.802829] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 }
[0.802881] ERROR: groups don't span domain->span
[0.803058] domain-3: span=0-7 level=NUMA
[0.803080]  groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 
mask=6-7 cap=4077 }
[0.804055] CPU1 attaching sched-domain(s):
[0.804072]  domain-0: span=0-1 level=MC
[0.804096]   groups: 1:{ span=1 cap=979 }, 0:{ span=0 cap=1013 }
[0.804152]   domain-1: span=0-3 level=NUMA
[0.804170]groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 }
[0.804219]domain-2: span=0-5 level=NUMA
[0.804236] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 }
[0.804302] ERROR: groups don't span domain->span
[0.804520] domain-3: span=0-7 level=NUMA
[0.804546]  groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 
mask=6-7 cap=4077 }
[0.804677] CPU2 attaching 

[tip: sched/core] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2

2021-03-04 Thread tip-bot2 for Barry Song
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 9f4af5753b691b9df558ddcfea13e9f3036e45ca
Gitweb:
https://git.kernel.org/tip/9f4af5753b691b9df558ddcfea13e9f3036e45ca
Author:Barry Song 
AuthorDate:Wed, 24 Feb 2021 16:09:44 +13:00
Committer: Peter Zijlstra 
CommitterDate: Thu, 04 Mar 2021 09:56:00 +01:00

sched/topology: fix the issue groups don't span domain->span for NUMA diameter 
> 2

As long as NUMA diameter > 2, building sched_domain by sibling's child
domain will definitely create a sched_domain with sched_group which will
span out of the sched_domain:

   +--+ +--++---+   +--+
   | node |  12 |node  | 20 | node  |  12   |node  |
   |  0   +-+1 ++ 2 +---+3 |
   +--+ +--++---+   +--+

domain0node0node1node2  node3

domain1node0+1  node0+1  node2+3node2+3
 +
domain2node0+1+2 |
 group: node0+1  |
   group:node2+3 <---+

when node2 is added into the domain2 of node0, kernel is using the child
domain of node2's domain2, which is domain1(node2+3). Node 3 is outside
the span of the domain including node0+1+2.

This will make load_balance() run based on screwed avg_load and group_type
in the sched_group spanning out of the sched_domain, and it also makes
select_task_rq_fair() pick an idle CPU outside the sched_domain.

Real servers which suffer from this problem include Kunpeng920 and 8-node
Sun Fire X4600-M2, at least.

Here we move to use the *child* domain of the *child* domain of node2's
domain2 as the new added sched_group. At the same, we re-use the lower
level sgc directly.
   +--+ +--++---+   +--+
   | node |  12 |node  | 20 | node  |  12   |node  |
   |  0   +-+1 ++ 2 +---+3 |
   +--+ +--++---+   +--+

domain0node0node1  +- node2  node3
   |
domain1node0+1  node0+1| node2+3node2+3
   |
domain2node0+1+2   |
 group: node0+1|
   group:node2 <---+

While the lower level sgc is re-used, this patch only changes the remote
sched_groups for those sched_domains playing grandchild trick, therefore,
sgc->next_update is still safe since it's only touched by CPUs that have
the group span as local group. And sgc->imbalance is also safe because
sd_parent remains the same in load_balance and LB only tries other CPUs
from the local group.
Moreover, since local groups are not touched, they are still getting
roughly equal size in a TL. And should_we_balance() only matters with
local groups, so the pull probability of those groups are still roughly
equal.

Tested by the below topology:
qemu-system-aarch64  -M virt -nographic \
 -smp cpus=8 \
 -numa node,cpus=0-1,nodeid=0 \
 -numa node,cpus=2-3,nodeid=1 \
 -numa node,cpus=4-5,nodeid=2 \
 -numa node,cpus=6-7,nodeid=3 \
 -numa dist,src=0,dst=1,val=12 \
 -numa dist,src=0,dst=2,val=20 \
 -numa dist,src=0,dst=3,val=22 \
 -numa dist,src=1,dst=2,val=22 \
 -numa dist,src=2,dst=3,val=12 \
 -numa dist,src=1,dst=3,val=24 \
 -m 4G -cpu cortex-a57 -kernel arch/arm64/boot/Image

w/o patch, we get lots of "groups don't span domain->span":
[0.802139] CPU0 attaching sched-domain(s):
[0.802193]  domain-0: span=0-1 level=MC
[0.802443]   groups: 0:{ span=0 cap=1013 }, 1:{ span=1 cap=979 }
[0.802693]   domain-1: span=0-3 level=NUMA
[0.802731]groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 }
[0.802811]domain-2: span=0-5 level=NUMA
[0.802829] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 }
[0.802881] ERROR: groups don't span domain->span
[0.803058] domain-3: span=0-7 level=NUMA
[0.803080]  groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 
mask=6-7 cap=4077 }
[0.804055] CPU1 attaching sched-domain(s):
[0.804072]  domain-0: span=0-1 level=MC
[0.804096]   groups: 1:{ span=1 cap=979 }, 0:{ span=0 cap=1013 }
[0.804152]   domain-1: span=0-3 level=NUMA
[0.804170]groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 }
[0.804219]domain-2: span=0-5 level=NUMA
[0.804236] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 }
[0.804302] ERROR: groups don't span domain->span
[0.804520] domain-3: span=0-7 level=NUMA
[0.804546]  groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 
mask=6-7 cap=4077 }
[0.804677] CPU2 attaching