[tip: sched/core] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2
The following commit has been merged into the sched/core branch of tip: Commit-ID: 585b6d2723dc927ebc4ad884c4e879e4da8bc21f Gitweb: https://git.kernel.org/tip/585b6d2723dc927ebc4ad884c4e879e4da8bc21f Author:Barry Song AuthorDate:Wed, 24 Feb 2021 16:09:44 +13:00 Committer: Ingo Molnar CommitterDate: Sat, 06 Mar 2021 12:40:22 +01:00 sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2 As long as NUMA diameter > 2, building sched_domain by sibling's child domain will definitely create a sched_domain with sched_group which will span out of the sched_domain: +--+ +--++---+ +--+ | node | 12 |node | 20 | node | 12 |node | | 0 +-+1 ++ 2 +---+3 | +--+ +--++---+ +--+ domain0node0node1node2 node3 domain1node0+1 node0+1 node2+3node2+3 + domain2node0+1+2 | group: node0+1 | group:node2+3 <---+ when node2 is added into the domain2 of node0, kernel is using the child domain of node2's domain2, which is domain1(node2+3). Node 3 is outside the span of the domain including node0+1+2. This will make load_balance() run based on screwed avg_load and group_type in the sched_group spanning out of the sched_domain, and it also makes select_task_rq_fair() pick an idle CPU outside the sched_domain. Real servers which suffer from this problem include Kunpeng920 and 8-node Sun Fire X4600-M2, at least. Here we move to use the *child* domain of the *child* domain of node2's domain2 as the new added sched_group. At the same, we re-use the lower level sgc directly. +--+ +--++---+ +--+ | node | 12 |node | 20 | node | 12 |node | | 0 +-+1 ++ 2 +---+3 | +--+ +--++---+ +--+ domain0node0node1 +- node2 node3 | domain1node0+1 node0+1| node2+3node2+3 | domain2node0+1+2 | group: node0+1| group:node2 <---+ While the lower level sgc is re-used, this patch only changes the remote sched_groups for those sched_domains playing grandchild trick, therefore, sgc->next_update is still safe since it's only touched by CPUs that have the group span as local group. And sgc->imbalance is also safe because sd_parent remains the same in load_balance and LB only tries other CPUs from the local group. Moreover, since local groups are not touched, they are still getting roughly equal size in a TL. And should_we_balance() only matters with local groups, so the pull probability of those groups are still roughly equal. Tested by the below topology: qemu-system-aarch64 -M virt -nographic \ -smp cpus=8 \ -numa node,cpus=0-1,nodeid=0 \ -numa node,cpus=2-3,nodeid=1 \ -numa node,cpus=4-5,nodeid=2 \ -numa node,cpus=6-7,nodeid=3 \ -numa dist,src=0,dst=1,val=12 \ -numa dist,src=0,dst=2,val=20 \ -numa dist,src=0,dst=3,val=22 \ -numa dist,src=1,dst=2,val=22 \ -numa dist,src=2,dst=3,val=12 \ -numa dist,src=1,dst=3,val=24 \ -m 4G -cpu cortex-a57 -kernel arch/arm64/boot/Image w/o patch, we get lots of "groups don't span domain->span": [0.802139] CPU0 attaching sched-domain(s): [0.802193] domain-0: span=0-1 level=MC [0.802443] groups: 0:{ span=0 cap=1013 }, 1:{ span=1 cap=979 } [0.802693] domain-1: span=0-3 level=NUMA [0.802731]groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 } [0.802811]domain-2: span=0-5 level=NUMA [0.802829] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 } [0.802881] ERROR: groups don't span domain->span [0.803058] domain-3: span=0-7 level=NUMA [0.803080] groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 mask=6-7 cap=4077 } [0.804055] CPU1 attaching sched-domain(s): [0.804072] domain-0: span=0-1 level=MC [0.804096] groups: 1:{ span=1 cap=979 }, 0:{ span=0 cap=1013 } [0.804152] domain-1: span=0-3 level=NUMA [0.804170]groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 } [0.804219]domain-2: span=0-5 level=NUMA [0.804236] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 } [0.804302] ERROR: groups don't span domain->span [0.804520] domain-3: span=0-7 level=NUMA [0.804546] groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 mask=6-7 cap=4077 } [0.804677] CPU2 attaching
[tip: sched/core] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2
The following commit has been merged into the sched/core branch of tip: Commit-ID: 9f4af5753b691b9df558ddcfea13e9f3036e45ca Gitweb: https://git.kernel.org/tip/9f4af5753b691b9df558ddcfea13e9f3036e45ca Author:Barry Song AuthorDate:Wed, 24 Feb 2021 16:09:44 +13:00 Committer: Peter Zijlstra CommitterDate: Thu, 04 Mar 2021 09:56:00 +01:00 sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2 As long as NUMA diameter > 2, building sched_domain by sibling's child domain will definitely create a sched_domain with sched_group which will span out of the sched_domain: +--+ +--++---+ +--+ | node | 12 |node | 20 | node | 12 |node | | 0 +-+1 ++ 2 +---+3 | +--+ +--++---+ +--+ domain0node0node1node2 node3 domain1node0+1 node0+1 node2+3node2+3 + domain2node0+1+2 | group: node0+1 | group:node2+3 <---+ when node2 is added into the domain2 of node0, kernel is using the child domain of node2's domain2, which is domain1(node2+3). Node 3 is outside the span of the domain including node0+1+2. This will make load_balance() run based on screwed avg_load and group_type in the sched_group spanning out of the sched_domain, and it also makes select_task_rq_fair() pick an idle CPU outside the sched_domain. Real servers which suffer from this problem include Kunpeng920 and 8-node Sun Fire X4600-M2, at least. Here we move to use the *child* domain of the *child* domain of node2's domain2 as the new added sched_group. At the same, we re-use the lower level sgc directly. +--+ +--++---+ +--+ | node | 12 |node | 20 | node | 12 |node | | 0 +-+1 ++ 2 +---+3 | +--+ +--++---+ +--+ domain0node0node1 +- node2 node3 | domain1node0+1 node0+1| node2+3node2+3 | domain2node0+1+2 | group: node0+1| group:node2 <---+ While the lower level sgc is re-used, this patch only changes the remote sched_groups for those sched_domains playing grandchild trick, therefore, sgc->next_update is still safe since it's only touched by CPUs that have the group span as local group. And sgc->imbalance is also safe because sd_parent remains the same in load_balance and LB only tries other CPUs from the local group. Moreover, since local groups are not touched, they are still getting roughly equal size in a TL. And should_we_balance() only matters with local groups, so the pull probability of those groups are still roughly equal. Tested by the below topology: qemu-system-aarch64 -M virt -nographic \ -smp cpus=8 \ -numa node,cpus=0-1,nodeid=0 \ -numa node,cpus=2-3,nodeid=1 \ -numa node,cpus=4-5,nodeid=2 \ -numa node,cpus=6-7,nodeid=3 \ -numa dist,src=0,dst=1,val=12 \ -numa dist,src=0,dst=2,val=20 \ -numa dist,src=0,dst=3,val=22 \ -numa dist,src=1,dst=2,val=22 \ -numa dist,src=2,dst=3,val=12 \ -numa dist,src=1,dst=3,val=24 \ -m 4G -cpu cortex-a57 -kernel arch/arm64/boot/Image w/o patch, we get lots of "groups don't span domain->span": [0.802139] CPU0 attaching sched-domain(s): [0.802193] domain-0: span=0-1 level=MC [0.802443] groups: 0:{ span=0 cap=1013 }, 1:{ span=1 cap=979 } [0.802693] domain-1: span=0-3 level=NUMA [0.802731]groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 } [0.802811]domain-2: span=0-5 level=NUMA [0.802829] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 } [0.802881] ERROR: groups don't span domain->span [0.803058] domain-3: span=0-7 level=NUMA [0.803080] groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 mask=6-7 cap=4077 } [0.804055] CPU1 attaching sched-domain(s): [0.804072] domain-0: span=0-1 level=MC [0.804096] groups: 1:{ span=1 cap=979 }, 0:{ span=0 cap=1013 } [0.804152] domain-1: span=0-3 level=NUMA [0.804170]groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 } [0.804219]domain-2: span=0-5 level=NUMA [0.804236] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 } [0.804302] ERROR: groups don't span domain->span [0.804520] domain-3: span=0-7 level=NUMA [0.804546] groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 mask=6-7 cap=4077 } [0.804677] CPU2 attaching