Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes
On 16/08/21 16:03, Srikar Dronamraju wrote: >> >> Your version is much much better than mine. >> And I have verified that it works as expected. >> >> > > Hey Peter/Valentin > > Are we waiting for any more feedback/testing for this? > I'm not overly fond of that last one, but AFAICT the only alternative is doing a full-fledged NUMA topology rebuild on new-node onlining (i.e. make calling sched_init_numa() more than once work). It's a lot more work for a very particular usecase. > > -- > Thanks and Regards > Srikar Dronamraju
Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes
> > Your version is much much better than mine. > And I have verified that it works as expected. > > Hey Peter/Valentin Are we waiting for any more feedback/testing for this? -- Thanks and Regards Srikar Dronamraju
Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes
* Valentin Schneider [2021-08-09 13:52:38]: > On 09/08/21 12:22, Srikar Dronamraju wrote: > > * Valentin Schneider [2021-08-08 16:56:47]: > >> Wait, doesn't the distance matrix (without any offline node) say > >> > >> distance(0, 3) == 40 > >> > >> ? We should have at the very least: > >> > >> node 0 1 2 3 > >> 0: 10 20 ?? 40 > >> 1: 20 20 ?? 40 > >> 2: ?? ?? ?? ?? > >> 3: 40 40 ?? 10 > >> > > > > Before onlining node 3 and CPU 3 (node/CPU 0 and 1 are already online) > > Note: Node 2-7 and CPU 2-7 are still offline. > > > > node 0 1 2 3 > > 0: 10 20 40 10 > > 1: 20 20 40 10 > > 2: 40 40 10 10 > > 3: 10 10 10 10 > > > > NODE->mask(0) == 0 > > NODE->mask(1) == 1 > > NODE->mask(2) == 0 > > NODE->mask(3) == 0 > > > > Note: This is with updating Node 2's distance as 40 for figuring out > > the number of numa levels. Since we have all possible distances, we > > dont update Node 3 distance, so it will be as if its local to node 0. > > > > Now when Node 3 and CPU 3 are onlined > > Note: Node 2, 3-7 and CPU 2, 3-7 are still offline. > > > > node 0 1 2 3 > > 0: 10 20 40 40 > > 1: 20 20 40 40 > > 2: 40 40 10 40 > > 3: 40 40 40 10 > > > > NODE->mask(0) == 0 > > NODE->mask(1) == 1 > > NODE->mask(2) == 0 > > NODE->mask(3) == 0,3 > > > > CPU 0 continues to be part of Node->mask(3) because when we online and > > we find the right distance, there is no API to reset the numa mask of > > 3 to remove CPU 0 from the numa masks. > > > > If we had an API to clear/set sched_domains_numa_masks[node][] when > > the node state changes, we could probably plug-in to clear/set the > > node masks whenever node state changes. > > > > Gotcha, this is now coming back to me... > > [...] > > >> Ok, so it looks like we really can't do without that part - even if we get > >> "sensible" distance values for the online nodes, we can't divine values for > >> the offline ones. > >> > > > > Yes > > > > Argh, while your approach does take care of the masks, it leaves > sched_numa_topology_type unchanged. You *can* force an update of it, but > yuck :( > > I got to the below... > Yes, I completely missed that we should update sched_numa_topology_type. > --- > From: Srikar Dronamraju > Date: Thu, 1 Jul 2021 09:45:51 +0530 > Subject: [PATCH 1/1] sched/topology: Skip updating masks for non-online nodes > > The scheduler currently expects NUMA node distances to be stable from init > onwards, and as a consequence builds the related data structures > once-and-for-all at init (see sched_init_numa()). > > Unfortunately, on some architectures node distance is unreliable for > offline nodes and may very well change upon onlining. > > Skip over offline nodes during sched_init_numa(). Track nodes that have > been onlined at least once, and trigger a build of a node's NUMA masks when > it is first onlined post-init. > Your version is much much better than mine. And I have verified that it works as expected. > Reported-by: Geetika Moolchandani > Signed-off-by: Srikar Dronamraju > Signed-off-by: Valentin Schneider > --- > kernel/sched/topology.c | 65 + > 1 file changed, 65 insertions(+) > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index b77ad49dc14f..cba95793a9b7 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -1482,6 +1482,8 @@ int sched_max_numa_distance; > static int *sched_domains_numa_distance; > static struct cpumask***sched_domains_numa_masks; > int __read_mostlynode_reclaim_distance = RECLAIM_DISTANCE; > + > +static unsigned long __read_mostly *sched_numa_onlined_nodes; > #endif > > /* > @@ -1833,6 +1835,16 @@ void sched_init_numa(void) > sched_domains_numa_masks[i][j] = mask; > > for_each_node(k) { > + /* > + * Distance information can be unreliable for > + * offline nodes, defer building the node > + * masks to its bringup. > + * This relies on all unique distance values > + * still being visible at init time. > + */ > + if (!node_online(j)) > + continue; > + > if (sched_debug() && (node_distance(j, k) != > node_distance(k, j))) > sched_numa_warn("Node-distance not > symmetric"); > > @@ -1886,6 +1898,53 @@ void sched_init_numa(void) > sched_max_numa_distance = sched_domains_numa_distance[nr_levels - 1]; > > init_numa_topology_type(); > + > + sched_numa_onlined_nodes = bitmap_alloc(nr_node_ids, GFP_KERNEL); > + if (!sched_numa_onlined_nodes) > + retur
Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes
On 09/08/21 12:22, Srikar Dronamraju wrote: > * Valentin Schneider [2021-08-08 16:56:47]: >> Wait, doesn't the distance matrix (without any offline node) say >> >> distance(0, 3) == 40 >> >> ? We should have at the very least: >> >> node 0 1 2 3 >> 0: 10 20 ?? 40 >> 1: 20 20 ?? 40 >> 2: ?? ?? ?? ?? >> 3: 40 40 ?? 10 >> > > Before onlining node 3 and CPU 3 (node/CPU 0 and 1 are already online) > Note: Node 2-7 and CPU 2-7 are still offline. > > node 0 1 2 3 > 0: 10 20 40 10 > 1: 20 20 40 10 > 2: 40 40 10 10 > 3: 10 10 10 10 > > NODE->mask(0) == 0 > NODE->mask(1) == 1 > NODE->mask(2) == 0 > NODE->mask(3) == 0 > > Note: This is with updating Node 2's distance as 40 for figuring out > the number of numa levels. Since we have all possible distances, we > dont update Node 3 distance, so it will be as if its local to node 0. > > Now when Node 3 and CPU 3 are onlined > Note: Node 2, 3-7 and CPU 2, 3-7 are still offline. > > node 0 1 2 3 > 0: 10 20 40 40 > 1: 20 20 40 40 > 2: 40 40 10 40 > 3: 40 40 40 10 > > NODE->mask(0) == 0 > NODE->mask(1) == 1 > NODE->mask(2) == 0 > NODE->mask(3) == 0,3 > > CPU 0 continues to be part of Node->mask(3) because when we online and > we find the right distance, there is no API to reset the numa mask of > 3 to remove CPU 0 from the numa masks. > > If we had an API to clear/set sched_domains_numa_masks[node][] when > the node state changes, we could probably plug-in to clear/set the > node masks whenever node state changes. > Gotcha, this is now coming back to me... [...] >> Ok, so it looks like we really can't do without that part - even if we get >> "sensible" distance values for the online nodes, we can't divine values for >> the offline ones. >> > > Yes > Argh, while your approach does take care of the masks, it leaves sched_numa_topology_type unchanged. You *can* force an update of it, but yuck :( I got to the below... --- From: Srikar Dronamraju Date: Thu, 1 Jul 2021 09:45:51 +0530 Subject: [PATCH 1/1] sched/topology: Skip updating masks for non-online nodes The scheduler currently expects NUMA node distances to be stable from init onwards, and as a consequence builds the related data structures once-and-for-all at init (see sched_init_numa()). Unfortunately, on some architectures node distance is unreliable for offline nodes and may very well change upon onlining. Skip over offline nodes during sched_init_numa(). Track nodes that have been onlined at least once, and trigger a build of a node's NUMA masks when it is first onlined post-init. Reported-by: Geetika Moolchandani Signed-off-by: Srikar Dronamraju Signed-off-by: Valentin Schneider --- kernel/sched/topology.c | 65 + 1 file changed, 65 insertions(+) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index b77ad49dc14f..cba95793a9b7 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1482,6 +1482,8 @@ int sched_max_numa_distance; static int *sched_domains_numa_distance; static struct cpumask ***sched_domains_numa_masks; int __read_mostly node_reclaim_distance = RECLAIM_DISTANCE; + +static unsigned long __read_mostly *sched_numa_onlined_nodes; #endif /* @@ -1833,6 +1835,16 @@ void sched_init_numa(void) sched_domains_numa_masks[i][j] = mask; for_each_node(k) { + /* +* Distance information can be unreliable for +* offline nodes, defer building the node +* masks to its bringup. +* This relies on all unique distance values +* still being visible at init time. +*/ + if (!node_online(j)) + continue; + if (sched_debug() && (node_distance(j, k) != node_distance(k, j))) sched_numa_warn("Node-distance not symmetric"); @@ -1886,6 +1898,53 @@ void sched_init_numa(void) sched_max_numa_distance = sched_domains_numa_distance[nr_levels - 1]; init_numa_topology_type(); + + sched_numa_onlined_nodes = bitmap_alloc(nr_node_ids, GFP_KERNEL); + if (!sched_numa_onlined_nodes) + return; + + bitmap_zero(sched_numa_onlined_nodes, nr_node_ids); + for_each_online_node(i) + bitmap_set(sched_numa_onlined_nodes, i, 1); +} + +void __sched_domains_numa_masks_set(unsigned int node) +{ + int i, j; + + /* +* NUMA masks are not built for offline nodes in sched_init_numa(). +* Thus, when a CPU of a never-onlined-before node gets plugged in, +* adding that new CPU to the right NU
Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes
* Valentin Schneider [2021-08-08 16:56:47]: > > A bit late, but technically the week isn't over yet! :D > > On 23/07/21 20:09, Srikar Dronamraju wrote: > > * Valentin Schneider [2021-07-13 17:32:14]: > >> Now, let's take examples from your cover letter: > >> > >> node distances: > >> node 0 1 2 3 4 5 6 7 > >> 0: 10 20 40 40 40 40 40 40 > >> 1: 20 10 40 40 40 40 40 40 > >> 2: 40 40 10 20 40 40 40 40 > >> 3: 40 40 20 10 40 40 40 40 > >> 4: 40 40 40 40 10 20 40 40 > >> 5: 40 40 40 40 20 10 40 40 > >> 6: 40 40 40 40 40 40 10 20 > >> 7: 40 40 40 40 40 40 20 10 > >> > >> But the system boots with just nodes 0 and 1, thus only this distance > >> matrix is valid: > >> > >> node 0 1 > >> 0: 10 20 > >> 1: 20 10 > >> > >> topology_span_sane() is going to use tl->mask(cpu), and as you reported the > >> NODE topology level should cause issues. Let's assume all offline nodes say > >> they're 10 distance away from everyone else, and that we have one CPU per > >> node. This would give us: > >> > > > > No, > > All offline nodes would be at a distance of 10 from node 0 only. > > So here node distance of all offline nodes from node 1 would be 20. > > > >> NODE->mask(0) == 0,2-7 > >> NODE->mask(1) == 1-7 > > > > so > > > > NODE->mask(0) == 0,2-7 > > NODE->mask(1) should be 1 > > and NODE->mask(2-7) == 0,2-7 > > > > Ok, so that shouldn't trigger the warning. Yes not at this point, but later on when we online a node. > > >> > >> The intersection is 2-7, we'll trigger the WARN_ON(). > >> Now, with the above snippet, we'll check if that intersection covers any > >> online CPU. For sched_init_domains(), cpu_map is cpu_active_mask, so we'd > >> end up with an empty intersection and we shouldn't warn - that's the theory > >> at least. > > > > Now lets say we onlined CPU 3 and node 3 which was at a actual distance > > of 20 from node 0. > > > > (If we only consider online CPUs, and since scheduler masks like > > sched_domains_numa_masks arent updated with offline CPUs,) > > then > > > > NODE->mask(0) == 0 > > NODE->mask(1) == 1 > > NODE->mask(3) == 0,3 > > > > Wait, doesn't the distance matrix (without any offline node) say > > distance(0, 3) == 40 > > ? We should have at the very least: > > node 0 1 2 3 > 0: 10 20 ?? 40 > 1: 20 20 ?? 40 > 2: ?? ?? ?? ?? > 3: 40 40 ?? 10 > Before onlining node 3 and CPU 3 (node/CPU 0 and 1 are already online) Note: Node 2-7 and CPU 2-7 are still offline. node 0 1 2 3 0: 10 20 40 10 1: 20 20 40 10 2: 40 40 10 10 3: 10 10 10 10 NODE->mask(0) == 0 NODE->mask(1) == 1 NODE->mask(2) == 0 NODE->mask(3) == 0 Note: This is with updating Node 2's distance as 40 for figuring out the number of numa levels. Since we have all possible distances, we dont update Node 3 distance, so it will be as if its local to node 0. Now when Node 3 and CPU 3 are onlined Note: Node 2, 3-7 and CPU 2, 3-7 are still offline. node 0 1 2 3 0: 10 20 40 40 1: 20 20 40 40 2: 40 40 10 40 3: 40 40 40 10 NODE->mask(0) == 0 NODE->mask(1) == 1 NODE->mask(2) == 0 NODE->mask(3) == 0,3 CPU 0 continues to be part of Node->mask(3) because when we online and we find the right distance, there is no API to reset the numa mask of 3 to remove CPU 0 from the numa masks. If we had an API to clear/set sched_domains_numa_masks[node][] when the node state changes, we could probably plug-in to clear/set the node masks whenever node state changes. > Regardless, NODE->mask(x) is sched_domains_numa_masks[0][x], if > > distance(0,3) > LOCAL_DISTANCE > > then > > node0 ??? NODE->mask(3) > > > cpumask_and(intersect, tl->mask(cpu), tl->mask(i)); > > if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) && > > cpumask_intersects(intersect, cpu_map)) > > > > cpu_map is 0,1,3 > > intersect is 0 > > > > From above NODE->mask(0) is !equal to NODE->mask(1) and > > cpumask_intersects(intersect, cpu_map) is also true. > > > > I picked Node 3 since if Node 1 is online, we would have faked distance > > for Node 2 to be at distance of 40. > > > > Any node from 3 to 7, we would have faced the same problem. > > > >> > >> Looking at sd_numa_mask(), I think there's a bug with topology_span_sane(): > >> it doesn't run in the right place wrt where sched_domains_curr_level is > >> updated. Could you try the below on top of the previous snippet? > >> > >> If that doesn't help, could you share the node distances / topology masks > >> that lead to the WARN_ON()? Thanks. > >> > >> --- > >> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > >> index b77ad49dc14f..cda69dfa4065 100644 > >> --- a/kernel/sched/topology.c > >> +++ b/kernel/sched/topology.c > >> @@ -1516,13 +1516,6 @@ sd_init(struct sched_domain_topology_level *tl, > >> int sd_id, sd_weight, sd_flags = 0; > >> struct cpumask *sd_span; > >> > >> -#if
Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes
A bit late, but technically the week isn't over yet! :D On 23/07/21 20:09, Srikar Dronamraju wrote: > * Valentin Schneider [2021-07-13 17:32:14]: >> Now, let's take examples from your cover letter: >> >> node distances: >> node 0 1 2 3 4 5 6 7 >> 0: 10 20 40 40 40 40 40 40 >> 1: 20 10 40 40 40 40 40 40 >> 2: 40 40 10 20 40 40 40 40 >> 3: 40 40 20 10 40 40 40 40 >> 4: 40 40 40 40 10 20 40 40 >> 5: 40 40 40 40 20 10 40 40 >> 6: 40 40 40 40 40 40 10 20 >> 7: 40 40 40 40 40 40 20 10 >> >> But the system boots with just nodes 0 and 1, thus only this distance >> matrix is valid: >> >> node 0 1 >> 0: 10 20 >> 1: 20 10 >> >> topology_span_sane() is going to use tl->mask(cpu), and as you reported the >> NODE topology level should cause issues. Let's assume all offline nodes say >> they're 10 distance away from everyone else, and that we have one CPU per >> node. This would give us: >> > > No, > All offline nodes would be at a distance of 10 from node 0 only. > So here node distance of all offline nodes from node 1 would be 20. > >> NODE->mask(0) == 0,2-7 >> NODE->mask(1) == 1-7 > > so > > NODE->mask(0) == 0,2-7 > NODE->mask(1) should be 1 > and NODE->mask(2-7) == 0,2-7 > Ok, so that shouldn't trigger the warning. >> >> The intersection is 2-7, we'll trigger the WARN_ON(). >> Now, with the above snippet, we'll check if that intersection covers any >> online CPU. For sched_init_domains(), cpu_map is cpu_active_mask, so we'd >> end up with an empty intersection and we shouldn't warn - that's the theory >> at least. > > Now lets say we onlined CPU 3 and node 3 which was at a actual distance > of 20 from node 0. > > (If we only consider online CPUs, and since scheduler masks like > sched_domains_numa_masks arent updated with offline CPUs,) > then > > NODE->mask(0) == 0 > NODE->mask(1) == 1 > NODE->mask(3) == 0,3 > Wait, doesn't the distance matrix (without any offline node) say distance(0, 3) == 40 ? We should have at the very least: node 0 1 2 3 0: 10 20 ?? 40 1: 20 20 ?? 40 2: ?? ?? ?? ?? 3: 40 40 ?? 10 Regardless, NODE->mask(x) is sched_domains_numa_masks[0][x], if distance(0,3) > LOCAL_DISTANCE then node0 ∉ NODE->mask(3) > cpumask_and(intersect, tl->mask(cpu), tl->mask(i)); > if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) && > cpumask_intersects(intersect, cpu_map)) > > cpu_map is 0,1,3 > intersect is 0 > > From above NODE->mask(0) is !equal to NODE->mask(1) and > cpumask_intersects(intersect, cpu_map) is also true. > > I picked Node 3 since if Node 1 is online, we would have faked distance > for Node 2 to be at distance of 40. > > Any node from 3 to 7, we would have faced the same problem. > >> >> Looking at sd_numa_mask(), I think there's a bug with topology_span_sane(): >> it doesn't run in the right place wrt where sched_domains_curr_level is >> updated. Could you try the below on top of the previous snippet? >> >> If that doesn't help, could you share the node distances / topology masks >> that lead to the WARN_ON()? Thanks. >> >> --- >> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c >> index b77ad49dc14f..cda69dfa4065 100644 >> --- a/kernel/sched/topology.c >> +++ b/kernel/sched/topology.c >> @@ -1516,13 +1516,6 @@ sd_init(struct sched_domain_topology_level *tl, >> int sd_id, sd_weight, sd_flags = 0; >> struct cpumask *sd_span; >> >> -#ifdef CONFIG_NUMA >> -/* >> - * Ugly hack to pass state to sd_numa_mask()... >> - */ >> -sched_domains_curr_level = tl->numa_level; >> -#endif >> - >> sd_weight = cpumask_weight(tl->mask(cpu)); >> >> if (tl->sd_flags) >> @@ -2131,7 +2124,12 @@ build_sched_domains(const struct cpumask *cpu_map, >> struct sched_domain_attr *att >> >> sd = NULL; >> for_each_sd_topology(tl) { >> - >> +#ifdef CONFIG_NUMA >> +/* >> + * Ugly hack to pass state to sd_numa_mask()... >> + */ >> +sched_domains_curr_level = tl->numa_level; >> +#endif >> if (WARN_ON(!topology_span_sane(tl, cpu_map, i))) >> goto error; >> >> > > I tested with the above patch too. However it still not helping. > > Here is the log from my testing. > > At Boot. > > (Do remember to arrive at sched_max_numa_levels we faked the > numa_distance of node 1 to be at 20 from node 0. All other offline > nodes are at a distance of 10 from node 0.) > [...] > ( First addition of a CPU to a non-online node esp node whose node > distance was not faked.) > > numactl -H > available: 3 nodes (0,5,7) > node 0 cpus: 0 1 2 3 4 5 6 7 > node 0 size: 0 MB > node 0 free: 0 MB > node 5 cpus: 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 > 32 33 34 35 40 41 42 43 48 49 50 51 56 57 58 59 64 65 66 67 72 73 74 75 76 77 > 78 79 80 81
Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes
On 04/08/21 15:31, Srikar Dronamraju wrote: > * Srikar Dronamraju [2021-07-23 20:09:14]: > >> * Valentin Schneider [2021-07-13 17:32:14]: >> >> > On 12/07/21 18:18, Srikar Dronamraju wrote: >> > > Hi Valentin, >> > > >> > >> On 01/07/21 09:45, Srikar Dronamraju wrote: >> > >> > @@ -1891,12 +1894,30 @@ void sched_init_numa(void) >> > >> > void sched_domains_numa_masks_set(unsigned int cpu) >> > >> > { >> > > > > Hey Valentin / Peter > > Did you get a chance to look at this? > Barely, I wanted to set some time aside to stare at this and have been failing miserably. Let me bump it up my todolist, I'll get to it before the end of the week. > -- > Thanks and Regards > Srikar Dronamraju
Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes
* Srikar Dronamraju [2021-07-23 20:09:14]: > * Valentin Schneider [2021-07-13 17:32:14]: > > > On 12/07/21 18:18, Srikar Dronamraju wrote: > > > Hi Valentin, > > > > > >> On 01/07/21 09:45, Srikar Dronamraju wrote: > > >> > @@ -1891,12 +1894,30 @@ void sched_init_numa(void) > > >> > void sched_domains_numa_masks_set(unsigned int cpu) > > >> > { > > > Hey Valentin / Peter Did you get a chance to look at this? -- Thanks and Regards Srikar Dronamraju
Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes
* Valentin Schneider [2021-07-13 17:32:14]: > On 12/07/21 18:18, Srikar Dronamraju wrote: > > Hi Valentin, > > > >> On 01/07/21 09:45, Srikar Dronamraju wrote: > >> > @@ -1891,12 +1894,30 @@ void sched_init_numa(void) > >> > void sched_domains_numa_masks_set(unsigned int cpu) > >> > { > > > > Unfortunately this is not helping. > > I tried this patch alone and also with 2/2 patch of this series where > > we update/fill fake topology numbers. However both cases are still failing. > > > > Thanks for testing it. > > > Now, let's take examples from your cover letter: > > node distances: > node 0 1 2 3 4 5 6 7 > 0: 10 20 40 40 40 40 40 40 > 1: 20 10 40 40 40 40 40 40 > 2: 40 40 10 20 40 40 40 40 > 3: 40 40 20 10 40 40 40 40 > 4: 40 40 40 40 10 20 40 40 > 5: 40 40 40 40 20 10 40 40 > 6: 40 40 40 40 40 40 10 20 > 7: 40 40 40 40 40 40 20 10 > > But the system boots with just nodes 0 and 1, thus only this distance > matrix is valid: > > node 0 1 > 0: 10 20 > 1: 20 10 > > topology_span_sane() is going to use tl->mask(cpu), and as you reported the > NODE topology level should cause issues. Let's assume all offline nodes say > they're 10 distance away from everyone else, and that we have one CPU per > node. This would give us: > No, All offline nodes would be at a distance of 10 from node 0 only. So here node distance of all offline nodes from node 1 would be 20. > NODE->mask(0) == 0,2-7 > NODE->mask(1) == 1-7 so NODE->mask(0) == 0,2-7 NODE->mask(1) should be 1 and NODE->mask(2-7) == 0,2-7 > > The intersection is 2-7, we'll trigger the WARN_ON(). > Now, with the above snippet, we'll check if that intersection covers any > online CPU. For sched_init_domains(), cpu_map is cpu_active_mask, so we'd > end up with an empty intersection and we shouldn't warn - that's the theory > at least. Now lets say we onlined CPU 3 and node 3 which was at a actual distance of 20 from node 0. (If we only consider online CPUs, and since scheduler masks like sched_domains_numa_masks arent updated with offline CPUs,) then NODE->mask(0) == 0 NODE->mask(1) == 1 NODE->mask(3) == 0,3 cpumask_and(intersect, tl->mask(cpu), tl->mask(i)); if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) && cpumask_intersects(intersect, cpu_map)) cpu_map is 0,1,3 intersect is 0 >From above NODE->mask(0) is !equal to NODE->mask(1) and cpumask_intersects(intersect, cpu_map) is also true. I picked Node 3 since if Node 1 is online, we would have faked distance for Node 2 to be at distance of 40. Any node from 3 to 7, we would have faced the same problem. > > Looking at sd_numa_mask(), I think there's a bug with topology_span_sane(): > it doesn't run in the right place wrt where sched_domains_curr_level is > updated. Could you try the below on top of the previous snippet? > > If that doesn't help, could you share the node distances / topology masks > that lead to the WARN_ON()? Thanks. > > --- > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index b77ad49dc14f..cda69dfa4065 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -1516,13 +1516,6 @@ sd_init(struct sched_domain_topology_level *tl, > int sd_id, sd_weight, sd_flags = 0; > struct cpumask *sd_span; > > -#ifdef CONFIG_NUMA > - /* > - * Ugly hack to pass state to sd_numa_mask()... > - */ > - sched_domains_curr_level = tl->numa_level; > -#endif > - > sd_weight = cpumask_weight(tl->mask(cpu)); > > if (tl->sd_flags) > @@ -2131,7 +2124,12 @@ build_sched_domains(const struct cpumask *cpu_map, > struct sched_domain_attr *att > > sd = NULL; > for_each_sd_topology(tl) { > - > +#ifdef CONFIG_NUMA > + /* > + * Ugly hack to pass state to sd_numa_mask()... > + */ > + sched_domains_curr_level = tl->numa_level; > +#endif > if (WARN_ON(!topology_span_sane(tl, cpu_map, i))) > goto error; > > I tested with the above patch too. However it still not helping. Here is the log from my testing. At Boot. (Do remember to arrive at sched_max_numa_levels we faked the numa_distance of node 1 to be at 20 from node 0. All other offline nodes are at a distance of 10 from node 0.) numactl -H available: 2 nodes (0,5) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 0 MB node 0 free: 0 MB node 5 cpus: node 5 size: 32038 MB node 5 free: 29367 MB node distances: node 0 5 0: 10 40 5: 40 10 -- grep -r . /sys/kernel/debug/sched/domains/cpu0/domain{0,1,2,3,4}/{name,flags} /sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_P
Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes
On 12/07/21 18:18, Srikar Dronamraju wrote: > Hi Valentin, > >> On 01/07/21 09:45, Srikar Dronamraju wrote: >> > @@ -1891,12 +1894,30 @@ void sched_init_numa(void) >> > void sched_domains_numa_masks_set(unsigned int cpu) >> > { >> >> Hmph, so we're playing games with masks of offline nodes - is that really >> necessary? Your modification of sched_init_numa() still scans all of the >> nodes (regardless of their online status) to build the distance map, and >> that is never updated (sched_init_numa() is pretty much an __init >> function). >> >> So AFAICT this is all to cope with topology_span_sane() not applying >> 'cpu_map' to its masks. That seemed fine to me back when I wrote it, but in >> light of having bogus distance values for offline nodes, not so much... >> >> What about the below instead? >> >> --- >> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c >> index b77ad49dc14f..c2d9caad4aa6 100644 >> --- a/kernel/sched/topology.c >> +++ b/kernel/sched/topology.c >> @@ -2075,6 +2075,7 @@ static struct sched_domain *build_sched_domain(struct >> sched_domain_topology_leve >> static bool topology_span_sane(struct sched_domain_topology_level *tl, >>const struct cpumask *cpu_map, int cpu) >> { >> +struct cpumask *intersect = sched_domains_tmpmask; >> int i; >> >> /* NUMA levels are allowed to overlap */ >> @@ -2090,14 +2091,17 @@ static bool topology_span_sane(struct >> sched_domain_topology_level *tl, >> for_each_cpu(i, cpu_map) { >> if (i == cpu) >> continue; >> + >> /* >> - * We should 'and' all those masks with 'cpu_map' to exactly >> - * match the topology we're about to build, but that can only >> - * remove CPUs, which only lessens our ability to detect >> - * overlaps >> + * We shouldn't have to bother with cpu_map here, unfortunately >> + * some architectures (powerpc says hello) have to deal with >> + * offline NUMA nodes reporting bogus distance values. This can >> + * lead to funky NODE domain spans, but since those are offline >> + * we can mask them out. >> */ >> +cpumask_and(intersect, tl->mask(cpu), tl->mask(i)); >> if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) && >> -cpumask_intersects(tl->mask(cpu), tl->mask(i))) >> +cpumask_intersects(intersect, cpu_map)) >> return false; >> } >> > > Unfortunately this is not helping. > I tried this patch alone and also with 2/2 patch of this series where > we update/fill fake topology numbers. However both cases are still failing. > Thanks for testing it. Now, let's take examples from your cover letter: node distances: node 0 1 2 3 4 5 6 7 0: 10 20 40 40 40 40 40 40 1: 20 10 40 40 40 40 40 40 2: 40 40 10 20 40 40 40 40 3: 40 40 20 10 40 40 40 40 4: 40 40 40 40 10 20 40 40 5: 40 40 40 40 20 10 40 40 6: 40 40 40 40 40 40 10 20 7: 40 40 40 40 40 40 20 10 But the system boots with just nodes 0 and 1, thus only this distance matrix is valid: node 0 1 0: 10 20 1: 20 10 topology_span_sane() is going to use tl->mask(cpu), and as you reported the NODE topology level should cause issues. Let's assume all offline nodes say they're 10 distance away from everyone else, and that we have one CPU per node. This would give us: NODE->mask(0) == 0,2-7 NODE->mask(1) == 1-7 The intersection is 2-7, we'll trigger the WARN_ON(). Now, with the above snippet, we'll check if that intersection covers any online CPU. For sched_init_domains(), cpu_map is cpu_active_mask, so we'd end up with an empty intersection and we shouldn't warn - that's the theory at least. Looking at sd_numa_mask(), I think there's a bug with topology_span_sane(): it doesn't run in the right place wrt where sched_domains_curr_level is updated. Could you try the below on top of the previous snippet? If that doesn't help, could you share the node distances / topology masks that lead to the WARN_ON()? Thanks. --- diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index b77ad49dc14f..cda69dfa4065 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1516,13 +1516,6 @@ sd_init(struct sched_domain_topology_level *tl, int sd_id, sd_weight, sd_flags = 0; struct cpumask *sd_span; -#ifdef CONFIG_NUMA - /* -* Ugly hack to pass state to sd_numa_mask()... -*/ - sched_domains_curr_level = tl->numa_level; -#endif - sd_weight = cpumask_weight(tl->mask(cpu)); if (tl->sd_flags) @@ -2131,7 +2124,12 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att sd = NULL; for_each_sd_topology(tl) { - +#ifdef CONFIG_NUMA +
Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes
Hi Valentin, > On 01/07/21 09:45, Srikar Dronamraju wrote: > > @@ -1891,12 +1894,30 @@ void sched_init_numa(void) > > void sched_domains_numa_masks_set(unsigned int cpu) > > { > > Hmph, so we're playing games with masks of offline nodes - is that really > necessary? Your modification of sched_init_numa() still scans all of the > nodes (regardless of their online status) to build the distance map, and > that is never updated (sched_init_numa() is pretty much an __init > function). > > So AFAICT this is all to cope with topology_span_sane() not applying > 'cpu_map' to its masks. That seemed fine to me back when I wrote it, but in > light of having bogus distance values for offline nodes, not so much... > > What about the below instead? > > --- > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index b77ad49dc14f..c2d9caad4aa6 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -2075,6 +2075,7 @@ static struct sched_domain *build_sched_domain(struct > sched_domain_topology_leve > static bool topology_span_sane(struct sched_domain_topology_level *tl, > const struct cpumask *cpu_map, int cpu) > { > + struct cpumask *intersect = sched_domains_tmpmask; > int i; > > /* NUMA levels are allowed to overlap */ > @@ -2090,14 +2091,17 @@ static bool topology_span_sane(struct > sched_domain_topology_level *tl, > for_each_cpu(i, cpu_map) { > if (i == cpu) > continue; > + > /* > - * We should 'and' all those masks with 'cpu_map' to exactly > - * match the topology we're about to build, but that can only > - * remove CPUs, which only lessens our ability to detect > - * overlaps > + * We shouldn't have to bother with cpu_map here, unfortunately > + * some architectures (powerpc says hello) have to deal with > + * offline NUMA nodes reporting bogus distance values. This can > + * lead to funky NODE domain spans, but since those are offline > + * we can mask them out. >*/ > + cpumask_and(intersect, tl->mask(cpu), tl->mask(i)); > if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) && > - cpumask_intersects(tl->mask(cpu), tl->mask(i))) > + cpumask_intersects(intersect, cpu_map)) > return false; > } > Unfortunately this is not helping. I tried this patch alone and also with 2/2 patch of this series where we update/fill fake topology numbers. However both cases are still failing. -- Thanks and Regards Srikar Dronamraju
Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes
On 01/07/21 09:45, Srikar Dronamraju wrote: > @@ -1891,12 +1894,30 @@ void sched_init_numa(void) > void sched_domains_numa_masks_set(unsigned int cpu) > { > int node = cpu_to_node(cpu); > - int i, j; > + int i, j, empty; > > + empty = cpumask_empty(sched_domains_numa_masks[0][node]); > for (i = 0; i < sched_domains_numa_levels; i++) { > for (j = 0; j < nr_node_ids; j++) { > - if (node_distance(j, node) <= > sched_domains_numa_distance[i]) > + if (!node_online(j)) > + continue; > + > + if (node_distance(j, node) <= > sched_domains_numa_distance[i]) { > cpumask_set_cpu(cpu, > sched_domains_numa_masks[i][j]); > + > + /* > + * We skip updating numa_masks for offline > + * nodes. However now that the node is > + * finally online, CPUs that were added > + * earlier, should now be accommodated into > + * newly oneline node's numa mask. > + */ > + if (node != j && empty) { > + > cpumask_or(sched_domains_numa_masks[i][node], > + > sched_domains_numa_masks[i][node], > + > sched_domains_numa_masks[0][j]); > + } > + } Hmph, so we're playing games with masks of offline nodes - is that really necessary? Your modification of sched_init_numa() still scans all of the nodes (regardless of their online status) to build the distance map, and that is never updated (sched_init_numa() is pretty much an __init function). So AFAICT this is all to cope with topology_span_sane() not applying 'cpu_map' to its masks. That seemed fine to me back when I wrote it, but in light of having bogus distance values for offline nodes, not so much... What about the below instead? --- diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index b77ad49dc14f..c2d9caad4aa6 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -2075,6 +2075,7 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve static bool topology_span_sane(struct sched_domain_topology_level *tl, const struct cpumask *cpu_map, int cpu) { + struct cpumask *intersect = sched_domains_tmpmask; int i; /* NUMA levels are allowed to overlap */ @@ -2090,14 +2091,17 @@ static bool topology_span_sane(struct sched_domain_topology_level *tl, for_each_cpu(i, cpu_map) { if (i == cpu) continue; + /* -* We should 'and' all those masks with 'cpu_map' to exactly -* match the topology we're about to build, but that can only -* remove CPUs, which only lessens our ability to detect -* overlaps +* We shouldn't have to bother with cpu_map here, unfortunately +* some architectures (powerpc says hello) have to deal with +* offline NUMA nodes reporting bogus distance values. This can +* lead to funky NODE domain spans, but since those are offline +* we can mask them out. */ + cpumask_and(intersect, tl->mask(cpu), tl->mask(i)); if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) && - cpumask_intersects(tl->mask(cpu), tl->mask(i))) + cpumask_intersects(intersect, cpu_map)) return false; }
[PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes
Currently scheduler doesn't check if node is online before adding CPUs to the node mask. However on some architectures, node distance is only available for nodes that are online. Its not sure how much to rely on the node distance, when one of the nodes is offline. If said node distance is fake (since one of the nodes is offline) and the actual node distance is different, then the cpumask of such nodes when the nodes become becomes online will be wrong. This can cause topology_span_sane to throw up a warning message and the rest of the topology being not updated properly. Resolve this by skipping update of cpumask for nodes that are not online. However by skipping, relevant CPUs may not be set when nodes are onlined. i.e when coming up with NUMA masks at a certain NUMA distance, CPUs that are part of other nodes, which are already online will not be part of the NUMA mask. Hence the first time, a CPU is added to the newly onlined node, add the other CPUs to the numa_mask. Cc: LKML Cc: linuxppc-dev@lists.ozlabs.org Cc: Nathan Lynch Cc: Michael Ellerman Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Gautham R Shenoy Cc: Dietmar Eggemann Cc: Mel Gorman Cc: Vincent Guittot Cc: Rik van Riel Cc: Geetika Moolchandani Cc: Laurent Dufour Reported-by: Geetika Moolchandani Signed-off-by: Srikar Dronamraju --- Changelog v1->v2: v1 link: http://lore.kernel.org/lkml/20210520154427.1041031-4-sri...@linux.vnet.ibm.com/t/#u Update the NUMA masks, whenever 1st CPU is added to cpuless node kernel/sched/topology.c | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index b77ad49dc14f..f25dbcab4fd2 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1833,6 +1833,9 @@ void sched_init_numa(void) sched_domains_numa_masks[i][j] = mask; for_each_node(k) { + if (!node_online(j)) + continue; + if (sched_debug() && (node_distance(j, k) != node_distance(k, j))) sched_numa_warn("Node-distance not symmetric"); @@ -1891,12 +1894,30 @@ void sched_init_numa(void) void sched_domains_numa_masks_set(unsigned int cpu) { int node = cpu_to_node(cpu); - int i, j; + int i, j, empty; + empty = cpumask_empty(sched_domains_numa_masks[0][node]); for (i = 0; i < sched_domains_numa_levels; i++) { for (j = 0; j < nr_node_ids; j++) { - if (node_distance(j, node) <= sched_domains_numa_distance[i]) + if (!node_online(j)) + continue; + + if (node_distance(j, node) <= sched_domains_numa_distance[i]) { cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]); + + /* +* We skip updating numa_masks for offline +* nodes. However now that the node is +* finally online, CPUs that were added +* earlier, should now be accommodated into +* newly oneline node's numa mask. +*/ + if (node != j && empty) { + cpumask_or(sched_domains_numa_masks[i][node], + sched_domains_numa_masks[i][node], + sched_domains_numa_masks[0][j]); + } + } } } } -- 2.27.0