[tip: sched/core] sched/topology: Remove redundant cpumask_and() in init_overlap_sched_group()
The following commit has been merged into the sched/core branch of tip: Commit-ID: 0a2b65c03e9b47493e1442bf9c84badc60d9bffb Gitweb: https://git.kernel.org/tip/0a2b65c03e9b47493e1442bf9c84badc60d9bffb Author:Barry Song AuthorDate:Thu, 25 Mar 2021 15:31:40 +13:00 Committer: Ingo Molnar CommitterDate: Thu, 25 Mar 2021 11:41:23 +01:00 sched/topology: Remove redundant cpumask_and() in init_overlap_sched_group() mask is built in build_balance_mask() by for_each_cpu(i, sg_span), so it must be a subset of sched_group_span(sg). So the cpumask_and() call is redundant - remove it. [ mingo: Adjusted the changelog a bit. ] Signed-off-by: Barry Song Signed-off-by: Ingo Molnar Reviewed-by: Valentin Schneider Link: https://lore.kernel.org/r/20210325023140.23456-1-song.bao@hisilicon.com --- kernel/sched/topology.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index f2066d6..d1aec24 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -934,7 +934,7 @@ static void init_overlap_sched_group(struct sched_domain *sd, int cpu; build_balance_mask(sd, sg, mask); - cpu = cpumask_first_and(sched_group_span(sg), mask); + cpu = cpumask_first(mask); sg->sgc = *per_cpu_ptr(sdd->sgc, cpu); if (atomic_inc_return(>sgc->ref) == 1)
[tip: sched/core] sched/fair: Optimize test_idle_cores() for !SMT
The following commit has been merged into the sched/core branch of tip: Commit-ID: c8987ae5af793a73e2c0d6ce804d8ff454ea377c Gitweb: https://git.kernel.org/tip/c8987ae5af793a73e2c0d6ce804d8ff454ea377c Author:Barry Song AuthorDate:Sun, 21 Mar 2021 11:14:32 +13:00 Committer: Peter Zijlstra CommitterDate: Tue, 23 Mar 2021 16:01:59 +01:00 sched/fair: Optimize test_idle_cores() for !SMT update_idle_core() is only done for the case of sched_smt_present. but test_idle_cores() is done for all machines even those without SMT. This can contribute to up 8%+ hackbench performance loss on a machine like kunpeng 920 which has no SMT. This patch removes the redundant test_idle_cores() for !SMT machines. Hackbench is ran with -g {2..14}, for each g it is ran 10 times to get an average. $ numactl -N 0 hackbench -p -T -l 2 -g $1 The below is the result of hackbench w/ and w/o this patch: g=2 4 6 8 10 12 14 w/o: 1.8151 3.8499 5.5142 7.2491 9.0340 10.7345 12.0929 w/ : 1.8428 3.7436 5.4501 6.9522 8.2882 9.9535 11.3367 +4.1% +8.3% +7.3% +6.3% Signed-off-by: Barry Song Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Vincent Guittot Acked-by: Mel Gorman Link: https://lkml.kernel.org/r/20210320221432.924-1-song.bao@hisilicon.com --- kernel/sched/fair.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6aad028..aaa0dfa 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6038,9 +6038,11 @@ static inline bool test_idle_cores(int cpu, bool def) { struct sched_domain_shared *sds; - sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); - if (sds) - return READ_ONCE(sds->has_idle_cores); + if (static_branch_likely(_smt_present)) { + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); + if (sds) + return READ_ONCE(sds->has_idle_cores); + } return def; }
[tip: irq/core] genirq: Add IRQF_NO_AUTOEN for request_irq/nmi()
The following commit has been merged into the irq/core branch of tip: Commit-ID: cbe16f35bee6880becca6f20d2ebf6b457148552 Gitweb: https://git.kernel.org/tip/cbe16f35bee6880becca6f20d2ebf6b457148552 Author:Barry Song AuthorDate:Wed, 03 Mar 2021 11:49:15 +13:00 Committer: Ingo Molnar CommitterDate: Sat, 06 Mar 2021 12:48:00 +01:00 genirq: Add IRQF_NO_AUTOEN for request_irq/nmi() Many drivers don't want interrupts enabled automatically via request_irq(). So they are handling this issue by either way of the below two: (1) irq_set_status_flags(irq, IRQ_NOAUTOEN); request_irq(dev, irq...); (2) request_irq(dev, irq...); disable_irq(irq); The code in the second way is silly and unsafe. In the small time gap between request_irq() and disable_irq(), interrupts can still come. The code in the first way is safe though it's subobtimal. Add a new IRQF_NO_AUTOEN flag which can be handed in by drivers to request_irq() and request_nmi(). It prevents the automatic enabling of the requested interrupt/nmi in the same safe way as #1 above. With that the various usage sites of #1 and #2 above can be simplified and corrected. Signed-off-by: Barry Song Signed-off-by: Thomas Gleixner Signed-off-by: Ingo Molnar Cc: dmitry.torok...@gmail.com Link: https://lore.kernel.org/r/20210302224916.13980-2-song.bao@hisilicon.com --- include/linux/interrupt.h | 4 kernel/irq/manage.c | 11 +-- 2 files changed, 13 insertions(+), 2 deletions(-) diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index 967e257..76f1161 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -61,6 +61,9 @@ *interrupt handler after suspending interrupts. For system *wakeup devices users need to implement wakeup detection in *their interrupt handlers. + * IRQF_NO_AUTOEN - Don't enable IRQ or NMI automatically when users request it. + *Users will enable it explicitly by enable_irq() or enable_nmi() + *later. */ #define IRQF_SHARED0x0080 #define IRQF_PROBE_SHARED 0x0100 @@ -74,6 +77,7 @@ #define IRQF_NO_THREAD 0x0001 #define IRQF_EARLY_RESUME 0x0002 #define IRQF_COND_SUSPEND 0x0004 +#define IRQF_NO_AUTOEN 0x0008 #define IRQF_TIMER (__IRQF_TIMER | IRQF_NO_SUSPEND | IRQF_NO_THREAD) diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index dec3f73..97c231a 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -1693,7 +1693,8 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new) irqd_set(>irq_data, IRQD_NO_BALANCING); } - if (irq_settings_can_autoenable(desc)) { + if (!(new->flags & IRQF_NO_AUTOEN) && + irq_settings_can_autoenable(desc)) { irq_startup(desc, IRQ_RESEND, IRQ_START_COND); } else { /* @@ -2086,10 +2087,15 @@ int request_threaded_irq(unsigned int irq, irq_handler_t handler, * which interrupt is which (messes up the interrupt freeing * logic etc). * +* Also shared interrupts do not go well with disabling auto enable. +* The sharing interrupt might request it while it's still disabled +* and then wait for interrupts forever. +* * Also IRQF_COND_SUSPEND only makes sense for shared interrupts and * it cannot be set along with IRQF_NO_SUSPEND. */ if (((irqflags & IRQF_SHARED) && !dev_id) || + ((irqflags & IRQF_SHARED) && (irqflags & IRQF_NO_AUTOEN)) || (!(irqflags & IRQF_SHARED) && (irqflags & IRQF_COND_SUSPEND)) || ((irqflags & IRQF_NO_SUSPEND) && (irqflags & IRQF_COND_SUSPEND))) return -EINVAL; @@ -2245,7 +2251,8 @@ int request_nmi(unsigned int irq, irq_handler_t handler, desc = irq_to_desc(irq); - if (!desc || irq_settings_can_autoenable(desc) || + if (!desc || (irq_settings_can_autoenable(desc) && + !(irqflags & IRQF_NO_AUTOEN)) || !irq_settings_can_request(desc) || WARN_ON(irq_settings_is_per_cpu_devid(desc)) || !irq_supports_nmi(desc))
[tip: sched/core] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2
The following commit has been merged into the sched/core branch of tip: Commit-ID: 585b6d2723dc927ebc4ad884c4e879e4da8bc21f Gitweb: https://git.kernel.org/tip/585b6d2723dc927ebc4ad884c4e879e4da8bc21f Author:Barry Song AuthorDate:Wed, 24 Feb 2021 16:09:44 +13:00 Committer: Ingo Molnar CommitterDate: Sat, 06 Mar 2021 12:40:22 +01:00 sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2 As long as NUMA diameter > 2, building sched_domain by sibling's child domain will definitely create a sched_domain with sched_group which will span out of the sched_domain: +--+ +--++---+ +--+ | node | 12 |node | 20 | node | 12 |node | | 0 +-+1 ++ 2 +---+3 | +--+ +--++---+ +--+ domain0node0node1node2 node3 domain1node0+1 node0+1 node2+3node2+3 + domain2node0+1+2 | group: node0+1 | group:node2+3 <---+ when node2 is added into the domain2 of node0, kernel is using the child domain of node2's domain2, which is domain1(node2+3). Node 3 is outside the span of the domain including node0+1+2. This will make load_balance() run based on screwed avg_load and group_type in the sched_group spanning out of the sched_domain, and it also makes select_task_rq_fair() pick an idle CPU outside the sched_domain. Real servers which suffer from this problem include Kunpeng920 and 8-node Sun Fire X4600-M2, at least. Here we move to use the *child* domain of the *child* domain of node2's domain2 as the new added sched_group. At the same, we re-use the lower level sgc directly. +--+ +--++---+ +--+ | node | 12 |node | 20 | node | 12 |node | | 0 +-+1 ++ 2 +---+3 | +--+ +--++---+ +--+ domain0node0node1 +- node2 node3 | domain1node0+1 node0+1| node2+3node2+3 | domain2node0+1+2 | group: node0+1| group:node2 <---+ While the lower level sgc is re-used, this patch only changes the remote sched_groups for those sched_domains playing grandchild trick, therefore, sgc->next_update is still safe since it's only touched by CPUs that have the group span as local group. And sgc->imbalance is also safe because sd_parent remains the same in load_balance and LB only tries other CPUs from the local group. Moreover, since local groups are not touched, they are still getting roughly equal size in a TL. And should_we_balance() only matters with local groups, so the pull probability of those groups are still roughly equal. Tested by the below topology: qemu-system-aarch64 -M virt -nographic \ -smp cpus=8 \ -numa node,cpus=0-1,nodeid=0 \ -numa node,cpus=2-3,nodeid=1 \ -numa node,cpus=4-5,nodeid=2 \ -numa node,cpus=6-7,nodeid=3 \ -numa dist,src=0,dst=1,val=12 \ -numa dist,src=0,dst=2,val=20 \ -numa dist,src=0,dst=3,val=22 \ -numa dist,src=1,dst=2,val=22 \ -numa dist,src=2,dst=3,val=12 \ -numa dist,src=1,dst=3,val=24 \ -m 4G -cpu cortex-a57 -kernel arch/arm64/boot/Image w/o patch, we get lots of "groups don't span domain->span": [0.802139] CPU0 attaching sched-domain(s): [0.802193] domain-0: span=0-1 level=MC [0.802443] groups: 0:{ span=0 cap=1013 }, 1:{ span=1 cap=979 } [0.802693] domain-1: span=0-3 level=NUMA [0.802731]groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 } [0.802811]domain-2: span=0-5 level=NUMA [0.802829] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 } [0.802881] ERROR: groups don't span domain->span [0.803058] domain-3: span=0-7 level=NUMA [0.803080] groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 mask=6-7 cap=4077 } [0.804055] CPU1 attaching sched-domain(s): [0.804072] domain-0: span=0-1 level=MC [0.804096] groups: 1:{ span=1 cap=979 }, 0:{ span=0 cap=1013 } [0.804152] domain-1: span=0-3 level=NUMA [0.804170]groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 } [0.804219]domain-2: span=0-5 level=NUMA [0.804236] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 } [0.804302] ERROR: groups don't span domain->span [0.804520] domain-3: span=0-7 level=NUMA [0.804546] groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 mask=6-7 cap=4077 } [0.804677] CPU2 attaching
[tip: irq/core] genirq: Add IRQF_NO_AUTOEN for request_irq/nmi()
The following commit has been merged into the irq/core branch of tip: Commit-ID: e749df1bbd23f4472082210650514548d8a39e9b Gitweb: https://git.kernel.org/tip/e749df1bbd23f4472082210650514548d8a39e9b Author:Barry Song AuthorDate:Wed, 03 Mar 2021 11:49:15 +13:00 Committer: Thomas Gleixner CommitterDate: Thu, 04 Mar 2021 11:47:52 +01:00 genirq: Add IRQF_NO_AUTOEN for request_irq/nmi() Many drivers don't want interrupts enabled automatically via request_irq(). So they are handling this issue by either way of the below two: (1) irq_set_status_flags(irq, IRQ_NOAUTOEN); request_irq(dev, irq...); (2) request_irq(dev, irq...); disable_irq(irq); The code in the second way is silly and unsafe. In the small time gap between request_irq() and disable_irq(), interrupts can still come. The code in the first way is safe though it's subobtimal. Add a new IRQF_NO_AUTOEN flag which can be handed in by drivers to request_irq() and request_nmi(). It prevents the automatic enabling of the requested interrupt/nmi in the same safe way as #1 above. With that the various usage sites of #1 and #2 above can be simplified and corrected. Signed-off-by: Barry Song Signed-off-by: Thomas Gleixner Cc: dmitry.torok...@gmail.com Link: https://lore.kernel.org/r/20210302224916.13980-2-song.bao@hisilicon.com --- include/linux/interrupt.h | 4 kernel/irq/manage.c | 11 +-- 2 files changed, 13 insertions(+), 2 deletions(-) diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index 967e257..76f1161 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -61,6 +61,9 @@ *interrupt handler after suspending interrupts. For system *wakeup devices users need to implement wakeup detection in *their interrupt handlers. + * IRQF_NO_AUTOEN - Don't enable IRQ or NMI automatically when users request it. + *Users will enable it explicitly by enable_irq() or enable_nmi() + *later. */ #define IRQF_SHARED0x0080 #define IRQF_PROBE_SHARED 0x0100 @@ -74,6 +77,7 @@ #define IRQF_NO_THREAD 0x0001 #define IRQF_EARLY_RESUME 0x0002 #define IRQF_COND_SUSPEND 0x0004 +#define IRQF_NO_AUTOEN 0x0008 #define IRQF_TIMER (__IRQF_TIMER | IRQF_NO_SUSPEND | IRQF_NO_THREAD) diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index dec3f73..97c231a 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -1693,7 +1693,8 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new) irqd_set(>irq_data, IRQD_NO_BALANCING); } - if (irq_settings_can_autoenable(desc)) { + if (!(new->flags & IRQF_NO_AUTOEN) && + irq_settings_can_autoenable(desc)) { irq_startup(desc, IRQ_RESEND, IRQ_START_COND); } else { /* @@ -2086,10 +2087,15 @@ int request_threaded_irq(unsigned int irq, irq_handler_t handler, * which interrupt is which (messes up the interrupt freeing * logic etc). * +* Also shared interrupts do not go well with disabling auto enable. +* The sharing interrupt might request it while it's still disabled +* and then wait for interrupts forever. +* * Also IRQF_COND_SUSPEND only makes sense for shared interrupts and * it cannot be set along with IRQF_NO_SUSPEND. */ if (((irqflags & IRQF_SHARED) && !dev_id) || + ((irqflags & IRQF_SHARED) && (irqflags & IRQF_NO_AUTOEN)) || (!(irqflags & IRQF_SHARED) && (irqflags & IRQF_COND_SUSPEND)) || ((irqflags & IRQF_NO_SUSPEND) && (irqflags & IRQF_COND_SUSPEND))) return -EINVAL; @@ -2245,7 +2251,8 @@ int request_nmi(unsigned int irq, irq_handler_t handler, desc = irq_to_desc(irq); - if (!desc || irq_settings_can_autoenable(desc) || + if (!desc || (irq_settings_can_autoenable(desc) && + !(irqflags & IRQF_NO_AUTOEN)) || !irq_settings_can_request(desc) || WARN_ON(irq_settings_is_per_cpu_devid(desc)) || !irq_supports_nmi(desc))
[tip: sched/core] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2
The following commit has been merged into the sched/core branch of tip: Commit-ID: 9f4af5753b691b9df558ddcfea13e9f3036e45ca Gitweb: https://git.kernel.org/tip/9f4af5753b691b9df558ddcfea13e9f3036e45ca Author:Barry Song AuthorDate:Wed, 24 Feb 2021 16:09:44 +13:00 Committer: Peter Zijlstra CommitterDate: Thu, 04 Mar 2021 09:56:00 +01:00 sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2 As long as NUMA diameter > 2, building sched_domain by sibling's child domain will definitely create a sched_domain with sched_group which will span out of the sched_domain: +--+ +--++---+ +--+ | node | 12 |node | 20 | node | 12 |node | | 0 +-+1 ++ 2 +---+3 | +--+ +--++---+ +--+ domain0node0node1node2 node3 domain1node0+1 node0+1 node2+3node2+3 + domain2node0+1+2 | group: node0+1 | group:node2+3 <---+ when node2 is added into the domain2 of node0, kernel is using the child domain of node2's domain2, which is domain1(node2+3). Node 3 is outside the span of the domain including node0+1+2. This will make load_balance() run based on screwed avg_load and group_type in the sched_group spanning out of the sched_domain, and it also makes select_task_rq_fair() pick an idle CPU outside the sched_domain. Real servers which suffer from this problem include Kunpeng920 and 8-node Sun Fire X4600-M2, at least. Here we move to use the *child* domain of the *child* domain of node2's domain2 as the new added sched_group. At the same, we re-use the lower level sgc directly. +--+ +--++---+ +--+ | node | 12 |node | 20 | node | 12 |node | | 0 +-+1 ++ 2 +---+3 | +--+ +--++---+ +--+ domain0node0node1 +- node2 node3 | domain1node0+1 node0+1| node2+3node2+3 | domain2node0+1+2 | group: node0+1| group:node2 <---+ While the lower level sgc is re-used, this patch only changes the remote sched_groups for those sched_domains playing grandchild trick, therefore, sgc->next_update is still safe since it's only touched by CPUs that have the group span as local group. And sgc->imbalance is also safe because sd_parent remains the same in load_balance and LB only tries other CPUs from the local group. Moreover, since local groups are not touched, they are still getting roughly equal size in a TL. And should_we_balance() only matters with local groups, so the pull probability of those groups are still roughly equal. Tested by the below topology: qemu-system-aarch64 -M virt -nographic \ -smp cpus=8 \ -numa node,cpus=0-1,nodeid=0 \ -numa node,cpus=2-3,nodeid=1 \ -numa node,cpus=4-5,nodeid=2 \ -numa node,cpus=6-7,nodeid=3 \ -numa dist,src=0,dst=1,val=12 \ -numa dist,src=0,dst=2,val=20 \ -numa dist,src=0,dst=3,val=22 \ -numa dist,src=1,dst=2,val=22 \ -numa dist,src=2,dst=3,val=12 \ -numa dist,src=1,dst=3,val=24 \ -m 4G -cpu cortex-a57 -kernel arch/arm64/boot/Image w/o patch, we get lots of "groups don't span domain->span": [0.802139] CPU0 attaching sched-domain(s): [0.802193] domain-0: span=0-1 level=MC [0.802443] groups: 0:{ span=0 cap=1013 }, 1:{ span=1 cap=979 } [0.802693] domain-1: span=0-3 level=NUMA [0.802731]groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 } [0.802811]domain-2: span=0-5 level=NUMA [0.802829] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 } [0.802881] ERROR: groups don't span domain->span [0.803058] domain-3: span=0-7 level=NUMA [0.803080] groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 mask=6-7 cap=4077 } [0.804055] CPU1 attaching sched-domain(s): [0.804072] domain-0: span=0-1 level=MC [0.804096] groups: 1:{ span=1 cap=979 }, 0:{ span=0 cap=1013 } [0.804152] domain-1: span=0-3 level=NUMA [0.804170]groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 } [0.804219]domain-2: span=0-5 level=NUMA [0.804236] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 } [0.804302] ERROR: groups don't span domain->span [0.804520] domain-3: span=0-7 level=NUMA [0.804546] groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 mask=6-7 cap=4077 } [0.804677] CPU2 attaching
[tip: sched/core] sched/fair: Trivial correction of the newidle_balance() comment
The following commit has been merged into the sched/core branch of tip: Commit-ID: 5b78f2dc315354c05300795064f587366a02c6ff Gitweb: https://git.kernel.org/tip/5b78f2dc315354c05300795064f587366a02c6ff Author:Barry Song AuthorDate:Thu, 03 Dec 2020 11:06:41 +13:00 Committer: Ingo Molnar CommitterDate: Fri, 11 Dec 2020 10:30:44 +01:00 sched/fair: Trivial correction of the newidle_balance() comment idle_balance() has been renamed to newidle_balance(). To differentiate with nohz_idle_balance, it seems refining the comment will be helpful for the readers of the code. Signed-off-by: Barry Song Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20201202220641.22752-1-song.bao@hisilicon.com --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index efac224..04a3ce2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10550,7 +10550,7 @@ static inline void nohz_newidle_balance(struct rq *this_rq) { } #endif /* CONFIG_NO_HZ_COMMON */ /* - * idle_balance is called by schedule() if this_cpu is about to become + * newidle_balance is called by schedule() if this_cpu is about to become * idle. Attempts to pull tasks from other CPUs. * * Returns:
[tip: sched/core] sched/fair: Trivial correction of the newidle_balance() comment
The following commit has been merged into the sched/core branch of tip: Commit-ID: 21bf7cbd1b100758cc82f5340576028d3d83119b Gitweb: https://git.kernel.org/tip/21bf7cbd1b100758cc82f5340576028d3d83119b Author:Barry Song AuthorDate:Thu, 03 Dec 2020 11:06:41 +13:00 Committer: Peter Zijlstra CommitterDate: Thu, 03 Dec 2020 10:00:36 +01:00 sched/fair: Trivial correction of the newidle_balance() comment idle_balance() has been renamed to newidle_balance(). To differentiate with nohz_idle_balance, it seems refining the comment will be helpful for the readers of the code. Signed-off-by: Barry Song Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20201202220641.22752-1-song.bao@hisilicon.com --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index efac224..04a3ce2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10550,7 +10550,7 @@ static inline void nohz_newidle_balance(struct rq *this_rq) { } #endif /* CONFIG_NO_HZ_COMMON */ /* - * idle_balance is called by schedule() if this_cpu is about to become + * newidle_balance is called by schedule() if this_cpu is about to become * idle. Attempts to pull tasks from other CPUs. * * Returns:
[tip: sched/core] Documentation: scheduler: fix information on arch SD flags, sched_domain and sched_debug
The following commit has been merged into the sched/core branch of tip: Commit-ID: 9032dc211523f7cd5395302a0658c306249553f4 Gitweb: https://git.kernel.org/tip/9032dc211523f7cd5395302a0658c306249553f4 Author:Barry Song AuthorDate:Sat, 14 Nov 2020 00:50:18 +13:00 Committer: Peter Zijlstra CommitterDate: Thu, 19 Nov 2020 11:25:46 +01:00 Documentation: scheduler: fix information on arch SD flags, sched_domain and sched_debug This document seems to be out of date for many, many years. Even it has misspelled from the first day. ARCH_HASH_SCHED_TUNE should be ARCH_HAS_SCHED_TUNE ARCH_HASH_SCHED_DOMAIN should be ARCH_HAS_SCHED_DOMAIN Since v2.6.14, kernel completely deleted the relevant code and even arch_init_sched_domains() was deleted. Right now, kernel is asking architectures to call set_sched_topology() to override the default sched domains. On the other hand, to print the schedule debug information, users need to set sched_debug cmdline or enable it by sysfs entry. So this patch also adds the description for sched_debug. Signed-off-by: Barry Song Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Valentin Schneider Link: https://lkml.kernel.org/r/20201113115018.1628-1-song.bao@hisilicon.com --- Documentation/scheduler/sched-domains.rst | 26 +- 1 file changed, 11 insertions(+), 15 deletions(-) diff --git a/Documentation/scheduler/sched-domains.rst b/Documentation/scheduler/sched-domains.rst index 5c4b7f4..8582fa5 100644 --- a/Documentation/scheduler/sched-domains.rst +++ b/Documentation/scheduler/sched-domains.rst @@ -65,21 +65,17 @@ of the SMP domain will span the entire machine, with each group having the cpumask of a node. Or, you could do multi-level NUMA or Opteron, for example, might have just one domain covering its one NUMA level. -The implementor should read comments in include/linux/sched.h: -struct sched_domain fields, SD_FLAG_*, SD_*_INIT to get an idea of -the specifics and what to tune. +The implementor should read comments in include/linux/sched/sd_flags.h: +SD_* to get an idea of the specifics and what to tune for the SD flags +of a sched_domain. -Architectures may retain the regular override the default SD_*_INIT flags -while using the generic domain builder in kernel/sched/core.c if they wish to -retain the traditional SMT->SMP->NUMA topology (or some subset of that). This -can be done by #define'ing ARCH_HASH_SCHED_TUNE. - -Alternatively, the architecture may completely override the generic domain -builder by #define'ing ARCH_HASH_SCHED_DOMAIN, and exporting your -arch_init_sched_domains function. This function will attach domains to all -CPUs using cpu_attach_domain. +Architectures may override the generic domain builder and the default SD flags +for a given topology level by creating a sched_domain_topology_level array and +calling set_sched_topology() with this array as the parameter. The sched-domains debugging infrastructure can be enabled by enabling -CONFIG_SCHED_DEBUG. This enables an error checking parse of the sched domains -which should catch most possible errors (described above). It also prints out -the domain structure in a visual format. +CONFIG_SCHED_DEBUG and adding 'sched_debug' to your cmdline. If you forgot to +tweak your cmdline, you can also flip the /sys/kernel/debug/sched_debug +knob. This enables an error checking parse of the sched domains which should +catch most possible errors (described above). It also prints out the domain +structure in a visual format.
[tip: sched/core] sched/fair: Use dst group while checking imbalance for NUMA balancer
The following commit has been merged into the sched/core branch of tip: Commit-ID: 233e7aca4c8a2c764f556bba9644c36154017e7f Gitweb: https://git.kernel.org/tip/233e7aca4c8a2c764f556bba9644c36154017e7f Author:Barry Song AuthorDate:Mon, 21 Sep 2020 23:18:49 +01:00 Committer: Peter Zijlstra CommitterDate: Fri, 25 Sep 2020 14:23:26 +02:00 sched/fair: Use dst group while checking imbalance for NUMA balancer Barry Song noted the following Something is wrong. In find_busiest_group(), we are checking if src has higher load, however, in task_numa_find_cpu(), we are checking if dst will have higher load after balancing. It seems it is not sensible to check src. It maybe cause wrong imbalance value, for example, if dst_running = env->dst_stats.nr_running + 1 results in 3 or above, and src_running = env->src_stats.nr_running - 1 results in 1; The current code is thinking imbalance as 0 since src_running is smaller than 2. This is inconsistent with load balancer. Basically, in find_busiest_group(), the NUMA imbalance is ignored if moving a task "from an almost idle domain" to a "domain with spare capacity". This patch forbids movement "from a misplaced domain" to "an almost idle domain" as that is closer to what the CPU load balancer expects. This patch is not a universal win. The old behaviour was intended to allow a task from an almost idle NUMA node to migrate to its preferred node if the destination had capacity but there are corner cases. For example, a NAS compute load could be parallelised to use 1/3rd of available CPUs but not all those potential tasks are active at all times allowing this logic to trigger. An obvious example is specjbb 2005 running various numbers of warehouses on a 2 socket box with 80 cpus. specjbb 5.9.0-rc4 5.9.0-rc4 vanilladstbalance-v1r1 Hmean tput-1 46425.00 ( 0.00%)43394.00 * -6.53%* Hmean tput-2 98416.00 ( 0.00%)96031.00 * -2.42%* Hmean tput-3150184.00 ( 0.00%) 148783.00 * -0.93%* Hmean tput-4200683.00 ( 0.00%) 197906.00 * -1.38%* Hmean tput-5236305.00 ( 0.00%) 245549.00 * 3.91%* Hmean tput-6281559.00 ( 0.00%) 285692.00 * 1.47%* Hmean tput-7338558.00 ( 0.00%) 334467.00 * -1.21%* Hmean tput-8340745.00 ( 0.00%) 372501.00 * 9.32%* Hmean tput-9424343.00 ( 0.00%) 413006.00 * -2.67%* Hmean tput-10 421854.00 ( 0.00%) 434261.00 * 2.94%* Hmean tput-11 493256.00 ( 0.00%) 485330.00 * -1.61%* Hmean tput-12 549573.00 ( 0.00%) 529959.00 * -3.57%* Hmean tput-13 593183.00 ( 0.00%) 555010.00 * -6.44%* Hmean tput-14 588252.00 ( 0.00%) 599166.00 * 1.86%* Hmean tput-15 623065.00 ( 0.00%) 642713.00 * 3.15%* Hmean tput-16 703924.00 ( 0.00%) 660758.00 * -6.13%* Hmean tput-17 666023.00 ( 0.00%) 697675.00 * 4.75%* Hmean tput-18 761502.00 ( 0.00%) 758360.00 * -0.41%* Hmean tput-19 796088.00 ( 0.00%) 798368.00 * 0.29%* Hmean tput-20 733564.00 ( 0.00%) 823086.00 * 12.20%* Hmean tput-21 840980.00 ( 0.00%) 856711.00 * 1.87%* Hmean tput-22 804285.00 ( 0.00%) 872238.00 * 8.45%* Hmean tput-23 795208.00 ( 0.00%) 889374.00 * 11.84%* Hmean tput-24 848619.00 ( 0.00%) 966783.00 * 13.92%* Hmean tput-25 750848.00 ( 0.00%) 903790.00 * 20.37%* Hmean tput-26 780523.00 ( 0.00%) 962254.00 * 23.28%* Hmean tput-27 1042245.00 ( 0.00%) 991544.00 * -4.86%* Hmean tput-28 1090580.00 ( 0.00%) 1035926.00 * -5.01%* Hmean tput-29 999483.00 ( 0.00%) 1082948.00 * 8.35%* Hmean tput-30 1098663.00 ( 0.00%) 1113427.00 * 1.34%* Hmean tput-31 1125671.00 ( 0.00%) 1134175.00 * 0.76%* Hmean tput-32 968167.00 ( 0.00%) 1250286.00 * 29.14%* Hmean tput-33 1077676.00 ( 0.00%) 1060893.00 * -1.56%* Hmean tput-34 1090538.00 ( 0.00%) 1090933.00 * 0.04%* Hmean tput-35 967058.00 ( 0.00%) 1107421.00 * 14.51%* Hmean tput-36 1051745.00 ( 0.00%) 1210663.00 * 15.11%* Hmean tput-37 1019465.00 ( 0.00%) 1351446.00 * 32.56%* Hmean tput-38 1083102.00 ( 0.00%) 1064541.00 * -1.71%* Hmean tput-39 1232990.00 ( 0.00%) 1303623.00 * 5.73%* Hmean tput-40 1175542.00 ( 0.00%) 1340943.00 * 14.07%* Hmean tput-41 1127826.00 ( 0.00%) 1339492.00 * 18.77%* Hmean tput-42 1198313.00 ( 0.00%) 1411023.00 * 17.75%* Hmean tput-43 1163733.00 ( 0.00%) 1228253.00 * 5.54%* Hmean tput-44 1305562.00 ( 0.00%) 1357886.00 * 4.01%* Hmean tput-45 1326752.00 ( 0.00%) 1406061.00 * 5.98%* Hmean tput-46 1339424.00 ( 0.00%) 1418451.00 * 5.90%* Hmean tput-47 1415057.00 ( 0.00%) 1381570.00 *