[tip: sched/core] sched/topology: Remove redundant cpumask_and() in init_overlap_sched_group()

2021-03-25 Thread tip-bot2 for Barry Song
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 0a2b65c03e9b47493e1442bf9c84badc60d9bffb
Gitweb:
https://git.kernel.org/tip/0a2b65c03e9b47493e1442bf9c84badc60d9bffb
Author:Barry Song 
AuthorDate:Thu, 25 Mar 2021 15:31:40 +13:00
Committer: Ingo Molnar 
CommitterDate: Thu, 25 Mar 2021 11:41:23 +01:00

sched/topology: Remove redundant cpumask_and() in init_overlap_sched_group()

mask is built in build_balance_mask() by for_each_cpu(i, sg_span), so
it must be a subset of sched_group_span(sg).

So the cpumask_and() call is redundant - remove it.

[ mingo: Adjusted the changelog a bit. ]

Signed-off-by: Barry Song 
Signed-off-by: Ingo Molnar 
Reviewed-by: Valentin Schneider 
Link: 
https://lore.kernel.org/r/20210325023140.23456-1-song.bao@hisilicon.com
---
 kernel/sched/topology.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index f2066d6..d1aec24 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -934,7 +934,7 @@ static void init_overlap_sched_group(struct sched_domain 
*sd,
int cpu;
 
build_balance_mask(sd, sg, mask);
-   cpu = cpumask_first_and(sched_group_span(sg), mask);
+   cpu = cpumask_first(mask);
 
sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
if (atomic_inc_return(>sgc->ref) == 1)


[tip: sched/core] sched/fair: Optimize test_idle_cores() for !SMT

2021-03-23 Thread tip-bot2 for Barry Song
The following commit has been merged into the sched/core branch of tip:

Commit-ID: c8987ae5af793a73e2c0d6ce804d8ff454ea377c
Gitweb:
https://git.kernel.org/tip/c8987ae5af793a73e2c0d6ce804d8ff454ea377c
Author:Barry Song 
AuthorDate:Sun, 21 Mar 2021 11:14:32 +13:00
Committer: Peter Zijlstra 
CommitterDate: Tue, 23 Mar 2021 16:01:59 +01:00

sched/fair: Optimize test_idle_cores() for !SMT

update_idle_core() is only done for the case of sched_smt_present.
but test_idle_cores() is done for all machines even those without
SMT.

This can contribute to up 8%+ hackbench performance loss on a
machine like kunpeng 920 which has no SMT. This patch removes the
redundant test_idle_cores() for !SMT machines.

Hackbench is ran with -g {2..14}, for each g it is ran 10 times to get
an average.

  $ numactl -N 0 hackbench -p -T -l 2 -g $1

The below is the result of hackbench w/ and w/o this patch:

  g=2  4 6   8  10 12  14
  w/o: 1.8151 3.8499 5.5142 7.2491 9.0340 10.7345 12.0929
  w/ : 1.8428 3.7436 5.4501 6.9522 8.2882  9.9535 11.3367
+4.1%  +8.3%  +7.3%   +6.3%

Signed-off-by: Barry Song 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Vincent Guittot 
Acked-by: Mel Gorman 
Link: https://lkml.kernel.org/r/20210320221432.924-1-song.bao@hisilicon.com
---
 kernel/sched/fair.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6aad028..aaa0dfa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6038,9 +6038,11 @@ static inline bool test_idle_cores(int cpu, bool def)
 {
struct sched_domain_shared *sds;
 
-   sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
-   if (sds)
-   return READ_ONCE(sds->has_idle_cores);
+   if (static_branch_likely(_smt_present)) {
+   sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+   if (sds)
+   return READ_ONCE(sds->has_idle_cores);
+   }
 
return def;
 }


[tip: irq/core] genirq: Add IRQF_NO_AUTOEN for request_irq/nmi()

2021-03-06 Thread tip-bot2 for Barry Song
The following commit has been merged into the irq/core branch of tip:

Commit-ID: cbe16f35bee6880becca6f20d2ebf6b457148552
Gitweb:
https://git.kernel.org/tip/cbe16f35bee6880becca6f20d2ebf6b457148552
Author:Barry Song 
AuthorDate:Wed, 03 Mar 2021 11:49:15 +13:00
Committer: Ingo Molnar 
CommitterDate: Sat, 06 Mar 2021 12:48:00 +01:00

genirq: Add IRQF_NO_AUTOEN for request_irq/nmi()

Many drivers don't want interrupts enabled automatically via request_irq().
So they are handling this issue by either way of the below two:

(1)
  irq_set_status_flags(irq, IRQ_NOAUTOEN);
  request_irq(dev, irq...);

(2)
  request_irq(dev, irq...);
  disable_irq(irq);

The code in the second way is silly and unsafe. In the small time gap
between request_irq() and disable_irq(), interrupts can still come.

The code in the first way is safe though it's subobtimal.

Add a new IRQF_NO_AUTOEN flag which can be handed in by drivers to
request_irq() and request_nmi(). It prevents the automatic enabling of the
requested interrupt/nmi in the same safe way as #1 above. With that the
various usage sites of #1 and #2 above can be simplified and corrected.

Signed-off-by: Barry Song 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Ingo Molnar 
Cc: dmitry.torok...@gmail.com
Link: 
https://lore.kernel.org/r/20210302224916.13980-2-song.bao@hisilicon.com
---
 include/linux/interrupt.h |  4 
 kernel/irq/manage.c   | 11 +--
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 967e257..76f1161 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -61,6 +61,9 @@
  *interrupt handler after suspending interrupts. For system
  *wakeup devices users need to implement wakeup detection in
  *their interrupt handlers.
+ * IRQF_NO_AUTOEN - Don't enable IRQ or NMI automatically when users request 
it.
+ *Users will enable it explicitly by enable_irq() or 
enable_nmi()
+ *later.
  */
 #define IRQF_SHARED0x0080
 #define IRQF_PROBE_SHARED  0x0100
@@ -74,6 +77,7 @@
 #define IRQF_NO_THREAD 0x0001
 #define IRQF_EARLY_RESUME  0x0002
 #define IRQF_COND_SUSPEND  0x0004
+#define IRQF_NO_AUTOEN 0x0008
 
 #define IRQF_TIMER (__IRQF_TIMER | IRQF_NO_SUSPEND | 
IRQF_NO_THREAD)
 
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index dec3f73..97c231a 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1693,7 +1693,8 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, 
struct irqaction *new)
irqd_set(>irq_data, IRQD_NO_BALANCING);
}
 
-   if (irq_settings_can_autoenable(desc)) {
+   if (!(new->flags & IRQF_NO_AUTOEN) &&
+   irq_settings_can_autoenable(desc)) {
irq_startup(desc, IRQ_RESEND, IRQ_START_COND);
} else {
/*
@@ -2086,10 +2087,15 @@ int request_threaded_irq(unsigned int irq, 
irq_handler_t handler,
 * which interrupt is which (messes up the interrupt freeing
 * logic etc).
 *
+* Also shared interrupts do not go well with disabling auto enable.
+* The sharing interrupt might request it while it's still disabled
+* and then wait for interrupts forever.
+*
 * Also IRQF_COND_SUSPEND only makes sense for shared interrupts and
 * it cannot be set along with IRQF_NO_SUSPEND.
 */
if (((irqflags & IRQF_SHARED) && !dev_id) ||
+   ((irqflags & IRQF_SHARED) && (irqflags & IRQF_NO_AUTOEN)) ||
(!(irqflags & IRQF_SHARED) && (irqflags & IRQF_COND_SUSPEND)) ||
((irqflags & IRQF_NO_SUSPEND) && (irqflags & IRQF_COND_SUSPEND)))
return -EINVAL;
@@ -2245,7 +2251,8 @@ int request_nmi(unsigned int irq, irq_handler_t handler,
 
desc = irq_to_desc(irq);
 
-   if (!desc || irq_settings_can_autoenable(desc) ||
+   if (!desc || (irq_settings_can_autoenable(desc) &&
+   !(irqflags & IRQF_NO_AUTOEN)) ||
!irq_settings_can_request(desc) ||
WARN_ON(irq_settings_is_per_cpu_devid(desc)) ||
!irq_supports_nmi(desc))


[tip: sched/core] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2

2021-03-06 Thread tip-bot2 for Barry Song
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 585b6d2723dc927ebc4ad884c4e879e4da8bc21f
Gitweb:
https://git.kernel.org/tip/585b6d2723dc927ebc4ad884c4e879e4da8bc21f
Author:Barry Song 
AuthorDate:Wed, 24 Feb 2021 16:09:44 +13:00
Committer: Ingo Molnar 
CommitterDate: Sat, 06 Mar 2021 12:40:22 +01:00

sched/topology: fix the issue groups don't span domain->span for NUMA diameter 
> 2

As long as NUMA diameter > 2, building sched_domain by sibling's child
domain will definitely create a sched_domain with sched_group which will
span out of the sched_domain:

   +--+ +--++---+   +--+
   | node |  12 |node  | 20 | node  |  12   |node  |
   |  0   +-+1 ++ 2 +---+3 |
   +--+ +--++---+   +--+

domain0node0node1node2  node3

domain1node0+1  node0+1  node2+3node2+3
 +
domain2node0+1+2 |
 group: node0+1  |
   group:node2+3 <---+

when node2 is added into the domain2 of node0, kernel is using the child
domain of node2's domain2, which is domain1(node2+3). Node 3 is outside
the span of the domain including node0+1+2.

This will make load_balance() run based on screwed avg_load and group_type
in the sched_group spanning out of the sched_domain, and it also makes
select_task_rq_fair() pick an idle CPU outside the sched_domain.

Real servers which suffer from this problem include Kunpeng920 and 8-node
Sun Fire X4600-M2, at least.

Here we move to use the *child* domain of the *child* domain of node2's
domain2 as the new added sched_group. At the same, we re-use the lower
level sgc directly.
   +--+ +--++---+   +--+
   | node |  12 |node  | 20 | node  |  12   |node  |
   |  0   +-+1 ++ 2 +---+3 |
   +--+ +--++---+   +--+

domain0node0node1  +- node2  node3
   |
domain1node0+1  node0+1| node2+3node2+3
   |
domain2node0+1+2   |
 group: node0+1|
   group:node2 <---+

While the lower level sgc is re-used, this patch only changes the remote
sched_groups for those sched_domains playing grandchild trick, therefore,
sgc->next_update is still safe since it's only touched by CPUs that have
the group span as local group. And sgc->imbalance is also safe because
sd_parent remains the same in load_balance and LB only tries other CPUs
from the local group.
Moreover, since local groups are not touched, they are still getting
roughly equal size in a TL. And should_we_balance() only matters with
local groups, so the pull probability of those groups are still roughly
equal.

Tested by the below topology:
qemu-system-aarch64  -M virt -nographic \
 -smp cpus=8 \
 -numa node,cpus=0-1,nodeid=0 \
 -numa node,cpus=2-3,nodeid=1 \
 -numa node,cpus=4-5,nodeid=2 \
 -numa node,cpus=6-7,nodeid=3 \
 -numa dist,src=0,dst=1,val=12 \
 -numa dist,src=0,dst=2,val=20 \
 -numa dist,src=0,dst=3,val=22 \
 -numa dist,src=1,dst=2,val=22 \
 -numa dist,src=2,dst=3,val=12 \
 -numa dist,src=1,dst=3,val=24 \
 -m 4G -cpu cortex-a57 -kernel arch/arm64/boot/Image

w/o patch, we get lots of "groups don't span domain->span":
[0.802139] CPU0 attaching sched-domain(s):
[0.802193]  domain-0: span=0-1 level=MC
[0.802443]   groups: 0:{ span=0 cap=1013 }, 1:{ span=1 cap=979 }
[0.802693]   domain-1: span=0-3 level=NUMA
[0.802731]groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 }
[0.802811]domain-2: span=0-5 level=NUMA
[0.802829] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 }
[0.802881] ERROR: groups don't span domain->span
[0.803058] domain-3: span=0-7 level=NUMA
[0.803080]  groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 
mask=6-7 cap=4077 }
[0.804055] CPU1 attaching sched-domain(s):
[0.804072]  domain-0: span=0-1 level=MC
[0.804096]   groups: 1:{ span=1 cap=979 }, 0:{ span=0 cap=1013 }
[0.804152]   domain-1: span=0-3 level=NUMA
[0.804170]groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 }
[0.804219]domain-2: span=0-5 level=NUMA
[0.804236] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 }
[0.804302] ERROR: groups don't span domain->span
[0.804520] domain-3: span=0-7 level=NUMA
[0.804546]  groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 
mask=6-7 cap=4077 }
[0.804677] CPU2 attaching 

[tip: irq/core] genirq: Add IRQF_NO_AUTOEN for request_irq/nmi()

2021-03-04 Thread tip-bot2 for Barry Song
The following commit has been merged into the irq/core branch of tip:

Commit-ID: e749df1bbd23f4472082210650514548d8a39e9b
Gitweb:
https://git.kernel.org/tip/e749df1bbd23f4472082210650514548d8a39e9b
Author:Barry Song 
AuthorDate:Wed, 03 Mar 2021 11:49:15 +13:00
Committer: Thomas Gleixner 
CommitterDate: Thu, 04 Mar 2021 11:47:52 +01:00

genirq: Add IRQF_NO_AUTOEN for request_irq/nmi()

Many drivers don't want interrupts enabled automatically via request_irq().
So they are handling this issue by either way of the below two:

(1)
  irq_set_status_flags(irq, IRQ_NOAUTOEN);
  request_irq(dev, irq...);

(2)
  request_irq(dev, irq...);
  disable_irq(irq);

The code in the second way is silly and unsafe. In the small time gap
between request_irq() and disable_irq(), interrupts can still come.

The code in the first way is safe though it's subobtimal.

Add a new IRQF_NO_AUTOEN flag which can be handed in by drivers to
request_irq() and request_nmi(). It prevents the automatic enabling of the
requested interrupt/nmi in the same safe way as #1 above. With that the
various usage sites of #1 and #2 above can be simplified and corrected.

Signed-off-by: Barry Song 
Signed-off-by: Thomas Gleixner 
Cc: dmitry.torok...@gmail.com
Link: 
https://lore.kernel.org/r/20210302224916.13980-2-song.bao@hisilicon.com

---
 include/linux/interrupt.h |  4 
 kernel/irq/manage.c   | 11 +--
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 967e257..76f1161 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -61,6 +61,9 @@
  *interrupt handler after suspending interrupts. For system
  *wakeup devices users need to implement wakeup detection in
  *their interrupt handlers.
+ * IRQF_NO_AUTOEN - Don't enable IRQ or NMI automatically when users request 
it.
+ *Users will enable it explicitly by enable_irq() or 
enable_nmi()
+ *later.
  */
 #define IRQF_SHARED0x0080
 #define IRQF_PROBE_SHARED  0x0100
@@ -74,6 +77,7 @@
 #define IRQF_NO_THREAD 0x0001
 #define IRQF_EARLY_RESUME  0x0002
 #define IRQF_COND_SUSPEND  0x0004
+#define IRQF_NO_AUTOEN 0x0008
 
 #define IRQF_TIMER (__IRQF_TIMER | IRQF_NO_SUSPEND | 
IRQF_NO_THREAD)
 
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index dec3f73..97c231a 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1693,7 +1693,8 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, 
struct irqaction *new)
irqd_set(>irq_data, IRQD_NO_BALANCING);
}
 
-   if (irq_settings_can_autoenable(desc)) {
+   if (!(new->flags & IRQF_NO_AUTOEN) &&
+   irq_settings_can_autoenable(desc)) {
irq_startup(desc, IRQ_RESEND, IRQ_START_COND);
} else {
/*
@@ -2086,10 +2087,15 @@ int request_threaded_irq(unsigned int irq, 
irq_handler_t handler,
 * which interrupt is which (messes up the interrupt freeing
 * logic etc).
 *
+* Also shared interrupts do not go well with disabling auto enable.
+* The sharing interrupt might request it while it's still disabled
+* and then wait for interrupts forever.
+*
 * Also IRQF_COND_SUSPEND only makes sense for shared interrupts and
 * it cannot be set along with IRQF_NO_SUSPEND.
 */
if (((irqflags & IRQF_SHARED) && !dev_id) ||
+   ((irqflags & IRQF_SHARED) && (irqflags & IRQF_NO_AUTOEN)) ||
(!(irqflags & IRQF_SHARED) && (irqflags & IRQF_COND_SUSPEND)) ||
((irqflags & IRQF_NO_SUSPEND) && (irqflags & IRQF_COND_SUSPEND)))
return -EINVAL;
@@ -2245,7 +2251,8 @@ int request_nmi(unsigned int irq, irq_handler_t handler,
 
desc = irq_to_desc(irq);
 
-   if (!desc || irq_settings_can_autoenable(desc) ||
+   if (!desc || (irq_settings_can_autoenable(desc) &&
+   !(irqflags & IRQF_NO_AUTOEN)) ||
!irq_settings_can_request(desc) ||
WARN_ON(irq_settings_is_per_cpu_devid(desc)) ||
!irq_supports_nmi(desc))


[tip: sched/core] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2

2021-03-04 Thread tip-bot2 for Barry Song
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 9f4af5753b691b9df558ddcfea13e9f3036e45ca
Gitweb:
https://git.kernel.org/tip/9f4af5753b691b9df558ddcfea13e9f3036e45ca
Author:Barry Song 
AuthorDate:Wed, 24 Feb 2021 16:09:44 +13:00
Committer: Peter Zijlstra 
CommitterDate: Thu, 04 Mar 2021 09:56:00 +01:00

sched/topology: fix the issue groups don't span domain->span for NUMA diameter 
> 2

As long as NUMA diameter > 2, building sched_domain by sibling's child
domain will definitely create a sched_domain with sched_group which will
span out of the sched_domain:

   +--+ +--++---+   +--+
   | node |  12 |node  | 20 | node  |  12   |node  |
   |  0   +-+1 ++ 2 +---+3 |
   +--+ +--++---+   +--+

domain0node0node1node2  node3

domain1node0+1  node0+1  node2+3node2+3
 +
domain2node0+1+2 |
 group: node0+1  |
   group:node2+3 <---+

when node2 is added into the domain2 of node0, kernel is using the child
domain of node2's domain2, which is domain1(node2+3). Node 3 is outside
the span of the domain including node0+1+2.

This will make load_balance() run based on screwed avg_load and group_type
in the sched_group spanning out of the sched_domain, and it also makes
select_task_rq_fair() pick an idle CPU outside the sched_domain.

Real servers which suffer from this problem include Kunpeng920 and 8-node
Sun Fire X4600-M2, at least.

Here we move to use the *child* domain of the *child* domain of node2's
domain2 as the new added sched_group. At the same, we re-use the lower
level sgc directly.
   +--+ +--++---+   +--+
   | node |  12 |node  | 20 | node  |  12   |node  |
   |  0   +-+1 ++ 2 +---+3 |
   +--+ +--++---+   +--+

domain0node0node1  +- node2  node3
   |
domain1node0+1  node0+1| node2+3node2+3
   |
domain2node0+1+2   |
 group: node0+1|
   group:node2 <---+

While the lower level sgc is re-used, this patch only changes the remote
sched_groups for those sched_domains playing grandchild trick, therefore,
sgc->next_update is still safe since it's only touched by CPUs that have
the group span as local group. And sgc->imbalance is also safe because
sd_parent remains the same in load_balance and LB only tries other CPUs
from the local group.
Moreover, since local groups are not touched, they are still getting
roughly equal size in a TL. And should_we_balance() only matters with
local groups, so the pull probability of those groups are still roughly
equal.

Tested by the below topology:
qemu-system-aarch64  -M virt -nographic \
 -smp cpus=8 \
 -numa node,cpus=0-1,nodeid=0 \
 -numa node,cpus=2-3,nodeid=1 \
 -numa node,cpus=4-5,nodeid=2 \
 -numa node,cpus=6-7,nodeid=3 \
 -numa dist,src=0,dst=1,val=12 \
 -numa dist,src=0,dst=2,val=20 \
 -numa dist,src=0,dst=3,val=22 \
 -numa dist,src=1,dst=2,val=22 \
 -numa dist,src=2,dst=3,val=12 \
 -numa dist,src=1,dst=3,val=24 \
 -m 4G -cpu cortex-a57 -kernel arch/arm64/boot/Image

w/o patch, we get lots of "groups don't span domain->span":
[0.802139] CPU0 attaching sched-domain(s):
[0.802193]  domain-0: span=0-1 level=MC
[0.802443]   groups: 0:{ span=0 cap=1013 }, 1:{ span=1 cap=979 }
[0.802693]   domain-1: span=0-3 level=NUMA
[0.802731]groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 }
[0.802811]domain-2: span=0-5 level=NUMA
[0.802829] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 }
[0.802881] ERROR: groups don't span domain->span
[0.803058] domain-3: span=0-7 level=NUMA
[0.803080]  groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 
mask=6-7 cap=4077 }
[0.804055] CPU1 attaching sched-domain(s):
[0.804072]  domain-0: span=0-1 level=MC
[0.804096]   groups: 1:{ span=1 cap=979 }, 0:{ span=0 cap=1013 }
[0.804152]   domain-1: span=0-3 level=NUMA
[0.804170]groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 }
[0.804219]domain-2: span=0-5 level=NUMA
[0.804236] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 }
[0.804302] ERROR: groups don't span domain->span
[0.804520] domain-3: span=0-7 level=NUMA
[0.804546]  groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 
mask=6-7 cap=4077 }
[0.804677] CPU2 attaching 

[tip: sched/core] sched/fair: Trivial correction of the newidle_balance() comment

2020-12-11 Thread tip-bot2 for Barry Song
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 5b78f2dc315354c05300795064f587366a02c6ff
Gitweb:
https://git.kernel.org/tip/5b78f2dc315354c05300795064f587366a02c6ff
Author:Barry Song 
AuthorDate:Thu, 03 Dec 2020 11:06:41 +13:00
Committer: Ingo Molnar 
CommitterDate: Fri, 11 Dec 2020 10:30:44 +01:00

sched/fair: Trivial correction of the newidle_balance() comment

idle_balance() has been renamed to newidle_balance(). To differentiate
with nohz_idle_balance, it seems refining the comment will be helpful
for the readers of the code.

Signed-off-by: Barry Song 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Link: 
https://lkml.kernel.org/r/20201202220641.22752-1-song.bao@hisilicon.com
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index efac224..04a3ce2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10550,7 +10550,7 @@ static inline void nohz_newidle_balance(struct rq 
*this_rq) { }
 #endif /* CONFIG_NO_HZ_COMMON */
 
 /*
- * idle_balance is called by schedule() if this_cpu is about to become
+ * newidle_balance is called by schedule() if this_cpu is about to become
  * idle. Attempts to pull tasks from other CPUs.
  *
  * Returns:


[tip: sched/core] sched/fair: Trivial correction of the newidle_balance() comment

2020-12-03 Thread tip-bot2 for Barry Song
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 21bf7cbd1b100758cc82f5340576028d3d83119b
Gitweb:
https://git.kernel.org/tip/21bf7cbd1b100758cc82f5340576028d3d83119b
Author:Barry Song 
AuthorDate:Thu, 03 Dec 2020 11:06:41 +13:00
Committer: Peter Zijlstra 
CommitterDate: Thu, 03 Dec 2020 10:00:36 +01:00

sched/fair: Trivial correction of the newidle_balance() comment

idle_balance() has been renamed to newidle_balance(). To differentiate
with nohz_idle_balance, it seems refining the comment will be helpful
for the readers of the code.

Signed-off-by: Barry Song 
Signed-off-by: Peter Zijlstra (Intel) 
Link: 
https://lkml.kernel.org/r/20201202220641.22752-1-song.bao@hisilicon.com
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index efac224..04a3ce2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10550,7 +10550,7 @@ static inline void nohz_newidle_balance(struct rq 
*this_rq) { }
 #endif /* CONFIG_NO_HZ_COMMON */
 
 /*
- * idle_balance is called by schedule() if this_cpu is about to become
+ * newidle_balance is called by schedule() if this_cpu is about to become
  * idle. Attempts to pull tasks from other CPUs.
  *
  * Returns:


[tip: sched/core] Documentation: scheduler: fix information on arch SD flags, sched_domain and sched_debug

2020-11-20 Thread tip-bot2 for Barry Song
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 9032dc211523f7cd5395302a0658c306249553f4
Gitweb:
https://git.kernel.org/tip/9032dc211523f7cd5395302a0658c306249553f4
Author:Barry Song 
AuthorDate:Sat, 14 Nov 2020 00:50:18 +13:00
Committer: Peter Zijlstra 
CommitterDate: Thu, 19 Nov 2020 11:25:46 +01:00

Documentation: scheduler: fix information on arch SD flags, sched_domain and 
sched_debug

This document seems to be out of date for many, many years. Even it has
misspelled from the first day.
ARCH_HASH_SCHED_TUNE should be ARCH_HAS_SCHED_TUNE
ARCH_HASH_SCHED_DOMAIN should be ARCH_HAS_SCHED_DOMAIN

Since v2.6.14, kernel completely deleted the relevant code and even
arch_init_sched_domains() was deleted.

Right now, kernel is asking architectures to call set_sched_topology() to
override the default sched domains.

On the other hand, to print the schedule debug information, users need to
set sched_debug cmdline or enable it by sysfs entry. So this patch also
adds the description for sched_debug.

Signed-off-by: Barry Song 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Valentin Schneider 
Link: https://lkml.kernel.org/r/20201113115018.1628-1-song.bao@hisilicon.com
---
 Documentation/scheduler/sched-domains.rst | 26 +-
 1 file changed, 11 insertions(+), 15 deletions(-)

diff --git a/Documentation/scheduler/sched-domains.rst 
b/Documentation/scheduler/sched-domains.rst
index 5c4b7f4..8582fa5 100644
--- a/Documentation/scheduler/sched-domains.rst
+++ b/Documentation/scheduler/sched-domains.rst
@@ -65,21 +65,17 @@ of the SMP domain will span the entire machine, with each 
group having the
 cpumask of a node. Or, you could do multi-level NUMA or Opteron, for example,
 might have just one domain covering its one NUMA level.
 
-The implementor should read comments in include/linux/sched.h:
-struct sched_domain fields, SD_FLAG_*, SD_*_INIT to get an idea of
-the specifics and what to tune.
+The implementor should read comments in include/linux/sched/sd_flags.h:
+SD_* to get an idea of the specifics and what to tune for the SD flags
+of a sched_domain.
 
-Architectures may retain the regular override the default SD_*_INIT flags
-while using the generic domain builder in kernel/sched/core.c if they wish to
-retain the traditional SMT->SMP->NUMA topology (or some subset of that). This
-can be done by #define'ing ARCH_HASH_SCHED_TUNE.
-
-Alternatively, the architecture may completely override the generic domain
-builder by #define'ing ARCH_HASH_SCHED_DOMAIN, and exporting your
-arch_init_sched_domains function. This function will attach domains to all
-CPUs using cpu_attach_domain.
+Architectures may override the generic domain builder and the default SD flags
+for a given topology level by creating a sched_domain_topology_level array and
+calling set_sched_topology() with this array as the parameter.
 
 The sched-domains debugging infrastructure can be enabled by enabling
-CONFIG_SCHED_DEBUG. This enables an error checking parse of the sched domains
-which should catch most possible errors (described above). It also prints out
-the domain structure in a visual format.
+CONFIG_SCHED_DEBUG and adding 'sched_debug' to your cmdline. If you forgot to
+tweak your cmdline, you can also flip the /sys/kernel/debug/sched_debug
+knob. This enables an error checking parse of the sched domains which should
+catch most possible errors (described above). It also prints out the domain
+structure in a visual format.


[tip: sched/core] sched/fair: Use dst group while checking imbalance for NUMA balancer

2020-09-29 Thread tip-bot2 for Barry Song
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 233e7aca4c8a2c764f556bba9644c36154017e7f
Gitweb:
https://git.kernel.org/tip/233e7aca4c8a2c764f556bba9644c36154017e7f
Author:Barry Song 
AuthorDate:Mon, 21 Sep 2020 23:18:49 +01:00
Committer: Peter Zijlstra 
CommitterDate: Fri, 25 Sep 2020 14:23:26 +02:00

sched/fair: Use dst group while checking imbalance for NUMA balancer

Barry Song noted the following

Something is wrong. In find_busiest_group(), we are checking if
src has higher load, however, in task_numa_find_cpu(), we are
checking if dst will have higher load after balancing. It seems
it is not sensible to check src.

It maybe cause wrong imbalance value, for example,

if dst_running = env->dst_stats.nr_running + 1 results in 3 or
above, and src_running = env->src_stats.nr_running - 1 results
in 1;

The current code is thinking imbalance as 0 since src_running is
smaller than 2.  This is inconsistent with load balancer.

Basically, in find_busiest_group(), the NUMA imbalance is ignored if moving
a task "from an almost idle domain" to a "domain with spare capacity". This
patch forbids movement "from a misplaced domain" to "an almost idle domain"
as that is closer to what the CPU load balancer expects.

This patch is not a universal win. The old behaviour was intended to allow
a task from an almost idle NUMA node to migrate to its preferred node if
the destination had capacity but there are corner cases.  For example,
a NAS compute load could be parallelised to use 1/3rd of available CPUs
but not all those potential tasks are active at all times allowing this
logic to trigger. An obvious example is specjbb 2005 running various
numbers of warehouses on a 2 socket box with 80 cpus.

specjbb
   5.9.0-rc4  5.9.0-rc4
 vanilladstbalance-v1r1
Hmean tput-1 46425.00 (   0.00%)43394.00 *  -6.53%*
Hmean tput-2 98416.00 (   0.00%)96031.00 *  -2.42%*
Hmean tput-3150184.00 (   0.00%)   148783.00 *  -0.93%*
Hmean tput-4200683.00 (   0.00%)   197906.00 *  -1.38%*
Hmean tput-5236305.00 (   0.00%)   245549.00 *   3.91%*
Hmean tput-6281559.00 (   0.00%)   285692.00 *   1.47%*
Hmean tput-7338558.00 (   0.00%)   334467.00 *  -1.21%*
Hmean tput-8340745.00 (   0.00%)   372501.00 *   9.32%*
Hmean tput-9424343.00 (   0.00%)   413006.00 *  -2.67%*
Hmean tput-10   421854.00 (   0.00%)   434261.00 *   2.94%*
Hmean tput-11   493256.00 (   0.00%)   485330.00 *  -1.61%*
Hmean tput-12   549573.00 (   0.00%)   529959.00 *  -3.57%*
Hmean tput-13   593183.00 (   0.00%)   555010.00 *  -6.44%*
Hmean tput-14   588252.00 (   0.00%)   599166.00 *   1.86%*
Hmean tput-15   623065.00 (   0.00%)   642713.00 *   3.15%*
Hmean tput-16   703924.00 (   0.00%)   660758.00 *  -6.13%*
Hmean tput-17   666023.00 (   0.00%)   697675.00 *   4.75%*
Hmean tput-18   761502.00 (   0.00%)   758360.00 *  -0.41%*
Hmean tput-19   796088.00 (   0.00%)   798368.00 *   0.29%*
Hmean tput-20   733564.00 (   0.00%)   823086.00 *  12.20%*
Hmean tput-21   840980.00 (   0.00%)   856711.00 *   1.87%*
Hmean tput-22   804285.00 (   0.00%)   872238.00 *   8.45%*
Hmean tput-23   795208.00 (   0.00%)   889374.00 *  11.84%*
Hmean tput-24   848619.00 (   0.00%)   966783.00 *  13.92%*
Hmean tput-25   750848.00 (   0.00%)   903790.00 *  20.37%*
Hmean tput-26   780523.00 (   0.00%)   962254.00 *  23.28%*
Hmean tput-27  1042245.00 (   0.00%)   991544.00 *  -4.86%*
Hmean tput-28  1090580.00 (   0.00%)  1035926.00 *  -5.01%*
Hmean tput-29   999483.00 (   0.00%)  1082948.00 *   8.35%*
Hmean tput-30  1098663.00 (   0.00%)  1113427.00 *   1.34%*
Hmean tput-31  1125671.00 (   0.00%)  1134175.00 *   0.76%*
Hmean tput-32   968167.00 (   0.00%)  1250286.00 *  29.14%*
Hmean tput-33  1077676.00 (   0.00%)  1060893.00 *  -1.56%*
Hmean tput-34  1090538.00 (   0.00%)  1090933.00 *   0.04%*
Hmean tput-35   967058.00 (   0.00%)  1107421.00 *  14.51%*
Hmean tput-36  1051745.00 (   0.00%)  1210663.00 *  15.11%*
Hmean tput-37  1019465.00 (   0.00%)  1351446.00 *  32.56%*
Hmean tput-38  1083102.00 (   0.00%)  1064541.00 *  -1.71%*
Hmean tput-39  1232990.00 (   0.00%)  1303623.00 *   5.73%*
Hmean tput-40  1175542.00 (   0.00%)  1340943.00 *  14.07%*
Hmean tput-41  1127826.00 (   0.00%)  1339492.00 *  18.77%*
Hmean tput-42  1198313.00 (   0.00%)  1411023.00 *  17.75%*
Hmean tput-43  1163733.00 (   0.00%)  1228253.00 *   5.54%*
Hmean tput-44  1305562.00 (   0.00%)  1357886.00 *   4.01%*
Hmean tput-45  1326752.00 (   0.00%)  1406061.00 *   5.98%*
Hmean tput-46  1339424.00 (   0.00%)  1418451.00 *   5.90%*
Hmean tput-47  1415057.00 (   0.00%)  1381570.00 *