RE: [RFC PATCH v5 4/4] scheduler: Add cluster scheduler level for x86

2021-04-20 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Tim Chen [mailto:tim.c.c...@linux.intel.com]
> Sent: Wednesday, April 21, 2021 6:32 AM
> To: Song Bao Hua (Barry Song) ;
> catalin.mari...@arm.com; w...@kernel.org; r...@rjwysocki.net;
> vincent.guit...@linaro.org; b...@alien8.de; t...@linutronix.de;
> mi...@redhat.com; l...@kernel.org; pet...@infradead.org;
> dietmar.eggem...@arm.com; rost...@goodmis.org; bseg...@google.com;
> mgor...@suse.de
> Cc: msys.miz...@gmail.com; valentin.schnei...@arm.com;
> gre...@linuxfoundation.org; Jonathan Cameron ;
> juri.le...@redhat.com; mark.rutl...@arm.com; sudeep.ho...@arm.com;
> aubrey...@linux.intel.com; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; linux-a...@vger.kernel.org; x...@kernel.org;
> xuwei (O) ; Zengtao (B) ;
> guodong...@linaro.org; yangyicong ; Liguozhu (Kenneth)
> ; linux...@openeuler.org; h...@zytor.com
> Subject: Re: [RFC PATCH v5 4/4] scheduler: Add cluster scheduler level for x86
> 
> 
> 
> On 3/23/21 4:21 PM, Song Bao Hua (Barry Song) wrote:
> 
> >>
> >> On 3/18/21 9:16 PM, Barry Song wrote:
> >>> From: Tim Chen 
> >>>
> >>> There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce
> >>> is shared among a cluster of cores instead of being exclusive
> >>> to one single core.
> >>>
> >>> To prevent oversubscription of L2 cache, load should be
> >>> balanced between such L2 clusters, especially for tasks with
> >>> no shared data.
> >>>
> >>> Also with cluster scheduling policy where tasks are woken up
> >>> in the same L2 cluster, we will benefit from keeping tasks
> >>> related to each other and likely sharing data in the same L2
> >>> cluster.
> >>>
> >>> Add CPU masks of CPUs sharing the L2 cache so we can build such
> >>> L2 cluster scheduler domain.
> >>>
> >>> Signed-off-by: Tim Chen 
> >>> Signed-off-by: Barry Song 
> >>
> >>
> >> Barry,
> >>
> >> Can you also add this chunk to the patch.
> >> Thanks.
> >
> > Sure, Tim, Thanks. I'll put that into patch 4/4 in v6.
> >
> 
> Barry,
> 
> This chunk will also need to be added to return cluster id for x86.
> Please add it in your next rev.

Yes. Thanks. I'll put this in either RFC v7 or Patch v1.

For spreading path, things are much easier, though packing path is 
quite tricky. But It seems RFC v6 has been quite close to what we want
to achieve to pack related tasks by scanning cluster for tasks within
same NUMA:
https://lore.kernel.org/lkml/20210420001844.9116-1-song.bao@hisilicon.com/

If couples have been already in same LLC(numa), scanning clusters will
gather them further. If they are running in different NUMA nodes, the
original scanning LLC will move them to the same node, after that,
scanning cluster might put them closer to each other.

it seems it is kind of the two-level packing Dietmar has suggested.

So perhaps we won't have RFC v7, I will probably send patch v1 afterwards.

> 
> Thanks.
> 
> Tim
> 
> ---
> 
> diff --git a/arch/x86/include/asm/topology.h
> b/arch/x86/include/asm/topology.h
> index 800fa48c9fcd..2548d824f103 100644
> --- a/arch/x86/include/asm/topology.h
> +++ b/arch/x86/include/asm/topology.h
> @@ -109,6 +109,7 @@ extern const struct cpumask *cpu_clustergroup_mask(int 
> cpu);
>  #define topology_physical_package_id(cpu)(cpu_data(cpu).phys_proc_id)
>  #define topology_logical_die_id(cpu) (cpu_data(cpu).logical_die_id)
>  #define topology_die_id(cpu) (cpu_data(cpu).cpu_die_id)
> +#define topology_cluster_id(cpu) (per_cpu(cpu_l2c_id, cpu))
>  #define topology_core_id(cpu)
> (cpu_data(cpu).cpu_core_id)
> 
>  extern unsigned int __max_die_per_package;

Thanks
Barry



RE: [RFC PATCH v5 1/4] topology: Represent clusters of CPUs within a die

2021-04-19 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Greg KH [mailto:gre...@linuxfoundation.org]
> Sent: Friday, March 19, 2021 11:02 PM
> To: Jonathan Cameron 
> Cc: Song Bao Hua (Barry Song) ;
> tim.c.c...@linux.intel.com; catalin.mari...@arm.com; w...@kernel.org;
> r...@rjwysocki.net; vincent.guit...@linaro.org; b...@alien8.de;
> t...@linutronix.de; mi...@redhat.com; l...@kernel.org; pet...@infradead.org;
> dietmar.eggem...@arm.com; rost...@goodmis.org; bseg...@google.com;
> mgor...@suse.de; msys.miz...@gmail.com; valentin.schnei...@arm.com;
> juri.le...@redhat.com; mark.rutl...@arm.com; sudeep.ho...@arm.com;
> aubrey...@linux.intel.com; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; linux-a...@vger.kernel.org; x...@kernel.org;
> xuwei (O) ; Zengtao (B) ;
> guodong...@linaro.org; yangyicong ; Liguozhu (Kenneth)
> ; linux...@openeuler.org; h...@zytor.com
> Subject: Re: [RFC PATCH v5 1/4] topology: Represent clusters of CPUs within
> a die
> 
> On Fri, Mar 19, 2021 at 09:36:16AM +, Jonathan Cameron wrote:
> > On Fri, 19 Mar 2021 06:57:08 +
> > "Song Bao Hua (Barry Song)"  wrote:
> >
> > > > -Original Message-
> > > > From: Greg KH [mailto:gre...@linuxfoundation.org]
> > > > Sent: Friday, March 19, 2021 7:35 PM
> > > > To: Song Bao Hua (Barry Song) 
> > > > Cc: tim.c.c...@linux.intel.com; catalin.mari...@arm.com;
> w...@kernel.org;
> > > > r...@rjwysocki.net; vincent.guit...@linaro.org; b...@alien8.de;
> > > > t...@linutronix.de; mi...@redhat.com; l...@kernel.org;
> pet...@infradead.org;
> > > > dietmar.eggem...@arm.com; rost...@goodmis.org; bseg...@google.com;
> > > > mgor...@suse.de; msys.miz...@gmail.com; valentin.schnei...@arm.com;
> Jonathan
> > > > Cameron ; juri.le...@redhat.com;
> > > > mark.rutl...@arm.com; sudeep.ho...@arm.com; aubrey...@linux.intel.com;
> > > > linux-arm-ker...@lists.infradead.org; linux-kernel@vger.kernel.org;
> > > > linux-a...@vger.kernel.org; x...@kernel.org; xuwei (O)
> ;
> > > > Zengtao (B) ; guodong...@linaro.org;
> yangyicong
> > > > ; Liguozhu (Kenneth) ;
> > > > linux...@openeuler.org; h...@zytor.com
> > > > Subject: Re: [RFC PATCH v5 1/4] topology: Represent clusters of CPUs 
> > > > within
> > > > a die
> > > >
> > > > On Fri, Mar 19, 2021 at 05:16:15PM +1300, Barry Song wrote:
> > > > > diff --git a/Documentation/admin-guide/cputopology.rst
> > > > b/Documentation/admin-guide/cputopology.rst
> > > > > index b90dafc..f9d3745 100644
> > > > > --- a/Documentation/admin-guide/cputopology.rst
> > > > > +++ b/Documentation/admin-guide/cputopology.rst
> > > > > @@ -24,6 +24,12 @@ core_id:
> > > > >   identifier (rather than the kernel's).  The actual value is
> > > > >   architecture and platform dependent.
> > > > >
> > > > > +cluster_id:
> > > > > +
> > > > > + the Cluster ID of cpuX.  Typically it is the hardware platform's
> > > > > + identifier (rather than the kernel's).  The actual value is
> > > > > + architecture and platform dependent.
> > > > > +
> > > > >  book_id:
> > > > >
> > > > >   the book ID of cpuX. Typically it is the hardware platform's
> > > > > @@ -56,6 +62,14 @@ package_cpus_list:
> > > > >   human-readable list of CPUs sharing the same 
> > > > > physical_package_id.
> > > > >   (deprecated name: "core_siblings_list")
> > > > >
> > > > > +cluster_cpus:
> > > > > +
> > > > > + internal kernel map of CPUs within the same cluster.
> > > > > +
> > > > > +cluster_cpus_list:
> > > > > +
> > > > > + human-readable list of CPUs within the same cluster.
> > > > > +
> > > > >  die_cpus:
> > > > >
> > > > >   internal kernel map of CPUs within the same die.
> > > >
> > > > Why are these sysfs files in this file, and not in a Documentation/ABI/
> > > > file which can be correctly parsed and shown to userspace?
> > >
> > > Well. Those ABIs have been there for much a long time. It is like:
> > >
> > > [root@ceph1 topology]# ls
> > > core_id  core_siblings  core_siblings_list  physical_package_id
> thread_siblings  thread_siblings_list
> > > [r

[RFC PATCH v6 3/4] scheduler: scan idle cpu in cluster for tasks within one LLC

2021-04-19 Thread Barry Song
On kunpeng920, cpus within one cluster can communicate wit each other
much faster than cpus across different clusters. A simple hackbench
can prove that.
hackbench running on 4 cpus in single one cluster and 4 cpus in
different clusters shows a large contrast:
(1) within a cluster:
root@ubuntu:~# taskset -c 0,1,2,3 hackbench -p -T -l 2 -g 1
Running in threaded mode with 1 groups using 40 file descriptors each
(== 40 tasks)
Each sender will pass 2 messages of 100 bytes
Time: 4.285

(2) across clusters:
root@ubuntu:~# taskset -c 0,4,8,12 hackbench -p -T -l 2 -g 1
Running in threaded mode with 1 groups using 40 file descriptors each
(== 40 tasks)
Each sender will pass 2 messages of 100 bytes
Time: 5.524

This inspires us to change the wake_affine path to scan cluster to pack
the related tasks. Ideally, a two-level packing vs. spreading heuristic
could be introduced to distinguish between llc-packing and even narrower
(cluster or MC-L2)-packing. But this way would be quite trivial. So this
patch begins from those tasks running in same LLC. This is actually quite
common in real use cases when tasks are bound in a NUMA.

If users use "numactl -N 0" to bind tasks, this patch will scan cluster
rather than llc to select idle cpu. A hackbench running with some groups
of monogamous sender-receiver model shows a major improvement.

To evaluate the performance impact to related tasks talking with each
other, we run the below hackbench with different -g parameter from 6
to 32 in a NUMA node with 24 cores, for each different g, we run the
command 20 times and get the average time:
$ numactl -N 0 hackbench -p -T -l 100 -f 1 -g $1
As -f is set to 1, this means all threads are talking with each other
monogamously.

hackbench will report the time which is needed to complete a certain number
of messages transmissions between a certain number of tasks, for example:
$ numactl -N 0 hackbench -p -T -l 100 -f 1 -g 6
Running in threaded mode with 6 groups using 2 file descriptors each (== 12 
tasks)
Each sender will pass 100 messages of 100 bytes

The below is the result of hackbench:
g=6  12  18 242832
w/o 1.2474 1.5635 1.5133 1.4796 1.6177 1.7898
w/domain1.1458 1.3309 1.3416 1.4990 1.9212 2.3411
w/domain+affine 0.9500 1.0728 1.1756 1.2201 1.4166 1.5464

w/o: without any change
w/domain: added cluster domain without changing wake_affine
w/domain+affine: added cluster domain, changed wake_affine

while g=6, if we use top -H to show the cpus which tasks are running on,
we can easily find couples are running in same CCL.

Signed-off-by: Barry Song 
---
 -v6:
  * emulated a two-level spreading/packing heuristic by only scanning cluster
in wake_affine path for tasks running in same LLC(also NUMA).

This partially addressed Dietmar's comment in RFC v3:
"In case we would like to further distinguish between llc-packing and
 even narrower (cluster or MC-L2)-packing, we would introduce a 2. level
 packing vs. spreading heuristic further down in sis().
   
 IMHO, Barry's current implementation doesn't do this right now. Instead
 he's trying to pack on cluster first and if not successful look further
 among the remaining llc CPUs for an idle CPU."

  * adjusted the hackbench parameter to make relatively low and high load.
previous patchsets with "-f 10" ran under an extremely high load with
hundreds of threads, which seems not real use cases.

This also addressed Vincent's question in RFC v4:
"In particular, I'm still not convinced that the modification of the wakeup
path is the root of the hackbench improvement; especially with g=14 where
there should not be much idle CPUs with 14*40 tasks on at most 32 CPUs."

 block/blk-mq.c |  2 +-
 include/linux/sched/topology.h |  5 +++--
 kernel/sched/core.c|  9 +---
 kernel/sched/fair.c| 47 +-
 kernel/sched/sched.h   |  3 +++
 kernel/sched/topology.c| 12 +++
 6 files changed, 53 insertions(+), 25 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d4d7c1c..1418981 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -611,7 +611,7 @@ static inline bool blk_mq_complete_need_ipi(struct request 
*rq)
/* same CPU or cache domain?  Complete locally */
if (cpu == rq->mq_ctx->cpu ||
(!test_bit(QUEUE_FLAG_SAME_FORCE, >q->queue_flags) &&
-cpus_share_cache(cpu, rq->mq_ctx->cpu)))
+cpus_share_cache(cpu, rq->mq_ctx->cpu, 0)))
return false;
 
/* don't try to IPI to an offline CPU */
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 846fcac..d63d6b8 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -176,7 +176,8 @@ ex

[RFC PATCH v6 4/4] scheduler: Add cluster scheduler level for x86

2021-04-19 Thread Barry Song
From: Tim Chen 

There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce
is shared among a cluster of cores instead of being exclusive
to one single core.

To prevent oversubscription of L2 cache, load should be
balanced between such L2 clusters, especially for tasks with
no shared data.

Also with cluster scheduling policy where tasks are woken up
in the same L2 cluster, we will benefit from keeping tasks
related to each other and likely sharing data in the same L2
cluster.

Add CPU masks of CPUs sharing the L2 cache so we can build such
L2 cluster scheduler domain.

Signed-off-by: Tim Chen 
Signed-off-by: Barry Song 
---
 -v6:
  * added topology_cluster_cpumask() for x86, code provided by Tim.

 arch/x86/Kconfig|  8 
 arch/x86/include/asm/smp.h  |  7 +++
 arch/x86/include/asm/topology.h |  2 ++
 arch/x86/kernel/cpu/cacheinfo.c |  1 +
 arch/x86/kernel/cpu/common.c|  3 +++
 arch/x86/kernel/smpboot.c   | 43 -
 6 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2792879..d597de2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1002,6 +1002,14 @@ config NR_CPUS
  This is purely to save memory: each supported CPU adds about 8KB
  to the kernel image.
 
+config SCHED_CLUSTER
+   bool "Cluster scheduler support"
+   default n
+   help
+Cluster scheduler support improves the CPU scheduler's decision
+making when dealing with machines that have clusters of CPUs
+sharing L2 cache. If unsure say N here.
+
 config SCHED_SMT
def_bool y if SMP
 
diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index c0538f8..9cbc4ae 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -16,7 +16,9 @@
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_die_map);
 /* cpus sharing the last level cache: */
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
+DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_l2c_shared_map);
 DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id);
+DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_l2c_id);
 DECLARE_PER_CPU_READ_MOSTLY(int, cpu_number);
 
 static inline struct cpumask *cpu_llc_shared_mask(int cpu)
@@ -24,6 +26,11 @@ static inline struct cpumask *cpu_llc_shared_mask(int cpu)
return per_cpu(cpu_llc_shared_map, cpu);
 }
 
+static inline struct cpumask *cpu_l2c_shared_mask(int cpu)
+{
+   return per_cpu(cpu_l2c_shared_map, cpu);
+}
+
 DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_cpu_to_apicid);
 DECLARE_EARLY_PER_CPU_READ_MOSTLY(u32, x86_cpu_to_acpiid);
 DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_bios_cpu_apicid);
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 9239399..800fa48 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -103,6 +103,7 @@ static inline void setup_node_to_cpumask_map(void) { }
 #include 
 
 extern const struct cpumask *cpu_coregroup_mask(int cpu);
+extern const struct cpumask *cpu_clustergroup_mask(int cpu);
 
 #define topology_logical_package_id(cpu)   (cpu_data(cpu).logical_proc_id)
 #define topology_physical_package_id(cpu)  (cpu_data(cpu).phys_proc_id)
@@ -114,6 +115,7 @@ static inline void setup_node_to_cpumask_map(void) { }
 
 #ifdef CONFIG_SMP
 #define topology_die_cpumask(cpu)  (per_cpu(cpu_die_map, cpu))
+#define topology_cluster_cpumask(cpu)  (cpu_clustergroup_mask(cpu))
 #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
 #define topology_sibling_cpumask(cpu)  (per_cpu(cpu_sibling_map, cpu))
 
diff --git a/arch/x86/kernel/cpu/cacheinfo.c b/arch/x86/kernel/cpu/cacheinfo.c
index 3ca9be4..0d03a71 100644
--- a/arch/x86/kernel/cpu/cacheinfo.c
+++ b/arch/x86/kernel/cpu/cacheinfo.c
@@ -846,6 +846,7 @@ void init_intel_cacheinfo(struct cpuinfo_x86 *c)
l2 = new_l2;
 #ifdef CONFIG_SMP
per_cpu(cpu_llc_id, cpu) = l2_id;
+   per_cpu(cpu_l2c_id, cpu) = l2_id;
 #endif
}
 
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index ab640ab..0ba282d 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -78,6 +78,9 @@
 /* Last level cache ID of each logical CPU */
 DEFINE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id) = BAD_APICID;
 
+/* L2 cache ID of each logical CPU */
+DEFINE_PER_CPU_READ_MOSTLY(u16, cpu_l2c_id) = BAD_APICID;
+
 /* correctly size the local cpu masks */
 void __init setup_cpu_local_masks(void)
 {
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 02813a7..c85ffa8 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -101,6 +101,8 @@
 
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
 
+DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_l2c_shared_map);
+
 /* Per CPU bogomips and other parameters */
 DEFINE_PER_CPU_READ_MOSTLY(struct c

[RFC PATCH v6 2/4] scheduler: add scheduler level for clusters

2021-04-19 Thread Barry Song
ARM64 chip Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each
cluster has 4 cpus. All clusters share L3 cache data, but each cluster
has local L3 tag. On the other hand, each clusters will share some
internal system bus. This means cache coherence overhead inside one
cluster is much less than the overhead across clusters.

This patch adds the sched_domain for clusters. On kunpeng 920, without
this patch, domain0 of cpu0 would be MC with cpu0~cpu23 with ; with this
patch, MC becomes domain1, a new domain0 "CLS" including cpu0-cpu3.

This will help spread unrelated tasks among clusters, thus decrease the
contention and improve the throughput, for example, stream benchmark can
improve 20%+ while parallelism is 6 and improve around 5% while paralle-
lism is 12:

(1) -P  6
$ numactl -N 0 /usr/lib/lmbench/bin/stream -P 6 -M 1024M -N 5

w/o patch:
STREAM copy latency: 2.46 nanoseconds
STREAM copy bandwidth: 39096.28 MB/sec
STREAM scale latency: 2.46 nanoseconds
STREAM scale bandwidth: 38970.26 MB/sec
STREAM add latency: 4.45 nanoseconds
STREAM add bandwidth: 32332.04 MB/sec
STREAM triad latency: 4.07 nanoseconds
STREAM triad bandwidth: 35387.69 MB/sec

w/ patch:
STREAM copy latency: 2.02 nanoseconds
STREAM copy bandwidth: 47604.47 MB/sec   +21.7%
STREAM scale latency: 2.04 nanoseconds
STREAM scale bandwidth: 47066.84 MB/sec  +20.8%
STREAM add latency: 3.35 nanoseconds
STREAM add bandwidth: 42942.15 MB/sec+32.8%
STREAM triad latency: 3.16 nanoseconds
STREAM triad bandwidth: 45619.18 MB/sec  +28.9%

On the other hand,stream result could change significantly during different
tests without the patch, eg:
a.
STREAM copy latency: 2.16 nanoseconds
STREAM copy bandwidth: 8.45 MB/sec
STREAM scale latency: 2.17 nanoseconds
STREAM scale bandwidth: 44320.77 MB/sec
STREAM add latency: 3.77 nanoseconds
STREAM add bandwidth: 38230.54 MB/sec
STREAM triad latency: 3.88 nanoseconds
STREAM triad bandwidth: 37072.10 MB/sec

b.
STREAM copy latency: 2.16 nanoseconds
STREAM copy bandwidth: 44403.22 MB/sec
STREAM scale latency: 2.39 nanoseconds
STREAM scale bandwidth: 40173.69 MB/sec
STREAM add latency: 3.77 nanoseconds
STREAM add bandwidth: 38232.56 MB/sec
STREAM triad latency: 3.38 nanoseconds
STREAM triad bandwidth: 42592.04 MB/sec

Obviously it is because the 6 threads are put randomly in 6 cores. Sometimes
they are packed in clusters, sometimes they are spread widely.

(2) -P  12
$ numactl -N 0 /usr/lib/lmbench/bin/stream -P 12 -M 1024M -N 5

w/o patch:
STREAM copy latency: 3.37 nanoseconds
STREAM copy bandwidth: 57008.80 MB/sec
STREAM scale latency: 3.38 nanoseconds
STREAM scale bandwidth: 56848.47 MB/sec
STREAM add latency: 5.50 nanoseconds
STREAM add bandwidth: 52398.62 MB/sec
STREAM triad latency: 5.09 nanoseconds
STREAM triad bandwidth: 56591.60 MB/sec

w/ patch:
STREAM copy latency: 3.24 nanoseconds
STREAM copy bandwidth: 59338.60 MB/sec  +4.1%
STREAM scale latency: 3.25 nanoseconds
STREAM scale bandwidth: 58993.23 MB/sec +3.7%
STREAM add latency: 5.19 nanoseconds
STREAM add bandwidth: 55517.45 MB/sec   +5.9%
STREAM triad latency: 4.86 nanoseconds
STREAM triad bandwidth: 59245.34 MB/sec +4.7%

Obviously the load balance between clusters help improve the parallelism
of unrelated tasks.

To evaluate the performance impact to related tasks talking with each
other, we run the below hackbench with different -g parameter from 6
to 32 in a NUMA node with 24 cores, for each different g, we run the
command 20 times and get the average time:
$ numactl -N 0 hackbench -p -T -l 100 -f 1 -g $1
As -f is set to 1, this means all threads are talking with each other
monogamously.

hackbench will report the time which is needed to complete a certain number
of messages transmissions between a certain number of tasks, for example:
$ numactl -N 0 hackbench -p -T -l 100 -f 1 -g 6
Running in threaded mode with 6 groups using 2 file descriptors each (== 12 
tasks)
Each sender will pass 100 messages of 100 bytes

The below is the result of hackbench w/ and w/o the patch:
g=6   1218  2428 32
w/o: 1.2474 1.5635 1.5133 1.4796 1.6177 1.7898
w/ : 1.1458 1.3309 1.3416 1.4990 1.9212 2.3411

It seems this patch benefits hackbench when the load is relatively low,
while it hurts hackbench much when the load is relatively high(56 and
64 threads in 24 cores).

Signed-off-by: Barry Song 
---
 arch/arm64/Kconfig |  7 +++
 include/linux/sched/cluster.h  | 19 +++
 include/linux/sched/sd_flags.h |  9 +
 include/linux/sched/topology.h |  7 +++
 include/linux/topology.h   |  7 +++
 kernel/sched/core.c| 20 
 kernel/sched/fair.c|  4 
 kernel/sched/sched.h   |  1 +
 kernel/sched/topology.c|  6 ++
 9 files changed, 80 insertions(+)
 create mode 100644 include/linux/sched/cluster.h

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1f212b4..9432a30 100644
--- a/arch/arm64/Kconfig

[RFC PATCH v6 1/4] topology: Represent clusters of CPUs within a die

2021-04-19 Thread Barry Song
|   | |
|  |  ||  | |  |   |   | |
|  +--++--+ |  +---+   | |
|   |  +-+
+---+

That means the cost to transfer ownership of a cacheline between CPUs
within a cluster is lower than between CPUs in different clusters on
the same die. Hence, it can make sense to tell the scheduler to use
the cache affinity of the cluster to make better decision on thread
migration.

This patch simply exposes this information to userspace libraries
like hwloc by providing cluster_cpus and related sysfs attributes.
PoC of HWLOC support at [2].

Note this patch only handle the ACPI case.

Special consideration is needed for SMT processors, where it is
necessary to move 2 levels up the hierarchy from the leaf nodes
(thus skipping the processor core level).

Currently the ID provided is the offset of the Processor
Hierarchy Nodes Structure within PPTT.  Whilst this is unique
it is not terribly elegant so alternative suggestions welcome.

Note that arm64 / ACPI does not provide any means of identifying
a die level in the topology but that may be unrelate to the cluster
level.

[1] ACPI Specification 6.3 - section 5.2.29.1 processor hierarchy node
structure (Type 0)
[2] https://github.com/hisilicon/hwloc/tree/linux-cluster

Signed-off-by: Jonathan Cameron 
Signed-off-by: Barry Song 
---
 -v6:
 * the topology ABI documents required by Greg is not completed yet.
   will have a separate patch for that.

 Documentation/admin-guide/cputopology.rst | 26 +++--
 arch/arm64/kernel/topology.c  |  2 +
 drivers/acpi/pptt.c   | 63 +++
 drivers/base/arch_topology.c  | 15 
 drivers/base/topology.c   | 10 +
 include/linux/acpi.h  |  5 +++
 include/linux/arch_topology.h |  5 +++
 include/linux/topology.h  |  6 +++
 8 files changed, 128 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/cputopology.rst 
b/Documentation/admin-guide/cputopology.rst
index b90dafc..f9d3745 100644
--- a/Documentation/admin-guide/cputopology.rst
+++ b/Documentation/admin-guide/cputopology.rst
@@ -24,6 +24,12 @@ core_id:
identifier (rather than the kernel's).  The actual value is
architecture and platform dependent.
 
+cluster_id:
+
+   the Cluster ID of cpuX.  Typically it is the hardware platform's
+   identifier (rather than the kernel's).  The actual value is
+   architecture and platform dependent.
+
 book_id:
 
the book ID of cpuX. Typically it is the hardware platform's
@@ -56,6 +62,14 @@ package_cpus_list:
human-readable list of CPUs sharing the same physical_package_id.
(deprecated name: "core_siblings_list")
 
+cluster_cpus:
+
+   internal kernel map of CPUs within the same cluster.
+
+cluster_cpus_list:
+
+   human-readable list of CPUs within the same cluster.
+
 die_cpus:
 
internal kernel map of CPUs within the same die.
@@ -96,11 +110,13 @@ these macros in include/asm-XXX/topology.h::
 
#define topology_physical_package_id(cpu)
#define topology_die_id(cpu)
+   #define topology_cluster_id(cpu)
#define topology_core_id(cpu)
#define topology_book_id(cpu)
#define topology_drawer_id(cpu)
#define topology_sibling_cpumask(cpu)
#define topology_core_cpumask(cpu)
+   #define topology_cluster_cpumask(cpu)
#define topology_die_cpumask(cpu)
#define topology_book_cpumask(cpu)
#define topology_drawer_cpumask(cpu)
@@ -116,10 +132,12 @@ not defined by include/asm-XXX/topology.h:
 
 1) topology_physical_package_id: -1
 2) topology_die_id: -1
-3) topology_core_id: 0
-4) topology_sibling_cpumask: just the given CPU
-5) topology_core_cpumask: just the given CPU
-6) topology_die_cpumask: just the given CPU
+3) topology_cluster_id: -1
+4) topology_core_id: 0
+5) topology_sibling_cpumask: just the given CPU
+6) topology_core_cpumask: just the given CPU
+7) topology_cluster_cpumask: just the given CPU
+8) topology_die_cpumask: just the given CPU
 
 For architectures that don't support books (CONFIG_SCHED_BOOK) there are no
 default definitions for topology_book_id() and topology_book_cpumask().
diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index e08a412..d72eb8d 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -103,6 +103,8 @@ int __init parse_acpi_topology(void)
cpu_topology[cpu].thread_id  = -1;
cpu_topology[cpu].core_id= topology_id;
}
+   topology_id = find_acpi_cpu_topology_cluster(cpu);
+   cpu_topology[cpu].cluster_id = topology_id;
topology_id = find_acpi_cpu_topology_package(cpu);
cpu_topology[cpu].package_id = topology_id;
 
diff --git a/drivers/acpi/pptt.c b/drivers/acpi/pptt.c
index 4ae933

[RFC PATCH v6 0/4] scheduler: expose the topology of clusters and add cluster scheduler

2021-04-19 Thread Barry Song
   ||| |2   |  |
| ++|| ++  |
|   || |
|   cluster1|| cluster2|
+---++-+

2. gathering related tasks within a cluster, which improves the cache affinity 
of tasks
talking with each other.
Without cluster sched_domain, related tasks might be put randomly. In case 
task1-8 have
relationship as below:
Task1 talks with task5
Task2 talks with task6
Task3 talks with task7
Task4 talks with task8
With the tuning of select_idle_cpu() to scan local cluster first, those tasks 
might
get a chance to be gathered like:
+---++--+
| +++-+ || ++  +-+  |
| |task||task | || |task|  |task |  |
| |1   || 5   | || |3   |  |7|  |
| +++-+ || ++  +-+  |
|   ||  |
|   cluster1|| cluster2 |
|   ||  |
|   ||  |
| +-+   +--+|| +-+ +--+ |
| |task |   | task ||| |task | |task  | |
| |2|   |  6   ||| |4| |8 | |
| +-+   +--+|| +-+ +--+ |
+---++--+
Otherwise, the result might be:
+---++--+
| +++-+ || ++  +-+  |
| |task||task | || |task|  |task |  |
| |1   || 2   | || |5   |  |6|  |
| +++-+ || ++  +-+  |
|   ||  |
|   cluster1|| cluster2 |
|   ||  |
|   ||  |
| +-+   +--+|| +-+ +--+ |
| |task |   | task ||| |task | |task  | |
| |3|   |  4   ||| |7| |8 | |
| +-+   +--+|| +-+ +--+ |
+---++--+

-v6:
  * added topology_cluster_cpumask() for x86, code provided by Tim.

  * emulated a two-level spreading/packing heuristic by only scanning cluster
in wake_affine path for tasks running in same LLC(also NUMA).

This partially addressed Dietmar's comment in RFC v3:
"In case we would like to further distinguish between llc-packing and
 even narrower (cluster or MC-L2)-packing, we would introduce a 2. level
 packing vs. spreading heuristic further down in sis().
   
 IMHO, Barry's current implementation doesn't do this right now. Instead
 he's trying to pack on cluster first and if not successful look further
 among the remaining llc CPUs for an idle CPU."

  * adjusted the hackbench parameter to make relatively low and high load.
previous patchsets with "-f 10" ran under an extremely high load with
hundreds of threads, which seems not real use cases.

This also addressed Vincent's question in RFC v4:
"In particular, I'm still not convinced that the modification of the wakeup
path is the root of the hackbench improvement; especially with g=14 where
there should not be much idle CPUs with 14*40 tasks on at most 32 CPUs."

-v5:
  * split "add scheduler level for clusters" into two patches to evaluate the
impact of spreading and gathering separately;
  * add a tracepoint of select_idle_cpu for debug purpose; add bcc script in
commit log;
  * add cluster_id = -1 in reset_cpu_topology()
  * rebased to tip/sched/core

-v4:
  * rebased to tip/sched/core with the latest unified code of select_idle_cpu
  * added Tim's patch for x86 Jacobsville
  * also added benchmark data of spreading unrelated tasks
  * avoided the iteration of sched_domain by moving to static_key(addressing
Vincent's comment
  * used acpi_cpu_id for acpi_find_processor_node(addressing Masa's comment)

Barry Song (2):
  scheduler: add scheduler level for clusters
  scheduler: scan idle cpu in cluster for tasks within one LLC

Jonathan Cameron (1):
  topology: Represent clusters of CPUs within a die

Tim Chen (1):
  scheduler: Add cluster scheduler level for x86

 Documentation/admin-guide/cputopology.rst | 26 +++--
 arch/arm64/Kconfig|  7 
 arch/arm64/kernel/topology.c  |  2 +
 arch/x86/Kconfig  |  8 
 arch/x86/include/asm/smp.h|  7 
 arch/x86/include/asm/topology.h   |  2 +
 arch/x86/kernel/cpu/cacheinfo.c   |  1 +
 arch/x86/kernel/cpu/common.c  |  3 ++
 arch/x86/kernel/smpboot.c | 43 -
 block/

RE: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and add cluster scheduler

2021-04-13 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Dietmar Eggemann [mailto:dietmar.eggem...@arm.com]
> Sent: Wednesday, January 13, 2021 12:00 AM
> To: Morten Rasmussen ; Tim Chen
> 
> Cc: Song Bao Hua (Barry Song) ;
> valentin.schnei...@arm.com; catalin.mari...@arm.com; w...@kernel.org;
> r...@rjwysocki.net; vincent.guit...@linaro.org; l...@kernel.org;
> gre...@linuxfoundation.org; Jonathan Cameron ;
> mi...@redhat.com; pet...@infradead.org; juri.le...@redhat.com;
> rost...@goodmis.org; bseg...@google.com; mgor...@suse.de;
> mark.rutl...@arm.com; sudeep.ho...@arm.com; aubrey...@linux.intel.com;
> linux-arm-ker...@lists.infradead.org; linux-kernel@vger.kernel.org;
> linux-a...@vger.kernel.org; linux...@openeuler.org; xuwei (O)
> ; Zengtao (B) ; tiantao (H)
> 
> Subject: Re: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and
> add cluster scheduler
> 
> On 11/01/2021 10:28, Morten Rasmussen wrote:
> > On Fri, Jan 08, 2021 at 12:22:41PM -0800, Tim Chen wrote:
> >>
> >>
> >> On 1/8/21 7:12 AM, Morten Rasmussen wrote:
> >>> On Thu, Jan 07, 2021 at 03:16:47PM -0800, Tim Chen wrote:
> >>>> On 1/6/21 12:30 AM, Barry Song wrote:
> 
> [...]
> 
> >> I think it is going to depend on the workload.  If there are dependent
> >> tasks that communicate with one another, putting them together
> >> in the same cluster will be the right thing to do to reduce communication
> >> costs.  On the other hand, if the tasks are independent, putting them 
> >> together
> on the same cluster
> >> will increase resource contention and spreading them out will be better.
> >
> > Agree. That is exactly where I'm coming from. This is all about the task
> > placement policy. We generally tend to spread tasks to avoid resource
> > contention, SMT and caches, which seems to be what you are proposing to
> > extend. I think that makes sense given it can produce significant
> > benefits.
> >
> >>
> >> Any thoughts on what is the right clustering "tag" to use to clump
> >> related tasks together?
> >> Cgroup? Pid? Tasks with same mm?
> >
> > I think this is the real question. I think the closest thing we have at
> > the moment is the wakee/waker flip heuristic. This seems to be related.
> > Perhaps the wake_affine tricks can serve as starting point?
> 
> wake_wide() switches between packing (select_idle_sibling(), llc_size
> CPUs) and spreading (find_idlest_cpu(), all CPUs).
> 
> AFAICS, since none of the sched domains set SD_BALANCE_WAKE, currently
> all wakeups are (llc-)packed.
> 
>  select_task_rq_fair()
> 
>for_each_domain(cpu, tmp)
> 
>  if (tmp->flags & sd_flag)
>sd = tmp;
> 
> 
> In case we would like to further distinguish between llc-packing and
> even narrower (cluster or MC-L2)-packing, we would introduce a 2. level
> packing vs. spreading heuristic further down in sis().
> 
> IMHO, Barry's current implementation doesn't do this right now. Instead
> he's trying to pack on cluster first and if not successful look further
> among the remaining llc CPUs for an idle CPU.

Right now in the main cases of using wake_affine to achieve
better performance, processes are actually bound within one
numa which is also a LLC in kunpeng920. 

Probably LLC=NUMA is also true for X86 Jacobsville, Tim?

So one possible way to pretend a 2-level packing might be:
if the affinity cpuset of waker and waker are both subset
of one same LLC, we totally use cluster as the factor to
determine packing or not and ignore LLC.

I haven't really done this, but the below code can make the
same result by forcing llc_id=cluster_id:

diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index d72eb8d..3d78097 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -107,7 +107,7 @@ int __init parse_acpi_topology(void)
cpu_topology[cpu].cluster_id = topology_id;
topology_id = find_acpi_cpu_topology_package(cpu);
cpu_topology[cpu].package_id = topology_id;
-
+#if 0
i = acpi_find_last_cache_level(cpu);

if (i > 0) {
@@ -119,8 +119,11 @@ int __init parse_acpi_topology(void)
if (cache_id > 0)
cpu_topology[cpu].llc_id = cache_id;
}
-   }
+#else
+   cpu_topology[cpu].llc_id = cpu_topology[cpu].cluster_id;
+#endif

+   }
return 0;
 }
 #endif

With this, I have seen some major improvement in hackbench especially
for monogamous communication model (fds_num=1, one sender for one
receiver):
numactl -N 0 hackbench -p -T -l 20 -f 1 -g $1

I have tested -g(group_nums) 6, 

RE: [PATCH v1 1/1] i2c: designware: Adjust bus_freq_hz when refuse high speed mode set

2021-03-31 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Andy Shevchenko [mailto:andriy.shevche...@linux.intel.com]
> Sent: Thursday, April 1, 2021 12:05 AM
> To: Andy Shevchenko ; Serge Semin
> ; linux-...@vger.kernel.org;
> linux-kernel@vger.kernel.org
> Cc: Jarkko Nikula ; Mika Westerberg
> ; w...@kernel.org; yangyicong
> ; Song Bao Hua (Barry Song) 
> 
> Subject: [PATCH v1 1/1] i2c: designware: Adjust bus_freq_hz when refuse high
> speed mode set
> 
> When hardware doesn't support High Speed Mode, we forget bus_freq_hz
> timing adjustment. This makes the timings and real registers being
> unsynchronized. Adjust bus_freq_hz when refuse high speed mode set.
> 
> Fixes: b6e67145f149 ("i2c: designware: Enable high speed mode")
> Reported-by: "Song Bao Hua (Barry Song)" 
> Signed-off-by: Andy Shevchenko 
> ---

Thanks for fixing that.

Reviewed-by: Barry Song 

>  drivers/i2c/busses/i2c-designware-master.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/i2c/busses/i2c-designware-master.c
> b/drivers/i2c/busses/i2c-designware-master.c
> index 34bb4e21bcc3..9bfa06e31eec 100644
> --- a/drivers/i2c/busses/i2c-designware-master.c
> +++ b/drivers/i2c/busses/i2c-designware-master.c
> @@ -129,6 +129,7 @@ static int i2c_dw_set_timings_master(struct dw_i2c_dev
> *dev)
>   if ((comp_param1 & DW_IC_COMP_PARAM_1_SPEED_MODE_MASK)
>   != DW_IC_COMP_PARAM_1_SPEED_MODE_HIGH) {
>   dev_err(dev->dev, "High Speed not supported!\n");
> + t->bus_freq_hz = I2C_MAX_FAST_MODE_FREQ;
>   dev->master_cfg &= ~DW_IC_CON_SPEED_MASK;
>   dev->master_cfg |= DW_IC_CON_SPEED_FAST;
>   dev->hs_hcnt = 0;
> --
> 2.30.2



RE: [RFC PATCH v5 4/4] scheduler: Add cluster scheduler level for x86

2021-03-31 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Song Bao Hua (Barry Song)
> Sent: Wednesday, March 24, 2021 12:15 PM
> To: 'Tim Chen' ; catalin.mari...@arm.com;
> w...@kernel.org; r...@rjwysocki.net; vincent.guit...@linaro.org; 
> b...@alien8.de;
> t...@linutronix.de; mi...@redhat.com; l...@kernel.org; pet...@infradead.org;
> dietmar.eggem...@arm.com; rost...@goodmis.org; bseg...@google.com;
> mgor...@suse.de
> Cc: msys.miz...@gmail.com; valentin.schnei...@arm.com;
> gre...@linuxfoundation.org; Jonathan Cameron ;
> juri.le...@redhat.com; mark.rutl...@arm.com; sudeep.ho...@arm.com;
> aubrey...@linux.intel.com; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; linux-a...@vger.kernel.org; x...@kernel.org;
> xuwei (O) ; Zengtao (B) ;
> guodong...@linaro.org; yangyicong ; Liguozhu (Kenneth)
> ; linux...@openeuler.org; h...@zytor.com
> Subject: RE: [RFC PATCH v5 4/4] scheduler: Add cluster scheduler level for x86
> 
> 
> 
> > -Original Message-
> > From: Tim Chen [mailto:tim.c.c...@linux.intel.com]
> > Sent: Wednesday, March 24, 2021 11:51 AM
> > To: Song Bao Hua (Barry Song) ;
> > catalin.mari...@arm.com; w...@kernel.org; r...@rjwysocki.net;
> > vincent.guit...@linaro.org; b...@alien8.de; t...@linutronix.de;
> > mi...@redhat.com; l...@kernel.org; pet...@infradead.org;
> > dietmar.eggem...@arm.com; rost...@goodmis.org; bseg...@google.com;
> > mgor...@suse.de
> > Cc: msys.miz...@gmail.com; valentin.schnei...@arm.com;
> > gre...@linuxfoundation.org; Jonathan Cameron ;
> > juri.le...@redhat.com; mark.rutl...@arm.com; sudeep.ho...@arm.com;
> > aubrey...@linux.intel.com; linux-arm-ker...@lists.infradead.org;
> > linux-kernel@vger.kernel.org; linux-a...@vger.kernel.org; x...@kernel.org;
> > xuwei (O) ; Zengtao (B) ;
> > guodong...@linaro.org; yangyicong ; Liguozhu
> (Kenneth)
> > ; linux...@openeuler.org; h...@zytor.com
> > Subject: Re: [RFC PATCH v5 4/4] scheduler: Add cluster scheduler level for
> x86
> >
> >
> >
> > On 3/18/21 9:16 PM, Barry Song wrote:
> > > From: Tim Chen 
> > >
> > > There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce
> > > is shared among a cluster of cores instead of being exclusive
> > > to one single core.
> > >
> > > To prevent oversubscription of L2 cache, load should be
> > > balanced between such L2 clusters, especially for tasks with
> > > no shared data.
> > >
> > > Also with cluster scheduling policy where tasks are woken up
> > > in the same L2 cluster, we will benefit from keeping tasks
> > > related to each other and likely sharing data in the same L2
> > > cluster.
> > >
> > > Add CPU masks of CPUs sharing the L2 cache so we can build such
> > > L2 cluster scheduler domain.
> > >
> > > Signed-off-by: Tim Chen 
> > > Signed-off-by: Barry Song 
> >
> >
> > Barry,
> >
> > Can you also add this chunk to the patch.
> > Thanks.
> 
> Sure, Tim, Thanks. I'll put that into patch 4/4 in v6.

Hi Tim,
You might want to take a look at this qemu patchset:
https://lore.kernel.org/qemu-devel/20210331095343.12172-1-wangyana...@huawei.com/T/#t

someone is trying to leverage this cluster topology
to improve KVM virtual machines performance.

> 
> >
> > Tim
> >
> >
> > diff --git a/arch/x86/include/asm/topology.h
> > b/arch/x86/include/asm/topology.h
> > index 2a11ccc14fb1..800fa48c9fcd 100644
> > --- a/arch/x86/include/asm/topology.h
> > +++ b/arch/x86/include/asm/topology.h
> > @@ -115,6 +115,7 @@ extern unsigned int __max_die_per_package;
> >
> >  #ifdef CONFIG_SMP
> >  #define topology_die_cpumask(cpu)  (per_cpu(cpu_die_map, cpu))
> > +#define topology_cluster_cpumask(cpu)  
> > (cpu_clustergroup_mask(cpu))
> >  #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
> >  #define topology_sibling_cpumask(cpu)  
> > (per_cpu(cpu_sibling_map, cpu))
> >
> 

Thanks
Barry


RE: [PATCH 5/5] i2c: designware: Switch over to i2c_freq_mode_string()

2021-03-31 Thread Song Bao Hua (Barry Song)

> No, please read the code carefully.
> We can duplicate conditional, but it brings a bit of inconsistency to how the 
> counters are printed.

Thanks for clarification, I am still confused as the original
code print the real mode based on dev->master_cfg, the new
code is printing mode based on frequency.

My understanding is the original code could fall back to a lower
speed when higher speed modes were not set successfully. For
example, high speed mode falls back to fast mode:

if ((dev->master_cfg & DW_IC_CON_SPEED_MASK) ==
DW_IC_CON_SPEED_HIGH) {
if ((comp_param1 & DW_IC_COMP_PARAM_1_SPEED_MODE_MASK)
!= DW_IC_COMP_PARAM_1_SPEED_MODE_HIGH) {
dev_err(dev->dev, "High Speed not supported!\n");
dev->master_cfg &= ~DW_IC_CON_SPEED_MASK;
dev->master_cfg |= DW_IC_CON_SPEED_FAST;
dev->hs_hcnt = 0;
dev->hs_lcnt = 0;
}

the original code was printing the mode based on the new
fallback dev->master_cfg but not the mode calculated from
frequency:

switch (dev->master_cfg & DW_IC_CON_SPEED_MASK) {
case DW_IC_CON_SPEED_STD:
mode_str = "Standard Mode";
break;
case DW_IC_CON_SPEED_HIGH:
mode_str = "High Speed Mode";
break;
default:
mode_str = "Fast Mode";
}

> > +     return 0;
> >  }
> >
> >  /**
> > --
> > 2.8.1
> 
> 
> --
> With Best Regards,
> Andy Shevchenko


-- 
With Best Regards,
Andy Shevchenko



RE: [PATCH 5/5] i2c: designware: Switch over to i2c_freq_mode_string()

2021-03-30 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Song Bao Hua (Barry Song)
> Sent: Wednesday, March 31, 2021 10:54 AM
> To: 'Andy Shevchenko' 
> Cc: yangyicong ; w...@kernel.org;
> andriy.shevche...@linux.intel.com; linux-...@vger.kernel.org;
> sergey.se...@baikalelectronics.ru; linux-kernel@vger.kernel.org;
> dig...@gmail.com; tred...@nvidia.com; jarkko.nik...@linux.intel.com;
> rmk+ker...@armlinux.org.uk; John Garry ;
> mika.westerb...@linux.intel.com; Zengtao (B) ;
> Linuxarm 
> Subject: RE: [PATCH 5/5] i2c: designware: Switch over to 
> i2c_freq_mode_string()
> 
> 
> 
> From: Andy Shevchenko [mailto:andy.shevche...@gmail.com]
> Sent: Wednesday, March 31, 2021 10:57 AM
> To: Song Bao Hua (Barry Song) 
> Cc: yangyicong ; w...@kernel.org;
> andriy.shevche...@linux.intel.com; linux-...@vger.kernel.org;
> sergey.se...@baikalelectronics.ru; linux-kernel@vger.kernel.org;
> dig...@gmail.com; tred...@nvidia.com; jarkko.nik...@linux.intel.com;
> rmk+ker...@armlinux.org.uk; John Garry ;
> mika.westerb...@linux.intel.com; Zengtao (B) ;
> Linuxarm 
> Subject: Re: [PATCH 5/5] i2c: designware: Switch over to 
> i2c_freq_mode_string()
> 
> 
> 
> On Wednesday, March 31, 2021, Song Bao Hua (Barry Song)
>  wrote:
> 
> 
> > -Original Message-
> > From: yangyicong
> > Sent: Wednesday, March 31, 2021 3:19 AM
> > To: w...@kernel.org; andriy.shevche...@linux.intel.com;
> > linux-...@vger.kernel.org; sergey.se...@baikalelectronics.ru;
> > linux-kernel@vger.kernel.org
> > Cc: dig...@gmail.com; tred...@nvidia.com; jarkko.nik...@linux.intel.com;
> > rmk+ker...@armlinux.org.uk; Song Bao Hua (Barry Song)
> > ; John Garry ;
> > mika.westerb...@linux.intel.com; yangyicong ;
> Zengtao
> > (B) ; Linuxarm 
> > Subject: [PATCH 5/5] i2c: designware: Switch over to i2c_freq_mode_string()
> >
> > From: Andy Shevchenko 
> >
> > Use generic i2c_freq_mode_string() helper to print chosen bus speed.
> >
> > Signed-off-by: Andy Shevchenko 
> > Signed-off-by: Yicong Yang 
> > ---
> >  drivers/i2c/busses/i2c-designware-master.c | 20 
> >  1 file changed, 4 insertions(+), 16 deletions(-)
> >
> > diff --git a/drivers/i2c/busses/i2c-designware-master.c
> > b/drivers/i2c/busses/i2c-designware-master.c
> > index dd27b9d..b64c4c8 100644
> > --- a/drivers/i2c/busses/i2c-designware-master.c
> > +++ b/drivers/i2c/busses/i2c-designware-master.c
> > @@ -35,10 +35,10 @@ static void i2c_dw_configure_fifo_master(struct
> dw_i2c_dev
> > *dev)
> >
> >  static int i2c_dw_set_timings_master(struct dw_i2c_dev *dev)
> >  {
> > -     const char *mode_str, *fp_str = "";
> >       u32 comp_param1;
> >       u32 sda_falling_time, scl_falling_time;
> >       struct i2c_timings *t = >timings;
> > +     const char *fp_str = "";
> >       u32 ic_clk;
> >       int ret;
> >
> > @@ -153,22 +153,10 @@ static int i2c_dw_set_timings_master(struct dw_i2c_dev
> > *dev)
> >
> >       ret = i2c_dw_set_sda_hold(dev);
> >       if (ret)
> > -             goto out;
> > -
> > -     switch (dev->master_cfg & DW_IC_CON_SPEED_MASK) {
> > -     case DW_IC_CON_SPEED_STD:
> > -             mode_str = "Standard Mode";
> > -             break;
> > -     case DW_IC_CON_SPEED_HIGH:
> > -             mode_str = "High Speed Mode";
> > -             break;
> > -     default:
> > -             mode_str = "Fast Mode";
> > -     }
> > -     dev_dbg(dev->dev, "Bus speed: %s%s\n", mode_str, fp_str);
> > +             return ret;
> >
> > -out:
> > -     return ret;
> > +     dev_dbg(dev->dev, "Bus speed: %s\n",
> > i2c_freq_mode_string(t->bus_freq_hz));
> 
> > Weird the original code was printing both mode and fp.
> > And you are printing mode only.
> 
> >> Sorry, but I didn’t get what you mean here. The code is equivalent, and 
> >> actually
> it will print even more.
> 
> The original code will print the string fp_str:
> %s%s\n", mode_str, fp_str
> 
> The new code is printing mode_str only:
> %s
> 

Isn't fp_str redundant? Do we need to change

dev_dbg(dev->dev, "Fast Mode:%s HCNT:LCNT = %d:%d\n", fp_str...)

> > +     return 0;
> >  }
> >
> >  /**
> > --
> > 2.8.1
> 
> 
> --
> With Best Regards,
> Andy Shevchenko



RE: [PATCH 5/5] i2c: designware: Switch over to i2c_freq_mode_string()

2021-03-30 Thread Song Bao Hua (Barry Song)


From: Andy Shevchenko [mailto:andy.shevche...@gmail.com] 
Sent: Wednesday, March 31, 2021 10:57 AM
To: Song Bao Hua (Barry Song) 
Cc: yangyicong ; w...@kernel.org; 
andriy.shevche...@linux.intel.com; linux-...@vger.kernel.org; 
sergey.se...@baikalelectronics.ru; linux-kernel@vger.kernel.org; 
dig...@gmail.com; tred...@nvidia.com; jarkko.nik...@linux.intel.com; 
rmk+ker...@armlinux.org.uk; John Garry ; 
mika.westerb...@linux.intel.com; Zengtao (B) ; 
Linuxarm 
Subject: Re: [PATCH 5/5] i2c: designware: Switch over to i2c_freq_mode_string()



On Wednesday, March 31, 2021, Song Bao Hua (Barry Song) 
 wrote:


> -Original Message-
> From: yangyicong
> Sent: Wednesday, March 31, 2021 3:19 AM
> To: w...@kernel.org; andriy.shevche...@linux.intel.com;
> linux-...@vger.kernel.org; sergey.se...@baikalelectronics.ru;
> linux-kernel@vger.kernel.org
> Cc: dig...@gmail.com; tred...@nvidia.com; jarkko.nik...@linux.intel.com;
> rmk+ker...@armlinux.org.uk; Song Bao Hua (Barry Song)
> ; John Garry ;
> mika.westerb...@linux.intel.com; yangyicong ; Zengtao
> (B) ; Linuxarm 
> Subject: [PATCH 5/5] i2c: designware: Switch over to i2c_freq_mode_string()
> 
> From: Andy Shevchenko 
> 
> Use generic i2c_freq_mode_string() helper to print chosen bus speed.
> 
> Signed-off-by: Andy Shevchenko 
> Signed-off-by: Yicong Yang 
> ---
>  drivers/i2c/busses/i2c-designware-master.c | 20 
>  1 file changed, 4 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/i2c/busses/i2c-designware-master.c
> b/drivers/i2c/busses/i2c-designware-master.c
> index dd27b9d..b64c4c8 100644
> --- a/drivers/i2c/busses/i2c-designware-master.c
> +++ b/drivers/i2c/busses/i2c-designware-master.c
> @@ -35,10 +35,10 @@ static void i2c_dw_configure_fifo_master(struct dw_i2c_dev
> *dev)
> 
>  static int i2c_dw_set_timings_master(struct dw_i2c_dev *dev)
>  {
> -     const char *mode_str, *fp_str = "";
>       u32 comp_param1;
>       u32 sda_falling_time, scl_falling_time;
>       struct i2c_timings *t = >timings;
> +     const char *fp_str = "";
>       u32 ic_clk;
>       int ret;
> 
> @@ -153,22 +153,10 @@ static int i2c_dw_set_timings_master(struct dw_i2c_dev
> *dev)
> 
>       ret = i2c_dw_set_sda_hold(dev);
>       if (ret)
> -             goto out;
> -
> -     switch (dev->master_cfg & DW_IC_CON_SPEED_MASK) {
> -     case DW_IC_CON_SPEED_STD:
> -             mode_str = "Standard Mode";
> -             break;
> -     case DW_IC_CON_SPEED_HIGH:
> -             mode_str = "High Speed Mode";
> -             break;
> -     default:
> -             mode_str = "Fast Mode";
> -     }
> -     dev_dbg(dev->dev, "Bus speed: %s%s\n", mode_str, fp_str);
> +             return ret;
> 
> -out:
> -     return ret;
> +     dev_dbg(dev->dev, "Bus speed: %s\n",
> i2c_freq_mode_string(t->bus_freq_hz));

> Weird the original code was printing both mode and fp.
> And you are printing mode only.

>> Sorry, but I didn’t get what you mean here. The code is equivalent, and 
>> actually it will print even more.

The original code will print the string fp_str:
%s%s\n", mode_str, fp_str

The new code is printing mode_str only:
%s

> +     return 0;
>  }
> 
>  /**
> --
> 2.8.1


-- 
With Best Regards,
Andy Shevchenko



RE: [PATCH 5/5] i2c: designware: Switch over to i2c_freq_mode_string()

2021-03-30 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: yangyicong
> Sent: Wednesday, March 31, 2021 3:19 AM
> To: w...@kernel.org; andriy.shevche...@linux.intel.com;
> linux-...@vger.kernel.org; sergey.se...@baikalelectronics.ru;
> linux-kernel@vger.kernel.org
> Cc: dig...@gmail.com; tred...@nvidia.com; jarkko.nik...@linux.intel.com;
> rmk+ker...@armlinux.org.uk; Song Bao Hua (Barry Song)
> ; John Garry ;
> mika.westerb...@linux.intel.com; yangyicong ; Zengtao
> (B) ; Linuxarm 
> Subject: [PATCH 5/5] i2c: designware: Switch over to i2c_freq_mode_string()
> 
> From: Andy Shevchenko 
> 
> Use generic i2c_freq_mode_string() helper to print chosen bus speed.
> 
> Signed-off-by: Andy Shevchenko 
> Signed-off-by: Yicong Yang 
> ---
>  drivers/i2c/busses/i2c-designware-master.c | 20 
>  1 file changed, 4 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/i2c/busses/i2c-designware-master.c
> b/drivers/i2c/busses/i2c-designware-master.c
> index dd27b9d..b64c4c8 100644
> --- a/drivers/i2c/busses/i2c-designware-master.c
> +++ b/drivers/i2c/busses/i2c-designware-master.c
> @@ -35,10 +35,10 @@ static void i2c_dw_configure_fifo_master(struct dw_i2c_dev
> *dev)
> 
>  static int i2c_dw_set_timings_master(struct dw_i2c_dev *dev)
>  {
> - const char *mode_str, *fp_str = "";
>   u32 comp_param1;
>   u32 sda_falling_time, scl_falling_time;
>   struct i2c_timings *t = >timings;
> + const char *fp_str = "";
>   u32 ic_clk;
>   int ret;
> 
> @@ -153,22 +153,10 @@ static int i2c_dw_set_timings_master(struct dw_i2c_dev
> *dev)
> 
>   ret = i2c_dw_set_sda_hold(dev);
>   if (ret)
> - goto out;
> -
> - switch (dev->master_cfg & DW_IC_CON_SPEED_MASK) {
> - case DW_IC_CON_SPEED_STD:
> - mode_str = "Standard Mode";
> - break;
> - case DW_IC_CON_SPEED_HIGH:
> - mode_str = "High Speed Mode";
> - break;
> - default:
> - mode_str = "Fast Mode";
> - }
> - dev_dbg(dev->dev, "Bus speed: %s%s\n", mode_str, fp_str);
> + return ret;
> 
> -out:
> - return ret;
> + dev_dbg(dev->dev, "Bus speed: %s\n",
> i2c_freq_mode_string(t->bus_freq_hz));

Weird the original code was printing both mode and fp.
And you are printing mode only.

> + return 0;
>  }
> 
>  /**
> --
> 2.8.1



RE: [External] Re: [PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock

2021-03-30 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Muchun Song [mailto:songmuc...@bytedance.com]
> Sent: Tuesday, March 30, 2021 9:09 PM
> To: Michal Hocko 
> Cc: Mike Kravetz ; Linux Memory Management List
> ; LKML ; Roman Gushchin
> ; Shakeel Butt ; Oscar Salvador
> ; David Hildenbrand ; David Rientjes
> ; linmiaohe ; Peter Zijlstra
> ; Matthew Wilcox ; HORIGUCHI NAOYA
> ; Aneesh Kumar K . V ;
> Waiman Long ; Peter Xu ; Mina Almasry
> ; Hillf Danton ; Joonsoo Kim
> ; Song Bao Hua (Barry Song)
> ; Will Deacon ; Andrew Morton
> 
> Subject: Re: [External] Re: [PATCH v2 1/8] mm/cma: change cma mutex to irq 
> safe
> spinlock
> 
> On Tue, Mar 30, 2021 at 4:01 PM Michal Hocko  wrote:
> >
> > On Mon 29-03-21 16:23:55, Mike Kravetz wrote:
> > > Ideally, cma_release could be called from any context.  However,
> > > that is not possible because a mutex is used to protect the per-area 
> > > bitmap.
> > > Change the bitmap to an irq safe spinlock.
> >
> > I would phrase the changelog slightly differerent "
> > cma_release is currently a sleepable operatation because the bitmap
> > manipulation is protected by cma->lock mutex. Hugetlb code which
> > relies on cma_release for CMA backed (giga) hugetlb pages, however,
> > needs to be irq safe.
> >
> > The lock doesn't protect any sleepable operation so it can be changed
> > to a (irq aware) spin lock. The bitmap processing should be quite fast
> > in typical case but if cma sizes grow to TB then we will likely need
> > to replace the lock by a more optimized bitmap implementation.
> > "
> >
> > it seems that you are overusing irqsave variants even from context
> > which are never called from the IRQ context so they do not need storing 
> > flags.
> >
> > [...]
> > > @@ -391,8 +391,9 @@ static void cma_debug_show_areas(struct cma *cma)
> > >   unsigned long start = 0;
> > >   unsigned long nr_part, nr_total = 0;
> > >   unsigned long nbits = cma_bitmap_maxno(cma);
> > > + unsigned long flags;
> > >
> > > - mutex_lock(>lock);
> > > + spin_lock_irqsave(>lock, flags);
> >
> > spin_lock_irq should be sufficient. This is only called from the
> > allocation context and that is never called from IRQ context.
> 
> This makes me think more. I think that spin_lock should be sufficient. Right?
> 

It seems Mike's point is that cma_release might be called from both
irq context and process context.

If it is running in process context, we need the irq-disable to lock
the irq context which might jump to call cma_release at the same time.

We have never seen cma_release has been really called in irq context
by now, anyway.

> 
> >
> > >   pr_info("number of available pages: ");
> > >   for (;;) {
> > >   next_zero_bit = find_next_zero_bit(cma->bitmap, nbits,
> > > start); @@ -407,7 +408,7 @@ static void cma_debug_show_areas(struct cma
> *cma)
> > >   start = next_zero_bit + nr_zero;
> > >   }
> > >   pr_cont("=> %lu free of %lu total pages\n", nr_total, cma->count);
> > > - mutex_unlock(>lock);
> > > + spin_unlock_irqrestore(>lock, flags);
> > >  }
> > >  #else
> > >  static inline void cma_debug_show_areas(struct cma *cma) { } @@
> > > -430,6 +431,7 @@ struct page *cma_alloc(struct cma *cma, size_t count,
> unsigned int align,
> > >   unsigned long pfn = -1;
> > >   unsigned long start = 0;
> > >   unsigned long bitmap_maxno, bitmap_no, bitmap_count;
> > > + unsigned long flags;
> > >   size_t i;
> > >   struct page *page = NULL;
> > >   int ret = -ENOMEM;
> > > @@ -454,12 +456,12 @@ struct page *cma_alloc(struct cma *cma, size_t 
> > > count,
> unsigned int align,
> > >   goto out;
> > >
> > >   for (;;) {
> > > - mutex_lock(>lock);
> > > + spin_lock_irqsave(>lock, flags);
> > >   bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
> > >   bitmap_maxno, start, bitmap_count, mask,
> > >   offset);
> > >   if (bitmap_no >= bitmap_maxno) {
> > > - mutex_unlock(>lock);
> > > + spin_unlock_irqrestore(>lock, flags);
> > >   break;
> > >   }
> > >   bitmap_set(cma->bitma

RE: [PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock

2021-03-29 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Mike Kravetz [mailto:mike.krav...@oracle.com]
> Sent: Tuesday, March 30, 2021 12:24 PM
> To: linux...@kvack.org; linux-kernel@vger.kernel.org
> Cc: Roman Gushchin ; Michal Hocko ; Shakeel Butt
> ; Oscar Salvador ; David Hildenbrand
> ; Muchun Song ; David Rientjes
> ; linmiaohe ; Peter Zijlstra
> ; Matthew Wilcox ; HORIGUCHI NAOYA
> ; Aneesh Kumar K . V ;
> Waiman Long ; Peter Xu ; Mina Almasry
> ; Hillf Danton ; Joonsoo Kim
> ; Song Bao Hua (Barry Song)
> ; Will Deacon ; Andrew Morton
> ; Mike Kravetz 
> Subject: [PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock
> 
> Ideally, cma_release could be called from any context.  However, that is
> not possible because a mutex is used to protect the per-area bitmap.
> Change the bitmap to an irq safe spinlock.
> 
> Signed-off-by: Mike Kravetz 

It seems mutex_lock is locking some areas with bitmap operations which
should be safe to atomic context.

Reviewed-by: Barry Song 

> ---
>  mm/cma.c   | 20 +++-
>  mm/cma.h   |  2 +-
>  mm/cma_debug.c | 10 ++
>  3 files changed, 18 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/cma.c b/mm/cma.c
> index b2393b892d3b..80875fd4487b 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -24,7 +24,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -83,13 +82,14 @@ static void cma_clear_bitmap(struct cma *cma, unsigned 
> long
> pfn,
>unsigned int count)
>  {
>   unsigned long bitmap_no, bitmap_count;
> + unsigned long flags;
> 
>   bitmap_no = (pfn - cma->base_pfn) >> cma->order_per_bit;
>   bitmap_count = cma_bitmap_pages_to_bits(cma, count);
> 
> - mutex_lock(>lock);
> + spin_lock_irqsave(>lock, flags);
>   bitmap_clear(cma->bitmap, bitmap_no, bitmap_count);
> - mutex_unlock(>lock);
> + spin_unlock_irqrestore(>lock, flags);
>  }
> 
>  static void __init cma_activate_area(struct cma *cma)
> @@ -118,7 +118,7 @@ static void __init cma_activate_area(struct cma *cma)
>pfn += pageblock_nr_pages)
>   init_cma_reserved_pageblock(pfn_to_page(pfn));
> 
> - mutex_init(>lock);
> + spin_lock_init(>lock);
> 
>  #ifdef CONFIG_CMA_DEBUGFS
>   INIT_HLIST_HEAD(>mem_head);
> @@ -391,8 +391,9 @@ static void cma_debug_show_areas(struct cma *cma)
>   unsigned long start = 0;
>   unsigned long nr_part, nr_total = 0;
>   unsigned long nbits = cma_bitmap_maxno(cma);
> + unsigned long flags;
> 
> - mutex_lock(>lock);
> + spin_lock_irqsave(>lock, flags);
>   pr_info("number of available pages: ");
>   for (;;) {
>   next_zero_bit = find_next_zero_bit(cma->bitmap, nbits, start);
> @@ -407,7 +408,7 @@ static void cma_debug_show_areas(struct cma *cma)
>   start = next_zero_bit + nr_zero;
>   }
>   pr_cont("=> %lu free of %lu total pages\n", nr_total, cma->count);
> - mutex_unlock(>lock);
> + spin_unlock_irqrestore(>lock, flags);
>  }
>  #else
>  static inline void cma_debug_show_areas(struct cma *cma) { }
> @@ -430,6 +431,7 @@ struct page *cma_alloc(struct cma *cma, size_t count,
> unsigned int align,
>   unsigned long pfn = -1;
>   unsigned long start = 0;
>   unsigned long bitmap_maxno, bitmap_no, bitmap_count;
> + unsigned long flags;
>   size_t i;
>   struct page *page = NULL;
>   int ret = -ENOMEM;
> @@ -454,12 +456,12 @@ struct page *cma_alloc(struct cma *cma, size_t count,
> unsigned int align,
>   goto out;
> 
>   for (;;) {
> - mutex_lock(>lock);
> + spin_lock_irqsave(>lock, flags);
>   bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
>   bitmap_maxno, start, bitmap_count, mask,
>   offset);
>   if (bitmap_no >= bitmap_maxno) {
> - mutex_unlock(>lock);
> + spin_unlock_irqrestore(>lock, flags);
>   break;
>   }
>   bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
> @@ -468,7 +470,7 @@ struct page *cma_alloc(struct cma *cma, size_t count,
> unsigned int align,
>* our exclusive use. If the migration fails we will take the
>* lock again and unmark it.
>*/
> - mutex_unlock(>lock);
> + spin_unlock_irqrestore(>lock, flags);
> 
>   pfn = cma->base_pfn + (bitmap_no

[tip: sched/core] sched/topology: Remove redundant cpumask_and() in init_overlap_sched_group()

2021-03-25 Thread tip-bot2 for Barry Song
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 0a2b65c03e9b47493e1442bf9c84badc60d9bffb
Gitweb:
https://git.kernel.org/tip/0a2b65c03e9b47493e1442bf9c84badc60d9bffb
Author:Barry Song 
AuthorDate:Thu, 25 Mar 2021 15:31:40 +13:00
Committer: Ingo Molnar 
CommitterDate: Thu, 25 Mar 2021 11:41:23 +01:00

sched/topology: Remove redundant cpumask_and() in init_overlap_sched_group()

mask is built in build_balance_mask() by for_each_cpu(i, sg_span), so
it must be a subset of sched_group_span(sg).

So the cpumask_and() call is redundant - remove it.

[ mingo: Adjusted the changelog a bit. ]

Signed-off-by: Barry Song 
Signed-off-by: Ingo Molnar 
Reviewed-by: Valentin Schneider 
Link: 
https://lore.kernel.org/r/20210325023140.23456-1-song.bao@hisilicon.com
---
 kernel/sched/topology.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index f2066d6..d1aec24 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -934,7 +934,7 @@ static void init_overlap_sched_group(struct sched_domain 
*sd,
int cpu;
 
build_balance_mask(sd, sg, mask);
-   cpu = cpumask_first_and(sched_group_span(sg), mask);
+   cpu = cpumask_first(mask);
 
sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
if (atomic_inc_return(>sgc->ref) == 1)


[PATCH v2] sched/topology: remove redundant cpumask_and in init_overlap_sched_group

2021-03-24 Thread Barry Song
mask is built in build_balance_mask() by for_each_cpu(i, sg_span), so
it must be a subset of sched_group_span(sg). Though cpumask_first_and
doesn't lead to a wrong result of balance cpu, it is pointless to do
cpumask_and again.

Signed-off-by: Barry Song 
Reviewed-by: Valentin Schneider 
---
 -v2: add reviewed-by of Valentin, thanks!

 kernel/sched/topology.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index f2066d682cd8..d1aec244c027 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -934,7 +934,7 @@ static void init_overlap_sched_group(struct sched_domain 
*sd,
int cpu;
 
build_balance_mask(sd, sg, mask);
-   cpu = cpumask_first_and(sched_group_span(sg), mask);
+   cpu = cpumask_first(mask);
 
sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
if (atomic_inc_return(>sgc->ref) == 1)
-- 
2.25.1



RE: [PATCH] dma-mapping: make map_benchmark compile into module

2021-03-24 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Christoph Hellwig [mailto:h...@lst.de]
> Sent: Wednesday, March 24, 2021 8:13 PM
> To: tiantao (H) 
> Cc: a...@linux-foundation.org; pet...@infradead.org; paul...@kernel.org;
> a...@kernel.org; t...@linutronix.de; rost...@goodmis.org; h...@lst.de;
> m.szyprow...@samsung.com; Song Bao Hua (Barry Song)
> ; io...@lists.linux-foundation.org;
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] dma-mapping: make map_benchmark compile into module
> 
> On Wed, Mar 24, 2021 at 10:17:38AM +0800, Tian Tao wrote:
> > under some scenarios, it is necessary to compile map_benchmark
> > into module to test iommu, so this patch changed Kconfig and
> > export_symbol to implement map_benchmark compiled into module.
> >
> > On the other hand, map_benchmark is a driver, which is supposed
> > to be able to run as a module.
> >
> > Signed-off-by: Tian Tao 
> 
> Nope, we're not going to export more kthread internals for a test
> module.

The requirement comes from an colleague who is frequently changing
the map-bench code for some customized test purpose. and he doesn't
want to build kernel image and reboot every time. So I moved the
requirement to Tao Tian.

Right now, kthread_bind() is exported, kthread_bind_mask() seems
to be a little bit "internal" as you said, maybe a wrapper like
kthread_bind_node() won't be that "internal", comparing to exposing
the cpumask?
Anyway, we don't find other driver users for this, hardly I can
convince you it is worth.

Thanks
Barry


RE: [PATCH] dma-mapping: make map_benchmark compile into module

2021-03-23 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: tiantao (H)
> Sent: Wednesday, March 24, 2021 3:18 PM
> To: a...@linux-foundation.org; pet...@infradead.org; paul...@kernel.org;
> a...@kernel.org; t...@linutronix.de; rost...@goodmis.org; h...@lst.de;
> m.szyprow...@samsung.com; Song Bao Hua (Barry Song)
> 
> Cc: io...@lists.linux-foundation.org; linux-kernel@vger.kernel.org; tiantao
> (H) 
> Subject: [PATCH] dma-mapping: make map_benchmark compile into module
> 
> under some scenarios, it is necessary to compile map_benchmark
> into module to test iommu, so this patch changed Kconfig and
> export_symbol to implement map_benchmark compiled into module.
> 
> On the other hand, map_benchmark is a driver, which is supposed
> to be able to run as a module.
> 
> Signed-off-by: Tian Tao 
> ---

Acked-by: Barry Song 

Look sensible to me. I like the idea that map_benchmark is
a driver. It seems unreasonable to always require built-in.


>  kernel/dma/Kconfig | 2 +-
>  kernel/kthread.c   | 1 +
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> index 77b4055..0468293 100644
> --- a/kernel/dma/Kconfig
> +++ b/kernel/dma/Kconfig
> @@ -223,7 +223,7 @@ config DMA_API_DEBUG_SG
> If unsure, say N.
> 
>  config DMA_MAP_BENCHMARK
> - bool "Enable benchmarking of streaming DMA mapping"
> + tristate "Enable benchmarking of streaming DMA mapping"
>   depends on DEBUG_FS
>   help
> Provides /sys/kernel/debug/dma_map_benchmark that helps with testing
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 1578973..fa4736f 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -455,6 +455,7 @@ void kthread_bind_mask(struct task_struct *p, const struct
> cpumask *mask)
>  {
>   __kthread_bind_mask(p, mask, TASK_UNINTERRUPTIBLE);
>  }
> +EXPORT_SYMBOL(kthread_bind_mask);
> 
>  /**
>   * kthread_bind - bind a just-created kthread to a cpu.
> --
> 2.7.4

Thanks
Barry



RE: [RFC PATCH v5 4/4] scheduler: Add cluster scheduler level for x86

2021-03-23 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Tim Chen [mailto:tim.c.c...@linux.intel.com]
> Sent: Wednesday, March 24, 2021 11:51 AM
> To: Song Bao Hua (Barry Song) ;
> catalin.mari...@arm.com; w...@kernel.org; r...@rjwysocki.net;
> vincent.guit...@linaro.org; b...@alien8.de; t...@linutronix.de;
> mi...@redhat.com; l...@kernel.org; pet...@infradead.org;
> dietmar.eggem...@arm.com; rost...@goodmis.org; bseg...@google.com;
> mgor...@suse.de
> Cc: msys.miz...@gmail.com; valentin.schnei...@arm.com;
> gre...@linuxfoundation.org; Jonathan Cameron ;
> juri.le...@redhat.com; mark.rutl...@arm.com; sudeep.ho...@arm.com;
> aubrey...@linux.intel.com; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; linux-a...@vger.kernel.org; x...@kernel.org;
> xuwei (O) ; Zengtao (B) ;
> guodong...@linaro.org; yangyicong ; Liguozhu (Kenneth)
> ; linux...@openeuler.org; h...@zytor.com
> Subject: Re: [RFC PATCH v5 4/4] scheduler: Add cluster scheduler level for x86
> 
> 
> 
> On 3/18/21 9:16 PM, Barry Song wrote:
> > From: Tim Chen 
> >
> > There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce
> > is shared among a cluster of cores instead of being exclusive
> > to one single core.
> >
> > To prevent oversubscription of L2 cache, load should be
> > balanced between such L2 clusters, especially for tasks with
> > no shared data.
> >
> > Also with cluster scheduling policy where tasks are woken up
> > in the same L2 cluster, we will benefit from keeping tasks
> > related to each other and likely sharing data in the same L2
> > cluster.
> >
> > Add CPU masks of CPUs sharing the L2 cache so we can build such
> > L2 cluster scheduler domain.
> >
> > Signed-off-by: Tim Chen 
> > Signed-off-by: Barry Song 
> 
> 
> Barry,
> 
> Can you also add this chunk to the patch.
> Thanks.

Sure, Tim, Thanks. I'll put that into patch 4/4 in v6.

> 
> Tim
> 
> 
> diff --git a/arch/x86/include/asm/topology.h
> b/arch/x86/include/asm/topology.h
> index 2a11ccc14fb1..800fa48c9fcd 100644
> --- a/arch/x86/include/asm/topology.h
> +++ b/arch/x86/include/asm/topology.h
> @@ -115,6 +115,7 @@ extern unsigned int __max_die_per_package;
> 
>  #ifdef CONFIG_SMP
>  #define topology_die_cpumask(cpu)(per_cpu(cpu_die_map, cpu))
> +#define topology_cluster_cpumask(cpu)
> (cpu_clustergroup_mask(cpu))
>  #define topology_core_cpumask(cpu)   (per_cpu(cpu_core_map, cpu))
>  #define topology_sibling_cpumask(cpu)
> (per_cpu(cpu_sibling_map, cpu))
> 

Thanks
Barry




[tip: sched/core] sched/fair: Optimize test_idle_cores() for !SMT

2021-03-23 Thread tip-bot2 for Barry Song
The following commit has been merged into the sched/core branch of tip:

Commit-ID: c8987ae5af793a73e2c0d6ce804d8ff454ea377c
Gitweb:
https://git.kernel.org/tip/c8987ae5af793a73e2c0d6ce804d8ff454ea377c
Author:Barry Song 
AuthorDate:Sun, 21 Mar 2021 11:14:32 +13:00
Committer: Peter Zijlstra 
CommitterDate: Tue, 23 Mar 2021 16:01:59 +01:00

sched/fair: Optimize test_idle_cores() for !SMT

update_idle_core() is only done for the case of sched_smt_present.
but test_idle_cores() is done for all machines even those without
SMT.

This can contribute to up 8%+ hackbench performance loss on a
machine like kunpeng 920 which has no SMT. This patch removes the
redundant test_idle_cores() for !SMT machines.

Hackbench is ran with -g {2..14}, for each g it is ran 10 times to get
an average.

  $ numactl -N 0 hackbench -p -T -l 2 -g $1

The below is the result of hackbench w/ and w/o this patch:

  g=2  4 6   8  10 12  14
  w/o: 1.8151 3.8499 5.5142 7.2491 9.0340 10.7345 12.0929
  w/ : 1.8428 3.7436 5.4501 6.9522 8.2882  9.9535 11.3367
+4.1%  +8.3%  +7.3%   +6.3%

Signed-off-by: Barry Song 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Vincent Guittot 
Acked-by: Mel Gorman 
Link: https://lkml.kernel.org/r/20210320221432.924-1-song.bao@hisilicon.com
---
 kernel/sched/fair.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6aad028..aaa0dfa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6038,9 +6038,11 @@ static inline bool test_idle_cores(int cpu, bool def)
 {
struct sched_domain_shared *sds;
 
-   sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
-   if (sds)
-   return READ_ONCE(sds->has_idle_cores);
+   if (static_branch_likely(_smt_present)) {
+   sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+   if (sds)
+   return READ_ONCE(sds->has_idle_cores);
+   }
 
return def;
 }


RE: [Linuxarm] Re: [PATCH] sched/fair: remove redundant test_idle_cores for non-smt

2021-03-21 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Li, Aubrey [mailto:aubrey...@linux.intel.com]
> Sent: Monday, March 22, 2021 5:37 PM
> To: Song Bao Hua (Barry Song) ;
> vincent.guit...@linaro.org; mi...@redhat.com; pet...@infradead.org;
> juri.le...@redhat.com; dietmar.eggem...@arm.com; rost...@goodmis.org;
> bseg...@google.com; mgor...@suse.de
> Cc: valentin.schnei...@arm.com; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; xuwei (O) ; Zengtao (B)
> ; guodong...@linaro.org; yangyicong
> ; Liguozhu (Kenneth) ;
> linux...@openeuler.org
> Subject: [Linuxarm] Re: [PATCH] sched/fair: remove redundant test_idle_cores
> for non-smt
> 
> Hi Barry,
> 
> On 2021/3/21 6:14, Barry Song wrote:
> > update_idle_core() is only done for the case of sched_smt_present.
> > but test_idle_cores() is done for all machines even those without
> > smt.
> 
> The patch looks good to me.
> May I know for what case we need to keep CONFIG_SCHED_SMT for non-smt
> machines?


Hi Aubrey,

I think the defconfig of arm64 has always enabled
CONFIG_SCHED_SMT:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/configs/defconfig

it is probably true for x86 as well.

I don't think Linux distribution will build a separate kernel
for machines without smt. so basically the kernel depends on
runtime topology parse to figure out if smt is present rather
than depending on a rebuild.


> 
> Thanks,
> -Aubrey
> 
> 
> > this could contribute to up 8%+ hackbench performance loss on a
> > machine like kunpeng 920 which has no smt. this patch removes the
> > redundant test_idle_cores() for non-smt machines.
> >
> > we run the below hackbench with different -g parameter from 2 to
> > 14, for each different g, we run the command 10 times and get the
> > average time:
> > $ numactl -N 0 hackbench -p -T -l 2 -g $1
> >
> > hackbench will report the time which is needed to complete a certain
> > number of messages transmissions between a certain number of tasks,
> > for example:
> > $ numactl -N 0 hackbench -p -T -l 2 -g 10
> > Running in threaded mode with 10 groups using 40 file descriptors each
> > (== 400 tasks)
> > Each sender will pass 2 messages of 100 bytes
> >
> > The below is the result of hackbench w/ and w/o this patch:
> > g=2  4 6   8  10 12      14
> > w/o: 1.8151 3.8499 5.5142 7.2491 9.0340 10.7345 12.0929
> > w/ : 1.8428 3.7436 5.4501 6.9522 8.2882  9.9535 11.3367
> >   +4.1%  +8.3%  +7.3%   +6.3%
> >
> > Signed-off-by: Barry Song 
> > ---
> >  kernel/sched/fair.c | 8 +---
> >  1 file changed, 5 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 2e2ab1e..de42a32 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6038,9 +6038,11 @@ static inline bool test_idle_cores(int cpu, bool def)
> >  {
> > struct sched_domain_shared *sds;
> >
> > -   sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> > -   if (sds)
> > -   return READ_ONCE(sds->has_idle_cores);
> > +   if (static_branch_likely(_smt_present)) {
> > +   sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> > +   if (sds)
> > +   return READ_ONCE(sds->has_idle_cores);
> > +   }
> >
> > return def;
> >  }

Thanks
Barry



[PATCH] sched/fair: remove redundant test_idle_cores for non-smt

2021-03-20 Thread Barry Song
update_idle_core() is only done for the case of sched_smt_present.
but test_idle_cores() is done for all machines even those without
smt.
this could contribute to up 8%+ hackbench performance loss on a
machine like kunpeng 920 which has no smt. this patch removes the
redundant test_idle_cores() for non-smt machines.

we run the below hackbench with different -g parameter from 2 to
14, for each different g, we run the command 10 times and get the
average time:
$ numactl -N 0 hackbench -p -T -l 2 -g $1

hackbench will report the time which is needed to complete a certain
number of messages transmissions between a certain number of tasks,
for example:
$ numactl -N 0 hackbench -p -T -l 2 -g 10
Running in threaded mode with 10 groups using 40 file descriptors each
(== 400 tasks)
Each sender will pass 2 messages of 100 bytes

The below is the result of hackbench w/ and w/o this patch:
g=2  4 6   8  10 12  14
w/o: 1.8151 3.8499 5.5142 7.2491 9.0340 10.7345 12.0929
w/ : 1.8428 3.7436 5.4501 6.9522 8.2882  9.9535 11.3367
  +4.1%  +8.3%  +7.3%   +6.3%

Signed-off-by: Barry Song 
---
 kernel/sched/fair.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2e2ab1e..de42a32 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6038,9 +6038,11 @@ static inline bool test_idle_cores(int cpu, bool def)
 {
struct sched_domain_shared *sds;
 
-   sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
-   if (sds)
-   return READ_ONCE(sds->has_idle_cores);
+   if (static_branch_likely(_smt_present)) {
+   sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+   if (sds)
+   return READ_ONCE(sds->has_idle_cores);
+   }
 
return def;
 }
-- 
1.8.3.1



RE: [RFC PATCH v5 3/4] scheduler: scan idle cpu in cluster before scanning the whole llc

2021-03-19 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Song Bao Hua (Barry Song)
> Sent: Friday, March 19, 2021 5:16 PM
> To: tim.c.c...@linux.intel.com; catalin.mari...@arm.com; w...@kernel.org;
> r...@rjwysocki.net; vincent.guit...@linaro.org; b...@alien8.de;
> t...@linutronix.de; mi...@redhat.com; l...@kernel.org; pet...@infradead.org;
> dietmar.eggem...@arm.com; rost...@goodmis.org; bseg...@google.com;
> mgor...@suse.de
> Cc: msys.miz...@gmail.com; valentin.schnei...@arm.com;
> gre...@linuxfoundation.org; Jonathan Cameron ;
> juri.le...@redhat.com; mark.rutl...@arm.com; sudeep.ho...@arm.com;
> aubrey...@linux.intel.com; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; linux-a...@vger.kernel.org; x...@kernel.org;
> xuwei (O) ; Zengtao (B) ;
> guodong...@linaro.org; yangyicong ; Liguozhu (Kenneth)
> ; linux...@openeuler.org; h...@zytor.com; Song Bao Hua
> (Barry Song) 
> Subject: [RFC PATCH v5 3/4] scheduler: scan idle cpu in cluster before 
> scanning
> the whole llc
> 
> On kunpeng920, cpus within one cluster can communicate wit each other
> much faster than cpus across different clusters. A simple hackbench
> can prove that.
> hackbench running on 4 cpus in single one cluster and 4 cpus in
> different clusters shows a large contrast:
> (1) within a cluster:
> root@ubuntu:~# taskset -c 0,1,2,3 hackbench -p -T -l 2 -g 1
> Running in threaded mode with 1 groups using 40 file descriptors each
> (== 40 tasks)
> Each sender will pass 2 messages of 100 bytes
> Time: 4.285
> 
> (2) across clusters:
> root@ubuntu:~# taskset -c 0,4,8,12 hackbench -p -T -l 2 -g 1
> Running in threaded mode with 1 groups using 40 file descriptors each
> (== 40 tasks)
> Each sender will pass 2 messages of 100 bytes
> Time: 5.524
> 
> This inspires us to change the wake_affine path to scan cluster before
> scanning the whole LLC to try to gatter related tasks in one cluster,
> which is done by this patch.
> 
> To evaluate the performance impact to related tasks talking with each
> other, we run the below hackbench with different -g parameter from 2
> to 14, for each different g, we run the command 10 times and get the
> average time:
> $ numactl -N 0 hackbench -p -T -l 2 -g $1
> 
> hackbench will report the time which is needed to complete a certain number
> of messages transmissions between a certain number of tasks, for example:
> $ numactl -N 0 hackbench -p -T -l 2 -g 10
> Running in threaded mode with 10 groups using 40 file descriptors each
> (== 400 tasks)
> Each sender will pass 2 messages of 100 bytes
> 
> The below is the result of hackbench w/ and w/o cluster patch:
> g=2  4 6   8  10 12  14
> w/o: 1.8151 3.8499 5.5142 7.2491 9.0340 10.7345 12.0929
> w/ : 1.7881 3.7371 5.3301 6.9747 8.6909  9.9235 11.2608
> 
> Obviously some recent commits have improved the hackbench. So the change
> in wake_affine path brings less increase on hackbench compared to what
> we got in RFC v4.
> And obviously it is much more tricky to leverage wake_affine compared to
> leveraging the scatter of tasks in the previous patch as load balance
> might pull tasks which have been compact in a cluster so alternative
> suggestions welcome.
> 
> In order to figure out how many times cpu is picked from the cluster and
> how many times cpu is picked out of the cluster, a tracepoint for debug
> purpose is added in this patch. And an userspace bcc script to print the
> histogram of the result of select_idle_cpu():
> #!/usr/bin/python
> #
> # selectidlecpu.pyselect idle cpu histogram.
> #
> # A Ctrl-C will print the gathered histogram then exit.
> #
> # 18-March-2021 Barry Song Created this.
> 
> from __future__ import print_function
> from bcc import BPF
> from time import sleep
> 
> # load BPF program
> b = BPF(text="""
> 
> BPF_HISTOGRAM(dist);
> 
> TRACEPOINT_PROBE(sched, sched_select_idle_cpu)
> {
>   u32 e;
>   if (args->idle / 4 == args->target/4)
>   e = 0; /* idle cpu from same cluster */

Oops here, as -1/4 = 1/4 = 2/4 = 3/4 = 0
So a part of -1 is put here(local cluster) incorrectly.

>   else if (args->idle != -1)
>   e = 1; /* idle cpu from different clusters */
>   else
>   e = 2; /* no idle cpu */
> 
>   dist.increment(e);
>   return 0;
> }
> """)

Fixed it to:

TRACEPOINT_PROBE(sched, sched_select_idle_cpu)
{
u32 e;
if (args->idle == -1)
e = 2; /* no idle cpu */
else if (args->idle / 4 == args->target / 4)
e = 0; /* idle cpu from same cluster */
else
e = 1; /* idle cpu fr

RE: [RFC PATCH v5 1/4] topology: Represent clusters of CPUs within a die

2021-03-19 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Greg KH [mailto:gre...@linuxfoundation.org]
> Sent: Friday, March 19, 2021 7:35 PM
> To: Song Bao Hua (Barry Song) 
> Cc: tim.c.c...@linux.intel.com; catalin.mari...@arm.com; w...@kernel.org;
> r...@rjwysocki.net; vincent.guit...@linaro.org; b...@alien8.de;
> t...@linutronix.de; mi...@redhat.com; l...@kernel.org; pet...@infradead.org;
> dietmar.eggem...@arm.com; rost...@goodmis.org; bseg...@google.com;
> mgor...@suse.de; msys.miz...@gmail.com; valentin.schnei...@arm.com; Jonathan
> Cameron ; juri.le...@redhat.com;
> mark.rutl...@arm.com; sudeep.ho...@arm.com; aubrey...@linux.intel.com;
> linux-arm-ker...@lists.infradead.org; linux-kernel@vger.kernel.org;
> linux-a...@vger.kernel.org; x...@kernel.org; xuwei (O) ;
> Zengtao (B) ; guodong...@linaro.org; yangyicong
> ; Liguozhu (Kenneth) ;
> linux...@openeuler.org; h...@zytor.com
> Subject: Re: [RFC PATCH v5 1/4] topology: Represent clusters of CPUs within
> a die
> 
> On Fri, Mar 19, 2021 at 05:16:15PM +1300, Barry Song wrote:
> > diff --git a/Documentation/admin-guide/cputopology.rst
> b/Documentation/admin-guide/cputopology.rst
> > index b90dafc..f9d3745 100644
> > --- a/Documentation/admin-guide/cputopology.rst
> > +++ b/Documentation/admin-guide/cputopology.rst
> > @@ -24,6 +24,12 @@ core_id:
> > identifier (rather than the kernel's).  The actual value is
> > architecture and platform dependent.
> >
> > +cluster_id:
> > +
> > +   the Cluster ID of cpuX.  Typically it is the hardware platform's
> > +   identifier (rather than the kernel's).  The actual value is
> > +   architecture and platform dependent.
> > +
> >  book_id:
> >
> > the book ID of cpuX. Typically it is the hardware platform's
> > @@ -56,6 +62,14 @@ package_cpus_list:
> > human-readable list of CPUs sharing the same physical_package_id.
> > (deprecated name: "core_siblings_list")
> >
> > +cluster_cpus:
> > +
> > +   internal kernel map of CPUs within the same cluster.
> > +
> > +cluster_cpus_list:
> > +
> > +   human-readable list of CPUs within the same cluster.
> > +
> >  die_cpus:
> >
> > internal kernel map of CPUs within the same die.
> 
> Why are these sysfs files in this file, and not in a Documentation/ABI/
> file which can be correctly parsed and shown to userspace?

Well. Those ABIs have been there for much a long time. It is like:

[root@ceph1 topology]# ls
core_id  core_siblings  core_siblings_list  physical_package_id thread_siblings 
 thread_siblings_list
[root@ceph1 topology]# pwd
/sys/devices/system/cpu/cpu100/topology
[root@ceph1 topology]# cat core_siblings_list
64-127
[root@ceph1 topology]#

> 
> Any chance you can fix that up here as well?

Yes. we will send a separate patch to address this, which won't
be in this patchset. This patchset will base on that one.

> 
> Also note that "list" is not something that goes in sysfs, sysfs is "one
> value per file", and a list is not "one value".  How do you prevent
> overflowing the buffer of the sysfs file if you have a "list"?
> 

At a glance, the list is using "-" rather than a real list
[root@ceph1 topology]# cat core_siblings_list
64-127

Anyway, I will take a look if it has any chance to overflow.

> thanks,
> 
> greg k-h

Thanks
Barry



RE: [PATCH] tty: serial: samsung_tty: remove spinlock flags in interrupt handlers

2021-03-19 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Andy Shevchenko [mailto:andy.shevche...@gmail.com]
> Sent: Tuesday, March 16, 2021 10:41 PM
> To: Johan Hovold ; Finn Thain ;
> Song Bao Hua (Barry Song) 
> Cc: Krzysztof Kozlowski ; Greg
> Kroah-Hartman ; Jiri Slaby ;
> linux-arm Mailing List ; Linux Samsung
> SOC ; open list:SERIAL DRIVERS
> ; Linux Kernel Mailing List
> ; Hector Martin ; Arnd
> Bergmann 
> Subject: Re: [PATCH] tty: serial: samsung_tty: remove spinlock flags in
> interrupt handlers
> 
> On Tue, Mar 16, 2021 at 11:02 AM Johan Hovold  wrote:
> >
> > On Mon, Mar 15, 2021 at 07:12:12PM +0100, Krzysztof Kozlowski wrote:
> > > Since interrupt handler is called with disabled local interrupts, there
> > > is no need to use the spinlock primitives disabling interrupts as well.
> >
> > This isn't generally true due to "threadirqs" and that can lead to
> > deadlocks if the console code is called from hard irq context.
> >
> > Now, this is *not* the case for this particular driver since it doesn't
> > even bother to take the port lock in console_write(). That should
> > probably be fixed instead.
> >
> > See https://lore.kernel.org/r/X7kviiRwuxvPxC8O@localhost.
> 
> Finn, Barry, something to check I think?

My understanding is that spin_lock_irqsave can't protect the context
the console_write() is called in hardirq for threaded_irq case mainly
for preempt-rt scenarios as spin_lock_irqsave doesn't disable irq in
that case at all.
See:
https://www.kernel.org/doc/html/latest/locking/locktypes.html
spinlock_t and PREEMPT_RT
On a PREEMPT_RT kernel spinlock_t is mapped to a separate implementation
based on rt_mutex which changes the semantics:
Preemption is not disabled.
The hard interrupt related suffixes for spin_lock / spin_unlock operations
(_irq, _irqsave / _irqrestore) do not affect the CPU’s interrupt disabled
state.

So if console_write() can interrupt our code in hardirq, we should
move to raw_spin_lock_irqsave for this driver.

I think it is almost always wrong to call spin_lock_irqsave in hardirq.

> 
> --
> With Best Regards,
> Andy Shevchenko

Thanks
Barry


[RFC PATCH v5 4/4] scheduler: Add cluster scheduler level for x86

2021-03-18 Thread Barry Song
From: Tim Chen 

There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce
is shared among a cluster of cores instead of being exclusive
to one single core.

To prevent oversubscription of L2 cache, load should be
balanced between such L2 clusters, especially for tasks with
no shared data.

Also with cluster scheduling policy where tasks are woken up
in the same L2 cluster, we will benefit from keeping tasks
related to each other and likely sharing data in the same L2
cluster.

Add CPU masks of CPUs sharing the L2 cache so we can build such
L2 cluster scheduler domain.

Signed-off-by: Tim Chen 
Signed-off-by: Barry Song 
---
 arch/x86/Kconfig|  8 
 arch/x86/include/asm/smp.h  |  7 +++
 arch/x86/include/asm/topology.h |  1 +
 arch/x86/kernel/cpu/cacheinfo.c |  1 +
 arch/x86/kernel/cpu/common.c|  3 +++
 arch/x86/kernel/smpboot.c   | 43 -
 6 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2792879..d597de2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1002,6 +1002,14 @@ config NR_CPUS
  This is purely to save memory: each supported CPU adds about 8KB
  to the kernel image.
 
+config SCHED_CLUSTER
+   bool "Cluster scheduler support"
+   default n
+   help
+Cluster scheduler support improves the CPU scheduler's decision
+making when dealing with machines that have clusters of CPUs
+sharing L2 cache. If unsure say N here.
+
 config SCHED_SMT
def_bool y if SMP
 
diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index c0538f8..9cbc4ae 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -16,7 +16,9 @@
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_die_map);
 /* cpus sharing the last level cache: */
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
+DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_l2c_shared_map);
 DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id);
+DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_l2c_id);
 DECLARE_PER_CPU_READ_MOSTLY(int, cpu_number);
 
 static inline struct cpumask *cpu_llc_shared_mask(int cpu)
@@ -24,6 +26,11 @@ static inline struct cpumask *cpu_llc_shared_mask(int cpu)
return per_cpu(cpu_llc_shared_map, cpu);
 }
 
+static inline struct cpumask *cpu_l2c_shared_mask(int cpu)
+{
+   return per_cpu(cpu_l2c_shared_map, cpu);
+}
+
 DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_cpu_to_apicid);
 DECLARE_EARLY_PER_CPU_READ_MOSTLY(u32, x86_cpu_to_acpiid);
 DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_bios_cpu_apicid);
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 9239399..2a11ccc 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -103,6 +103,7 @@ static inline void setup_node_to_cpumask_map(void) { }
 #include 
 
 extern const struct cpumask *cpu_coregroup_mask(int cpu);
+extern const struct cpumask *cpu_clustergroup_mask(int cpu);
 
 #define topology_logical_package_id(cpu)   (cpu_data(cpu).logical_proc_id)
 #define topology_physical_package_id(cpu)  (cpu_data(cpu).phys_proc_id)
diff --git a/arch/x86/kernel/cpu/cacheinfo.c b/arch/x86/kernel/cpu/cacheinfo.c
index 3ca9be4..0d03a71 100644
--- a/arch/x86/kernel/cpu/cacheinfo.c
+++ b/arch/x86/kernel/cpu/cacheinfo.c
@@ -846,6 +846,7 @@ void init_intel_cacheinfo(struct cpuinfo_x86 *c)
l2 = new_l2;
 #ifdef CONFIG_SMP
per_cpu(cpu_llc_id, cpu) = l2_id;
+   per_cpu(cpu_l2c_id, cpu) = l2_id;
 #endif
}
 
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index ab640ab..0ba282d 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -78,6 +78,9 @@
 /* Last level cache ID of each logical CPU */
 DEFINE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id) = BAD_APICID;
 
+/* L2 cache ID of each logical CPU */
+DEFINE_PER_CPU_READ_MOSTLY(u16, cpu_l2c_id) = BAD_APICID;
+
 /* correctly size the local cpu masks */
 void __init setup_cpu_local_masks(void)
 {
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 02813a7..c85ffa8 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -101,6 +101,8 @@
 
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
 
+DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_l2c_shared_map);
+
 /* Per CPU bogomips and other parameters */
 DEFINE_PER_CPU_READ_MOSTLY(struct cpuinfo_x86, cpu_info);
 EXPORT_PER_CPU_SYMBOL(cpu_info);
@@ -501,6 +503,21 @@ static bool match_llc(struct cpuinfo_x86 *c, struct 
cpuinfo_x86 *o)
return topology_sane(c, o, "llc");
 }
 
+static bool match_l2c(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
+{
+   int cpu1 = c->cpu_index, cpu2 = o->cpu_index;
+
+   /* Do not match if we do not have a valid APICID for cpu: */
+   if (per_cpu(cpu_l2c_id, cpu1) == BAD_APICID)
+   return false;

[RFC PATCH v5 0/4] scheduler: expose the topology of clusters and add cluster scheduler

2021-03-18 Thread Barry Song
   ||| |2   |  |
| ++|| ++  |
|   || |
|   cluster1|| cluster2|
+---++-+

2. gathering related tasks within a cluster, which improves the cache affinity 
of tasks
talking with each other.
Without cluster sched_domain, related tasks might be put randomly. In case 
task1-8 have
relationship as below:
Task1 wakes up task4
Task2 wakes up task5
Task3 wakes up task6
Task4 wakes up task7
With the tuning of select_idle_cpu() to scan local cluster first, those tasks 
might
get a chance to be gathered like:
+---++--+
| +++-+ || ++  +-+  |
| |task||task | || |task|  |task |  |
| |1   || 4   | || |2   |  |5|  |
| +++-+ || ++  +-+  |
|   ||  |
|   cluster1|| cluster2 |
|   ||  |
|   ||  |
| +-+   +--+|| +-+ +--+ |
| |task |   | task ||| |task | |task  | |
| |3|   |  6   ||| |4| |8 | |
| +-+   +--+|| +-+ +--+ |
+---++--+
Otherwise, the result might be:
+---++--+
| +++-+ || ++  +-+  |
| |task||task | || |task|  |task |  |
| |1   || 2   | || |5   |  |6|  |
| +++-+ || ++  +-+  |
|   ||  |
|   cluster1|| cluster2 |
|   ||  |
|   ||  |
| +-+   +--+|| +-+ +--+ |
| |task |   | task ||| |task | |task  | |
| |3|   |  4   ||| |7| |8 | |
| +-+   +--+|| +-+ +--+ |
+---++--+

-v5:
  * split "add scheduler level for clusters" into two patches to evaluate the
impact of spreading and gathering separately;
  * add a tracepoint of select_idle_cpu for debug purpose; add bcc script in
commit log;
  * add cluster_id = -1 in reset_cpu_topology()
  * rebased to tip/sched/core

-v4:
  * rebased to tip/sched/core with the latest unified code of select_idle_cpu
  * added Tim's patch for x86 Jacobsville
  * also added benchmark data of spreading unrelated tasks
  * avoided the iteration of sched_domain by moving to static_key(addressing
Vincent's comment
  * used acpi_cpu_id for acpi_find_processor_node(addressing Masa's comment)

Barry Song (2):
  scheduler: add scheduler level for clusters
  scheduler: scan idle cpu in cluster before scanning the whole llc

Jonathan Cameron (1):
  topology: Represent clusters of CPUs within a die

Tim Chen (1):
  scheduler: Add cluster scheduler level for x86

 Documentation/admin-guide/cputopology.rst | 26 +++--
 arch/arm64/Kconfig|  7 
 arch/arm64/kernel/topology.c  |  2 +
 arch/x86/Kconfig  |  8 
 arch/x86/include/asm/smp.h|  7 
 arch/x86/include/asm/topology.h   |  1 +
 arch/x86/kernel/cpu/cacheinfo.c   |  1 +
 arch/x86/kernel/cpu/common.c  |  3 ++
 arch/x86/kernel/smpboot.c | 43 -
 drivers/acpi/pptt.c   | 63 +++
 drivers/base/arch_topology.c  | 15 
 drivers/base/topology.c   | 10 +
 include/linux/acpi.h  |  5 +++
 include/linux/arch_topology.h |  5 +++
 include/linux/sched/cluster.h | 19 ++
 include/linux/sched/topology.h|  7 
 include/linux/topology.h  | 13 +++
 include/trace/events/sched.h  | 22 +++
 kernel/sched/core.c   | 20 ++
 kernel/sched/fair.c   | 36 +-
 kernel/sched/sched.h  |  1 +
 kernel/sched/topology.c   |  5 +++
 22 files changed, 313 insertions(+), 6 deletions(-)
 create mode 100644 include/linux/sched/cluster.h

-- 
1.8.3.1



[RFC PATCH v5 3/4] scheduler: scan idle cpu in cluster before scanning the whole llc

2021-03-18 Thread Barry Song
On kunpeng920, cpus within one cluster can communicate wit each other
much faster than cpus across different clusters. A simple hackbench
can prove that.
hackbench running on 4 cpus in single one cluster and 4 cpus in
different clusters shows a large contrast:
(1) within a cluster:
root@ubuntu:~# taskset -c 0,1,2,3 hackbench -p -T -l 2 -g 1
Running in threaded mode with 1 groups using 40 file descriptors each
(== 40 tasks)
Each sender will pass 2 messages of 100 bytes
Time: 4.285

(2) across clusters:
root@ubuntu:~# taskset -c 0,4,8,12 hackbench -p -T -l 2 -g 1
Running in threaded mode with 1 groups using 40 file descriptors each
(== 40 tasks)
Each sender will pass 2 messages of 100 bytes
Time: 5.524

This inspires us to change the wake_affine path to scan cluster before
scanning the whole LLC to try to gatter related tasks in one cluster,
which is done by this patch.

To evaluate the performance impact to related tasks talking with each
other, we run the below hackbench with different -g parameter from 2
to 14, for each different g, we run the command 10 times and get the
average time:
$ numactl -N 0 hackbench -p -T -l 2 -g $1

hackbench will report the time which is needed to complete a certain number
of messages transmissions between a certain number of tasks, for example:
$ numactl -N 0 hackbench -p -T -l 2 -g 10
Running in threaded mode with 10 groups using 40 file descriptors each
(== 400 tasks)
Each sender will pass 2 messages of 100 bytes

The below is the result of hackbench w/ and w/o cluster patch:
g=2  4 6   8  10 12  14
w/o: 1.8151 3.8499 5.5142 7.2491 9.0340 10.7345 12.0929
w/ : 1.7881 3.7371 5.3301 6.9747 8.6909  9.9235 11.2608

Obviously some recent commits have improved the hackbench. So the change
in wake_affine path brings less increase on hackbench compared to what
we got in RFC v4.
And obviously it is much more tricky to leverage wake_affine compared to
leveraging the scatter of tasks in the previous patch as load balance
might pull tasks which have been compact in a cluster so alternative
suggestions welcome.

In order to figure out how many times cpu is picked from the cluster and
how many times cpu is picked out of the cluster, a tracepoint for debug
purpose is added in this patch. And an userspace bcc script to print the
histogram of the result of select_idle_cpu():
#!/usr/bin/python
#
# selectidlecpu.py  select idle cpu histogram.
#
# A Ctrl-C will print the gathered histogram then exit.
#
# 18-March-2021 Barry Song Created this.

from __future__ import print_function
from bcc import BPF
from time import sleep

# load BPF program
b = BPF(text="""

BPF_HISTOGRAM(dist);

TRACEPOINT_PROBE(sched, sched_select_idle_cpu)
{
u32 e;
if (args->idle / 4 == args->target/4)
e = 0; /* idle cpu from same cluster */
else if (args->idle != -1)
e = 1; /* idle cpu from different clusters */
else
e = 2; /* no idle cpu */

dist.increment(e);
return 0;
}
""")

# header
print("Tracing... Hit Ctrl-C to end.")

# trace until Ctrl-C
try:
sleep()
except KeyboardInterrupt:
print()

# output

print("\nlinear histogram")
print("")
b["dist"].print_linear_hist("idle")

Even while g=14 and the system is quite busy, we can see there are some
chances idle cpu is picked from local cluster:
linear histogram
~~
 idle  : count distribution
0  : 15234281 |*** |
1  : 18494||
2  : 53066152 ||

0: local cluster
1: out of the cluster
2: select_idle_cpu() returns -1

Signed-off-by: Barry Song 
---
 include/trace/events/sched.h | 22 ++
 kernel/sched/fair.c  | 32 +++-
 2 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index cbe3e15..86608cf 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -136,6 +136,28 @@
 );
 
 /*
+ * Tracepoint for select_idle_cpu:
+ */
+TRACE_EVENT(sched_select_idle_cpu,
+
+   TP_PROTO(int target, int idle),
+
+   TP_ARGS(target, idle),
+
+   TP_STRUCT__entry(
+   __field(int,target  )
+   __field(int,idle)
+   ),
+
+   TP_fast_assign(
+   __entry->target = target;
+   __entry->idle = idle;
+   ),
+
+   TP_printk("target=%d idle=%d", __entry->target, __entry->idle)
+);
+
+/*
  * Tracepoint for waking up a task:
  */
 DECLARE_EVENT_CLASS(sched_wakeup_template,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index

[RFC PATCH v5 2/4] scheduler: add scheduler level for clusters

2021-03-18 Thread Barry Song
ARM64 chip Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each
cluster has 4 cpus. All clusters share L3 cache data, but each cluster
has local L3 tag. On the other hand, each clusters will share some
internal system bus. This means cache coherence overhead inside one
cluster is much less than the overhead across clusters.

This patch adds the sched_domain for clusters. On kunpeng 920, without
this patch, domain0 of cpu0 would be MC with cpu0~cpu23 with ; with this
patch, MC becomes domain1, a new domain0 "CLS" including cpu0-cpu3.

This will help spread unrelated tasks among clusters, thus decrease the
contention and improve the throughput, for example, stream benchmark can
improve 20%+ while parallelism is 6 and improve around 5% while paralle-
lism is 12:

(1) -P  6
$ numactl -N 0 /usr/lib/lmbench/bin/stream -P 6 -M 1024M -N 5

w/o patch:
STREAM copy latency: 2.46 nanoseconds
STREAM copy bandwidth: 39096.28 MB/sec
STREAM scale latency: 2.46 nanoseconds
STREAM scale bandwidth: 38970.26 MB/sec
STREAM add latency: 4.45 nanoseconds
STREAM add bandwidth: 32332.04 MB/sec
STREAM triad latency: 4.07 nanoseconds
STREAM triad bandwidth: 35387.69 MB/sec

w/ patch:
STREAM copy latency: 2.02 nanoseconds
STREAM copy bandwidth: 47604.47 MB/sec   +21.7%
STREAM scale latency: 2.04 nanoseconds
STREAM scale bandwidth: 47066.84 MB/sec  +20.8%
STREAM add latency: 3.35 nanoseconds
STREAM add bandwidth: 42942.15 MB/sec+32.8%
STREAM triad latency: 3.16 nanoseconds
STREAM triad bandwidth: 45619.18 MB/sec  +28.9%

On the other hand,stream result could change significantly during different
tests without the patch, eg:
a.
STREAM copy latency: 2.16 nanoseconds
STREAM copy bandwidth: 8.45 MB/sec
STREAM scale latency: 2.17 nanoseconds
STREAM scale bandwidth: 44320.77 MB/sec
STREAM add latency: 3.77 nanoseconds
STREAM add bandwidth: 38230.54 MB/sec
STREAM triad latency: 3.88 nanoseconds
STREAM triad bandwidth: 37072.10 MB/sec

b.
STREAM copy latency: 2.16 nanoseconds
STREAM copy bandwidth: 44403.22 MB/sec
STREAM scale latency: 2.39 nanoseconds
STREAM scale bandwidth: 40173.69 MB/sec
STREAM add latency: 3.77 nanoseconds
STREAM add bandwidth: 38232.56 MB/sec
STREAM triad latency: 3.38 nanoseconds
STREAM triad bandwidth: 42592.04 MB/sec

Obviously it is because the 6 threads are put randomly in 6 cores. Sometimes
they are packed in clusters, sometimes they are spread widely.

(2) -P  12
$ numactl -N 0 /usr/lib/lmbench/bin/stream -P 12 -M 1024M -N 5

w/o patch:
STREAM copy latency: 3.37 nanoseconds
STREAM copy bandwidth: 57008.80 MB/sec
STREAM scale latency: 3.38 nanoseconds
STREAM scale bandwidth: 56848.47 MB/sec
STREAM add latency: 5.50 nanoseconds
STREAM add bandwidth: 52398.62 MB/sec
STREAM triad latency: 5.09 nanoseconds
STREAM triad bandwidth: 56591.60 MB/sec

w/ patch:
STREAM copy latency: 3.24 nanoseconds
STREAM copy bandwidth: 59338.60 MB/sec  +4.1%
STREAM scale latency: 3.25 nanoseconds
STREAM scale bandwidth: 58993.23 MB/sec +3.7%
STREAM add latency: 5.19 nanoseconds
STREAM add bandwidth: 55517.45 MB/sec   +5.9%
STREAM triad latency: 4.86 nanoseconds
STREAM triad bandwidth: 59245.34 MB/sec +4.7%

To evaluate the performance impact to related tasks talking with each
other, we run the below hackbench with different -g parameter from 2
to 14, for each different g, we run the command 10 times and get the
average time:
$ numactl -N 0 hackbench -p -T -l 2 -g $1

hackbench will report the time which is needed to complete a certain number
of messages transmissions between a certain number of tasks, for example:
$ numactl -N 0 hackbench -p -T -l 2 -g 10
Running in threaded mode with 10 groups using 40 file descriptors each
(== 400 tasks)
Each sender will pass 2 messages of 100 bytes

The below is the result of hackbench w/ and w/o the patch:
g=2  4 6   8  10 12  14
w/o: 1.8151 3.8499 5.5142 7.2491 9.0340 10.7345 12.0929
w/ : 1.8396 3.8250 5.4780 7.3442 9.0172 10.5950 11.9113

Obviously this patch doesn't impact hackbench too much.

Signed-off-by: Barry Song 
---
 arch/arm64/Kconfig |  7 +++
 include/linux/sched/cluster.h  | 19 +++
 include/linux/sched/topology.h |  7 +++
 include/linux/topology.h   |  7 +++
 kernel/sched/core.c| 20 
 kernel/sched/fair.c|  4 
 kernel/sched/sched.h   |  1 +
 kernel/sched/topology.c|  5 +
 8 files changed, 70 insertions(+)
 create mode 100644 include/linux/sched/cluster.h

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1f212b4..9432a30 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -977,6 +977,13 @@ config SCHED_MC
  making when dealing with multi-core CPU chips at a cost of slightly
  increased overhead in some places. If unsure say N here.
 
+config SCHED_CLUSTER
+   bool "Cluster scheduler support"
+   help
+ Cluster scheduler support improves the CPU 

[RFC PATCH v5 1/4] topology: Represent clusters of CPUs within a die

2021-03-18 Thread Barry Song
|   | |
|  |  ||  | |  |   |   | |
|  +--++--+ |  +---+   | |
|   |  +-+
+---+

That means the cost to transfer ownership of a cacheline between CPUs
within a cluster is lower than between CPUs in different clusters on
the same die. Hence, it can make sense to tell the scheduler to use
the cache affinity of the cluster to make better decision on thread
migration.

This patch simply exposes this information to userspace libraries
like hwloc by providing cluster_cpus and related sysfs attributes.
PoC of HWLOC support at [2].

Note this patch only handle the ACPI case.

Special consideration is needed for SMT processors, where it is
necessary to move 2 levels up the hierarchy from the leaf nodes
(thus skipping the processor core level).

Currently the ID provided is the offset of the Processor
Hierarchy Nodes Structure within PPTT.  Whilst this is unique
it is not terribly elegant so alternative suggestions welcome.

Note that arm64 / ACPI does not provide any means of identifying
a die level in the topology but that may be unrelate to the cluster
level.

[1] ACPI Specification 6.3 - section 5.2.29.1 processor hierarchy node
structure (Type 0)
[2] https://github.com/hisilicon/hwloc/tree/linux-cluster

Signed-off-by: Jonathan Cameron 
Signed-off-by: Barry Song 
---
 Documentation/admin-guide/cputopology.rst | 26 +++--
 arch/arm64/kernel/topology.c  |  2 +
 drivers/acpi/pptt.c   | 63 +++
 drivers/base/arch_topology.c  | 15 
 drivers/base/topology.c   | 10 +
 include/linux/acpi.h  |  5 +++
 include/linux/arch_topology.h |  5 +++
 include/linux/topology.h  |  6 +++
 8 files changed, 128 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/cputopology.rst 
b/Documentation/admin-guide/cputopology.rst
index b90dafc..f9d3745 100644
--- a/Documentation/admin-guide/cputopology.rst
+++ b/Documentation/admin-guide/cputopology.rst
@@ -24,6 +24,12 @@ core_id:
identifier (rather than the kernel's).  The actual value is
architecture and platform dependent.
 
+cluster_id:
+
+   the Cluster ID of cpuX.  Typically it is the hardware platform's
+   identifier (rather than the kernel's).  The actual value is
+   architecture and platform dependent.
+
 book_id:
 
the book ID of cpuX. Typically it is the hardware platform's
@@ -56,6 +62,14 @@ package_cpus_list:
human-readable list of CPUs sharing the same physical_package_id.
(deprecated name: "core_siblings_list")
 
+cluster_cpus:
+
+   internal kernel map of CPUs within the same cluster.
+
+cluster_cpus_list:
+
+   human-readable list of CPUs within the same cluster.
+
 die_cpus:
 
internal kernel map of CPUs within the same die.
@@ -96,11 +110,13 @@ these macros in include/asm-XXX/topology.h::
 
#define topology_physical_package_id(cpu)
#define topology_die_id(cpu)
+   #define topology_cluster_id(cpu)
#define topology_core_id(cpu)
#define topology_book_id(cpu)
#define topology_drawer_id(cpu)
#define topology_sibling_cpumask(cpu)
#define topology_core_cpumask(cpu)
+   #define topology_cluster_cpumask(cpu)
#define topology_die_cpumask(cpu)
#define topology_book_cpumask(cpu)
#define topology_drawer_cpumask(cpu)
@@ -116,10 +132,12 @@ not defined by include/asm-XXX/topology.h:
 
 1) topology_physical_package_id: -1
 2) topology_die_id: -1
-3) topology_core_id: 0
-4) topology_sibling_cpumask: just the given CPU
-5) topology_core_cpumask: just the given CPU
-6) topology_die_cpumask: just the given CPU
+3) topology_cluster_id: -1
+4) topology_core_id: 0
+5) topology_sibling_cpumask: just the given CPU
+6) topology_core_cpumask: just the given CPU
+7) topology_cluster_cpumask: just the given CPU
+8) topology_die_cpumask: just the given CPU
 
 For architectures that don't support books (CONFIG_SCHED_BOOK) there are no
 default definitions for topology_book_id() and topology_book_cpumask().
diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index e08a412..d72eb8d 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -103,6 +103,8 @@ int __init parse_acpi_topology(void)
cpu_topology[cpu].thread_id  = -1;
cpu_topology[cpu].core_id= topology_id;
}
+   topology_id = find_acpi_cpu_topology_cluster(cpu);
+   cpu_topology[cpu].cluster_id = topology_id;
topology_id = find_acpi_cpu_topology_package(cpu);
cpu_topology[cpu].package_id = topology_id;
 
diff --git a/drivers/acpi/pptt.c b/drivers/acpi/pptt.c
index 4ae9335..11f8b02 100644
--- a/drivers/acpi/pptt.c
+++ b/drivers/acpi/pptt.c
@@ -737,6 +737,69 @@ int find_acpi_cpu_topolo

RE: [RFC PATCH v4 2/3] scheduler: add scheduler level for clusters

2021-03-16 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Peter Zijlstra [mailto:pet...@infradead.org]
> Sent: Tuesday, March 2, 2021 11:43 PM
> To: Song Bao Hua (Barry Song) 
> Cc: tim.c.c...@linux.intel.com; catalin.mari...@arm.com; w...@kernel.org;
> r...@rjwysocki.net; vincent.guit...@linaro.org; b...@alien8.de;
> t...@linutronix.de; mi...@redhat.com; l...@kernel.org;
> dietmar.eggem...@arm.com; rost...@goodmis.org; bseg...@google.com;
> mgor...@suse.de; msys.miz...@gmail.com; valentin.schnei...@arm.com;
> gre...@linuxfoundation.org; Jonathan Cameron ;
> juri.le...@redhat.com; mark.rutl...@arm.com; sudeep.ho...@arm.com;
> aubrey...@linux.intel.com; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; linux-a...@vger.kernel.org; x...@kernel.org;
> xuwei (O) ; Zengtao (B) ;
> guodong...@linaro.org; yangyicong ; Liguozhu (Kenneth)
> ; linux...@openeuler.org; h...@zytor.com
> Subject: Re: [RFC PATCH v4 2/3] scheduler: add scheduler level for clusters
> 
> On Tue, Mar 02, 2021 at 11:59:39AM +1300, Barry Song wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 88a2e2b..d805e59 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -7797,6 +7797,16 @@ int sched_cpu_activate(unsigned int cpu)
> > if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
> > static_branch_inc_cpuslocked(_smt_present);
> >  #endif
> > +
> > +#ifdef CONFIG_SCHED_CLUSTER
> > +   /*
> > +* When going up, increment the number of cluster cpus with
> > +* cluster present.
> > +*/
> > +   if (cpumask_weight(cpu_cluster_mask(cpu)) > 1)
> > +   static_branch_inc_cpuslocked(_cluster_present);
> > +#endif
> > +
> > set_cpu_active(cpu, true);
> >
> > if (sched_smp_initialized) {
> > @@ -7873,6 +7883,14 @@ int sched_cpu_deactivate(unsigned int cpu)
> > static_branch_dec_cpuslocked(_smt_present);
> >  #endif
> >
> > +#ifdef CONFIG_SCHED_CLUSTER
> > +   /*
> > +* When going down, decrement the number of cpus with cluster present.
> > +*/
> > +   if (cpumask_weight(cpu_cluster_mask(cpu)) > 1)
> > +   static_branch_dec_cpuslocked(_cluster_present);
> > +#endif
> > +
> > if (!sched_smp_initialized)
> > return 0;
> 
> I don't think that's correct. IIUC this will mean the
> sched_cluster_present thing will be enabled on anything with SMT (very
> much including x86 big cores after the next patch).
> 
> I'm thinking that at the very least you should check a CLS domain
> exists, but that might be hard at this point, because the sched domains
> haven't been build yet.

might be able to achieve the same goal by:

int cls_wt = cpumask_weight(cpu_cluster_mask(cpu));
if ((cls_wt > cpumask_weight(cpu_smt_mask(cpu))) &&
 && (cls_wt < cpumask_weight(cpu_coregroup_mask(cpu
   sched_cluster_present...

> 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 8a8bd7b..3db7b07 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6009,6 +6009,11 @@ static inline int __select_idle_cpu(int cpu)
> > return -1;
> >  }
> >
> > +#ifdef CONFIG_SCHED_CLUSTER
> > +DEFINE_STATIC_KEY_FALSE(sched_cluster_present);
> > +EXPORT_SYMBOL_GPL(sched_cluster_present);
> 
> I really rather think this shouldn't be exported

Ok. Make sense.

> 
> > +#endif
> > +
> >  #ifdef CONFIG_SCHED_SMT
> >  DEFINE_STATIC_KEY_FALSE(sched_smt_present);
> >  EXPORT_SYMBOL_GPL(sched_smt_present);
> 
> This is a KVM wart, it needs to know because mitigation crap.
> 

Ok.

> > @@ -6116,6 +6121,26 @@ static inline int select_idle_core(struct task_struct
> *p, int core, struct cpuma
> >
> >  #endif /* CONFIG_SCHED_SMT */
> >
> > +static inline int _select_idle_cpu(bool smt, struct task_struct *p, int
> target, struct cpumask *cpus, int *idle_cpu, int *nr)
> > +{
> > +   int cpu, i;
> > +
> > +   for_each_cpu_wrap(cpu, cpus, target) {
> > +   if (smt) {
> > +   i = select_idle_core(p, cpu, cpus, idle_cpu);
> > +   } else {
> > +   if (!--*nr)
> > +   return -1;
> > +   i = __select_idle_cpu(cpu);
> > +   }
> > +
> > +   if ((unsigned int)i < nr_cpumask_bits)
> > +   return i;
> > +   }
> > +
> > +   return -1;
> > +}
> > +
> >  /*
> >   * Scan the LLC domain for idle CPUs; this is dynamically regulated by
> >   * comparing the

RE: [RFC PATCH v4 1/3] topology: Represent clusters of CPUs within a die.

2021-03-14 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Song Bao Hua (Barry Song)
> Sent: Tuesday, March 2, 2021 12:00 PM
> To: tim.c.c...@linux.intel.com; catalin.mari...@arm.com; w...@kernel.org;
> r...@rjwysocki.net; vincent.guit...@linaro.org; b...@alien8.de;
> t...@linutronix.de; mi...@redhat.com; l...@kernel.org; pet...@infradead.org;
> dietmar.eggem...@arm.com; rost...@goodmis.org; bseg...@google.com;
> mgor...@suse.de
> Cc: msys.miz...@gmail.com; valentin.schnei...@arm.com;
> gre...@linuxfoundation.org; Jonathan Cameron ;
> juri.le...@redhat.com; mark.rutl...@arm.com; sudeep.ho...@arm.com;
> aubrey...@linux.intel.com; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; linux-a...@vger.kernel.org; x...@kernel.org;
> xuwei (O) ; Zengtao (B) ;
> guodong...@linaro.org; yangyicong ; Liguozhu (Kenneth)
> ; linux...@openeuler.org; h...@zytor.com; Jonathan
> Cameron ; Song Bao Hua (Barry Song)
> 
> Subject: [RFC PATCH v4 1/3] topology: Represent clusters of CPUs within a die.
> 
> From: Jonathan Cameron 
> 
> Both ACPI and DT provide the ability to describe additional layers of
> topology between that of individual cores and higher level constructs
> such as the level at which the last level cache is shared.
> In ACPI this can be represented in PPTT as a Processor Hierarchy
> Node Structure [1] that is the parent of the CPU cores and in turn
> has a parent Processor Hierarchy Nodes Structure representing
> a higher level of topology.
> 
> For example Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each
> cluster has 4 cpus. All clusters share L3 cache data, but each cluster
> has local L3 tag. On the other hand, each clusters will share some
> internal system bus.
> 
> +---+  +-+
> |  +--++--++---+ |
> |  | CPU0 || cpu1 | |+---+ | |
> |  +--++--+ ||   | | |
> |   ++L3 | | |
> |  +--++--+   cluster   ||tag| | |
> |  | CPU2 || CPU3 | ||   | | |
> |  +--++--+ |+---+ | |
> |   |  | |
> +---+  | |
> +---+  | |
> |  +--++--+ +--+ |
> |  |  ||  | |+---+ | |
> |  +--++--+ ||   | | |
> |   ||L3 | | |
> |  +--++--+ ++tag| | |
> |  |  ||  | ||   | | |
> |  +--++--+ |+---+ | |
> |   |  | |
> +---+  |   L3|
>|   data  |
> +---+  | |
> |  +--++--+ |+---+ | |
> |  |  ||  | ||   | | |
> |  +--++--+ ++L3 | | |
> |   ||tag| | |
> |  +--++--+ ||   | | |
> |  |  ||  |+++---+ | |
> |  +--++--+|---+ |
> +---|  | |
> +---|  | |
> |  +--++--++---+ |
> |  |  ||  | |+---+ | |
&

[RESEND PATCH v2 0/2] scripts/gdb: clarify the platforms supporting lx_current and add arm64 support

2021-03-14 Thread Barry Song
lx_current depends on per_cpu current_task variable which exists on x86 only.
so it actually works on x86 only. the 1st patch documents this clearly; the
2nd patch adds support for arm64.

-resend
 resending to Andrew as Kieran Bingham explained patches of scripts/gdb
 usually go through the tree of Andrew Morton;

Barry Song (2):
  scripts/gdb: document lx_current is only supported by x86
  scripts/gdb: add lx_current support for arm64

 .../dev-tools/gdb-kernel-debugging.rst|  2 +-
 scripts/gdb/linux/cpus.py | 23 +--
 2 files changed, 22 insertions(+), 3 deletions(-)

-- 
2.25.1



[RESEND PATCH v2 2/2] scripts/gdb: add lx_current support for arm64

2021-03-14 Thread Barry Song
arm64 uses SP_EL0 to save the current task_struct address. While running
in EL0, SP_EL0 is clobbered by userspace. So if the upper bit is not 1
(not TTBR1), the current address is invalid. This patch checks the upper
bit of SP_EL0, if the upper bit is 1, lx_current() of arm64 will return
the derefrence of current task. Otherwise, lx_current() will tell users
they are running in userspace(EL0).

While arm64 is running in EL0, it is actually pointless to print current
task as the memory of kernel space is not accessible in EL0.

Signed-off-by: Barry Song 
---
 Documentation/dev-tools/gdb-kernel-debugging.rst |  2 +-
 scripts/gdb/linux/cpus.py| 13 +
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/Documentation/dev-tools/gdb-kernel-debugging.rst 
b/Documentation/dev-tools/gdb-kernel-debugging.rst
index 1586901b683c..8e0f1fe8d17a 100644
--- a/Documentation/dev-tools/gdb-kernel-debugging.rst
+++ b/Documentation/dev-tools/gdb-kernel-debugging.rst
@@ -114,7 +114,7 @@ Examples of using the Linux-provided gdb helpers
 [ 0.00] BIOS-e820: [mem 0x0009fc00-0x0009] 
reserved
 
 
-- Examine fields of the current task struct(supported by x86 only)::
+- Examine fields of the current task struct(supported by x86 and arm64 only)::
 
 (gdb) p $lx_current().pid
 $1 = 4998
diff --git a/scripts/gdb/linux/cpus.py b/scripts/gdb/linux/cpus.py
index f382762509d3..15fc4626d236 100644
--- a/scripts/gdb/linux/cpus.py
+++ b/scripts/gdb/linux/cpus.py
@@ -16,6 +16,9 @@ import gdb
 from linux import tasks, utils
 
 
+task_type = utils.CachedType("struct task_struct")
+
+
 MAX_CPUS = 4096
 
 
@@ -157,9 +160,19 @@ Note that VAR has to be quoted as string."""
 PerCpu()
 
 def get_current_task(cpu):
+task_ptr_type = task_type.get_type().pointer()
+
 if utils.is_target_arch("x86"):
  var_ptr = gdb.parse_and_eval("_task")
  return per_cpu(var_ptr, cpu).dereference()
+elif utils.is_target_arch("aarch64"):
+ current_task_addr = gdb.parse_and_eval("$SP_EL0")
+ if((current_task_addr >> 63) != 0):
+ current_task = current_task_addr.cast(task_ptr_type)
+ return current_task.dereference()
+ else:
+ raise gdb.GdbError("Sorry, obtaining the current task is not 
allowed "
+"while running in userspace(EL0)")
 else:
 raise gdb.GdbError("Sorry, obtaining the current task is not yet "
"supported with this arch")
-- 
2.25.1



[RESEND PATCH v2 1/2] scripts/gdb: document lx_current is only supported by x86

2021-03-14 Thread Barry Song
x86 is the only architecture which has per_cpu current_task:
arch$ git grep current_task | grep -i per_cpu
x86/include/asm/current.h:DECLARE_PER_CPU(struct task_struct *, current_task);
x86/kernel/cpu/common.c:DEFINE_PER_CPU(struct task_struct *, current_task) 
cacheline_aligned =
x86/kernel/cpu/common.c:EXPORT_PER_CPU_SYMBOL(current_task);
x86/kernel/cpu/common.c:DEFINE_PER_CPU(struct task_struct *, current_task) = 
_task;
x86/kernel/cpu/common.c:EXPORT_PER_CPU_SYMBOL(current_task);
x86/kernel/smpboot.c:   per_cpu(current_task, cpu) = idle;

On other architectures, lx_current() will lead to a python exception:
(gdb) p $lx_current().pid
Python Exception  No symbol "current_task" in current 
context.:
Error occurred in Python: No symbol "current_task" in current context.

To avoid more people struggling and wasting time in other architectures,
document it.

Cc: Jan Kiszka 
Signed-off-by: Barry Song 
---
 Documentation/dev-tools/gdb-kernel-debugging.rst |  2 +-
 scripts/gdb/linux/cpus.py| 10 --
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/Documentation/dev-tools/gdb-kernel-debugging.rst 
b/Documentation/dev-tools/gdb-kernel-debugging.rst
index 4756f6b3a04e..1586901b683c 100644
--- a/Documentation/dev-tools/gdb-kernel-debugging.rst
+++ b/Documentation/dev-tools/gdb-kernel-debugging.rst
@@ -114,7 +114,7 @@ Examples of using the Linux-provided gdb helpers
 [ 0.00] BIOS-e820: [mem 0x0009fc00-0x0009] 
reserved
 
 
-- Examine fields of the current task struct::
+- Examine fields of the current task struct(supported by x86 only)::
 
 (gdb) p $lx_current().pid
 $1 = 4998
diff --git a/scripts/gdb/linux/cpus.py b/scripts/gdb/linux/cpus.py
index 008e62f3190d..f382762509d3 100644
--- a/scripts/gdb/linux/cpus.py
+++ b/scripts/gdb/linux/cpus.py
@@ -156,6 +156,13 @@ Note that VAR has to be quoted as string."""
 
 PerCpu()
 
+def get_current_task(cpu):
+if utils.is_target_arch("x86"):
+ var_ptr = gdb.parse_and_eval("_task")
+ return per_cpu(var_ptr, cpu).dereference()
+else:
+raise gdb.GdbError("Sorry, obtaining the current task is not yet "
+   "supported with this arch")
 
 class LxCurrentFunc(gdb.Function):
 """Return current task.
@@ -167,8 +174,7 @@ number. If CPU is omitted, the CPU of the current context 
is used."""
 super(LxCurrentFunc, self).__init__("lx_current")
 
 def invoke(self, cpu=-1):
-var_ptr = gdb.parse_and_eval("_task")
-return per_cpu(var_ptr, cpu).dereference()
+return get_current_task(cpu)
 
 
 LxCurrentFunc()
-- 
2.25.1



RE: [Linuxarm] Re: [RFC PATCH v4 3/3] scheduler: Add cluster scheduler level for x86

2021-03-08 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Tim Chen [mailto:tim.c.c...@linux.intel.com]
> Sent: Thursday, March 4, 2021 7:34 AM
> To: Peter Zijlstra ; Song Bao Hua (Barry Song)
> 
> Cc: catalin.mari...@arm.com; w...@kernel.org; r...@rjwysocki.net;
> vincent.guit...@linaro.org; b...@alien8.de; t...@linutronix.de;
> mi...@redhat.com; l...@kernel.org; dietmar.eggem...@arm.com;
> rost...@goodmis.org; bseg...@google.com; mgor...@suse.de;
> msys.miz...@gmail.com; valentin.schnei...@arm.com;
> gre...@linuxfoundation.org; Jonathan Cameron ;
> juri.le...@redhat.com; mark.rutl...@arm.com; sudeep.ho...@arm.com;
> aubrey...@linux.intel.com; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; linux-a...@vger.kernel.org; x...@kernel.org;
> xuwei (O) ; Zengtao (B) ;
> guodong...@linaro.org; yangyicong ; Liguozhu (Kenneth)
> ; linux...@openeuler.org; h...@zytor.com
> Subject: [Linuxarm] Re: [RFC PATCH v4 3/3] scheduler: Add cluster scheduler
> level for x86
> 
> 
> 
> On 3/2/21 2:30 AM, Peter Zijlstra wrote:
> > On Tue, Mar 02, 2021 at 11:59:40AM +1300, Barry Song wrote:
> >> From: Tim Chen 
> >>
> >> There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce
> >> is shared among a cluster of cores instead of being exclusive
> >> to one single core.
> >
> > Isn't that most atoms one way or another? Tremont seems to have it per 4
> > cores, but earlier it was per 2 cores.
> >
> 
> Yes, older Atoms have 2 cores sharing L2.  I probably should
> rephrase my comments to not leave the impression that sharing
> L2 among cores is new for Atoms.
> 
> Tremont based Atom CPUs increases the possible load imbalance more
> with 4 cores per L2 instead of 2.  And also with more overall cores on a die,
> the
> chance increases for packing running tasks on a few clusters while leaving
> others empty on light/medium loaded systems.  We did see
> this effect on Jacobsville.
> 
> So load balancing between the L2 clusters is more
> useful on Tremont based Atom CPUs compared to the older Atoms.

It seems sensible the more CPU we get in the cluster, the more
we need the kernel to be aware of its existence.

Tim, it is possible for you to bring up the cpu_cluster_mask and
cluster_sibling for x86 so that the topology can be represented
in sysfs and be used by scheduler? It seems your patch lacks this
part.

BTW, I wonder if x86 can do some improvement on your KMP_AFFINITY
by leveraging the cluster topology level.
https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/optimization-and-programming-guide/openmp-support/openmp-library-support/thread-affinity-interface-linux-and-windows.html

KMP_AFFINITY has thread affinity modes like compact and scatter,
it seems this "compact" and "scatter" can also use the cluster
information as you see we are also struggling with the "compact"
and "scatter" issues here in this patchset :-)

Thanks
Barry


RE: [RFC PATCH v4 2/3] scheduler: add scheduler level for clusters

2021-03-08 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Vincent Guittot [mailto:vincent.guit...@linaro.org]
> Sent: Tuesday, March 9, 2021 12:26 AM
> To: Song Bao Hua (Barry Song) 
> Cc: Tim Chen ; Catalin Marinas
> ; Will Deacon ; Rafael J. Wysocki
> ; Borislav Petkov ; Thomas Gleixner
> ; Ingo Molnar ; Cc: Len Brown
> ; Peter Zijlstra ; Dietmar Eggemann
> ; Steven Rostedt ; Ben Segall
> ; Mel Gorman ; Juri Lelli
> ; Mark Rutland ; Aubrey Li
> ; H. Peter Anvin ; Zengtao (B)
> ; Guodong Xu ;
> gre...@linuxfoundation.org; Sudeep Holla ; linux-kernel
> ; linux...@openeuler.org; ACPI Devel Maling
> List ; xuwei (O) ; Jonathan
> Cameron ; yangyicong ;
> x86 ; msys.miz...@gmail.com; Liguozhu (Kenneth)
> ; Valentin Schneider ;
> LAK 
> Subject: Re: [RFC PATCH v4 2/3] scheduler: add scheduler level for clusters
> 
> On Tue, 2 Mar 2021 at 00:08, Barry Song  wrote:
> >
> > ARM64 chip Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each
> > cluster has 4 cpus. All clusters share L3 cache data, but each cluster
> > has local L3 tag. On the other hand, each clusters will share some
> > internal system bus. This means cache coherence overhead inside one
> > cluster is much less than the overhead across clusters.
> >
> > This patch adds the sched_domain for clusters. On kunpeng 920, without
> > this patch, domain0 of cpu0 would be MC with cpu0~cpu23 with ; with this
> > patch, MC becomes domain1, a new domain0 "CLS" including cpu0-cpu3.
> >
> > This will help spread unrelated tasks among clusters, thus decrease the
> > contention and improve the throughput, for example, stream benchmark can
> > improve around 4.3%~6.3% by this patch:
> >
> > w/o patch:
> > numactl -N 0 /usr/lib/lmbench/bin/stream -P 12 -M 1024M -N 5
> > STREAM copy latency: 3.36 nanoseconds
> > STREAM copy bandwidth: 57072.50 MB/sec
> > STREAM scale latency: 3.40 nanoseconds
> > STREAM scale bandwidth: 56542.52 MB/sec
> > STREAM add latency: 5.10 nanoseconds
> > STREAM add bandwidth: 56482.83 MB/sec
> > STREAM triad latency: 5.14 nanoseconds
> > STREAM triad bandwidth: 56069.52 MB/sec
> >
> > w/ patch:
> > $ numactl -N 0 /usr/lib/lmbench/bin/stream -P 12 -M 1024M -N 5
> > STREAM copy latency: 3.22 nanoseconds
> > STREAM copy bandwidth: 59660.96 MB/sec->  +4.5%
> > STREAM scale latency: 3.25 nanoseconds
> > STREAM scale bandwidth: 59002.29 MB/sec   ->  +4.3%
> > STREAM add latency: 4.80 nanoseconds
> > STREAM add bandwidth: 60036.62 MB/sec ->  +6.3%
> > STREAM triad latency: 4.86 nanoseconds
> > STREAM triad bandwidth: 59228.30 MB/sec   ->  +5.6%
> >
> > On the other hand, while doing WAKE_AFFINE, this patch will try to find
> > a core in the target cluster before scanning the whole llc domain. So it
> > helps gather related tasks within one cluster.
> 
> Could you split this patch in 2 patches ? One for adding a cluster
> sched domain level and one for modifying the wake up path ?

Yes. If this is helpful, I would like to split into two patches.

> 
> This would ease the review and I would be curious about the impact of
> each feature in the performance. In particular, I'm still not
> convinced that the modification of the wakeup path is the root of the
> hackbench improvement; especially with g=14 where there should not be
> much idle CPUs with 14*40 tasks on at most 32 CPUs.  IIRC, there was

My understanding is that threads could be blocked due to pipe. So CPUs
still have some chance to be idle for a big g. Also note the default g
of hackbench is 10.

Anyway, i'd like to add some tracepoints to get the percentages of how
many are picked from cluster, how many are selected from cpus outside
cluster.

> no obvious improvement with the changes in select_idle_cpu unless you
> hack the behavior to not fall back to llc domain
> 

You have a good memory. In a very old version I once mentioned that. But
at that time, I didn't decrease nr after scanning cluster, so it was
scanning at least 8 cpus(4 within cluster, 4 outside cluster). I guess
that is the reason my hack to not fall back to llc domain could bringing
actual hackbench improvement.

> > we run the below hackbench with different -g parameter from 2 to 14, for
> > each different g, we run the command 10 times and get the average time
> > $ numactl -N 0 hackbench -p -T -l 2 -g $1
> >
> > hackbench will report the time which is needed to complete a certain number
> > of messages transmissions between a certain number of tasks, for example:
> > $ numactl -N 0 hackbench -p -T -l 2 -g 10
> > Running in threaded mode with 10 groups using 40 file descriptors each
> > (== 400 tasks)

[tip: irq/core] genirq: Add IRQF_NO_AUTOEN for request_irq/nmi()

2021-03-06 Thread tip-bot2 for Barry Song
The following commit has been merged into the irq/core branch of tip:

Commit-ID: cbe16f35bee6880becca6f20d2ebf6b457148552
Gitweb:
https://git.kernel.org/tip/cbe16f35bee6880becca6f20d2ebf6b457148552
Author:Barry Song 
AuthorDate:Wed, 03 Mar 2021 11:49:15 +13:00
Committer: Ingo Molnar 
CommitterDate: Sat, 06 Mar 2021 12:48:00 +01:00

genirq: Add IRQF_NO_AUTOEN for request_irq/nmi()

Many drivers don't want interrupts enabled automatically via request_irq().
So they are handling this issue by either way of the below two:

(1)
  irq_set_status_flags(irq, IRQ_NOAUTOEN);
  request_irq(dev, irq...);

(2)
  request_irq(dev, irq...);
  disable_irq(irq);

The code in the second way is silly and unsafe. In the small time gap
between request_irq() and disable_irq(), interrupts can still come.

The code in the first way is safe though it's subobtimal.

Add a new IRQF_NO_AUTOEN flag which can be handed in by drivers to
request_irq() and request_nmi(). It prevents the automatic enabling of the
requested interrupt/nmi in the same safe way as #1 above. With that the
various usage sites of #1 and #2 above can be simplified and corrected.

Signed-off-by: Barry Song 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Ingo Molnar 
Cc: dmitry.torok...@gmail.com
Link: 
https://lore.kernel.org/r/20210302224916.13980-2-song.bao@hisilicon.com
---
 include/linux/interrupt.h |  4 
 kernel/irq/manage.c   | 11 +--
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 967e257..76f1161 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -61,6 +61,9 @@
  *interrupt handler after suspending interrupts. For system
  *wakeup devices users need to implement wakeup detection in
  *their interrupt handlers.
+ * IRQF_NO_AUTOEN - Don't enable IRQ or NMI automatically when users request 
it.
+ *Users will enable it explicitly by enable_irq() or 
enable_nmi()
+ *later.
  */
 #define IRQF_SHARED0x0080
 #define IRQF_PROBE_SHARED  0x0100
@@ -74,6 +77,7 @@
 #define IRQF_NO_THREAD 0x0001
 #define IRQF_EARLY_RESUME  0x0002
 #define IRQF_COND_SUSPEND  0x0004
+#define IRQF_NO_AUTOEN 0x0008
 
 #define IRQF_TIMER (__IRQF_TIMER | IRQF_NO_SUSPEND | 
IRQF_NO_THREAD)
 
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index dec3f73..97c231a 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1693,7 +1693,8 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, 
struct irqaction *new)
irqd_set(>irq_data, IRQD_NO_BALANCING);
}
 
-   if (irq_settings_can_autoenable(desc)) {
+   if (!(new->flags & IRQF_NO_AUTOEN) &&
+   irq_settings_can_autoenable(desc)) {
irq_startup(desc, IRQ_RESEND, IRQ_START_COND);
} else {
/*
@@ -2086,10 +2087,15 @@ int request_threaded_irq(unsigned int irq, 
irq_handler_t handler,
 * which interrupt is which (messes up the interrupt freeing
 * logic etc).
 *
+* Also shared interrupts do not go well with disabling auto enable.
+* The sharing interrupt might request it while it's still disabled
+* and then wait for interrupts forever.
+*
 * Also IRQF_COND_SUSPEND only makes sense for shared interrupts and
 * it cannot be set along with IRQF_NO_SUSPEND.
 */
if (((irqflags & IRQF_SHARED) && !dev_id) ||
+   ((irqflags & IRQF_SHARED) && (irqflags & IRQF_NO_AUTOEN)) ||
(!(irqflags & IRQF_SHARED) && (irqflags & IRQF_COND_SUSPEND)) ||
((irqflags & IRQF_NO_SUSPEND) && (irqflags & IRQF_COND_SUSPEND)))
return -EINVAL;
@@ -2245,7 +2251,8 @@ int request_nmi(unsigned int irq, irq_handler_t handler,
 
desc = irq_to_desc(irq);
 
-   if (!desc || irq_settings_can_autoenable(desc) ||
+   if (!desc || (irq_settings_can_autoenable(desc) &&
+   !(irqflags & IRQF_NO_AUTOEN)) ||
!irq_settings_can_request(desc) ||
WARN_ON(irq_settings_is_per_cpu_devid(desc)) ||
!irq_supports_nmi(desc))


[tip: sched/core] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2

2021-03-06 Thread tip-bot2 for Barry Song
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 585b6d2723dc927ebc4ad884c4e879e4da8bc21f
Gitweb:
https://git.kernel.org/tip/585b6d2723dc927ebc4ad884c4e879e4da8bc21f
Author:Barry Song 
AuthorDate:Wed, 24 Feb 2021 16:09:44 +13:00
Committer: Ingo Molnar 
CommitterDate: Sat, 06 Mar 2021 12:40:22 +01:00

sched/topology: fix the issue groups don't span domain->span for NUMA diameter 
> 2

As long as NUMA diameter > 2, building sched_domain by sibling's child
domain will definitely create a sched_domain with sched_group which will
span out of the sched_domain:

   +--+ +--++---+   +--+
   | node |  12 |node  | 20 | node  |  12   |node  |
   |  0   +-+1 ++ 2 +---+3 |
   +--+ +--++---+   +--+

domain0node0node1node2  node3

domain1node0+1  node0+1  node2+3node2+3
 +
domain2node0+1+2 |
 group: node0+1  |
   group:node2+3 <---+

when node2 is added into the domain2 of node0, kernel is using the child
domain of node2's domain2, which is domain1(node2+3). Node 3 is outside
the span of the domain including node0+1+2.

This will make load_balance() run based on screwed avg_load and group_type
in the sched_group spanning out of the sched_domain, and it also makes
select_task_rq_fair() pick an idle CPU outside the sched_domain.

Real servers which suffer from this problem include Kunpeng920 and 8-node
Sun Fire X4600-M2, at least.

Here we move to use the *child* domain of the *child* domain of node2's
domain2 as the new added sched_group. At the same, we re-use the lower
level sgc directly.
   +--+ +--++---+   +--+
   | node |  12 |node  | 20 | node  |  12   |node  |
   |  0   +-+1 ++ 2 +---+3 |
   +--+ +--++---+   +--+

domain0node0node1  +- node2  node3
   |
domain1node0+1  node0+1| node2+3node2+3
   |
domain2node0+1+2   |
 group: node0+1|
   group:node2 <---+

While the lower level sgc is re-used, this patch only changes the remote
sched_groups for those sched_domains playing grandchild trick, therefore,
sgc->next_update is still safe since it's only touched by CPUs that have
the group span as local group. And sgc->imbalance is also safe because
sd_parent remains the same in load_balance and LB only tries other CPUs
from the local group.
Moreover, since local groups are not touched, they are still getting
roughly equal size in a TL. And should_we_balance() only matters with
local groups, so the pull probability of those groups are still roughly
equal.

Tested by the below topology:
qemu-system-aarch64  -M virt -nographic \
 -smp cpus=8 \
 -numa node,cpus=0-1,nodeid=0 \
 -numa node,cpus=2-3,nodeid=1 \
 -numa node,cpus=4-5,nodeid=2 \
 -numa node,cpus=6-7,nodeid=3 \
 -numa dist,src=0,dst=1,val=12 \
 -numa dist,src=0,dst=2,val=20 \
 -numa dist,src=0,dst=3,val=22 \
 -numa dist,src=1,dst=2,val=22 \
 -numa dist,src=2,dst=3,val=12 \
 -numa dist,src=1,dst=3,val=24 \
 -m 4G -cpu cortex-a57 -kernel arch/arm64/boot/Image

w/o patch, we get lots of "groups don't span domain->span":
[0.802139] CPU0 attaching sched-domain(s):
[0.802193]  domain-0: span=0-1 level=MC
[0.802443]   groups: 0:{ span=0 cap=1013 }, 1:{ span=1 cap=979 }
[0.802693]   domain-1: span=0-3 level=NUMA
[0.802731]groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 }
[0.802811]domain-2: span=0-5 level=NUMA
[0.802829] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 }
[0.802881] ERROR: groups don't span domain->span
[0.803058] domain-3: span=0-7 level=NUMA
[0.803080]  groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 
mask=6-7 cap=4077 }
[0.804055] CPU1 attaching sched-domain(s):
[0.804072]  domain-0: span=0-1 level=MC
[0.804096]   groups: 1:{ span=1 cap=979 }, 0:{ span=0 cap=1013 }
[0.804152]   domain-1: span=0-3 level=NUMA
[0.804170]groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 }
[0.804219]domain-2: span=0-5 level=NUMA
[0.804236] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 }
[0.804302] ERROR: groups don't span domain->span
[0.804520] domain-3: span=0-7 level=NUMA
[0.804546]  groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 
mask=6-7 cap=4077 }
[ 

RE: [PATCH] sched/topology: remove redundant cpumask_and in init_overlap_sched_group

2021-03-05 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Valentin Schneider [mailto:valentin.schnei...@arm.com]
> Sent: Saturday, March 6, 2021 12:49 AM
> To: Song Bao Hua (Barry Song) ;
> vincent.guit...@linaro.org; mi...@redhat.com; pet...@infradead.org;
> juri.le...@redhat.com; dietmar.eggem...@arm.com; rost...@goodmis.org;
> bseg...@google.com; mgor...@suse.de
> Cc: linux-kernel@vger.kernel.org; linux...@openeuler.org; Song Bao Hua (Barry
> Song) 
> Subject: Re: [PATCH] sched/topology: remove redundant cpumask_and in
> init_overlap_sched_group
> 
> On 05/03/21 11:29, Barry Song wrote:
> > mask is built in build_balance_mask() by for_each_cpu(i, sg_span), so
> > it must be a subset of sched_group_span(sg).
> 
> So we should indeed have
> 
>   cpumask_subset(sched_group_span(sg), mask)
> 
> but that doesn't imply
> 
>   cpumask_first(sched_group_span(sg)) == cpumask_first(mask)
> 
> does it? I'm thinking if in your topology of N CPUs, CPUs 0 and N-1 are the
> furthest away, you will most likely hit

It is true:
cpumask_first(sched_group_span(sg)) != cpumask_first(mask)

but 

cpumask_first_and(sched_group_span(sg), mask) = cpumask_first(mask)

since mask is always subset of sched_group_span(sg).

/**
 * cpumask_first_and - return the first cpu from *srcp1 & *srcp2
 * @src1p: the first input
 * @src2p: the second input
 *
 * Returns >= nr_cpu_ids if no cpus set in both.  See also cpumask_next_and().
 */

*srcp2 is subset of *srcp1, so  *srcp1 & *srcp2 = *srcp2

> 
>   !cpumask_equal(sg_pan, sched_domain_span(sibling->child))
>  ^^^
>  CPUN-1CPU0
> 
> which should be the case on your Kunpeng920 system.
> 
> > Though cpumask_first_and
> > doesn't lead to a wrong result of balance cpu, it is pointless to do
> > cpumask_and again.
> >
> > Signed-off-by: Barry Song 
> > ---
> >  kernel/sched/topology.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 12f8058..45f3db2 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -934,7 +934,7 @@ static void init_overlap_sched_group(struct sched_domain
> *sd,
> > int cpu;
> >
> > build_balance_mask(sd, sg, mask);
> > -   cpu = cpumask_first_and(sched_group_span(sg), mask);
> > +   cpu = cpumask_first(mask);
> >
> > sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
> > if (atomic_inc_return(>sgc->ref) == 1)
> > --
> > 1.8.3.1

Thanks
Barry



[PATCH] sched/topology: remove redundant cpumask_and in init_overlap_sched_group

2021-03-04 Thread Barry Song
mask is built in build_balance_mask() by for_each_cpu(i, sg_span), so
it must be a subset of sched_group_span(sg). Though cpumask_first_and
doesn't lead to a wrong result of balance cpu, it is pointless to do
cpumask_and again.

Signed-off-by: Barry Song 
---
 kernel/sched/topology.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 12f8058..45f3db2 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -934,7 +934,7 @@ static void init_overlap_sched_group(struct sched_domain 
*sd,
int cpu;
 
build_balance_mask(sd, sg, mask);
-   cpu = cpumask_first_and(sched_group_span(sg), mask);
+   cpu = cpumask_first(mask);
 
sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
if (atomic_inc_return(>sgc->ref) == 1)
-- 
1.8.3.1



[tip: irq/core] genirq: Add IRQF_NO_AUTOEN for request_irq/nmi()

2021-03-04 Thread tip-bot2 for Barry Song
The following commit has been merged into the irq/core branch of tip:

Commit-ID: e749df1bbd23f4472082210650514548d8a39e9b
Gitweb:
https://git.kernel.org/tip/e749df1bbd23f4472082210650514548d8a39e9b
Author:Barry Song 
AuthorDate:Wed, 03 Mar 2021 11:49:15 +13:00
Committer: Thomas Gleixner 
CommitterDate: Thu, 04 Mar 2021 11:47:52 +01:00

genirq: Add IRQF_NO_AUTOEN for request_irq/nmi()

Many drivers don't want interrupts enabled automatically via request_irq().
So they are handling this issue by either way of the below two:

(1)
  irq_set_status_flags(irq, IRQ_NOAUTOEN);
  request_irq(dev, irq...);

(2)
  request_irq(dev, irq...);
  disable_irq(irq);

The code in the second way is silly and unsafe. In the small time gap
between request_irq() and disable_irq(), interrupts can still come.

The code in the first way is safe though it's subobtimal.

Add a new IRQF_NO_AUTOEN flag which can be handed in by drivers to
request_irq() and request_nmi(). It prevents the automatic enabling of the
requested interrupt/nmi in the same safe way as #1 above. With that the
various usage sites of #1 and #2 above can be simplified and corrected.

Signed-off-by: Barry Song 
Signed-off-by: Thomas Gleixner 
Cc: dmitry.torok...@gmail.com
Link: 
https://lore.kernel.org/r/20210302224916.13980-2-song.bao@hisilicon.com

---
 include/linux/interrupt.h |  4 
 kernel/irq/manage.c   | 11 +--
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 967e257..76f1161 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -61,6 +61,9 @@
  *interrupt handler after suspending interrupts. For system
  *wakeup devices users need to implement wakeup detection in
  *their interrupt handlers.
+ * IRQF_NO_AUTOEN - Don't enable IRQ or NMI automatically when users request 
it.
+ *Users will enable it explicitly by enable_irq() or 
enable_nmi()
+ *later.
  */
 #define IRQF_SHARED0x0080
 #define IRQF_PROBE_SHARED  0x0100
@@ -74,6 +77,7 @@
 #define IRQF_NO_THREAD 0x0001
 #define IRQF_EARLY_RESUME  0x0002
 #define IRQF_COND_SUSPEND  0x0004
+#define IRQF_NO_AUTOEN 0x0008
 
 #define IRQF_TIMER (__IRQF_TIMER | IRQF_NO_SUSPEND | 
IRQF_NO_THREAD)
 
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index dec3f73..97c231a 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1693,7 +1693,8 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, 
struct irqaction *new)
irqd_set(>irq_data, IRQD_NO_BALANCING);
}
 
-   if (irq_settings_can_autoenable(desc)) {
+   if (!(new->flags & IRQF_NO_AUTOEN) &&
+   irq_settings_can_autoenable(desc)) {
irq_startup(desc, IRQ_RESEND, IRQ_START_COND);
} else {
/*
@@ -2086,10 +2087,15 @@ int request_threaded_irq(unsigned int irq, 
irq_handler_t handler,
 * which interrupt is which (messes up the interrupt freeing
 * logic etc).
 *
+* Also shared interrupts do not go well with disabling auto enable.
+* The sharing interrupt might request it while it's still disabled
+* and then wait for interrupts forever.
+*
 * Also IRQF_COND_SUSPEND only makes sense for shared interrupts and
 * it cannot be set along with IRQF_NO_SUSPEND.
 */
if (((irqflags & IRQF_SHARED) && !dev_id) ||
+   ((irqflags & IRQF_SHARED) && (irqflags & IRQF_NO_AUTOEN)) ||
(!(irqflags & IRQF_SHARED) && (irqflags & IRQF_COND_SUSPEND)) ||
((irqflags & IRQF_NO_SUSPEND) && (irqflags & IRQF_COND_SUSPEND)))
return -EINVAL;
@@ -2245,7 +2251,8 @@ int request_nmi(unsigned int irq, irq_handler_t handler,
 
desc = irq_to_desc(irq);
 
-   if (!desc || irq_settings_can_autoenable(desc) ||
+   if (!desc || (irq_settings_can_autoenable(desc) &&
+   !(irqflags & IRQF_NO_AUTOEN)) ||
!irq_settings_can_request(desc) ||
WARN_ON(irq_settings_is_per_cpu_devid(desc)) ||
!irq_supports_nmi(desc))


[tip: sched/core] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2

2021-03-04 Thread tip-bot2 for Barry Song
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 9f4af5753b691b9df558ddcfea13e9f3036e45ca
Gitweb:
https://git.kernel.org/tip/9f4af5753b691b9df558ddcfea13e9f3036e45ca
Author:Barry Song 
AuthorDate:Wed, 24 Feb 2021 16:09:44 +13:00
Committer: Peter Zijlstra 
CommitterDate: Thu, 04 Mar 2021 09:56:00 +01:00

sched/topology: fix the issue groups don't span domain->span for NUMA diameter 
> 2

As long as NUMA diameter > 2, building sched_domain by sibling's child
domain will definitely create a sched_domain with sched_group which will
span out of the sched_domain:

   +--+ +--++---+   +--+
   | node |  12 |node  | 20 | node  |  12   |node  |
   |  0   +-+1 ++ 2 +---+3 |
   +--+ +--++---+   +--+

domain0node0node1node2  node3

domain1node0+1  node0+1  node2+3node2+3
 +
domain2node0+1+2 |
 group: node0+1  |
   group:node2+3 <---+

when node2 is added into the domain2 of node0, kernel is using the child
domain of node2's domain2, which is domain1(node2+3). Node 3 is outside
the span of the domain including node0+1+2.

This will make load_balance() run based on screwed avg_load and group_type
in the sched_group spanning out of the sched_domain, and it also makes
select_task_rq_fair() pick an idle CPU outside the sched_domain.

Real servers which suffer from this problem include Kunpeng920 and 8-node
Sun Fire X4600-M2, at least.

Here we move to use the *child* domain of the *child* domain of node2's
domain2 as the new added sched_group. At the same, we re-use the lower
level sgc directly.
   +--+ +--++---+   +--+
   | node |  12 |node  | 20 | node  |  12   |node  |
   |  0   +-+1 ++ 2 +---+3 |
   +--+ +--++---+   +--+

domain0node0node1  +- node2  node3
   |
domain1node0+1  node0+1| node2+3node2+3
   |
domain2node0+1+2   |
 group: node0+1|
   group:node2 <---+

While the lower level sgc is re-used, this patch only changes the remote
sched_groups for those sched_domains playing grandchild trick, therefore,
sgc->next_update is still safe since it's only touched by CPUs that have
the group span as local group. And sgc->imbalance is also safe because
sd_parent remains the same in load_balance and LB only tries other CPUs
from the local group.
Moreover, since local groups are not touched, they are still getting
roughly equal size in a TL. And should_we_balance() only matters with
local groups, so the pull probability of those groups are still roughly
equal.

Tested by the below topology:
qemu-system-aarch64  -M virt -nographic \
 -smp cpus=8 \
 -numa node,cpus=0-1,nodeid=0 \
 -numa node,cpus=2-3,nodeid=1 \
 -numa node,cpus=4-5,nodeid=2 \
 -numa node,cpus=6-7,nodeid=3 \
 -numa dist,src=0,dst=1,val=12 \
 -numa dist,src=0,dst=2,val=20 \
 -numa dist,src=0,dst=3,val=22 \
 -numa dist,src=1,dst=2,val=22 \
 -numa dist,src=2,dst=3,val=12 \
 -numa dist,src=1,dst=3,val=24 \
 -m 4G -cpu cortex-a57 -kernel arch/arm64/boot/Image

w/o patch, we get lots of "groups don't span domain->span":
[0.802139] CPU0 attaching sched-domain(s):
[0.802193]  domain-0: span=0-1 level=MC
[0.802443]   groups: 0:{ span=0 cap=1013 }, 1:{ span=1 cap=979 }
[0.802693]   domain-1: span=0-3 level=NUMA
[0.802731]groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 }
[0.802811]domain-2: span=0-5 level=NUMA
[0.802829] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 }
[0.802881] ERROR: groups don't span domain->span
[0.803058] domain-3: span=0-7 level=NUMA
[0.803080]  groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 
mask=6-7 cap=4077 }
[0.804055] CPU1 attaching sched-domain(s):
[0.804072]  domain-0: span=0-1 level=MC
[0.804096]   groups: 1:{ span=1 cap=979 }, 0:{ span=0 cap=1013 }
[0.804152]   domain-1: span=0-3 level=NUMA
[0.804170]groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 }
[0.804219]domain-2: span=0-5 level=NUMA
[0.804236] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 }
[0.804302] ERROR: groups don't span domain->span
[0.804520] domain-3: span=0-7 level=NUMA
[0.804546]  groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 
mask=6-7 cap=40

[PATCH v5 2/2] Input: move to use request_irq by IRQF_NO_AUTOEN flag

2021-03-03 Thread Barry Song
disable_irq() after request_irq() still has a time gap in which
interrupts can come. request_irq() with IRQF_NO_AUTOEN flag will
disable IRQ auto-enable because of requesting.

On the other hand, request_irq() after setting IRQ_NOAUTOEN as
below
irq_set_status_flags(irq, IRQ_NOAUTOEN);
request_irq(dev, irq...);
can also be replaced by request_irq() with IRQF_NO_AUTOEN flag.

Signed-off-by: Barry Song 
---
 drivers/input/keyboard/tca6416-keypad.c  | 3 +--
 drivers/input/keyboard/tegra-kbc.c   | 5 ++---
 drivers/input/touchscreen/ar1021_i2c.c   | 5 +
 drivers/input/touchscreen/atmel_mxt_ts.c | 5 ++---
 drivers/input/touchscreen/bu21029_ts.c   | 4 ++--
 drivers/input/touchscreen/cyttsp_core.c  | 5 ++---
 drivers/input/touchscreen/melfas_mip4.c  | 5 ++---
 drivers/input/touchscreen/mms114.c   | 4 ++--
 drivers/input/touchscreen/stmfts.c   | 3 +--
 drivers/input/touchscreen/wm831x-ts.c| 3 +--
 drivers/input/touchscreen/zinitix.c  | 4 ++--
 11 files changed, 18 insertions(+), 28 deletions(-)

diff --git a/drivers/input/keyboard/tca6416-keypad.c 
b/drivers/input/keyboard/tca6416-keypad.c
index 9b0f9665dcb0..2a9755910065 100644
--- a/drivers/input/keyboard/tca6416-keypad.c
+++ b/drivers/input/keyboard/tca6416-keypad.c
@@ -274,7 +274,7 @@ static int tca6416_keypad_probe(struct i2c_client *client,
error = request_threaded_irq(chip->irqnum, NULL,
 tca6416_keys_isr,
 IRQF_TRIGGER_FALLING |
-   IRQF_ONESHOT,
+IRQF_ONESHOT | IRQF_NO_AUTOEN,
 "tca6416-keypad", chip);
if (error) {
dev_dbg(>dev,
@@ -282,7 +282,6 @@ static int tca6416_keypad_probe(struct i2c_client *client,
chip->irqnum, error);
goto fail1;
}
-   disable_irq(chip->irqnum);
}
 
error = input_register_device(input);
diff --git a/drivers/input/keyboard/tegra-kbc.c 
b/drivers/input/keyboard/tegra-kbc.c
index 9671842a082a..570fe18c0ce9 100644
--- a/drivers/input/keyboard/tegra-kbc.c
+++ b/drivers/input/keyboard/tegra-kbc.c
@@ -694,14 +694,13 @@ static int tegra_kbc_probe(struct platform_device *pdev)
input_set_drvdata(kbc->idev, kbc);
 
err = devm_request_irq(>dev, kbc->irq, tegra_kbc_isr,
-  IRQF_TRIGGER_HIGH, pdev->name, kbc);
+  IRQF_TRIGGER_HIGH | IRQF_NO_AUTOEN,
+  pdev->name, kbc);
if (err) {
dev_err(>dev, "failed to request keyboard IRQ\n");
return err;
}
 
-   disable_irq(kbc->irq);
-
err = input_register_device(kbc->idev);
if (err) {
dev_err(>dev, "failed to register input device\n");
diff --git a/drivers/input/touchscreen/ar1021_i2c.c 
b/drivers/input/touchscreen/ar1021_i2c.c
index c0d5c2413356..dc6a85362a40 100644
--- a/drivers/input/touchscreen/ar1021_i2c.c
+++ b/drivers/input/touchscreen/ar1021_i2c.c
@@ -125,7 +125,7 @@ static int ar1021_i2c_probe(struct i2c_client *client,
 
error = devm_request_threaded_irq(>dev, client->irq,
  NULL, ar1021_i2c_irq,
- IRQF_ONESHOT,
+ IRQF_ONESHOT | IRQF_NO_AUTOEN,
  "ar1021_i2c", ar1021);
if (error) {
dev_err(>dev,
@@ -133,9 +133,6 @@ static int ar1021_i2c_probe(struct i2c_client *client,
return error;
}
 
-   /* Disable the IRQ, we'll enable it in ar1021_i2c_open() */
-   disable_irq(client->irq);
-
error = input_register_device(ar1021->input);
if (error) {
dev_err(>dev,
diff --git a/drivers/input/touchscreen/atmel_mxt_ts.c 
b/drivers/input/touchscreen/atmel_mxt_ts.c
index 383a848eb601..3c837c7b24b3 100644
--- a/drivers/input/touchscreen/atmel_mxt_ts.c
+++ b/drivers/input/touchscreen/atmel_mxt_ts.c
@@ -3156,15 +3156,14 @@ static int mxt_probe(struct i2c_client *client, const 
struct i2c_device_id *id)
}
 
error = devm_request_threaded_irq(>dev, client->irq,
- NULL, mxt_interrupt, IRQF_ONESHOT,
+ NULL, mxt_interrupt,
+ IRQF_ONESHOT | IRQF_NO_AUTOEN,
  client->name, data);
if (error) {
dev_err(>dev, "Failed to register interrupt\n");
return error;
}
 
-   disable_irq(client->irq);
-
error = regulator_bulk_enable(ARRAY_SIZE(data->regu

[PATCH v5 1/2] genirq: add IRQF_NO_AUTOEN for request_irq

2021-03-03 Thread Barry Song
Many drivers don't want interrupts enabled automatically due to
request_irq(). So they are handling this issue by either way of
the below two:
(1)
irq_set_status_flags(irq, IRQ_NOAUTOEN);
request_irq(dev, irq...);
(2)
request_irq(dev, irq...);
disable_irq(irq);

The code in the second way is silly and unsafe. In the small time
gap between request_irq() and disable_irq(), interrupts can still
come.
The code in the first way is safe though we might be able to do it
in the generic irq code.

With this patch, drivers can request_irq with IRQF_NO_AUTOEN flag.
They will need neither irq_set_status_flags() nor disable_irq().

In the meantime, drivers using the below pattern for NMI
irq_set_status_flags(irq, IRQ_NOAUTOEN);
request_nmi(dev, irq...);

can also move to request_nmi() with IRQF_NO_AUTOEN flag.

Cc: Dmitry Torokhov 
Signed-off-by: Barry Song 
---
-v5:
  * add the same check for IRQF_NO_AUTOEN in request_nmi()

 include/linux/interrupt.h |  4 
 kernel/irq/manage.c   | 11 +--
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 967e25767153..76f1161a441a 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -61,6 +61,9 @@
  *interrupt handler after suspending interrupts. For system
  *wakeup devices users need to implement wakeup detection in
  *their interrupt handlers.
+ * IRQF_NO_AUTOEN - Don't enable IRQ or NMI automatically when users request 
it.
+ *Users will enable it explicitly by enable_irq() or 
enable_nmi()
+ *later.
  */
 #define IRQF_SHARED0x0080
 #define IRQF_PROBE_SHARED  0x0100
@@ -74,6 +77,7 @@
 #define IRQF_NO_THREAD 0x0001
 #define IRQF_EARLY_RESUME  0x0002
 #define IRQF_COND_SUSPEND  0x0004
+#define IRQF_NO_AUTOEN 0x0008
 
 #define IRQF_TIMER (__IRQF_TIMER | IRQF_NO_SUSPEND | 
IRQF_NO_THREAD)
 
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index dec3f73e8db9..97c231a5644c 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1693,7 +1693,8 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, 
struct irqaction *new)
irqd_set(>irq_data, IRQD_NO_BALANCING);
}
 
-   if (irq_settings_can_autoenable(desc)) {
+   if (!(new->flags & IRQF_NO_AUTOEN) &&
+   irq_settings_can_autoenable(desc)) {
irq_startup(desc, IRQ_RESEND, IRQ_START_COND);
} else {
/*
@@ -2086,10 +2087,15 @@ int request_threaded_irq(unsigned int irq, 
irq_handler_t handler,
 * which interrupt is which (messes up the interrupt freeing
 * logic etc).
 *
+* Also shared interrupts do not go well with disabling auto enable.
+* The sharing interrupt might request it while it's still disabled
+* and then wait for interrupts forever.
+*
 * Also IRQF_COND_SUSPEND only makes sense for shared interrupts and
 * it cannot be set along with IRQF_NO_SUSPEND.
 */
if (((irqflags & IRQF_SHARED) && !dev_id) ||
+   ((irqflags & IRQF_SHARED) && (irqflags & IRQF_NO_AUTOEN)) ||
(!(irqflags & IRQF_SHARED) && (irqflags & IRQF_COND_SUSPEND)) ||
((irqflags & IRQF_NO_SUSPEND) && (irqflags & IRQF_COND_SUSPEND)))
return -EINVAL;
@@ -2245,7 +2251,8 @@ int request_nmi(unsigned int irq, irq_handler_t handler,
 
desc = irq_to_desc(irq);
 
-   if (!desc || irq_settings_can_autoenable(desc) ||
+   if (!desc || (irq_settings_can_autoenable(desc) &&
+   !(irqflags & IRQF_NO_AUTOEN)) ||
!irq_settings_can_request(desc) ||
WARN_ON(irq_settings_is_per_cpu_devid(desc)) ||
!irq_supports_nmi(desc))
-- 
2.25.1



[PATCH v5 0/2] add IRQF_NO_AUTOEN for request_irq

2021-03-03 Thread Barry Song
-v5:
  * add the same check for IRQF_NO_AUTOEN in request_nmi()
  * combine a dozen of separate patches of input into one (hopefully
this could easy the life of the maintainers)

-v4:
  * remove the irq_settings magic for NOAUTOEN with respect to
Thomas's comment

Barry Song (2):
  genirq: add IRQF_NO_AUTOEN for request_irq
  Input: move to use request_irq by IRQF_NO_AUTOEN flag

 drivers/input/keyboard/tca6416-keypad.c  |  3 +--
 drivers/input/keyboard/tegra-kbc.c   |  5 ++---
 drivers/input/touchscreen/ar1021_i2c.c   |  5 +
 drivers/input/touchscreen/atmel_mxt_ts.c |  5 ++---
 drivers/input/touchscreen/bu21029_ts.c   |  4 ++--
 drivers/input/touchscreen/cyttsp_core.c  |  5 ++---
 drivers/input/touchscreen/melfas_mip4.c  |  5 ++---
 drivers/input/touchscreen/mms114.c   |  4 ++--
 drivers/input/touchscreen/stmfts.c   |  3 +--
 drivers/input/touchscreen/wm831x-ts.c|  3 +--
 drivers/input/touchscreen/zinitix.c  |  4 ++--
 include/linux/interrupt.h|  4 
 kernel/irq/manage.c  | 11 +--
 13 files changed, 31 insertions(+), 30 deletions(-)

-- 
2.25.1



[PATCH] Documentation/admin-guide: kernel-parameters: correct the architectures for numa_balancing

2021-03-02 Thread Barry Song
X86 isn't the only architecture supporting NUMA_BALANCING. ARM64, PPC,
S390 and RISCV also support it:

arch$ git grep NUMA_BALANCING
arm64/Kconfig:  select ARCH_SUPPORTS_NUMA_BALANCING
arm64/configs/defconfig:CONFIG_NUMA_BALANCING=y
arm64/include/asm/pgtable.h:#ifdef CONFIG_NUMA_BALANCING
powerpc/configs/powernv_defconfig:CONFIG_NUMA_BALANCING=y
powerpc/configs/ppc64_defconfig:CONFIG_NUMA_BALANCING=y
powerpc/configs/pseries_defconfig:CONFIG_NUMA_BALANCING=y
powerpc/include/asm/book3s/64/pgtable.h:#ifdef CONFIG_NUMA_BALANCING
powerpc/include/asm/book3s/64/pgtable.h:#ifdef CONFIG_NUMA_BALANCING
powerpc/include/asm/book3s/64/pgtable.h:#endif /* CONFIG_NUMA_BALANCING */
powerpc/include/asm/book3s/64/pgtable.h:#ifdef CONFIG_NUMA_BALANCING
powerpc/include/asm/book3s/64/pgtable.h:#endif /* CONFIG_NUMA_BALANCING */
powerpc/include/asm/nohash/pgtable.h:#ifdef CONFIG_NUMA_BALANCING
powerpc/include/asm/nohash/pgtable.h:#endif /* CONFIG_NUMA_BALANCING */
powerpc/platforms/Kconfig.cputype:  select ARCH_SUPPORTS_NUMA_BALANCING
riscv/Kconfig:  select ARCH_SUPPORTS_NUMA_BALANCING
riscv/include/asm/pgtable.h:#ifdef CONFIG_NUMA_BALANCING
s390/Kconfig:   select ARCH_SUPPORTS_NUMA_BALANCING
s390/configs/debug_defconfig:CONFIG_NUMA_BALANCING=y
s390/configs/defconfig:CONFIG_NUMA_BALANCING=y
s390/include/asm/pgtable.h:#ifdef CONFIG_NUMA_BALANCING
x86/Kconfig:select ARCH_SUPPORTS_NUMA_BALANCING if X86_64
x86/include/asm/pgtable.h:#ifdef CONFIG_NUMA_BALANCING
x86/include/asm/pgtable.h:#endif /* CONFIG_NUMA_BALANCING */

On the other hand, setup_numabalancing() is implemented in mm/mempolicy.c
which doesn't depend on architectures.

Cc: Mel Gorman 
Cc: Paul Walmsley 
Cc: Palmer Dabbelt 
Cc: Albert Ou 
Cc: "Paul E. McKenney" 
Cc: Randy Dunlap 
Cc: Andrew Morton 
Cc: Thomas Gleixner 
Cc: Mauro Carvalho Chehab 
Cc: Viresh Kumar 
Cc: Mike Kravetz 
Cc: Peter Zijlstra 
Signed-off-by: Barry Song 
---
 Documentation/admin-guide/kernel-parameters.rst | 1 +
 Documentation/admin-guide/kernel-parameters.txt | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.rst 
b/Documentation/admin-guide/kernel-parameters.rst
index 1132796a8d96..24302cad174a 100644
--- a/Documentation/admin-guide/kernel-parameters.rst
+++ b/Documentation/admin-guide/kernel-parameters.rst
@@ -140,6 +140,7 @@ parameter is applicable::
PPT Parallel port support is enabled.
PS2 Appropriate PS/2 support is enabled.
RAM RAM disk support is enabled.
+   RISCV   RISCV architecture is enabled.
RDT Intel Resource Director Technology.
S390S390 architecture is enabled.
SCSIAppropriate SCSI support is enabled.
diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 04545725f187..371a02ae1e21 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3472,7 +3472,8 @@
 
nr_uarts=   [SERIAL] maximum number of UARTs to be registered.
 
-   numa_balancing= [KNL,X86] Enable or disable automatic NUMA balancing.
+   numa_balancing= [KNL,ARM64,PPC,RISCV,S390,X86] Enable or disable 
automatic
+   NUMA balancing.
Allowed values are enable and disable
 
numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
-- 
2.25.1



[RFC PATCH v4 3/3] scheduler: Add cluster scheduler level for x86

2021-03-01 Thread Barry Song
From: Tim Chen 

There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce
is shared among a cluster of cores instead of being exclusive
to one single core.

To prevent oversubscription of L2 cache, load should be
balanced between such L2 clusters, especially for tasks with
no shared data.

Also with cluster scheduling policy where tasks are woken up
in the same L2 cluster, we will benefit from keeping tasks
related to each other and likely sharing data in the same L2
cluster.

Add CPU masks of CPUs sharing the L2 cache so we can build such
L2 cluster scheduler domain.

Signed-off-by: Tim Chen 
Signed-off-by: Barry Song 
---
 arch/x86/Kconfig|  8 
 arch/x86/include/asm/smp.h  |  7 +++
 arch/x86/include/asm/topology.h |  1 +
 arch/x86/kernel/cpu/cacheinfo.c |  1 +
 arch/x86/kernel/cpu/common.c|  3 +++
 arch/x86/kernel/smpboot.c   | 43 -
 6 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d3338a8..40110de 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1009,6 +1009,14 @@ config NR_CPUS
  This is purely to save memory: each supported CPU adds about 8KB
  to the kernel image.
 
+config SCHED_CLUSTER
+   bool "Cluster scheduler support"
+   default n
+   help
+Cluster scheduler support improves the CPU scheduler's decision
+making when dealing with machines that have clusters of CPUs
+sharing L2 cache. If unsure say N here.
+
 config SCHED_SMT
def_bool y if SMP
 
diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index c0538f8..9cbc4ae 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -16,7 +16,9 @@
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_die_map);
 /* cpus sharing the last level cache: */
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
+DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_l2c_shared_map);
 DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id);
+DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_l2c_id);
 DECLARE_PER_CPU_READ_MOSTLY(int, cpu_number);
 
 static inline struct cpumask *cpu_llc_shared_mask(int cpu)
@@ -24,6 +26,11 @@ static inline struct cpumask *cpu_llc_shared_mask(int cpu)
return per_cpu(cpu_llc_shared_map, cpu);
 }
 
+static inline struct cpumask *cpu_l2c_shared_mask(int cpu)
+{
+   return per_cpu(cpu_l2c_shared_map, cpu);
+}
+
 DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_cpu_to_apicid);
 DECLARE_EARLY_PER_CPU_READ_MOSTLY(u32, x86_cpu_to_acpiid);
 DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_bios_cpu_apicid);
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 9239399..2a11ccc 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -103,6 +103,7 @@ static inline void setup_node_to_cpumask_map(void) { }
 #include 
 
 extern const struct cpumask *cpu_coregroup_mask(int cpu);
+extern const struct cpumask *cpu_clustergroup_mask(int cpu);
 
 #define topology_logical_package_id(cpu)   (cpu_data(cpu).logical_proc_id)
 #define topology_physical_package_id(cpu)  (cpu_data(cpu).phys_proc_id)
diff --git a/arch/x86/kernel/cpu/cacheinfo.c b/arch/x86/kernel/cpu/cacheinfo.c
index 3ca9be4..0d03a71 100644
--- a/arch/x86/kernel/cpu/cacheinfo.c
+++ b/arch/x86/kernel/cpu/cacheinfo.c
@@ -846,6 +846,7 @@ void init_intel_cacheinfo(struct cpuinfo_x86 *c)
l2 = new_l2;
 #ifdef CONFIG_SMP
per_cpu(cpu_llc_id, cpu) = l2_id;
+   per_cpu(cpu_l2c_id, cpu) = l2_id;
 #endif
}
 
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 35ad848..fb08c73 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -78,6 +78,9 @@
 /* Last level cache ID of each logical CPU */
 DEFINE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id) = BAD_APICID;
 
+/* L2 cache ID of each logical CPU */
+DEFINE_PER_CPU_READ_MOSTLY(u16, cpu_l2c_id) = BAD_APICID;
+
 /* correctly size the local cpu masks */
 void __init setup_cpu_local_masks(void)
 {
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 02813a7..c85ffa8 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -101,6 +101,8 @@
 
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
 
+DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_l2c_shared_map);
+
 /* Per CPU bogomips and other parameters */
 DEFINE_PER_CPU_READ_MOSTLY(struct cpuinfo_x86, cpu_info);
 EXPORT_PER_CPU_SYMBOL(cpu_info);
@@ -501,6 +503,21 @@ static bool match_llc(struct cpuinfo_x86 *c, struct 
cpuinfo_x86 *o)
return topology_sane(c, o, "llc");
 }
 
+static bool match_l2c(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
+{
+   int cpu1 = c->cpu_index, cpu2 = o->cpu_index;
+
+   /* Do not match if we do not have a valid APICID for cpu: */
+   if (per_cpu(cpu_l2c_id, cpu1) == BAD_APICID)
+   return false;

[RFC PATCH v4 2/3] scheduler: add scheduler level for clusters

2021-03-01 Thread Barry Song
ARM64 chip Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each
cluster has 4 cpus. All clusters share L3 cache data, but each cluster
has local L3 tag. On the other hand, each clusters will share some
internal system bus. This means cache coherence overhead inside one
cluster is much less than the overhead across clusters.

This patch adds the sched_domain for clusters. On kunpeng 920, without
this patch, domain0 of cpu0 would be MC with cpu0~cpu23 with ; with this
patch, MC becomes domain1, a new domain0 "CLS" including cpu0-cpu3.

This will help spread unrelated tasks among clusters, thus decrease the
contention and improve the throughput, for example, stream benchmark can
improve around 4.3%~6.3% by this patch:

w/o patch:
numactl -N 0 /usr/lib/lmbench/bin/stream -P 12 -M 1024M -N 5
STREAM copy latency: 3.36 nanoseconds
STREAM copy bandwidth: 57072.50 MB/sec
STREAM scale latency: 3.40 nanoseconds
STREAM scale bandwidth: 56542.52 MB/sec
STREAM add latency: 5.10 nanoseconds
STREAM add bandwidth: 56482.83 MB/sec
STREAM triad latency: 5.14 nanoseconds
STREAM triad bandwidth: 56069.52 MB/sec

w/ patch:
$ numactl -N 0 /usr/lib/lmbench/bin/stream -P 12 -M 1024M -N 5
STREAM copy latency: 3.22 nanoseconds
STREAM copy bandwidth: 59660.96 MB/sec->  +4.5%
STREAM scale latency: 3.25 nanoseconds
STREAM scale bandwidth: 59002.29 MB/sec   ->  +4.3%
STREAM add latency: 4.80 nanoseconds
STREAM add bandwidth: 60036.62 MB/sec ->  +6.3%
STREAM triad latency: 4.86 nanoseconds
STREAM triad bandwidth: 59228.30 MB/sec   ->  +5.6%

On the other hand, while doing WAKE_AFFINE, this patch will try to find
a core in the target cluster before scanning the whole llc domain. So it
helps gather related tasks within one cluster.
we run the below hackbench with different -g parameter from 2 to 14, for
each different g, we run the command 10 times and get the average time
$ numactl -N 0 hackbench -p -T -l 2 -g $1

hackbench will report the time which is needed to complete a certain number
of messages transmissions between a certain number of tasks, for example:
$ numactl -N 0 hackbench -p -T -l 2 -g 10
Running in threaded mode with 10 groups using 40 file descriptors each
(== 400 tasks)
Each sender will pass 2 messages of 100 bytes
Time: 8.874

The below is the result of hackbench w/ and w/o the patch:
g=2  4 6   8  10 12  14
w/o: 1.9596 4.0506 5.9654 8.0068 9.8147 11.4900 13.1163
w/ : 1.9362 3.9197 5.6570 7.1376 8.5263 10.0512 11.3256
+3.3%  +5.2%  +10.9% +13.2%  +12.8%  +13.7%

Signed-off-by: Barry Song 
---
-v4:
  * rebased to tip/sched/core with the latest unified code of select_idle_cpu
  * also added benchmark data of spreading unrelated tasks
  * avoided the iteration of sched_domain by moving to static_key(addressing
Vincent's comment

 arch/arm64/Kconfig |  7 +
 include/linux/sched/cluster.h  | 19 
 include/linux/sched/sd_flags.h |  9 ++
 include/linux/sched/topology.h |  7 +
 include/linux/topology.h   |  7 +
 kernel/sched/core.c| 18 
 kernel/sched/fair.c| 66 +-
 kernel/sched/sched.h   |  1 +
 kernel/sched/topology.c|  6 
 9 files changed, 126 insertions(+), 14 deletions(-)
 create mode 100644 include/linux/sched/cluster.h

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index f39568b..158b0fa 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -971,6 +971,13 @@ config SCHED_MC
  making when dealing with multi-core CPU chips at a cost of slightly
  increased overhead in some places. If unsure say N here.
 
+config SCHED_CLUSTER
+   bool "Cluster scheduler support"
+   help
+ Cluster scheduler support improves the CPU scheduler's decision
+ making when dealing with machines that have clusters(sharing internal
+ bus or sharing LLC cache tag). If unsure say N here.
+
 config SCHED_SMT
bool "SMT scheduler support"
help
diff --git a/include/linux/sched/cluster.h b/include/linux/sched/cluster.h
new file mode 100644
index 000..ea6c475
--- /dev/null
+++ b/include/linux/sched/cluster.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_SCHED_CLUSTER_H
+#define _LINUX_SCHED_CLUSTER_H
+
+#include 
+
+#ifdef CONFIG_SCHED_CLUSTER
+extern struct static_key_false sched_cluster_present;
+
+static __always_inline bool sched_cluster_active(void)
+{
+   return static_branch_likely(_cluster_present);
+}
+#else
+static inline bool sched_cluster_active(void) { return false; }
+
+#endif
+
+#endif
diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
index 34b21e9..fc3c894 100644
--- a/include/linux/sched/sd_flags.h
+++ b/include/linux/sched/sd_flags.h
@@ -100,6 +100,15 @@
 SD_FLAG(SD_SHARE_CPUCAPACITY, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
 
 /*
+ * Domain members share 

[RFC PATCH v4 0/3] scheduler: expose the topology of clusters and add cluster scheduler

2021-03-01 Thread Barry Song
   ||| |2   |  |
| ++|| ++  |
|   || |
|   cluster1|| cluster2|
+---++-+

2. gathering related tasks within a cluster, which improves the cache affinity 
of tasks
talking with each other.
Without cluster sched_domain, related tasks might be put randomly. In case 
task1-8 have
relationship as below:
Task1 wakes up task4
Task2 wakes up task5
Task3 wakes up task6
Task4 wakes up task7
With the tuning of select_idle_cpu() to scan local cluster first, those tasks 
might
get a chance to be gathered like:
+---++--+
| +++-+ || ++  +-+  |
| |task||task | || |task|  |task |  |
| |1   || 4   | || |2   |  |5|  |
| +++-+ || ++  +-+  |
|   ||  |
|   cluster1|| cluster2 |
|   ||  |
|   ||  |
| +-+   +--+|| +-+ +--+ |
| |task |   | task ||| |task | |task  | |
| |3|   |  6   ||| |4| |8 | |
| +-+   +--+|| +-+ +--+ |
+---++--+
Otherwise, the result might be:
+---++--+
| +++-+ || ++  +-+  |
| |task||task | || |task|  |task |  |
| |1   || 2   | || |5   |  |6|  |
| +++-+ || ++  +-+  |
|   ||  |
|   cluster1|| cluster2 |
|   ||  |
|   ||  |
| +-+   +--+|| +-+ +--+ |
| |task |   | task ||| |task | |task  | |
| |3|   |  4   ||| |7| |8 | |
| +-+   +--+|| +-+ +--+ |
+---++--+

-v4:
  * rebased to tip/sched/core with the latest unified code of select_idle_cpu
  * added Tim's patch for x86 Jacobsville
  * also added benchmark data of spreading unrelated tasks
  * avoided the iteration of sched_domain by moving to static_key(addressing
Vincent's comment
  * used acpi_cpu_id for acpi_find_processor_node(addressing Masa's comment)

Barry Song (1):
  scheduler: add scheduler level for clusters

Jonathan Cameron (1):
  topology: Represent clusters of CPUs within a die.

Tim Chen (1):
  scheduler: Add cluster scheduler level for x86

 Documentation/admin-guide/cputopology.rst | 26 ++--
 arch/arm64/Kconfig|  7 
 arch/arm64/kernel/topology.c  |  2 +
 arch/x86/Kconfig  |  8 
 arch/x86/include/asm/smp.h|  7 
 arch/x86/include/asm/topology.h   |  1 +
 arch/x86/kernel/cpu/cacheinfo.c   |  1 +
 arch/x86/kernel/cpu/common.c  |  3 ++
 arch/x86/kernel/smpboot.c | 43 +++-
 drivers/acpi/pptt.c   | 63 +
 drivers/base/arch_topology.c  | 14 +++
 drivers/base/topology.c   | 10 +
 include/linux/acpi.h  |  5 +++
 include/linux/arch_topology.h |  5 +++
 include/linux/sched/cluster.h | 19 +
 include/linux/sched/sd_flags.h|  9 +
 include/linux/sched/topology.h|  7 
 include/linux/topology.h  | 13 ++
 kernel/sched/core.c   | 18 +
 kernel/sched/fair.c   | 66 ---
 kernel/sched/sched.h  |  1 +
 kernel/sched/topology.c   |  6 +++
 22 files changed, 315 insertions(+), 19 deletions(-)
 create mode 100644 include/linux/sched/cluster.h

-- 
1.8.3.1



[RFC PATCH v4 1/3] topology: Represent clusters of CPUs within a die.

2021-03-01 Thread Barry Song
|   | |
|  |  ||  | |  |   |   | |
|  +--++--+ |  +---+   | |
|   |  +-+
+---+

That means the cost to transfer ownership of a cacheline between CPUs
within a cluster is lower than between CPUs in different clusters on
the same die. Hence, it can make sense to tell the scheduler to use
the cache affinity of the cluster to make better decision on thread
migration.

This patch simply exposes this information to userspace libraries
like hwloc by providing cluster_cpus and related sysfs attributes.
PoC of HWLOC support at [2].

Note this patch only handle the ACPI case.

Special consideration is needed for SMT processors, where it is
necessary to move 2 levels up the hierarchy from the leaf nodes
(thus skipping the processor core level).

Currently the ID provided is the offset of the Processor
Hierarchy Nodes Structure within PPTT.  Whilst this is unique
it is not terribly elegant so alternative suggestions welcome.

Note that arm64 / ACPI does not provide any means of identifying
a die level in the topology but that may be unrelate to the cluster
level.

[1] ACPI Specification 6.3 - section 5.2.29.1 processor hierarchy node
structure (Type 0)
[2] https://github.com/hisilicon/hwloc/tree/linux-cluster

Signed-off-by: Jonathan Cameron 
Signed-off-by: Barry Song 
---
  -v4:
  * used acpi_cpu_id for acpi_find_processor_node(addressing Masa's comment)

 Documentation/admin-guide/cputopology.rst | 26 +++--
 arch/arm64/kernel/topology.c  |  2 +
 drivers/acpi/pptt.c   | 63 +++
 drivers/base/arch_topology.c  | 14 +++
 drivers/base/topology.c   | 10 +
 include/linux/acpi.h  |  5 +++
 include/linux/arch_topology.h |  5 +++
 include/linux/topology.h  |  6 +++
 8 files changed, 127 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/cputopology.rst 
b/Documentation/admin-guide/cputopology.rst
index b90dafc..f9d3745 100644
--- a/Documentation/admin-guide/cputopology.rst
+++ b/Documentation/admin-guide/cputopology.rst
@@ -24,6 +24,12 @@ core_id:
identifier (rather than the kernel's).  The actual value is
architecture and platform dependent.
 
+cluster_id:
+
+   the Cluster ID of cpuX.  Typically it is the hardware platform's
+   identifier (rather than the kernel's).  The actual value is
+   architecture and platform dependent.
+
 book_id:
 
the book ID of cpuX. Typically it is the hardware platform's
@@ -56,6 +62,14 @@ package_cpus_list:
human-readable list of CPUs sharing the same physical_package_id.
(deprecated name: "core_siblings_list")
 
+cluster_cpus:
+
+   internal kernel map of CPUs within the same cluster.
+
+cluster_cpus_list:
+
+   human-readable list of CPUs within the same cluster.
+
 die_cpus:
 
internal kernel map of CPUs within the same die.
@@ -96,11 +110,13 @@ these macros in include/asm-XXX/topology.h::
 
#define topology_physical_package_id(cpu)
#define topology_die_id(cpu)
+   #define topology_cluster_id(cpu)
#define topology_core_id(cpu)
#define topology_book_id(cpu)
#define topology_drawer_id(cpu)
#define topology_sibling_cpumask(cpu)
#define topology_core_cpumask(cpu)
+   #define topology_cluster_cpumask(cpu)
#define topology_die_cpumask(cpu)
#define topology_book_cpumask(cpu)
#define topology_drawer_cpumask(cpu)
@@ -116,10 +132,12 @@ not defined by include/asm-XXX/topology.h:
 
 1) topology_physical_package_id: -1
 2) topology_die_id: -1
-3) topology_core_id: 0
-4) topology_sibling_cpumask: just the given CPU
-5) topology_core_cpumask: just the given CPU
-6) topology_die_cpumask: just the given CPU
+3) topology_cluster_id: -1
+4) topology_core_id: 0
+5) topology_sibling_cpumask: just the given CPU
+6) topology_core_cpumask: just the given CPU
+7) topology_cluster_cpumask: just the given CPU
+8) topology_die_cpumask: just the given CPU
 
 For architectures that don't support books (CONFIG_SCHED_BOOK) there are no
 default definitions for topology_book_id() and topology_book_cpumask().
diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index f6faa69..fe076b3 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -103,6 +103,8 @@ int __init parse_acpi_topology(void)
cpu_topology[cpu].thread_id  = -1;
cpu_topology[cpu].core_id= topology_id;
}
+   topology_id = find_acpi_cpu_topology_cluster(cpu);
+   cpu_topology[cpu].cluster_id = topology_id;
topology_id = find_acpi_cpu_topology_package(cpu);
cpu_topology[cpu].package_id = topology_id;
 
diff --git a/drivers/acpi/pptt.c b/drivers/acpi/pptt.c
index 4ae9335..11f8b02 100644
--- a/drivers/a

RE: [Linuxarm] [PATCH v1] drm/nouveau/device: append a NUL-terminated character for the string which filled by strncpy()

2021-02-25 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Luo Jiaxing [mailto:luojiax...@huawei.com]
> Sent: Friday, February 26, 2021 12:39 AM
> To: nouv...@lists.freedesktop.org; dri-de...@lists.freedesktop.org;
> bske...@redhat.com
> Cc: linux-kernel@vger.kernel.org; linux...@openeuler.org; luojiaxing
> 
> Subject: [Linuxarm] [PATCH v1] drm/nouveau/device: append a NUL-terminated
> character for the string which filled by strncpy()
> 
> Following warning is found when using W=1 to build kernel:
> 
> In function ‘nvkm_udevice_info’,
> inlined from ‘nvkm_udevice_mthd’ at
> drivers/gpu/drm/nouveau/nvkm/engine/device/user.c:195:10:
> drivers/gpu/drm/nouveau/nvkm/engine/device/user.c:164:2: warning: ‘strncpy’
> specified bound 16 equals destination size [-Wstringop-truncation]
>   164 |  strncpy(args->v0.chip, device->chip->name, sizeof(args->v0.chip));
> drivers/gpu/drm/nouveau/nvkm/engine/device/user.c:165:2: warning: ‘strncpy’
> specified bound 64 equals destination size [-Wstringop-truncation]
>   165 |  strncpy(args->v0.name, device->name, sizeof(args->v0.name));
> 
> The reason of this warning is strncpy() does not guarantee that the
> destination buffer will be NUL terminated. If the length of source string
> is bigger than number we set by third input parameter, only first [number]
> of characters is copied to the destination, and no NUL-terminated is
> automatically added. There are some potential risks.
> 
> Signed-off-by: Luo Jiaxing 
> ---
>  drivers/gpu/drm/nouveau/nvkm/engine/device/user.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/device/user.c
> b/drivers/gpu/drm/nouveau/nvkm/engine/device/user.c
> index fea9d8f..2a32fe0 100644
> --- a/drivers/gpu/drm/nouveau/nvkm/engine/device/user.c
> +++ b/drivers/gpu/drm/nouveau/nvkm/engine/device/user.c
> @@ -161,8 +161,10 @@ nvkm_udevice_info(struct nvkm_udevice *udev, void *data,
> u32 size)
>   if (imem && args->v0.ram_size > 0)
>   args->v0.ram_user = args->v0.ram_user - imem->reserved;
> 
> - strncpy(args->v0.chip, device->chip->name, sizeof(args->v0.chip));
> - strncpy(args->v0.name, device->name, sizeof(args->v0.name));
> + strncpy(args->v0.chip, device->chip->name, sizeof(args->v0.chip) - 1);
> + args->v0.chip[sizeof(args->v0.chip) - 1] = '\0';
> + strncpy(args->v0.name, device->name, sizeof(args->v0.name) - 1);
> + args->v0.name[sizeof(args->v0.name) - 1] = '\0';


Isn't it better to use snprintf()?

>   return 0;
>  }
> 
Thanks
Barry



RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage optimization for SCSI drivers

2021-02-24 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Finn Thain [mailto:fth...@telegraphics.com.au]
> Sent: Wednesday, February 24, 2021 6:21 PM
> To: Song Bao Hua (Barry Song) 
> Cc: tanxiaofei ; j...@linux.ibm.com;
> martin.peter...@oracle.com; linux-s...@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux...@openeuler.org;
> linux-m...@vger.kernel.org
> Subject: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage optimization
> for SCSI drivers
> 
> On Tue, 23 Feb 2021, Song Bao Hua (Barry Song) wrote:
> 
> > >
> > > Regarding m68k, your analysis overlooks the timing issue. E.g. patch
> > > 11/32 could be a problem because removing the irqsave would allow PDMA
> > > transfers to be interrupted. Aside from the timing issues, I agree
> > > with your analysis above regarding m68k.
> >
> > You mentioned you need realtime so you want an interrupt to be able to
> > preempt another one.
> 
> That's not what I said. But for the sake of discussion, yes, I do know
> people who run Linux on ARM hardware (if Android vendor kernels can be
> called "Linux") and who would benefit from realtime support on those
> devices.

Realtime requirement is definitely a true requirement on ARM Linux.

I once talked/worked  with some guys who were using ARM for realtime
system.
The feasible approaches include:
1. Dual OS(RTOS + Linux): e.g.  QNX+Linux XENOMAI+Linux L4+Linux
2. preempt-rt
Which is continuously maintained like:
https://lore.kernel.org/lkml/20210218201041.65fknr7bdplwq...@linutronix.de/
3. bootargs isolcpus=
to isolate a cpu for a specific realtime task or interrupt
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_for_real_time/7/html/tuning_guide/isolating_cpus_using_tuned-profiles-realtime
4. ARM FIQ which has separate fiq API, an example in fsl sound:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/sound/soc/fsl/imx-pcm-fiq.c
5. Let one core invisible to Linux
Running non-os system and rtos on the core

Honestly, I've never seen anyone who depends on irq priority to support
realtime in ARM Linux though ARM's RTOS-es use it quite commonly.

> 
> > Now you said you want an interrupt not to be preempted as it will make a
> > timing issue.
> 
> mac_esp deliberately constrains segment sizes so that it can harmlessly
> disable interrupts for the duration of the transfer.
> 
> Maybe the irqsave in this driver is over-cautious. Who knows? The PDMA
> timing problem relates to SCSI bus signalling and the tolerance of real-
> world SCSI devices to same. The other problem is that the PDMA logic
> circuit is undocumented hardware. So there may be further timing
> requirements lurking there. Therefore, patch 11/32 is too risky.
> 
> > If this PDMA transfer will have some problem when it is preempted, I
> > believe we need some enhanced ways to handle this, otherwise, once we
> > enable preempt_rt or threaded_irq, it will get the timing issue. so here
> > it needs a clear comment and IRQF_NO_THREAD if this is the case.
> >
> 
> People who require fast response times cannot expect random drivers or
> platforms to meet such requirements. I fear you may be asking too much
> from Mac Quadra machines.

Once preempt_rt is enabled, those who want a fast irq environment need
a no_thread flag, or need to set its irq thread to higher sched_fifo/rr
priority.

> 
> > >
> > > With regard to other architectures and platforms, in specific cases,
> > > e.g. where there's never more than one IRQ involved, then I could
> > > agree that your assumptions probably hold and an irqsave would be
> > > probably redundant.
> > >
> > > When you find a redundant irqsave, to actually patch it would bring a
> > > risk of regression with little or no reward. It's not my place to veto
> > > this entire patch series on that basis but IMO this kind of churn is
> > > misguided.
> >
> > Nope.
> >
> > I would say the real misguidance is that the code adds one lock while it
> > doesn't need the lock. Easily we can add redundant locks or exaggerate
> > the coverage range of locks, but the smarter way is that people add
> > locks only when they really need the lock by considering concurrency and
> > realtime performance.
> >
> 
> You appear to be debating a strawman. No-one is advocating excessive
> locking in new code.
> 

I actually meant most irqsave(s) in hardirq were added carelessly.
When irq and threads could access same data, people added irqsave
in threads, that is perfectly good as it could block irq. But
people were likely to put an irqsave in irq without any thinking.

We do have some drivers which are doing that with a clear intention
as your sonic_interrupt(), but I bet most were done aimlessly.

Anyway, the debate is long enough, let's move to some more important
things. I appreciate that you shared a lot of knowledge of m68k.

Thanks
Barry


[PATCH v4] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2

2021-02-23 Thread Barry Song
span=0-5 mask=2-3 cap=5986 }, 6:{ span=0-1,4-7 
mask=6-7 cap=6102 }
[1.523748] CPU3 attaching sched-domain(s):
[1.523774]  domain-0: span=2-3 level=MC
[1.523825]   groups: 3:{ span=3 cap=986 }, 2:{ span=2 cap=1003 }
[1.524009]   domain-1: span=0-3 level=NUMA
[1.524086]groups: 2:{ span=2-3 cap=1989 }, 0:{ span=0-1 cap=1974 }
[1.524281]domain-2: span=0-5 level=NUMA
[1.524331] groups: 2:{ span=0-3 mask=2-3 cap=4037 }, 4:{ span=4-5 
cap=1949 }
[1.524534] domain-3: span=0-7 level=NUMA
[1.524586]  groups: 2:{ span=0-5 mask=2-3 cap=5986 }, 6:{ span=0-1,4-7 
mask=6-7 cap=6102 }
[1.524847] CPU4 attaching sched-domain(s):
[1.524873]  domain-0: span=4-5 level=MC
[1.524954]   groups: 4:{ span=4 cap=958 }, 5:{ span=5 cap=991 }
[1.525105]   domain-1: span=4-7 level=NUMA
[1.525153]groups: 4:{ span=4-5 cap=1949 }, 6:{ span=6-7 cap=2006 }
[1.525368]domain-2: span=0-1,4-7 level=NUMA
[1.525428] groups: 4:{ span=4-7 cap=3955 }, 0:{ span=0-1 cap=1974 }
[1.532726] domain-3: span=0-7 level=NUMA
[1.532811]  groups: 4:{ span=0-1,4-7 mask=4-5 cap=6003 }, 2:{ span=0-3 
mask=2-3 cap=4037 }
[1.534125] CPU5 attaching sched-domain(s):
[1.534159]  domain-0: span=4-5 level=MC
[1.534303]   groups: 5:{ span=5 cap=991 }, 4:{ span=4 cap=958 }
[1.534490]   domain-1: span=4-7 level=NUMA
[1.534572]groups: 4:{ span=4-5 cap=1949 }, 6:{ span=6-7 cap=2006 }
[1.534734]domain-2: span=0-1,4-7 level=NUMA
[1.534783] groups: 4:{ span=4-7 cap=3955 }, 0:{ span=0-1 cap=1974 }
[1.536057] domain-3: span=0-7 level=NUMA
[1.536430]  groups: 4:{ span=0-1,4-7 mask=4-5 cap=6003 }, 2:{ span=0-3 
mask=2-3 cap=3896 }
[1.536815] CPU6 attaching sched-domain(s):
[1.536846]  domain-0: span=6-7 level=MC
[1.536934]   groups: 6:{ span=6 cap=1005 }, 7:{ span=7 cap=1001 }
[1.537144]   domain-1: span=4-7 level=NUMA
[1.537262]groups: 6:{ span=6-7 cap=2006 }, 4:{ span=4-5 cap=1949 }
[1.537553]domain-2: span=0-1,4-7 level=NUMA
[1.537613] groups: 6:{ span=4-7 mask=6-7 cap=4054 }, 0:{ span=0-1 
cap=1805 }
[1.537872] domain-3: span=0-7 level=NUMA
[1.537998]  groups: 6:{ span=0-1,4-7 mask=6-7 cap=6102 }, 2:{ span=0-5 
mask=2-3 cap=5845 }
[1.538448] CPU7 attaching sched-domain(s):
[1.538505]  domain-0: span=6-7 level=MC
[1.538586]   groups: 7:{ span=7 cap=1001 }, 6:{ span=6 cap=1005 }
[1.538746]   domain-1: span=4-7 level=NUMA
[1.538798]groups: 6:{ span=6-7 cap=2006 }, 4:{ span=4-5 cap=1949 }
[1.539048]domain-2: span=0-1,4-7 level=NUMA
[1.539111] groups: 6:{ span=4-7 mask=6-7 cap=4054 }, 0:{ span=0-1 
cap=1805 }
[1.539571] domain-3: span=0-7 level=NUMA
[1.539610]  groups: 6:{ span=0-1,4-7 mask=6-7 cap=6102 }, 2:{ span=0-5 
mask=2-3 cap=5845 }

Reported-by: Valentin Schneider 
Tested-by: Meelis Roos 
Reviewed-by: Valentin Schneider 
Signed-off-by: Barry Song 
---
 -v4:
 no code changed; mainly rewrote changelog:
 * add Reviewed-by of Valentin;
   While the grandchild approach was started by me, Valentin contributed the
   most useful edit;
 * add description about sgc->imbalance and next_update according to the
   comment of Vincent;
 * add description about the equal size of local group to address Peter's
   comment

 kernel/sched/topology.c | 91 +++--
 1 file changed, 61 insertions(+), 30 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 09d35044bd88..12f80587e127 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -723,35 +723,6 @@ cpu_attach_domain(struct sched_domain *sd, struct 
root_domain *rd, int cpu)
for (tmp = sd; tmp; tmp = tmp->parent)
numa_distance += !!(tmp->flags & SD_NUMA);
 
-   /*
-* FIXME: Diameter >=3 is misrepresented.
-*
-* Smallest diameter=3 topology is:
-*
-*   node   0   1   2   3
-* 0:  10  20  30  40
-* 1:  20  10  20  30
-* 2:  30  20  10  20
-* 3:  40  30  20  10
-*
-*   0 --- 1 --- 2 --- 3
-*
-* NUMA-3   0-3 N/A N/A 0-3
-*  groups: {0-2},{1-3} 
{1-3},{0-2}
-*
-* NUMA-2   0-2 0-3 0-3 1-3
-*  groups: {0-1},{1-3} {0-2},{2-3} {1-3},{0-1} 
{2-3},{0-2}
-*
-* NUMA-1   0-1 0-2 1-3 2-3
-*  groups: {0},{1} {1},{2},{0} {2},{3},{1} {3},{2}
-*
-* NUMA-0   0   1   2   3
-*
-* The NUMA-2 groups for nodes 0 and 3 are obviously buggered, as the
-* group span isn't a subset of the domain span.
-*/
-   WARN_ONCE(numa_distance &g

[PATCH v2 0/2] scripts/gdb: clarify the platforms supporting lx_current and add arm64 support

2021-02-23 Thread Barry Song
lx_current depends on per_cpu current_task variable which exists on x86 only.
so it actually works on x86 only. the 1st patch documents this clearly; the
2nd patch adds support for arm64.

Barry Song (2):
  scripts/gdb: document lx_current is only supported by x86
  scripts/gdb: add lx_current support for arm64

 .../dev-tools/gdb-kernel-debugging.rst|  2 +-
 scripts/gdb/linux/cpus.py | 23 +--
 2 files changed, 22 insertions(+), 3 deletions(-)

-- 
2.25.1



[PATCH v2 1/2] scripts/gdb: document lx_current is only supported by x86

2021-02-23 Thread Barry Song
x86 is the only architecture which has per_cpu current_task:
arch$ git grep current_task | grep -i per_cpu
x86/include/asm/current.h:DECLARE_PER_CPU(struct task_struct *, current_task);
x86/kernel/cpu/common.c:DEFINE_PER_CPU(struct task_struct *, current_task) 
cacheline_aligned =
x86/kernel/cpu/common.c:EXPORT_PER_CPU_SYMBOL(current_task);
x86/kernel/cpu/common.c:DEFINE_PER_CPU(struct task_struct *, current_task) = 
_task;
x86/kernel/cpu/common.c:EXPORT_PER_CPU_SYMBOL(current_task);
x86/kernel/smpboot.c:   per_cpu(current_task, cpu) = idle;

On other architectures, lx_current() will lead to a python exception:
(gdb) p $lx_current().pid
Python Exception  No symbol "current_task" in current 
context.:
Error occurred in Python: No symbol "current_task" in current context.

To avoid more people struggling and wasting time in other architectures,
document it.

Cc: Jan Kiszka 
Signed-off-by: Barry Song 
---
 Documentation/dev-tools/gdb-kernel-debugging.rst |  2 +-
 scripts/gdb/linux/cpus.py| 10 --
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/Documentation/dev-tools/gdb-kernel-debugging.rst 
b/Documentation/dev-tools/gdb-kernel-debugging.rst
index 4756f6b3a04e..1586901b683c 100644
--- a/Documentation/dev-tools/gdb-kernel-debugging.rst
+++ b/Documentation/dev-tools/gdb-kernel-debugging.rst
@@ -114,7 +114,7 @@ Examples of using the Linux-provided gdb helpers
 [ 0.00] BIOS-e820: [mem 0x0009fc00-0x0009] 
reserved
 
 
-- Examine fields of the current task struct::
+- Examine fields of the current task struct(supported by x86 only)::
 
 (gdb) p $lx_current().pid
 $1 = 4998
diff --git a/scripts/gdb/linux/cpus.py b/scripts/gdb/linux/cpus.py
index 008e62f3190d..f382762509d3 100644
--- a/scripts/gdb/linux/cpus.py
+++ b/scripts/gdb/linux/cpus.py
@@ -156,6 +156,13 @@ Note that VAR has to be quoted as string."""
 
 PerCpu()
 
+def get_current_task(cpu):
+if utils.is_target_arch("x86"):
+ var_ptr = gdb.parse_and_eval("_task")
+ return per_cpu(var_ptr, cpu).dereference()
+else:
+raise gdb.GdbError("Sorry, obtaining the current task is not yet "
+   "supported with this arch")
 
 class LxCurrentFunc(gdb.Function):
 """Return current task.
@@ -167,8 +174,7 @@ number. If CPU is omitted, the CPU of the current context 
is used."""
 super(LxCurrentFunc, self).__init__("lx_current")
 
 def invoke(self, cpu=-1):
-var_ptr = gdb.parse_and_eval("_task")
-return per_cpu(var_ptr, cpu).dereference()
+return get_current_task(cpu)
 
 
 LxCurrentFunc()
-- 
2.25.1



[PATCH v2 2/2] scripts/gdb: add lx_current support for arm64

2021-02-23 Thread Barry Song
arm64 uses SP_EL0 to save the current task_struct address. While running
in EL0, SP_EL0 is clobbered by userspace. So if the upper bit is not 1
(not TTBR1), the current address is invalid. This patch checks the upper
bit of SP_EL0, if the upper bit is 1, lx_current() of arm64 will return
the derefrence of current task. Otherwise, lx_current() will tell users
they are running in userspace(EL0).

While arm64 is running in EL0, it is actually pointless to print current
task as the memory of kernel space is not accessible in EL0.

Signed-off-by: Barry Song 
---
 Documentation/dev-tools/gdb-kernel-debugging.rst |  2 +-
 scripts/gdb/linux/cpus.py| 13 +
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/Documentation/dev-tools/gdb-kernel-debugging.rst 
b/Documentation/dev-tools/gdb-kernel-debugging.rst
index 1586901b683c..8e0f1fe8d17a 100644
--- a/Documentation/dev-tools/gdb-kernel-debugging.rst
+++ b/Documentation/dev-tools/gdb-kernel-debugging.rst
@@ -114,7 +114,7 @@ Examples of using the Linux-provided gdb helpers
 [ 0.00] BIOS-e820: [mem 0x0009fc00-0x0009] 
reserved
 
 
-- Examine fields of the current task struct(supported by x86 only)::
+- Examine fields of the current task struct(supported by x86 and arm64 only)::
 
 (gdb) p $lx_current().pid
 $1 = 4998
diff --git a/scripts/gdb/linux/cpus.py b/scripts/gdb/linux/cpus.py
index f382762509d3..15fc4626d236 100644
--- a/scripts/gdb/linux/cpus.py
+++ b/scripts/gdb/linux/cpus.py
@@ -16,6 +16,9 @@ import gdb
 from linux import tasks, utils
 
 
+task_type = utils.CachedType("struct task_struct")
+
+
 MAX_CPUS = 4096
 
 
@@ -157,9 +160,19 @@ Note that VAR has to be quoted as string."""
 PerCpu()
 
 def get_current_task(cpu):
+task_ptr_type = task_type.get_type().pointer()
+
 if utils.is_target_arch("x86"):
  var_ptr = gdb.parse_and_eval("_task")
  return per_cpu(var_ptr, cpu).dereference()
+elif utils.is_target_arch("aarch64"):
+ current_task_addr = gdb.parse_and_eval("$SP_EL0")
+ if((current_task_addr >> 63) != 0):
+ current_task = current_task_addr.cast(task_ptr_type)
+ return current_task.dereference()
+ else:
+ raise gdb.GdbError("Sorry, obtaining the current task is not 
allowed "
+"while running in userspace(EL0)")
 else:
 raise gdb.GdbError("Sorry, obtaining the current task is not yet "
"supported with this arch")
-- 
2.25.1



RE: [PATCH] scripts/gdb: document lx_current is only supported by x86

2021-02-23 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Song Bao Hua (Barry Song)
> Sent: Tuesday, February 23, 2021 9:30 PM
> To: 'Jan Kiszka' ; kieran.bing...@ideasonboard.com;
> cor...@lwn.net; linux-...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org; linux...@openeuler.org
> Subject: RE: [PATCH] scripts/gdb: document lx_current is only supported by x86
> 
> 
> 
> > -Original Message-
> > From: Jan Kiszka [mailto:jan.kis...@siemens.com]
> > Sent: Tuesday, February 23, 2021 8:27 PM
> > To: Song Bao Hua (Barry Song) ;
> > kieran.bing...@ideasonboard.com; cor...@lwn.net; linux-...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org; linux...@openeuler.org
> > Subject: Re: [PATCH] scripts/gdb: document lx_current is only supported by
> x86
> >
> > On 22.02.21 22:18, Song Bao Hua (Barry Song) wrote:
> > >
> > >
> > >> -Original Message-
> > >> From: Kieran Bingham [mailto:kieran.bing...@ideasonboard.com]
> > >> Sent: Tuesday, February 23, 2021 12:06 AM
> > >> To: Song Bao Hua (Barry Song) ; 
> > >> cor...@lwn.net;
> > >> linux-...@vger.kernel.org; jan.kis...@siemens.com
> > >> Cc: linux-kernel@vger.kernel.org; linux...@openeuler.org
> > >> Subject: Re: [PATCH] scripts/gdb: document lx_current is only supported
> by
> > x86
> > >>
> > >> Hi Barry
> > >>
> > >> On 21/02/2021 21:35, Barry Song wrote:
> > >>> lx_current depends on the per_cpu current_task which exists on x86 only:
> > >>>
> > >>> arch$ git grep current_task | grep -i per_cpu
> > >>> x86/include/asm/current.h:DECLARE_PER_CPU(struct task_struct *,
> > >> current_task);
> > >>> x86/kernel/cpu/common.c:DEFINE_PER_CPU(struct task_struct *,
> > current_task)
> > >> cacheline_aligned =
> > >>> x86/kernel/cpu/common.c:EXPORT_PER_CPU_SYMBOL(current_task);
> > >>> x86/kernel/cpu/common.c:DEFINE_PER_CPU(struct task_struct *,
> > current_task)
> > >> = _task;
> > >>> x86/kernel/cpu/common.c:EXPORT_PER_CPU_SYMBOL(current_task);
> > >>> x86/kernel/smpboot.c:   per_cpu(current_task, cpu) = idle;
> > >>>
> > >>> On other architectures, lx_current() will lead to a python exception:
> > >>> (gdb) p $lx_current().pid
> > >>> Python Exception  No symbol "current_task" in current
> > >> context.:
> > >>> Error occurred in Python: No symbol "current_task" in current context.
> > >>>
> > >>> To avoid more people struggling and wasting time in other architectures,
> > >>> document it.
> > >>>
> > >>> Cc: Jan Kiszka 
> > >>> Signed-off-by: Barry Song 
> > >>> ---
> > >>>  Documentation/dev-tools/gdb-kernel-debugging.rst |  2 +-
> > >>>  scripts/gdb/linux/cpus.py| 10 --
> > >>>  2 files changed, 9 insertions(+), 3 deletions(-)
> > >>>
> > >>> diff --git a/Documentation/dev-tools/gdb-kernel-debugging.rst
> > >> b/Documentation/dev-tools/gdb-kernel-debugging.rst
> > >>> index 4756f6b3a04e..1586901b683c 100644
> > >>> --- a/Documentation/dev-tools/gdb-kernel-debugging.rst
> > >>> +++ b/Documentation/dev-tools/gdb-kernel-debugging.rst
> > >>> @@ -114,7 +114,7 @@ Examples of using the Linux-provided gdb helpers
> > >>>  [ 0.00] BIOS-e820: [mem
> > 0x0009fc00-0x0009]
> > >> reserved
> > >>>  
> > >>>
> > >>> -- Examine fields of the current task struct::
> > >>> +- Examine fields of the current task struct(supported by x86 only)::
> > >>>
> > >>>  (gdb) p $lx_current().pid
> > >>>  $1 = 4998
> > >>> diff --git a/scripts/gdb/linux/cpus.py b/scripts/gdb/linux/cpus.py
> > >>> index 008e62f3190d..f382762509d3 100644
> > >>> --- a/scripts/gdb/linux/cpus.py
> > >>> +++ b/scripts/gdb/linux/cpus.py
> > >>> @@ -156,6 +156,13 @@ Note that VAR has to be quoted as string."""
> > >>>
> > >>>  PerCpu()
> > >>>
> > >>> +def get_current_task(cpu):
> > >>> +if utils.is_target_arch("x86"):
> > >>> + var_ptr = gdb.parse_and_eval("_task")
> > >>> +  

RE: [PATCH] scripts/gdb: document lx_current is only supported by x86

2021-02-23 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Jan Kiszka [mailto:jan.kis...@siemens.com]
> Sent: Tuesday, February 23, 2021 8:27 PM
> To: Song Bao Hua (Barry Song) ;
> kieran.bing...@ideasonboard.com; cor...@lwn.net; linux-...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org; linux...@openeuler.org
> Subject: Re: [PATCH] scripts/gdb: document lx_current is only supported by x86
> 
> On 22.02.21 22:18, Song Bao Hua (Barry Song) wrote:
> >
> >
> >> -Original Message-
> >> From: Kieran Bingham [mailto:kieran.bing...@ideasonboard.com]
> >> Sent: Tuesday, February 23, 2021 12:06 AM
> >> To: Song Bao Hua (Barry Song) ; cor...@lwn.net;
> >> linux-...@vger.kernel.org; jan.kis...@siemens.com
> >> Cc: linux-kernel@vger.kernel.org; linux...@openeuler.org
> >> Subject: Re: [PATCH] scripts/gdb: document lx_current is only supported by
> x86
> >>
> >> Hi Barry
> >>
> >> On 21/02/2021 21:35, Barry Song wrote:
> >>> lx_current depends on the per_cpu current_task which exists on x86 only:
> >>>
> >>> arch$ git grep current_task | grep -i per_cpu
> >>> x86/include/asm/current.h:DECLARE_PER_CPU(struct task_struct *,
> >> current_task);
> >>> x86/kernel/cpu/common.c:DEFINE_PER_CPU(struct task_struct *,
> current_task)
> >> cacheline_aligned =
> >>> x86/kernel/cpu/common.c:EXPORT_PER_CPU_SYMBOL(current_task);
> >>> x86/kernel/cpu/common.c:DEFINE_PER_CPU(struct task_struct *,
> current_task)
> >> = _task;
> >>> x86/kernel/cpu/common.c:EXPORT_PER_CPU_SYMBOL(current_task);
> >>> x86/kernel/smpboot.c: per_cpu(current_task, cpu) = idle;
> >>>
> >>> On other architectures, lx_current() will lead to a python exception:
> >>> (gdb) p $lx_current().pid
> >>> Python Exception  No symbol "current_task" in current
> >> context.:
> >>> Error occurred in Python: No symbol "current_task" in current context.
> >>>
> >>> To avoid more people struggling and wasting time in other architectures,
> >>> document it.
> >>>
> >>> Cc: Jan Kiszka 
> >>> Signed-off-by: Barry Song 
> >>> ---
> >>>  Documentation/dev-tools/gdb-kernel-debugging.rst |  2 +-
> >>>  scripts/gdb/linux/cpus.py| 10 --
> >>>  2 files changed, 9 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/Documentation/dev-tools/gdb-kernel-debugging.rst
> >> b/Documentation/dev-tools/gdb-kernel-debugging.rst
> >>> index 4756f6b3a04e..1586901b683c 100644
> >>> --- a/Documentation/dev-tools/gdb-kernel-debugging.rst
> >>> +++ b/Documentation/dev-tools/gdb-kernel-debugging.rst
> >>> @@ -114,7 +114,7 @@ Examples of using the Linux-provided gdb helpers
> >>>  [ 0.00] BIOS-e820: [mem
> 0x0009fc00-0x0009]
> >> reserved
> >>>  
> >>>
> >>> -- Examine fields of the current task struct::
> >>> +- Examine fields of the current task struct(supported by x86 only)::
> >>>
> >>>  (gdb) p $lx_current().pid
> >>>  $1 = 4998
> >>> diff --git a/scripts/gdb/linux/cpus.py b/scripts/gdb/linux/cpus.py
> >>> index 008e62f3190d..f382762509d3 100644
> >>> --- a/scripts/gdb/linux/cpus.py
> >>> +++ b/scripts/gdb/linux/cpus.py
> >>> @@ -156,6 +156,13 @@ Note that VAR has to be quoted as string."""
> >>>
> >>>  PerCpu()
> >>>
> >>> +def get_current_task(cpu):
> >>> +if utils.is_target_arch("x86"):
> >>> + var_ptr = gdb.parse_and_eval("_task")
> >>> + return per_cpu(var_ptr, cpu).dereference()
> >>> +else:
> >>> +raise gdb.GdbError("Sorry, obtaining the current task is not yet
> "
> >>> +   "supported with this arch")
> >>
> >> I've wondered in the past how we should handle the architecture specific
> >> layers.
> >>
> >> Perhaps we need to have an interface of functionality to implement on
> >> each architecture so that we can create a per-arch set of helpers.
> >>
> >> or break it up into arch specific subdirs at least...
> >>
> >>
> >>>  class LxCurrentFunc(gdb.Function):
> >>>  """Return current task.
> >>> @@ -167

RE: [Linuxarm] Re: [PATCH] Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't apply to ARM64

2021-02-22 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Anshuman Khandual [mailto:anshuman.khand...@arm.com]
> Sent: Tuesday, February 23, 2021 7:10 PM
> To: Song Bao Hua (Barry Song) ; cor...@lwn.net;
> linux-...@vger.kernel.org; a...@linux-foundation.org; linux...@kvack.org
> Cc: linux-arm-ker...@lists.infradead.org; linux-kernel@vger.kernel.org;
> linux...@openeuler.org; Mel Gorman ; Andy Lutomirski
> ; Catalin Marinas ; Will Deacon
> 
> Subject: [Linuxarm] Re: [PATCH] Documentation/features: mark
> BATCHED_UNMAP_TLB_FLUSH doesn't apply to ARM64
> 
> 
> 
> On 2/23/21 6:02 AM, Barry Song wrote:
> > BATCHED_UNMAP_TLB_FLUSH is used on x86 to do batched tlb shootdown by
> > sending one IPI to TLB flush all entries after unmapping pages rather
> > than sending an IPI to flush each individual entry.
> > On arm64, tlb shootdown is done by hardware. Flush instructions are
> > innershareable. The local flushes are limited to the boot (1 per CPU)
> > and when a task is getting a new ASID.
> 
> Is there any previous discussion around this ?

I copied the declaration of local flushes from:

"ARM64 Linux kernel is SMP-aware (no possibility to build only for UP).
Most of the flush instructions are innershareable. The local flushes are
limited to the boot (1 per CPU) and when a task is getting a new ASIC."

https://patchwork.kernel.org/project/xen-devel/patch/1461756173-10300-1-git-send-email-julien.gr...@arm.com/

I am not sure if getting a new asid and the boot are the only two
cases of local flushes while I think this is probably true.

But even we find more corner cases, hardly the trend arm64 doesn't
need BATCHED_UNMAP_TLB_FLUSH will be changed.

> 
> > So marking this feature as "TODO" is not proper. ".." isn't good as
> > well. So this patch adds a "N/A" for this kind of features which are
> > not needed on some architectures.
> >
> > Cc: Mel Gorman 
> > Cc: Andy Lutomirski 
> > Cc: Catalin Marinas 
> > Cc: Will Deacon 
> > Signed-off-by: Barry Song 
> > ---
> >  Documentation/features/arch-support.txt| 1 +
> >  Documentation/features/vm/TLB/arch-support.txt | 2 +-
> >  2 files changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/features/arch-support.txt
> b/Documentation/features/arch-support.txt
> > index d22a1095e661..118ae031840b 100644
> > --- a/Documentation/features/arch-support.txt
> > +++ b/Documentation/features/arch-support.txt
> > @@ -8,4 +8,5 @@ The meaning of entries in the tables is:
> >  | ok |  # feature supported by the architecture
> >  |TODO|  # feature not yet supported by the architecture
> >  | .. |  # feature cannot be supported by the hardware
> > +| N/A|  # feature doesn't apply to the architecture
> 
> NA might be better here. s/doesn't apply/not applicable/ in order to match NA.
> Still wondering if NA is really needed when there is already ".." ? Regardless
> either way should be fine.

I don't think ".." is proper here. ".." means hardware doesn't support
the feature. But here it is just opposite, arm64 has the hardware
support of tlb shootdown rather than depending on a software IPI.

> 
> >
> > diff --git a/Documentation/features/vm/TLB/arch-support.txt
> b/Documentation/features/vm/TLB/arch-support.txt
> > index 30f75a79ce01..0d070f9f98d8 100644
> > --- a/Documentation/features/vm/TLB/arch-support.txt
> > +++ b/Documentation/features/vm/TLB/arch-support.txt
> > @@ -9,7 +9,7 @@
> >  |   alpha: | TODO |
> >  | arc: | TODO |
> >  | arm: | TODO |
> > -|   arm64: | TODO |
> > +|   arm64: | N/A  |
> >  | c6x: |  ..  |
> >  |csky: | TODO |
> >  |   h8300: |  ..  |
> >
Thanks
Barry



RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage optimization for SCSI drivers

2021-02-22 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Finn Thain [mailto:fth...@telegraphics.com.au]
> Sent: Tuesday, February 23, 2021 6:25 PM
> To: Song Bao Hua (Barry Song) 
> Cc: tanxiaofei ; j...@linux.ibm.com;
> martin.peter...@oracle.com; linux-s...@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux...@openeuler.org;
> linux-m...@vger.kernel.org
> Subject: RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage 
> optimization
> for SCSI drivers
> 
> On Mon, 22 Feb 2021, Song Bao Hua (Barry Song) wrote:
> 
> > > On Thu, 18 Feb 2021, Xiaofei Tan wrote:
> > >
> > > > On 2021/2/9 13:06, Finn Thain wrote:
> > > > > On Tue, 9 Feb 2021, Song Bao Hua (Barry Song) wrote:
> > > > >
> > > > > > > On Sun, 7 Feb 2021, Xiaofei Tan wrote:
> > > > > > >
> > > > > > > > Replace spin_lock_irqsave with spin_lock in hard IRQ of SCSI
> > > > > > > > drivers. There are no function changes, but may speed up if
> > > > > > > > interrupt happen too often.
> > > > > > >
> > > > > > > This change doesn't necessarily work on platforms that support
> > > > > > > nested interrupts.
> > > > > > >
> > > > > > > Were you able to measure any benefit from this change on some
> > > > > > > other platform?
> > > > > >
> > > > > > I think the code disabling irq in hardIRQ is simply wrong. Since
> > > > > > this commit
> > > > > >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=e58aa3d2d0cc
> > > > > > genirq: Run irq handlers with interrupts disabled
> > > > > >
> > > > > > interrupt handlers are definitely running in a irq-disabled
> > > > > > context unless irq handlers enable them explicitly in the
> > > > > > handler to permit other interrupts.
> > > > > >
> > > > >
> > > > > Repeating the same claim does not somehow make it true. If you put
> > > > > your claim to the test, you'll see that that interrupts are not
> > > > > disabled on m68k when interrupt handlers execute.
> > > > >
> > > > > The Interrupt Priority Level (IPL) can prevent any given irq
> > > > > handler from being re-entered, but an irq with a higher priority
> > > > > level may be handled during execution of a lower priority irq
> > > > > handler.
> > > > >
> > > > > sonic_interrupt() uses an irq lock within an interrupt handler to
> > > > > avoid issues relating to this. This kind of locking may be needed
> > > > > in the drivers you are trying to patch. Or it might not.
> > > > > Apparently, no-one has looked.
> > > > >
> > > >
> > > > According to your discussion with Barry, it seems that m68k is a
> > > > little different from other architecture, and this kind of
> > > > modification of this patch cannot be applied to m68k. So, could help
> > > > to point out which driver belong to m68k architecture in this patch
> > > > set of SCSI? I can remove them.
> > > >
> > >
> > > If you would claim that "there are no function changes" in your
> > > patches (as above) then the onus is on you to support that claim.
> > >
> > > I assume that there are some platforms on which your assumptions hold.
> > >
> > > With regard to drivers for those platforms, you might want to explain
> > > why your patches should be applied there, given that the existing code
> > > is superior for being more portable.
> >
> > I don't think it has nothing to do with portability. In the case of
> > sonic_interrupt() you pointed out, on m68k, there is a high-priority
> > interrupt can preempt low-priority interrupt, they will result in access
> > the same critical data. M68K's spin_lock_irqsave() can disable the
> > high-priority interrupt and avoid the race condition of the data. So the
> > case should not be touched. I'd like to accept the reality and leave
> > sonic_interrupt() alone.
> >
> > However, even on m68k, spin_lock_irqsave is not needed for other
> > ordinary cases.
> > If there is no other irq handler coming to access same critical data,
> > it is pointless to hold a redundant irqsave lock in irqhandler even
> > on m68k.
> >
> > In thread conte

[PATCH] Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't apply to ARM64

2021-02-22 Thread Barry Song
BATCHED_UNMAP_TLB_FLUSH is used on x86 to do batched tlb shootdown by
sending one IPI to TLB flush all entries after unmapping pages rather
than sending an IPI to flush each individual entry.
On arm64, tlb shootdown is done by hardware. Flush instructions are
innershareable. The local flushes are limited to the boot (1 per CPU)
and when a task is getting a new ASID.
So marking this feature as "TODO" is not proper. ".." isn't good as
well. So this patch adds a "N/A" for this kind of features which are
not needed on some architectures.

Cc: Mel Gorman 
Cc: Andy Lutomirski 
Cc: Catalin Marinas 
Cc: Will Deacon 
Signed-off-by: Barry Song 
---
 Documentation/features/arch-support.txt| 1 +
 Documentation/features/vm/TLB/arch-support.txt | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/features/arch-support.txt 
b/Documentation/features/arch-support.txt
index d22a1095e661..118ae031840b 100644
--- a/Documentation/features/arch-support.txt
+++ b/Documentation/features/arch-support.txt
@@ -8,4 +8,5 @@ The meaning of entries in the tables is:
 | ok |  # feature supported by the architecture
 |TODO|  # feature not yet supported by the architecture
 | .. |  # feature cannot be supported by the hardware
+| N/A|  # feature doesn't apply to the architecture
 
diff --git a/Documentation/features/vm/TLB/arch-support.txt 
b/Documentation/features/vm/TLB/arch-support.txt
index 30f75a79ce01..0d070f9f98d8 100644
--- a/Documentation/features/vm/TLB/arch-support.txt
+++ b/Documentation/features/vm/TLB/arch-support.txt
@@ -9,7 +9,7 @@
 |   alpha: | TODO |
 | arc: | TODO |
 | arm: | TODO |
-|   arm64: | TODO |
+|   arm64: | N/A  |
 | c6x: |  ..  |
 |csky: | TODO |
 |   h8300: |  ..  |
-- 
2.25.1



RE: [PATCH] scripts/gdb: document lx_current is only supported by x86

2021-02-22 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Kieran Bingham [mailto:kieran.bing...@ideasonboard.com]
> Sent: Tuesday, February 23, 2021 12:06 AM
> To: Song Bao Hua (Barry Song) ; cor...@lwn.net;
> linux-...@vger.kernel.org; jan.kis...@siemens.com
> Cc: linux-kernel@vger.kernel.org; linux...@openeuler.org
> Subject: Re: [PATCH] scripts/gdb: document lx_current is only supported by x86
> 
> Hi Barry
> 
> On 21/02/2021 21:35, Barry Song wrote:
> > lx_current depends on the per_cpu current_task which exists on x86 only:
> >
> > arch$ git grep current_task | grep -i per_cpu
> > x86/include/asm/current.h:DECLARE_PER_CPU(struct task_struct *,
> current_task);
> > x86/kernel/cpu/common.c:DEFINE_PER_CPU(struct task_struct *, current_task)
> cacheline_aligned =
> > x86/kernel/cpu/common.c:EXPORT_PER_CPU_SYMBOL(current_task);
> > x86/kernel/cpu/common.c:DEFINE_PER_CPU(struct task_struct *, current_task)
> = _task;
> > x86/kernel/cpu/common.c:EXPORT_PER_CPU_SYMBOL(current_task);
> > x86/kernel/smpboot.c:   per_cpu(current_task, cpu) = idle;
> >
> > On other architectures, lx_current() will lead to a python exception:
> > (gdb) p $lx_current().pid
> > Python Exception  No symbol "current_task" in current
> context.:
> > Error occurred in Python: No symbol "current_task" in current context.
> >
> > To avoid more people struggling and wasting time in other architectures,
> > document it.
> >
> > Cc: Jan Kiszka 
> > Signed-off-by: Barry Song 
> > ---
> >  Documentation/dev-tools/gdb-kernel-debugging.rst |  2 +-
> >  scripts/gdb/linux/cpus.py| 10 --
> >  2 files changed, 9 insertions(+), 3 deletions(-)
> >
> > diff --git a/Documentation/dev-tools/gdb-kernel-debugging.rst
> b/Documentation/dev-tools/gdb-kernel-debugging.rst
> > index 4756f6b3a04e..1586901b683c 100644
> > --- a/Documentation/dev-tools/gdb-kernel-debugging.rst
> > +++ b/Documentation/dev-tools/gdb-kernel-debugging.rst
> > @@ -114,7 +114,7 @@ Examples of using the Linux-provided gdb helpers
> >  [ 0.00] BIOS-e820: [mem 0x0009fc00-0x0009]
> reserved
> >  
> >
> > -- Examine fields of the current task struct::
> > +- Examine fields of the current task struct(supported by x86 only)::
> >
> >  (gdb) p $lx_current().pid
> >  $1 = 4998
> > diff --git a/scripts/gdb/linux/cpus.py b/scripts/gdb/linux/cpus.py
> > index 008e62f3190d..f382762509d3 100644
> > --- a/scripts/gdb/linux/cpus.py
> > +++ b/scripts/gdb/linux/cpus.py
> > @@ -156,6 +156,13 @@ Note that VAR has to be quoted as string."""
> >
> >  PerCpu()
> >
> > +def get_current_task(cpu):
> > +if utils.is_target_arch("x86"):
> > + var_ptr = gdb.parse_and_eval("_task")
> > + return per_cpu(var_ptr, cpu).dereference()
> > +else:
> > +raise gdb.GdbError("Sorry, obtaining the current task is not yet "
> > +   "supported with this arch")
> 
> I've wondered in the past how we should handle the architecture specific
> layers.
> 
> Perhaps we need to have an interface of functionality to implement on
> each architecture so that we can create a per-arch set of helpers.
> 
> or break it up into arch specific subdirs at least...
> 
> 
> >  class LxCurrentFunc(gdb.Function):
> >  """Return current task.
> > @@ -167,8 +174,7 @@ number. If CPU is omitted, the CPU of the current 
> > context
> is used."""
> >  super(LxCurrentFunc, self).__init__("lx_current")
> >
> >  def invoke(self, cpu=-1):
> > -var_ptr = gdb.parse_and_eval("_task")
> > -return per_cpu(var_ptr, cpu).dereference()
> > +return get_current_task(cpu)
> >
> 
> And then perhaps we simply shouldn't even expose commands which can not
> be supported on those architectures?

I feel it is better to tell users this function is not supported on its arch
than simply hiding the function.

If we hide it, users still have many chances to try it as they have got
information of lx_current from google or somewhere.
They will try, if it turns out the lx_current is not in the list and an
error like  "invalid data type for function to be called", they will
probably suspect their gdb/python environment is not set up correctly,
and continue to waste time in checking their environment. 
Finally they figure out this function is not supported by its arch so it is
not exposed. But they have wasted a couple of 

RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage optimization for SCSI drivers

2021-02-21 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Finn Thain [mailto:fth...@telegraphics.com.au]
> Sent: Saturday, February 20, 2021 6:18 PM
> To: tanxiaofei 
> Cc: Song Bao Hua (Barry Song) ; 
> j...@linux.ibm.com;
> martin.peter...@oracle.com; linux-s...@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux...@openeuler.org;
> linux-m...@vger.kernel.org
> Subject: Re: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage 
> optimization
> for SCSI drivers
> 
> On Thu, 18 Feb 2021, Xiaofei Tan wrote:
> 
> > On 2021/2/9 13:06, Finn Thain wrote:
> > > On Tue, 9 Feb 2021, Song Bao Hua (Barry Song) wrote:
> > >
> > > > > On Sun, 7 Feb 2021, Xiaofei Tan wrote:
> > > > >
> > > > > > Replace spin_lock_irqsave with spin_lock in hard IRQ of SCSI
> > > > > > drivers. There are no function changes, but may speed up if
> > > > > > interrupt happen too often.
> > > > >
> > > > > This change doesn't necessarily work on platforms that support
> > > > > nested interrupts.
> > > > >
> > > > > Were you able to measure any benefit from this change on some
> > > > > other platform?
> > > >
> > > > I think the code disabling irq in hardIRQ is simply wrong.
> > > > Since this commit
> > > >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=e58aa3d2d0cc
> > > > genirq: Run irq handlers with interrupts disabled
> > > >
> > > > interrupt handlers are definitely running in a irq-disabled context
> > > > unless irq handlers enable them explicitly in the handler to permit
> > > > other interrupts.
> > > >
> > >
> > > Repeating the same claim does not somehow make it true. If you put
> > > your claim to the test, you'll see that that interrupts are not
> > > disabled on m68k when interrupt handlers execute.
> > >
> > > The Interrupt Priority Level (IPL) can prevent any given irq handler
> > > from being re-entered, but an irq with a higher priority level may be
> > > handled during execution of a lower priority irq handler.
> > >
> > > sonic_interrupt() uses an irq lock within an interrupt handler to
> > > avoid issues relating to this. This kind of locking may be needed in
> > > the drivers you are trying to patch. Or it might not. Apparently,
> > > no-one has looked.
> > >
> >
> > According to your discussion with Barry, it seems that m68k is a little
> > different from other architecture, and this kind of modification of this
> > patch cannot be applied to m68k. So, could help to point out which
> > driver belong to m68k architecture in this patch set of SCSI? I can
> > remove them.
> >
> 
> If you would claim that "there are no function changes" in your patches
> (as above) then the onus is on you to support that claim.
> 
> I assume that there are some platforms on which your assumptions hold.
> 
> With regard to drivers for those platforms, you might want to explain why
> your patches should be applied there, given that the existing code is
> superior for being more portable.

I don't think it has nothing to do with portability. In the case of
sonic_interrupt() you pointed out, on m68k, there is a high-priority
interrupt can preempt low-priority interrupt, they will result in
access the same critical data. M68K's spin_lock_irqsave() can disable
the high-priority interrupt and avoid the race condition of the data.
So the case should not be touched. I'd like to accept the reality
and leave sonic_interrupt() alone.

However, even on m68k, spin_lock_irqsave is not needed for other
ordinary cases.
If there is no other irq handler coming to access same critical data,
it is pointless to hold a redundant irqsave lock in irqhandler even
on m68k.

In thread contexts, we always need that if an irqhandler can preempt
those threads and access the same data. In hardirq, if there is an
high-priority which can jump out on m68k to access the critical data
which needs protection, we use the spin_lock_irqsave as you have used
in sonic_interrupt(). Otherwise, the irqsave is also redundant for
m68k.

> 
> > BTW, sonic_interrupt() is from net driver natsemi, right?  It would be
> > appreciative if only discuss SCSI drivers in this patch set. thanks.
> >
> 
> The 'net' subsystem does have some different requirements than the 'scsi'
> subsystem. But I don't see how that's relevant. Perhaps you can explain
> it. Thanks.

The difference is that if there are two co-existing interrupts which can
access the same critical data on m68k. I don't think net and scsi matter.
What really matters is the specific driver.

Thanks
Barry



[PATCH] scripts/gdb: document lx_current is only supported by x86

2021-02-21 Thread Barry Song
lx_current depends on the per_cpu current_task which exists on x86 only:

arch$ git grep current_task | grep -i per_cpu
x86/include/asm/current.h:DECLARE_PER_CPU(struct task_struct *, current_task);
x86/kernel/cpu/common.c:DEFINE_PER_CPU(struct task_struct *, current_task) 
cacheline_aligned =
x86/kernel/cpu/common.c:EXPORT_PER_CPU_SYMBOL(current_task);
x86/kernel/cpu/common.c:DEFINE_PER_CPU(struct task_struct *, current_task) = 
_task;
x86/kernel/cpu/common.c:EXPORT_PER_CPU_SYMBOL(current_task);
x86/kernel/smpboot.c:   per_cpu(current_task, cpu) = idle;

On other architectures, lx_current() will lead to a python exception:
(gdb) p $lx_current().pid
Python Exception  No symbol "current_task" in current 
context.:
Error occurred in Python: No symbol "current_task" in current context.

To avoid more people struggling and wasting time in other architectures,
document it.

Cc: Jan Kiszka 
Signed-off-by: Barry Song 
---
 Documentation/dev-tools/gdb-kernel-debugging.rst |  2 +-
 scripts/gdb/linux/cpus.py| 10 --
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/Documentation/dev-tools/gdb-kernel-debugging.rst 
b/Documentation/dev-tools/gdb-kernel-debugging.rst
index 4756f6b3a04e..1586901b683c 100644
--- a/Documentation/dev-tools/gdb-kernel-debugging.rst
+++ b/Documentation/dev-tools/gdb-kernel-debugging.rst
@@ -114,7 +114,7 @@ Examples of using the Linux-provided gdb helpers
 [ 0.00] BIOS-e820: [mem 0x0009fc00-0x0009] 
reserved
 
 
-- Examine fields of the current task struct::
+- Examine fields of the current task struct(supported by x86 only)::
 
 (gdb) p $lx_current().pid
 $1 = 4998
diff --git a/scripts/gdb/linux/cpus.py b/scripts/gdb/linux/cpus.py
index 008e62f3190d..f382762509d3 100644
--- a/scripts/gdb/linux/cpus.py
+++ b/scripts/gdb/linux/cpus.py
@@ -156,6 +156,13 @@ Note that VAR has to be quoted as string."""
 
 PerCpu()
 
+def get_current_task(cpu):
+if utils.is_target_arch("x86"):
+ var_ptr = gdb.parse_and_eval("_task")
+ return per_cpu(var_ptr, cpu).dereference()
+else:
+raise gdb.GdbError("Sorry, obtaining the current task is not yet "
+   "supported with this arch")
 
 class LxCurrentFunc(gdb.Function):
 """Return current task.
@@ -167,8 +174,7 @@ number. If CPU is omitted, the CPU of the current context 
is used."""
 super(LxCurrentFunc, self).__init__("lx_current")
 
 def invoke(self, cpu=-1):
-var_ptr = gdb.parse_and_eval("_task")
-return per_cpu(var_ptr, cpu).dereference()
+return get_current_task(cpu)
 
 
 LxCurrentFunc()
-- 
2.25.1



RE: [Linuxarm] Re: [PATCH v2] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2

2021-02-18 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Valentin Schneider [mailto:valentin.schnei...@arm.com]
> Sent: Friday, February 19, 2021 1:41 AM
> To: Song Bao Hua (Barry Song) ; Peter Zijlstra
> 
> Cc: vincent.guit...@linaro.org; mgor...@suse.de; mi...@kernel.org;
> dietmar.eggem...@arm.com; morten.rasmus...@arm.com;
> linux-kernel@vger.kernel.org; linux...@openeuler.org; xuwei (O)
> ; Liguozhu (Kenneth) ; tiantao (H)
> ; wanghuiqiang ; Zengtao (B)
> ; Jonathan Cameron ;
> guodong...@linaro.org; Meelis Roos 
> Subject: [Linuxarm] Re: [PATCH v2] sched/topology: fix the issue groups don't
> span domain->span for NUMA diameter > 2
> 
> 
> Hi Barry,
> 
> On 18/02/21 09:17, Song Bao Hua (Barry Song) wrote:
> > Hi Valentin,
> >
> > I understand Peter's concern is that the local group has different
> > size with remote groups. Is this patch resolving Peter's concern?
> > To me, it seems not :-)
> >
> 
> If you remove the '&& i != cpu' in build_overlap_sched_groups() you get that,
> but then you also get some extra warnings :-)
> 
> Now yes, should_we_balance() only matters for the local group. However I'm
> somewhat wary of messing with the local groups; for one it means you would 
> have
> more than one tl now accessing the same sgc->next_update, sgc->{min,
> max}capacity, sgc->group_imbalance (as Vincent had pointed out).
> 
> By ensuring only remote (i.e. !local) groups are modified (which is what your
> patch does), we absolve ourselves of this issue, which is why I prefer this
> approach ATM.

Yep. The grandchild approach seems still to the feasible way for this moment.

> 
> > Though I don’t understand why different group sizes will be harmful
> > since all groups are calculating avg_load and group_type based on
> > their own capacities. Thus, for a smaller group, its capacity would be
> > smaller.
> >
> > Is it because a bigger group has relatively less chance to pull, so
> > load balancing will be completed more slowly while small groups have
> > high load?
> >
> 
> Peter's point is that, if at a given tl you have groups that look like
> 
> g0: 0-4, g1: 5-6, g2: 7-8
> 
> Then g0 is half as likely to pull tasks with load_balance() than g1 or g2 (due
> to the group size vs should_we_balance())

Yep. the difference is that g1 and g2 won't be local groups of any CPU in
this tl.
The smaller groups g1 and g2 are only remote groups,  so should_we_balance()
doesn't matter here for them.

> 
> 
> However, I suppose one "trick" to be aware of here is that since your patch
> *doesn't* change the local group, we do have e.g. on CPU0:
> 
> [0.374840]domain-2: span=0-5 level=NUMA
> [0.375054] groups: 0:{ span=0-3 cap=4003 }, 4:{ span=4-5 cap=1988 }
> 
> *but* on CPU4 we get:
> 
> [0.387019]domain-2: span=0-1,4-7 level=NUMA
> [0.387211] groups: 4:{ span=4-7 cap=3984 }, 0:{ span=0-1 cap=2013 }
> 
> IOW, at a given tl, all *local* groups have /roughly/ the same size and thus
> similar pull probability (it took me writing this mail to see it that way).
> So perhaps this is all fine already?

Yep. since should_we_balance() only matters for local groups and we haven't
changed the size of local groups in the grandchild approach, all local groups
are still getting similar pull probability in this topology level.

Since we still prefer the grandchild approach ATM, if Peter has no more concern
on the unequal size between local groups and remote groups, I would be glad
to send v4 of grandchild approach by rewriting changelog to explain the update
issue of sgc->next_update, sgc->{min, max}capacity, sgc->group_imbalance
Vincent pointed out and also describe the local_groups are not touched, thus
still in the equal size.

Thanks
Barry



RE: [Linuxarm] Re: [PATCH v2] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2

2021-02-18 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Valentin Schneider [mailto:valentin.schnei...@arm.com]
> Sent: Friday, February 12, 2021 8:55 AM
> To: Peter Zijlstra ; Song Bao Hua (Barry Song)
> 
> Cc: vincent.guit...@linaro.org; mgor...@suse.de; mi...@kernel.org;
> dietmar.eggem...@arm.com; morten.rasmus...@arm.com;
> linux-kernel@vger.kernel.org; linux...@openeuler.org; xuwei (O)
> ; Liguozhu (Kenneth) ; tiantao (H)
> ; wanghuiqiang ; Zengtao (B)
> ; Jonathan Cameron ;
> guodong...@linaro.org; Meelis Roos 
> Subject: [Linuxarm] Re: [PATCH v2] sched/topology: fix the issue groups don't
> span domain->span for NUMA diameter > 2
> 
> On 10/02/21 12:21, Peter Zijlstra wrote:
> > On Tue, Feb 09, 2021 at 08:58:15PM +, Song Bao Hua (Barry Song) wrote:
> >> So historically, the code has never tried to make sched_groups result
> >> in equal size. And it permits the overlapping of local group and remote
> >> groups.
> >
> > Histrorically groups have (typically) always been the same size though.
> >
> > The reason I did ask is because when you get one large and a bunch of
> > smaller groups, the load-balancing 'pull' is relatively smaller to the
> > large groups.
> >
> > That is, IIRC should_we_balance() ensures only 1 CPU out of the group
> > continues the load-balancing pass. So if, for example, we have one group
> > of 4 CPUs and one group of 2 CPUs, then the group of 2 CPUs will pull
> > 1/2 times, while the group of 4 CPUs will pull 1/4 times.
> >
> > By making sure all groups are of the same level, and thus of equal size,
> > this doesn't happen.
> 
> So I hacked something that tries to do this, with the notable exception
> that it doesn't change the way the local group is generated. Breaking the
> assumption that the local group always spans the child domain doesn't sound
> like the best thing to do.
> 
> Anywho, the below makes it so all !local NUMA groups are built using the
> same sched_domain_topology_level. Some of it is absolutely disgusting
> (esp. the sched_domains_curr_level stuff), I didn't bother with handling
> domain degeneration yet, and I trigger the WARN_ON in get_group()... But at
> least the topology gets built!
> 
> AFAICT Barry's topology is handled the same. On that sunfire topology, it
> pretty much turns all remote groups into groups spanning a single
> node. That does almost double the number of groups for some domains,
> compared to Barry's current v3.
> 
> If that is really a route we want to go down, I'll try to clean the below.
> 
Hi Valentin,

I understand Peter's concern is that the local group has different
size with remote groups. Is this patch resolving Peter's concern?
To me, it seems not :-)

Though I don’t understand why different group sizes will be harmful
since all groups are calculating avg_load and group_type based on
their own capacities. Thus, for a smaller group, its capacity would
be smaller.

Is it because a bigger group has relatively less chance to pull, so
load balancing will be completed more slowly while small groups have
high load?

> (deposit your drinks before going any further)
> --->8---
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 8f0f778b7c91..e74f48bdd226 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -187,7 +187,10 @@ struct sched_domain_topology_level {
>   sched_domain_mask_f mask;
>   sched_domain_flags_f sd_flags;
>   int flags;
> +#ifdef CONFIG_NUMA
>   int numa_level;
> + int numa_sibling_level;
> +#endif
>   struct sd_data  data;
>  #ifdef CONFIG_SCHED_DEBUG
>   char*name;
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 3c50cc7285c9..5a9e6a4d5d89 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -742,6 +742,34 @@ enum s_alloc {
>   sa_none,
>  };
> 
> +/*
> + * Topology list, bottom-up.
> + */
> +static struct sched_domain_topology_level default_topology[] = {
> +#ifdef CONFIG_SCHED_SMT
> + { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
> +#endif
> +#ifdef CONFIG_SCHED_MC
> + { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
> +#endif
> + { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> + { NULL, },
> +};
> +
> +static struct sched_domain_topology_level *sched_domain_topology =
> + default_topology;
> +
> +#define for_each_sd_topology(tl) \
> + for (tl = sched_domain_topology; tl->mask; tl++)
> +
> +void set_sched_topology(struct sched_domain_topology_level *tl)
> +{
> + if (WARN_ON_ONCE(s

RE: [RFC] IRQ handlers run with some high-priority interrupts(not NMI) enabled on some platform

2021-02-17 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Finn Thain [mailto:fth...@telegraphics.com.au]
> Sent: Sunday, February 14, 2021 6:11 PM
> To: Song Bao Hua (Barry Song) 
> Cc: Arnd Bergmann ; t...@linutronix.de;
> gre...@linuxfoundation.org; a...@arndb.de; ge...@linux-m68k.org;
> fun...@jurai.org; ph...@gnu.org; cor...@lwn.net; mi...@redhat.com;
> linux-m...@lists.linux-m68k.org; linux-kernel@vger.kernel.org
> Subject: RE: [RFC] IRQ handlers run with some high-priority interrupts(not 
> NMI)
> enabled on some platform
> 
> On Sat, 13 Feb 2021, Song Bao Hua (Barry Song) wrote:
> 
> >
> > So what is really confusing and a pain to me is that:
> > For years people like me have been writing device drivers
> > with the idea that irq handlers run with interrupts
> > disabled after those commits in genirq. So I don't need
> > to care about if some other IRQs on the same cpu will
> > jump out to access the data the current IRQ handler
> > is accessing.
> >
> > but it turns out the assumption is not true on some platform.
> > So should I start to program devices driver with the new idea
> > interrupts can actually come while irqhandler is running?
> >
> > That's the question which really bothers me.
> >
> 
> That scenario seems a little contrived to me (drivers for two or more
> devices sharing state through their interrupt handlers). Is it real?
> I suppose every platform has its quirks. The irq lock in sonic_interrupt()
> is only there because of a platform quirk (the same device can trigger
> either of two IRQs). Anyway, no-one expects all drivers to work on all
> platforms; I don't know why it bothers you so much when platforms differ.

Basically, we wrote drivers with the assumption that this driver will
be cross-platform. (Of course there are some drivers which can only work
on one platform, for example, if the IP of the device is only used in
one platform as an internal component of a specific SoC.)

So once a device has two or more interrupts, we need to consider one
interrupt might preempt another one on m68k on the same cpu if we also
want to support this driver on m68k. this usually doesn't matter on
other platforms.

on the other hand, there are more than 400 irqs_disabled() in kernel,
I am really not sure if they are running with the knowledge that
the true irqs_disabled() actually means some interrupts are off
and some others are still open on m68k. Or they are running with
the assumption that the true irqs_disabled() means IRQ is totally
quiet? If the latter is true, those drivers might fail to work on
m68k as well.

Thanks
Barry


RE: [RFC] IRQ handlers run with some high-priority interrupts(not NMI) enabled on some platform

2021-02-13 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Song Bao Hua (Barry Song)
> Sent: Sunday, February 14, 2021 11:13 AM
> To: 'Arnd Bergmann' 
> Cc: t...@linutronix.de; gre...@linuxfoundation.org; a...@arndb.de;
> ge...@linux-m68k.org; fun...@jurai.org; ph...@gnu.org; cor...@lwn.net;
> mi...@redhat.com; linux-m...@lists.linux-m68k.org;
> fth...@telegraphics.com.au; linux-kernel@vger.kernel.org
> Subject: RE: [RFC] IRQ handlers run with some high-priority interrupts(not 
> NMI)
> enabled on some platform
> 
> 
> 
> > -Original Message-
> > From: Arnd Bergmann [mailto:a...@kernel.org]
> > Sent: Sunday, February 14, 2021 5:32 AM
> > To: Song Bao Hua (Barry Song) 
> > Cc: t...@linutronix.de; gre...@linuxfoundation.org; a...@arndb.de;
> > ge...@linux-m68k.org; fun...@jurai.org; ph...@gnu.org; cor...@lwn.net;
> > mi...@redhat.com; linux-m...@lists.linux-m68k.org;
> > fth...@telegraphics.com.au; linux-kernel@vger.kernel.org
> > Subject: Re: [RFC] IRQ handlers run with some high-priority interrupts(not
> NMI)
> > enabled on some platform
> >
> > On Sat, Feb 13, 2021 at 12:50 AM Song Bao Hua (Barry Song)
> >  wrote:
> >
> > > So I was actually trying to warn this unusual case - interrupts
> > > get nested while both in_hardirq() and irqs_disabled() are true.
> > >
> > > diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
> > > index 7c9d6a2d7e90..b8ca27555c76 100644
> > > --- a/include/linux/hardirq.h
> > > +++ b/include/linux/hardirq.h
> > > @@ -32,6 +32,7 @@ static __always_inline void 
> > > rcu_irq_enter_check_tick(void)
> > >   */
> > >  #define __irq_enter()  \
> > > do {\
> > > +   WARN_ONCE(in_hardirq() && irqs_disabled(), "nested
> > > interrupts\n"); \
> > > preempt_count_add(HARDIRQ_OFFSET);  \
> >
> > That seems to be a rather heavyweight change in a critical path.
> >
> > A more useful change might be to implement lockdep support for m68k
> > and see if that warns about any actual problems. I'm not sure
> > what is actually missing for that, but these are the commits that
> > added it for other architectures in the past:
> >
> > 3c4697982982 ("riscv: Enable LOCKDEP_SUPPORT & fixup
> TRACE_IRQFLAGS_SUPPORT")
> > 000591f1ca33 ("csky: Enable LOCKDEP_SUPPORT")
> > 78cdfb5cf15e ("openrisc: enable LOCKDEP_SUPPORT and irqflags tracing")
> > 8f371c752154 ("xtensa: enable lockdep support")
> > bf2d80966890 ("microblaze: Lockdep support")
> >
> 
> Yes. M68k lacks lockdep support which might be added.

BTW, probably m68k won't run into any problem with lockdep
as it has been running for decades. Just like interrupts
were widely allowed to preempt irq handlers on all platforms
before IRQF_DISABLED was dropped and commit e58aa3d2d0cc ("
genirq: Run irq handlers with interrupts disabled").
Rarely we could really run into the stack overflow
issue commit e58aa3d2d0cc mentioned at that time.
Before those commits we had already made thousands of
successful Linux products running irq handlers with
interrupts enabled.

So what is really confusing and a pain to me is that:
For years people like me have been writing device drivers
with the idea that irq handlers run with interrupts
disabled after those commits in genirq. So I don't need
to care about if some other IRQs on the same cpu will
jump out to access the data the current IRQ handler
is accessing.

but it turns out the assumption is not true on some platform.
So should I start to program devices driver with the new idea
interrupts can actually come while irqhandler is running?

That's the question which really bothers me.

> 
> > > And I also think it is better for m68k's arch_irqs_disabled() to
> > > return true only when both low and high priority interrupts are
> > > disabled rather than try to mute this warn in genirq by a weaker
> > > condition:
> > >  if (WARN_ONCE(!irqs_disabled(),"irq %u handler %pS enabled
> > interrupts\n",
> > >  irq, action->handler))
> > >local_irq_disable();
> > > }
> > >
> > > This warn is not activated on m68k because its arch_irqs_disabled() return
> > > true though its high-priority interrupts are still enabled.
> >
> > Then it would just end up always warning when a nested hardirq happens,
> > right? That seems no different to dropping support for nested hardirqs
> > on 

RE: [RFC] IRQ handlers run with some high-priority interrupts(not NMI) enabled on some platform

2021-02-13 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Arnd Bergmann [mailto:a...@kernel.org]
> Sent: Sunday, February 14, 2021 5:32 AM
> To: Song Bao Hua (Barry Song) 
> Cc: t...@linutronix.de; gre...@linuxfoundation.org; a...@arndb.de;
> ge...@linux-m68k.org; fun...@jurai.org; ph...@gnu.org; cor...@lwn.net;
> mi...@redhat.com; linux-m...@lists.linux-m68k.org;
> fth...@telegraphics.com.au; linux-kernel@vger.kernel.org
> Subject: Re: [RFC] IRQ handlers run with some high-priority interrupts(not 
> NMI)
> enabled on some platform
> 
> On Sat, Feb 13, 2021 at 12:50 AM Song Bao Hua (Barry Song)
>  wrote:
> 
> > So I was actually trying to warn this unusual case - interrupts
> > get nested while both in_hardirq() and irqs_disabled() are true.
> >
> > diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
> > index 7c9d6a2d7e90..b8ca27555c76 100644
> > --- a/include/linux/hardirq.h
> > +++ b/include/linux/hardirq.h
> > @@ -32,6 +32,7 @@ static __always_inline void rcu_irq_enter_check_tick(void)
> >   */
> >  #define __irq_enter()  \
> > do {\
> > +   WARN_ONCE(in_hardirq() && irqs_disabled(), "nested
> > interrupts\n"); \
> > preempt_count_add(HARDIRQ_OFFSET);  \
> 
> That seems to be a rather heavyweight change in a critical path.
> 
> A more useful change might be to implement lockdep support for m68k
> and see if that warns about any actual problems. I'm not sure
> what is actually missing for that, but these are the commits that
> added it for other architectures in the past:
> 
> 3c4697982982 ("riscv: Enable LOCKDEP_SUPPORT & fixup TRACE_IRQFLAGS_SUPPORT")
> 000591f1ca33 ("csky: Enable LOCKDEP_SUPPORT")
> 78cdfb5cf15e ("openrisc: enable LOCKDEP_SUPPORT and irqflags tracing")
> 8f371c752154 ("xtensa: enable lockdep support")
> bf2d80966890 ("microblaze: Lockdep support")
> 

Yes. M68k lacks lockdep support which might be added.

> > And I also think it is better for m68k's arch_irqs_disabled() to
> > return true only when both low and high priority interrupts are
> > disabled rather than try to mute this warn in genirq by a weaker
> > condition:
> >  if (WARN_ONCE(!irqs_disabled(),"irq %u handler %pS enabled
> interrupts\n",
> >  irq, action->handler))
> >local_irq_disable();
> > }
> >
> > This warn is not activated on m68k because its arch_irqs_disabled() return
> > true though its high-priority interrupts are still enabled.
> 
> Then it would just end up always warning when a nested hardirq happens,
> right? That seems no different to dropping support for nested hardirqs
> on m68k altogether, which of course is what you suggested already.

This won't end up a warning on other architectures like arm,arm64, x86 etc
as interrupts won't come while arch_irqs_disabled() is true in hardIRQ.
For example, I_BIT of CPSR of ARM is set:
static inline int arch_irqs_disabled_flags(unsigned long flags)
{
return flags & IRQMASK_I_BIT;
}

So it would only give a backtrace on platforms whose arch_irqs_disabled()
return true while only some interrupts are disabled and some others
are still open, thus nested interrupts can come without any explicit
code to enable interrupts.

This warn seems to give consistent interpretation on what's "Run irq
handlers with interrupts disabled" in commit e58aa3d2d0cc (" genirq:
Run irq handlers with interrupts disabled")

> 
>Arnd

Thanks
Barry


RE: [RFC] IRQ handlers run with some high-priority interrupts(not NMI) enabled on some platform

2021-02-12 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Arnd Bergmann [mailto:a...@kernel.org]
> Sent: Saturday, February 13, 2021 12:06 PM
> To: Song Bao Hua (Barry Song) 
> Cc: t...@linutronix.de; gre...@linuxfoundation.org; a...@arndb.de;
> ge...@linux-m68k.org; fun...@jurai.org; ph...@gnu.org; cor...@lwn.net;
> mi...@redhat.com; linux-m...@lists.linux-m68k.org;
> fth...@telegraphics.com.au; linux-kernel@vger.kernel.org
> Subject: Re: [RFC] IRQ handlers run with some high-priority interrupts(not 
> NMI)
> enabled on some platform
> 
> On Sat, Feb 13, 2021 at 12:00 AM Song Bao Hua (Barry Song)
>  wrote:
> > > -Original Message-
> > > From: Arnd Bergmann [mailto:a...@kernel.org]
> > > Sent: Saturday, February 13, 2021 11:34 AM
> > > To: Song Bao Hua (Barry Song) 
> > > Cc: t...@linutronix.de; gre...@linuxfoundation.org; a...@arndb.de;
> > > ge...@linux-m68k.org; fun...@jurai.org; ph...@gnu.org; cor...@lwn.net;
> > > mi...@redhat.com; linux-m...@lists.linux-m68k.org;
> > > fth...@telegraphics.com.au; linux-kernel@vger.kernel.org
> > > Subject: Re: [RFC] IRQ handlers run with some high-priority interrupts(not
> NMI)
> > > enabled on some platform
> > >
> > > On Fri, Feb 12, 2021 at 2:18 AM Song Bao Hua (Barry Song)
> > >  wrote:
> > >
> > > > So I am requesting comments on:
> > > > 1. are we expecting all interrupts except NMI to be disabled in irq 
> > > > handler,
> > > > or do we actually allow some high-priority interrupts between low and
> NMI
> > > to
> > > > come in some platforms?
> > >
> > > I tried to come to an answer but this does not seem particularly 
> > > well-defined.
> > > There are a few things I noticed:
> > >
> > > - going through the local_irq_save()/restore() implementations on all
> > >   architectures, I did not find any other ones besides m68k that leave
> > >   high-priority interrupts enabled. I did see that at least alpha and 
> > > openrisc
> > >   are designed to support that in hardware, but the code just leaves the
> > >   interrupts disabled.
> >
> > The case is a little different. Explicit local_irq_save() does disable all
> > high priority interrupts on m68k. The only difference is 
> > arch_irqs_disabled()
> > of m68k will return true while low-priority interrupts are masked and high
> > -priority are still open. M68k's hardIRQ also runs in this context with high
> > priority interrupts enabled.
> 
> My point was that on most other architectures, local_irq_save()/restore()
> always disables/enables all interrupts, while on m68k it restores the
> specific level they were on before. On alpha, it does the same as on m68k,
> but then the top-level interrupt handler just disables them all before calling
> into any other code.

That's what I think m68k is better to do.
 
Looks weird that nested interrupts can enter while arch_irqs_disabled()
is true on m68k because masking low-priority interrupts with
high-interrupts still enabled would be able to make m68k's
arch_irqs_disabled() true, which is exactly the environment
m68k's irq handler is running.

So I was actually trying to warn this unusual case - interrupts
get nested while both in_hardirq() and irqs_disabled() are true.

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 7c9d6a2d7e90..b8ca27555c76 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -32,6 +32,7 @@ static __always_inline void rcu_irq_enter_check_tick(void)
  */
 #define __irq_enter()  \
do {\
+   WARN_ONCE(in_hardirq() && irqs_disabled(), "nested
interrupts\n"); \
preempt_count_add(HARDIRQ_OFFSET);  \
lockdep_hardirq_enter();\
account_hardirq_enter(current); \
@@ -44,6 +45,7 @@ static __always_inline void rcu_irq_enter_check_tick(void)
  */
 #define __irq_enter_raw()  \
do {\
+   WARN_ONCE(in_hardirq() && irqs_disabled(), " nested
interrupts\n"); \
preempt_count_add(HARDIRQ_OFFSET);  \
lockdep_hardirq_enter();\
} while (0)

And I also think it is better for m68k's arch_irqs_disabled() to 
return true only when both low and high priority interrupts are
disabled rather than try to mute this warn in genirq by a weaker
condition:

irqreturn_t __handle_irq_event_percpu(struct irq_desc *desc, unsigned int 
*flags)
{
...

trace_irq_handler_entry(irq, action)

RE: [RFC] IRQ handlers run with some high-priority interrupts(not NMI) enabled on some platform

2021-02-12 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Arnd Bergmann [mailto:a...@kernel.org]
> Sent: Saturday, February 13, 2021 11:34 AM
> To: Song Bao Hua (Barry Song) 
> Cc: t...@linutronix.de; gre...@linuxfoundation.org; a...@arndb.de;
> ge...@linux-m68k.org; fun...@jurai.org; ph...@gnu.org; cor...@lwn.net;
> mi...@redhat.com; linux-m...@lists.linux-m68k.org;
> fth...@telegraphics.com.au; linux-kernel@vger.kernel.org
> Subject: Re: [RFC] IRQ handlers run with some high-priority interrupts(not 
> NMI)
> enabled on some platform
> 
> On Fri, Feb 12, 2021 at 2:18 AM Song Bao Hua (Barry Song)
>  wrote:
> 
> > So I am requesting comments on:
> > 1. are we expecting all interrupts except NMI to be disabled in irq handler,
> > or do we actually allow some high-priority interrupts between low and NMI
> to
> > come in some platforms?
> 
> I tried to come to an answer but this does not seem particularly well-defined.
> There are a few things I noticed:
> 
> - going through the local_irq_save()/restore() implementations on all
>   architectures, I did not find any other ones besides m68k that leave
>   high-priority interrupts enabled. I did see that at least alpha and openrisc
>   are designed to support that in hardware, but the code just leaves the
>   interrupts disabled.

The case is a little different. Explicit local_irq_save() does disable all
high priority interrupts on m68k. The only difference is arch_irqs_disabled()
of m68k will return true while low-priority interrupts are masked and high
-priority are still open. M68k's hardIRQ also runs in this context with high
priority interrupts enabled.

> 
> - The generic code is clearly prepared to handle nested hardirqs, and
>the irq_enter()/irq_exit() functions have a counter in preempt_count
>for the nesting level, using a 4-bit number for hardirq, plus another
>4-bit number for NMI.

Yes, I understand nested interrupts are supported by an explicit 
local_irq_enable_in_hardirq(). Mk68k's case is different, nested
interrupts can come with arch_irqs_disabled() is true and while
nobody has called local_irq_enable_in_hardirq() in the previous
hardIRQ because hardIRQ keeps high-priority interrupts open.

> 
> - There are a couple of (ancient) drivers that enable interrupts in their
>interrupt handlers, see the four callers of local_irq_enable_in_hardirq()
>(all in the old drivers/ide stack) and arch/ia64/kernel/time.c, which
>enables interupts in its timer function (I recently tried removing this
>and my patch broke ia64 timers, but I'm not sure if the cause was
>the local_irq_enable() or something else).
> 
> - The local_irq_enable_in_hardirq() function itself turns into a nop
>   when lockdep is enabled, since d7e9629de051 ("[PATCH] lockdep:
>   add local_irq_enable_in_hardirq() API"). According to the comment
>   in there, lockdep already enforces the behavior you suggest. Note that
>   lockdep support is missing on m68k (and also alpha, h8300, ia64, nios2,
>   and parisc).
> 
> > 2. If either side is true, I think we need to document it somewhere as there
> > is always confusion about this.
> >
> > Personally, I would expect all interrupts to be disabled and I like the way
> > of ARM64 to only use high-priority interrupt as pseudo NMI:
> > https://lwn.net/Articles/755906/
> > Though Finn argued that this will contribute to lose hardware feature of 
> > m68k.
> 
> Regardless of what is documented, I would argue that any platform
> that relies on this is at the minimum doing something risky because at
> the minimum this runs into hard to debug code paths that are not
> exercised on any of the common architectures.
> 
> Arnd


Thanks
Barry



RE: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()

2021-02-12 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Arnd Bergmann [mailto:a...@kernel.org]
> Sent: Saturday, February 13, 2021 9:23 AM
> To: Grygorii Strashko 
> Cc: Song Bao Hua (Barry Song) ; Andy Shevchenko
> ; luojiaxing ; Linus
> Walleij ; Santosh Shilimkar ;
> Kevin Hilman ; open list:GPIO SUBSYSTEM
> ; linux-kernel@vger.kernel.org;
> linux...@openeuler.org
> Subject: Re: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace
> raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()
> 
> On Fri, Feb 12, 2021 at 12:53 PM Grygorii Strashko
>  wrote:
> >
> > The worst RT case I can imagine is when gpio API is still called from hard
> IRQ context by some
> > other device driver - some toggling for example.
> > Note. RT or "threadirqs" does not mean gpiochip become sleepable.
> >
> > In this case:
> >   threaded handler
> > raw_spin_lock
> > IRQ from other device
> >hard_irq handler
> >  gpiod_x()
> > raw_spin_lock_irqsave() -- oops
> >
> 
> Good point, I had missed the fact that drivers can call gpio functions from
> hardirq context when I replied earlier, gpio is clearly special here.


Yes. Gpio provides APIs, thus, other drivers can go directly into the
territory of gpio driver.

Another one which is even more special might be m68k, which I cc-ed you
yesterday:
https://lore.kernel.org/lkml/c46ddb954cfe45d9849c911271d7e...@hisilicon.com/

> 
>   Arnd

Thanks
Barry



RE: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()

2021-02-12 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Grygorii Strashko [mailto:grygorii.stras...@ti.com]
> Sent: Saturday, February 13, 2021 3:09 AM
> To: Song Bao Hua (Barry Song) ; Andy Shevchenko
> 
> Cc: Arnd Bergmann ; luojiaxing ; Linus
> Walleij ; Santosh Shilimkar ;
> Kevin Hilman ; open list:GPIO SUBSYSTEM
> ; linux-kernel@vger.kernel.org;
> linux...@openeuler.org
> Subject: Re: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace
> raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()
> 
> 
> 
> On 12/02/2021 15:12, Song Bao Hua (Barry Song) wrote:
> >
> >
> >> -Original Message-
> >> From: Grygorii Strashko [mailto:grygorii.stras...@ti.com]
> >> Sent: Saturday, February 13, 2021 12:53 AM
> >> To: Song Bao Hua (Barry Song) ; Andy Shevchenko
> >> 
> >> Cc: Arnd Bergmann ; luojiaxing ;
> Linus
> >> Walleij ; Santosh Shilimkar
> ;
> >> Kevin Hilman ; open list:GPIO SUBSYSTEM
> >> ; linux-kernel@vger.kernel.org;
> >> linux...@openeuler.org
> >> Subject: Re: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace
> >> raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()
> >>
> >>
> >>
> >> On 12/02/2021 13:29, Song Bao Hua (Barry Song) wrote:
> >>>
> >>>
> >>>> -Original Message-
> >>>> From: Andy Shevchenko [mailto:andy.shevche...@gmail.com]
> >>>> Sent: Friday, February 12, 2021 11:57 PM
> >>>> To: Song Bao Hua (Barry Song) 
> >>>> Cc: Grygorii Strashko ; Arnd Bergmann
> >>>> ; luojiaxing ; Linus Walleij
> >>>> ; Santosh Shilimkar ;
> Kevin
> >>>> Hilman ; open list:GPIO SUBSYSTEM
> >>>> ; linux-kernel@vger.kernel.org;
> >>>> linux...@openeuler.org
> >>>> Subject: Re: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace
> >>>> raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()
> >>>>
> >>>> On Fri, Feb 12, 2021 at 10:42:19AM +, Song Bao Hua (Barry Song) 
> >>>> wrote:
> >>>>>> From: Grygorii Strashko [mailto:grygorii.stras...@ti.com]
> >>>>>> Sent: Friday, February 12, 2021 11:28 PM
> >>>>>> On 12/02/2021 11:45, Arnd Bergmann wrote:
> >>>>>>> On Fri, Feb 12, 2021 at 6:05 AM Song Bao Hua (Barry Song)
> >>>>>>>  wrote:
> >>>>
> >>>>>>>>> Note. there is also generic_handle_irq() call inside.
> >>>>>>>>
> >>>>>>>> So generic_handle_irq() is not safe to run in thread thus requires
> >>>>>>>> an interrupt-disabled environment to run? If so, I'd rather this
> >>>>>>>> irqsave moved into generic_handle_irq() rather than asking everyone
> >>>>>>>> calling it to do irqsave.
> >>>>>>>
> >>>>>>> In a preempt-rt kernel, interrupts are run in task context, so they
> clearly
> >>>>>>> should not be called with interrupts disabled, that would defeat the
> >>>>>>> purpose of making them preemptible.
> >>>>>>>
> >>>>>>> generic_handle_irq() does need to run with in_irq()==true though,
> >>>>>>> but this should be set by the caller of the gpiochip's handler, and
> >>>>>>> it is not set by raw_spin_lock_irqsave().
> >>>>>>
> >>>>>> It will produce warning from __handle_irq_event_percpu(), as this is
> IRQ
> >>>>>> dispatcher
> >>>>>> and generic_handle_irq() will call one of handle_level_irq or
> >>>> handle_edge_irq.
> >>>>>>
> >>>>>> The history behind this is commit 450fa54cfd66 ("gpio: omap: convert
> to
> >>>> use
> >>>>>> generic irq handler").
> >>>>>>
> >>>>>> The resent related discussion:
> >>>>>> https://lkml.org/lkml/2020/12/5/208
> >>>>>
> >>>>> Ok, second thought. irqsave before generic_handle_irq() won't defeat
> >>>>> the purpose of preemption too much as the dispatched irq handlers by
> >>>>> gpiochip will run in their own threads but not in the thread of
> >>>>> gpiochip's handler.
> >>>>>
> >>>>> so looks like this patch ca

RE: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()

2021-02-12 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Grygorii Strashko [mailto:grygorii.stras...@ti.com]
> Sent: Saturday, February 13, 2021 12:53 AM
> To: Song Bao Hua (Barry Song) ; Andy Shevchenko
> 
> Cc: Arnd Bergmann ; luojiaxing ; Linus
> Walleij ; Santosh Shilimkar ;
> Kevin Hilman ; open list:GPIO SUBSYSTEM
> ; linux-kernel@vger.kernel.org;
> linux...@openeuler.org
> Subject: Re: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace
> raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()
> 
> 
> 
> On 12/02/2021 13:29, Song Bao Hua (Barry Song) wrote:
> >
> >
> >> -Original Message-
> >> From: Andy Shevchenko [mailto:andy.shevche...@gmail.com]
> >> Sent: Friday, February 12, 2021 11:57 PM
> >> To: Song Bao Hua (Barry Song) 
> >> Cc: Grygorii Strashko ; Arnd Bergmann
> >> ; luojiaxing ; Linus Walleij
> >> ; Santosh Shilimkar ; Kevin
> >> Hilman ; open list:GPIO SUBSYSTEM
> >> ; linux-kernel@vger.kernel.org;
> >> linux...@openeuler.org
> >> Subject: Re: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace
> >> raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()
> >>
> >> On Fri, Feb 12, 2021 at 10:42:19AM +, Song Bao Hua (Barry Song) wrote:
> >>>> From: Grygorii Strashko [mailto:grygorii.stras...@ti.com]
> >>>> Sent: Friday, February 12, 2021 11:28 PM
> >>>> On 12/02/2021 11:45, Arnd Bergmann wrote:
> >>>>> On Fri, Feb 12, 2021 at 6:05 AM Song Bao Hua (Barry Song)
> >>>>>  wrote:
> >>
> >>>>>>> Note. there is also generic_handle_irq() call inside.
> >>>>>>
> >>>>>> So generic_handle_irq() is not safe to run in thread thus requires
> >>>>>> an interrupt-disabled environment to run? If so, I'd rather this
> >>>>>> irqsave moved into generic_handle_irq() rather than asking everyone
> >>>>>> calling it to do irqsave.
> >>>>>
> >>>>> In a preempt-rt kernel, interrupts are run in task context, so they 
> >>>>> clearly
> >>>>> should not be called with interrupts disabled, that would defeat the
> >>>>> purpose of making them preemptible.
> >>>>>
> >>>>> generic_handle_irq() does need to run with in_irq()==true though,
> >>>>> but this should be set by the caller of the gpiochip's handler, and
> >>>>> it is not set by raw_spin_lock_irqsave().
> >>>>
> >>>> It will produce warning from __handle_irq_event_percpu(), as this is IRQ
> >>>> dispatcher
> >>>> and generic_handle_irq() will call one of handle_level_irq or
> >> handle_edge_irq.
> >>>>
> >>>> The history behind this is commit 450fa54cfd66 ("gpio: omap: convert to
> >> use
> >>>> generic irq handler").
> >>>>
> >>>> The resent related discussion:
> >>>> https://lkml.org/lkml/2020/12/5/208
> >>>
> >>> Ok, second thought. irqsave before generic_handle_irq() won't defeat
> >>> the purpose of preemption too much as the dispatched irq handlers by
> >>> gpiochip will run in their own threads but not in the thread of
> >>> gpiochip's handler.
> >>>
> >>> so looks like this patch can improve by:
> >>> * move other raw_spin_lock_irqsave to raw_spin_lock;
> >>> * keep the raw_spin_lock_irqsave before generic_handle_irq() to mute
> >>> the warning in genirq.
> >>
> >> Isn't the idea of irqsave is to prevent dead lock from the process context
> when
> >> we get interrupt on the *same* CPU?
> >
> > Anyway, gpiochip is more tricky as it is also a irq dispatcher. Moving
> > spin_lock_irq to spin_lock in the irq handler of non-irq dispatcher
> > driver is almost always correct.
> >
> > But for gpiochip, would the below be true though it is almost alway true
> > for non-irq dispatcher?
> >
> > 1. While gpiochip's handler runs in hardIRQ, interrupts are disabled, so no
> more
> > interrupt on the same cpu -> No deadleak.
> >
> > 2. While gpiochip's handler runs in threads
> > * other non-threaded interrupts such as timer tick might come on same cpu,
> > but they are an irrelevant driver and thus they are not going to get the
> > lock gpiochip's handler has held. -> no deadlock.
> > * other devices attached to this gpiochip might get interrupts, s

RE: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()

2021-02-12 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Andy Shevchenko [mailto:andy.shevche...@gmail.com]
> Sent: Friday, February 12, 2021 11:57 PM
> To: Song Bao Hua (Barry Song) 
> Cc: Grygorii Strashko ; Arnd Bergmann
> ; luojiaxing ; Linus Walleij
> ; Santosh Shilimkar ; Kevin
> Hilman ; open list:GPIO SUBSYSTEM
> ; linux-kernel@vger.kernel.org;
> linux...@openeuler.org
> Subject: Re: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace
> raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()
> 
> On Fri, Feb 12, 2021 at 10:42:19AM +, Song Bao Hua (Barry Song) wrote:
> > > From: Grygorii Strashko [mailto:grygorii.stras...@ti.com]
> > > Sent: Friday, February 12, 2021 11:28 PM
> > > On 12/02/2021 11:45, Arnd Bergmann wrote:
> > > > On Fri, Feb 12, 2021 at 6:05 AM Song Bao Hua (Barry Song)
> > > >  wrote:
> 
> > > >>> Note. there is also generic_handle_irq() call inside.
> > > >>
> > > >> So generic_handle_irq() is not safe to run in thread thus requires
> > > >> an interrupt-disabled environment to run? If so, I'd rather this
> > > >> irqsave moved into generic_handle_irq() rather than asking everyone
> > > >> calling it to do irqsave.
> > > >
> > > > In a preempt-rt kernel, interrupts are run in task context, so they 
> > > > clearly
> > > > should not be called with interrupts disabled, that would defeat the
> > > > purpose of making them preemptible.
> > > >
> > > > generic_handle_irq() does need to run with in_irq()==true though,
> > > > but this should be set by the caller of the gpiochip's handler, and
> > > > it is not set by raw_spin_lock_irqsave().
> > >
> > > It will produce warning from __handle_irq_event_percpu(), as this is IRQ
> > > dispatcher
> > > and generic_handle_irq() will call one of handle_level_irq or
> handle_edge_irq.
> > >
> > > The history behind this is commit 450fa54cfd66 ("gpio: omap: convert to
> use
> > > generic irq handler").
> > >
> > > The resent related discussion:
> > > https://lkml.org/lkml/2020/12/5/208
> >
> > Ok, second thought. irqsave before generic_handle_irq() won't defeat
> > the purpose of preemption too much as the dispatched irq handlers by
> > gpiochip will run in their own threads but not in the thread of
> > gpiochip's handler.
> >
> > so looks like this patch can improve by:
> > * move other raw_spin_lock_irqsave to raw_spin_lock;
> > * keep the raw_spin_lock_irqsave before generic_handle_irq() to mute
> > the warning in genirq.
> 
> Isn't the idea of irqsave is to prevent dead lock from the process context 
> when
> we get interrupt on the *same* CPU?

Anyway, gpiochip is more tricky as it is also a irq dispatcher. Moving
spin_lock_irq to spin_lock in the irq handler of non-irq dispatcher
driver is almost always correct.

But for gpiochip, would the below be true though it is almost alway true
for non-irq dispatcher?

1. While gpiochip's handler runs in hardIRQ, interrupts are disabled, so no more
interrupt on the same cpu -> No deadleak.

2. While gpiochip's handler runs in threads
* other non-threaded interrupts such as timer tick might come on same cpu,
but they are an irrelevant driver and thus they are not going to get the
lock gpiochip's handler has held. -> no deadlock.
* other devices attached to this gpiochip might get interrupts, since 
gpiochip's handler is running in threads, raw_spin_lock can help avoid
messing up the critical data by two threads -> still no deadlock.

> 
> --
> With Best Regards,
> Andy Shevchenko
> 

Thanks
Barry



RE: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()

2021-02-12 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Grygorii Strashko [mailto:grygorii.stras...@ti.com]
> Sent: Friday, February 12, 2021 11:28 PM
> To: Arnd Bergmann ; Song Bao Hua (Barry Song)
> 
> Cc: luojiaxing ; Linus Walleij
> ; Andy Shevchenko ; Andy
> Shevchenko ; Santosh Shilimkar
> ; Kevin Hilman ; open list:GPIO
> SUBSYSTEM ; linux-kernel@vger.kernel.org;
> linux...@openeuler.org
> Subject: Re: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace
> raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()
> 
> Hi Arnd,
> 
> On 12/02/2021 11:45, Arnd Bergmann wrote:
> > On Fri, Feb 12, 2021 at 6:05 AM Song Bao Hua (Barry Song)
> >  wrote:
> >>> -Original Message-
> >
> >>>
> >>> Note. there is also generic_handle_irq() call inside.
> >>
> >> So generic_handle_irq() is not safe to run in thread thus requires
> >> an interrupt-disabled environment to run? If so, I'd rather this
> >> irqsave moved into generic_handle_irq() rather than asking everyone
> >> calling it to do irqsave.
> >
> > In a preempt-rt kernel, interrupts are run in task context, so they clearly
> > should not be called with interrupts disabled, that would defeat the
> > purpose of making them preemptible.
> >
> > generic_handle_irq() does need to run with in_irq()==true though,
> > but this should be set by the caller of the gpiochip's handler, and
> > it is not set by raw_spin_lock_irqsave().
> 
> It will produce warning from __handle_irq_event_percpu(), as this is IRQ
> dispatcher
> and generic_handle_irq() will call one of handle_level_irq or handle_edge_irq.
> 
> The history behind this is commit 450fa54cfd66 ("gpio: omap: convert to use
> generic irq handler").
> 
> The resent related discussion:
> https://lkml.org/lkml/2020/12/5/208

Ok, second thought. irqsave before generic_handle_irq() won't defeat
the purpose of preemption too much as the dispatched irq handlers by
gpiochip will run in their own threads but not in the thread of
gpiochip's handler.

so looks like this patch can improve by:
* move other raw_spin_lock_irqsave to raw_spin_lock;
* keep the raw_spin_lock_irqsave before generic_handle_irq() to mute
the warning in genirq.

> 
> 
> 
> --
> Best regards,
> Grygorii

Thanks
Barry



RE: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()

2021-02-12 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Arnd Bergmann [mailto:a...@kernel.org]
> Sent: Friday, February 12, 2021 10:45 PM
> To: Song Bao Hua (Barry Song) 
> Cc: Grygorii Strashko ; luojiaxing
> ; Linus Walleij ; Andy
> Shevchenko ; Andy Shevchenko
> ; Santosh Shilimkar ;
> Kevin Hilman ; open list:GPIO SUBSYSTEM
> , linux-kernel@vger.kernel.org
> ; linux...@openeuler.org
> Subject: Re: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace
> raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()
> 
> On Fri, Feb 12, 2021 at 6:05 AM Song Bao Hua (Barry Song)
>  wrote:
> > > -Original Message-
> 
> > >
> > > Note. there is also generic_handle_irq() call inside.
> >
> > So generic_handle_irq() is not safe to run in thread thus requires
> > an interrupt-disabled environment to run? If so, I'd rather this
> > irqsave moved into generic_handle_irq() rather than asking everyone
> > calling it to do irqsave.
> 
> In a preempt-rt kernel, interrupts are run in task context, so they clearly
> should not be called with interrupts disabled, that would defeat the
> purpose of making them preemptible.

Yes. Sounds sensible. Irqsave in generic_handle_irq() will defeat
the purpose of RT.

> 
> generic_handle_irq() does need to run with in_irq()==true though,
> but this should be set by the caller of the gpiochip's handler, and
> it is not set by raw_spin_lock_irqsave().
> 

So sounds like this issue of in_irq()=true is still irrelevant with
this patch.

I guess this should have been set by the caller of the gpiochip's
handler somewhere, otherwise, gpiochip's irq handler won't be able
to be threaded. Has it been set somewhere?

>Arnd

Thanks
Barry


RE: kernel BUG at mm/zswap.c:1275! (rc6 - git 61556703b610)

2021-02-12 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Oleksandr Natalenko [mailto:oleksa...@natalenko.name]
> Sent: Friday, February 12, 2021 8:43 PM
> To: Song Bao Hua (Barry Song) 
> Cc: Mikhail Gavrilov ;
> sjenn...@linux.vnet.ibm.com; Linux List Kernel Mailing
> ; Linux Memory Management List
> 
> Subject: Re: kernel BUG at mm/zswap.c:1275! (rc6 - git 61556703b610)
> 
> Hello.
> 
> On Thu, Feb 11, 2021 at 10:43:18AM +, Song Bao Hua (Barry Song) wrote:
> > Are you using zsmalloc? There is a known bug on the combination
> > of zsmalloc and zswap, fixed by patches of tiantao:
> >
> > mm: set the sleep_mapped to true for zbud and z3fold
> > mm/zswap: fix variable 'entry' is uninitialized when used
> > mm/zswap: fix potential memory leak
> > mm/zswap: add the flag can_sleep_mapped
> >
> > at Linux-next:
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/log/?q
> t=author=tiantao6%40hisilicon.com
> 
> Is this a future stable-5.11 material (and/or, potentially, older stable
> branches
> as well)?

I would believe this should be put into 5.11. I will ask Andrew.

> 
> --
>   Oleksandr Natalenko (post-factum)

Thanks
Barry



RE: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()

2021-02-11 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Grygorii Strashko [mailto:grygorii.stras...@ti.com]
> Sent: Friday, February 12, 2021 9:17 AM
> To: Arnd Bergmann 
> Cc: luojiaxing ; Linus Walleij
> ; Andy Shevchenko ; Andy
> Shevchenko ; Santosh Shilimkar
> ; Kevin Hilman ; open list:GPIO
> SUBSYSTEM , linux-kernel@vger.kernel.org
> ; linux...@openeuler.org
> Subject: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace
> raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()
> 
> 
> 
> On 11/02/2021 21:39, Arnd Bergmann wrote:
> > On Thu, Feb 11, 2021 at 7:25 PM Grygorii Strashko
> >  wrote:
> >> On 08/02/2021 10:56, Luo Jiaxing wrote:
> >>> There is no need to use API with _irqsave in omap_gpio_irq_handler(),
> >>> because it already be in a irq-disabled context.
> >>
> >> NACK.
> >> Who said that this is always hard IRQ handler?
> >> What about RT-kernel or boot with "threadirqs"?
> >
> > In those cases, the interrupt handler is just a normal thread, so the
> > preempt_disable() that is implied by raw_spin_lock() is sufficient.
> >
> > Disabling interrupts inside of an interrupt handler is always incorrect,
> > the patch looks like a useful cleanup to me, if only for readability.
> 
> Note. there is also generic_handle_irq() call inside.

So generic_handle_irq() is not safe to run in thread thus requires
an interrupt-disabled environment to run? If so, I'd rather this
irqsave moved into generic_handle_irq() rather than asking everyone
calling it to do irqsave.

On the other hand, the author changed a couple of spin_lock_irqsave
to spin_lock, if only this one is incorrect, it seems it is worth a
new version to fix this.

> 
> --
> Best regards,
> grygorii

Thanks
Barry



RE: Re: [PATCH for-next 00/32] spin lock usage optimization for SCSI drivers

2021-02-11 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Finn Thain [mailto:fth...@telegraphics.com.au]
> Sent: Friday, February 12, 2021 1:09 PM
> To: Song Bao Hua (Barry Song) 
> Cc: tanxiaofei ; j...@linux.ibm.com;
> martin.peter...@oracle.com; linux-s...@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux...@openeuler.org;
> linux-m...@vger.kernel.org
> Subject: RE: Re: [PATCH for-next 00/32] spin lock usage optimization for SCSI
> drivers
> 
> On Fri, 12 Feb 2021, Song Bao Hua (Barry Song) wrote:
> 
> >
> > > -Original Message-
> > > From: Finn Thain [mailto:fth...@telegraphics.com.au]
> > > Sent: Friday, February 12, 2021 12:57 PM
> > > To: Song Bao Hua (Barry Song) 
> > > Cc: tanxiaofei ; j...@linux.ibm.com;
> > > martin.peter...@oracle.com; linux-s...@vger.kernel.org;
> > > linux-kernel@vger.kernel.org; linux...@openeuler.org;
> > > linux-m...@vger.kernel.org
> > > Subject: RE: Re: [PATCH for-next 00/32] spin lock usage optimization for
> SCSI
> > > drivers
> > >
> > >
> > > On Thu, 11 Feb 2021, Song Bao Hua (Barry Song) wrote:
> > >
> > > >
> > > > Actually in m68k, I also saw its IRQ entry disabled interrupts by
> > > > ' move  #0x2700,%sr /* disable intrs */'
> > > >
> > > > arch/m68k/include/asm/entry.h:
> > > >
> > > > .macro SAVE_ALL_SYS
> > > > move#0x2700,%sr /* disable intrs */
> > > > btst#5,%sp@(2)  /* from user? */
> > > > bnes6f  /* no, skip */
> > > > movel   %sp,sw_usp  /* save user sp */
> > > > ...
> > > >
> > > > .macro SAVE_ALL_INT
> > > > SAVE_ALL_SYS
> > > > moveq   #-1,%d0 /* not system call entry */
> > > > movel   %d0,%sp@(PT_OFF_ORIG_D0)
> > > > .endm
> > > >
> > > > arch/m68k/kernel/entry.S:
> > > >
> > > > /* This is the main interrupt handler for autovector interrupts */
> > > >
> > > > ENTRY(auto_inthandler)
> > > > SAVE_ALL_INT
> > > > GET_CURRENT(%d0)
> > > > |  put exception # in d0
> > > > bfextu  %sp@(PT_OFF_FORMATVEC){#4,#10},%d0
> > > > subw#VEC_SPUR,%d0
> > > >
> > > > movel   %sp,%sp@-
> > > > movel   %d0,%sp@-   |  put vector # on stack
> > > > auto_irqhandler_fixup = . + 2
> > > > jsr do_IRQ  |  process the IRQ
> > > > addql   #8,%sp  |  pop parameters off stack
> > > > jra ret_from_exception
> > > >
> > > > So my question is that " move   #0x2700,%sr" is actually disabling
> > > > all interrupts? And is m68k actually running irq handlers
> > > > with interrupts disabled?
> > > >
> > >
> > > When sonic_interrupt() executes, the IPL is 2 or 3 (since either IRQ may
> > > be involved). That is, SR & 0x700 is 0x200 or 0x300. The level 3 interrupt
> > > may interrupt execution of the level 2 handler so an irq lock is used to
> > > avoid re-entrance.
> > >
> > > This patch,
> > >
> > > diff --git a/drivers/net/ethernet/natsemi/sonic.c
> > > b/drivers/net/ethernet/natsemi/sonic.c
> > > index d17d1b4f2585..041354647bad 100644
> > > --- a/drivers/net/ethernet/natsemi/sonic.c
> > > +++ b/drivers/net/ethernet/natsemi/sonic.c
> > > @@ -355,6 +355,8 @@ static irqreturn_t sonic_interrupt(int irq, void 
> > > *dev_id)
> > >  */
> > > spin_lock_irqsave(>lock, flags);
> > >
> > > +   printk_once(KERN_INFO "%s: %08lx\n", __func__, flags);
> > > +
> > > status = SONIC_READ(SONIC_ISR) & SONIC_IMR_DEFAULT;
> > > if (!status) {
> > > spin_unlock_irqrestore(>lock, flags);
> > >
> > > produces this output,
> > >
> > > [3.80] sonic_interrupt: 2300
> >
> > I actually hope you can directly read the register rather than reading
> > a flag which might be a software one not from register.
> >
> 
> Again, the implementation of arch_local_irq_save() may be found in
> arch/m68k/include/asm/irqflags.h

Yes. I have read it. Anyway, I started a discussion in genirq
with you cc-ed:
https://lore.kernel.org/lkml/c46ddb954cfe45d9849c911271d7e...@hisilicon.com/

And thanks very much for all your efforts to help me understand
M68k. Let's get this clarified thoroughly in genirq level.

In arm, we also have some special high-priority interrupts
which are not NMI but able to preempt normal IRQ. They are
managed by arch-extended APIs rather than common APIs.

Neither arch_irqs_disabled() nor local_irq_disable() API can
access this kind of interrupts. They are using things specific
to ARM like:
local_fiq_disable()
local_fiq_enable()
set_fiq_handler()
disable_fiq()
enable_fiq()
...

so fiq doesn't bother us anyhow in genirq.

> 
> > >
> > > I ran that code in QEMU, but experience shows that Apple hardware works
> > > exactly the same. Please do confirm this for yourself, if you still think
> > > the code and comments in sonic_interrupt are wrong.
> > >
> > > > Best Regards
> > > > Barry
> > > >
> >

Thanks
Barry



[RFC] IRQ handlers run with some high-priority interrupts(not NMI) enabled on some platform

2021-02-11 Thread Song Bao Hua (Barry Song)
Hi,

I am getting a very long debate with Finn in this thread:
https://lore.kernel.org/lkml/1612697823-8073-1-git-send-email-tanxiao...@huawei.com/

In short, the debate is about if high-priority IRQs (*not NMI*)
are allowed to preempt an running IRQ handler in hardIRQ context.

In my understanding, right now IRQ handlers are running with *all* interrupts
disabled since this commit and IRQF_DISABLED was dropped:
e58aa3d2d0cc
genirq: Run irq handlers with interrupts disabled

b738a50a2026
genirq: Warn when handler enables interrupts
We run all handlers with interrupts disabled and expect them not to
enable them. Warn when we catch one who does.

While it seems to be true in almost all platforms, it seems to be
false on m68k.

According to Finn, while IRQ handlers are running, high-priority
interrupts can still jump out on m68k. A driver which is handling
this issue is here: drivers/net/ethernet/natsemi/sonic.c.
you can read the comment:
static irqreturn_t sonic_interrupt(int irq, void *dev_id)
{
struct net_device *dev = dev_id;
struct sonic_local *lp = netdev_priv(dev);
int status;
unsigned long flags;

/* The lock has two purposes. Firstly, it synchronizes sonic_interrupt()
 * with sonic_send_packet() so that the two functions can share state.
 * Secondly, it makes sonic_interrupt() re-entrant, as that is required
 * by macsonic which must use two IRQs with different priority levels.
 */
spin_lock_irqsave(>lock, flags);

status = SONIC_READ(SONIC_ISR) & SONIC_IMR_DEFAULT;
if (!status) {
spin_unlock_irqrestore(>lock, flags);

return IRQ_NONE;
}
}

So m68k does allow a high-priority interrupt to preempt
a hardIRQ so the code needs to call irqsave to protect
this risk. That is to say, some interrupts are not disabled
during hardIRQ of m68k.

But m68k doesn't trigger any warning for !irqs_disabled() in
genirq:
irqreturn_t __handle_irq_event_percpu(struct irq_desc *desc, unsigned int 
*flags)
{
...

trace_irq_handler_entry(irq, action);
res = action->handler(irq, action->dev_id);
trace_irq_handler_exit(irq, action, res);

if (WARN_ONCE(!irqs_disabled(),"irq %u handler %pS enabled 
interrupts\n",
  irq, action->handler))
local_irq_disable();
}

The reason is:
* arch_irqs_disabled() return true while low-priority interrupts are disabled
though high-priority interrupts are still open;
* local_irq_disable, spin_lock_irqsave() etc will disable high-priority 
interrupt
(IPL 7);
* arch_irqs_disabled() also return true while both low and high priority 
interrupts
interrupts are disabled.
Note m68k has several interrupt levels. But in the above description, I simply
abstract them as high and low to help the understanding.

I think m68k lets arch_irq_disabled() return true in relatively weaker condition
to pretend all IRQs are disabled while high-priority IRQ is still open, thus
pass all sanitizing check in genirq and kernel core. But Finn strongly 
disagreed.

I am not saying I am right and Finn is wrong. But I think we need somewhere to 
clarify
this problem.

Personally, I would prefer "interrupts disabled" mean "all except NMI", So I'd 
like to
guard this by:

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 7c9d6a2d7e90..b8ca27555c76 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -32,6 +32,7 @@ static __always_inline void rcu_irq_enter_check_tick(void)
  */
 #define __irq_enter()  \
do {\
+   WARN_ONCE(in_hardirq() && irqs_disabled(), "nested
interrupts\n"); \
preempt_count_add(HARDIRQ_OFFSET);  \
lockdep_hardirq_enter();\
account_hardirq_enter(current); \
@@ -44,6 +45,7 @@ static __always_inline void rcu_irq_enter_check_tick(void)
  */
 #define __irq_enter_raw()  \
do {\
+   WARN_ONCE(in_hardirq() && irqs_disabled(), " nested
interrupts\n"); \
preempt_count_add(HARDIRQ_OFFSET);  \
lockdep_hardirq_enter();\
} while (0)

Though Finn thought it lacks any justification

So I am requesting comments on:
1. are we expecting all interrupts except NMI to be disabled in irq handler,
or do we actually allow some high-priority interrupts between low and NMI to
come in some platforms?

2. If either side is true, I think we need to document it somewhere as there
is always confusion about this.

Personally, I would expect all interrupts to be disabled and I like the way
of ARM64 to only use high-priority interrupt as pseudo NMI:
https://lwn.net/Articles/755906/
Though Finn argued that this will contribute to lose 

RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage optimization for SCSI drivers

2021-02-11 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Finn Thain [mailto:fth...@telegraphics.com.au]
> Sent: Friday, February 12, 2021 12:58 PM
> To: Song Bao Hua (Barry Song) 
> Cc: tanxiaofei ; j...@linux.ibm.com;
> martin.peter...@oracle.com; linux-s...@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux...@openeuler.org;
> linux-m...@vger.kernel.org
> Subject: RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage 
> optimization
> for SCSI drivers
> 
> On Thu, 11 Feb 2021, Song Bao Hua (Barry Song) wrote:
> 
> > > On Wed, 10 Feb 2021, Song Bao Hua (Barry Song) wrote:
> > >
> > > > > On Wed, 10 Feb 2021, Song Bao Hua (Barry Song) wrote:
> > > > >
> > > > > > TBH, that is why m68k is so confusing. irqs_disabled() on m68k
> > > > > > should just reflect the status of all interrupts have been
> > > > > > disabled except NMI.
> > > > > >
> > > > > > irqs_disabled() should be consistent with the calling of APIs
> > > > > > such as local_irq_disable, local_irq_save, spin_lock_irqsave
> > > > > > etc.
> > > > > >
> > > > >
> > > > > When irqs_disabled() returns true, we cannot infer that
> > > > > arch_local_irq_disable() was called. But I have not yet found
> > > > > driver code or core kernel code attempting that inference.
> > > > >
> > > > > > >
> > > > > > > > Isn't arch_irqs_disabled() a status reflection of irq
> > > > > > > > disable API?
> > > > > > > >
> > > > > > >
> > > > > > > Why not?
> > > > > >
> > > > > > If so, arch_irqs_disabled() should mean all interrupts have been
> > > > > > masked except NMI as NMI is unmaskable.
> > > > > >
> > > > >
> > > > > Can you support that claim with a reference to core kernel code or
> > > > > documentation? (If some arch code agrees with you, that's neither
> > > > > here nor there.)
> > > >
> > > > I think those links I share you have supported this. Just you don't
> > > > believe :-)
> > > >
> > >
> > > Your links show that the distinction between fast and slow handlers
> > > was removed. Your links don't support your claim that
> > > "arch_irqs_disabled() should mean all interrupts have been masked".
> > > Where is the code that makes that inference? Where is the
> > > documentation that supports your claim?
> >
> > (1)
> > https://lwn.net/Articles/380931/
> > Looking at all these worries, one might well wonder if a system which
> > *disabled interrupts for all handlers* would function well at all. So it
> > is interesting to note one thing: any system which has the lockdep
> > locking checker enabled has been running all handlers that way for some
> > years now. Many developers and testers run lockdep-enabled kernels, and
> > they are available for some of the more adventurous distributions
> > (Rawhide, for example) as well. So we have quite a bit of test coverage
> > for this mode of operation already.
> >
> 
> IIUC, your claim is that CONFIG_LOCKDEP involves code that contains the
> inference, "arch_irqs_disabled() means all interrupts have been masked".
> 
> Unfortunately, m68k lacks CONFIG_LOCKDEP support so I can't easily confirm
> this. I suppose there may be other architectures that support both LOCKDEP
> and nested interrupts (?)
> 
> Anyway, if you would point to the code that contains said inference, that
> would help a lot.
> 
> > (2)
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=b738a50a
> >
> > "We run all handlers *with interrupts disabled* and expect them not to
> > enable them. Warn when we catch one who does."
> >
> 
> Again, you don't see that warning because irqs_disabled() correctly
> returns true. You can confirm this fact in QEMU or Aranym if you are
> interested.
> 
> > (3)
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=e58aa3d2d0cc
> > genirq: Run irq handlers *with interrupts disabled*
> >
> > Running interrupt handlers with interrupts enabled can cause stack
> > overflows. That has been observed with multiqueue NICs delivering all
> > their interrupts to a single core. We might band aid that somehow by
> > checking the interrupt stack

RE: Re: [PATCH for-next 00/32] spin lock usage optimization for SCSI drivers

2021-02-11 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Finn Thain [mailto:fth...@telegraphics.com.au]
> Sent: Friday, February 12, 2021 12:57 PM
> To: Song Bao Hua (Barry Song) 
> Cc: tanxiaofei ; j...@linux.ibm.com;
> martin.peter...@oracle.com; linux-s...@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux...@openeuler.org;
> linux-m...@vger.kernel.org
> Subject: RE: Re: [PATCH for-next 00/32] spin lock usage optimization for SCSI
> drivers
> 
> 
> On Thu, 11 Feb 2021, Song Bao Hua (Barry Song) wrote:
> 
> >
> > Actually in m68k, I also saw its IRQ entry disabled interrupts by
> > ' move  #0x2700,%sr /* disable intrs */'
> >
> > arch/m68k/include/asm/entry.h:
> >
> > .macro SAVE_ALL_SYS
> > move#0x2700,%sr /* disable intrs */
> > btst#5,%sp@(2)  /* from user? */
> > bnes6f  /* no, skip */
> > movel   %sp,sw_usp  /* save user sp */
> > ...
> >
> > .macro SAVE_ALL_INT
> > SAVE_ALL_SYS
> > moveq   #-1,%d0 /* not system call entry */
> > movel   %d0,%sp@(PT_OFF_ORIG_D0)
> > .endm
> >
> > arch/m68k/kernel/entry.S:
> >
> > /* This is the main interrupt handler for autovector interrupts */
> >
> > ENTRY(auto_inthandler)
> > SAVE_ALL_INT
> > GET_CURRENT(%d0)
> > |  put exception # in d0
> > bfextu  %sp@(PT_OFF_FORMATVEC){#4,#10},%d0
> > subw#VEC_SPUR,%d0
> >
> > movel   %sp,%sp@-
> > movel   %d0,%sp@-   |  put vector # on stack
> > auto_irqhandler_fixup = . + 2
> > jsr do_IRQ  |  process the IRQ
> > addql   #8,%sp  |  pop parameters off stack
> > jra ret_from_exception
> >
> > So my question is that " move   #0x2700,%sr" is actually disabling
> > all interrupts? And is m68k actually running irq handlers
> > with interrupts disabled?
> >
> 
> When sonic_interrupt() executes, the IPL is 2 or 3 (since either IRQ may
> be involved). That is, SR & 0x700 is 0x200 or 0x300. The level 3 interrupt
> may interrupt execution of the level 2 handler so an irq lock is used to
> avoid re-entrance.
> 
> This patch,
> 
> diff --git a/drivers/net/ethernet/natsemi/sonic.c
> b/drivers/net/ethernet/natsemi/sonic.c
> index d17d1b4f2585..041354647bad 100644
> --- a/drivers/net/ethernet/natsemi/sonic.c
> +++ b/drivers/net/ethernet/natsemi/sonic.c
> @@ -355,6 +355,8 @@ static irqreturn_t sonic_interrupt(int irq, void *dev_id)
>  */
> spin_lock_irqsave(>lock, flags);
> 
> +   printk_once(KERN_INFO "%s: %08lx\n", __func__, flags);
> +
> status = SONIC_READ(SONIC_ISR) & SONIC_IMR_DEFAULT;
> if (!status) {
> spin_unlock_irqrestore(>lock, flags);
> 
> produces this output,
> 
> [3.80] sonic_interrupt: 2300

I actually hope you can directly read the register rather than reading
a flag which might be a software one not from register.

> 
> I ran that code in QEMU, but experience shows that Apple hardware works
> exactly the same. Please do confirm this for yourself, if you still think
> the code and comments in sonic_interrupt are wrong.
> 
> > Best Regards
> > Barry
> >

Thanks
Barry



RE: Re: [PATCH for-next 00/32] spin lock usage optimization for SCSI drivers

2021-02-11 Thread Song Bao Hua (Barry Song)
> >
> > On Wed, 10 Feb 2021, Song Bao Hua (Barry Song) wrote:
> >
> > > > On Wed, 10 Feb 2021, Song Bao Hua (Barry Song) wrote:
> > > >
> > > > > > On Wed, 10 Feb 2021, Song Bao Hua (Barry Song) wrote:
> > > > > >
> > > > > > > > There is no warning from m68k builds. That's because
> > > > > > > > arch_irqs_disabled() returns true when the IPL is non-zero.
> > > > > > >
> > > > > > > So for m68k, the case is arch_irqs_disabled() is true, but
> > > > > > > interrupts can still come?
> > > > > > >
> > > > > > > Then it seems it is very confusing. If prioritized interrupts
> > > > > > > can still come while arch_irqs_disabled() is true,
> > > > > >
> > > > > > Yes, on m68k CPUs, an IRQ having a priority level higher than the
> > > > > > present priority mask will get serviced.
> > > > > >
> > > > > > Non-Maskable Interrupt (NMI) is not subject to this rule and gets
> > > > > > serviced regardless.
> > > > > >
> > > > > > > how could spin_lock_irqsave() block the prioritized interrupts?
> > > > > >
> > > > > > It raises the the mask level to 7. Again, please see
> > > > > > arch/m68k/include/asm/irqflags.h
> > > > >
> > > > > Hi Finn,
> > > > > Thanks for your explanation again.
> > > > >
> > > > > TBH, that is why m68k is so confusing. irqs_disabled() on m68k
> > > > > should just reflect the status of all interrupts have been disabled
> > > > > except NMI.
> > > > >
> > > > > irqs_disabled() should be consistent with the calling of APIs such
> > > > > as local_irq_disable, local_irq_save, spin_lock_irqsave etc.
> > > > >
> > > >
> > > > When irqs_disabled() returns true, we cannot infer that
> > > > arch_local_irq_disable() was called. But I have not yet found driver
> > > > code or core kernel code attempting that inference.
> > > >
> > > > > >
> > > > > > > Isn't arch_irqs_disabled() a status reflection of irq disable
> > > > > > > API?
> > > > > > >
> > > > > >
> > > > > > Why not?
> > > > >
> > > > > If so, arch_irqs_disabled() should mean all interrupts have been
> > > > > masked except NMI as NMI is unmaskable.
> > > > >
> > > >
> > > > Can you support that claim with a reference to core kernel code or
> > > > documentation? (If some arch code agrees with you, that's neither here
> > > > nor there.)
> > >
> > > I think those links I share you have supported this. Just you don't
> > > believe :-)
> > >
> >
> > Your links show that the distinction between fast and slow handlers was
> > removed. Your links don't support your claim that "arch_irqs_disabled()
> > should mean all interrupts have been masked". Where is the code that makes
> > that inference? Where is the documentation that supports your claim?
> 
> (1)
> https://lwn.net/Articles/380931/
> Looking at all these worries, one might well wonder if a system which 
> *disabled
> interrupts for all handlers* would function well at all. So it is interesting
> to note one thing: any system which has the lockdep locking checker enabled
> has been running all handlers that way for some years now. Many developers
> and testers run lockdep-enabled kernels, and they are available for some of
> the more adventurous distributions (Rawhide, for example) as well. So we
> have quite a bit of test coverage for this mode of operation already.
> 
> (2)
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=b738a50a
> 
> "We run all handlers *with interrupts disabled* and expect them not to
> enable them. Warn when we catch one who does."
> 
> (3)
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=e58aa3d2d0cc
> genirq: Run irq handlers *with interrupts disabled*
> 
> Running interrupt handlers with interrupts enabled can cause stack
> overflows. That has been observed with multiqueue NICs delivering all
> their interrupts to a single core. We might band aid that somehow by
> checking the interrupt stacks, but the real safe fix is to *run the irq
&g

RE: kernel BUG at mm/zswap.c:1275! (rc6 - git 61556703b610)

2021-02-11 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Mikhail Gavrilov [mailto:mikhail.v.gavri...@gmail.com]
> Sent: Thursday, February 11, 2021 9:58 PM
> To: sjenn...@linux.vnet.ibm.com; Song Bao Hua (Barry Song)
> 
> Cc: Linux List Kernel Mailing ; Linux Memory
> Management List 
> Subject: kernel BUG at mm/zswap.c:1275! (rc6 - git 61556703b610)
> 
> Hi folks.
> During the 5.11 test cycle I caught a rare but repeatable problem when
> after a day uptime happens "BUG at mm/zswap.c:1275!". I am still not
> having an idea how to reproduce it, but maybe the authors of this code
> could explain what happens here?

Are you using zsmalloc? There is a known bug on the combination
of zsmalloc and zswap, fixed by patches of tiantao:

mm: set the sleep_mapped to true for zbud and z3fold
mm/zswap: fix variable 'entry' is uninitialized when used
mm/zswap: fix potential memory leak
mm/zswap: add the flag can_sleep_mapped

at Linux-next:
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/log/?qt=author=tiantao6%40hisilicon.com


> 
> $ grep "mm/zswap.c" dmesg*.txt
> dmesg101.txt:[127850.513201] kernel BUG at mm/zswap.c:1275!
> dmesg11.txt:[52211.962861] kernel BUG at mm/zswap.c:1275!
> dmesg8.txt:[46610.641843] kernel BUG at mm/zswap.c:1275!
> 
> [127850.513193] [ cut here ]
> [127850.513201] kernel BUG at mm/zswap.c:1275!
> [127850.513210] invalid opcode:  [#1] SMP NOPTI
> [127850.513214] CPU: 6 PID: 485132 Comm: brave Tainted: GW
>- ---  5.11.0-0.rc6.20210204git61556703b610.145.fc34.x86_64
> #1
> [127850.513218] Hardware name: System manufacturer System Product
> Name/ROG STRIX X570-I GAMING, BIOS 3402 01/13/2021
> [127850.513221] RIP: 0010:zswap_frontswap_load+0x258/0x260
> [127850.513228] Code: ab 83 aa f0 2f 00 00 01 65 ff 0d c3 73 cd 54 eb
> 88 48 8d 7b 10 e8 78 b9 9f 00 c7 43 10 00 00 00 00 44 8b 63 70 e9 4a
> ff ff ff <0f> 0b 0f 0b 0f 0b 66 90 0f 1f 44 00 00 41 57 31 c0 b9 0c 00
> 00 00
> [127850.513231] RSP: :a92e866c7c48 EFLAGS: 00010282
> [127850.513235] RAX: 0006 RBX: c92e7ca61830 RCX:
> 0001
> [127850.513238] RDX:  RSI: ab3429fe RDI:
> 97f4d0393010
> [127850.513240] RBP: 97ee5544d1c0 R08: 0001 R09:
> 
> [127850.513242] R10:  R11:  R12:
> ffea
> [127850.513244] R13: 97ee016800c8 R14: 97ee016800c0 R15:
> c0d54020
> [127850.513247] FS:  7fcbe628de40() GS:97f50760()
> knlGS:
> [127850.513249] CS:  0010 DS:  ES:  CR0: 80050033
> [127850.513252] CR2: 381208c29250 CR3: 0001c54ea000 CR4:
> 00350ee0
> [127850.513254] Call Trace:
> [127850.513261]  __frontswap_load+0xc3/0x160
> [127850.513265]  swap_readpage+0x1ca/0x3a0
> [127850.513270]  swapin_readahead+0x2ee/0x4e0
> [127850.513276]  do_swap_page+0x4a4/0x900
> [127850.513279]  ? lock_release+0x1e9/0x400
> [127850.513283]  ? trace_hardirqs_on+0x1b/0xe0
> [127850.513288]  handle_mm_fault+0xe7d/0x19d0
> [127850.513294]  do_user_addr_fault+0x1c7/0x4c0
> [127850.513299]  exc_page_fault+0x67/0x2a0
> [127850.513304]  ? asm_exc_page_fault+0x8/0x30
> [127850.513307]  asm_exc_page_fault+0x1e/0x30
> [127850.513310] RIP: 0033:0x560297642f44
> [127850.513314] Code: 64 75 07 45 8b 76 03 4d 03 f5 45 8b 56 ff 4d 03
> d5 66 41 81 7a 07 83 00 0f 85 4f 01 00 00 8b 5f 13 49 03 dd 8b 5b 03
> 49 03 dd <8b> 4b ff 49 03 cd 66 81 79 07 a5 00 0f 85 0f 00 00 00 8b 4b
> 0f f6
> [127850.513317] RSP: 002b:7ffc04cd4b30 EFLAGS: 00010202
> [127850.513320] RAX:  RBX: 381208c29251 RCX:
> 560297642f00
> [127850.513322] RDX: 3812080423b1 RSI: 381209b11231 RDI:
> 381209b1141d
> [127850.513323] RBP: 7ffc04cd4b90 R08: 0043 R09:
> 0024
> [127850.513325] R10: 381208042a1d R11: 381209b1141f R12:
> 09b1141d
> [127850.513327] R13: 3812 R14: 381208b368ed R15:
> 3d2fb6b7da10
> [127850.51] Modules linked in: tun snd_seq_dummy snd_hrtimer
> uinput rfcomm nft_objref nf_conntrack_netbios_ns
> nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
> nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
> nft_chain_nat ip6table_nat ip6table_mangle ip6table_raw
> ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6
> nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set
> nf_tables nfnetlink ip6table_filter ip6_tables iptable_filter cmac
> bnep zstd sunrpc vfat fat hid_logitech_hidpp hid_logitech_dj
> snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio
> snd_hda_codec_hdmi snd_hda_intel snd

RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage optimization for SCSI drivers

2021-02-10 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Finn Thain [mailto:fth...@telegraphics.com.au]
> Sent: Thursday, February 11, 2021 2:12 PM
> To: Song Bao Hua (Barry Song) 
> Cc: tanxiaofei ; j...@linux.ibm.com;
> martin.peter...@oracle.com; linux-s...@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux...@openeuler.org;
> linux-m...@vger.kernel.org
> Subject: RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage 
> optimization
> for SCSI drivers
> 
> On Wed, 10 Feb 2021, Song Bao Hua (Barry Song) wrote:
> 
> > > On Wed, 10 Feb 2021, Song Bao Hua (Barry Song) wrote:
> > >
> > > > > On Wed, 10 Feb 2021, Song Bao Hua (Barry Song) wrote:
> > > > >
> > > > > > > There is no warning from m68k builds. That's because
> > > > > > > arch_irqs_disabled() returns true when the IPL is non-zero.
> > > > > >
> > > > > > So for m68k, the case is arch_irqs_disabled() is true, but
> > > > > > interrupts can still come?
> > > > > >
> > > > > > Then it seems it is very confusing. If prioritized interrupts
> > > > > > can still come while arch_irqs_disabled() is true,
> > > > >
> > > > > Yes, on m68k CPUs, an IRQ having a priority level higher than the
> > > > > present priority mask will get serviced.
> > > > >
> > > > > Non-Maskable Interrupt (NMI) is not subject to this rule and gets
> > > > > serviced regardless.
> > > > >
> > > > > > how could spin_lock_irqsave() block the prioritized interrupts?
> > > > >
> > > > > It raises the the mask level to 7. Again, please see
> > > > > arch/m68k/include/asm/irqflags.h
> > > >
> > > > Hi Finn,
> > > > Thanks for your explanation again.
> > > >
> > > > TBH, that is why m68k is so confusing. irqs_disabled() on m68k
> > > > should just reflect the status of all interrupts have been disabled
> > > > except NMI.
> > > >
> > > > irqs_disabled() should be consistent with the calling of APIs such
> > > > as local_irq_disable, local_irq_save, spin_lock_irqsave etc.
> > > >
> > >
> > > When irqs_disabled() returns true, we cannot infer that
> > > arch_local_irq_disable() was called. But I have not yet found driver
> > > code or core kernel code attempting that inference.
> > >
> > > > >
> > > > > > Isn't arch_irqs_disabled() a status reflection of irq disable
> > > > > > API?
> > > > > >
> > > > >
> > > > > Why not?
> > > >
> > > > If so, arch_irqs_disabled() should mean all interrupts have been
> > > > masked except NMI as NMI is unmaskable.
> > > >
> > >
> > > Can you support that claim with a reference to core kernel code or
> > > documentation? (If some arch code agrees with you, that's neither here
> > > nor there.)
> >
> > I think those links I share you have supported this. Just you don't
> > believe :-)
> >
> 
> Your links show that the distinction between fast and slow handlers was
> removed. Your links don't support your claim that "arch_irqs_disabled()
> should mean all interrupts have been masked". Where is the code that makes
> that inference? Where is the documentation that supports your claim?

(1)
https://lwn.net/Articles/380931/
Looking at all these worries, one might well wonder if a system which *disabled
interrupts for all handlers* would function well at all. So it is interesting
to note one thing: any system which has the lockdep locking checker enabled
has been running all handlers that way for some years now. Many developers
and testers run lockdep-enabled kernels, and they are available for some of
the more adventurous distributions (Rawhide, for example) as well. So we
have quite a bit of test coverage for this mode of operation already.

(2)
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b738a50a

"We run all handlers *with interrupts disabled* and expect them not to
enable them. Warn when we catch one who does."

(3) 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e58aa3d2d0cc
genirq: Run irq handlers *with interrupts disabled*

Running interrupt handlers with interrupts enabled can cause stack
overflows. That has been observed with multiqueue NICs delivering all
their interrupts to a single core. We might band aid that somehow by
checking the interrupt stacks, but the real safe fix is t

RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage optimization for SCSI drivers

2021-02-10 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Finn Thain [mailto:fth...@telegraphics.com.au]
> Sent: Thursday, February 11, 2021 11:35 AM
> To: Song Bao Hua (Barry Song) 
> Cc: tanxiaofei ; j...@linux.ibm.com;
> martin.peter...@oracle.com; linux-s...@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux...@openeuler.org;
> linux-m...@vger.kernel.org
> Subject: RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage 
> optimization
> for SCSI drivers
> 
> On Wed, 10 Feb 2021, Song Bao Hua (Barry Song) wrote:
> 
> > > On Wed, 10 Feb 2021, Song Bao Hua (Barry Song) wrote:
> > >
> > > > >
> > > > > There is no warning from m68k builds. That's because
> > > > > arch_irqs_disabled() returns true when the IPL is non-zero.
> > > >
> > > > So for m68k, the case is
> > > > arch_irqs_disabled() is true, but interrupts can still come?
> > > >
> > > > Then it seems it is very confusing. If prioritized interrupts can
> > > > still come while arch_irqs_disabled() is true,
> > >
> > > Yes, on m68k CPUs, an IRQ having a priority level higher than the
> > > present priority mask will get serviced.
> > >
> > > Non-Maskable Interrupt (NMI) is not subject to this rule and gets
> > > serviced regardless.
> > >
> > > > how could spin_lock_irqsave() block the prioritized interrupts?
> > >
> > > It raises the the mask level to 7. Again, please see
> > > arch/m68k/include/asm/irqflags.h
> >
> > Hi Finn,
> > Thanks for your explanation again.
> >
> > TBH, that is why m68k is so confusing. irqs_disabled() on m68k should
> > just reflect the status of all interrupts have been disabled except NMI.
> >
> > irqs_disabled() should be consistent with the calling of APIs such as
> > local_irq_disable, local_irq_save, spin_lock_irqsave etc.
> >
> 
> When irqs_disabled() returns true, we cannot infer that
> arch_local_irq_disable() was called. But I have not yet found driver code
> or core kernel code attempting that inference.
> 
> > >
> > > > Isn't arch_irqs_disabled() a status reflection of irq disable API?
> > > >
> > >
> > > Why not?
> >
> > If so, arch_irqs_disabled() should mean all interrupts have been masked
> > except NMI as NMI is unmaskable.
> >
> 
> Can you support that claim with a reference to core kernel code or
> documentation? (If some arch code agrees with you, that's neither here nor
> there.)

I think those links I share you have supported this. Just you don't
believe :-)

> 
> > >
> > > Are all interrupts (including NMI) masked whenever
> > > arch_irqs_disabled() returns true on your platforms?
> >
> > On my platform, once irqs_disabled() is true, all interrupts are masked
> > except NMI. NMI just ignore spin_lock_irqsave or local_irq_disable.
> >
> > On ARM64, we also have high-priority interrupts, but they are running as
> > PESUDO_NMI:
> > https://lwn.net/Articles/755906/
> >
> 
> A glance at the ARM GIC specification suggests that your hardware works
> much like 68000 hardware.
> 
>When enabled, a CPU interface takes the highest priority pending
>interrupt for its connected processor and determines whether the
>interrupt has sufficient priority for it to signal the interrupt
>request to the processor. [...]
> 
>When the processor acknowledges the interrupt at the CPU interface, the
>Distributor changes the status of the interrupt from pending to either
>active, or active and pending. At this point the CPU interface can
>signal another interrupt to the processor, to preempt interrupts that
>are active on the processor. If there is no pending interrupt with
>sufficient priority for signaling to the processor, the interface
>deasserts the interrupt request signal to the processor.
> 
> https://developer.arm.com/documentation/ihi0048/b/
> 
> Have you considered that Linux/arm might benefit if it could fully exploit
> hardware features already available, such as the interrupt priority
> masking feature in the GIC in existing arm systems?

I guess no:-) there are only two levels: IRQ and NMI. Injecting a high-prio
IRQ level between them makes no sense.

To me, arm64's design is quite clear and has no any confusion.

> 
> > On m68k, it seems you mean:
> > irq_disabled() is true, but high-priority interrupts can still come;
> > local_irq_disable() can disable high-priority interrupts, and at that
> > time, irq_disabled() is also true.
> >
> > TBH, this is wrong and confus

RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-10 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Jason Gunthorpe [mailto:j...@ziepe.ca]
> Sent: Thursday, February 11, 2021 7:04 AM
> To: Song Bao Hua (Barry Song) 
> Cc: David Hildenbrand ; Wangzhou (B)
> ; linux-kernel@vger.kernel.org;
> io...@lists.linux-foundation.org; linux...@kvack.org;
> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> Morton ; Alexander Viro ;
> gre...@linuxfoundation.org; kevin.t...@intel.com; jean-phili...@linaro.org;
> eric.au...@redhat.com; Liguozhu (Kenneth) ;
> zhangfei@linaro.org; chensihang (A) 
> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> pin
> 
> On Tue, Feb 09, 2021 at 10:22:47PM +, Song Bao Hua (Barry Song) wrote:
> 
> > The problem is that SVA declares we can use any memory of a process
> > to do I/O. And in real scenarios, we are unable to customize most
> > applications to make them use the pool. So we are looking for some
> > extension generically for applications such as Nginx, Ceph.
> 
> But those applications will suffer jitter even if their are using CPU
> to do the same work. I fail to see why adding an accelerator suddenly
> means the application owner will care about jitter introduced by
> migration/etc.

The only point for this is that when migration occurs on the accelerator,
the impact/jitter is much bigger than it does on CPU. Then the accelerator
might be unhelpful.

> 
> Again in proper SVA it should be quite unlikely to take a fault caused
> by something like migration, on the same likelyhood as the CPU. If
> things are faulting so much this is a problem then I think it is a
> system level problem with doing too much page motion.

My point is that single one SVA application shouldn't require system
to make global changes, such as disabling numa balancing, disabling
THP, to decrease page fault frequency by affecting other applications.

Anyway, guys are in lunar new year. Hopefully, we are getting more
real benchmark data afterwards to make the discussion more targeted.

> 
> Jason

Thanks
Barry


RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage optimization for SCSI drivers

2021-02-10 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Finn Thain [mailto:fth...@telegraphics.com.au]
> Sent: Thursday, February 11, 2021 10:07 AM
> To: Song Bao Hua (Barry Song) 
> Cc: tanxiaofei ; j...@linux.ibm.com;
> martin.peter...@oracle.com; linux-s...@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux...@openeuler.org;
> linux-m...@vger.kernel.org
> Subject: RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage 
> optimization
> for SCSI drivers
> 
> 
> On Wed, 10 Feb 2021, Song Bao Hua (Barry Song) wrote:
> 
> > > On Tue, 9 Feb 2021, Song Bao Hua (Barry Song) wrote:
> > >
> > > > > > sonic_interrupt() uses an irq lock within an interrupt handler
> > > > > > to avoid issues relating to this. This kind of locking may be
> > > > > > needed in the drivers you are trying to patch. Or it might not.
> > > > > > Apparently, no-one has looked.
> > > >
> > > > Is the comment in sonic_interrupt() outdated according to this:
> > > > m68k: irq: Remove IRQF_DISABLED
> > > >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=77a4279
> > > > http://lkml.iu.edu/hypermail/linux/kernel/1109.2/01687.html
> > > >
> > >
> > > The removal of IRQF_DISABLED isn't relevant to this driver. Commit
> > > 77a42796786c ("m68k: Remove deprecated IRQF_DISABLED") did not disable
> > > interrupts, it just removed some code to enable them.
> > >
> > > The code and comments in sonic_interrupt() are correct. You can
> > > confirm this for yourself quite easily using QEMU and a
> > > cross-compiler.
> > >
> > > > and this: genirq: Warn when handler enables interrupts
> > > >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=b738a50a
> > > >
> > > > wouldn't genirq report a warning on m68k?
> > > >
> > >
> > > There is no warning from m68k builds. That's because
> > > arch_irqs_disabled() returns true when the IPL is non-zero.
> >
> >
> > So for m68k, the case is
> > arch_irqs_disabled() is true, but interrupts can still come?
> >
> > Then it seems it is very confusing. If prioritized interrupts can still
> > come while arch_irqs_disabled() is true,
> 
> Yes, on m68k CPUs, an IRQ having a priority level higher than the present
> priority mask will get serviced.
> 
> Non-Maskable Interrupt (NMI) is not subject to this rule and gets serviced
> regardless.
> 
> > how could spin_lock_irqsave() block the prioritized interrupts?
> 
> It raises the the mask level to 7. Again, please see
> arch/m68k/include/asm/irqflags.h

Hi Finn,
Thanks for your explanation again.

TBH, that is why m68k is so confusing. irqs_disabled() on m68k should just
reflect the status of all interrupts have been disabled except NMI.

irqs_disabled() should be consistent with the calling of APIs such
as local_irq_disable, local_irq_save, spin_lock_irqsave etc.

> 
> > Isn't arch_irqs_disabled() a status reflection of irq disable API?
> >
> 
> Why not?

If so, arch_irqs_disabled() should mean all interrupts have been masked
except NMI as NMI is unmaskable.

> 
> Are all interrupts (including NMI) masked whenever arch_irqs_disabled()
> returns true on your platforms?

On my platform, once irqs_disabled() is true, all interrupts are masked
except NMI. NMI just ignore spin_lock_irqsave or local_irq_disable.

On ARM64, we also have high-priority interrupts, but they are running as
PESUDO_NMI:
https://lwn.net/Articles/755906/

On m68k, it seems you mean:
irq_disabled() is true, but high-priority interrupts can still come;
local_irq_disable() can disable high-priority interrupts, and at that
time, irq_disabled() is also true.

TBH, this is wrong and confusing on m68k.

> 
> > Thanks
> > Barry
> >

Thanks
Barry


RE: [Linuxarm] Re: [PATCH for next v1 0/2] gpio: few clean up patches to replace spin_lock_irqsave with spin_lock

2021-02-10 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Andy Shevchenko [mailto:andy.shevche...@gmail.com]
> Sent: Thursday, February 11, 2021 3:57 AM
> To: Song Bao Hua (Barry Song) 
> Cc: luojiaxing ; Linus Walleij
> ; Grygorii Strashko ;
> Santosh Shilimkar ; Kevin Hilman ;
> open list:GPIO SUBSYSTEM ; Linux Kernel Mailing
> List ; linux...@openeuler.org
> Subject: Re: [Linuxarm] Re: [PATCH for next v1 0/2] gpio: few clean up patches
> to replace spin_lock_irqsave with spin_lock
> 
> On Wed, Feb 10, 2021 at 11:50:45AM +, Song Bao Hua (Barry Song) wrote:
> > > -Original Message-
> > > From: Andy Shevchenko [mailto:andy.shevche...@gmail.com]
> > > Sent: Wednesday, February 10, 2021 11:51 PM
> > > On Wed, Feb 10, 2021 at 5:43 AM luojiaxing  wrote:
> > > > On 2021/2/9 17:42, Andy Shevchenko wrote:
> 
> ...
> 
> > > > Between IRQ handler A and IRQ handle A, it's no need for a SLIS.
> > >
> > > Right, but it's not the case in the patches you provided.
> >
> > The code still holds spin_lock. So if two cpus call same IRQ handler,
> > spin_lock makes them spin; and if interrupts are threaded, spin_lock
> > makes two threads run the same handler one by one.
> 
> If you run on an SMP system and it happens that spin_lock_irqsave() just
> immediately after spin_unlock(), you will get into the troubles. Am I 
> mistaken?

Hi Andy,
Thanks for your reply.

But I don't agree spin_lock_irqsave() just immediately after spin_unlock()
could a problem on SMP.
When the 1st cpu releases spinlock by spin_unlock, it has completed its section
of accessing the critical data, then 2nd cpu gets the spin_lock. These two CPUs
won't have overlap on accessing the same data.

> 
> I think this entire activity is a carefully crafted mine field for the future
> syzcaller and fuzzers alike. I don't believe there are no side effects in a
> long
> term on all possible systems and configurations (including forced threaded IRQ
> handlers).

Also I don't understand why forced threaded IRQ could be a problem. Since IRQ 
has
been a thread, this actually makes the situation much simpler than non-threaded
IRQ. Since all threads including those IRQ threads want to hold spin_lock,
they won't access the same critical data at the same time either.

> 
> I would love to see a better explanation in the commit message of such patches
> which makes it clear that there are *no* side effects.
> 

People had the same questions before, But I guess the discussion in this commit
has led to a better commit log:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4eb7d0cd59

> For time being, NAK to the all patches of this kind.

Fair enough, if you expect better explanation, I agree the commit log is too
short.

> 
> --
> With Best Regards,
> Andy Shevchenko
> 

Thanks
Barry



RE: [PATCH v2] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2

2021-02-10 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Peter Zijlstra [mailto:pet...@infradead.org]
> Sent: Thursday, February 11, 2021 12:22 AM
> To: Song Bao Hua (Barry Song) 
> Cc: valentin.schnei...@arm.com; vincent.guit...@linaro.org; mgor...@suse.de;
> mi...@kernel.org; dietmar.eggem...@arm.com; morten.rasmus...@arm.com;
> linux-kernel@vger.kernel.org; linux...@openeuler.org; xuwei (O)
> ; Liguozhu (Kenneth) ; tiantao (H)
> ; wanghuiqiang ; Zengtao (B)
> ; Jonathan Cameron ;
> guodong...@linaro.org; Meelis Roos 
> Subject: Re: [PATCH v2] sched/topology: fix the issue groups don't span
> domain->span for NUMA diameter > 2
> 
> On Tue, Feb 09, 2021 at 08:58:15PM +, Song Bao Hua (Barry Song) wrote:
> 
> > > I've finally had a moment to think about this, would it make sense to
> > > also break up group: node0+1, such that we then end up with 3 groups of
> > > equal size?
> >
> 
> > Since the sched_domain[n-1] of a part of node[m]'s siblings are able
> > to cover the whole span of sched_domain[n] of node[m], there is no
> > necessity to scan over all siblings of node[m], once sched_domain[n]
> > of node[m] has been covered, we can stop making more sched_groups. So
> > the number of sched_groups is small.
> >
> > So historically, the code has never tried to make sched_groups result
> > in equal size. And it permits the overlapping of local group and remote
> > groups.
> 
> Histrorically groups have (typically) always been the same size though.

This is probably true for other platforms. But unfortunately it has never
been true in my platform :-)

node   0   1   2   3 
  0:  10  12  20  22 
  1:  12  10  22  24 
  2:  20  22  10  12 
  3:  22  24  12  10

In case we have only two cpus in one numa. 

CPU0's domain-3 has no overflowed sched_group, but its first group
covers 0-5(node0-node2), the second group covers 4-7
(node2-node3):

[0.802139] CPU0 attaching sched-domain(s):
[0.802193]  domain-0: span=0-1 level=MC
[0.802443]   groups: 0:{ span=0 cap=1013 }, 1:{ span=1 cap=979 }
[0.802693]   domain-1: span=0-3 level=NUMA
[0.802731]groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 }
[0.802811]domain-2: span=0-5 level=NUMA
[0.802829] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 }
[0.802881] ERROR: groups don't span domain->span
[0.803058] domain-3: span=0-7 level=NUMA
[0.803080]  groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 
mask=6-7 cap=4077 }


> 
> The reason I did ask is because when you get one large and a bunch of
> smaller groups, the load-balancing 'pull' is relatively smaller to the
> large groups.
> 
> That is, IIRC should_we_balance() ensures only 1 CPU out of the group
> continues the load-balancing pass. So if, for example, we have one group
> of 4 CPUs and one group of 2 CPUs, then the group of 2 CPUs will pull
> 1/2 times, while the group of 4 CPUs will pull 1/4 times.
> 
> By making sure all groups are of the same level, and thus of equal size,
> this doesn't happen.

As you can see, even if we give all groups of domain2 equal size
by breaking up both local_group and remote_groups,  we will get to
the same problem in domain-3. And what's more tricky is that
domain-3 has no problem of "groups don't span domain->span".

It seems we need to change both domain2 and domain3 then though
domain3 has no issue of "groups don't span domain->span".

Thanks
Barry



RE: [Linuxarm] Re: [PATCH for next v1 0/2] gpio: few clean up patches to replace spin_lock_irqsave with spin_lock

2021-02-10 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Andy Shevchenko [mailto:andy.shevche...@gmail.com]
> Sent: Wednesday, February 10, 2021 11:51 PM
> To: luojiaxing 
> Cc: Linus Walleij ; Andy Shevchenko
> ; Grygorii Strashko
> ; Santosh Shilimkar ; Kevin
> Hilman ; open list:GPIO SUBSYSTEM
> ; Linux Kernel Mailing List
> ; linux...@openeuler.org
> Subject: [Linuxarm] Re: [PATCH for next v1 0/2] gpio: few clean up patches to
> replace spin_lock_irqsave with spin_lock
> 
> On Wed, Feb 10, 2021 at 5:43 AM luojiaxing  wrote:
> > On 2021/2/9 17:42, Andy Shevchenko wrote:
> > > On Tue, Feb 9, 2021 at 11:24 AM luojiaxing  wrote:
> > >> On 2021/2/8 21:28, Andy Shevchenko wrote:
> > >>> On Mon, Feb 8, 2021 at 11:11 AM luojiaxing  
> > >>> wrote:
> >  On 2021/2/8 16:56, Luo Jiaxing wrote:
> > > There is no need to use API with _irqsave in hard IRQ handler, So 
> > > replace
> > > those with spin_lock.
> > >>> How do you know that another CPU in the system can't serve the
> > > The keyword here is: *another*.
> >
> > ooh, sorry, now I got your point.
> >
> > As to me, I don't think another CPU can serve the IRQ when one CPU
> > runing hard IRQ handler,
> 
> Why is it so?
> Each CPU can serve IRQs separately.
> 
> > except it's a per CPU interrupts.
> 
> I didn't get how it is related.
> 
> > The following is a simple call logic when IRQ come.
> >
> > elx_irq -> handle_arch_irq -> __handle_domain_irq -> desc->handle_irq ->
> > handle_irq_event
> 
> What is `elx_irq()`? I haven't found any mention of this in the kernel
> source tree.
> But okay, it shouldn't prevent our discussion.
> 
> > Assume that two CPUs receive the same IRQ and enter the preceding
> > process. Both of them will go to desc->handle_irq().
> 
> Ah, I'm talking about the same IRQ by number (like Linux IRQ number,
> means from the same source), but with different sequence number (means
> two consequent events).
> 
> > In handle_irq(), raw_spin_lock(>lock) always be called first.
> > Therefore, even if two CPUs are running handle_irq(),
> >
> > only one can get the spin lock. Assume that CPU A obtains the spin lock.
> > Then CPU A will sets the status of irq_data to
> >
> > IRQD_IRQ_INPROGRESS in handle_irq_event() and releases the spin lock.
> > Even though CPU B gets the spin lock later and
> >
> > continue to run handle_irq(), but the check of irq_may_run(desc) causes
> > it to exit.
> >
> >
> > so, I think we don't own the situation that two CPU server the hard IRQ
> > handler at the same time.
> 
> Okay. Assuming your analysis is correct, have you considered the case
> when all IRQ handlers are threaded? (There is a kernel command line
> option to force this)
> 
> > >>> following interrupt from the hardware at the same time?
> > >> Yes, I have some question before.
> > >>
> > >> There are some similar discussion here,  please take a look, Song baohua
> > >> explained it more professionally.
> > >>
> > >>
> https://lore.kernel.org/lkml/e949a474a9284ac6951813bfc8b34...@hisilicon.co
> m/
> > >>
> > >> Here are some excerpts from the discussion:
> > >>
> > >> I think the code disabling irq in hardIRQ is simply wrong.
> > > Why?
> >
> >
> > I mention the following call before.
> >
> > elx_irq -> handle_arch_irq -> __handle_domain_irq -> desc->handle_irq ->
> > handle_irq_event
> >
> >
> > __handle_domain_irq() will call irq_enter(), it ensures that the IRQ
> > processing of the current CPU can not be preempted.
> >
> > So I think this is the reason why Song baohua said it's not need to
> > disable IRQ in hardIRQ handler.
> >
> > >> Since this commit
> > >>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=e58aa3d2d0cc
> > >> genirq: Run irq handlers with interrupts disabled
> > >>
> > >> interrupt handlers are definitely running in a irq-disabled context
> > >> unless irq handlers enable them explicitly in the handler to permit
> > >> other interrupts.
> > > This doesn't explain any changes in the behaviour on SMP.
> > > IRQ line can be disabled on a few stages:
> > >   a) on the source (IP that generates an event)
> > >   b) on IRQ router / controller
> > >   c) on CPU side
> >
> > yes, you are right.
> >
> > > The commit above is discussing (rightfully!) the problem when all
> > > interrupts are being served by a *single* core. Nobody prevents them
> > > from being served by *different* cores simultaneously. Also, see [1].
> > >
> > > [1]: https://www.kernel.org/doc/htmldocs/kernel-locking/cheatsheet.html
> >
> > I check [1], quite useful description about locking, thanks. But you can
> > see Table of locking Requirements
> >
> > Between IRQ handler A and IRQ handle A, it's no need for a SLIS.
> 
> Right, but it's not the case in the patches you provided.

The code still holds spin_lock. So if two cpus call same IRQ handler,
spin_lock makes them spin; and if interrupts are threaded, spin_lock
makes two threads run the same handler one by one.

> 
> --
> With Best Regards,
> Andy Shevchenko

Thanks
Barry



RE: [PATCH v3] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2

2021-02-10 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Meelis Roos [mailto:mr...@linux.ee]
> Sent: Wednesday, February 10, 2021 1:40 AM
> To: Song Bao Hua (Barry Song) ;
> valentin.schnei...@arm.com; vincent.guit...@linaro.org; mgor...@suse.de;
> mi...@kernel.org; pet...@infradead.org; dietmar.eggem...@arm.com;
> morten.rasmus...@arm.com; linux-kernel@vger.kernel.org
> Cc: linux...@openeuler.org; xuwei (O) ; Liguozhu (Kenneth)
> ; tiantao (H) ; wanghuiqiang
> ; Zengtao (B) ; Jonathan
> Cameron ; guodong...@linaro.org
> Subject: Re: [PATCH v3] sched/topology: fix the issue groups don't span
> domain->span for NUMA diameter > 2
> 
> I did a rudimentary benchmark on the same 8-node Sun Fire X4600-M2, on top of
> todays  5.11.0-rc7-2-ge0756cfc7d7c.
> 
> The test: building clean kernel with make -j64 after make clean and 
> drop_caches.
> 
> While running clean kernel / 3 tries):
> 
> real2m38.574s
> user46m18.387s
> sys 6m8.724s
> 
> real2m37.647s
> user46m34.171s
> sys 6m11.993s
> 
> real2m37.832s
> user46m34.910s
> sys 6m12.013s
> 
> 
> While running patched kernel:
> 
> real2m40.072s
> user46m22.610s
> sys 6m6.658s
> 
> 
> for real time, seems to be 1.5s-2s slower out of 160s (noise?) User and system
> time are slightly less, on the other hand, so seems good to me.

I ran the same test on the machine with the below topology:
numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0-31
node 0 size: 64144 MB
node 0 free: 62356 MB
node 1 cpus: 32-63
node 1 size: 64509 MB
node 1 free: 62996 MB
node 2 cpus: 64-95
node 2 size: 64509 MB
node 2 free: 63020 MB
node 3 cpus: 96-127
node 3 size: 63991 MB
node 3 free: 62647 MB
node distances:
node   0   1   2   3 
  0:  10  12  20  22 
  1:  12  10  22  24 
  2:  20  22  10  12 
  3:  22  24  12  10

Basically the influence to kernel build is noise by
the commands I ran a couple of rounds:

make clean
echo 3 > /proc/sys/vm/drop_caches
make Image -j100

w/ patch:   w/o patch:

real1m17.644s  real 1m19.510s
user32m12.074s user 32m14.133s
sys 4m35.827s   sys 4m38.198s

real1m15.855s  real 1m17.303s
user32m7.700s  user 32m14.128s
sys 4m35.868s   sys 4m40.094s

real1m18.918s  real 1m19.583s
user32m13.352s user 32m13.205s
sys 4m40.161s   sys 4m40.696s

real1m20.329s  real 1m17.819s
user32m7.255s  user 32m11.753s
sys 4m36.706s   sys 4m41.371s

real1m17.773s  real 1m16.763s
user32m19.912s user 32m15.607s
sys 4m36.989s   sys 4m41.297s

real1m14.943s  real 1m18.551s
user32m14.549s user 32m18.521s
sys 4m38.670s   sys 4m41.392s

real1m16.439s  real 1m18.154s
user32m12.864s user 32m14.540s
sys 4m39.424s   sys 4m40.364s

our team guys who used the 3-hops-fix patch to run unixbench
reported some data of unixbench score as below(3 rounds):

w/o patch:w/ patch:
1228.61254.9
1231.41265.7
1226.11266.1

One interesting thing is that if we change the kernel to
disallow the below BALANCING flags for the last hop,
sd->flags &= ~(SD_BALANCE_EXEC |
   SD_BALANCE_FORK |
   SD_WAKE_AFFINE);

We are seeing further increase of unixbench. So sounds like
those balancing shouldn't go that far. But it is a different
topic.

> 
> --
> Meelis Roos 

Thanks
Barry



RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage optimization for SCSI drivers

2021-02-09 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Finn Thain [mailto:fth...@telegraphics.com.au]
> Sent: Wednesday, February 10, 2021 5:16 PM
> To: Song Bao Hua (Barry Song) 
> Cc: tanxiaofei ; j...@linux.ibm.com;
> martin.peter...@oracle.com; linux-s...@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux...@openeuler.org;
> linux-m...@vger.kernel.org
> Subject: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage optimization
> for SCSI drivers
> 
> On Tue, 9 Feb 2021, Song Bao Hua (Barry Song) wrote:
> 
> > > > sonic_interrupt() uses an irq lock within an interrupt handler to
> > > > avoid issues relating to this. This kind of locking may be needed in
> > > > the drivers you are trying to patch. Or it might not. Apparently,
> > > > no-one has looked.
> >
> > Is the comment in sonic_interrupt() outdated according to this:
> > m68k: irq: Remove IRQF_DISABLED
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=77a4279
> > http://lkml.iu.edu/hypermail/linux/kernel/1109.2/01687.html
> >
> 
> The removal of IRQF_DISABLED isn't relevant to this driver. Commit
> 77a42796786c ("m68k: Remove deprecated IRQF_DISABLED") did not disable
> interrupts, it just removed some code to enable them.
> 
> The code and comments in sonic_interrupt() are correct. You can confirm
> this for yourself quite easily using QEMU and a cross-compiler.
> 
> > and this:
> > genirq: Warn when handler enables interrupts
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=b738a50a
> >
> > wouldn't genirq report a warning on m68k?
> >
> 
> There is no warning from m68k builds. That's because arch_irqs_disabled()
> returns true when the IPL is non-zero.


So for m68k, the case is
arch_irqs_disabled() is true, but interrupts can still come?

Then it seems it is very confusing. If prioritized interrupts can still come
while arch_irqs_disabled() is true, how could spin_lock_irqsave() block the
prioritized interrupts? Isn't arch_irqs_disabled() a status reflection of
irq disable API?

Thanks
Barry



RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage optimization for SCSI drivers

2021-02-09 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Finn Thain [mailto:fth...@telegraphics.com.au]
> Sent: Wednesday, February 10, 2021 1:29 PM
> To: Song Bao Hua (Barry Song) 
> Cc: tanxiaofei ; j...@linux.ibm.com;
> martin.peter...@oracle.com; linux-s...@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux...@openeuler.org;
> linux-m...@vger.kernel.org
> Subject: RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage 
> optimization
> for SCSI drivers
> 
> On Tue, 9 Feb 2021, Song Bao Hua (Barry Song) wrote:
> 
> > > On Tue, 9 Feb 2021, Song Bao Hua (Barry Song) wrote:
> > >
> > > > > On Sun, 7 Feb 2021, Xiaofei Tan wrote:
> > > > >
> > > > > > Replace spin_lock_irqsave with spin_lock in hard IRQ of SCSI
> > > > > > drivers. There are no function changes, but may speed up if
> > > > > > interrupt happen too often.
> > > > >
> > > > > This change doesn't necessarily work on platforms that support
> > > > > nested interrupts.
> > > > >
> > > > > Were you able to measure any benefit from this change on some
> > > > > other platform?
> > > >
> > > > I think the code disabling irq in hardIRQ is simply wrong. Since
> > > > this commit
> > > >
> > > >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=e58aa3d2d0cc
> > > > genirq: Run irq handlers with interrupts disabled
> > > >
> > > > interrupt handlers are definitely running in a irq-disabled context
> > > > unless irq handlers enable them explicitly in the handler to permit
> > > > other interrupts.
> > > >
> > >
> > > Repeating the same claim does not somehow make it true.
> >
> > Sorry for I didn't realize xiaofei had replied.
> >
> 
> I was referring to the claim in patch 00/32, i.e. that interrupt handlers
> only run when irqs are disabled.
> 
> > > If you put your claim to the test, you'll see that that interrupts are
> > > not disabled on m68k when interrupt handlers execute.
> >
> > Sounds like an implementation issue of m68k since IRQF_DISABLED has been
> > totally removed.
> >
> 
> It's true that IRQF_DISABLED could be used to avoid the need for irq locks
> in interrupt handlers. So, if you want to remove irq locks from interrupt
> handlers, today you can't use IRQF_DISABLED to help you. So what?
> 
> > >
> > > The Interrupt Priority Level (IPL) can prevent any given irq handler
> > > from being re-entered, but an irq with a higher priority level may be
> > > handled during execution of a lower priority irq handler.
> > >
> >
> > We used to have IRQF_DISABLED to support so-called "fast interrupt" to
> > avoid this.
> >
> > But the concept has been totally removed. That is interesting if m68k
> > still has this issue.
> >
> 
> Prioritized interrupts are beneficial. Why would you want to avoid them?
> 

I doubt this is true as it has been already thought as unnecessary
in Linux:
https://lwn.net/Articles/380931/

> Moreover, there's no reason to believe that m68k is the only platform that
> supports nested interrupts.

I doubt that is true as genirq is running understand the consumption
that hardIRQ is running in irq-disabled context:
"We run all handlers with interrupts disabled and expect them not to
enable them. Warn when we catch one who does."
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b738a50a

If it is, m68k is against the assumption of genirq.

> 
> > > sonic_interrupt() uses an irq lock within an interrupt handler to
> > > avoid issues relating to this. This kind of locking may be needed in
> > > the drivers you are trying to patch. Or it might not. Apparently,
> > > no-one has looked.
> >

Thanks
Barry


RE: [PATCH v4 01/12] genirq: add IRQF_NO_AUTOEN for request_irq

2021-02-09 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Song Bao Hua (Barry Song)
> Sent: Friday, January 29, 2021 11:35 AM
> To: t...@linutronix.de; dmitry.torok...@gmail.com; m...@kernel.org;
> gre...@linuxfoundation.org; linux-in...@vger.kernel.org;
> linux-kernel@vger.kernel.org
> Cc: linux...@openeuler.org; Song Bao Hua (Barry Song)
> 
> Subject: [PATCH v4 01/12] genirq: add IRQF_NO_AUTOEN for request_irq
> 
> Many drivers don't want interrupts enabled automatically due to
> request_irq(). So they are handling this issue by either way of
> the below two:
> (1)
> irq_set_status_flags(irq, IRQ_NOAUTOEN);
> request_irq(dev, irq...);
> (2)
> request_irq(dev, irq...);
> disable_irq(irq);
> 
> The code in the second way is silly and unsafe. In the small time
> gap between request_irq() and disable_irq(), interrupts can still
> come.
> The code in the first way is safe though we might be able to do it
> in the generic irq code.
> 
> With this patch, drivers can request_irq with IRQF_NO_AUTOEN flag.
> They will need neither irq_set_status_flags() nor disable_irq().
> Hundreds of drivers with this problem will be handled afterwards.
> 
> Cc: Dmitry Torokhov 
> Signed-off-by: Barry Song 
> ---
>  -v4: remove the irq_settings magic for NOAUTOEN

Hi Thomas,
Any further comment on this? Does it get any opportunity to hit
5.12 so that we can begin to handle those drivers in 5.12?

Thanks
Barry

> 
>  include/linux/interrupt.h | 3 +++
>  kernel/irq/manage.c   | 8 +++-
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> index bb8ff9083e7d..0f22d277078c 100644
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -61,6 +61,8 @@
>   *interrupt handler after suspending interrupts. For system
>   *wakeup devices users need to implement wakeup detection in
>   *their interrupt handlers.
> + * IRQF_NO_AUTOEN - Don't enable IRQ automatically when users request it. 
> Users
> + *will enable it explicitly by enable_irq() later.
>   */
>  #define IRQF_SHARED  0x0080
>  #define IRQF_PROBE_SHARED0x0100
> @@ -74,6 +76,7 @@
>  #define IRQF_NO_THREAD   0x0001
>  #define IRQF_EARLY_RESUME0x0002
>  #define IRQF_COND_SUSPEND0x0004
> +#define IRQF_NO_AUTOEN   0x0008
> 
>  #define IRQF_TIMER   (__IRQF_TIMER | IRQF_NO_SUSPEND | 
> IRQF_NO_THREAD)
> 
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index dec3f73e8db9..95014073bd2e 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -1693,7 +1693,8 @@ __setup_irq(unsigned int irq, struct irq_desc *desc,
> struct irqaction *new)
>   irqd_set(>irq_data, IRQD_NO_BALANCING);
>   }
> 
> - if (irq_settings_can_autoenable(desc)) {
> + if (!(new->flags & IRQF_NO_AUTOEN) &&
> + irq_settings_can_autoenable(desc)) {
>   irq_startup(desc, IRQ_RESEND, IRQ_START_COND);
>   } else {
>   /*
> @@ -2086,10 +2087,15 @@ int request_threaded_irq(unsigned int irq,
> irq_handler_t handler,
>* which interrupt is which (messes up the interrupt freeing
>* logic etc).
>*
> +  * Also shared interrupts do not go well with disabling auto enable.
> +  * The sharing interrupt might request it while it's still disabled
> +  * and then wait for interrupts forever.
> +  *
>* Also IRQF_COND_SUSPEND only makes sense for shared interrupts and
>* it cannot be set along with IRQF_NO_SUSPEND.
>*/
>   if (((irqflags & IRQF_SHARED) && !dev_id) ||
> + ((irqflags & IRQF_SHARED) && (irqflags & IRQF_NO_AUTOEN)) ||
>   (!(irqflags & IRQF_SHARED) && (irqflags & IRQF_COND_SUSPEND)) ||
>   ((irqflags & IRQF_NO_SUSPEND) && (irqflags & IRQF_COND_SUSPEND)))
>   return -EINVAL;
> --
> 2.25.1



RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-09 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Jason Gunthorpe [mailto:j...@ziepe.ca]
> Sent: Wednesday, February 10, 2021 2:54 AM
> To: Song Bao Hua (Barry Song) 
> Cc: David Hildenbrand ; Wangzhou (B)
> ; linux-kernel@vger.kernel.org;
> io...@lists.linux-foundation.org; linux...@kvack.org;
> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> Morton ; Alexander Viro ;
> gre...@linuxfoundation.org; kevin.t...@intel.com; jean-phili...@linaro.org;
> eric.au...@redhat.com; Liguozhu (Kenneth) ;
> zhangfei@linaro.org; chensihang (A) 
> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> pin
> 
> On Tue, Feb 09, 2021 at 03:01:42AM +, Song Bao Hua (Barry Song) wrote:
> 
> > On the other hand, wouldn't it be the benefit of hardware accelerators
> > to have a lower and more stable latency zip/encryption than CPU?
> 
> No, I don't think so.

Fortunately or unfortunately, I think my people have this target to have
a lower-latency and more stable zip/encryption by using accelerators,
otherwise, they are going to use CPU directly if there is no advantage
of accelerators.

> 
> If this is an important problem then it should apply equally to CPU
> and IO jitter.
> 
> Honestly I find the idea that occasional migration jitters CPU and DMA
> to not be very compelling. Such specialized applications should
> allocate special pages to avoid this, not adding an API to be able to
> lock down any page

That is exactly what we have done to provide a hugeTLB pool so that
applications can allocate memory from this pool.

+---+
 |   |
 |applications using accelerators|
 +---+


 alloc from pool free to pool
   +  ++
   |   |
   |   |
   |   |
   |   |
   |   |
   |   |
   |   |
+--+---+-+
||
||
|  HugeTLB memory pool   |
||
||
++

The problem is that SVA declares we can use any memory of a process
to do I/O. And in real scenarios, we are unable to customize most
applications to make them use the pool. So we are looking for some
extension generically for applications such as Nginx, Ceph.

I am also thinking about leveraging vm.compact_unevictable_allowed
which David suggested and making an extension on it, for example,
permit users to disable compaction and numa balancing on unevictable
pages of SVA process,  which might be a smaller deal.

> 
> Jason

Thanks
Barry



  1   2   3   4   5   6   7   >