RE: [RFC PATCH v5 4/4] scheduler: Add cluster scheduler level for x86

2021-04-20 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Tim Chen [mailto:tim.c.c...@linux.intel.com]
> Sent: Wednesday, April 21, 2021 6:32 AM
> To: Song Bao Hua (Barry Song) ;
> catalin.mari...@arm.com; w...@kernel.org; r...@rjwysocki.net;
> vincent.guit...@linaro.org; b...@alien8.de; t...@linutronix.de;
> mi...@redhat.com; l...@kernel.org; pet...@infradead.org;
> dietmar.eggem...@arm.com; rost...@goodmis.org; bseg...@google.com;
> mgor...@suse.de
> Cc: msys.miz...@gmail.com; valentin.schnei...@arm.com;
> gre...@linuxfoundation.org; Jonathan Cameron ;
> juri.le...@redhat.com; mark.rutl...@arm.com; sudeep.ho...@arm.com;
> aubrey...@linux.intel.com; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; linux-a...@vger.kernel.org; x...@kernel.org;
> xuwei (O) ; Zengtao (B) ;
> guodong...@linaro.org; yangyicong ; Liguozhu (Kenneth)
> ; linux...@openeuler.org; h...@zytor.com
> Subject: Re: [RFC PATCH v5 4/4] scheduler: Add cluster scheduler level for x86
> 
> 
> 
> On 3/23/21 4:21 PM, Song Bao Hua (Barry Song) wrote:
> 
> >>
> >> On 3/18/21 9:16 PM, Barry Song wrote:
> >>> From: Tim Chen 
> >>>
> >>> There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce
> >>> is shared among a cluster of cores instead of being exclusive
> >>> to one single core.
> >>>
> >>> To prevent oversubscription of L2 cache, load should be
> >>> balanced between such L2 clusters, especially for tasks with
> >>> no shared data.
> >>>
> >>> Also with cluster scheduling policy where tasks are woken up
> >>> in the same L2 cluster, we will benefit from keeping tasks
> >>> related to each other and likely sharing data in the same L2
> >>> cluster.
> >>>
> >>> Add CPU masks of CPUs sharing the L2 cache so we can build such
> >>> L2 cluster scheduler domain.
> >>>
> >>> Signed-off-by: Tim Chen 
> >>> Signed-off-by: Barry Song 
> >>
> >>
> >> Barry,
> >>
> >> Can you also add this chunk to the patch.
> >> Thanks.
> >
> > Sure, Tim, Thanks. I'll put that into patch 4/4 in v6.
> >
> 
> Barry,
> 
> This chunk will also need to be added to return cluster id for x86.
> Please add it in your next rev.

Yes. Thanks. I'll put this in either RFC v7 or Patch v1.

For spreading path, things are much easier, though packing path is 
quite tricky. But It seems RFC v6 has been quite close to what we want
to achieve to pack related tasks by scanning cluster for tasks within
same NUMA:
https://lore.kernel.org/lkml/20210420001844.9116-1-song.bao@hisilicon.com/

If couples have been already in same LLC(numa), scanning clusters will
gather them further. If they are running in different NUMA nodes, the
original scanning LLC will move them to the same node, after that,
scanning cluster might put them closer to each other.

it seems it is kind of the two-level packing Dietmar has suggested.

So perhaps we won't have RFC v7, I will probably send patch v1 afterwards.

> 
> Thanks.
> 
> Tim
> 
> ---
> 
> diff --git a/arch/x86/include/asm/topology.h
> b/arch/x86/include/asm/topology.h
> index 800fa48c9fcd..2548d824f103 100644
> --- a/arch/x86/include/asm/topology.h
> +++ b/arch/x86/include/asm/topology.h
> @@ -109,6 +109,7 @@ extern const struct cpumask *cpu_clustergroup_mask(int 
> cpu);
>  #define topology_physical_package_id(cpu)(cpu_data(cpu).phys_proc_id)
>  #define topology_logical_die_id(cpu) (cpu_data(cpu).logical_die_id)
>  #define topology_die_id(cpu) (cpu_data(cpu).cpu_die_id)
> +#define topology_cluster_id(cpu) (per_cpu(cpu_l2c_id, cpu))
>  #define topology_core_id(cpu)
> (cpu_data(cpu).cpu_core_id)
> 
>  extern unsigned int __max_die_per_package;

Thanks
Barry



RE: [RFC PATCH v5 1/4] topology: Represent clusters of CPUs within a die

2021-04-19 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Greg KH [mailto:gre...@linuxfoundation.org]
> Sent: Friday, March 19, 2021 11:02 PM
> To: Jonathan Cameron 
> Cc: Song Bao Hua (Barry Song) ;
> tim.c.c...@linux.intel.com; catalin.mari...@arm.com; w...@kernel.org;
> r...@rjwysocki.net; vincent.guit...@linaro.org; b...@alien8.de;
> t...@linutronix.de; mi...@redhat.com; l...@kernel.org; pet...@infradead.org;
> dietmar.eggem...@arm.com; rost...@goodmis.org; bseg...@google.com;
> mgor...@suse.de; msys.miz...@gmail.com; valentin.schnei...@arm.com;
> juri.le...@redhat.com; mark.rutl...@arm.com; sudeep.ho...@arm.com;
> aubrey...@linux.intel.com; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; linux-a...@vger.kernel.org; x...@kernel.org;
> xuwei (O) ; Zengtao (B) ;
> guodong...@linaro.org; yangyicong ; Liguozhu (Kenneth)
> ; linux...@openeuler.org; h...@zytor.com
> Subject: Re: [RFC PATCH v5 1/4] topology: Represent clusters of CPUs within
> a die
> 
> On Fri, Mar 19, 2021 at 09:36:16AM +, Jonathan Cameron wrote:
> > On Fri, 19 Mar 2021 06:57:08 +
> > "Song Bao Hua (Barry Song)"  wrote:
> >
> > > > -Original Message-
> > > > From: Greg KH [mailto:gre...@linuxfoundation.org]
> > > > Sent: Friday, March 19, 2021 7:35 PM
> > > > To: Song Bao Hua (Barry Song) 
> > > > Cc: tim.c.c...@linux.intel.com; catalin.mari...@arm.com;
> w...@kernel.org;
> > > > r...@rjwysocki.net; vincent.guit...@linaro.org; b...@alien8.de;
> > > > t...@linutronix.de; mi...@redhat.com; l...@kernel.org;
> pet...@infradead.org;
> > > > dietmar.eggem...@arm.com; rost...@goodmis.org; bseg...@google.com;
> > > > mgor...@suse.de; msys.miz...@gmail.com; valentin.schnei...@arm.com;
> Jonathan
> > > > Cameron ; juri.le...@redhat.com;
> > > > mark.rutl...@arm.com; sudeep.ho...@arm.com; aubrey...@linux.intel.com;
> > > > linux-arm-ker...@lists.infradead.org; linux-kernel@vger.kernel.org;
> > > > linux-a...@vger.kernel.org; x...@kernel.org; xuwei (O)
> ;
> > > > Zengtao (B) ; guodong...@linaro.org;
> yangyicong
> > > > ; Liguozhu (Kenneth) ;
> > > > linux...@openeuler.org; h...@zytor.com
> > > > Subject: Re: [RFC PATCH v5 1/4] topology: Represent clusters of CPUs 
> > > > within
> > > > a die
> > > >
> > > > On Fri, Mar 19, 2021 at 05:16:15PM +1300, Barry Song wrote:
> > > > > diff --git a/Documentation/admin-guide/cputopology.rst
> > > > b/Documentation/admin-guide/cputopology.rst
> > > > > index b90dafc..f9d3745 100644
> > > > > --- a/Documentation/admin-guide/cputopology.rst
> > > > > +++ b/Documentation/admin-guide/cputopology.rst
> > > > > @@ -24,6 +24,12 @@ core_id:
> > > > >   identifier (rather than the kernel's).  The actual value is
> > > > >   architecture and platform dependent.
> > > > >
> > > > > +cluster_id:
> > > > > +
> > > > > + the Cluster ID of cpuX.  Typically it is the hardware platform's
> > > > > + identifier (rather than the kernel's).  The actual value is
> > > > > + architecture and platform dependent.
> > > > > +
> > > > >  book_id:
> > > > >
> > > > >   the book ID of cpuX. Typically it is the hardware platform's
> > > > > @@ -56,6 +62,14 @@ package_cpus_list:
> > > > >   human-readable list of CPUs sharing the same 
> > > > > physical_package_id.
> > > > >   (deprecated name: "core_siblings_list")
> > > > >
> > > > > +cluster_cpus:
> > > > > +
> > > > > + internal kernel map of CPUs within the same cluster.
> > > > > +
> > > > > +cluster_cpus_list:
> > > > > +
> > > > > + human-readable list of CPUs within the same cluster.
> > > > > +
> > > > >  die_cpus:
> > > > >
> > > > >   internal kernel map of CPUs within the same die.
> > > >
> > > > Why are these sysfs files in this file, and not in a Documentation/ABI/
> > > > file which can be correctly parsed and shown to userspace?
> > >
> > > Well. Those ABIs have been there for much a long time. It is like:
> > >
> > > [root@ceph1 topology]# ls
> > > core_id  core_siblings  core_siblings_list  physical_package_id
> thread_siblings  thread_siblings_list
> > > [r

RE: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and add cluster scheduler

2021-04-13 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Dietmar Eggemann [mailto:dietmar.eggem...@arm.com]
> Sent: Wednesday, January 13, 2021 12:00 AM
> To: Morten Rasmussen ; Tim Chen
> 
> Cc: Song Bao Hua (Barry Song) ;
> valentin.schnei...@arm.com; catalin.mari...@arm.com; w...@kernel.org;
> r...@rjwysocki.net; vincent.guit...@linaro.org; l...@kernel.org;
> gre...@linuxfoundation.org; Jonathan Cameron ;
> mi...@redhat.com; pet...@infradead.org; juri.le...@redhat.com;
> rost...@goodmis.org; bseg...@google.com; mgor...@suse.de;
> mark.rutl...@arm.com; sudeep.ho...@arm.com; aubrey...@linux.intel.com;
> linux-arm-ker...@lists.infradead.org; linux-kernel@vger.kernel.org;
> linux-a...@vger.kernel.org; linux...@openeuler.org; xuwei (O)
> ; Zengtao (B) ; tiantao (H)
> 
> Subject: Re: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and
> add cluster scheduler
> 
> On 11/01/2021 10:28, Morten Rasmussen wrote:
> > On Fri, Jan 08, 2021 at 12:22:41PM -0800, Tim Chen wrote:
> >>
> >>
> >> On 1/8/21 7:12 AM, Morten Rasmussen wrote:
> >>> On Thu, Jan 07, 2021 at 03:16:47PM -0800, Tim Chen wrote:
> >>>> On 1/6/21 12:30 AM, Barry Song wrote:
> 
> [...]
> 
> >> I think it is going to depend on the workload.  If there are dependent
> >> tasks that communicate with one another, putting them together
> >> in the same cluster will be the right thing to do to reduce communication
> >> costs.  On the other hand, if the tasks are independent, putting them 
> >> together
> on the same cluster
> >> will increase resource contention and spreading them out will be better.
> >
> > Agree. That is exactly where I'm coming from. This is all about the task
> > placement policy. We generally tend to spread tasks to avoid resource
> > contention, SMT and caches, which seems to be what you are proposing to
> > extend. I think that makes sense given it can produce significant
> > benefits.
> >
> >>
> >> Any thoughts on what is the right clustering "tag" to use to clump
> >> related tasks together?
> >> Cgroup? Pid? Tasks with same mm?
> >
> > I think this is the real question. I think the closest thing we have at
> > the moment is the wakee/waker flip heuristic. This seems to be related.
> > Perhaps the wake_affine tricks can serve as starting point?
> 
> wake_wide() switches between packing (select_idle_sibling(), llc_size
> CPUs) and spreading (find_idlest_cpu(), all CPUs).
> 
> AFAICS, since none of the sched domains set SD_BALANCE_WAKE, currently
> all wakeups are (llc-)packed.
> 
>  select_task_rq_fair()
> 
>for_each_domain(cpu, tmp)
> 
>  if (tmp->flags & sd_flag)
>sd = tmp;
> 
> 
> In case we would like to further distinguish between llc-packing and
> even narrower (cluster or MC-L2)-packing, we would introduce a 2. level
> packing vs. spreading heuristic further down in sis().
> 
> IMHO, Barry's current implementation doesn't do this right now. Instead
> he's trying to pack on cluster first and if not successful look further
> among the remaining llc CPUs for an idle CPU.

Right now in the main cases of using wake_affine to achieve
better performance, processes are actually bound within one
numa which is also a LLC in kunpeng920. 

Probably LLC=NUMA is also true for X86 Jacobsville, Tim?

So one possible way to pretend a 2-level packing might be:
if the affinity cpuset of waker and waker are both subset
of one same LLC, we totally use cluster as the factor to
determine packing or not and ignore LLC.

I haven't really done this, but the below code can make the
same result by forcing llc_id=cluster_id:

diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index d72eb8d..3d78097 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -107,7 +107,7 @@ int __init parse_acpi_topology(void)
cpu_topology[cpu].cluster_id = topology_id;
topology_id = find_acpi_cpu_topology_package(cpu);
cpu_topology[cpu].package_id = topology_id;
-
+#if 0
i = acpi_find_last_cache_level(cpu);

if (i > 0) {
@@ -119,8 +119,11 @@ int __init parse_acpi_topology(void)
if (cache_id > 0)
cpu_topology[cpu].llc_id = cache_id;
}
-   }
+#else
+   cpu_topology[cpu].llc_id = cpu_topology[cpu].cluster_id;
+#endif

+   }
return 0;
 }
 #endif

With this, I have seen some major improvement in hackbench especially
for monogamous communication model (fds_num=1, one sender for one
receiver):
numactl -N 0 hackbench -p -T -l 20 -f 1 -g $1

I have tested -g(group_nums) 6, 

RE: [PATCH v1 1/1] i2c: designware: Adjust bus_freq_hz when refuse high speed mode set

2021-03-31 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Andy Shevchenko [mailto:andriy.shevche...@linux.intel.com]
> Sent: Thursday, April 1, 2021 12:05 AM
> To: Andy Shevchenko ; Serge Semin
> ; linux-...@vger.kernel.org;
> linux-kernel@vger.kernel.org
> Cc: Jarkko Nikula ; Mika Westerberg
> ; w...@kernel.org; yangyicong
> ; Song Bao Hua (Barry Song) 
> 
> Subject: [PATCH v1 1/1] i2c: designware: Adjust bus_freq_hz when refuse high
> speed mode set
> 
> When hardware doesn't support High Speed Mode, we forget bus_freq_hz
> timing adjustment. This makes the timings and real registers being
> unsynchronized. Adjust bus_freq_hz when refuse high speed mode set.
> 
> Fixes: b6e67145f149 ("i2c: designware: Enable high speed mode")
> Reported-by: "Song Bao Hua (Barry Song)" 
> Signed-off-by: Andy Shevchenko 
> ---

Thanks for fixing that.

Reviewed-by: Barry Song 

>  drivers/i2c/busses/i2c-designware-master.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/i2c/busses/i2c-designware-master.c
> b/drivers/i2c/busses/i2c-designware-master.c
> index 34bb4e21bcc3..9bfa06e31eec 100644
> --- a/drivers/i2c/busses/i2c-designware-master.c
> +++ b/drivers/i2c/busses/i2c-designware-master.c
> @@ -129,6 +129,7 @@ static int i2c_dw_set_timings_master(struct dw_i2c_dev
> *dev)
>   if ((comp_param1 & DW_IC_COMP_PARAM_1_SPEED_MODE_MASK)
>   != DW_IC_COMP_PARAM_1_SPEED_MODE_HIGH) {
>   dev_err(dev->dev, "High Speed not supported!\n");
> + t->bus_freq_hz = I2C_MAX_FAST_MODE_FREQ;
>   dev->master_cfg &= ~DW_IC_CON_SPEED_MASK;
>   dev->master_cfg |= DW_IC_CON_SPEED_FAST;
>   dev->hs_hcnt = 0;
> --
> 2.30.2



RE: [RFC PATCH v5 4/4] scheduler: Add cluster scheduler level for x86

2021-03-31 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Song Bao Hua (Barry Song)
> Sent: Wednesday, March 24, 2021 12:15 PM
> To: 'Tim Chen' ; catalin.mari...@arm.com;
> w...@kernel.org; r...@rjwysocki.net; vincent.guit...@linaro.org; 
> b...@alien8.de;
> t...@linutronix.de; mi...@redhat.com; l...@kernel.org; pet...@infradead.org;
> dietmar.eggem...@arm.com; rost...@goodmis.org; bseg...@google.com;
> mgor...@suse.de
> Cc: msys.miz...@gmail.com; valentin.schnei...@arm.com;
> gre...@linuxfoundation.org; Jonathan Cameron ;
> juri.le...@redhat.com; mark.rutl...@arm.com; sudeep.ho...@arm.com;
> aubrey...@linux.intel.com; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; linux-a...@vger.kernel.org; x...@kernel.org;
> xuwei (O) ; Zengtao (B) ;
> guodong...@linaro.org; yangyicong ; Liguozhu (Kenneth)
> ; linux...@openeuler.org; h...@zytor.com
> Subject: RE: [RFC PATCH v5 4/4] scheduler: Add cluster scheduler level for x86
> 
> 
> 
> > -Original Message-
> > From: Tim Chen [mailto:tim.c.c...@linux.intel.com]
> > Sent: Wednesday, March 24, 2021 11:51 AM
> > To: Song Bao Hua (Barry Song) ;
> > catalin.mari...@arm.com; w...@kernel.org; r...@rjwysocki.net;
> > vincent.guit...@linaro.org; b...@alien8.de; t...@linutronix.de;
> > mi...@redhat.com; l...@kernel.org; pet...@infradead.org;
> > dietmar.eggem...@arm.com; rost...@goodmis.org; bseg...@google.com;
> > mgor...@suse.de
> > Cc: msys.miz...@gmail.com; valentin.schnei...@arm.com;
> > gre...@linuxfoundation.org; Jonathan Cameron ;
> > juri.le...@redhat.com; mark.rutl...@arm.com; sudeep.ho...@arm.com;
> > aubrey...@linux.intel.com; linux-arm-ker...@lists.infradead.org;
> > linux-kernel@vger.kernel.org; linux-a...@vger.kernel.org; x...@kernel.org;
> > xuwei (O) ; Zengtao (B) ;
> > guodong...@linaro.org; yangyicong ; Liguozhu
> (Kenneth)
> > ; linux...@openeuler.org; h...@zytor.com
> > Subject: Re: [RFC PATCH v5 4/4] scheduler: Add cluster scheduler level for
> x86
> >
> >
> >
> > On 3/18/21 9:16 PM, Barry Song wrote:
> > > From: Tim Chen 
> > >
> > > There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce
> > > is shared among a cluster of cores instead of being exclusive
> > > to one single core.
> > >
> > > To prevent oversubscription of L2 cache, load should be
> > > balanced between such L2 clusters, especially for tasks with
> > > no shared data.
> > >
> > > Also with cluster scheduling policy where tasks are woken up
> > > in the same L2 cluster, we will benefit from keeping tasks
> > > related to each other and likely sharing data in the same L2
> > > cluster.
> > >
> > > Add CPU masks of CPUs sharing the L2 cache so we can build such
> > > L2 cluster scheduler domain.
> > >
> > > Signed-off-by: Tim Chen 
> > > Signed-off-by: Barry Song 
> >
> >
> > Barry,
> >
> > Can you also add this chunk to the patch.
> > Thanks.
> 
> Sure, Tim, Thanks. I'll put that into patch 4/4 in v6.

Hi Tim,
You might want to take a look at this qemu patchset:
https://lore.kernel.org/qemu-devel/20210331095343.12172-1-wangyana...@huawei.com/T/#t

someone is trying to leverage this cluster topology
to improve KVM virtual machines performance.

> 
> >
> > Tim
> >
> >
> > diff --git a/arch/x86/include/asm/topology.h
> > b/arch/x86/include/asm/topology.h
> > index 2a11ccc14fb1..800fa48c9fcd 100644
> > --- a/arch/x86/include/asm/topology.h
> > +++ b/arch/x86/include/asm/topology.h
> > @@ -115,6 +115,7 @@ extern unsigned int __max_die_per_package;
> >
> >  #ifdef CONFIG_SMP
> >  #define topology_die_cpumask(cpu)  (per_cpu(cpu_die_map, cpu))
> > +#define topology_cluster_cpumask(cpu)  
> > (cpu_clustergroup_mask(cpu))
> >  #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
> >  #define topology_sibling_cpumask(cpu)  
> > (per_cpu(cpu_sibling_map, cpu))
> >
> 

Thanks
Barry


RE: [PATCH 5/5] i2c: designware: Switch over to i2c_freq_mode_string()

2021-03-31 Thread Song Bao Hua (Barry Song)

> No, please read the code carefully.
> We can duplicate conditional, but it brings a bit of inconsistency to how the 
> counters are printed.

Thanks for clarification, I am still confused as the original
code print the real mode based on dev->master_cfg, the new
code is printing mode based on frequency.

My understanding is the original code could fall back to a lower
speed when higher speed modes were not set successfully. For
example, high speed mode falls back to fast mode:

if ((dev->master_cfg & DW_IC_CON_SPEED_MASK) ==
DW_IC_CON_SPEED_HIGH) {
if ((comp_param1 & DW_IC_COMP_PARAM_1_SPEED_MODE_MASK)
!= DW_IC_COMP_PARAM_1_SPEED_MODE_HIGH) {
dev_err(dev->dev, "High Speed not supported!\n");
dev->master_cfg &= ~DW_IC_CON_SPEED_MASK;
dev->master_cfg |= DW_IC_CON_SPEED_FAST;
dev->hs_hcnt = 0;
dev->hs_lcnt = 0;
}

the original code was printing the mode based on the new
fallback dev->master_cfg but not the mode calculated from
frequency:

switch (dev->master_cfg & DW_IC_CON_SPEED_MASK) {
case DW_IC_CON_SPEED_STD:
mode_str = "Standard Mode";
break;
case DW_IC_CON_SPEED_HIGH:
mode_str = "High Speed Mode";
break;
default:
mode_str = "Fast Mode";
}

> > +     return 0;
> >  }
> >
> >  /**
> > --
> > 2.8.1
> 
> 
> --
> With Best Regards,
> Andy Shevchenko


-- 
With Best Regards,
Andy Shevchenko



RE: [PATCH 5/5] i2c: designware: Switch over to i2c_freq_mode_string()

2021-03-30 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Song Bao Hua (Barry Song)
> Sent: Wednesday, March 31, 2021 10:54 AM
> To: 'Andy Shevchenko' 
> Cc: yangyicong ; w...@kernel.org;
> andriy.shevche...@linux.intel.com; linux-...@vger.kernel.org;
> sergey.se...@baikalelectronics.ru; linux-kernel@vger.kernel.org;
> dig...@gmail.com; tred...@nvidia.com; jarkko.nik...@linux.intel.com;
> rmk+ker...@armlinux.org.uk; John Garry ;
> mika.westerb...@linux.intel.com; Zengtao (B) ;
> Linuxarm 
> Subject: RE: [PATCH 5/5] i2c: designware: Switch over to 
> i2c_freq_mode_string()
> 
> 
> 
> From: Andy Shevchenko [mailto:andy.shevche...@gmail.com]
> Sent: Wednesday, March 31, 2021 10:57 AM
> To: Song Bao Hua (Barry Song) 
> Cc: yangyicong ; w...@kernel.org;
> andriy.shevche...@linux.intel.com; linux-...@vger.kernel.org;
> sergey.se...@baikalelectronics.ru; linux-kernel@vger.kernel.org;
> dig...@gmail.com; tred...@nvidia.com; jarkko.nik...@linux.intel.com;
> rmk+ker...@armlinux.org.uk; John Garry ;
> mika.westerb...@linux.intel.com; Zengtao (B) ;
> Linuxarm 
> Subject: Re: [PATCH 5/5] i2c: designware: Switch over to 
> i2c_freq_mode_string()
> 
> 
> 
> On Wednesday, March 31, 2021, Song Bao Hua (Barry Song)
>  wrote:
> 
> 
> > -Original Message-
> > From: yangyicong
> > Sent: Wednesday, March 31, 2021 3:19 AM
> > To: w...@kernel.org; andriy.shevche...@linux.intel.com;
> > linux-...@vger.kernel.org; sergey.se...@baikalelectronics.ru;
> > linux-kernel@vger.kernel.org
> > Cc: dig...@gmail.com; tred...@nvidia.com; jarkko.nik...@linux.intel.com;
> > rmk+ker...@armlinux.org.uk; Song Bao Hua (Barry Song)
> > ; John Garry ;
> > mika.westerb...@linux.intel.com; yangyicong ;
> Zengtao
> > (B) ; Linuxarm 
> > Subject: [PATCH 5/5] i2c: designware: Switch over to i2c_freq_mode_string()
> >
> > From: Andy Shevchenko 
> >
> > Use generic i2c_freq_mode_string() helper to print chosen bus speed.
> >
> > Signed-off-by: Andy Shevchenko 
> > Signed-off-by: Yicong Yang 
> > ---
> >  drivers/i2c/busses/i2c-designware-master.c | 20 
> >  1 file changed, 4 insertions(+), 16 deletions(-)
> >
> > diff --git a/drivers/i2c/busses/i2c-designware-master.c
> > b/drivers/i2c/busses/i2c-designware-master.c
> > index dd27b9d..b64c4c8 100644
> > --- a/drivers/i2c/busses/i2c-designware-master.c
> > +++ b/drivers/i2c/busses/i2c-designware-master.c
> > @@ -35,10 +35,10 @@ static void i2c_dw_configure_fifo_master(struct
> dw_i2c_dev
> > *dev)
> >
> >  static int i2c_dw_set_timings_master(struct dw_i2c_dev *dev)
> >  {
> > -     const char *mode_str, *fp_str = "";
> >       u32 comp_param1;
> >       u32 sda_falling_time, scl_falling_time;
> >       struct i2c_timings *t = >timings;
> > +     const char *fp_str = "";
> >       u32 ic_clk;
> >       int ret;
> >
> > @@ -153,22 +153,10 @@ static int i2c_dw_set_timings_master(struct dw_i2c_dev
> > *dev)
> >
> >       ret = i2c_dw_set_sda_hold(dev);
> >       if (ret)
> > -             goto out;
> > -
> > -     switch (dev->master_cfg & DW_IC_CON_SPEED_MASK) {
> > -     case DW_IC_CON_SPEED_STD:
> > -             mode_str = "Standard Mode";
> > -             break;
> > -     case DW_IC_CON_SPEED_HIGH:
> > -             mode_str = "High Speed Mode";
> > -             break;
> > -     default:
> > -             mode_str = "Fast Mode";
> > -     }
> > -     dev_dbg(dev->dev, "Bus speed: %s%s\n", mode_str, fp_str);
> > +             return ret;
> >
> > -out:
> > -     return ret;
> > +     dev_dbg(dev->dev, "Bus speed: %s\n",
> > i2c_freq_mode_string(t->bus_freq_hz));
> 
> > Weird the original code was printing both mode and fp.
> > And you are printing mode only.
> 
> >> Sorry, but I didn’t get what you mean here. The code is equivalent, and 
> >> actually
> it will print even more.
> 
> The original code will print the string fp_str:
> %s%s\n", mode_str, fp_str
> 
> The new code is printing mode_str only:
> %s
> 

Isn't fp_str redundant? Do we need to change

dev_dbg(dev->dev, "Fast Mode:%s HCNT:LCNT = %d:%d\n", fp_str...)

> > +     return 0;
> >  }
> >
> >  /**
> > --
> > 2.8.1
> 
> 
> --
> With Best Regards,
> Andy Shevchenko



RE: [PATCH 5/5] i2c: designware: Switch over to i2c_freq_mode_string()

2021-03-30 Thread Song Bao Hua (Barry Song)


From: Andy Shevchenko [mailto:andy.shevche...@gmail.com] 
Sent: Wednesday, March 31, 2021 10:57 AM
To: Song Bao Hua (Barry Song) 
Cc: yangyicong ; w...@kernel.org; 
andriy.shevche...@linux.intel.com; linux-...@vger.kernel.org; 
sergey.se...@baikalelectronics.ru; linux-kernel@vger.kernel.org; 
dig...@gmail.com; tred...@nvidia.com; jarkko.nik...@linux.intel.com; 
rmk+ker...@armlinux.org.uk; John Garry ; 
mika.westerb...@linux.intel.com; Zengtao (B) ; 
Linuxarm 
Subject: Re: [PATCH 5/5] i2c: designware: Switch over to i2c_freq_mode_string()



On Wednesday, March 31, 2021, Song Bao Hua (Barry Song) 
 wrote:


> -Original Message-
> From: yangyicong
> Sent: Wednesday, March 31, 2021 3:19 AM
> To: w...@kernel.org; andriy.shevche...@linux.intel.com;
> linux-...@vger.kernel.org; sergey.se...@baikalelectronics.ru;
> linux-kernel@vger.kernel.org
> Cc: dig...@gmail.com; tred...@nvidia.com; jarkko.nik...@linux.intel.com;
> rmk+ker...@armlinux.org.uk; Song Bao Hua (Barry Song)
> ; John Garry ;
> mika.westerb...@linux.intel.com; yangyicong ; Zengtao
> (B) ; Linuxarm 
> Subject: [PATCH 5/5] i2c: designware: Switch over to i2c_freq_mode_string()
> 
> From: Andy Shevchenko 
> 
> Use generic i2c_freq_mode_string() helper to print chosen bus speed.
> 
> Signed-off-by: Andy Shevchenko 
> Signed-off-by: Yicong Yang 
> ---
>  drivers/i2c/busses/i2c-designware-master.c | 20 
>  1 file changed, 4 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/i2c/busses/i2c-designware-master.c
> b/drivers/i2c/busses/i2c-designware-master.c
> index dd27b9d..b64c4c8 100644
> --- a/drivers/i2c/busses/i2c-designware-master.c
> +++ b/drivers/i2c/busses/i2c-designware-master.c
> @@ -35,10 +35,10 @@ static void i2c_dw_configure_fifo_master(struct dw_i2c_dev
> *dev)
> 
>  static int i2c_dw_set_timings_master(struct dw_i2c_dev *dev)
>  {
> -     const char *mode_str, *fp_str = "";
>       u32 comp_param1;
>       u32 sda_falling_time, scl_falling_time;
>       struct i2c_timings *t = >timings;
> +     const char *fp_str = "";
>       u32 ic_clk;
>       int ret;
> 
> @@ -153,22 +153,10 @@ static int i2c_dw_set_timings_master(struct dw_i2c_dev
> *dev)
> 
>       ret = i2c_dw_set_sda_hold(dev);
>       if (ret)
> -             goto out;
> -
> -     switch (dev->master_cfg & DW_IC_CON_SPEED_MASK) {
> -     case DW_IC_CON_SPEED_STD:
> -             mode_str = "Standard Mode";
> -             break;
> -     case DW_IC_CON_SPEED_HIGH:
> -             mode_str = "High Speed Mode";
> -             break;
> -     default:
> -             mode_str = "Fast Mode";
> -     }
> -     dev_dbg(dev->dev, "Bus speed: %s%s\n", mode_str, fp_str);
> +             return ret;
> 
> -out:
> -     return ret;
> +     dev_dbg(dev->dev, "Bus speed: %s\n",
> i2c_freq_mode_string(t->bus_freq_hz));

> Weird the original code was printing both mode and fp.
> And you are printing mode only.

>> Sorry, but I didn’t get what you mean here. The code is equivalent, and 
>> actually it will print even more.

The original code will print the string fp_str:
%s%s\n", mode_str, fp_str

The new code is printing mode_str only:
%s

> +     return 0;
>  }
> 
>  /**
> --
> 2.8.1


-- 
With Best Regards,
Andy Shevchenko



RE: [PATCH 5/5] i2c: designware: Switch over to i2c_freq_mode_string()

2021-03-30 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: yangyicong
> Sent: Wednesday, March 31, 2021 3:19 AM
> To: w...@kernel.org; andriy.shevche...@linux.intel.com;
> linux-...@vger.kernel.org; sergey.se...@baikalelectronics.ru;
> linux-kernel@vger.kernel.org
> Cc: dig...@gmail.com; tred...@nvidia.com; jarkko.nik...@linux.intel.com;
> rmk+ker...@armlinux.org.uk; Song Bao Hua (Barry Song)
> ; John Garry ;
> mika.westerb...@linux.intel.com; yangyicong ; Zengtao
> (B) ; Linuxarm 
> Subject: [PATCH 5/5] i2c: designware: Switch over to i2c_freq_mode_string()
> 
> From: Andy Shevchenko 
> 
> Use generic i2c_freq_mode_string() helper to print chosen bus speed.
> 
> Signed-off-by: Andy Shevchenko 
> Signed-off-by: Yicong Yang 
> ---
>  drivers/i2c/busses/i2c-designware-master.c | 20 
>  1 file changed, 4 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/i2c/busses/i2c-designware-master.c
> b/drivers/i2c/busses/i2c-designware-master.c
> index dd27b9d..b64c4c8 100644
> --- a/drivers/i2c/busses/i2c-designware-master.c
> +++ b/drivers/i2c/busses/i2c-designware-master.c
> @@ -35,10 +35,10 @@ static void i2c_dw_configure_fifo_master(struct dw_i2c_dev
> *dev)
> 
>  static int i2c_dw_set_timings_master(struct dw_i2c_dev *dev)
>  {
> - const char *mode_str, *fp_str = "";
>   u32 comp_param1;
>   u32 sda_falling_time, scl_falling_time;
>   struct i2c_timings *t = >timings;
> + const char *fp_str = "";
>   u32 ic_clk;
>   int ret;
> 
> @@ -153,22 +153,10 @@ static int i2c_dw_set_timings_master(struct dw_i2c_dev
> *dev)
> 
>   ret = i2c_dw_set_sda_hold(dev);
>   if (ret)
> - goto out;
> -
> - switch (dev->master_cfg & DW_IC_CON_SPEED_MASK) {
> - case DW_IC_CON_SPEED_STD:
> - mode_str = "Standard Mode";
> - break;
> - case DW_IC_CON_SPEED_HIGH:
> - mode_str = "High Speed Mode";
> - break;
> - default:
> - mode_str = "Fast Mode";
> - }
> - dev_dbg(dev->dev, "Bus speed: %s%s\n", mode_str, fp_str);
> + return ret;
> 
> -out:
> - return ret;
> + dev_dbg(dev->dev, "Bus speed: %s\n",
> i2c_freq_mode_string(t->bus_freq_hz));

Weird the original code was printing both mode and fp.
And you are printing mode only.

> + return 0;
>  }
> 
>  /**
> --
> 2.8.1



RE: [External] Re: [PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock

2021-03-30 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Muchun Song [mailto:songmuc...@bytedance.com]
> Sent: Tuesday, March 30, 2021 9:09 PM
> To: Michal Hocko 
> Cc: Mike Kravetz ; Linux Memory Management List
> ; LKML ; Roman Gushchin
> ; Shakeel Butt ; Oscar Salvador
> ; David Hildenbrand ; David Rientjes
> ; linmiaohe ; Peter Zijlstra
> ; Matthew Wilcox ; HORIGUCHI NAOYA
> ; Aneesh Kumar K . V ;
> Waiman Long ; Peter Xu ; Mina Almasry
> ; Hillf Danton ; Joonsoo Kim
> ; Song Bao Hua (Barry Song)
> ; Will Deacon ; Andrew Morton
> 
> Subject: Re: [External] Re: [PATCH v2 1/8] mm/cma: change cma mutex to irq 
> safe
> spinlock
> 
> On Tue, Mar 30, 2021 at 4:01 PM Michal Hocko  wrote:
> >
> > On Mon 29-03-21 16:23:55, Mike Kravetz wrote:
> > > Ideally, cma_release could be called from any context.  However,
> > > that is not possible because a mutex is used to protect the per-area 
> > > bitmap.
> > > Change the bitmap to an irq safe spinlock.
> >
> > I would phrase the changelog slightly differerent "
> > cma_release is currently a sleepable operatation because the bitmap
> > manipulation is protected by cma->lock mutex. Hugetlb code which
> > relies on cma_release for CMA backed (giga) hugetlb pages, however,
> > needs to be irq safe.
> >
> > The lock doesn't protect any sleepable operation so it can be changed
> > to a (irq aware) spin lock. The bitmap processing should be quite fast
> > in typical case but if cma sizes grow to TB then we will likely need
> > to replace the lock by a more optimized bitmap implementation.
> > "
> >
> > it seems that you are overusing irqsave variants even from context
> > which are never called from the IRQ context so they do not need storing 
> > flags.
> >
> > [...]
> > > @@ -391,8 +391,9 @@ static void cma_debug_show_areas(struct cma *cma)
> > >   unsigned long start = 0;
> > >   unsigned long nr_part, nr_total = 0;
> > >   unsigned long nbits = cma_bitmap_maxno(cma);
> > > + unsigned long flags;
> > >
> > > - mutex_lock(>lock);
> > > + spin_lock_irqsave(>lock, flags);
> >
> > spin_lock_irq should be sufficient. This is only called from the
> > allocation context and that is never called from IRQ context.
> 
> This makes me think more. I think that spin_lock should be sufficient. Right?
> 

It seems Mike's point is that cma_release might be called from both
irq context and process context.

If it is running in process context, we need the irq-disable to lock
the irq context which might jump to call cma_release at the same time.

We have never seen cma_release has been really called in irq context
by now, anyway.

> 
> >
> > >   pr_info("number of available pages: ");
> > >   for (;;) {
> > >   next_zero_bit = find_next_zero_bit(cma->bitmap, nbits,
> > > start); @@ -407,7 +408,7 @@ static void cma_debug_show_areas(struct cma
> *cma)
> > >   start = next_zero_bit + nr_zero;
> > >   }
> > >   pr_cont("=> %lu free of %lu total pages\n", nr_total, cma->count);
> > > - mutex_unlock(>lock);
> > > + spin_unlock_irqrestore(>lock, flags);
> > >  }
> > >  #else
> > >  static inline void cma_debug_show_areas(struct cma *cma) { } @@
> > > -430,6 +431,7 @@ struct page *cma_alloc(struct cma *cma, size_t count,
> unsigned int align,
> > >   unsigned long pfn = -1;
> > >   unsigned long start = 0;
> > >   unsigned long bitmap_maxno, bitmap_no, bitmap_count;
> > > + unsigned long flags;
> > >   size_t i;
> > >   struct page *page = NULL;
> > >   int ret = -ENOMEM;
> > > @@ -454,12 +456,12 @@ struct page *cma_alloc(struct cma *cma, size_t 
> > > count,
> unsigned int align,
> > >   goto out;
> > >
> > >   for (;;) {
> > > - mutex_lock(>lock);
> > > + spin_lock_irqsave(>lock, flags);
> > >   bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
> > >   bitmap_maxno, start, bitmap_count, mask,
> > >   offset);
> > >   if (bitmap_no >= bitmap_maxno) {
> > > - mutex_unlock(>lock);
> > > + spin_unlock_irqrestore(>lock, flags);
> > >   break;
> > >   }
> > >   bitmap_set(cma->bitma

RE: [PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock

2021-03-29 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Mike Kravetz [mailto:mike.krav...@oracle.com]
> Sent: Tuesday, March 30, 2021 12:24 PM
> To: linux...@kvack.org; linux-kernel@vger.kernel.org
> Cc: Roman Gushchin ; Michal Hocko ; Shakeel Butt
> ; Oscar Salvador ; David Hildenbrand
> ; Muchun Song ; David Rientjes
> ; linmiaohe ; Peter Zijlstra
> ; Matthew Wilcox ; HORIGUCHI NAOYA
> ; Aneesh Kumar K . V ;
> Waiman Long ; Peter Xu ; Mina Almasry
> ; Hillf Danton ; Joonsoo Kim
> ; Song Bao Hua (Barry Song)
> ; Will Deacon ; Andrew Morton
> ; Mike Kravetz 
> Subject: [PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock
> 
> Ideally, cma_release could be called from any context.  However, that is
> not possible because a mutex is used to protect the per-area bitmap.
> Change the bitmap to an irq safe spinlock.
> 
> Signed-off-by: Mike Kravetz 

It seems mutex_lock is locking some areas with bitmap operations which
should be safe to atomic context.

Reviewed-by: Barry Song 

> ---
>  mm/cma.c   | 20 +++-
>  mm/cma.h   |  2 +-
>  mm/cma_debug.c | 10 ++
>  3 files changed, 18 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/cma.c b/mm/cma.c
> index b2393b892d3b..80875fd4487b 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -24,7 +24,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -83,13 +82,14 @@ static void cma_clear_bitmap(struct cma *cma, unsigned 
> long
> pfn,
>unsigned int count)
>  {
>   unsigned long bitmap_no, bitmap_count;
> + unsigned long flags;
> 
>   bitmap_no = (pfn - cma->base_pfn) >> cma->order_per_bit;
>   bitmap_count = cma_bitmap_pages_to_bits(cma, count);
> 
> - mutex_lock(>lock);
> + spin_lock_irqsave(>lock, flags);
>   bitmap_clear(cma->bitmap, bitmap_no, bitmap_count);
> - mutex_unlock(>lock);
> + spin_unlock_irqrestore(>lock, flags);
>  }
> 
>  static void __init cma_activate_area(struct cma *cma)
> @@ -118,7 +118,7 @@ static void __init cma_activate_area(struct cma *cma)
>pfn += pageblock_nr_pages)
>   init_cma_reserved_pageblock(pfn_to_page(pfn));
> 
> - mutex_init(>lock);
> + spin_lock_init(>lock);
> 
>  #ifdef CONFIG_CMA_DEBUGFS
>   INIT_HLIST_HEAD(>mem_head);
> @@ -391,8 +391,9 @@ static void cma_debug_show_areas(struct cma *cma)
>   unsigned long start = 0;
>   unsigned long nr_part, nr_total = 0;
>   unsigned long nbits = cma_bitmap_maxno(cma);
> + unsigned long flags;
> 
> - mutex_lock(>lock);
> + spin_lock_irqsave(>lock, flags);
>   pr_info("number of available pages: ");
>   for (;;) {
>   next_zero_bit = find_next_zero_bit(cma->bitmap, nbits, start);
> @@ -407,7 +408,7 @@ static void cma_debug_show_areas(struct cma *cma)
>   start = next_zero_bit + nr_zero;
>   }
>   pr_cont("=> %lu free of %lu total pages\n", nr_total, cma->count);
> - mutex_unlock(>lock);
> + spin_unlock_irqrestore(>lock, flags);
>  }
>  #else
>  static inline void cma_debug_show_areas(struct cma *cma) { }
> @@ -430,6 +431,7 @@ struct page *cma_alloc(struct cma *cma, size_t count,
> unsigned int align,
>   unsigned long pfn = -1;
>   unsigned long start = 0;
>   unsigned long bitmap_maxno, bitmap_no, bitmap_count;
> + unsigned long flags;
>   size_t i;
>   struct page *page = NULL;
>   int ret = -ENOMEM;
> @@ -454,12 +456,12 @@ struct page *cma_alloc(struct cma *cma, size_t count,
> unsigned int align,
>   goto out;
> 
>   for (;;) {
> - mutex_lock(>lock);
> + spin_lock_irqsave(>lock, flags);
>   bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
>   bitmap_maxno, start, bitmap_count, mask,
>   offset);
>   if (bitmap_no >= bitmap_maxno) {
> - mutex_unlock(>lock);
> + spin_unlock_irqrestore(>lock, flags);
>   break;
>   }
>   bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
> @@ -468,7 +470,7 @@ struct page *cma_alloc(struct cma *cma, size_t count,
> unsigned int align,
>* our exclusive use. If the migration fails we will take the
>* lock again and unmark it.
>*/
> - mutex_unlock(>lock);
> + spin_unlock_irqrestore(>lock, flags);
> 
>   pfn = cma->base_pfn + (bitmap_no

RE: [PATCH] dma-mapping: make map_benchmark compile into module

2021-03-24 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Christoph Hellwig [mailto:h...@lst.de]
> Sent: Wednesday, March 24, 2021 8:13 PM
> To: tiantao (H) 
> Cc: a...@linux-foundation.org; pet...@infradead.org; paul...@kernel.org;
> a...@kernel.org; t...@linutronix.de; rost...@goodmis.org; h...@lst.de;
> m.szyprow...@samsung.com; Song Bao Hua (Barry Song)
> ; io...@lists.linux-foundation.org;
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] dma-mapping: make map_benchmark compile into module
> 
> On Wed, Mar 24, 2021 at 10:17:38AM +0800, Tian Tao wrote:
> > under some scenarios, it is necessary to compile map_benchmark
> > into module to test iommu, so this patch changed Kconfig and
> > export_symbol to implement map_benchmark compiled into module.
> >
> > On the other hand, map_benchmark is a driver, which is supposed
> > to be able to run as a module.
> >
> > Signed-off-by: Tian Tao 
> 
> Nope, we're not going to export more kthread internals for a test
> module.

The requirement comes from an colleague who is frequently changing
the map-bench code for some customized test purpose. and he doesn't
want to build kernel image and reboot every time. So I moved the
requirement to Tao Tian.

Right now, kthread_bind() is exported, kthread_bind_mask() seems
to be a little bit "internal" as you said, maybe a wrapper like
kthread_bind_node() won't be that "internal", comparing to exposing
the cpumask?
Anyway, we don't find other driver users for this, hardly I can
convince you it is worth.

Thanks
Barry


RE: [PATCH] dma-mapping: make map_benchmark compile into module

2021-03-23 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: tiantao (H)
> Sent: Wednesday, March 24, 2021 3:18 PM
> To: a...@linux-foundation.org; pet...@infradead.org; paul...@kernel.org;
> a...@kernel.org; t...@linutronix.de; rost...@goodmis.org; h...@lst.de;
> m.szyprow...@samsung.com; Song Bao Hua (Barry Song)
> 
> Cc: io...@lists.linux-foundation.org; linux-kernel@vger.kernel.org; tiantao
> (H) 
> Subject: [PATCH] dma-mapping: make map_benchmark compile into module
> 
> under some scenarios, it is necessary to compile map_benchmark
> into module to test iommu, so this patch changed Kconfig and
> export_symbol to implement map_benchmark compiled into module.
> 
> On the other hand, map_benchmark is a driver, which is supposed
> to be able to run as a module.
> 
> Signed-off-by: Tian Tao 
> ---

Acked-by: Barry Song 

Look sensible to me. I like the idea that map_benchmark is
a driver. It seems unreasonable to always require built-in.


>  kernel/dma/Kconfig | 2 +-
>  kernel/kthread.c   | 1 +
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> index 77b4055..0468293 100644
> --- a/kernel/dma/Kconfig
> +++ b/kernel/dma/Kconfig
> @@ -223,7 +223,7 @@ config DMA_API_DEBUG_SG
> If unsure, say N.
> 
>  config DMA_MAP_BENCHMARK
> - bool "Enable benchmarking of streaming DMA mapping"
> + tristate "Enable benchmarking of streaming DMA mapping"
>   depends on DEBUG_FS
>   help
> Provides /sys/kernel/debug/dma_map_benchmark that helps with testing
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 1578973..fa4736f 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -455,6 +455,7 @@ void kthread_bind_mask(struct task_struct *p, const struct
> cpumask *mask)
>  {
>   __kthread_bind_mask(p, mask, TASK_UNINTERRUPTIBLE);
>  }
> +EXPORT_SYMBOL(kthread_bind_mask);
> 
>  /**
>   * kthread_bind - bind a just-created kthread to a cpu.
> --
> 2.7.4

Thanks
Barry



RE: [RFC PATCH v5 4/4] scheduler: Add cluster scheduler level for x86

2021-03-23 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Tim Chen [mailto:tim.c.c...@linux.intel.com]
> Sent: Wednesday, March 24, 2021 11:51 AM
> To: Song Bao Hua (Barry Song) ;
> catalin.mari...@arm.com; w...@kernel.org; r...@rjwysocki.net;
> vincent.guit...@linaro.org; b...@alien8.de; t...@linutronix.de;
> mi...@redhat.com; l...@kernel.org; pet...@infradead.org;
> dietmar.eggem...@arm.com; rost...@goodmis.org; bseg...@google.com;
> mgor...@suse.de
> Cc: msys.miz...@gmail.com; valentin.schnei...@arm.com;
> gre...@linuxfoundation.org; Jonathan Cameron ;
> juri.le...@redhat.com; mark.rutl...@arm.com; sudeep.ho...@arm.com;
> aubrey...@linux.intel.com; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; linux-a...@vger.kernel.org; x...@kernel.org;
> xuwei (O) ; Zengtao (B) ;
> guodong...@linaro.org; yangyicong ; Liguozhu (Kenneth)
> ; linux...@openeuler.org; h...@zytor.com
> Subject: Re: [RFC PATCH v5 4/4] scheduler: Add cluster scheduler level for x86
> 
> 
> 
> On 3/18/21 9:16 PM, Barry Song wrote:
> > From: Tim Chen 
> >
> > There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce
> > is shared among a cluster of cores instead of being exclusive
> > to one single core.
> >
> > To prevent oversubscription of L2 cache, load should be
> > balanced between such L2 clusters, especially for tasks with
> > no shared data.
> >
> > Also with cluster scheduling policy where tasks are woken up
> > in the same L2 cluster, we will benefit from keeping tasks
> > related to each other and likely sharing data in the same L2
> > cluster.
> >
> > Add CPU masks of CPUs sharing the L2 cache so we can build such
> > L2 cluster scheduler domain.
> >
> > Signed-off-by: Tim Chen 
> > Signed-off-by: Barry Song 
> 
> 
> Barry,
> 
> Can you also add this chunk to the patch.
> Thanks.

Sure, Tim, Thanks. I'll put that into patch 4/4 in v6.

> 
> Tim
> 
> 
> diff --git a/arch/x86/include/asm/topology.h
> b/arch/x86/include/asm/topology.h
> index 2a11ccc14fb1..800fa48c9fcd 100644
> --- a/arch/x86/include/asm/topology.h
> +++ b/arch/x86/include/asm/topology.h
> @@ -115,6 +115,7 @@ extern unsigned int __max_die_per_package;
> 
>  #ifdef CONFIG_SMP
>  #define topology_die_cpumask(cpu)(per_cpu(cpu_die_map, cpu))
> +#define topology_cluster_cpumask(cpu)
> (cpu_clustergroup_mask(cpu))
>  #define topology_core_cpumask(cpu)   (per_cpu(cpu_core_map, cpu))
>  #define topology_sibling_cpumask(cpu)
> (per_cpu(cpu_sibling_map, cpu))
> 

Thanks
Barry




RE: [Linuxarm] Re: [PATCH] sched/fair: remove redundant test_idle_cores for non-smt

2021-03-21 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Li, Aubrey [mailto:aubrey...@linux.intel.com]
> Sent: Monday, March 22, 2021 5:37 PM
> To: Song Bao Hua (Barry Song) ;
> vincent.guit...@linaro.org; mi...@redhat.com; pet...@infradead.org;
> juri.le...@redhat.com; dietmar.eggem...@arm.com; rost...@goodmis.org;
> bseg...@google.com; mgor...@suse.de
> Cc: valentin.schnei...@arm.com; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; xuwei (O) ; Zengtao (B)
> ; guodong...@linaro.org; yangyicong
> ; Liguozhu (Kenneth) ;
> linux...@openeuler.org
> Subject: [Linuxarm] Re: [PATCH] sched/fair: remove redundant test_idle_cores
> for non-smt
> 
> Hi Barry,
> 
> On 2021/3/21 6:14, Barry Song wrote:
> > update_idle_core() is only done for the case of sched_smt_present.
> > but test_idle_cores() is done for all machines even those without
> > smt.
> 
> The patch looks good to me.
> May I know for what case we need to keep CONFIG_SCHED_SMT for non-smt
> machines?


Hi Aubrey,

I think the defconfig of arm64 has always enabled
CONFIG_SCHED_SMT:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/configs/defconfig

it is probably true for x86 as well.

I don't think Linux distribution will build a separate kernel
for machines without smt. so basically the kernel depends on
runtime topology parse to figure out if smt is present rather
than depending on a rebuild.


> 
> Thanks,
> -Aubrey
> 
> 
> > this could contribute to up 8%+ hackbench performance loss on a
> > machine like kunpeng 920 which has no smt. this patch removes the
> > redundant test_idle_cores() for non-smt machines.
> >
> > we run the below hackbench with different -g parameter from 2 to
> > 14, for each different g, we run the command 10 times and get the
> > average time:
> > $ numactl -N 0 hackbench -p -T -l 2 -g $1
> >
> > hackbench will report the time which is needed to complete a certain
> > number of messages transmissions between a certain number of tasks,
> > for example:
> > $ numactl -N 0 hackbench -p -T -l 2 -g 10
> > Running in threaded mode with 10 groups using 40 file descriptors each
> > (== 400 tasks)
> > Each sender will pass 2 messages of 100 bytes
> >
> > The below is the result of hackbench w/ and w/o this patch:
> > g=2  4 6   8  10 12  14
> > w/o: 1.8151 3.8499 5.5142 7.2491 9.0340 10.7345 12.0929
> > w/ : 1.8428 3.7436 5.4501 6.9522 8.2882  9.9535 11.3367
> >   +4.1%  +8.3%  +7.3%   +6.3%
> >
> > Signed-off-by: Barry Song 
> > ---
> >  kernel/sched/fair.c | 8 +---
> >  1 file changed, 5 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 2e2ab1e..de42a32 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6038,9 +6038,11 @@ static inline bool test_idle_cores(int cpu, bool def)
> >  {
> > struct sched_domain_shared *sds;
> >
> > -   sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> > -   if (sds)
> > -   return READ_ONCE(sds->has_idle_cores);
> > +   if (static_branch_likely(_smt_present)) {
> > +   sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> > +   if (sds)
> > +   return READ_ONCE(sds->has_idle_cores);
> > +   }
> >
> > return def;
> >  }

Thanks
Barry



RE: [RFC PATCH v5 3/4] scheduler: scan idle cpu in cluster before scanning the whole llc

2021-03-19 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Song Bao Hua (Barry Song)
> Sent: Friday, March 19, 2021 5:16 PM
> To: tim.c.c...@linux.intel.com; catalin.mari...@arm.com; w...@kernel.org;
> r...@rjwysocki.net; vincent.guit...@linaro.org; b...@alien8.de;
> t...@linutronix.de; mi...@redhat.com; l...@kernel.org; pet...@infradead.org;
> dietmar.eggem...@arm.com; rost...@goodmis.org; bseg...@google.com;
> mgor...@suse.de
> Cc: msys.miz...@gmail.com; valentin.schnei...@arm.com;
> gre...@linuxfoundation.org; Jonathan Cameron ;
> juri.le...@redhat.com; mark.rutl...@arm.com; sudeep.ho...@arm.com;
> aubrey...@linux.intel.com; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; linux-a...@vger.kernel.org; x...@kernel.org;
> xuwei (O) ; Zengtao (B) ;
> guodong...@linaro.org; yangyicong ; Liguozhu (Kenneth)
> ; linux...@openeuler.org; h...@zytor.com; Song Bao Hua
> (Barry Song) 
> Subject: [RFC PATCH v5 3/4] scheduler: scan idle cpu in cluster before 
> scanning
> the whole llc
> 
> On kunpeng920, cpus within one cluster can communicate wit each other
> much faster than cpus across different clusters. A simple hackbench
> can prove that.
> hackbench running on 4 cpus in single one cluster and 4 cpus in
> different clusters shows a large contrast:
> (1) within a cluster:
> root@ubuntu:~# taskset -c 0,1,2,3 hackbench -p -T -l 2 -g 1
> Running in threaded mode with 1 groups using 40 file descriptors each
> (== 40 tasks)
> Each sender will pass 2 messages of 100 bytes
> Time: 4.285
> 
> (2) across clusters:
> root@ubuntu:~# taskset -c 0,4,8,12 hackbench -p -T -l 2 -g 1
> Running in threaded mode with 1 groups using 40 file descriptors each
> (== 40 tasks)
> Each sender will pass 2 messages of 100 bytes
> Time: 5.524
> 
> This inspires us to change the wake_affine path to scan cluster before
> scanning the whole LLC to try to gatter related tasks in one cluster,
> which is done by this patch.
> 
> To evaluate the performance impact to related tasks talking with each
> other, we run the below hackbench with different -g parameter from 2
> to 14, for each different g, we run the command 10 times and get the
> average time:
> $ numactl -N 0 hackbench -p -T -l 2 -g $1
> 
> hackbench will report the time which is needed to complete a certain number
> of messages transmissions between a certain number of tasks, for example:
> $ numactl -N 0 hackbench -p -T -l 2 -g 10
> Running in threaded mode with 10 groups using 40 file descriptors each
> (== 400 tasks)
> Each sender will pass 2 messages of 100 bytes
> 
> The below is the result of hackbench w/ and w/o cluster patch:
> g=2  4 6   8  10 12  14
> w/o: 1.8151 3.8499 5.5142 7.2491 9.0340 10.7345 12.0929
> w/ : 1.7881 3.7371 5.3301 6.9747 8.6909  9.9235 11.2608
> 
> Obviously some recent commits have improved the hackbench. So the change
> in wake_affine path brings less increase on hackbench compared to what
> we got in RFC v4.
> And obviously it is much more tricky to leverage wake_affine compared to
> leveraging the scatter of tasks in the previous patch as load balance
> might pull tasks which have been compact in a cluster so alternative
> suggestions welcome.
> 
> In order to figure out how many times cpu is picked from the cluster and
> how many times cpu is picked out of the cluster, a tracepoint for debug
> purpose is added in this patch. And an userspace bcc script to print the
> histogram of the result of select_idle_cpu():
> #!/usr/bin/python
> #
> # selectidlecpu.pyselect idle cpu histogram.
> #
> # A Ctrl-C will print the gathered histogram then exit.
> #
> # 18-March-2021 Barry Song Created this.
> 
> from __future__ import print_function
> from bcc import BPF
> from time import sleep
> 
> # load BPF program
> b = BPF(text="""
> 
> BPF_HISTOGRAM(dist);
> 
> TRACEPOINT_PROBE(sched, sched_select_idle_cpu)
> {
>   u32 e;
>   if (args->idle / 4 == args->target/4)
>   e = 0; /* idle cpu from same cluster */

Oops here, as -1/4 = 1/4 = 2/4 = 3/4 = 0
So a part of -1 is put here(local cluster) incorrectly.

>   else if (args->idle != -1)
>   e = 1; /* idle cpu from different clusters */
>   else
>   e = 2; /* no idle cpu */
> 
>   dist.increment(e);
>   return 0;
> }
> """)

Fixed it to:

TRACEPOINT_PROBE(sched, sched_select_idle_cpu)
{
u32 e;
if (args->idle == -1)
e = 2; /* no idle cpu */
else if (args->idle / 4 == args->target / 4)
e = 0; /* idle cpu from same cluster */
else
e = 1; /* idle cpu fr

RE: [RFC PATCH v5 1/4] topology: Represent clusters of CPUs within a die

2021-03-19 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Greg KH [mailto:gre...@linuxfoundation.org]
> Sent: Friday, March 19, 2021 7:35 PM
> To: Song Bao Hua (Barry Song) 
> Cc: tim.c.c...@linux.intel.com; catalin.mari...@arm.com; w...@kernel.org;
> r...@rjwysocki.net; vincent.guit...@linaro.org; b...@alien8.de;
> t...@linutronix.de; mi...@redhat.com; l...@kernel.org; pet...@infradead.org;
> dietmar.eggem...@arm.com; rost...@goodmis.org; bseg...@google.com;
> mgor...@suse.de; msys.miz...@gmail.com; valentin.schnei...@arm.com; Jonathan
> Cameron ; juri.le...@redhat.com;
> mark.rutl...@arm.com; sudeep.ho...@arm.com; aubrey...@linux.intel.com;
> linux-arm-ker...@lists.infradead.org; linux-kernel@vger.kernel.org;
> linux-a...@vger.kernel.org; x...@kernel.org; xuwei (O) ;
> Zengtao (B) ; guodong...@linaro.org; yangyicong
> ; Liguozhu (Kenneth) ;
> linux...@openeuler.org; h...@zytor.com
> Subject: Re: [RFC PATCH v5 1/4] topology: Represent clusters of CPUs within
> a die
> 
> On Fri, Mar 19, 2021 at 05:16:15PM +1300, Barry Song wrote:
> > diff --git a/Documentation/admin-guide/cputopology.rst
> b/Documentation/admin-guide/cputopology.rst
> > index b90dafc..f9d3745 100644
> > --- a/Documentation/admin-guide/cputopology.rst
> > +++ b/Documentation/admin-guide/cputopology.rst
> > @@ -24,6 +24,12 @@ core_id:
> > identifier (rather than the kernel's).  The actual value is
> > architecture and platform dependent.
> >
> > +cluster_id:
> > +
> > +   the Cluster ID of cpuX.  Typically it is the hardware platform's
> > +   identifier (rather than the kernel's).  The actual value is
> > +   architecture and platform dependent.
> > +
> >  book_id:
> >
> > the book ID of cpuX. Typically it is the hardware platform's
> > @@ -56,6 +62,14 @@ package_cpus_list:
> > human-readable list of CPUs sharing the same physical_package_id.
> > (deprecated name: "core_siblings_list")
> >
> > +cluster_cpus:
> > +
> > +   internal kernel map of CPUs within the same cluster.
> > +
> > +cluster_cpus_list:
> > +
> > +   human-readable list of CPUs within the same cluster.
> > +
> >  die_cpus:
> >
> > internal kernel map of CPUs within the same die.
> 
> Why are these sysfs files in this file, and not in a Documentation/ABI/
> file which can be correctly parsed and shown to userspace?

Well. Those ABIs have been there for much a long time. It is like:

[root@ceph1 topology]# ls
core_id  core_siblings  core_siblings_list  physical_package_id thread_siblings 
 thread_siblings_list
[root@ceph1 topology]# pwd
/sys/devices/system/cpu/cpu100/topology
[root@ceph1 topology]# cat core_siblings_list
64-127
[root@ceph1 topology]#

> 
> Any chance you can fix that up here as well?

Yes. we will send a separate patch to address this, which won't
be in this patchset. This patchset will base on that one.

> 
> Also note that "list" is not something that goes in sysfs, sysfs is "one
> value per file", and a list is not "one value".  How do you prevent
> overflowing the buffer of the sysfs file if you have a "list"?
> 

At a glance, the list is using "-" rather than a real list
[root@ceph1 topology]# cat core_siblings_list
64-127

Anyway, I will take a look if it has any chance to overflow.

> thanks,
> 
> greg k-h

Thanks
Barry



RE: [PATCH] tty: serial: samsung_tty: remove spinlock flags in interrupt handlers

2021-03-19 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Andy Shevchenko [mailto:andy.shevche...@gmail.com]
> Sent: Tuesday, March 16, 2021 10:41 PM
> To: Johan Hovold ; Finn Thain ;
> Song Bao Hua (Barry Song) 
> Cc: Krzysztof Kozlowski ; Greg
> Kroah-Hartman ; Jiri Slaby ;
> linux-arm Mailing List ; Linux Samsung
> SOC ; open list:SERIAL DRIVERS
> ; Linux Kernel Mailing List
> ; Hector Martin ; Arnd
> Bergmann 
> Subject: Re: [PATCH] tty: serial: samsung_tty: remove spinlock flags in
> interrupt handlers
> 
> On Tue, Mar 16, 2021 at 11:02 AM Johan Hovold  wrote:
> >
> > On Mon, Mar 15, 2021 at 07:12:12PM +0100, Krzysztof Kozlowski wrote:
> > > Since interrupt handler is called with disabled local interrupts, there
> > > is no need to use the spinlock primitives disabling interrupts as well.
> >
> > This isn't generally true due to "threadirqs" and that can lead to
> > deadlocks if the console code is called from hard irq context.
> >
> > Now, this is *not* the case for this particular driver since it doesn't
> > even bother to take the port lock in console_write(). That should
> > probably be fixed instead.
> >
> > See https://lore.kernel.org/r/X7kviiRwuxvPxC8O@localhost.
> 
> Finn, Barry, something to check I think?

My understanding is that spin_lock_irqsave can't protect the context
the console_write() is called in hardirq for threaded_irq case mainly
for preempt-rt scenarios as spin_lock_irqsave doesn't disable irq in
that case at all.
See:
https://www.kernel.org/doc/html/latest/locking/locktypes.html
spinlock_t and PREEMPT_RT
On a PREEMPT_RT kernel spinlock_t is mapped to a separate implementation
based on rt_mutex which changes the semantics:
Preemption is not disabled.
The hard interrupt related suffixes for spin_lock / spin_unlock operations
(_irq, _irqsave / _irqrestore) do not affect the CPU’s interrupt disabled
state.

So if console_write() can interrupt our code in hardirq, we should
move to raw_spin_lock_irqsave for this driver.

I think it is almost always wrong to call spin_lock_irqsave in hardirq.

> 
> --
> With Best Regards,
> Andy Shevchenko

Thanks
Barry


RE: [RFC PATCH v4 2/3] scheduler: add scheduler level for clusters

2021-03-16 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Peter Zijlstra [mailto:pet...@infradead.org]
> Sent: Tuesday, March 2, 2021 11:43 PM
> To: Song Bao Hua (Barry Song) 
> Cc: tim.c.c...@linux.intel.com; catalin.mari...@arm.com; w...@kernel.org;
> r...@rjwysocki.net; vincent.guit...@linaro.org; b...@alien8.de;
> t...@linutronix.de; mi...@redhat.com; l...@kernel.org;
> dietmar.eggem...@arm.com; rost...@goodmis.org; bseg...@google.com;
> mgor...@suse.de; msys.miz...@gmail.com; valentin.schnei...@arm.com;
> gre...@linuxfoundation.org; Jonathan Cameron ;
> juri.le...@redhat.com; mark.rutl...@arm.com; sudeep.ho...@arm.com;
> aubrey...@linux.intel.com; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; linux-a...@vger.kernel.org; x...@kernel.org;
> xuwei (O) ; Zengtao (B) ;
> guodong...@linaro.org; yangyicong ; Liguozhu (Kenneth)
> ; linux...@openeuler.org; h...@zytor.com
> Subject: Re: [RFC PATCH v4 2/3] scheduler: add scheduler level for clusters
> 
> On Tue, Mar 02, 2021 at 11:59:39AM +1300, Barry Song wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 88a2e2b..d805e59 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -7797,6 +7797,16 @@ int sched_cpu_activate(unsigned int cpu)
> > if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
> > static_branch_inc_cpuslocked(_smt_present);
> >  #endif
> > +
> > +#ifdef CONFIG_SCHED_CLUSTER
> > +   /*
> > +* When going up, increment the number of cluster cpus with
> > +* cluster present.
> > +*/
> > +   if (cpumask_weight(cpu_cluster_mask(cpu)) > 1)
> > +   static_branch_inc_cpuslocked(_cluster_present);
> > +#endif
> > +
> > set_cpu_active(cpu, true);
> >
> > if (sched_smp_initialized) {
> > @@ -7873,6 +7883,14 @@ int sched_cpu_deactivate(unsigned int cpu)
> > static_branch_dec_cpuslocked(_smt_present);
> >  #endif
> >
> > +#ifdef CONFIG_SCHED_CLUSTER
> > +   /*
> > +* When going down, decrement the number of cpus with cluster present.
> > +*/
> > +   if (cpumask_weight(cpu_cluster_mask(cpu)) > 1)
> > +   static_branch_dec_cpuslocked(_cluster_present);
> > +#endif
> > +
> > if (!sched_smp_initialized)
> > return 0;
> 
> I don't think that's correct. IIUC this will mean the
> sched_cluster_present thing will be enabled on anything with SMT (very
> much including x86 big cores after the next patch).
> 
> I'm thinking that at the very least you should check a CLS domain
> exists, but that might be hard at this point, because the sched domains
> haven't been build yet.

might be able to achieve the same goal by:

int cls_wt = cpumask_weight(cpu_cluster_mask(cpu));
if ((cls_wt > cpumask_weight(cpu_smt_mask(cpu))) &&
 && (cls_wt < cpumask_weight(cpu_coregroup_mask(cpu
   sched_cluster_present...

> 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 8a8bd7b..3db7b07 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6009,6 +6009,11 @@ static inline int __select_idle_cpu(int cpu)
> > return -1;
> >  }
> >
> > +#ifdef CONFIG_SCHED_CLUSTER
> > +DEFINE_STATIC_KEY_FALSE(sched_cluster_present);
> > +EXPORT_SYMBOL_GPL(sched_cluster_present);
> 
> I really rather think this shouldn't be exported

Ok. Make sense.

> 
> > +#endif
> > +
> >  #ifdef CONFIG_SCHED_SMT
> >  DEFINE_STATIC_KEY_FALSE(sched_smt_present);
> >  EXPORT_SYMBOL_GPL(sched_smt_present);
> 
> This is a KVM wart, it needs to know because mitigation crap.
> 

Ok.

> > @@ -6116,6 +6121,26 @@ static inline int select_idle_core(struct task_struct
> *p, int core, struct cpuma
> >
> >  #endif /* CONFIG_SCHED_SMT */
> >
> > +static inline int _select_idle_cpu(bool smt, struct task_struct *p, int
> target, struct cpumask *cpus, int *idle_cpu, int *nr)
> > +{
> > +   int cpu, i;
> > +
> > +   for_each_cpu_wrap(cpu, cpus, target) {
> > +   if (smt) {
> > +   i = select_idle_core(p, cpu, cpus, idle_cpu);
> > +   } else {
> > +   if (!--*nr)
> > +   return -1;
> > +   i = __select_idle_cpu(cpu);
> > +   }
> > +
> > +   if ((unsigned int)i < nr_cpumask_bits)
> > +   return i;
> > +   }
> > +
> > +   return -1;
> > +}
> > +
> >  /*
> >   * Scan the LLC domain for idle CPUs; this is dynamically regulated by
> >   * comparing the

RE: [RFC PATCH v4 1/3] topology: Represent clusters of CPUs within a die.

2021-03-14 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Song Bao Hua (Barry Song)
> Sent: Tuesday, March 2, 2021 12:00 PM
> To: tim.c.c...@linux.intel.com; catalin.mari...@arm.com; w...@kernel.org;
> r...@rjwysocki.net; vincent.guit...@linaro.org; b...@alien8.de;
> t...@linutronix.de; mi...@redhat.com; l...@kernel.org; pet...@infradead.org;
> dietmar.eggem...@arm.com; rost...@goodmis.org; bseg...@google.com;
> mgor...@suse.de
> Cc: msys.miz...@gmail.com; valentin.schnei...@arm.com;
> gre...@linuxfoundation.org; Jonathan Cameron ;
> juri.le...@redhat.com; mark.rutl...@arm.com; sudeep.ho...@arm.com;
> aubrey...@linux.intel.com; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; linux-a...@vger.kernel.org; x...@kernel.org;
> xuwei (O) ; Zengtao (B) ;
> guodong...@linaro.org; yangyicong ; Liguozhu (Kenneth)
> ; linux...@openeuler.org; h...@zytor.com; Jonathan
> Cameron ; Song Bao Hua (Barry Song)
> 
> Subject: [RFC PATCH v4 1/3] topology: Represent clusters of CPUs within a die.
> 
> From: Jonathan Cameron 
> 
> Both ACPI and DT provide the ability to describe additional layers of
> topology between that of individual cores and higher level constructs
> such as the level at which the last level cache is shared.
> In ACPI this can be represented in PPTT as a Processor Hierarchy
> Node Structure [1] that is the parent of the CPU cores and in turn
> has a parent Processor Hierarchy Nodes Structure representing
> a higher level of topology.
> 
> For example Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each
> cluster has 4 cpus. All clusters share L3 cache data, but each cluster
> has local L3 tag. On the other hand, each clusters will share some
> internal system bus.
> 
> +---+  +-+
> |  +--++--++---+ |
> |  | CPU0 || cpu1 | |+---+ | |
> |  +--++--+ ||   | | |
> |   ++L3 | | |
> |  +--++--+   cluster   ||tag| | |
> |  | CPU2 || CPU3 | ||   | | |
> |  +--++--+ |+---+ | |
> |   |  | |
> +---+  | |
> +---+  | |
> |  +--++--+ +--+ |
> |  |  ||  | |+---+ | |
> |  +--++--+ ||   | | |
> |   ||L3 | | |
> |  +--++--+ ++tag| | |
> |  |  ||  | ||   | | |
> |  +--++--+ |+---+ | |
> |   |  | |
> +---+  |   L3|
>|   data  |
> +---+  | |
> |  +--++--+ |+---+ | |
> |  |  ||  | ||   | | |
> |  +--++--+ ++L3 | | |
> |   ||tag| | |
> |  +--++--+ ||   | | |
> |  |  ||  |+++---+ | |
> |  +--++--+|---+ |
> +---|  | |
> +---|  | |
> |  +--++--++---+ |
> |  |  ||  | |+---+ | |
&

RE: [Linuxarm] Re: [RFC PATCH v4 3/3] scheduler: Add cluster scheduler level for x86

2021-03-08 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Tim Chen [mailto:tim.c.c...@linux.intel.com]
> Sent: Thursday, March 4, 2021 7:34 AM
> To: Peter Zijlstra ; Song Bao Hua (Barry Song)
> 
> Cc: catalin.mari...@arm.com; w...@kernel.org; r...@rjwysocki.net;
> vincent.guit...@linaro.org; b...@alien8.de; t...@linutronix.de;
> mi...@redhat.com; l...@kernel.org; dietmar.eggem...@arm.com;
> rost...@goodmis.org; bseg...@google.com; mgor...@suse.de;
> msys.miz...@gmail.com; valentin.schnei...@arm.com;
> gre...@linuxfoundation.org; Jonathan Cameron ;
> juri.le...@redhat.com; mark.rutl...@arm.com; sudeep.ho...@arm.com;
> aubrey...@linux.intel.com; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; linux-a...@vger.kernel.org; x...@kernel.org;
> xuwei (O) ; Zengtao (B) ;
> guodong...@linaro.org; yangyicong ; Liguozhu (Kenneth)
> ; linux...@openeuler.org; h...@zytor.com
> Subject: [Linuxarm] Re: [RFC PATCH v4 3/3] scheduler: Add cluster scheduler
> level for x86
> 
> 
> 
> On 3/2/21 2:30 AM, Peter Zijlstra wrote:
> > On Tue, Mar 02, 2021 at 11:59:40AM +1300, Barry Song wrote:
> >> From: Tim Chen 
> >>
> >> There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce
> >> is shared among a cluster of cores instead of being exclusive
> >> to one single core.
> >
> > Isn't that most atoms one way or another? Tremont seems to have it per 4
> > cores, but earlier it was per 2 cores.
> >
> 
> Yes, older Atoms have 2 cores sharing L2.  I probably should
> rephrase my comments to not leave the impression that sharing
> L2 among cores is new for Atoms.
> 
> Tremont based Atom CPUs increases the possible load imbalance more
> with 4 cores per L2 instead of 2.  And also with more overall cores on a die,
> the
> chance increases for packing running tasks on a few clusters while leaving
> others empty on light/medium loaded systems.  We did see
> this effect on Jacobsville.
> 
> So load balancing between the L2 clusters is more
> useful on Tremont based Atom CPUs compared to the older Atoms.

It seems sensible the more CPU we get in the cluster, the more
we need the kernel to be aware of its existence.

Tim, it is possible for you to bring up the cpu_cluster_mask and
cluster_sibling for x86 so that the topology can be represented
in sysfs and be used by scheduler? It seems your patch lacks this
part.

BTW, I wonder if x86 can do some improvement on your KMP_AFFINITY
by leveraging the cluster topology level.
https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/optimization-and-programming-guide/openmp-support/openmp-library-support/thread-affinity-interface-linux-and-windows.html

KMP_AFFINITY has thread affinity modes like compact and scatter,
it seems this "compact" and "scatter" can also use the cluster
information as you see we are also struggling with the "compact"
and "scatter" issues here in this patchset :-)

Thanks
Barry


RE: [RFC PATCH v4 2/3] scheduler: add scheduler level for clusters

2021-03-08 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Vincent Guittot [mailto:vincent.guit...@linaro.org]
> Sent: Tuesday, March 9, 2021 12:26 AM
> To: Song Bao Hua (Barry Song) 
> Cc: Tim Chen ; Catalin Marinas
> ; Will Deacon ; Rafael J. Wysocki
> ; Borislav Petkov ; Thomas Gleixner
> ; Ingo Molnar ; Cc: Len Brown
> ; Peter Zijlstra ; Dietmar Eggemann
> ; Steven Rostedt ; Ben Segall
> ; Mel Gorman ; Juri Lelli
> ; Mark Rutland ; Aubrey Li
> ; H. Peter Anvin ; Zengtao (B)
> ; Guodong Xu ;
> gre...@linuxfoundation.org; Sudeep Holla ; linux-kernel
> ; linux...@openeuler.org; ACPI Devel Maling
> List ; xuwei (O) ; Jonathan
> Cameron ; yangyicong ;
> x86 ; msys.miz...@gmail.com; Liguozhu (Kenneth)
> ; Valentin Schneider ;
> LAK 
> Subject: Re: [RFC PATCH v4 2/3] scheduler: add scheduler level for clusters
> 
> On Tue, 2 Mar 2021 at 00:08, Barry Song  wrote:
> >
> > ARM64 chip Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each
> > cluster has 4 cpus. All clusters share L3 cache data, but each cluster
> > has local L3 tag. On the other hand, each clusters will share some
> > internal system bus. This means cache coherence overhead inside one
> > cluster is much less than the overhead across clusters.
> >
> > This patch adds the sched_domain for clusters. On kunpeng 920, without
> > this patch, domain0 of cpu0 would be MC with cpu0~cpu23 with ; with this
> > patch, MC becomes domain1, a new domain0 "CLS" including cpu0-cpu3.
> >
> > This will help spread unrelated tasks among clusters, thus decrease the
> > contention and improve the throughput, for example, stream benchmark can
> > improve around 4.3%~6.3% by this patch:
> >
> > w/o patch:
> > numactl -N 0 /usr/lib/lmbench/bin/stream -P 12 -M 1024M -N 5
> > STREAM copy latency: 3.36 nanoseconds
> > STREAM copy bandwidth: 57072.50 MB/sec
> > STREAM scale latency: 3.40 nanoseconds
> > STREAM scale bandwidth: 56542.52 MB/sec
> > STREAM add latency: 5.10 nanoseconds
> > STREAM add bandwidth: 56482.83 MB/sec
> > STREAM triad latency: 5.14 nanoseconds
> > STREAM triad bandwidth: 56069.52 MB/sec
> >
> > w/ patch:
> > $ numactl -N 0 /usr/lib/lmbench/bin/stream -P 12 -M 1024M -N 5
> > STREAM copy latency: 3.22 nanoseconds
> > STREAM copy bandwidth: 59660.96 MB/sec->  +4.5%
> > STREAM scale latency: 3.25 nanoseconds
> > STREAM scale bandwidth: 59002.29 MB/sec   ->  +4.3%
> > STREAM add latency: 4.80 nanoseconds
> > STREAM add bandwidth: 60036.62 MB/sec ->  +6.3%
> > STREAM triad latency: 4.86 nanoseconds
> > STREAM triad bandwidth: 59228.30 MB/sec   ->  +5.6%
> >
> > On the other hand, while doing WAKE_AFFINE, this patch will try to find
> > a core in the target cluster before scanning the whole llc domain. So it
> > helps gather related tasks within one cluster.
> 
> Could you split this patch in 2 patches ? One for adding a cluster
> sched domain level and one for modifying the wake up path ?

Yes. If this is helpful, I would like to split into two patches.

> 
> This would ease the review and I would be curious about the impact of
> each feature in the performance. In particular, I'm still not
> convinced that the modification of the wakeup path is the root of the
> hackbench improvement; especially with g=14 where there should not be
> much idle CPUs with 14*40 tasks on at most 32 CPUs.  IIRC, there was

My understanding is that threads could be blocked due to pipe. So CPUs
still have some chance to be idle for a big g. Also note the default g
of hackbench is 10.

Anyway, i'd like to add some tracepoints to get the percentages of how
many are picked from cluster, how many are selected from cpus outside
cluster.

> no obvious improvement with the changes in select_idle_cpu unless you
> hack the behavior to not fall back to llc domain
> 

You have a good memory. In a very old version I once mentioned that. But
at that time, I didn't decrease nr after scanning cluster, so it was
scanning at least 8 cpus(4 within cluster, 4 outside cluster). I guess
that is the reason my hack to not fall back to llc domain could bringing
actual hackbench improvement.

> > we run the below hackbench with different -g parameter from 2 to 14, for
> > each different g, we run the command 10 times and get the average time
> > $ numactl -N 0 hackbench -p -T -l 2 -g $1
> >
> > hackbench will report the time which is needed to complete a certain number
> > of messages transmissions between a certain number of tasks, for example:
> > $ numactl -N 0 hackbench -p -T -l 2 -g 10
> > Running in threaded mode with 10 groups using 40 file descriptors each
> > (== 400 tasks)

RE: [PATCH] sched/topology: remove redundant cpumask_and in init_overlap_sched_group

2021-03-05 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Valentin Schneider [mailto:valentin.schnei...@arm.com]
> Sent: Saturday, March 6, 2021 12:49 AM
> To: Song Bao Hua (Barry Song) ;
> vincent.guit...@linaro.org; mi...@redhat.com; pet...@infradead.org;
> juri.le...@redhat.com; dietmar.eggem...@arm.com; rost...@goodmis.org;
> bseg...@google.com; mgor...@suse.de
> Cc: linux-kernel@vger.kernel.org; linux...@openeuler.org; Song Bao Hua (Barry
> Song) 
> Subject: Re: [PATCH] sched/topology: remove redundant cpumask_and in
> init_overlap_sched_group
> 
> On 05/03/21 11:29, Barry Song wrote:
> > mask is built in build_balance_mask() by for_each_cpu(i, sg_span), so
> > it must be a subset of sched_group_span(sg).
> 
> So we should indeed have
> 
>   cpumask_subset(sched_group_span(sg), mask)
> 
> but that doesn't imply
> 
>   cpumask_first(sched_group_span(sg)) == cpumask_first(mask)
> 
> does it? I'm thinking if in your topology of N CPUs, CPUs 0 and N-1 are the
> furthest away, you will most likely hit

It is true:
cpumask_first(sched_group_span(sg)) != cpumask_first(mask)

but 

cpumask_first_and(sched_group_span(sg), mask) = cpumask_first(mask)

since mask is always subset of sched_group_span(sg).

/**
 * cpumask_first_and - return the first cpu from *srcp1 & *srcp2
 * @src1p: the first input
 * @src2p: the second input
 *
 * Returns >= nr_cpu_ids if no cpus set in both.  See also cpumask_next_and().
 */

*srcp2 is subset of *srcp1, so  *srcp1 & *srcp2 = *srcp2

> 
>   !cpumask_equal(sg_pan, sched_domain_span(sibling->child))
>  ^^^
>  CPUN-1CPU0
> 
> which should be the case on your Kunpeng920 system.
> 
> > Though cpumask_first_and
> > doesn't lead to a wrong result of balance cpu, it is pointless to do
> > cpumask_and again.
> >
> > Signed-off-by: Barry Song 
> > ---
> >  kernel/sched/topology.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 12f8058..45f3db2 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -934,7 +934,7 @@ static void init_overlap_sched_group(struct sched_domain
> *sd,
> > int cpu;
> >
> > build_balance_mask(sd, sg, mask);
> > -   cpu = cpumask_first_and(sched_group_span(sg), mask);
> > +   cpu = cpumask_first(mask);
> >
> > sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
> > if (atomic_inc_return(>sgc->ref) == 1)
> > --
> > 1.8.3.1

Thanks
Barry



RE: [Linuxarm] [PATCH v1] drm/nouveau/device: append a NUL-terminated character for the string which filled by strncpy()

2021-02-25 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Luo Jiaxing [mailto:luojiax...@huawei.com]
> Sent: Friday, February 26, 2021 12:39 AM
> To: nouv...@lists.freedesktop.org; dri-de...@lists.freedesktop.org;
> bske...@redhat.com
> Cc: linux-kernel@vger.kernel.org; linux...@openeuler.org; luojiaxing
> 
> Subject: [Linuxarm] [PATCH v1] drm/nouveau/device: append a NUL-terminated
> character for the string which filled by strncpy()
> 
> Following warning is found when using W=1 to build kernel:
> 
> In function ‘nvkm_udevice_info’,
> inlined from ‘nvkm_udevice_mthd’ at
> drivers/gpu/drm/nouveau/nvkm/engine/device/user.c:195:10:
> drivers/gpu/drm/nouveau/nvkm/engine/device/user.c:164:2: warning: ‘strncpy’
> specified bound 16 equals destination size [-Wstringop-truncation]
>   164 |  strncpy(args->v0.chip, device->chip->name, sizeof(args->v0.chip));
> drivers/gpu/drm/nouveau/nvkm/engine/device/user.c:165:2: warning: ‘strncpy’
> specified bound 64 equals destination size [-Wstringop-truncation]
>   165 |  strncpy(args->v0.name, device->name, sizeof(args->v0.name));
> 
> The reason of this warning is strncpy() does not guarantee that the
> destination buffer will be NUL terminated. If the length of source string
> is bigger than number we set by third input parameter, only first [number]
> of characters is copied to the destination, and no NUL-terminated is
> automatically added. There are some potential risks.
> 
> Signed-off-by: Luo Jiaxing 
> ---
>  drivers/gpu/drm/nouveau/nvkm/engine/device/user.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/device/user.c
> b/drivers/gpu/drm/nouveau/nvkm/engine/device/user.c
> index fea9d8f..2a32fe0 100644
> --- a/drivers/gpu/drm/nouveau/nvkm/engine/device/user.c
> +++ b/drivers/gpu/drm/nouveau/nvkm/engine/device/user.c
> @@ -161,8 +161,10 @@ nvkm_udevice_info(struct nvkm_udevice *udev, void *data,
> u32 size)
>   if (imem && args->v0.ram_size > 0)
>   args->v0.ram_user = args->v0.ram_user - imem->reserved;
> 
> - strncpy(args->v0.chip, device->chip->name, sizeof(args->v0.chip));
> - strncpy(args->v0.name, device->name, sizeof(args->v0.name));
> + strncpy(args->v0.chip, device->chip->name, sizeof(args->v0.chip) - 1);
> + args->v0.chip[sizeof(args->v0.chip) - 1] = '\0';
> + strncpy(args->v0.name, device->name, sizeof(args->v0.name) - 1);
> + args->v0.name[sizeof(args->v0.name) - 1] = '\0';


Isn't it better to use snprintf()?

>   return 0;
>  }
> 
Thanks
Barry



RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage optimization for SCSI drivers

2021-02-24 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Finn Thain [mailto:fth...@telegraphics.com.au]
> Sent: Wednesday, February 24, 2021 6:21 PM
> To: Song Bao Hua (Barry Song) 
> Cc: tanxiaofei ; j...@linux.ibm.com;
> martin.peter...@oracle.com; linux-s...@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux...@openeuler.org;
> linux-m...@vger.kernel.org
> Subject: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage optimization
> for SCSI drivers
> 
> On Tue, 23 Feb 2021, Song Bao Hua (Barry Song) wrote:
> 
> > >
> > > Regarding m68k, your analysis overlooks the timing issue. E.g. patch
> > > 11/32 could be a problem because removing the irqsave would allow PDMA
> > > transfers to be interrupted. Aside from the timing issues, I agree
> > > with your analysis above regarding m68k.
> >
> > You mentioned you need realtime so you want an interrupt to be able to
> > preempt another one.
> 
> That's not what I said. But for the sake of discussion, yes, I do know
> people who run Linux on ARM hardware (if Android vendor kernels can be
> called "Linux") and who would benefit from realtime support on those
> devices.

Realtime requirement is definitely a true requirement on ARM Linux.

I once talked/worked  with some guys who were using ARM for realtime
system.
The feasible approaches include:
1. Dual OS(RTOS + Linux): e.g.  QNX+Linux XENOMAI+Linux L4+Linux
2. preempt-rt
Which is continuously maintained like:
https://lore.kernel.org/lkml/20210218201041.65fknr7bdplwq...@linutronix.de/
3. bootargs isolcpus=
to isolate a cpu for a specific realtime task or interrupt
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_for_real_time/7/html/tuning_guide/isolating_cpus_using_tuned-profiles-realtime
4. ARM FIQ which has separate fiq API, an example in fsl sound:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/sound/soc/fsl/imx-pcm-fiq.c
5. Let one core invisible to Linux
Running non-os system and rtos on the core

Honestly, I've never seen anyone who depends on irq priority to support
realtime in ARM Linux though ARM's RTOS-es use it quite commonly.

> 
> > Now you said you want an interrupt not to be preempted as it will make a
> > timing issue.
> 
> mac_esp deliberately constrains segment sizes so that it can harmlessly
> disable interrupts for the duration of the transfer.
> 
> Maybe the irqsave in this driver is over-cautious. Who knows? The PDMA
> timing problem relates to SCSI bus signalling and the tolerance of real-
> world SCSI devices to same. The other problem is that the PDMA logic
> circuit is undocumented hardware. So there may be further timing
> requirements lurking there. Therefore, patch 11/32 is too risky.
> 
> > If this PDMA transfer will have some problem when it is preempted, I
> > believe we need some enhanced ways to handle this, otherwise, once we
> > enable preempt_rt or threaded_irq, it will get the timing issue. so here
> > it needs a clear comment and IRQF_NO_THREAD if this is the case.
> >
> 
> People who require fast response times cannot expect random drivers or
> platforms to meet such requirements. I fear you may be asking too much
> from Mac Quadra machines.

Once preempt_rt is enabled, those who want a fast irq environment need
a no_thread flag, or need to set its irq thread to higher sched_fifo/rr
priority.

> 
> > >
> > > With regard to other architectures and platforms, in specific cases,
> > > e.g. where there's never more than one IRQ involved, then I could
> > > agree that your assumptions probably hold and an irqsave would be
> > > probably redundant.
> > >
> > > When you find a redundant irqsave, to actually patch it would bring a
> > > risk of regression with little or no reward. It's not my place to veto
> > > this entire patch series on that basis but IMO this kind of churn is
> > > misguided.
> >
> > Nope.
> >
> > I would say the real misguidance is that the code adds one lock while it
> > doesn't need the lock. Easily we can add redundant locks or exaggerate
> > the coverage range of locks, but the smarter way is that people add
> > locks only when they really need the lock by considering concurrency and
> > realtime performance.
> >
> 
> You appear to be debating a strawman. No-one is advocating excessive
> locking in new code.
> 

I actually meant most irqsave(s) in hardirq were added carelessly.
When irq and threads could access same data, people added irqsave
in threads, that is perfectly good as it could block irq. But
people were likely to put an irqsave in irq without any thinking.

We do have some drivers which are doing that with a clear intention
as your sonic_interrupt(), but I bet most were done aimlessly.

Anyway, the debate is long enough, let's move to some more important
things. I appreciate that you shared a lot of knowledge of m68k.

Thanks
Barry


RE: [PATCH] scripts/gdb: document lx_current is only supported by x86

2021-02-23 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Song Bao Hua (Barry Song)
> Sent: Tuesday, February 23, 2021 9:30 PM
> To: 'Jan Kiszka' ; kieran.bing...@ideasonboard.com;
> cor...@lwn.net; linux-...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org; linux...@openeuler.org
> Subject: RE: [PATCH] scripts/gdb: document lx_current is only supported by x86
> 
> 
> 
> > -Original Message-
> > From: Jan Kiszka [mailto:jan.kis...@siemens.com]
> > Sent: Tuesday, February 23, 2021 8:27 PM
> > To: Song Bao Hua (Barry Song) ;
> > kieran.bing...@ideasonboard.com; cor...@lwn.net; linux-...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org; linux...@openeuler.org
> > Subject: Re: [PATCH] scripts/gdb: document lx_current is only supported by
> x86
> >
> > On 22.02.21 22:18, Song Bao Hua (Barry Song) wrote:
> > >
> > >
> > >> -Original Message-
> > >> From: Kieran Bingham [mailto:kieran.bing...@ideasonboard.com]
> > >> Sent: Tuesday, February 23, 2021 12:06 AM
> > >> To: Song Bao Hua (Barry Song) ; 
> > >> cor...@lwn.net;
> > >> linux-...@vger.kernel.org; jan.kis...@siemens.com
> > >> Cc: linux-kernel@vger.kernel.org; linux...@openeuler.org
> > >> Subject: Re: [PATCH] scripts/gdb: document lx_current is only supported
> by
> > x86
> > >>
> > >> Hi Barry
> > >>
> > >> On 21/02/2021 21:35, Barry Song wrote:
> > >>> lx_current depends on the per_cpu current_task which exists on x86 only:
> > >>>
> > >>> arch$ git grep current_task | grep -i per_cpu
> > >>> x86/include/asm/current.h:DECLARE_PER_CPU(struct task_struct *,
> > >> current_task);
> > >>> x86/kernel/cpu/common.c:DEFINE_PER_CPU(struct task_struct *,
> > current_task)
> > >> cacheline_aligned =
> > >>> x86/kernel/cpu/common.c:EXPORT_PER_CPU_SYMBOL(current_task);
> > >>> x86/kernel/cpu/common.c:DEFINE_PER_CPU(struct task_struct *,
> > current_task)
> > >> = _task;
> > >>> x86/kernel/cpu/common.c:EXPORT_PER_CPU_SYMBOL(current_task);
> > >>> x86/kernel/smpboot.c:   per_cpu(current_task, cpu) = idle;
> > >>>
> > >>> On other architectures, lx_current() will lead to a python exception:
> > >>> (gdb) p $lx_current().pid
> > >>> Python Exception  No symbol "current_task" in current
> > >> context.:
> > >>> Error occurred in Python: No symbol "current_task" in current context.
> > >>>
> > >>> To avoid more people struggling and wasting time in other architectures,
> > >>> document it.
> > >>>
> > >>> Cc: Jan Kiszka 
> > >>> Signed-off-by: Barry Song 
> > >>> ---
> > >>>  Documentation/dev-tools/gdb-kernel-debugging.rst |  2 +-
> > >>>  scripts/gdb/linux/cpus.py| 10 --
> > >>>  2 files changed, 9 insertions(+), 3 deletions(-)
> > >>>
> > >>> diff --git a/Documentation/dev-tools/gdb-kernel-debugging.rst
> > >> b/Documentation/dev-tools/gdb-kernel-debugging.rst
> > >>> index 4756f6b3a04e..1586901b683c 100644
> > >>> --- a/Documentation/dev-tools/gdb-kernel-debugging.rst
> > >>> +++ b/Documentation/dev-tools/gdb-kernel-debugging.rst
> > >>> @@ -114,7 +114,7 @@ Examples of using the Linux-provided gdb helpers
> > >>>  [ 0.00] BIOS-e820: [mem
> > 0x0009fc00-0x0009]
> > >> reserved
> > >>>  
> > >>>
> > >>> -- Examine fields of the current task struct::
> > >>> +- Examine fields of the current task struct(supported by x86 only)::
> > >>>
> > >>>  (gdb) p $lx_current().pid
> > >>>  $1 = 4998
> > >>> diff --git a/scripts/gdb/linux/cpus.py b/scripts/gdb/linux/cpus.py
> > >>> index 008e62f3190d..f382762509d3 100644
> > >>> --- a/scripts/gdb/linux/cpus.py
> > >>> +++ b/scripts/gdb/linux/cpus.py
> > >>> @@ -156,6 +156,13 @@ Note that VAR has to be quoted as string."""
> > >>>
> > >>>  PerCpu()
> > >>>
> > >>> +def get_current_task(cpu):
> > >>> +if utils.is_target_arch("x86"):
> > >>> + var_ptr = gdb.parse_and_eval("_task")
> > >>> +  

RE: [PATCH] scripts/gdb: document lx_current is only supported by x86

2021-02-23 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Jan Kiszka [mailto:jan.kis...@siemens.com]
> Sent: Tuesday, February 23, 2021 8:27 PM
> To: Song Bao Hua (Barry Song) ;
> kieran.bing...@ideasonboard.com; cor...@lwn.net; linux-...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org; linux...@openeuler.org
> Subject: Re: [PATCH] scripts/gdb: document lx_current is only supported by x86
> 
> On 22.02.21 22:18, Song Bao Hua (Barry Song) wrote:
> >
> >
> >> -Original Message-
> >> From: Kieran Bingham [mailto:kieran.bing...@ideasonboard.com]
> >> Sent: Tuesday, February 23, 2021 12:06 AM
> >> To: Song Bao Hua (Barry Song) ; cor...@lwn.net;
> >> linux-...@vger.kernel.org; jan.kis...@siemens.com
> >> Cc: linux-kernel@vger.kernel.org; linux...@openeuler.org
> >> Subject: Re: [PATCH] scripts/gdb: document lx_current is only supported by
> x86
> >>
> >> Hi Barry
> >>
> >> On 21/02/2021 21:35, Barry Song wrote:
> >>> lx_current depends on the per_cpu current_task which exists on x86 only:
> >>>
> >>> arch$ git grep current_task | grep -i per_cpu
> >>> x86/include/asm/current.h:DECLARE_PER_CPU(struct task_struct *,
> >> current_task);
> >>> x86/kernel/cpu/common.c:DEFINE_PER_CPU(struct task_struct *,
> current_task)
> >> cacheline_aligned =
> >>> x86/kernel/cpu/common.c:EXPORT_PER_CPU_SYMBOL(current_task);
> >>> x86/kernel/cpu/common.c:DEFINE_PER_CPU(struct task_struct *,
> current_task)
> >> = _task;
> >>> x86/kernel/cpu/common.c:EXPORT_PER_CPU_SYMBOL(current_task);
> >>> x86/kernel/smpboot.c: per_cpu(current_task, cpu) = idle;
> >>>
> >>> On other architectures, lx_current() will lead to a python exception:
> >>> (gdb) p $lx_current().pid
> >>> Python Exception  No symbol "current_task" in current
> >> context.:
> >>> Error occurred in Python: No symbol "current_task" in current context.
> >>>
> >>> To avoid more people struggling and wasting time in other architectures,
> >>> document it.
> >>>
> >>> Cc: Jan Kiszka 
> >>> Signed-off-by: Barry Song 
> >>> ---
> >>>  Documentation/dev-tools/gdb-kernel-debugging.rst |  2 +-
> >>>  scripts/gdb/linux/cpus.py| 10 --
> >>>  2 files changed, 9 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/Documentation/dev-tools/gdb-kernel-debugging.rst
> >> b/Documentation/dev-tools/gdb-kernel-debugging.rst
> >>> index 4756f6b3a04e..1586901b683c 100644
> >>> --- a/Documentation/dev-tools/gdb-kernel-debugging.rst
> >>> +++ b/Documentation/dev-tools/gdb-kernel-debugging.rst
> >>> @@ -114,7 +114,7 @@ Examples of using the Linux-provided gdb helpers
> >>>  [ 0.00] BIOS-e820: [mem
> 0x0009fc00-0x0009]
> >> reserved
> >>>  
> >>>
> >>> -- Examine fields of the current task struct::
> >>> +- Examine fields of the current task struct(supported by x86 only)::
> >>>
> >>>  (gdb) p $lx_current().pid
> >>>  $1 = 4998
> >>> diff --git a/scripts/gdb/linux/cpus.py b/scripts/gdb/linux/cpus.py
> >>> index 008e62f3190d..f382762509d3 100644
> >>> --- a/scripts/gdb/linux/cpus.py
> >>> +++ b/scripts/gdb/linux/cpus.py
> >>> @@ -156,6 +156,13 @@ Note that VAR has to be quoted as string."""
> >>>
> >>>  PerCpu()
> >>>
> >>> +def get_current_task(cpu):
> >>> +if utils.is_target_arch("x86"):
> >>> + var_ptr = gdb.parse_and_eval("_task")
> >>> + return per_cpu(var_ptr, cpu).dereference()
> >>> +else:
> >>> +raise gdb.GdbError("Sorry, obtaining the current task is not yet
> "
> >>> +   "supported with this arch")
> >>
> >> I've wondered in the past how we should handle the architecture specific
> >> layers.
> >>
> >> Perhaps we need to have an interface of functionality to implement on
> >> each architecture so that we can create a per-arch set of helpers.
> >>
> >> or break it up into arch specific subdirs at least...
> >>
> >>
> >>>  class LxCurrentFunc(gdb.Function):
> >>>  """Return current task.
> >>> @@ -167

RE: [Linuxarm] Re: [PATCH] Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't apply to ARM64

2021-02-22 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Anshuman Khandual [mailto:anshuman.khand...@arm.com]
> Sent: Tuesday, February 23, 2021 7:10 PM
> To: Song Bao Hua (Barry Song) ; cor...@lwn.net;
> linux-...@vger.kernel.org; a...@linux-foundation.org; linux...@kvack.org
> Cc: linux-arm-ker...@lists.infradead.org; linux-kernel@vger.kernel.org;
> linux...@openeuler.org; Mel Gorman ; Andy Lutomirski
> ; Catalin Marinas ; Will Deacon
> 
> Subject: [Linuxarm] Re: [PATCH] Documentation/features: mark
> BATCHED_UNMAP_TLB_FLUSH doesn't apply to ARM64
> 
> 
> 
> On 2/23/21 6:02 AM, Barry Song wrote:
> > BATCHED_UNMAP_TLB_FLUSH is used on x86 to do batched tlb shootdown by
> > sending one IPI to TLB flush all entries after unmapping pages rather
> > than sending an IPI to flush each individual entry.
> > On arm64, tlb shootdown is done by hardware. Flush instructions are
> > innershareable. The local flushes are limited to the boot (1 per CPU)
> > and when a task is getting a new ASID.
> 
> Is there any previous discussion around this ?

I copied the declaration of local flushes from:

"ARM64 Linux kernel is SMP-aware (no possibility to build only for UP).
Most of the flush instructions are innershareable. The local flushes are
limited to the boot (1 per CPU) and when a task is getting a new ASIC."

https://patchwork.kernel.org/project/xen-devel/patch/1461756173-10300-1-git-send-email-julien.gr...@arm.com/

I am not sure if getting a new asid and the boot are the only two
cases of local flushes while I think this is probably true.

But even we find more corner cases, hardly the trend arm64 doesn't
need BATCHED_UNMAP_TLB_FLUSH will be changed.

> 
> > So marking this feature as "TODO" is not proper. ".." isn't good as
> > well. So this patch adds a "N/A" for this kind of features which are
> > not needed on some architectures.
> >
> > Cc: Mel Gorman 
> > Cc: Andy Lutomirski 
> > Cc: Catalin Marinas 
> > Cc: Will Deacon 
> > Signed-off-by: Barry Song 
> > ---
> >  Documentation/features/arch-support.txt| 1 +
> >  Documentation/features/vm/TLB/arch-support.txt | 2 +-
> >  2 files changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/features/arch-support.txt
> b/Documentation/features/arch-support.txt
> > index d22a1095e661..118ae031840b 100644
> > --- a/Documentation/features/arch-support.txt
> > +++ b/Documentation/features/arch-support.txt
> > @@ -8,4 +8,5 @@ The meaning of entries in the tables is:
> >  | ok |  # feature supported by the architecture
> >  |TODO|  # feature not yet supported by the architecture
> >  | .. |  # feature cannot be supported by the hardware
> > +| N/A|  # feature doesn't apply to the architecture
> 
> NA might be better here. s/doesn't apply/not applicable/ in order to match NA.
> Still wondering if NA is really needed when there is already ".." ? Regardless
> either way should be fine.

I don't think ".." is proper here. ".." means hardware doesn't support
the feature. But here it is just opposite, arm64 has the hardware
support of tlb shootdown rather than depending on a software IPI.

> 
> >
> > diff --git a/Documentation/features/vm/TLB/arch-support.txt
> b/Documentation/features/vm/TLB/arch-support.txt
> > index 30f75a79ce01..0d070f9f98d8 100644
> > --- a/Documentation/features/vm/TLB/arch-support.txt
> > +++ b/Documentation/features/vm/TLB/arch-support.txt
> > @@ -9,7 +9,7 @@
> >  |   alpha: | TODO |
> >  | arc: | TODO |
> >  | arm: | TODO |
> > -|   arm64: | TODO |
> > +|   arm64: | N/A  |
> >  | c6x: |  ..  |
> >  |csky: | TODO |
> >  |   h8300: |  ..  |
> >
Thanks
Barry



RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage optimization for SCSI drivers

2021-02-22 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Finn Thain [mailto:fth...@telegraphics.com.au]
> Sent: Tuesday, February 23, 2021 6:25 PM
> To: Song Bao Hua (Barry Song) 
> Cc: tanxiaofei ; j...@linux.ibm.com;
> martin.peter...@oracle.com; linux-s...@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux...@openeuler.org;
> linux-m...@vger.kernel.org
> Subject: RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage 
> optimization
> for SCSI drivers
> 
> On Mon, 22 Feb 2021, Song Bao Hua (Barry Song) wrote:
> 
> > > On Thu, 18 Feb 2021, Xiaofei Tan wrote:
> > >
> > > > On 2021/2/9 13:06, Finn Thain wrote:
> > > > > On Tue, 9 Feb 2021, Song Bao Hua (Barry Song) wrote:
> > > > >
> > > > > > > On Sun, 7 Feb 2021, Xiaofei Tan wrote:
> > > > > > >
> > > > > > > > Replace spin_lock_irqsave with spin_lock in hard IRQ of SCSI
> > > > > > > > drivers. There are no function changes, but may speed up if
> > > > > > > > interrupt happen too often.
> > > > > > >
> > > > > > > This change doesn't necessarily work on platforms that support
> > > > > > > nested interrupts.
> > > > > > >
> > > > > > > Were you able to measure any benefit from this change on some
> > > > > > > other platform?
> > > > > >
> > > > > > I think the code disabling irq in hardIRQ is simply wrong. Since
> > > > > > this commit
> > > > > >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=e58aa3d2d0cc
> > > > > > genirq: Run irq handlers with interrupts disabled
> > > > > >
> > > > > > interrupt handlers are definitely running in a irq-disabled
> > > > > > context unless irq handlers enable them explicitly in the
> > > > > > handler to permit other interrupts.
> > > > > >
> > > > >
> > > > > Repeating the same claim does not somehow make it true. If you put
> > > > > your claim to the test, you'll see that that interrupts are not
> > > > > disabled on m68k when interrupt handlers execute.
> > > > >
> > > > > The Interrupt Priority Level (IPL) can prevent any given irq
> > > > > handler from being re-entered, but an irq with a higher priority
> > > > > level may be handled during execution of a lower priority irq
> > > > > handler.
> > > > >
> > > > > sonic_interrupt() uses an irq lock within an interrupt handler to
> > > > > avoid issues relating to this. This kind of locking may be needed
> > > > > in the drivers you are trying to patch. Or it might not.
> > > > > Apparently, no-one has looked.
> > > > >
> > > >
> > > > According to your discussion with Barry, it seems that m68k is a
> > > > little different from other architecture, and this kind of
> > > > modification of this patch cannot be applied to m68k. So, could help
> > > > to point out which driver belong to m68k architecture in this patch
> > > > set of SCSI? I can remove them.
> > > >
> > >
> > > If you would claim that "there are no function changes" in your
> > > patches (as above) then the onus is on you to support that claim.
> > >
> > > I assume that there are some platforms on which your assumptions hold.
> > >
> > > With regard to drivers for those platforms, you might want to explain
> > > why your patches should be applied there, given that the existing code
> > > is superior for being more portable.
> >
> > I don't think it has nothing to do with portability. In the case of
> > sonic_interrupt() you pointed out, on m68k, there is a high-priority
> > interrupt can preempt low-priority interrupt, they will result in access
> > the same critical data. M68K's spin_lock_irqsave() can disable the
> > high-priority interrupt and avoid the race condition of the data. So the
> > case should not be touched. I'd like to accept the reality and leave
> > sonic_interrupt() alone.
> >
> > However, even on m68k, spin_lock_irqsave is not needed for other
> > ordinary cases.
> > If there is no other irq handler coming to access same critical data,
> > it is pointless to hold a redundant irqsave lock in irqhandler even
> > on m68k.
> >
> > In thread conte

RE: [PATCH] scripts/gdb: document lx_current is only supported by x86

2021-02-22 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Kieran Bingham [mailto:kieran.bing...@ideasonboard.com]
> Sent: Tuesday, February 23, 2021 12:06 AM
> To: Song Bao Hua (Barry Song) ; cor...@lwn.net;
> linux-...@vger.kernel.org; jan.kis...@siemens.com
> Cc: linux-kernel@vger.kernel.org; linux...@openeuler.org
> Subject: Re: [PATCH] scripts/gdb: document lx_current is only supported by x86
> 
> Hi Barry
> 
> On 21/02/2021 21:35, Barry Song wrote:
> > lx_current depends on the per_cpu current_task which exists on x86 only:
> >
> > arch$ git grep current_task | grep -i per_cpu
> > x86/include/asm/current.h:DECLARE_PER_CPU(struct task_struct *,
> current_task);
> > x86/kernel/cpu/common.c:DEFINE_PER_CPU(struct task_struct *, current_task)
> cacheline_aligned =
> > x86/kernel/cpu/common.c:EXPORT_PER_CPU_SYMBOL(current_task);
> > x86/kernel/cpu/common.c:DEFINE_PER_CPU(struct task_struct *, current_task)
> = _task;
> > x86/kernel/cpu/common.c:EXPORT_PER_CPU_SYMBOL(current_task);
> > x86/kernel/smpboot.c:   per_cpu(current_task, cpu) = idle;
> >
> > On other architectures, lx_current() will lead to a python exception:
> > (gdb) p $lx_current().pid
> > Python Exception  No symbol "current_task" in current
> context.:
> > Error occurred in Python: No symbol "current_task" in current context.
> >
> > To avoid more people struggling and wasting time in other architectures,
> > document it.
> >
> > Cc: Jan Kiszka 
> > Signed-off-by: Barry Song 
> > ---
> >  Documentation/dev-tools/gdb-kernel-debugging.rst |  2 +-
> >  scripts/gdb/linux/cpus.py| 10 --
> >  2 files changed, 9 insertions(+), 3 deletions(-)
> >
> > diff --git a/Documentation/dev-tools/gdb-kernel-debugging.rst
> b/Documentation/dev-tools/gdb-kernel-debugging.rst
> > index 4756f6b3a04e..1586901b683c 100644
> > --- a/Documentation/dev-tools/gdb-kernel-debugging.rst
> > +++ b/Documentation/dev-tools/gdb-kernel-debugging.rst
> > @@ -114,7 +114,7 @@ Examples of using the Linux-provided gdb helpers
> >  [ 0.00] BIOS-e820: [mem 0x0009fc00-0x0009]
> reserved
> >  
> >
> > -- Examine fields of the current task struct::
> > +- Examine fields of the current task struct(supported by x86 only)::
> >
> >  (gdb) p $lx_current().pid
> >  $1 = 4998
> > diff --git a/scripts/gdb/linux/cpus.py b/scripts/gdb/linux/cpus.py
> > index 008e62f3190d..f382762509d3 100644
> > --- a/scripts/gdb/linux/cpus.py
> > +++ b/scripts/gdb/linux/cpus.py
> > @@ -156,6 +156,13 @@ Note that VAR has to be quoted as string."""
> >
> >  PerCpu()
> >
> > +def get_current_task(cpu):
> > +if utils.is_target_arch("x86"):
> > + var_ptr = gdb.parse_and_eval("_task")
> > + return per_cpu(var_ptr, cpu).dereference()
> > +else:
> > +raise gdb.GdbError("Sorry, obtaining the current task is not yet "
> > +   "supported with this arch")
> 
> I've wondered in the past how we should handle the architecture specific
> layers.
> 
> Perhaps we need to have an interface of functionality to implement on
> each architecture so that we can create a per-arch set of helpers.
> 
> or break it up into arch specific subdirs at least...
> 
> 
> >  class LxCurrentFunc(gdb.Function):
> >  """Return current task.
> > @@ -167,8 +174,7 @@ number. If CPU is omitted, the CPU of the current 
> > context
> is used."""
> >  super(LxCurrentFunc, self).__init__("lx_current")
> >
> >  def invoke(self, cpu=-1):
> > -var_ptr = gdb.parse_and_eval("_task")
> > -return per_cpu(var_ptr, cpu).dereference()
> > +return get_current_task(cpu)
> >
> 
> And then perhaps we simply shouldn't even expose commands which can not
> be supported on those architectures?

I feel it is better to tell users this function is not supported on its arch
than simply hiding the function.

If we hide it, users still have many chances to try it as they have got
information of lx_current from google or somewhere.
They will try, if it turns out the lx_current is not in the list and an
error like  "invalid data type for function to be called", they will
probably suspect their gdb/python environment is not set up correctly,
and continue to waste time in checking their environment. 
Finally they figure out this function is not supported by its arch so it is
not exposed. But they have wasted a couple of 

RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage optimization for SCSI drivers

2021-02-21 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Finn Thain [mailto:fth...@telegraphics.com.au]
> Sent: Saturday, February 20, 2021 6:18 PM
> To: tanxiaofei 
> Cc: Song Bao Hua (Barry Song) ; 
> j...@linux.ibm.com;
> martin.peter...@oracle.com; linux-s...@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux...@openeuler.org;
> linux-m...@vger.kernel.org
> Subject: Re: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage 
> optimization
> for SCSI drivers
> 
> On Thu, 18 Feb 2021, Xiaofei Tan wrote:
> 
> > On 2021/2/9 13:06, Finn Thain wrote:
> > > On Tue, 9 Feb 2021, Song Bao Hua (Barry Song) wrote:
> > >
> > > > > On Sun, 7 Feb 2021, Xiaofei Tan wrote:
> > > > >
> > > > > > Replace spin_lock_irqsave with spin_lock in hard IRQ of SCSI
> > > > > > drivers. There are no function changes, but may speed up if
> > > > > > interrupt happen too often.
> > > > >
> > > > > This change doesn't necessarily work on platforms that support
> > > > > nested interrupts.
> > > > >
> > > > > Were you able to measure any benefit from this change on some
> > > > > other platform?
> > > >
> > > > I think the code disabling irq in hardIRQ is simply wrong.
> > > > Since this commit
> > > >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=e58aa3d2d0cc
> > > > genirq: Run irq handlers with interrupts disabled
> > > >
> > > > interrupt handlers are definitely running in a irq-disabled context
> > > > unless irq handlers enable them explicitly in the handler to permit
> > > > other interrupts.
> > > >
> > >
> > > Repeating the same claim does not somehow make it true. If you put
> > > your claim to the test, you'll see that that interrupts are not
> > > disabled on m68k when interrupt handlers execute.
> > >
> > > The Interrupt Priority Level (IPL) can prevent any given irq handler
> > > from being re-entered, but an irq with a higher priority level may be
> > > handled during execution of a lower priority irq handler.
> > >
> > > sonic_interrupt() uses an irq lock within an interrupt handler to
> > > avoid issues relating to this. This kind of locking may be needed in
> > > the drivers you are trying to patch. Or it might not. Apparently,
> > > no-one has looked.
> > >
> >
> > According to your discussion with Barry, it seems that m68k is a little
> > different from other architecture, and this kind of modification of this
> > patch cannot be applied to m68k. So, could help to point out which
> > driver belong to m68k architecture in this patch set of SCSI? I can
> > remove them.
> >
> 
> If you would claim that "there are no function changes" in your patches
> (as above) then the onus is on you to support that claim.
> 
> I assume that there are some platforms on which your assumptions hold.
> 
> With regard to drivers for those platforms, you might want to explain why
> your patches should be applied there, given that the existing code is
> superior for being more portable.

I don't think it has nothing to do with portability. In the case of
sonic_interrupt() you pointed out, on m68k, there is a high-priority
interrupt can preempt low-priority interrupt, they will result in
access the same critical data. M68K's spin_lock_irqsave() can disable
the high-priority interrupt and avoid the race condition of the data.
So the case should not be touched. I'd like to accept the reality
and leave sonic_interrupt() alone.

However, even on m68k, spin_lock_irqsave is not needed for other
ordinary cases.
If there is no other irq handler coming to access same critical data,
it is pointless to hold a redundant irqsave lock in irqhandler even
on m68k.

In thread contexts, we always need that if an irqhandler can preempt
those threads and access the same data. In hardirq, if there is an
high-priority which can jump out on m68k to access the critical data
which needs protection, we use the spin_lock_irqsave as you have used
in sonic_interrupt(). Otherwise, the irqsave is also redundant for
m68k.

> 
> > BTW, sonic_interrupt() is from net driver natsemi, right?  It would be
> > appreciative if only discuss SCSI drivers in this patch set. thanks.
> >
> 
> The 'net' subsystem does have some different requirements than the 'scsi'
> subsystem. But I don't see how that's relevant. Perhaps you can explain
> it. Thanks.

The difference is that if there are two co-existing interrupts which can
access the same critical data on m68k. I don't think net and scsi matter.
What really matters is the specific driver.

Thanks
Barry



RE: [Linuxarm] Re: [PATCH v2] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2

2021-02-18 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Valentin Schneider [mailto:valentin.schnei...@arm.com]
> Sent: Friday, February 19, 2021 1:41 AM
> To: Song Bao Hua (Barry Song) ; Peter Zijlstra
> 
> Cc: vincent.guit...@linaro.org; mgor...@suse.de; mi...@kernel.org;
> dietmar.eggem...@arm.com; morten.rasmus...@arm.com;
> linux-kernel@vger.kernel.org; linux...@openeuler.org; xuwei (O)
> ; Liguozhu (Kenneth) ; tiantao (H)
> ; wanghuiqiang ; Zengtao (B)
> ; Jonathan Cameron ;
> guodong...@linaro.org; Meelis Roos 
> Subject: [Linuxarm] Re: [PATCH v2] sched/topology: fix the issue groups don't
> span domain->span for NUMA diameter > 2
> 
> 
> Hi Barry,
> 
> On 18/02/21 09:17, Song Bao Hua (Barry Song) wrote:
> > Hi Valentin,
> >
> > I understand Peter's concern is that the local group has different
> > size with remote groups. Is this patch resolving Peter's concern?
> > To me, it seems not :-)
> >
> 
> If you remove the '&& i != cpu' in build_overlap_sched_groups() you get that,
> but then you also get some extra warnings :-)
> 
> Now yes, should_we_balance() only matters for the local group. However I'm
> somewhat wary of messing with the local groups; for one it means you would 
> have
> more than one tl now accessing the same sgc->next_update, sgc->{min,
> max}capacity, sgc->group_imbalance (as Vincent had pointed out).
> 
> By ensuring only remote (i.e. !local) groups are modified (which is what your
> patch does), we absolve ourselves of this issue, which is why I prefer this
> approach ATM.

Yep. The grandchild approach seems still to the feasible way for this moment.

> 
> > Though I don’t understand why different group sizes will be harmful
> > since all groups are calculating avg_load and group_type based on
> > their own capacities. Thus, for a smaller group, its capacity would be
> > smaller.
> >
> > Is it because a bigger group has relatively less chance to pull, so
> > load balancing will be completed more slowly while small groups have
> > high load?
> >
> 
> Peter's point is that, if at a given tl you have groups that look like
> 
> g0: 0-4, g1: 5-6, g2: 7-8
> 
> Then g0 is half as likely to pull tasks with load_balance() than g1 or g2 (due
> to the group size vs should_we_balance())

Yep. the difference is that g1 and g2 won't be local groups of any CPU in
this tl.
The smaller groups g1 and g2 are only remote groups,  so should_we_balance()
doesn't matter here for them.

> 
> 
> However, I suppose one "trick" to be aware of here is that since your patch
> *doesn't* change the local group, we do have e.g. on CPU0:
> 
> [0.374840]domain-2: span=0-5 level=NUMA
> [0.375054] groups: 0:{ span=0-3 cap=4003 }, 4:{ span=4-5 cap=1988 }
> 
> *but* on CPU4 we get:
> 
> [0.387019]domain-2: span=0-1,4-7 level=NUMA
> [0.387211] groups: 4:{ span=4-7 cap=3984 }, 0:{ span=0-1 cap=2013 }
> 
> IOW, at a given tl, all *local* groups have /roughly/ the same size and thus
> similar pull probability (it took me writing this mail to see it that way).
> So perhaps this is all fine already?

Yep. since should_we_balance() only matters for local groups and we haven't
changed the size of local groups in the grandchild approach, all local groups
are still getting similar pull probability in this topology level.

Since we still prefer the grandchild approach ATM, if Peter has no more concern
on the unequal size between local groups and remote groups, I would be glad
to send v4 of grandchild approach by rewriting changelog to explain the update
issue of sgc->next_update, sgc->{min, max}capacity, sgc->group_imbalance
Vincent pointed out and also describe the local_groups are not touched, thus
still in the equal size.

Thanks
Barry



RE: [Linuxarm] Re: [PATCH v2] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2

2021-02-18 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Valentin Schneider [mailto:valentin.schnei...@arm.com]
> Sent: Friday, February 12, 2021 8:55 AM
> To: Peter Zijlstra ; Song Bao Hua (Barry Song)
> 
> Cc: vincent.guit...@linaro.org; mgor...@suse.de; mi...@kernel.org;
> dietmar.eggem...@arm.com; morten.rasmus...@arm.com;
> linux-kernel@vger.kernel.org; linux...@openeuler.org; xuwei (O)
> ; Liguozhu (Kenneth) ; tiantao (H)
> ; wanghuiqiang ; Zengtao (B)
> ; Jonathan Cameron ;
> guodong...@linaro.org; Meelis Roos 
> Subject: [Linuxarm] Re: [PATCH v2] sched/topology: fix the issue groups don't
> span domain->span for NUMA diameter > 2
> 
> On 10/02/21 12:21, Peter Zijlstra wrote:
> > On Tue, Feb 09, 2021 at 08:58:15PM +, Song Bao Hua (Barry Song) wrote:
> >> So historically, the code has never tried to make sched_groups result
> >> in equal size. And it permits the overlapping of local group and remote
> >> groups.
> >
> > Histrorically groups have (typically) always been the same size though.
> >
> > The reason I did ask is because when you get one large and a bunch of
> > smaller groups, the load-balancing 'pull' is relatively smaller to the
> > large groups.
> >
> > That is, IIRC should_we_balance() ensures only 1 CPU out of the group
> > continues the load-balancing pass. So if, for example, we have one group
> > of 4 CPUs and one group of 2 CPUs, then the group of 2 CPUs will pull
> > 1/2 times, while the group of 4 CPUs will pull 1/4 times.
> >
> > By making sure all groups are of the same level, and thus of equal size,
> > this doesn't happen.
> 
> So I hacked something that tries to do this, with the notable exception
> that it doesn't change the way the local group is generated. Breaking the
> assumption that the local group always spans the child domain doesn't sound
> like the best thing to do.
> 
> Anywho, the below makes it so all !local NUMA groups are built using the
> same sched_domain_topology_level. Some of it is absolutely disgusting
> (esp. the sched_domains_curr_level stuff), I didn't bother with handling
> domain degeneration yet, and I trigger the WARN_ON in get_group()... But at
> least the topology gets built!
> 
> AFAICT Barry's topology is handled the same. On that sunfire topology, it
> pretty much turns all remote groups into groups spanning a single
> node. That does almost double the number of groups for some domains,
> compared to Barry's current v3.
> 
> If that is really a route we want to go down, I'll try to clean the below.
> 
Hi Valentin,

I understand Peter's concern is that the local group has different
size with remote groups. Is this patch resolving Peter's concern?
To me, it seems not :-)

Though I don’t understand why different group sizes will be harmful
since all groups are calculating avg_load and group_type based on
their own capacities. Thus, for a smaller group, its capacity would
be smaller.

Is it because a bigger group has relatively less chance to pull, so
load balancing will be completed more slowly while small groups have
high load?

> (deposit your drinks before going any further)
> --->8---
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 8f0f778b7c91..e74f48bdd226 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -187,7 +187,10 @@ struct sched_domain_topology_level {
>   sched_domain_mask_f mask;
>   sched_domain_flags_f sd_flags;
>   int flags;
> +#ifdef CONFIG_NUMA
>   int numa_level;
> + int numa_sibling_level;
> +#endif
>   struct sd_data  data;
>  #ifdef CONFIG_SCHED_DEBUG
>   char*name;
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 3c50cc7285c9..5a9e6a4d5d89 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -742,6 +742,34 @@ enum s_alloc {
>   sa_none,
>  };
> 
> +/*
> + * Topology list, bottom-up.
> + */
> +static struct sched_domain_topology_level default_topology[] = {
> +#ifdef CONFIG_SCHED_SMT
> + { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
> +#endif
> +#ifdef CONFIG_SCHED_MC
> + { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
> +#endif
> + { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> + { NULL, },
> +};
> +
> +static struct sched_domain_topology_level *sched_domain_topology =
> + default_topology;
> +
> +#define for_each_sd_topology(tl) \
> + for (tl = sched_domain_topology; tl->mask; tl++)
> +
> +void set_sched_topology(struct sched_domain_topology_level *tl)
> +{
> + if (WARN_ON_ONCE(s

RE: [RFC] IRQ handlers run with some high-priority interrupts(not NMI) enabled on some platform

2021-02-17 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Finn Thain [mailto:fth...@telegraphics.com.au]
> Sent: Sunday, February 14, 2021 6:11 PM
> To: Song Bao Hua (Barry Song) 
> Cc: Arnd Bergmann ; t...@linutronix.de;
> gre...@linuxfoundation.org; a...@arndb.de; ge...@linux-m68k.org;
> fun...@jurai.org; ph...@gnu.org; cor...@lwn.net; mi...@redhat.com;
> linux-m...@lists.linux-m68k.org; linux-kernel@vger.kernel.org
> Subject: RE: [RFC] IRQ handlers run with some high-priority interrupts(not 
> NMI)
> enabled on some platform
> 
> On Sat, 13 Feb 2021, Song Bao Hua (Barry Song) wrote:
> 
> >
> > So what is really confusing and a pain to me is that:
> > For years people like me have been writing device drivers
> > with the idea that irq handlers run with interrupts
> > disabled after those commits in genirq. So I don't need
> > to care about if some other IRQs on the same cpu will
> > jump out to access the data the current IRQ handler
> > is accessing.
> >
> > but it turns out the assumption is not true on some platform.
> > So should I start to program devices driver with the new idea
> > interrupts can actually come while irqhandler is running?
> >
> > That's the question which really bothers me.
> >
> 
> That scenario seems a little contrived to me (drivers for two or more
> devices sharing state through their interrupt handlers). Is it real?
> I suppose every platform has its quirks. The irq lock in sonic_interrupt()
> is only there because of a platform quirk (the same device can trigger
> either of two IRQs). Anyway, no-one expects all drivers to work on all
> platforms; I don't know why it bothers you so much when platforms differ.

Basically, we wrote drivers with the assumption that this driver will
be cross-platform. (Of course there are some drivers which can only work
on one platform, for example, if the IP of the device is only used in
one platform as an internal component of a specific SoC.)

So once a device has two or more interrupts, we need to consider one
interrupt might preempt another one on m68k on the same cpu if we also
want to support this driver on m68k. this usually doesn't matter on
other platforms.

on the other hand, there are more than 400 irqs_disabled() in kernel,
I am really not sure if they are running with the knowledge that
the true irqs_disabled() actually means some interrupts are off
and some others are still open on m68k. Or they are running with
the assumption that the true irqs_disabled() means IRQ is totally
quiet? If the latter is true, those drivers might fail to work on
m68k as well.

Thanks
Barry


RE: [RFC] IRQ handlers run with some high-priority interrupts(not NMI) enabled on some platform

2021-02-13 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Song Bao Hua (Barry Song)
> Sent: Sunday, February 14, 2021 11:13 AM
> To: 'Arnd Bergmann' 
> Cc: t...@linutronix.de; gre...@linuxfoundation.org; a...@arndb.de;
> ge...@linux-m68k.org; fun...@jurai.org; ph...@gnu.org; cor...@lwn.net;
> mi...@redhat.com; linux-m...@lists.linux-m68k.org;
> fth...@telegraphics.com.au; linux-kernel@vger.kernel.org
> Subject: RE: [RFC] IRQ handlers run with some high-priority interrupts(not 
> NMI)
> enabled on some platform
> 
> 
> 
> > -Original Message-
> > From: Arnd Bergmann [mailto:a...@kernel.org]
> > Sent: Sunday, February 14, 2021 5:32 AM
> > To: Song Bao Hua (Barry Song) 
> > Cc: t...@linutronix.de; gre...@linuxfoundation.org; a...@arndb.de;
> > ge...@linux-m68k.org; fun...@jurai.org; ph...@gnu.org; cor...@lwn.net;
> > mi...@redhat.com; linux-m...@lists.linux-m68k.org;
> > fth...@telegraphics.com.au; linux-kernel@vger.kernel.org
> > Subject: Re: [RFC] IRQ handlers run with some high-priority interrupts(not
> NMI)
> > enabled on some platform
> >
> > On Sat, Feb 13, 2021 at 12:50 AM Song Bao Hua (Barry Song)
> >  wrote:
> >
> > > So I was actually trying to warn this unusual case - interrupts
> > > get nested while both in_hardirq() and irqs_disabled() are true.
> > >
> > > diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
> > > index 7c9d6a2d7e90..b8ca27555c76 100644
> > > --- a/include/linux/hardirq.h
> > > +++ b/include/linux/hardirq.h
> > > @@ -32,6 +32,7 @@ static __always_inline void 
> > > rcu_irq_enter_check_tick(void)
> > >   */
> > >  #define __irq_enter()  \
> > > do {\
> > > +   WARN_ONCE(in_hardirq() && irqs_disabled(), "nested
> > > interrupts\n"); \
> > > preempt_count_add(HARDIRQ_OFFSET);  \
> >
> > That seems to be a rather heavyweight change in a critical path.
> >
> > A more useful change might be to implement lockdep support for m68k
> > and see if that warns about any actual problems. I'm not sure
> > what is actually missing for that, but these are the commits that
> > added it for other architectures in the past:
> >
> > 3c4697982982 ("riscv: Enable LOCKDEP_SUPPORT & fixup
> TRACE_IRQFLAGS_SUPPORT")
> > 000591f1ca33 ("csky: Enable LOCKDEP_SUPPORT")
> > 78cdfb5cf15e ("openrisc: enable LOCKDEP_SUPPORT and irqflags tracing")
> > 8f371c752154 ("xtensa: enable lockdep support")
> > bf2d80966890 ("microblaze: Lockdep support")
> >
> 
> Yes. M68k lacks lockdep support which might be added.

BTW, probably m68k won't run into any problem with lockdep
as it has been running for decades. Just like interrupts
were widely allowed to preempt irq handlers on all platforms
before IRQF_DISABLED was dropped and commit e58aa3d2d0cc ("
genirq: Run irq handlers with interrupts disabled").
Rarely we could really run into the stack overflow
issue commit e58aa3d2d0cc mentioned at that time.
Before those commits we had already made thousands of
successful Linux products running irq handlers with
interrupts enabled.

So what is really confusing and a pain to me is that:
For years people like me have been writing device drivers
with the idea that irq handlers run with interrupts
disabled after those commits in genirq. So I don't need
to care about if some other IRQs on the same cpu will
jump out to access the data the current IRQ handler
is accessing.

but it turns out the assumption is not true on some platform.
So should I start to program devices driver with the new idea
interrupts can actually come while irqhandler is running?

That's the question which really bothers me.

> 
> > > And I also think it is better for m68k's arch_irqs_disabled() to
> > > return true only when both low and high priority interrupts are
> > > disabled rather than try to mute this warn in genirq by a weaker
> > > condition:
> > >  if (WARN_ONCE(!irqs_disabled(),"irq %u handler %pS enabled
> > interrupts\n",
> > >  irq, action->handler))
> > >local_irq_disable();
> > > }
> > >
> > > This warn is not activated on m68k because its arch_irqs_disabled() return
> > > true though its high-priority interrupts are still enabled.
> >
> > Then it would just end up always warning when a nested hardirq happens,
> > right? That seems no different to dropping support for nested hardirqs
> > on 

RE: [RFC] IRQ handlers run with some high-priority interrupts(not NMI) enabled on some platform

2021-02-13 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Arnd Bergmann [mailto:a...@kernel.org]
> Sent: Sunday, February 14, 2021 5:32 AM
> To: Song Bao Hua (Barry Song) 
> Cc: t...@linutronix.de; gre...@linuxfoundation.org; a...@arndb.de;
> ge...@linux-m68k.org; fun...@jurai.org; ph...@gnu.org; cor...@lwn.net;
> mi...@redhat.com; linux-m...@lists.linux-m68k.org;
> fth...@telegraphics.com.au; linux-kernel@vger.kernel.org
> Subject: Re: [RFC] IRQ handlers run with some high-priority interrupts(not 
> NMI)
> enabled on some platform
> 
> On Sat, Feb 13, 2021 at 12:50 AM Song Bao Hua (Barry Song)
>  wrote:
> 
> > So I was actually trying to warn this unusual case - interrupts
> > get nested while both in_hardirq() and irqs_disabled() are true.
> >
> > diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
> > index 7c9d6a2d7e90..b8ca27555c76 100644
> > --- a/include/linux/hardirq.h
> > +++ b/include/linux/hardirq.h
> > @@ -32,6 +32,7 @@ static __always_inline void rcu_irq_enter_check_tick(void)
> >   */
> >  #define __irq_enter()  \
> > do {\
> > +   WARN_ONCE(in_hardirq() && irqs_disabled(), "nested
> > interrupts\n"); \
> > preempt_count_add(HARDIRQ_OFFSET);  \
> 
> That seems to be a rather heavyweight change in a critical path.
> 
> A more useful change might be to implement lockdep support for m68k
> and see if that warns about any actual problems. I'm not sure
> what is actually missing for that, but these are the commits that
> added it for other architectures in the past:
> 
> 3c4697982982 ("riscv: Enable LOCKDEP_SUPPORT & fixup TRACE_IRQFLAGS_SUPPORT")
> 000591f1ca33 ("csky: Enable LOCKDEP_SUPPORT")
> 78cdfb5cf15e ("openrisc: enable LOCKDEP_SUPPORT and irqflags tracing")
> 8f371c752154 ("xtensa: enable lockdep support")
> bf2d80966890 ("microblaze: Lockdep support")
> 

Yes. M68k lacks lockdep support which might be added.

> > And I also think it is better for m68k's arch_irqs_disabled() to
> > return true only when both low and high priority interrupts are
> > disabled rather than try to mute this warn in genirq by a weaker
> > condition:
> >  if (WARN_ONCE(!irqs_disabled(),"irq %u handler %pS enabled
> interrupts\n",
> >  irq, action->handler))
> >local_irq_disable();
> > }
> >
> > This warn is not activated on m68k because its arch_irqs_disabled() return
> > true though its high-priority interrupts are still enabled.
> 
> Then it would just end up always warning when a nested hardirq happens,
> right? That seems no different to dropping support for nested hardirqs
> on m68k altogether, which of course is what you suggested already.

This won't end up a warning on other architectures like arm,arm64, x86 etc
as interrupts won't come while arch_irqs_disabled() is true in hardIRQ.
For example, I_BIT of CPSR of ARM is set:
static inline int arch_irqs_disabled_flags(unsigned long flags)
{
return flags & IRQMASK_I_BIT;
}

So it would only give a backtrace on platforms whose arch_irqs_disabled()
return true while only some interrupts are disabled and some others
are still open, thus nested interrupts can come without any explicit
code to enable interrupts.

This warn seems to give consistent interpretation on what's "Run irq
handlers with interrupts disabled" in commit e58aa3d2d0cc (" genirq:
Run irq handlers with interrupts disabled")

> 
>Arnd

Thanks
Barry


RE: [RFC] IRQ handlers run with some high-priority interrupts(not NMI) enabled on some platform

2021-02-12 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Arnd Bergmann [mailto:a...@kernel.org]
> Sent: Saturday, February 13, 2021 12:06 PM
> To: Song Bao Hua (Barry Song) 
> Cc: t...@linutronix.de; gre...@linuxfoundation.org; a...@arndb.de;
> ge...@linux-m68k.org; fun...@jurai.org; ph...@gnu.org; cor...@lwn.net;
> mi...@redhat.com; linux-m...@lists.linux-m68k.org;
> fth...@telegraphics.com.au; linux-kernel@vger.kernel.org
> Subject: Re: [RFC] IRQ handlers run with some high-priority interrupts(not 
> NMI)
> enabled on some platform
> 
> On Sat, Feb 13, 2021 at 12:00 AM Song Bao Hua (Barry Song)
>  wrote:
> > > -Original Message-
> > > From: Arnd Bergmann [mailto:a...@kernel.org]
> > > Sent: Saturday, February 13, 2021 11:34 AM
> > > To: Song Bao Hua (Barry Song) 
> > > Cc: t...@linutronix.de; gre...@linuxfoundation.org; a...@arndb.de;
> > > ge...@linux-m68k.org; fun...@jurai.org; ph...@gnu.org; cor...@lwn.net;
> > > mi...@redhat.com; linux-m...@lists.linux-m68k.org;
> > > fth...@telegraphics.com.au; linux-kernel@vger.kernel.org
> > > Subject: Re: [RFC] IRQ handlers run with some high-priority interrupts(not
> NMI)
> > > enabled on some platform
> > >
> > > On Fri, Feb 12, 2021 at 2:18 AM Song Bao Hua (Barry Song)
> > >  wrote:
> > >
> > > > So I am requesting comments on:
> > > > 1. are we expecting all interrupts except NMI to be disabled in irq 
> > > > handler,
> > > > or do we actually allow some high-priority interrupts between low and
> NMI
> > > to
> > > > come in some platforms?
> > >
> > > I tried to come to an answer but this does not seem particularly 
> > > well-defined.
> > > There are a few things I noticed:
> > >
> > > - going through the local_irq_save()/restore() implementations on all
> > >   architectures, I did not find any other ones besides m68k that leave
> > >   high-priority interrupts enabled. I did see that at least alpha and 
> > > openrisc
> > >   are designed to support that in hardware, but the code just leaves the
> > >   interrupts disabled.
> >
> > The case is a little different. Explicit local_irq_save() does disable all
> > high priority interrupts on m68k. The only difference is 
> > arch_irqs_disabled()
> > of m68k will return true while low-priority interrupts are masked and high
> > -priority are still open. M68k's hardIRQ also runs in this context with high
> > priority interrupts enabled.
> 
> My point was that on most other architectures, local_irq_save()/restore()
> always disables/enables all interrupts, while on m68k it restores the
> specific level they were on before. On alpha, it does the same as on m68k,
> but then the top-level interrupt handler just disables them all before calling
> into any other code.

That's what I think m68k is better to do.
 
Looks weird that nested interrupts can enter while arch_irqs_disabled()
is true on m68k because masking low-priority interrupts with
high-interrupts still enabled would be able to make m68k's
arch_irqs_disabled() true, which is exactly the environment
m68k's irq handler is running.

So I was actually trying to warn this unusual case - interrupts
get nested while both in_hardirq() and irqs_disabled() are true.

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 7c9d6a2d7e90..b8ca27555c76 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -32,6 +32,7 @@ static __always_inline void rcu_irq_enter_check_tick(void)
  */
 #define __irq_enter()  \
do {\
+   WARN_ONCE(in_hardirq() && irqs_disabled(), "nested
interrupts\n"); \
preempt_count_add(HARDIRQ_OFFSET);  \
lockdep_hardirq_enter();\
account_hardirq_enter(current); \
@@ -44,6 +45,7 @@ static __always_inline void rcu_irq_enter_check_tick(void)
  */
 #define __irq_enter_raw()  \
do {\
+   WARN_ONCE(in_hardirq() && irqs_disabled(), " nested
interrupts\n"); \
preempt_count_add(HARDIRQ_OFFSET);  \
lockdep_hardirq_enter();\
} while (0)

And I also think it is better for m68k's arch_irqs_disabled() to 
return true only when both low and high priority interrupts are
disabled rather than try to mute this warn in genirq by a weaker
condition:

irqreturn_t __handle_irq_event_percpu(struct irq_desc *desc, unsigned int 
*flags)
{
...

trace_irq_handler_entry(irq, action)

RE: [RFC] IRQ handlers run with some high-priority interrupts(not NMI) enabled on some platform

2021-02-12 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Arnd Bergmann [mailto:a...@kernel.org]
> Sent: Saturday, February 13, 2021 11:34 AM
> To: Song Bao Hua (Barry Song) 
> Cc: t...@linutronix.de; gre...@linuxfoundation.org; a...@arndb.de;
> ge...@linux-m68k.org; fun...@jurai.org; ph...@gnu.org; cor...@lwn.net;
> mi...@redhat.com; linux-m...@lists.linux-m68k.org;
> fth...@telegraphics.com.au; linux-kernel@vger.kernel.org
> Subject: Re: [RFC] IRQ handlers run with some high-priority interrupts(not 
> NMI)
> enabled on some platform
> 
> On Fri, Feb 12, 2021 at 2:18 AM Song Bao Hua (Barry Song)
>  wrote:
> 
> > So I am requesting comments on:
> > 1. are we expecting all interrupts except NMI to be disabled in irq handler,
> > or do we actually allow some high-priority interrupts between low and NMI
> to
> > come in some platforms?
> 
> I tried to come to an answer but this does not seem particularly well-defined.
> There are a few things I noticed:
> 
> - going through the local_irq_save()/restore() implementations on all
>   architectures, I did not find any other ones besides m68k that leave
>   high-priority interrupts enabled. I did see that at least alpha and openrisc
>   are designed to support that in hardware, but the code just leaves the
>   interrupts disabled.

The case is a little different. Explicit local_irq_save() does disable all
high priority interrupts on m68k. The only difference is arch_irqs_disabled()
of m68k will return true while low-priority interrupts are masked and high
-priority are still open. M68k's hardIRQ also runs in this context with high
priority interrupts enabled.

> 
> - The generic code is clearly prepared to handle nested hardirqs, and
>the irq_enter()/irq_exit() functions have a counter in preempt_count
>for the nesting level, using a 4-bit number for hardirq, plus another
>4-bit number for NMI.

Yes, I understand nested interrupts are supported by an explicit 
local_irq_enable_in_hardirq(). Mk68k's case is different, nested
interrupts can come with arch_irqs_disabled() is true and while
nobody has called local_irq_enable_in_hardirq() in the previous
hardIRQ because hardIRQ keeps high-priority interrupts open.

> 
> - There are a couple of (ancient) drivers that enable interrupts in their
>interrupt handlers, see the four callers of local_irq_enable_in_hardirq()
>(all in the old drivers/ide stack) and arch/ia64/kernel/time.c, which
>enables interupts in its timer function (I recently tried removing this
>and my patch broke ia64 timers, but I'm not sure if the cause was
>the local_irq_enable() or something else).
> 
> - The local_irq_enable_in_hardirq() function itself turns into a nop
>   when lockdep is enabled, since d7e9629de051 ("[PATCH] lockdep:
>   add local_irq_enable_in_hardirq() API"). According to the comment
>   in there, lockdep already enforces the behavior you suggest. Note that
>   lockdep support is missing on m68k (and also alpha, h8300, ia64, nios2,
>   and parisc).
> 
> > 2. If either side is true, I think we need to document it somewhere as there
> > is always confusion about this.
> >
> > Personally, I would expect all interrupts to be disabled and I like the way
> > of ARM64 to only use high-priority interrupt as pseudo NMI:
> > https://lwn.net/Articles/755906/
> > Though Finn argued that this will contribute to lose hardware feature of 
> > m68k.
> 
> Regardless of what is documented, I would argue that any platform
> that relies on this is at the minimum doing something risky because at
> the minimum this runs into hard to debug code paths that are not
> exercised on any of the common architectures.
> 
> Arnd


Thanks
Barry



RE: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()

2021-02-12 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Arnd Bergmann [mailto:a...@kernel.org]
> Sent: Saturday, February 13, 2021 9:23 AM
> To: Grygorii Strashko 
> Cc: Song Bao Hua (Barry Song) ; Andy Shevchenko
> ; luojiaxing ; Linus
> Walleij ; Santosh Shilimkar ;
> Kevin Hilman ; open list:GPIO SUBSYSTEM
> ; linux-kernel@vger.kernel.org;
> linux...@openeuler.org
> Subject: Re: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace
> raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()
> 
> On Fri, Feb 12, 2021 at 12:53 PM Grygorii Strashko
>  wrote:
> >
> > The worst RT case I can imagine is when gpio API is still called from hard
> IRQ context by some
> > other device driver - some toggling for example.
> > Note. RT or "threadirqs" does not mean gpiochip become sleepable.
> >
> > In this case:
> >   threaded handler
> > raw_spin_lock
> > IRQ from other device
> >hard_irq handler
> >  gpiod_x()
> > raw_spin_lock_irqsave() -- oops
> >
> 
> Good point, I had missed the fact that drivers can call gpio functions from
> hardirq context when I replied earlier, gpio is clearly special here.


Yes. Gpio provides APIs, thus, other drivers can go directly into the
territory of gpio driver.

Another one which is even more special might be m68k, which I cc-ed you
yesterday:
https://lore.kernel.org/lkml/c46ddb954cfe45d9849c911271d7e...@hisilicon.com/

> 
>   Arnd

Thanks
Barry



RE: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()

2021-02-12 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Grygorii Strashko [mailto:grygorii.stras...@ti.com]
> Sent: Saturday, February 13, 2021 3:09 AM
> To: Song Bao Hua (Barry Song) ; Andy Shevchenko
> 
> Cc: Arnd Bergmann ; luojiaxing ; Linus
> Walleij ; Santosh Shilimkar ;
> Kevin Hilman ; open list:GPIO SUBSYSTEM
> ; linux-kernel@vger.kernel.org;
> linux...@openeuler.org
> Subject: Re: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace
> raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()
> 
> 
> 
> On 12/02/2021 15:12, Song Bao Hua (Barry Song) wrote:
> >
> >
> >> -Original Message-
> >> From: Grygorii Strashko [mailto:grygorii.stras...@ti.com]
> >> Sent: Saturday, February 13, 2021 12:53 AM
> >> To: Song Bao Hua (Barry Song) ; Andy Shevchenko
> >> 
> >> Cc: Arnd Bergmann ; luojiaxing ;
> Linus
> >> Walleij ; Santosh Shilimkar
> ;
> >> Kevin Hilman ; open list:GPIO SUBSYSTEM
> >> ; linux-kernel@vger.kernel.org;
> >> linux...@openeuler.org
> >> Subject: Re: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace
> >> raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()
> >>
> >>
> >>
> >> On 12/02/2021 13:29, Song Bao Hua (Barry Song) wrote:
> >>>
> >>>
> >>>> -Original Message-
> >>>> From: Andy Shevchenko [mailto:andy.shevche...@gmail.com]
> >>>> Sent: Friday, February 12, 2021 11:57 PM
> >>>> To: Song Bao Hua (Barry Song) 
> >>>> Cc: Grygorii Strashko ; Arnd Bergmann
> >>>> ; luojiaxing ; Linus Walleij
> >>>> ; Santosh Shilimkar ;
> Kevin
> >>>> Hilman ; open list:GPIO SUBSYSTEM
> >>>> ; linux-kernel@vger.kernel.org;
> >>>> linux...@openeuler.org
> >>>> Subject: Re: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace
> >>>> raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()
> >>>>
> >>>> On Fri, Feb 12, 2021 at 10:42:19AM +, Song Bao Hua (Barry Song) 
> >>>> wrote:
> >>>>>> From: Grygorii Strashko [mailto:grygorii.stras...@ti.com]
> >>>>>> Sent: Friday, February 12, 2021 11:28 PM
> >>>>>> On 12/02/2021 11:45, Arnd Bergmann wrote:
> >>>>>>> On Fri, Feb 12, 2021 at 6:05 AM Song Bao Hua (Barry Song)
> >>>>>>>  wrote:
> >>>>
> >>>>>>>>> Note. there is also generic_handle_irq() call inside.
> >>>>>>>>
> >>>>>>>> So generic_handle_irq() is not safe to run in thread thus requires
> >>>>>>>> an interrupt-disabled environment to run? If so, I'd rather this
> >>>>>>>> irqsave moved into generic_handle_irq() rather than asking everyone
> >>>>>>>> calling it to do irqsave.
> >>>>>>>
> >>>>>>> In a preempt-rt kernel, interrupts are run in task context, so they
> clearly
> >>>>>>> should not be called with interrupts disabled, that would defeat the
> >>>>>>> purpose of making them preemptible.
> >>>>>>>
> >>>>>>> generic_handle_irq() does need to run with in_irq()==true though,
> >>>>>>> but this should be set by the caller of the gpiochip's handler, and
> >>>>>>> it is not set by raw_spin_lock_irqsave().
> >>>>>>
> >>>>>> It will produce warning from __handle_irq_event_percpu(), as this is
> IRQ
> >>>>>> dispatcher
> >>>>>> and generic_handle_irq() will call one of handle_level_irq or
> >>>> handle_edge_irq.
> >>>>>>
> >>>>>> The history behind this is commit 450fa54cfd66 ("gpio: omap: convert
> to
> >>>> use
> >>>>>> generic irq handler").
> >>>>>>
> >>>>>> The resent related discussion:
> >>>>>> https://lkml.org/lkml/2020/12/5/208
> >>>>>
> >>>>> Ok, second thought. irqsave before generic_handle_irq() won't defeat
> >>>>> the purpose of preemption too much as the dispatched irq handlers by
> >>>>> gpiochip will run in their own threads but not in the thread of
> >>>>> gpiochip's handler.
> >>>>>
> >>>>> so looks like this patch ca

RE: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()

2021-02-12 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Grygorii Strashko [mailto:grygorii.stras...@ti.com]
> Sent: Saturday, February 13, 2021 12:53 AM
> To: Song Bao Hua (Barry Song) ; Andy Shevchenko
> 
> Cc: Arnd Bergmann ; luojiaxing ; Linus
> Walleij ; Santosh Shilimkar ;
> Kevin Hilman ; open list:GPIO SUBSYSTEM
> ; linux-kernel@vger.kernel.org;
> linux...@openeuler.org
> Subject: Re: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace
> raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()
> 
> 
> 
> On 12/02/2021 13:29, Song Bao Hua (Barry Song) wrote:
> >
> >
> >> -Original Message-
> >> From: Andy Shevchenko [mailto:andy.shevche...@gmail.com]
> >> Sent: Friday, February 12, 2021 11:57 PM
> >> To: Song Bao Hua (Barry Song) 
> >> Cc: Grygorii Strashko ; Arnd Bergmann
> >> ; luojiaxing ; Linus Walleij
> >> ; Santosh Shilimkar ; Kevin
> >> Hilman ; open list:GPIO SUBSYSTEM
> >> ; linux-kernel@vger.kernel.org;
> >> linux...@openeuler.org
> >> Subject: Re: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace
> >> raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()
> >>
> >> On Fri, Feb 12, 2021 at 10:42:19AM +, Song Bao Hua (Barry Song) wrote:
> >>>> From: Grygorii Strashko [mailto:grygorii.stras...@ti.com]
> >>>> Sent: Friday, February 12, 2021 11:28 PM
> >>>> On 12/02/2021 11:45, Arnd Bergmann wrote:
> >>>>> On Fri, Feb 12, 2021 at 6:05 AM Song Bao Hua (Barry Song)
> >>>>>  wrote:
> >>
> >>>>>>> Note. there is also generic_handle_irq() call inside.
> >>>>>>
> >>>>>> So generic_handle_irq() is not safe to run in thread thus requires
> >>>>>> an interrupt-disabled environment to run? If so, I'd rather this
> >>>>>> irqsave moved into generic_handle_irq() rather than asking everyone
> >>>>>> calling it to do irqsave.
> >>>>>
> >>>>> In a preempt-rt kernel, interrupts are run in task context, so they 
> >>>>> clearly
> >>>>> should not be called with interrupts disabled, that would defeat the
> >>>>> purpose of making them preemptible.
> >>>>>
> >>>>> generic_handle_irq() does need to run with in_irq()==true though,
> >>>>> but this should be set by the caller of the gpiochip's handler, and
> >>>>> it is not set by raw_spin_lock_irqsave().
> >>>>
> >>>> It will produce warning from __handle_irq_event_percpu(), as this is IRQ
> >>>> dispatcher
> >>>> and generic_handle_irq() will call one of handle_level_irq or
> >> handle_edge_irq.
> >>>>
> >>>> The history behind this is commit 450fa54cfd66 ("gpio: omap: convert to
> >> use
> >>>> generic irq handler").
> >>>>
> >>>> The resent related discussion:
> >>>> https://lkml.org/lkml/2020/12/5/208
> >>>
> >>> Ok, second thought. irqsave before generic_handle_irq() won't defeat
> >>> the purpose of preemption too much as the dispatched irq handlers by
> >>> gpiochip will run in their own threads but not in the thread of
> >>> gpiochip's handler.
> >>>
> >>> so looks like this patch can improve by:
> >>> * move other raw_spin_lock_irqsave to raw_spin_lock;
> >>> * keep the raw_spin_lock_irqsave before generic_handle_irq() to mute
> >>> the warning in genirq.
> >>
> >> Isn't the idea of irqsave is to prevent dead lock from the process context
> when
> >> we get interrupt on the *same* CPU?
> >
> > Anyway, gpiochip is more tricky as it is also a irq dispatcher. Moving
> > spin_lock_irq to spin_lock in the irq handler of non-irq dispatcher
> > driver is almost always correct.
> >
> > But for gpiochip, would the below be true though it is almost alway true
> > for non-irq dispatcher?
> >
> > 1. While gpiochip's handler runs in hardIRQ, interrupts are disabled, so no
> more
> > interrupt on the same cpu -> No deadleak.
> >
> > 2. While gpiochip's handler runs in threads
> > * other non-threaded interrupts such as timer tick might come on same cpu,
> > but they are an irrelevant driver and thus they are not going to get the
> > lock gpiochip's handler has held. -> no deadlock.
> > * other devices attached to this gpiochip might get interrupts, s

RE: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()

2021-02-12 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Andy Shevchenko [mailto:andy.shevche...@gmail.com]
> Sent: Friday, February 12, 2021 11:57 PM
> To: Song Bao Hua (Barry Song) 
> Cc: Grygorii Strashko ; Arnd Bergmann
> ; luojiaxing ; Linus Walleij
> ; Santosh Shilimkar ; Kevin
> Hilman ; open list:GPIO SUBSYSTEM
> ; linux-kernel@vger.kernel.org;
> linux...@openeuler.org
> Subject: Re: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace
> raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()
> 
> On Fri, Feb 12, 2021 at 10:42:19AM +, Song Bao Hua (Barry Song) wrote:
> > > From: Grygorii Strashko [mailto:grygorii.stras...@ti.com]
> > > Sent: Friday, February 12, 2021 11:28 PM
> > > On 12/02/2021 11:45, Arnd Bergmann wrote:
> > > > On Fri, Feb 12, 2021 at 6:05 AM Song Bao Hua (Barry Song)
> > > >  wrote:
> 
> > > >>> Note. there is also generic_handle_irq() call inside.
> > > >>
> > > >> So generic_handle_irq() is not safe to run in thread thus requires
> > > >> an interrupt-disabled environment to run? If so, I'd rather this
> > > >> irqsave moved into generic_handle_irq() rather than asking everyone
> > > >> calling it to do irqsave.
> > > >
> > > > In a preempt-rt kernel, interrupts are run in task context, so they 
> > > > clearly
> > > > should not be called with interrupts disabled, that would defeat the
> > > > purpose of making them preemptible.
> > > >
> > > > generic_handle_irq() does need to run with in_irq()==true though,
> > > > but this should be set by the caller of the gpiochip's handler, and
> > > > it is not set by raw_spin_lock_irqsave().
> > >
> > > It will produce warning from __handle_irq_event_percpu(), as this is IRQ
> > > dispatcher
> > > and generic_handle_irq() will call one of handle_level_irq or
> handle_edge_irq.
> > >
> > > The history behind this is commit 450fa54cfd66 ("gpio: omap: convert to
> use
> > > generic irq handler").
> > >
> > > The resent related discussion:
> > > https://lkml.org/lkml/2020/12/5/208
> >
> > Ok, second thought. irqsave before generic_handle_irq() won't defeat
> > the purpose of preemption too much as the dispatched irq handlers by
> > gpiochip will run in their own threads but not in the thread of
> > gpiochip's handler.
> >
> > so looks like this patch can improve by:
> > * move other raw_spin_lock_irqsave to raw_spin_lock;
> > * keep the raw_spin_lock_irqsave before generic_handle_irq() to mute
> > the warning in genirq.
> 
> Isn't the idea of irqsave is to prevent dead lock from the process context 
> when
> we get interrupt on the *same* CPU?

Anyway, gpiochip is more tricky as it is also a irq dispatcher. Moving
spin_lock_irq to spin_lock in the irq handler of non-irq dispatcher
driver is almost always correct.

But for gpiochip, would the below be true though it is almost alway true
for non-irq dispatcher?

1. While gpiochip's handler runs in hardIRQ, interrupts are disabled, so no more
interrupt on the same cpu -> No deadleak.

2. While gpiochip's handler runs in threads
* other non-threaded interrupts such as timer tick might come on same cpu,
but they are an irrelevant driver and thus they are not going to get the
lock gpiochip's handler has held. -> no deadlock.
* other devices attached to this gpiochip might get interrupts, since 
gpiochip's handler is running in threads, raw_spin_lock can help avoid
messing up the critical data by two threads -> still no deadlock.

> 
> --
> With Best Regards,
> Andy Shevchenko
> 

Thanks
Barry



RE: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()

2021-02-12 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Grygorii Strashko [mailto:grygorii.stras...@ti.com]
> Sent: Friday, February 12, 2021 11:28 PM
> To: Arnd Bergmann ; Song Bao Hua (Barry Song)
> 
> Cc: luojiaxing ; Linus Walleij
> ; Andy Shevchenko ; Andy
> Shevchenko ; Santosh Shilimkar
> ; Kevin Hilman ; open list:GPIO
> SUBSYSTEM ; linux-kernel@vger.kernel.org;
> linux...@openeuler.org
> Subject: Re: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace
> raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()
> 
> Hi Arnd,
> 
> On 12/02/2021 11:45, Arnd Bergmann wrote:
> > On Fri, Feb 12, 2021 at 6:05 AM Song Bao Hua (Barry Song)
> >  wrote:
> >>> -Original Message-
> >
> >>>
> >>> Note. there is also generic_handle_irq() call inside.
> >>
> >> So generic_handle_irq() is not safe to run in thread thus requires
> >> an interrupt-disabled environment to run? If so, I'd rather this
> >> irqsave moved into generic_handle_irq() rather than asking everyone
> >> calling it to do irqsave.
> >
> > In a preempt-rt kernel, interrupts are run in task context, so they clearly
> > should not be called with interrupts disabled, that would defeat the
> > purpose of making them preemptible.
> >
> > generic_handle_irq() does need to run with in_irq()==true though,
> > but this should be set by the caller of the gpiochip's handler, and
> > it is not set by raw_spin_lock_irqsave().
> 
> It will produce warning from __handle_irq_event_percpu(), as this is IRQ
> dispatcher
> and generic_handle_irq() will call one of handle_level_irq or handle_edge_irq.
> 
> The history behind this is commit 450fa54cfd66 ("gpio: omap: convert to use
> generic irq handler").
> 
> The resent related discussion:
> https://lkml.org/lkml/2020/12/5/208

Ok, second thought. irqsave before generic_handle_irq() won't defeat
the purpose of preemption too much as the dispatched irq handlers by
gpiochip will run in their own threads but not in the thread of
gpiochip's handler.

so looks like this patch can improve by:
* move other raw_spin_lock_irqsave to raw_spin_lock;
* keep the raw_spin_lock_irqsave before generic_handle_irq() to mute
the warning in genirq.

> 
> 
> 
> --
> Best regards,
> Grygorii

Thanks
Barry



RE: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()

2021-02-12 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Arnd Bergmann [mailto:a...@kernel.org]
> Sent: Friday, February 12, 2021 10:45 PM
> To: Song Bao Hua (Barry Song) 
> Cc: Grygorii Strashko ; luojiaxing
> ; Linus Walleij ; Andy
> Shevchenko ; Andy Shevchenko
> ; Santosh Shilimkar ;
> Kevin Hilman ; open list:GPIO SUBSYSTEM
> , linux-kernel@vger.kernel.org
> ; linux...@openeuler.org
> Subject: Re: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace
> raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()
> 
> On Fri, Feb 12, 2021 at 6:05 AM Song Bao Hua (Barry Song)
>  wrote:
> > > -Original Message-
> 
> > >
> > > Note. there is also generic_handle_irq() call inside.
> >
> > So generic_handle_irq() is not safe to run in thread thus requires
> > an interrupt-disabled environment to run? If so, I'd rather this
> > irqsave moved into generic_handle_irq() rather than asking everyone
> > calling it to do irqsave.
> 
> In a preempt-rt kernel, interrupts are run in task context, so they clearly
> should not be called with interrupts disabled, that would defeat the
> purpose of making them preemptible.

Yes. Sounds sensible. Irqsave in generic_handle_irq() will defeat
the purpose of RT.

> 
> generic_handle_irq() does need to run with in_irq()==true though,
> but this should be set by the caller of the gpiochip's handler, and
> it is not set by raw_spin_lock_irqsave().
> 

So sounds like this issue of in_irq()=true is still irrelevant with
this patch.

I guess this should have been set by the caller of the gpiochip's
handler somewhere, otherwise, gpiochip's irq handler won't be able
to be threaded. Has it been set somewhere?

>Arnd

Thanks
Barry


RE: kernel BUG at mm/zswap.c:1275! (rc6 - git 61556703b610)

2021-02-12 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Oleksandr Natalenko [mailto:oleksa...@natalenko.name]
> Sent: Friday, February 12, 2021 8:43 PM
> To: Song Bao Hua (Barry Song) 
> Cc: Mikhail Gavrilov ;
> sjenn...@linux.vnet.ibm.com; Linux List Kernel Mailing
> ; Linux Memory Management List
> 
> Subject: Re: kernel BUG at mm/zswap.c:1275! (rc6 - git 61556703b610)
> 
> Hello.
> 
> On Thu, Feb 11, 2021 at 10:43:18AM +, Song Bao Hua (Barry Song) wrote:
> > Are you using zsmalloc? There is a known bug on the combination
> > of zsmalloc and zswap, fixed by patches of tiantao:
> >
> > mm: set the sleep_mapped to true for zbud and z3fold
> > mm/zswap: fix variable 'entry' is uninitialized when used
> > mm/zswap: fix potential memory leak
> > mm/zswap: add the flag can_sleep_mapped
> >
> > at Linux-next:
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/log/?q
> t=author=tiantao6%40hisilicon.com
> 
> Is this a future stable-5.11 material (and/or, potentially, older stable
> branches
> as well)?

I would believe this should be put into 5.11. I will ask Andrew.

> 
> --
>   Oleksandr Natalenko (post-factum)

Thanks
Barry



RE: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()

2021-02-11 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Grygorii Strashko [mailto:grygorii.stras...@ti.com]
> Sent: Friday, February 12, 2021 9:17 AM
> To: Arnd Bergmann 
> Cc: luojiaxing ; Linus Walleij
> ; Andy Shevchenko ; Andy
> Shevchenko ; Santosh Shilimkar
> ; Kevin Hilman ; open list:GPIO
> SUBSYSTEM , linux-kernel@vger.kernel.org
> ; linux...@openeuler.org
> Subject: [Linuxarm] Re: [PATCH for next v1 1/2] gpio: omap: Replace
> raw_spin_lock_irqsave with raw_spin_lock in omap_gpio_irq_handler()
> 
> 
> 
> On 11/02/2021 21:39, Arnd Bergmann wrote:
> > On Thu, Feb 11, 2021 at 7:25 PM Grygorii Strashko
> >  wrote:
> >> On 08/02/2021 10:56, Luo Jiaxing wrote:
> >>> There is no need to use API with _irqsave in omap_gpio_irq_handler(),
> >>> because it already be in a irq-disabled context.
> >>
> >> NACK.
> >> Who said that this is always hard IRQ handler?
> >> What about RT-kernel or boot with "threadirqs"?
> >
> > In those cases, the interrupt handler is just a normal thread, so the
> > preempt_disable() that is implied by raw_spin_lock() is sufficient.
> >
> > Disabling interrupts inside of an interrupt handler is always incorrect,
> > the patch looks like a useful cleanup to me, if only for readability.
> 
> Note. there is also generic_handle_irq() call inside.

So generic_handle_irq() is not safe to run in thread thus requires
an interrupt-disabled environment to run? If so, I'd rather this
irqsave moved into generic_handle_irq() rather than asking everyone
calling it to do irqsave.

On the other hand, the author changed a couple of spin_lock_irqsave
to spin_lock, if only this one is incorrect, it seems it is worth a
new version to fix this.

> 
> --
> Best regards,
> grygorii

Thanks
Barry



RE: Re: [PATCH for-next 00/32] spin lock usage optimization for SCSI drivers

2021-02-11 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Finn Thain [mailto:fth...@telegraphics.com.au]
> Sent: Friday, February 12, 2021 1:09 PM
> To: Song Bao Hua (Barry Song) 
> Cc: tanxiaofei ; j...@linux.ibm.com;
> martin.peter...@oracle.com; linux-s...@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux...@openeuler.org;
> linux-m...@vger.kernel.org
> Subject: RE: Re: [PATCH for-next 00/32] spin lock usage optimization for SCSI
> drivers
> 
> On Fri, 12 Feb 2021, Song Bao Hua (Barry Song) wrote:
> 
> >
> > > -Original Message-
> > > From: Finn Thain [mailto:fth...@telegraphics.com.au]
> > > Sent: Friday, February 12, 2021 12:57 PM
> > > To: Song Bao Hua (Barry Song) 
> > > Cc: tanxiaofei ; j...@linux.ibm.com;
> > > martin.peter...@oracle.com; linux-s...@vger.kernel.org;
> > > linux-kernel@vger.kernel.org; linux...@openeuler.org;
> > > linux-m...@vger.kernel.org
> > > Subject: RE: Re: [PATCH for-next 00/32] spin lock usage optimization for
> SCSI
> > > drivers
> > >
> > >
> > > On Thu, 11 Feb 2021, Song Bao Hua (Barry Song) wrote:
> > >
> > > >
> > > > Actually in m68k, I also saw its IRQ entry disabled interrupts by
> > > > ' move  #0x2700,%sr /* disable intrs */'
> > > >
> > > > arch/m68k/include/asm/entry.h:
> > > >
> > > > .macro SAVE_ALL_SYS
> > > > move#0x2700,%sr /* disable intrs */
> > > > btst#5,%sp@(2)  /* from user? */
> > > > bnes6f  /* no, skip */
> > > > movel   %sp,sw_usp  /* save user sp */
> > > > ...
> > > >
> > > > .macro SAVE_ALL_INT
> > > > SAVE_ALL_SYS
> > > > moveq   #-1,%d0 /* not system call entry */
> > > > movel   %d0,%sp@(PT_OFF_ORIG_D0)
> > > > .endm
> > > >
> > > > arch/m68k/kernel/entry.S:
> > > >
> > > > /* This is the main interrupt handler for autovector interrupts */
> > > >
> > > > ENTRY(auto_inthandler)
> > > > SAVE_ALL_INT
> > > > GET_CURRENT(%d0)
> > > > |  put exception # in d0
> > > > bfextu  %sp@(PT_OFF_FORMATVEC){#4,#10},%d0
> > > > subw#VEC_SPUR,%d0
> > > >
> > > > movel   %sp,%sp@-
> > > > movel   %d0,%sp@-   |  put vector # on stack
> > > > auto_irqhandler_fixup = . + 2
> > > > jsr do_IRQ  |  process the IRQ
> > > > addql   #8,%sp  |  pop parameters off stack
> > > > jra ret_from_exception
> > > >
> > > > So my question is that " move   #0x2700,%sr" is actually disabling
> > > > all interrupts? And is m68k actually running irq handlers
> > > > with interrupts disabled?
> > > >
> > >
> > > When sonic_interrupt() executes, the IPL is 2 or 3 (since either IRQ may
> > > be involved). That is, SR & 0x700 is 0x200 or 0x300. The level 3 interrupt
> > > may interrupt execution of the level 2 handler so an irq lock is used to
> > > avoid re-entrance.
> > >
> > > This patch,
> > >
> > > diff --git a/drivers/net/ethernet/natsemi/sonic.c
> > > b/drivers/net/ethernet/natsemi/sonic.c
> > > index d17d1b4f2585..041354647bad 100644
> > > --- a/drivers/net/ethernet/natsemi/sonic.c
> > > +++ b/drivers/net/ethernet/natsemi/sonic.c
> > > @@ -355,6 +355,8 @@ static irqreturn_t sonic_interrupt(int irq, void 
> > > *dev_id)
> > >  */
> > > spin_lock_irqsave(>lock, flags);
> > >
> > > +   printk_once(KERN_INFO "%s: %08lx\n", __func__, flags);
> > > +
> > > status = SONIC_READ(SONIC_ISR) & SONIC_IMR_DEFAULT;
> > > if (!status) {
> > > spin_unlock_irqrestore(>lock, flags);
> > >
> > > produces this output,
> > >
> > > [3.80] sonic_interrupt: 2300
> >
> > I actually hope you can directly read the register rather than reading
> > a flag which might be a software one not from register.
> >
> 
> Again, the implementation of arch_local_irq_save() may be found in
> arch/m68k/include/asm/irqflags.h

Yes. I have read it. Anyway, I started a discussion in genirq
with you cc-ed:
https://lore.kernel.org/lkml/c46ddb954cfe45d9849c911271d7e...@hisilicon.com/

And thanks very much for all your efforts to help me understand
M68k. Let's get this clarified thoroughly in genirq level.

In arm, we also have some special high-priority interrupts
which are not NMI but able to preempt normal IRQ. They are
managed by arch-extended APIs rather than common APIs.

Neither arch_irqs_disabled() nor local_irq_disable() API can
access this kind of interrupts. They are using things specific
to ARM like:
local_fiq_disable()
local_fiq_enable()
set_fiq_handler()
disable_fiq()
enable_fiq()
...

so fiq doesn't bother us anyhow in genirq.

> 
> > >
> > > I ran that code in QEMU, but experience shows that Apple hardware works
> > > exactly the same. Please do confirm this for yourself, if you still think
> > > the code and comments in sonic_interrupt are wrong.
> > >
> > > > Best Regards
> > > > Barry
> > > >
> >

Thanks
Barry



[RFC] IRQ handlers run with some high-priority interrupts(not NMI) enabled on some platform

2021-02-11 Thread Song Bao Hua (Barry Song)
Hi,

I am getting a very long debate with Finn in this thread:
https://lore.kernel.org/lkml/1612697823-8073-1-git-send-email-tanxiao...@huawei.com/

In short, the debate is about if high-priority IRQs (*not NMI*)
are allowed to preempt an running IRQ handler in hardIRQ context.

In my understanding, right now IRQ handlers are running with *all* interrupts
disabled since this commit and IRQF_DISABLED was dropped:
e58aa3d2d0cc
genirq: Run irq handlers with interrupts disabled

b738a50a2026
genirq: Warn when handler enables interrupts
We run all handlers with interrupts disabled and expect them not to
enable them. Warn when we catch one who does.

While it seems to be true in almost all platforms, it seems to be
false on m68k.

According to Finn, while IRQ handlers are running, high-priority
interrupts can still jump out on m68k. A driver which is handling
this issue is here: drivers/net/ethernet/natsemi/sonic.c.
you can read the comment:
static irqreturn_t sonic_interrupt(int irq, void *dev_id)
{
struct net_device *dev = dev_id;
struct sonic_local *lp = netdev_priv(dev);
int status;
unsigned long flags;

/* The lock has two purposes. Firstly, it synchronizes sonic_interrupt()
 * with sonic_send_packet() so that the two functions can share state.
 * Secondly, it makes sonic_interrupt() re-entrant, as that is required
 * by macsonic which must use two IRQs with different priority levels.
 */
spin_lock_irqsave(>lock, flags);

status = SONIC_READ(SONIC_ISR) & SONIC_IMR_DEFAULT;
if (!status) {
spin_unlock_irqrestore(>lock, flags);

return IRQ_NONE;
}
}

So m68k does allow a high-priority interrupt to preempt
a hardIRQ so the code needs to call irqsave to protect
this risk. That is to say, some interrupts are not disabled
during hardIRQ of m68k.

But m68k doesn't trigger any warning for !irqs_disabled() in
genirq:
irqreturn_t __handle_irq_event_percpu(struct irq_desc *desc, unsigned int 
*flags)
{
...

trace_irq_handler_entry(irq, action);
res = action->handler(irq, action->dev_id);
trace_irq_handler_exit(irq, action, res);

if (WARN_ONCE(!irqs_disabled(),"irq %u handler %pS enabled 
interrupts\n",
  irq, action->handler))
local_irq_disable();
}

The reason is:
* arch_irqs_disabled() return true while low-priority interrupts are disabled
though high-priority interrupts are still open;
* local_irq_disable, spin_lock_irqsave() etc will disable high-priority 
interrupt
(IPL 7);
* arch_irqs_disabled() also return true while both low and high priority 
interrupts
interrupts are disabled.
Note m68k has several interrupt levels. But in the above description, I simply
abstract them as high and low to help the understanding.

I think m68k lets arch_irq_disabled() return true in relatively weaker condition
to pretend all IRQs are disabled while high-priority IRQ is still open, thus
pass all sanitizing check in genirq and kernel core. But Finn strongly 
disagreed.

I am not saying I am right and Finn is wrong. But I think we need somewhere to 
clarify
this problem.

Personally, I would prefer "interrupts disabled" mean "all except NMI", So I'd 
like to
guard this by:

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 7c9d6a2d7e90..b8ca27555c76 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -32,6 +32,7 @@ static __always_inline void rcu_irq_enter_check_tick(void)
  */
 #define __irq_enter()  \
do {\
+   WARN_ONCE(in_hardirq() && irqs_disabled(), "nested
interrupts\n"); \
preempt_count_add(HARDIRQ_OFFSET);  \
lockdep_hardirq_enter();\
account_hardirq_enter(current); \
@@ -44,6 +45,7 @@ static __always_inline void rcu_irq_enter_check_tick(void)
  */
 #define __irq_enter_raw()  \
do {\
+   WARN_ONCE(in_hardirq() && irqs_disabled(), " nested
interrupts\n"); \
preempt_count_add(HARDIRQ_OFFSET);  \
lockdep_hardirq_enter();\
} while (0)

Though Finn thought it lacks any justification

So I am requesting comments on:
1. are we expecting all interrupts except NMI to be disabled in irq handler,
or do we actually allow some high-priority interrupts between low and NMI to
come in some platforms?

2. If either side is true, I think we need to document it somewhere as there
is always confusion about this.

Personally, I would expect all interrupts to be disabled and I like the way
of ARM64 to only use high-priority interrupt as pseudo NMI:
https://lwn.net/Articles/755906/
Though Finn argued that this will contribute to lose 

RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage optimization for SCSI drivers

2021-02-11 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Finn Thain [mailto:fth...@telegraphics.com.au]
> Sent: Friday, February 12, 2021 12:58 PM
> To: Song Bao Hua (Barry Song) 
> Cc: tanxiaofei ; j...@linux.ibm.com;
> martin.peter...@oracle.com; linux-s...@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux...@openeuler.org;
> linux-m...@vger.kernel.org
> Subject: RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage 
> optimization
> for SCSI drivers
> 
> On Thu, 11 Feb 2021, Song Bao Hua (Barry Song) wrote:
> 
> > > On Wed, 10 Feb 2021, Song Bao Hua (Barry Song) wrote:
> > >
> > > > > On Wed, 10 Feb 2021, Song Bao Hua (Barry Song) wrote:
> > > > >
> > > > > > TBH, that is why m68k is so confusing. irqs_disabled() on m68k
> > > > > > should just reflect the status of all interrupts have been
> > > > > > disabled except NMI.
> > > > > >
> > > > > > irqs_disabled() should be consistent with the calling of APIs
> > > > > > such as local_irq_disable, local_irq_save, spin_lock_irqsave
> > > > > > etc.
> > > > > >
> > > > >
> > > > > When irqs_disabled() returns true, we cannot infer that
> > > > > arch_local_irq_disable() was called. But I have not yet found
> > > > > driver code or core kernel code attempting that inference.
> > > > >
> > > > > > >
> > > > > > > > Isn't arch_irqs_disabled() a status reflection of irq
> > > > > > > > disable API?
> > > > > > > >
> > > > > > >
> > > > > > > Why not?
> > > > > >
> > > > > > If so, arch_irqs_disabled() should mean all interrupts have been
> > > > > > masked except NMI as NMI is unmaskable.
> > > > > >
> > > > >
> > > > > Can you support that claim with a reference to core kernel code or
> > > > > documentation? (If some arch code agrees with you, that's neither
> > > > > here nor there.)
> > > >
> > > > I think those links I share you have supported this. Just you don't
> > > > believe :-)
> > > >
> > >
> > > Your links show that the distinction between fast and slow handlers
> > > was removed. Your links don't support your claim that
> > > "arch_irqs_disabled() should mean all interrupts have been masked".
> > > Where is the code that makes that inference? Where is the
> > > documentation that supports your claim?
> >
> > (1)
> > https://lwn.net/Articles/380931/
> > Looking at all these worries, one might well wonder if a system which
> > *disabled interrupts for all handlers* would function well at all. So it
> > is interesting to note one thing: any system which has the lockdep
> > locking checker enabled has been running all handlers that way for some
> > years now. Many developers and testers run lockdep-enabled kernels, and
> > they are available for some of the more adventurous distributions
> > (Rawhide, for example) as well. So we have quite a bit of test coverage
> > for this mode of operation already.
> >
> 
> IIUC, your claim is that CONFIG_LOCKDEP involves code that contains the
> inference, "arch_irqs_disabled() means all interrupts have been masked".
> 
> Unfortunately, m68k lacks CONFIG_LOCKDEP support so I can't easily confirm
> this. I suppose there may be other architectures that support both LOCKDEP
> and nested interrupts (?)
> 
> Anyway, if you would point to the code that contains said inference, that
> would help a lot.
> 
> > (2)
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=b738a50a
> >
> > "We run all handlers *with interrupts disabled* and expect them not to
> > enable them. Warn when we catch one who does."
> >
> 
> Again, you don't see that warning because irqs_disabled() correctly
> returns true. You can confirm this fact in QEMU or Aranym if you are
> interested.
> 
> > (3)
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=e58aa3d2d0cc
> > genirq: Run irq handlers *with interrupts disabled*
> >
> > Running interrupt handlers with interrupts enabled can cause stack
> > overflows. That has been observed with multiqueue NICs delivering all
> > their interrupts to a single core. We might band aid that somehow by
> > checking the interrupt stack

RE: Re: [PATCH for-next 00/32] spin lock usage optimization for SCSI drivers

2021-02-11 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Finn Thain [mailto:fth...@telegraphics.com.au]
> Sent: Friday, February 12, 2021 12:57 PM
> To: Song Bao Hua (Barry Song) 
> Cc: tanxiaofei ; j...@linux.ibm.com;
> martin.peter...@oracle.com; linux-s...@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux...@openeuler.org;
> linux-m...@vger.kernel.org
> Subject: RE: Re: [PATCH for-next 00/32] spin lock usage optimization for SCSI
> drivers
> 
> 
> On Thu, 11 Feb 2021, Song Bao Hua (Barry Song) wrote:
> 
> >
> > Actually in m68k, I also saw its IRQ entry disabled interrupts by
> > ' move  #0x2700,%sr /* disable intrs */'
> >
> > arch/m68k/include/asm/entry.h:
> >
> > .macro SAVE_ALL_SYS
> > move#0x2700,%sr /* disable intrs */
> > btst#5,%sp@(2)  /* from user? */
> > bnes6f  /* no, skip */
> > movel   %sp,sw_usp  /* save user sp */
> > ...
> >
> > .macro SAVE_ALL_INT
> > SAVE_ALL_SYS
> > moveq   #-1,%d0 /* not system call entry */
> > movel   %d0,%sp@(PT_OFF_ORIG_D0)
> > .endm
> >
> > arch/m68k/kernel/entry.S:
> >
> > /* This is the main interrupt handler for autovector interrupts */
> >
> > ENTRY(auto_inthandler)
> > SAVE_ALL_INT
> > GET_CURRENT(%d0)
> > |  put exception # in d0
> > bfextu  %sp@(PT_OFF_FORMATVEC){#4,#10},%d0
> > subw#VEC_SPUR,%d0
> >
> > movel   %sp,%sp@-
> > movel   %d0,%sp@-   |  put vector # on stack
> > auto_irqhandler_fixup = . + 2
> > jsr do_IRQ  |  process the IRQ
> > addql   #8,%sp  |  pop parameters off stack
> > jra ret_from_exception
> >
> > So my question is that " move   #0x2700,%sr" is actually disabling
> > all interrupts? And is m68k actually running irq handlers
> > with interrupts disabled?
> >
> 
> When sonic_interrupt() executes, the IPL is 2 or 3 (since either IRQ may
> be involved). That is, SR & 0x700 is 0x200 or 0x300. The level 3 interrupt
> may interrupt execution of the level 2 handler so an irq lock is used to
> avoid re-entrance.
> 
> This patch,
> 
> diff --git a/drivers/net/ethernet/natsemi/sonic.c
> b/drivers/net/ethernet/natsemi/sonic.c
> index d17d1b4f2585..041354647bad 100644
> --- a/drivers/net/ethernet/natsemi/sonic.c
> +++ b/drivers/net/ethernet/natsemi/sonic.c
> @@ -355,6 +355,8 @@ static irqreturn_t sonic_interrupt(int irq, void *dev_id)
>  */
> spin_lock_irqsave(>lock, flags);
> 
> +   printk_once(KERN_INFO "%s: %08lx\n", __func__, flags);
> +
> status = SONIC_READ(SONIC_ISR) & SONIC_IMR_DEFAULT;
> if (!status) {
> spin_unlock_irqrestore(>lock, flags);
> 
> produces this output,
> 
> [3.80] sonic_interrupt: 2300

I actually hope you can directly read the register rather than reading
a flag which might be a software one not from register.

> 
> I ran that code in QEMU, but experience shows that Apple hardware works
> exactly the same. Please do confirm this for yourself, if you still think
> the code and comments in sonic_interrupt are wrong.
> 
> > Best Regards
> > Barry
> >

Thanks
Barry



RE: Re: [PATCH for-next 00/32] spin lock usage optimization for SCSI drivers

2021-02-11 Thread Song Bao Hua (Barry Song)
> >
> > On Wed, 10 Feb 2021, Song Bao Hua (Barry Song) wrote:
> >
> > > > On Wed, 10 Feb 2021, Song Bao Hua (Barry Song) wrote:
> > > >
> > > > > > On Wed, 10 Feb 2021, Song Bao Hua (Barry Song) wrote:
> > > > > >
> > > > > > > > There is no warning from m68k builds. That's because
> > > > > > > > arch_irqs_disabled() returns true when the IPL is non-zero.
> > > > > > >
> > > > > > > So for m68k, the case is arch_irqs_disabled() is true, but
> > > > > > > interrupts can still come?
> > > > > > >
> > > > > > > Then it seems it is very confusing. If prioritized interrupts
> > > > > > > can still come while arch_irqs_disabled() is true,
> > > > > >
> > > > > > Yes, on m68k CPUs, an IRQ having a priority level higher than the
> > > > > > present priority mask will get serviced.
> > > > > >
> > > > > > Non-Maskable Interrupt (NMI) is not subject to this rule and gets
> > > > > > serviced regardless.
> > > > > >
> > > > > > > how could spin_lock_irqsave() block the prioritized interrupts?
> > > > > >
> > > > > > It raises the the mask level to 7. Again, please see
> > > > > > arch/m68k/include/asm/irqflags.h
> > > > >
> > > > > Hi Finn,
> > > > > Thanks for your explanation again.
> > > > >
> > > > > TBH, that is why m68k is so confusing. irqs_disabled() on m68k
> > > > > should just reflect the status of all interrupts have been disabled
> > > > > except NMI.
> > > > >
> > > > > irqs_disabled() should be consistent with the calling of APIs such
> > > > > as local_irq_disable, local_irq_save, spin_lock_irqsave etc.
> > > > >
> > > >
> > > > When irqs_disabled() returns true, we cannot infer that
> > > > arch_local_irq_disable() was called. But I have not yet found driver
> > > > code or core kernel code attempting that inference.
> > > >
> > > > > >
> > > > > > > Isn't arch_irqs_disabled() a status reflection of irq disable
> > > > > > > API?
> > > > > > >
> > > > > >
> > > > > > Why not?
> > > > >
> > > > > If so, arch_irqs_disabled() should mean all interrupts have been
> > > > > masked except NMI as NMI is unmaskable.
> > > > >
> > > >
> > > > Can you support that claim with a reference to core kernel code or
> > > > documentation? (If some arch code agrees with you, that's neither here
> > > > nor there.)
> > >
> > > I think those links I share you have supported this. Just you don't
> > > believe :-)
> > >
> >
> > Your links show that the distinction between fast and slow handlers was
> > removed. Your links don't support your claim that "arch_irqs_disabled()
> > should mean all interrupts have been masked". Where is the code that makes
> > that inference? Where is the documentation that supports your claim?
> 
> (1)
> https://lwn.net/Articles/380931/
> Looking at all these worries, one might well wonder if a system which 
> *disabled
> interrupts for all handlers* would function well at all. So it is interesting
> to note one thing: any system which has the lockdep locking checker enabled
> has been running all handlers that way for some years now. Many developers
> and testers run lockdep-enabled kernels, and they are available for some of
> the more adventurous distributions (Rawhide, for example) as well. So we
> have quite a bit of test coverage for this mode of operation already.
> 
> (2)
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=b738a50a
> 
> "We run all handlers *with interrupts disabled* and expect them not to
> enable them. Warn when we catch one who does."
> 
> (3)
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=e58aa3d2d0cc
> genirq: Run irq handlers *with interrupts disabled*
> 
> Running interrupt handlers with interrupts enabled can cause stack
> overflows. That has been observed with multiqueue NICs delivering all
> their interrupts to a single core. We might band aid that somehow by
> checking the interrupt stacks, but the real safe fix is to *run the irq
&g

RE: kernel BUG at mm/zswap.c:1275! (rc6 - git 61556703b610)

2021-02-11 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Mikhail Gavrilov [mailto:mikhail.v.gavri...@gmail.com]
> Sent: Thursday, February 11, 2021 9:58 PM
> To: sjenn...@linux.vnet.ibm.com; Song Bao Hua (Barry Song)
> 
> Cc: Linux List Kernel Mailing ; Linux Memory
> Management List 
> Subject: kernel BUG at mm/zswap.c:1275! (rc6 - git 61556703b610)
> 
> Hi folks.
> During the 5.11 test cycle I caught a rare but repeatable problem when
> after a day uptime happens "BUG at mm/zswap.c:1275!". I am still not
> having an idea how to reproduce it, but maybe the authors of this code
> could explain what happens here?

Are you using zsmalloc? There is a known bug on the combination
of zsmalloc and zswap, fixed by patches of tiantao:

mm: set the sleep_mapped to true for zbud and z3fold
mm/zswap: fix variable 'entry' is uninitialized when used
mm/zswap: fix potential memory leak
mm/zswap: add the flag can_sleep_mapped

at Linux-next:
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/log/?qt=author=tiantao6%40hisilicon.com


> 
> $ grep "mm/zswap.c" dmesg*.txt
> dmesg101.txt:[127850.513201] kernel BUG at mm/zswap.c:1275!
> dmesg11.txt:[52211.962861] kernel BUG at mm/zswap.c:1275!
> dmesg8.txt:[46610.641843] kernel BUG at mm/zswap.c:1275!
> 
> [127850.513193] [ cut here ]
> [127850.513201] kernel BUG at mm/zswap.c:1275!
> [127850.513210] invalid opcode:  [#1] SMP NOPTI
> [127850.513214] CPU: 6 PID: 485132 Comm: brave Tainted: GW
>- ---  5.11.0-0.rc6.20210204git61556703b610.145.fc34.x86_64
> #1
> [127850.513218] Hardware name: System manufacturer System Product
> Name/ROG STRIX X570-I GAMING, BIOS 3402 01/13/2021
> [127850.513221] RIP: 0010:zswap_frontswap_load+0x258/0x260
> [127850.513228] Code: ab 83 aa f0 2f 00 00 01 65 ff 0d c3 73 cd 54 eb
> 88 48 8d 7b 10 e8 78 b9 9f 00 c7 43 10 00 00 00 00 44 8b 63 70 e9 4a
> ff ff ff <0f> 0b 0f 0b 0f 0b 66 90 0f 1f 44 00 00 41 57 31 c0 b9 0c 00
> 00 00
> [127850.513231] RSP: :a92e866c7c48 EFLAGS: 00010282
> [127850.513235] RAX: 0006 RBX: c92e7ca61830 RCX:
> 0001
> [127850.513238] RDX:  RSI: ab3429fe RDI:
> 97f4d0393010
> [127850.513240] RBP: 97ee5544d1c0 R08: 0001 R09:
> 
> [127850.513242] R10:  R11:  R12:
> ffea
> [127850.513244] R13: 97ee016800c8 R14: 97ee016800c0 R15:
> c0d54020
> [127850.513247] FS:  7fcbe628de40() GS:97f50760()
> knlGS:
> [127850.513249] CS:  0010 DS:  ES:  CR0: 80050033
> [127850.513252] CR2: 381208c29250 CR3: 0001c54ea000 CR4:
> 00350ee0
> [127850.513254] Call Trace:
> [127850.513261]  __frontswap_load+0xc3/0x160
> [127850.513265]  swap_readpage+0x1ca/0x3a0
> [127850.513270]  swapin_readahead+0x2ee/0x4e0
> [127850.513276]  do_swap_page+0x4a4/0x900
> [127850.513279]  ? lock_release+0x1e9/0x400
> [127850.513283]  ? trace_hardirqs_on+0x1b/0xe0
> [127850.513288]  handle_mm_fault+0xe7d/0x19d0
> [127850.513294]  do_user_addr_fault+0x1c7/0x4c0
> [127850.513299]  exc_page_fault+0x67/0x2a0
> [127850.513304]  ? asm_exc_page_fault+0x8/0x30
> [127850.513307]  asm_exc_page_fault+0x1e/0x30
> [127850.513310] RIP: 0033:0x560297642f44
> [127850.513314] Code: 64 75 07 45 8b 76 03 4d 03 f5 45 8b 56 ff 4d 03
> d5 66 41 81 7a 07 83 00 0f 85 4f 01 00 00 8b 5f 13 49 03 dd 8b 5b 03
> 49 03 dd <8b> 4b ff 49 03 cd 66 81 79 07 a5 00 0f 85 0f 00 00 00 8b 4b
> 0f f6
> [127850.513317] RSP: 002b:7ffc04cd4b30 EFLAGS: 00010202
> [127850.513320] RAX:  RBX: 381208c29251 RCX:
> 560297642f00
> [127850.513322] RDX: 3812080423b1 RSI: 381209b11231 RDI:
> 381209b1141d
> [127850.513323] RBP: 7ffc04cd4b90 R08: 0043 R09:
> 0024
> [127850.513325] R10: 381208042a1d R11: 381209b1141f R12:
> 09b1141d
> [127850.513327] R13: 3812 R14: 381208b368ed R15:
> 3d2fb6b7da10
> [127850.51] Modules linked in: tun snd_seq_dummy snd_hrtimer
> uinput rfcomm nft_objref nf_conntrack_netbios_ns
> nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
> nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
> nft_chain_nat ip6table_nat ip6table_mangle ip6table_raw
> ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6
> nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set
> nf_tables nfnetlink ip6table_filter ip6_tables iptable_filter cmac
> bnep zstd sunrpc vfat fat hid_logitech_hidpp hid_logitech_dj
> snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio
> snd_hda_codec_hdmi snd_hda_intel snd

RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage optimization for SCSI drivers

2021-02-10 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Finn Thain [mailto:fth...@telegraphics.com.au]
> Sent: Thursday, February 11, 2021 2:12 PM
> To: Song Bao Hua (Barry Song) 
> Cc: tanxiaofei ; j...@linux.ibm.com;
> martin.peter...@oracle.com; linux-s...@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux...@openeuler.org;
> linux-m...@vger.kernel.org
> Subject: RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage 
> optimization
> for SCSI drivers
> 
> On Wed, 10 Feb 2021, Song Bao Hua (Barry Song) wrote:
> 
> > > On Wed, 10 Feb 2021, Song Bao Hua (Barry Song) wrote:
> > >
> > > > > On Wed, 10 Feb 2021, Song Bao Hua (Barry Song) wrote:
> > > > >
> > > > > > > There is no warning from m68k builds. That's because
> > > > > > > arch_irqs_disabled() returns true when the IPL is non-zero.
> > > > > >
> > > > > > So for m68k, the case is arch_irqs_disabled() is true, but
> > > > > > interrupts can still come?
> > > > > >
> > > > > > Then it seems it is very confusing. If prioritized interrupts
> > > > > > can still come while arch_irqs_disabled() is true,
> > > > >
> > > > > Yes, on m68k CPUs, an IRQ having a priority level higher than the
> > > > > present priority mask will get serviced.
> > > > >
> > > > > Non-Maskable Interrupt (NMI) is not subject to this rule and gets
> > > > > serviced regardless.
> > > > >
> > > > > > how could spin_lock_irqsave() block the prioritized interrupts?
> > > > >
> > > > > It raises the the mask level to 7. Again, please see
> > > > > arch/m68k/include/asm/irqflags.h
> > > >
> > > > Hi Finn,
> > > > Thanks for your explanation again.
> > > >
> > > > TBH, that is why m68k is so confusing. irqs_disabled() on m68k
> > > > should just reflect the status of all interrupts have been disabled
> > > > except NMI.
> > > >
> > > > irqs_disabled() should be consistent with the calling of APIs such
> > > > as local_irq_disable, local_irq_save, spin_lock_irqsave etc.
> > > >
> > >
> > > When irqs_disabled() returns true, we cannot infer that
> > > arch_local_irq_disable() was called. But I have not yet found driver
> > > code or core kernel code attempting that inference.
> > >
> > > > >
> > > > > > Isn't arch_irqs_disabled() a status reflection of irq disable
> > > > > > API?
> > > > > >
> > > > >
> > > > > Why not?
> > > >
> > > > If so, arch_irqs_disabled() should mean all interrupts have been
> > > > masked except NMI as NMI is unmaskable.
> > > >
> > >
> > > Can you support that claim with a reference to core kernel code or
> > > documentation? (If some arch code agrees with you, that's neither here
> > > nor there.)
> >
> > I think those links I share you have supported this. Just you don't
> > believe :-)
> >
> 
> Your links show that the distinction between fast and slow handlers was
> removed. Your links don't support your claim that "arch_irqs_disabled()
> should mean all interrupts have been masked". Where is the code that makes
> that inference? Where is the documentation that supports your claim?

(1)
https://lwn.net/Articles/380931/
Looking at all these worries, one might well wonder if a system which *disabled
interrupts for all handlers* would function well at all. So it is interesting
to note one thing: any system which has the lockdep locking checker enabled
has been running all handlers that way for some years now. Many developers
and testers run lockdep-enabled kernels, and they are available for some of
the more adventurous distributions (Rawhide, for example) as well. So we
have quite a bit of test coverage for this mode of operation already.

(2)
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b738a50a

"We run all handlers *with interrupts disabled* and expect them not to
enable them. Warn when we catch one who does."

(3) 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e58aa3d2d0cc
genirq: Run irq handlers *with interrupts disabled*

Running interrupt handlers with interrupts enabled can cause stack
overflows. That has been observed with multiqueue NICs delivering all
their interrupts to a single core. We might band aid that somehow by
checking the interrupt stacks, but the real safe fix is t

RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage optimization for SCSI drivers

2021-02-10 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Finn Thain [mailto:fth...@telegraphics.com.au]
> Sent: Thursday, February 11, 2021 11:35 AM
> To: Song Bao Hua (Barry Song) 
> Cc: tanxiaofei ; j...@linux.ibm.com;
> martin.peter...@oracle.com; linux-s...@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux...@openeuler.org;
> linux-m...@vger.kernel.org
> Subject: RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage 
> optimization
> for SCSI drivers
> 
> On Wed, 10 Feb 2021, Song Bao Hua (Barry Song) wrote:
> 
> > > On Wed, 10 Feb 2021, Song Bao Hua (Barry Song) wrote:
> > >
> > > > >
> > > > > There is no warning from m68k builds. That's because
> > > > > arch_irqs_disabled() returns true when the IPL is non-zero.
> > > >
> > > > So for m68k, the case is
> > > > arch_irqs_disabled() is true, but interrupts can still come?
> > > >
> > > > Then it seems it is very confusing. If prioritized interrupts can
> > > > still come while arch_irqs_disabled() is true,
> > >
> > > Yes, on m68k CPUs, an IRQ having a priority level higher than the
> > > present priority mask will get serviced.
> > >
> > > Non-Maskable Interrupt (NMI) is not subject to this rule and gets
> > > serviced regardless.
> > >
> > > > how could spin_lock_irqsave() block the prioritized interrupts?
> > >
> > > It raises the the mask level to 7. Again, please see
> > > arch/m68k/include/asm/irqflags.h
> >
> > Hi Finn,
> > Thanks for your explanation again.
> >
> > TBH, that is why m68k is so confusing. irqs_disabled() on m68k should
> > just reflect the status of all interrupts have been disabled except NMI.
> >
> > irqs_disabled() should be consistent with the calling of APIs such as
> > local_irq_disable, local_irq_save, spin_lock_irqsave etc.
> >
> 
> When irqs_disabled() returns true, we cannot infer that
> arch_local_irq_disable() was called. But I have not yet found driver code
> or core kernel code attempting that inference.
> 
> > >
> > > > Isn't arch_irqs_disabled() a status reflection of irq disable API?
> > > >
> > >
> > > Why not?
> >
> > If so, arch_irqs_disabled() should mean all interrupts have been masked
> > except NMI as NMI is unmaskable.
> >
> 
> Can you support that claim with a reference to core kernel code or
> documentation? (If some arch code agrees with you, that's neither here nor
> there.)

I think those links I share you have supported this. Just you don't
believe :-)

> 
> > >
> > > Are all interrupts (including NMI) masked whenever
> > > arch_irqs_disabled() returns true on your platforms?
> >
> > On my platform, once irqs_disabled() is true, all interrupts are masked
> > except NMI. NMI just ignore spin_lock_irqsave or local_irq_disable.
> >
> > On ARM64, we also have high-priority interrupts, but they are running as
> > PESUDO_NMI:
> > https://lwn.net/Articles/755906/
> >
> 
> A glance at the ARM GIC specification suggests that your hardware works
> much like 68000 hardware.
> 
>When enabled, a CPU interface takes the highest priority pending
>interrupt for its connected processor and determines whether the
>interrupt has sufficient priority for it to signal the interrupt
>request to the processor. [...]
> 
>When the processor acknowledges the interrupt at the CPU interface, the
>Distributor changes the status of the interrupt from pending to either
>active, or active and pending. At this point the CPU interface can
>signal another interrupt to the processor, to preempt interrupts that
>are active on the processor. If there is no pending interrupt with
>sufficient priority for signaling to the processor, the interface
>deasserts the interrupt request signal to the processor.
> 
> https://developer.arm.com/documentation/ihi0048/b/
> 
> Have you considered that Linux/arm might benefit if it could fully exploit
> hardware features already available, such as the interrupt priority
> masking feature in the GIC in existing arm systems?

I guess no:-) there are only two levels: IRQ and NMI. Injecting a high-prio
IRQ level between them makes no sense.

To me, arm64's design is quite clear and has no any confusion.

> 
> > On m68k, it seems you mean:
> > irq_disabled() is true, but high-priority interrupts can still come;
> > local_irq_disable() can disable high-priority interrupts, and at that
> > time, irq_disabled() is also true.
> >
> > TBH, this is wrong and confus

RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-10 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Jason Gunthorpe [mailto:j...@ziepe.ca]
> Sent: Thursday, February 11, 2021 7:04 AM
> To: Song Bao Hua (Barry Song) 
> Cc: David Hildenbrand ; Wangzhou (B)
> ; linux-kernel@vger.kernel.org;
> io...@lists.linux-foundation.org; linux...@kvack.org;
> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> Morton ; Alexander Viro ;
> gre...@linuxfoundation.org; kevin.t...@intel.com; jean-phili...@linaro.org;
> eric.au...@redhat.com; Liguozhu (Kenneth) ;
> zhangfei@linaro.org; chensihang (A) 
> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> pin
> 
> On Tue, Feb 09, 2021 at 10:22:47PM +, Song Bao Hua (Barry Song) wrote:
> 
> > The problem is that SVA declares we can use any memory of a process
> > to do I/O. And in real scenarios, we are unable to customize most
> > applications to make them use the pool. So we are looking for some
> > extension generically for applications such as Nginx, Ceph.
> 
> But those applications will suffer jitter even if their are using CPU
> to do the same work. I fail to see why adding an accelerator suddenly
> means the application owner will care about jitter introduced by
> migration/etc.

The only point for this is that when migration occurs on the accelerator,
the impact/jitter is much bigger than it does on CPU. Then the accelerator
might be unhelpful.

> 
> Again in proper SVA it should be quite unlikely to take a fault caused
> by something like migration, on the same likelyhood as the CPU. If
> things are faulting so much this is a problem then I think it is a
> system level problem with doing too much page motion.

My point is that single one SVA application shouldn't require system
to make global changes, such as disabling numa balancing, disabling
THP, to decrease page fault frequency by affecting other applications.

Anyway, guys are in lunar new year. Hopefully, we are getting more
real benchmark data afterwards to make the discussion more targeted.

> 
> Jason

Thanks
Barry


RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage optimization for SCSI drivers

2021-02-10 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Finn Thain [mailto:fth...@telegraphics.com.au]
> Sent: Thursday, February 11, 2021 10:07 AM
> To: Song Bao Hua (Barry Song) 
> Cc: tanxiaofei ; j...@linux.ibm.com;
> martin.peter...@oracle.com; linux-s...@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux...@openeuler.org;
> linux-m...@vger.kernel.org
> Subject: RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage 
> optimization
> for SCSI drivers
> 
> 
> On Wed, 10 Feb 2021, Song Bao Hua (Barry Song) wrote:
> 
> > > On Tue, 9 Feb 2021, Song Bao Hua (Barry Song) wrote:
> > >
> > > > > > sonic_interrupt() uses an irq lock within an interrupt handler
> > > > > > to avoid issues relating to this. This kind of locking may be
> > > > > > needed in the drivers you are trying to patch. Or it might not.
> > > > > > Apparently, no-one has looked.
> > > >
> > > > Is the comment in sonic_interrupt() outdated according to this:
> > > > m68k: irq: Remove IRQF_DISABLED
> > > >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=77a4279
> > > > http://lkml.iu.edu/hypermail/linux/kernel/1109.2/01687.html
> > > >
> > >
> > > The removal of IRQF_DISABLED isn't relevant to this driver. Commit
> > > 77a42796786c ("m68k: Remove deprecated IRQF_DISABLED") did not disable
> > > interrupts, it just removed some code to enable them.
> > >
> > > The code and comments in sonic_interrupt() are correct. You can
> > > confirm this for yourself quite easily using QEMU and a
> > > cross-compiler.
> > >
> > > > and this: genirq: Warn when handler enables interrupts
> > > >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=b738a50a
> > > >
> > > > wouldn't genirq report a warning on m68k?
> > > >
> > >
> > > There is no warning from m68k builds. That's because
> > > arch_irqs_disabled() returns true when the IPL is non-zero.
> >
> >
> > So for m68k, the case is
> > arch_irqs_disabled() is true, but interrupts can still come?
> >
> > Then it seems it is very confusing. If prioritized interrupts can still
> > come while arch_irqs_disabled() is true,
> 
> Yes, on m68k CPUs, an IRQ having a priority level higher than the present
> priority mask will get serviced.
> 
> Non-Maskable Interrupt (NMI) is not subject to this rule and gets serviced
> regardless.
> 
> > how could spin_lock_irqsave() block the prioritized interrupts?
> 
> It raises the the mask level to 7. Again, please see
> arch/m68k/include/asm/irqflags.h

Hi Finn,
Thanks for your explanation again.

TBH, that is why m68k is so confusing. irqs_disabled() on m68k should just
reflect the status of all interrupts have been disabled except NMI.

irqs_disabled() should be consistent with the calling of APIs such
as local_irq_disable, local_irq_save, spin_lock_irqsave etc.

> 
> > Isn't arch_irqs_disabled() a status reflection of irq disable API?
> >
> 
> Why not?

If so, arch_irqs_disabled() should mean all interrupts have been masked
except NMI as NMI is unmaskable.

> 
> Are all interrupts (including NMI) masked whenever arch_irqs_disabled()
> returns true on your platforms?

On my platform, once irqs_disabled() is true, all interrupts are masked
except NMI. NMI just ignore spin_lock_irqsave or local_irq_disable.

On ARM64, we also have high-priority interrupts, but they are running as
PESUDO_NMI:
https://lwn.net/Articles/755906/

On m68k, it seems you mean:
irq_disabled() is true, but high-priority interrupts can still come;
local_irq_disable() can disable high-priority interrupts, and at that
time, irq_disabled() is also true.

TBH, this is wrong and confusing on m68k.

> 
> > Thanks
> > Barry
> >

Thanks
Barry


RE: [Linuxarm] Re: [PATCH for next v1 0/2] gpio: few clean up patches to replace spin_lock_irqsave with spin_lock

2021-02-10 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Andy Shevchenko [mailto:andy.shevche...@gmail.com]
> Sent: Thursday, February 11, 2021 3:57 AM
> To: Song Bao Hua (Barry Song) 
> Cc: luojiaxing ; Linus Walleij
> ; Grygorii Strashko ;
> Santosh Shilimkar ; Kevin Hilman ;
> open list:GPIO SUBSYSTEM ; Linux Kernel Mailing
> List ; linux...@openeuler.org
> Subject: Re: [Linuxarm] Re: [PATCH for next v1 0/2] gpio: few clean up patches
> to replace spin_lock_irqsave with spin_lock
> 
> On Wed, Feb 10, 2021 at 11:50:45AM +, Song Bao Hua (Barry Song) wrote:
> > > -Original Message-
> > > From: Andy Shevchenko [mailto:andy.shevche...@gmail.com]
> > > Sent: Wednesday, February 10, 2021 11:51 PM
> > > On Wed, Feb 10, 2021 at 5:43 AM luojiaxing  wrote:
> > > > On 2021/2/9 17:42, Andy Shevchenko wrote:
> 
> ...
> 
> > > > Between IRQ handler A and IRQ handle A, it's no need for a SLIS.
> > >
> > > Right, but it's not the case in the patches you provided.
> >
> > The code still holds spin_lock. So if two cpus call same IRQ handler,
> > spin_lock makes them spin; and if interrupts are threaded, spin_lock
> > makes two threads run the same handler one by one.
> 
> If you run on an SMP system and it happens that spin_lock_irqsave() just
> immediately after spin_unlock(), you will get into the troubles. Am I 
> mistaken?

Hi Andy,
Thanks for your reply.

But I don't agree spin_lock_irqsave() just immediately after spin_unlock()
could a problem on SMP.
When the 1st cpu releases spinlock by spin_unlock, it has completed its section
of accessing the critical data, then 2nd cpu gets the spin_lock. These two CPUs
won't have overlap on accessing the same data.

> 
> I think this entire activity is a carefully crafted mine field for the future
> syzcaller and fuzzers alike. I don't believe there are no side effects in a
> long
> term on all possible systems and configurations (including forced threaded IRQ
> handlers).

Also I don't understand why forced threaded IRQ could be a problem. Since IRQ 
has
been a thread, this actually makes the situation much simpler than non-threaded
IRQ. Since all threads including those IRQ threads want to hold spin_lock,
they won't access the same critical data at the same time either.

> 
> I would love to see a better explanation in the commit message of such patches
> which makes it clear that there are *no* side effects.
> 

People had the same questions before, But I guess the discussion in this commit
has led to a better commit log:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4eb7d0cd59

> For time being, NAK to the all patches of this kind.

Fair enough, if you expect better explanation, I agree the commit log is too
short.

> 
> --
> With Best Regards,
> Andy Shevchenko
> 

Thanks
Barry



RE: [PATCH v2] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2

2021-02-10 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Peter Zijlstra [mailto:pet...@infradead.org]
> Sent: Thursday, February 11, 2021 12:22 AM
> To: Song Bao Hua (Barry Song) 
> Cc: valentin.schnei...@arm.com; vincent.guit...@linaro.org; mgor...@suse.de;
> mi...@kernel.org; dietmar.eggem...@arm.com; morten.rasmus...@arm.com;
> linux-kernel@vger.kernel.org; linux...@openeuler.org; xuwei (O)
> ; Liguozhu (Kenneth) ; tiantao (H)
> ; wanghuiqiang ; Zengtao (B)
> ; Jonathan Cameron ;
> guodong...@linaro.org; Meelis Roos 
> Subject: Re: [PATCH v2] sched/topology: fix the issue groups don't span
> domain->span for NUMA diameter > 2
> 
> On Tue, Feb 09, 2021 at 08:58:15PM +, Song Bao Hua (Barry Song) wrote:
> 
> > > I've finally had a moment to think about this, would it make sense to
> > > also break up group: node0+1, such that we then end up with 3 groups of
> > > equal size?
> >
> 
> > Since the sched_domain[n-1] of a part of node[m]'s siblings are able
> > to cover the whole span of sched_domain[n] of node[m], there is no
> > necessity to scan over all siblings of node[m], once sched_domain[n]
> > of node[m] has been covered, we can stop making more sched_groups. So
> > the number of sched_groups is small.
> >
> > So historically, the code has never tried to make sched_groups result
> > in equal size. And it permits the overlapping of local group and remote
> > groups.
> 
> Histrorically groups have (typically) always been the same size though.

This is probably true for other platforms. But unfortunately it has never
been true in my platform :-)

node   0   1   2   3 
  0:  10  12  20  22 
  1:  12  10  22  24 
  2:  20  22  10  12 
  3:  22  24  12  10

In case we have only two cpus in one numa. 

CPU0's domain-3 has no overflowed sched_group, but its first group
covers 0-5(node0-node2), the second group covers 4-7
(node2-node3):

[0.802139] CPU0 attaching sched-domain(s):
[0.802193]  domain-0: span=0-1 level=MC
[0.802443]   groups: 0:{ span=0 cap=1013 }, 1:{ span=1 cap=979 }
[0.802693]   domain-1: span=0-3 level=NUMA
[0.802731]groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 }
[0.802811]domain-2: span=0-5 level=NUMA
[0.802829] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 }
[0.802881] ERROR: groups don't span domain->span
[0.803058] domain-3: span=0-7 level=NUMA
[0.803080]  groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 
mask=6-7 cap=4077 }


> 
> The reason I did ask is because when you get one large and a bunch of
> smaller groups, the load-balancing 'pull' is relatively smaller to the
> large groups.
> 
> That is, IIRC should_we_balance() ensures only 1 CPU out of the group
> continues the load-balancing pass. So if, for example, we have one group
> of 4 CPUs and one group of 2 CPUs, then the group of 2 CPUs will pull
> 1/2 times, while the group of 4 CPUs will pull 1/4 times.
> 
> By making sure all groups are of the same level, and thus of equal size,
> this doesn't happen.

As you can see, even if we give all groups of domain2 equal size
by breaking up both local_group and remote_groups,  we will get to
the same problem in domain-3. And what's more tricky is that
domain-3 has no problem of "groups don't span domain->span".

It seems we need to change both domain2 and domain3 then though
domain3 has no issue of "groups don't span domain->span".

Thanks
Barry



RE: [Linuxarm] Re: [PATCH for next v1 0/2] gpio: few clean up patches to replace spin_lock_irqsave with spin_lock

2021-02-10 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Andy Shevchenko [mailto:andy.shevche...@gmail.com]
> Sent: Wednesday, February 10, 2021 11:51 PM
> To: luojiaxing 
> Cc: Linus Walleij ; Andy Shevchenko
> ; Grygorii Strashko
> ; Santosh Shilimkar ; Kevin
> Hilman ; open list:GPIO SUBSYSTEM
> ; Linux Kernel Mailing List
> ; linux...@openeuler.org
> Subject: [Linuxarm] Re: [PATCH for next v1 0/2] gpio: few clean up patches to
> replace spin_lock_irqsave with spin_lock
> 
> On Wed, Feb 10, 2021 at 5:43 AM luojiaxing  wrote:
> > On 2021/2/9 17:42, Andy Shevchenko wrote:
> > > On Tue, Feb 9, 2021 at 11:24 AM luojiaxing  wrote:
> > >> On 2021/2/8 21:28, Andy Shevchenko wrote:
> > >>> On Mon, Feb 8, 2021 at 11:11 AM luojiaxing  
> > >>> wrote:
> >  On 2021/2/8 16:56, Luo Jiaxing wrote:
> > > There is no need to use API with _irqsave in hard IRQ handler, So 
> > > replace
> > > those with spin_lock.
> > >>> How do you know that another CPU in the system can't serve the
> > > The keyword here is: *another*.
> >
> > ooh, sorry, now I got your point.
> >
> > As to me, I don't think another CPU can serve the IRQ when one CPU
> > runing hard IRQ handler,
> 
> Why is it so?
> Each CPU can serve IRQs separately.
> 
> > except it's a per CPU interrupts.
> 
> I didn't get how it is related.
> 
> > The following is a simple call logic when IRQ come.
> >
> > elx_irq -> handle_arch_irq -> __handle_domain_irq -> desc->handle_irq ->
> > handle_irq_event
> 
> What is `elx_irq()`? I haven't found any mention of this in the kernel
> source tree.
> But okay, it shouldn't prevent our discussion.
> 
> > Assume that two CPUs receive the same IRQ and enter the preceding
> > process. Both of them will go to desc->handle_irq().
> 
> Ah, I'm talking about the same IRQ by number (like Linux IRQ number,
> means from the same source), but with different sequence number (means
> two consequent events).
> 
> > In handle_irq(), raw_spin_lock(>lock) always be called first.
> > Therefore, even if two CPUs are running handle_irq(),
> >
> > only one can get the spin lock. Assume that CPU A obtains the spin lock.
> > Then CPU A will sets the status of irq_data to
> >
> > IRQD_IRQ_INPROGRESS in handle_irq_event() and releases the spin lock.
> > Even though CPU B gets the spin lock later and
> >
> > continue to run handle_irq(), but the check of irq_may_run(desc) causes
> > it to exit.
> >
> >
> > so, I think we don't own the situation that two CPU server the hard IRQ
> > handler at the same time.
> 
> Okay. Assuming your analysis is correct, have you considered the case
> when all IRQ handlers are threaded? (There is a kernel command line
> option to force this)
> 
> > >>> following interrupt from the hardware at the same time?
> > >> Yes, I have some question before.
> > >>
> > >> There are some similar discussion here,  please take a look, Song baohua
> > >> explained it more professionally.
> > >>
> > >>
> https://lore.kernel.org/lkml/e949a474a9284ac6951813bfc8b34...@hisilicon.co
> m/
> > >>
> > >> Here are some excerpts from the discussion:
> > >>
> > >> I think the code disabling irq in hardIRQ is simply wrong.
> > > Why?
> >
> >
> > I mention the following call before.
> >
> > elx_irq -> handle_arch_irq -> __handle_domain_irq -> desc->handle_irq ->
> > handle_irq_event
> >
> >
> > __handle_domain_irq() will call irq_enter(), it ensures that the IRQ
> > processing of the current CPU can not be preempted.
> >
> > So I think this is the reason why Song baohua said it's not need to
> > disable IRQ in hardIRQ handler.
> >
> > >> Since this commit
> > >>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=e58aa3d2d0cc
> > >> genirq: Run irq handlers with interrupts disabled
> > >>
> > >> interrupt handlers are definitely running in a irq-disabled context
> > >> unless irq handlers enable them explicitly in the handler to permit
> > >> other interrupts.
> > > This doesn't explain any changes in the behaviour on SMP.
> > > IRQ line can be disabled on a few stages:
> > >   a) on the source (IP that generates an event)
> > >   b) on IRQ router / controller
> > >   c) on CPU side
> >
> > yes, you are right.
> >
> > > The commit above is discussing (rightfully!) the problem when all
> > > interrupts are being served by a *single* core. Nobody prevents them
> > > from being served by *different* cores simultaneously. Also, see [1].
> > >
> > > [1]: https://www.kernel.org/doc/htmldocs/kernel-locking/cheatsheet.html
> >
> > I check [1], quite useful description about locking, thanks. But you can
> > see Table of locking Requirements
> >
> > Between IRQ handler A and IRQ handle A, it's no need for a SLIS.
> 
> Right, but it's not the case in the patches you provided.

The code still holds spin_lock. So if two cpus call same IRQ handler,
spin_lock makes them spin; and if interrupts are threaded, spin_lock
makes two threads run the same handler one by one.

> 
> --
> With Best Regards,
> Andy Shevchenko

Thanks
Barry



RE: [PATCH v3] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2

2021-02-10 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Meelis Roos [mailto:mr...@linux.ee]
> Sent: Wednesday, February 10, 2021 1:40 AM
> To: Song Bao Hua (Barry Song) ;
> valentin.schnei...@arm.com; vincent.guit...@linaro.org; mgor...@suse.de;
> mi...@kernel.org; pet...@infradead.org; dietmar.eggem...@arm.com;
> morten.rasmus...@arm.com; linux-kernel@vger.kernel.org
> Cc: linux...@openeuler.org; xuwei (O) ; Liguozhu (Kenneth)
> ; tiantao (H) ; wanghuiqiang
> ; Zengtao (B) ; Jonathan
> Cameron ; guodong...@linaro.org
> Subject: Re: [PATCH v3] sched/topology: fix the issue groups don't span
> domain->span for NUMA diameter > 2
> 
> I did a rudimentary benchmark on the same 8-node Sun Fire X4600-M2, on top of
> todays  5.11.0-rc7-2-ge0756cfc7d7c.
> 
> The test: building clean kernel with make -j64 after make clean and 
> drop_caches.
> 
> While running clean kernel / 3 tries):
> 
> real2m38.574s
> user46m18.387s
> sys 6m8.724s
> 
> real2m37.647s
> user46m34.171s
> sys 6m11.993s
> 
> real2m37.832s
> user46m34.910s
> sys 6m12.013s
> 
> 
> While running patched kernel:
> 
> real2m40.072s
> user46m22.610s
> sys 6m6.658s
> 
> 
> for real time, seems to be 1.5s-2s slower out of 160s (noise?) User and system
> time are slightly less, on the other hand, so seems good to me.

I ran the same test on the machine with the below topology:
numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0-31
node 0 size: 64144 MB
node 0 free: 62356 MB
node 1 cpus: 32-63
node 1 size: 64509 MB
node 1 free: 62996 MB
node 2 cpus: 64-95
node 2 size: 64509 MB
node 2 free: 63020 MB
node 3 cpus: 96-127
node 3 size: 63991 MB
node 3 free: 62647 MB
node distances:
node   0   1   2   3 
  0:  10  12  20  22 
  1:  12  10  22  24 
  2:  20  22  10  12 
  3:  22  24  12  10

Basically the influence to kernel build is noise by
the commands I ran a couple of rounds:

make clean
echo 3 > /proc/sys/vm/drop_caches
make Image -j100

w/ patch:   w/o patch:

real1m17.644s  real 1m19.510s
user32m12.074s user 32m14.133s
sys 4m35.827s   sys 4m38.198s

real1m15.855s  real 1m17.303s
user32m7.700s  user 32m14.128s
sys 4m35.868s   sys 4m40.094s

real1m18.918s  real 1m19.583s
user32m13.352s user 32m13.205s
sys 4m40.161s   sys 4m40.696s

real1m20.329s  real 1m17.819s
user32m7.255s  user 32m11.753s
sys 4m36.706s   sys 4m41.371s

real1m17.773s  real 1m16.763s
user32m19.912s user 32m15.607s
sys 4m36.989s   sys 4m41.297s

real1m14.943s  real 1m18.551s
user32m14.549s user 32m18.521s
sys 4m38.670s   sys 4m41.392s

real1m16.439s  real 1m18.154s
user32m12.864s user 32m14.540s
sys 4m39.424s   sys 4m40.364s

our team guys who used the 3-hops-fix patch to run unixbench
reported some data of unixbench score as below(3 rounds):

w/o patch:w/ patch:
1228.61254.9
1231.41265.7
1226.11266.1

One interesting thing is that if we change the kernel to
disallow the below BALANCING flags for the last hop,
sd->flags &= ~(SD_BALANCE_EXEC |
   SD_BALANCE_FORK |
   SD_WAKE_AFFINE);

We are seeing further increase of unixbench. So sounds like
those balancing shouldn't go that far. But it is a different
topic.

> 
> --
> Meelis Roos 

Thanks
Barry



RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage optimization for SCSI drivers

2021-02-09 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Finn Thain [mailto:fth...@telegraphics.com.au]
> Sent: Wednesday, February 10, 2021 5:16 PM
> To: Song Bao Hua (Barry Song) 
> Cc: tanxiaofei ; j...@linux.ibm.com;
> martin.peter...@oracle.com; linux-s...@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux...@openeuler.org;
> linux-m...@vger.kernel.org
> Subject: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage optimization
> for SCSI drivers
> 
> On Tue, 9 Feb 2021, Song Bao Hua (Barry Song) wrote:
> 
> > > > sonic_interrupt() uses an irq lock within an interrupt handler to
> > > > avoid issues relating to this. This kind of locking may be needed in
> > > > the drivers you are trying to patch. Or it might not. Apparently,
> > > > no-one has looked.
> >
> > Is the comment in sonic_interrupt() outdated according to this:
> > m68k: irq: Remove IRQF_DISABLED
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=77a4279
> > http://lkml.iu.edu/hypermail/linux/kernel/1109.2/01687.html
> >
> 
> The removal of IRQF_DISABLED isn't relevant to this driver. Commit
> 77a42796786c ("m68k: Remove deprecated IRQF_DISABLED") did not disable
> interrupts, it just removed some code to enable them.
> 
> The code and comments in sonic_interrupt() are correct. You can confirm
> this for yourself quite easily using QEMU and a cross-compiler.
> 
> > and this:
> > genirq: Warn when handler enables interrupts
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=b738a50a
> >
> > wouldn't genirq report a warning on m68k?
> >
> 
> There is no warning from m68k builds. That's because arch_irqs_disabled()
> returns true when the IPL is non-zero.


So for m68k, the case is
arch_irqs_disabled() is true, but interrupts can still come?

Then it seems it is very confusing. If prioritized interrupts can still come
while arch_irqs_disabled() is true, how could spin_lock_irqsave() block the
prioritized interrupts? Isn't arch_irqs_disabled() a status reflection of
irq disable API?

Thanks
Barry



RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage optimization for SCSI drivers

2021-02-09 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Finn Thain [mailto:fth...@telegraphics.com.au]
> Sent: Wednesday, February 10, 2021 1:29 PM
> To: Song Bao Hua (Barry Song) 
> Cc: tanxiaofei ; j...@linux.ibm.com;
> martin.peter...@oracle.com; linux-s...@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux...@openeuler.org;
> linux-m...@vger.kernel.org
> Subject: RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage 
> optimization
> for SCSI drivers
> 
> On Tue, 9 Feb 2021, Song Bao Hua (Barry Song) wrote:
> 
> > > On Tue, 9 Feb 2021, Song Bao Hua (Barry Song) wrote:
> > >
> > > > > On Sun, 7 Feb 2021, Xiaofei Tan wrote:
> > > > >
> > > > > > Replace spin_lock_irqsave with spin_lock in hard IRQ of SCSI
> > > > > > drivers. There are no function changes, but may speed up if
> > > > > > interrupt happen too often.
> > > > >
> > > > > This change doesn't necessarily work on platforms that support
> > > > > nested interrupts.
> > > > >
> > > > > Were you able to measure any benefit from this change on some
> > > > > other platform?
> > > >
> > > > I think the code disabling irq in hardIRQ is simply wrong. Since
> > > > this commit
> > > >
> > > >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=e58aa3d2d0cc
> > > > genirq: Run irq handlers with interrupts disabled
> > > >
> > > > interrupt handlers are definitely running in a irq-disabled context
> > > > unless irq handlers enable them explicitly in the handler to permit
> > > > other interrupts.
> > > >
> > >
> > > Repeating the same claim does not somehow make it true.
> >
> > Sorry for I didn't realize xiaofei had replied.
> >
> 
> I was referring to the claim in patch 00/32, i.e. that interrupt handlers
> only run when irqs are disabled.
> 
> > > If you put your claim to the test, you'll see that that interrupts are
> > > not disabled on m68k when interrupt handlers execute.
> >
> > Sounds like an implementation issue of m68k since IRQF_DISABLED has been
> > totally removed.
> >
> 
> It's true that IRQF_DISABLED could be used to avoid the need for irq locks
> in interrupt handlers. So, if you want to remove irq locks from interrupt
> handlers, today you can't use IRQF_DISABLED to help you. So what?
> 
> > >
> > > The Interrupt Priority Level (IPL) can prevent any given irq handler
> > > from being re-entered, but an irq with a higher priority level may be
> > > handled during execution of a lower priority irq handler.
> > >
> >
> > We used to have IRQF_DISABLED to support so-called "fast interrupt" to
> > avoid this.
> >
> > But the concept has been totally removed. That is interesting if m68k
> > still has this issue.
> >
> 
> Prioritized interrupts are beneficial. Why would you want to avoid them?
> 

I doubt this is true as it has been already thought as unnecessary
in Linux:
https://lwn.net/Articles/380931/

> Moreover, there's no reason to believe that m68k is the only platform that
> supports nested interrupts.

I doubt that is true as genirq is running understand the consumption
that hardIRQ is running in irq-disabled context:
"We run all handlers with interrupts disabled and expect them not to
enable them. Warn when we catch one who does."
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b738a50a

If it is, m68k is against the assumption of genirq.

> 
> > > sonic_interrupt() uses an irq lock within an interrupt handler to
> > > avoid issues relating to this. This kind of locking may be needed in
> > > the drivers you are trying to patch. Or it might not. Apparently,
> > > no-one has looked.
> >

Thanks
Barry


RE: [PATCH v4 01/12] genirq: add IRQF_NO_AUTOEN for request_irq

2021-02-09 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Song Bao Hua (Barry Song)
> Sent: Friday, January 29, 2021 11:35 AM
> To: t...@linutronix.de; dmitry.torok...@gmail.com; m...@kernel.org;
> gre...@linuxfoundation.org; linux-in...@vger.kernel.org;
> linux-kernel@vger.kernel.org
> Cc: linux...@openeuler.org; Song Bao Hua (Barry Song)
> 
> Subject: [PATCH v4 01/12] genirq: add IRQF_NO_AUTOEN for request_irq
> 
> Many drivers don't want interrupts enabled automatically due to
> request_irq(). So they are handling this issue by either way of
> the below two:
> (1)
> irq_set_status_flags(irq, IRQ_NOAUTOEN);
> request_irq(dev, irq...);
> (2)
> request_irq(dev, irq...);
> disable_irq(irq);
> 
> The code in the second way is silly and unsafe. In the small time
> gap between request_irq() and disable_irq(), interrupts can still
> come.
> The code in the first way is safe though we might be able to do it
> in the generic irq code.
> 
> With this patch, drivers can request_irq with IRQF_NO_AUTOEN flag.
> They will need neither irq_set_status_flags() nor disable_irq().
> Hundreds of drivers with this problem will be handled afterwards.
> 
> Cc: Dmitry Torokhov 
> Signed-off-by: Barry Song 
> ---
>  -v4: remove the irq_settings magic for NOAUTOEN

Hi Thomas,
Any further comment on this? Does it get any opportunity to hit
5.12 so that we can begin to handle those drivers in 5.12?

Thanks
Barry

> 
>  include/linux/interrupt.h | 3 +++
>  kernel/irq/manage.c   | 8 +++-
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> index bb8ff9083e7d..0f22d277078c 100644
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -61,6 +61,8 @@
>   *interrupt handler after suspending interrupts. For system
>   *wakeup devices users need to implement wakeup detection in
>   *their interrupt handlers.
> + * IRQF_NO_AUTOEN - Don't enable IRQ automatically when users request it. 
> Users
> + *will enable it explicitly by enable_irq() later.
>   */
>  #define IRQF_SHARED  0x0080
>  #define IRQF_PROBE_SHARED0x0100
> @@ -74,6 +76,7 @@
>  #define IRQF_NO_THREAD   0x0001
>  #define IRQF_EARLY_RESUME0x0002
>  #define IRQF_COND_SUSPEND0x0004
> +#define IRQF_NO_AUTOEN   0x0008
> 
>  #define IRQF_TIMER   (__IRQF_TIMER | IRQF_NO_SUSPEND | 
> IRQF_NO_THREAD)
> 
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index dec3f73e8db9..95014073bd2e 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -1693,7 +1693,8 @@ __setup_irq(unsigned int irq, struct irq_desc *desc,
> struct irqaction *new)
>   irqd_set(>irq_data, IRQD_NO_BALANCING);
>   }
> 
> - if (irq_settings_can_autoenable(desc)) {
> + if (!(new->flags & IRQF_NO_AUTOEN) &&
> + irq_settings_can_autoenable(desc)) {
>   irq_startup(desc, IRQ_RESEND, IRQ_START_COND);
>   } else {
>   /*
> @@ -2086,10 +2087,15 @@ int request_threaded_irq(unsigned int irq,
> irq_handler_t handler,
>* which interrupt is which (messes up the interrupt freeing
>* logic etc).
>*
> +  * Also shared interrupts do not go well with disabling auto enable.
> +  * The sharing interrupt might request it while it's still disabled
> +  * and then wait for interrupts forever.
> +  *
>* Also IRQF_COND_SUSPEND only makes sense for shared interrupts and
>* it cannot be set along with IRQF_NO_SUSPEND.
>*/
>   if (((irqflags & IRQF_SHARED) && !dev_id) ||
> + ((irqflags & IRQF_SHARED) && (irqflags & IRQF_NO_AUTOEN)) ||
>   (!(irqflags & IRQF_SHARED) && (irqflags & IRQF_COND_SUSPEND)) ||
>   ((irqflags & IRQF_NO_SUSPEND) && (irqflags & IRQF_COND_SUSPEND)))
>   return -EINVAL;
> --
> 2.25.1



RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-09 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Jason Gunthorpe [mailto:j...@ziepe.ca]
> Sent: Wednesday, February 10, 2021 2:54 AM
> To: Song Bao Hua (Barry Song) 
> Cc: David Hildenbrand ; Wangzhou (B)
> ; linux-kernel@vger.kernel.org;
> io...@lists.linux-foundation.org; linux...@kvack.org;
> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> Morton ; Alexander Viro ;
> gre...@linuxfoundation.org; kevin.t...@intel.com; jean-phili...@linaro.org;
> eric.au...@redhat.com; Liguozhu (Kenneth) ;
> zhangfei@linaro.org; chensihang (A) 
> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> pin
> 
> On Tue, Feb 09, 2021 at 03:01:42AM +, Song Bao Hua (Barry Song) wrote:
> 
> > On the other hand, wouldn't it be the benefit of hardware accelerators
> > to have a lower and more stable latency zip/encryption than CPU?
> 
> No, I don't think so.

Fortunately or unfortunately, I think my people have this target to have
a lower-latency and more stable zip/encryption by using accelerators,
otherwise, they are going to use CPU directly if there is no advantage
of accelerators.

> 
> If this is an important problem then it should apply equally to CPU
> and IO jitter.
> 
> Honestly I find the idea that occasional migration jitters CPU and DMA
> to not be very compelling. Such specialized applications should
> allocate special pages to avoid this, not adding an API to be able to
> lock down any page

That is exactly what we have done to provide a hugeTLB pool so that
applications can allocate memory from this pool.

+---+
 |   |
 |applications using accelerators|
 +---+


 alloc from pool free to pool
   +  ++
   |   |
   |   |
   |   |
   |   |
   |   |
   |   |
   |   |
+--+---+-+
||
||
|  HugeTLB memory pool   |
||
||
++

The problem is that SVA declares we can use any memory of a process
to do I/O. And in real scenarios, we are unable to customize most
applications to make them use the pool. So we are looking for some
extension generically for applications such as Nginx, Ceph.

I am also thinking about leveraging vm.compact_unevictable_allowed
which David suggested and making an extension on it, for example,
permit users to disable compaction and numa balancing on unevictable
pages of SVA process,  which might be a smaller deal.

> 
> Jason

Thanks
Barry



RE: [PATCH v2] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2

2021-02-09 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Peter Zijlstra [mailto:pet...@infradead.org]
> Sent: Wednesday, February 10, 2021 1:56 AM
> To: Song Bao Hua (Barry Song) 
> Cc: valentin.schnei...@arm.com; vincent.guit...@linaro.org; mgor...@suse.de;
> mi...@kernel.org; dietmar.eggem...@arm.com; morten.rasmus...@arm.com;
> linux-kernel@vger.kernel.org; linux...@openeuler.org; xuwei (O)
> ; Liguozhu (Kenneth) ; tiantao (H)
> ; wanghuiqiang ; Zengtao (B)
> ; Jonathan Cameron ;
> guodong...@linaro.org; Meelis Roos 
> Subject: Re: [PATCH v2] sched/topology: fix the issue groups don't span
> domain->span for NUMA diameter > 2
> 
> On Thu, Feb 04, 2021 at 12:12:01AM +1300, Barry Song wrote:
> > As long as NUMA diameter > 2, building sched_domain by sibling's child
> > domain will definitely create a sched_domain with sched_group which will
> > span out of the sched_domain:
> >
> >+--+ +--++---+   +--+
> >| node |  12 |node  | 20 | node  |  12   |node  |
> >|  0   +-+1 ++ 2 +---+3 |
> >+--+ +--++---+   +--+
> >
> > domain0node0node1node2  node3
> >
> > domain1node0+1  node0+1  node2+3node2+3
> >  +
> > domain2node0+1+2 |
> >  group: node0+1  |
> >group:node2+3 <---+
> >
> > when node2 is added into the domain2 of node0, kernel is using the child
> > domain of node2's domain2, which is domain1(node2+3). Node 3 is outside
> > the span of the domain including node0+1+2.
> >
> > This will make load_balance() run based on screwed avg_load and group_type
> > in the sched_group spanning out of the sched_domain, and it also makes
> > select_task_rq_fair() pick an idle CPU out of the sched_domain.
> >
> > Real servers which suffer from this problem include Kunpeng920 and 8-node
> > Sun Fire X4600-M2, at least.
> >
> > Here we move to use the *child* domain of the *child* domain of node2's
> > domain2 as the new added sched_group. At the same time, we re-use the
> > lower level sgc directly.
> >
> >+--+ +--++---+   +--+
> >| node |  12 |node  | 20 | node  |  12   |node  |
> >|  0   +-+1 ++ 2 +---+3 |
> >+--+ +--++---+   +--+
> >
> > domain0node0node1  +- node2  node3
> >|
> > domain1node0+1  node0+1| node2+3node2+3
> >|
> > domain2node0+1+2   |
> >  group: node0+1|
> >group:node2 <---+
> >
> 
> I've finally had a moment to think about this, would it make sense to
> also break up group: node0+1, such that we then end up with 3 groups of
> equal size?

We used to create the sched_groups of sched_domain[n] of node[m] by
1. local group: sched_domain[n-1] of node[m]
2. remote group: sched_domain[n-1] of node[m]'s siblings
in the same level. 
Since the sched_domain[n-1] of a part of node[m]'s siblings are able
to cover the whole span of sched_domain[n] of node[m], there is no
necessity to scan over all siblings of node[m], once sched_domain[n]
of node[m] has been covered, we can stop making more sched_groups. So
the number of sched_groups is small.

So historically, the code has never tried to make sched_groups result
in equal size. And it permits the overlapping of local group and remote
groups.

One issue we are facing in original code is that once the topology
gets to 3-hops NUMA, sched_domain[n-1] of node[m]'s siblings might
span out of the range of sched_domain[n] of node[m]. Here my approach
is trying to find a descanted sibling to build remote groups and fix
this issue for those machines with this problem. So it keeps those
machines without 3-hops issues untouched. 

Valentin sent another RFC to break up all remote groups to include
the remote node only instead of using sched_domain[n-1] of siblings,
this will eliminate the problem from the first beginning. One side
effect is that it changes all machines including those machines w/o
3-hops issue by creating much more remote sched_groups. So we both
agree we can get started from descanted sibling(grandchild) approach
first.

What you ar

RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage optimization for SCSI drivers

2021-02-08 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Song Bao Hua (Barry Song)
> Sent: Tuesday, February 9, 2021 6:28 PM
> To: 'Finn Thain' 
> Cc: tanxiaofei ; j...@linux.ibm.com;
> martin.peter...@oracle.com; linux-s...@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux...@openeuler.org;
> linux-m...@vger.kernel.org
> Subject: RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage 
> optimization
> for SCSI drivers
> 
> 
> 
> > -Original Message-
> > From: Finn Thain [mailto:fth...@telegraphics.com.au]
> > Sent: Tuesday, February 9, 2021 6:06 PM
> > To: Song Bao Hua (Barry Song) 
> > Cc: tanxiaofei ; j...@linux.ibm.com;
> > martin.peter...@oracle.com; linux-s...@vger.kernel.org;
> > linux-kernel@vger.kernel.org; linux...@openeuler.org;
> > linux-m...@vger.kernel.org
> > Subject: RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage 
> > optimization
> > for SCSI drivers
> >
> > On Tue, 9 Feb 2021, Song Bao Hua (Barry Song) wrote:
> >
> > > > -Original Message-
> > > > From: Finn Thain [mailto:fth...@telegraphics.com.au]
> > > > Sent: Monday, February 8, 2021 8:57 PM
> > > > To: tanxiaofei 
> > > > Cc: j...@linux.ibm.com; martin.peter...@oracle.com;
> > > > linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > > linux...@openeuler.org
> > > > Subject: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage 
> > > > optimization
> > > > for SCSI drivers
> > > >
> > > > On Sun, 7 Feb 2021, Xiaofei Tan wrote:
> > > >
> > > > > Replace spin_lock_irqsave with spin_lock in hard IRQ of SCSI drivers.
> > > > > There are no function changes, but may speed up if interrupt happen
> too
> > > > > often.
> > > >
> > > > This change doesn't necessarily work on platforms that support nested
> > > > interrupts.
> > > >
> > > > Were you able to measure any benefit from this change on some other
> > > > platform?
> > >
> > > I think the code disabling irq in hardIRQ is simply wrong.
> > > Since this commit
> > >
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > ?id=e58aa3d2d0cc
> > > genirq: Run irq handlers with interrupts disabled
> > >
> > > interrupt handlers are definitely running in a irq-disabled context
> > > unless irq handlers enable them explicitly in the handler to permit
> > > other interrupts.
> > >
> >
> > Repeating the same claim does not somehow make it true. If you put your
> 
> Sorry for I didn't realize xiaofei had replied.
> 
> > claim to the test, you'll see that that interrupts are not disabled on
> > m68k when interrupt handlers execute.
> 
> Sounds like an implementation issue of m68k since IRQF_DISABLED has
> been totally removed.
> 
> >
> > The Interrupt Priority Level (IPL) can prevent any given irq handler from
> > being re-entered, but an irq with a higher priority level may be handled
> > during execution of a lower priority irq handler.
> >
> 
> We used to have IRQF_DISABLED to support so-called "fast interrupt" to avoid
> this. But the concept has been totally removed. That is interesting if m68k
> still has this issue.
> 
> > sonic_interrupt() uses an irq lock within an interrupt handler to avoid
> > issues relating to this. This kind of locking may be needed in the drivers
> > you are trying to patch. Or it might not. Apparently, no-one has looked.

Is the comment in sonic_interrupt() outdated according to this:
m68k: irq: Remove IRQF_DISABLED
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=77a4279
http://lkml.iu.edu/hypermail/linux/kernel/1109.2/01687.html

and this:
genirq: Warn when handler enables interrupts
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b738a50a

wouldn't genirq report a warning on m68k?

> 
> Thanks
> Barry

Thanks
Barry



RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage optimization for SCSI drivers

2021-02-08 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Finn Thain [mailto:fth...@telegraphics.com.au]
> Sent: Tuesday, February 9, 2021 6:06 PM
> To: Song Bao Hua (Barry Song) 
> Cc: tanxiaofei ; j...@linux.ibm.com;
> martin.peter...@oracle.com; linux-s...@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux...@openeuler.org;
> linux-m...@vger.kernel.org
> Subject: RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage 
> optimization
> for SCSI drivers
> 
> On Tue, 9 Feb 2021, Song Bao Hua (Barry Song) wrote:
> 
> > > -Original Message-
> > > From: Finn Thain [mailto:fth...@telegraphics.com.au]
> > > Sent: Monday, February 8, 2021 8:57 PM
> > > To: tanxiaofei 
> > > Cc: j...@linux.ibm.com; martin.peter...@oracle.com;
> > > linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > linux...@openeuler.org
> > > Subject: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage 
> > > optimization
> > > for SCSI drivers
> > >
> > > On Sun, 7 Feb 2021, Xiaofei Tan wrote:
> > >
> > > > Replace spin_lock_irqsave with spin_lock in hard IRQ of SCSI drivers.
> > > > There are no function changes, but may speed up if interrupt happen too
> > > > often.
> > >
> > > This change doesn't necessarily work on platforms that support nested
> > > interrupts.
> > >
> > > Were you able to measure any benefit from this change on some other
> > > platform?
> >
> > I think the code disabling irq in hardIRQ is simply wrong.
> > Since this commit
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=e58aa3d2d0cc
> > genirq: Run irq handlers with interrupts disabled
> >
> > interrupt handlers are definitely running in a irq-disabled context
> > unless irq handlers enable them explicitly in the handler to permit
> > other interrupts.
> >
> 
> Repeating the same claim does not somehow make it true. If you put your

Sorry for I didn't realize xiaofei had replied.

> claim to the test, you'll see that that interrupts are not disabled on
> m68k when interrupt handlers execute.

Sounds like an implementation issue of m68k since IRQF_DISABLED has
been totally removed.

> 
> The Interrupt Priority Level (IPL) can prevent any given irq handler from
> being re-entered, but an irq with a higher priority level may be handled
> during execution of a lower priority irq handler.
> 

We used to have IRQF_DISABLED to support so-called "fast interrupt" to avoid
this. But the concept has been totally removed. That is interesting if m68k
still has this issue.

> sonic_interrupt() uses an irq lock within an interrupt handler to avoid
> issues relating to this. This kind of locking may be needed in the drivers
> you are trying to patch. Or it might not. Apparently, no-one has looked.

Thanks
Barry



RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-08 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Jason Gunthorpe [mailto:j...@ziepe.ca]
> Sent: Tuesday, February 9, 2021 10:30 AM
> To: Song Bao Hua (Barry Song) 
> Cc: David Hildenbrand ; Wangzhou (B)
> ; linux-kernel@vger.kernel.org;
> io...@lists.linux-foundation.org; linux...@kvack.org;
> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> Morton ; Alexander Viro ;
> gre...@linuxfoundation.org; kevin.t...@intel.com; jean-phili...@linaro.org;
> eric.au...@redhat.com; Liguozhu (Kenneth) ;
> zhangfei@linaro.org; chensihang (A) 
> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> pin
> 
> On Mon, Feb 08, 2021 at 08:35:31PM +, Song Bao Hua (Barry Song) wrote:
> >
> >
> > > From: Jason Gunthorpe [mailto:j...@ziepe.ca]
> > > Sent: Tuesday, February 9, 2021 7:34 AM
> > > To: David Hildenbrand 
> > > Cc: Wangzhou (B) ; linux-kernel@vger.kernel.org;
> > > io...@lists.linux-foundation.org; linux...@kvack.org;
> > > linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> > > Morton ; Alexander Viro
> ;
> > > gre...@linuxfoundation.org; Song Bao Hua (Barry Song)
> > > ; kevin.t...@intel.com;
> > > jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth)
> > > ; zhangfei@linaro.org; chensihang (A)
> > > 
> > > Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide 
> > > memory
> > > pin
> > >
> > > On Mon, Feb 08, 2021 at 09:14:28AM +0100, David Hildenbrand wrote:
> > >
> > > > People are constantly struggling with the effects of long term pinnings
> > > > under user space control, like we already have with vfio and RDMA.
> > > >
> > > > And here we are, adding yet another, easier way to mess with core MM in
> the
> > > > same way. This feels like a step backwards to me.
> > >
> > > Yes, this seems like a very poor candidate to be a system call in this
> > > format. Much too narrow, poorly specified, and possibly security
> > > implications to allow any process whatsoever to pin memory.
> > >
> > > I keep encouraging people to explore a standard shared SVA interface
> > > that can cover all these topics (and no, uaccel is not that
> > > interface), that seems much more natural.
> > >
> > > I still haven't seen an explanation why DMA is so special here,
> > > migration and so forth jitter the CPU too, environments that care
> > > about jitter have to turn this stuff off.
> >
> > This paper has a good explanation:
> > https://ieeexplore.ieee.org/stamp/stamp.jsp?tp==7482091
> >
> > mainly because page fault can go directly to the CPU and we have
> > many CPUs. But IO Page Faults go a different way, thus mean much
> > higher latency 3-80x slower than page fault:
> > events in hardware queue -> Interrupts -> cpu processing page fault
> > -> return events to iommu/device -> continue I/O.
> 
> The justifications for this was migration scenarios and migration is
> short. If you take a fault on what you are migrating only then does it
> slow down the CPU.

I agree this can slow down CPU, but not as much as IO page fault.

On the other hand, wouldn't it be the benefit of hardware accelerators
to have a lower and more stable latency zip/encryption than CPU?

> 
> Are you also working with HW where the IOMMU becomes invalidated after
> a migration and doesn't reload?
> 
> ie not true SVA but the sort of emulated SVA we see in a lot of
> places?

Yes. It is true SVA not emulated SVA.

> 
> It would be much better to work improve that to have closer sync with the
> CPU page table than to use pinning.

Absolutely I agree improving IOPF and making IOPF catch up with the 
performance of page fault is the best way. but it would take much
long time to optimize both HW and SW. While waiting for them to
mature, probably some way which can minimize IOPF should be used to
take the responsivity.

> 
> Jason

Thanks
Barry


RE: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage optimization for SCSI drivers

2021-02-08 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Finn Thain [mailto:fth...@telegraphics.com.au]
> Sent: Monday, February 8, 2021 8:57 PM
> To: tanxiaofei 
> Cc: j...@linux.ibm.com; martin.peter...@oracle.com;
> linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org;
> linux...@openeuler.org
> Subject: [Linuxarm] Re: [PATCH for-next 00/32] spin lock usage optimization
> for SCSI drivers
> 
> On Sun, 7 Feb 2021, Xiaofei Tan wrote:
> 
> > Replace spin_lock_irqsave with spin_lock in hard IRQ of SCSI drivers.
> > There are no function changes, but may speed up if interrupt happen too
> > often.
> 
> This change doesn't necessarily work on platforms that support nested
> interrupts.
> 
> Were you able to measure any benefit from this change on some other
> platform?

I think the code disabling irq in hardIRQ is simply wrong.
Since this commit
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e58aa3d2d0cc
genirq: Run irq handlers with interrupts disabled

interrupt handlers are definitely running in a irq-disabled context
unless irq handlers enable them explicitly in the handler to permit
other interrupts.

> 
> Please see also,
> https://lore.kernel.org/linux-scsi/89c5cb05cb844939ae684db0077f6...@h3c.co
> m/
> ___
> Linuxarm mailing list -- linux...@openeuler.org
> To unsubscribe 

Thanks
Barry



RE: [RFC PATCH 1/2] sched/topology: Get rid of NUMA overlapping groups

2021-02-08 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Valentin Schneider [mailto:valentin.schnei...@arm.com]
> Sent: Tuesday, February 9, 2021 12:48 AM
> To: Song Bao Hua (Barry Song) ;
> linux-kernel@vger.kernel.org
> Cc: vincent.guit...@linaro.org; mgor...@suse.de; mi...@kernel.org;
> pet...@infradead.org; dietmar.eggem...@arm.com; morten.rasmus...@arm.com;
> linux...@openeuler.org; xuwei (O) ; Liguozhu (Kenneth)
> ; tiantao (H) ; wanghuiqiang
> ; Zengtao (B) ; Jonathan
> Cameron ; guodong...@linaro.org; Meelis Roos
> 
> Subject: RE: [RFC PATCH 1/2] sched/topology: Get rid of NUMA overlapping 
> groups
> 
> Hi Barry,
> 
> On 08/02/21 10:04, Song Bao Hua (Barry Song) wrote:
> >> -Original Message-
> >> From: Valentin Schneider [mailto:valentin.schnei...@arm.com]
> 
> >
> > Hi Valentin,
> >
> > While I like your approach, this will require more time
> > to evaluate possible influence as the approach also affects
> > all machines without 3-hops issue. So x86 platforms need to
> > be tested and benchmark is required.
> >
> > What about we firstly finish the review of "grandchild" approach
> > v2 and have a solution for kunpeng920 and Sun Fire X4600-M2
> > while not impacting other machines which haven't 3-hops issues
> > first?
> >
> 
> I figured I'd toss this out while the iron was hot (and I had the topology
> crud paged in), but I ultimately agree that it's better to first go with
> something that fixes the diameter > 2 topologies and leaves the other ones
> untouched, which is exactly what you have.
> 
> > I would appreciate very much if you could comment on v2:
> >
> https://lore.kernel.org/lkml/20210203111201.20720-1-song.bao.hua@hisilicon
> .com/
> >
> 
> See my comment below on domain degeneration; with that taken care of I
> would say it's good to go. Have a look at what patch1+patch3 squashed
> together looks like, passing the right sd to init_overlap_sched_group()
> looks a bit neater IMO.
> 
> >> +static struct sched_domain *find_node_domain(struct sched_domain *sd)
> >> +{
> >> +  struct sched_domain *parent;
> >> +
> >> +  BUG_ON(!(sd->flags & SD_NUMA));
> >> +
> >> +  /* Get to the level above NODE */
> >> +  while (sd && sd->child) {
> >> +  parent = sd;
> >> +  sd = sd->child;
> >> +
> >> +  if (!(sd->flags & SD_NUMA))
> >> +  break;
> >> +  }
> >> +  /*
> >> +   * We're going to create cross topology level sched_group_capacity
> >> +   * references. This can only work if the domains resulting from said
> >> +   * levels won't be degenerated, as we need said sgc to be periodically
> >> +   * updated: it needs to be attached to the local group of a domain
> >> +   * that didn't get degenerated.
> >> +   *
> >> +   * Of course, groups aren't available yet, so we can't call the usual
> >> +   * sd_degenerate(). Checking domain spans is the closest we get.
> >> +   * Start from NODE's parent, and keep going up until we get a domain
> >> +   * we're sure won't be degenerated.
> >> +   */
> >> +  while (sd->parent &&
> >> + cpumask_equal(sched_domain_span(sd), sched_domain_span(parent)))
> {
> >> +  sd = parent;
> >> +  parent = sd->parent;
> >> +  }
> >
> > So this is because the sched_domain which doesn't contribute to scheduler
> > will be destroyed during cpu_attach_domain() since sd and parent span
> > the seam mask?
> >
> 
> Yes; let's take your topology for instance:
> 
> node   0   1   2   3
> 0:  10  12  20  22
> 1:  12  10  22  24
> 2:  20  22  10  12
> 3:  22  24  12  10
> 
>   2   10  2
>   0 <---> 1 <---> 2 <---> 3

Guess you actually mean
   2   10  2
   1 <---> 0 <---> 2 <---> 3

> 
> 
> Domains for node1 will look like (before any fixes are applied):
> 
> NUMA<=10: span=1   groups=(1)
> NUMA<=12: span=0-1 groups=(1)->(0)
> NUMA<=20: span=0-1 groups=(0,1)
> NUMA<=22: span=0-2 groups=(0,1)->(0,2-3)
> NUMA<=24: span=0-3 groups=(0-2)->(0,2-3)
> 
> As you can see, the domain representing distance <= 20 will be degenerated
> (it has a single group). If we were to e.g. add some more nodes to the left
> of node0, then we would trigger the "grandchildren logic" for node1 and
> would end up creating a reference to node1 NUMA<=20's sgc, which is a
> mistake: that domain will

RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-08 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: David Hildenbrand [mailto:da...@redhat.com]
> Sent: Monday, February 8, 2021 11:37 PM
> To: Song Bao Hua (Barry Song) ; Matthew Wilcox
> 
> Cc: Wangzhou (B) ; linux-kernel@vger.kernel.org;
> io...@lists.linux-foundation.org; linux...@kvack.org;
> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> Morton ; Alexander Viro ;
> gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com;
> jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth)
> ; zhangfei@linaro.org; chensihang (A)
> 
> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> pin
> 
> On 08.02.21 11:13, Song Bao Hua (Barry Song) wrote:
> >
> >
> >> -Original Message-
> >> From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On Behalf
> Of
> >> David Hildenbrand
> >> Sent: Monday, February 8, 2021 9:22 PM
> >> To: Song Bao Hua (Barry Song) ; Matthew Wilcox
> >> 
> >> Cc: Wangzhou (B) ; linux-kernel@vger.kernel.org;
> >> io...@lists.linux-foundation.org; linux...@kvack.org;
> >> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> >> Morton ; Alexander Viro
> ;
> >> gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com;
> >> jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth)
> >> ; zhangfei@linaro.org; chensihang (A)
> >> 
> >> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> >> pin
> >>
> >> On 08.02.21 03:27, Song Bao Hua (Barry Song) wrote:
> >>>
> >>>
> >>>> -Original Message-
> >>>> From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On
> Behalf
> >> Of
> >>>> Matthew Wilcox
> >>>> Sent: Monday, February 8, 2021 2:31 PM
> >>>> To: Song Bao Hua (Barry Song) 
> >>>> Cc: Wangzhou (B) ;
> linux-kernel@vger.kernel.org;
> >>>> io...@lists.linux-foundation.org; linux...@kvack.org;
> >>>> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> >>>> Morton ; Alexander Viro
> >> ;
> >>>> gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com;
> >>>> jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth)
> >>>> ; zhangfei@linaro.org; chensihang (A)
> >>>> 
> >>>> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide 
> >>>> memory
> >>>> pin
> >>>>
> >>>> On Sun, Feb 07, 2021 at 10:24:28PM +, Song Bao Hua (Barry Song) 
> >>>> wrote:
> >>>>>>> In high-performance I/O cases, accelerators might want to perform
> >>>>>>> I/O on a memory without IO page faults which can result in 
> >>>>>>> dramatically
> >>>>>>> increased latency. Current memory related APIs could not achieve this
> >>>>>>> requirement, e.g. mlock can only avoid memory to swap to backup 
> >>>>>>> device,
> >>>>>>> page migration can still trigger IO page fault.
> >>>>>>
> >>>>>> Well ... we have two requirements.  The application wants to not take
> >>>>>> page faults.  The system wants to move the application to a different
> >>>>>> NUMA node in order to optimise overall performance.  Why should the
> >>>>>> application's desires take precedence over the kernel's desires?  And
> why
> >>>>>> should it be done this way rather than by the sysadmin using numactl
> to
> >>>>>> lock the application to a particular node?
> >>>>>
> >>>>> NUMA balancer is just one of many reasons for page migration. Even one
> >>>>> simple alloc_pages() can cause memory migration in just single NUMA
> >>>>> node or UMA system.
> >>>>>
> >>>>> The other reasons for page migration include but are not limited to:
> >>>>> * memory move due to CMA
> >>>>> * memory move due to huge pages creation
> >>>>>
> >>>>> Hardly we can ask users to disable the COMPACTION, CMA and Huge Page
> >>>>> in the whole system.
> >>>>
> >>>> You're dodging the question.  Should the CMA allocation fail because
> >>>> another application is usin

RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-08 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Jason Gunthorpe [mailto:j...@ziepe.ca]
> Sent: Tuesday, February 9, 2021 7:34 AM
> To: David Hildenbrand 
> Cc: Wangzhou (B) ; linux-kernel@vger.kernel.org;
> io...@lists.linux-foundation.org; linux...@kvack.org;
> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> Morton ; Alexander Viro ;
> gre...@linuxfoundation.org; Song Bao Hua (Barry Song)
> ; kevin.t...@intel.com;
> jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth)
> ; zhangfei@linaro.org; chensihang (A)
> 
> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> pin
> 
> On Mon, Feb 08, 2021 at 09:14:28AM +0100, David Hildenbrand wrote:
> 
> > People are constantly struggling with the effects of long term pinnings
> > under user space control, like we already have with vfio and RDMA.
> >
> > And here we are, adding yet another, easier way to mess with core MM in the
> > same way. This feels like a step backwards to me.
> 
> Yes, this seems like a very poor candidate to be a system call in this
> format. Much too narrow, poorly specified, and possibly security
> implications to allow any process whatsoever to pin memory.
> 
> I keep encouraging people to explore a standard shared SVA interface
> that can cover all these topics (and no, uaccel is not that
> interface), that seems much more natural.
> 
> I still haven't seen an explanation why DMA is so special here,
> migration and so forth jitter the CPU too, environments that care
> about jitter have to turn this stuff off.

This paper has a good explanation:
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp==7482091

mainly because page fault can go directly to the CPU and we have
many CPUs. But IO Page Faults go a different way, thus mean much
higher latency 3-80x slower than page fault:
events in hardware queue -> Interrupts -> cpu processing page fault
-> return events to iommu/device -> continue I/O.

Copied from the paper:

If the IOMMU's page table walker fails to find the desired
translation in the page table, it sends an ATS response to
the GPU notifying it of this failure. This in turn corresponds
to a page fault. In response, the GPU sends another request to
the IOMMU called a Peripheral Page Request (PPR). The IOMMU
places this request in a memory-mapped queue and raises an
interrupt on the CPU. Multiple PPR requests can be queued
before the CPU is interrupted. The OS must have a suitable
IOMMU driver to process this interrupt and the queued PPR
requests. In Linux, while in an interrupt context, the driver
pulls PPR requests from the queue and places them in a work-queue
for later processing. Presumably this design decision was made
to minimize the time spent executing in an interrupt context,
where lower priority interrupts would be dis-abled. At a later
time, an OS worker-thread calls back into the driver to process
page fault requests in the work-queue. Once the requests are
serviced, the driver notifies the IOMMU. In turn, the IOMMU
notifies the GPU. The GPU then sends an-other ATS request to
retry the translation for the original fault-ing address.

Comparison with CPU: On the CPU, a hardware excep-tion is
raised on a page fault, which immediately switches to the
OS. In most cases in Linux, this routine services the page
fault directly, instead of queuing it for later processing.
Con-trast this with a page fault from an accelerator, where
the IOMMU has to interrupt the CPU to request service on
its be-half, and also note the several back-and-forth messages
be-tween the accelerator, the IOMMU, and the CPU. Further-more,
page faults on the CPU are generally handled one at a time
on the CPU, while for the GPU they are batched by the IOMMU
and OS work-queue mechanism.

> 
> Jason

Thanks
Barry



RE: [RFC PATCH 2/2] Revert "sched/topology: Warn when NUMA diameter > 2"

2021-02-08 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Valentin Schneider [mailto:valentin.schnei...@arm.com]
> Sent: Thursday, February 4, 2021 4:55 AM
> To: linux-kernel@vger.kernel.org
> Cc: vincent.guit...@linaro.org; mgor...@suse.de; mi...@kernel.org;
> pet...@infradead.org; dietmar.eggem...@arm.com; morten.rasmus...@arm.com;
> linux...@openeuler.org; xuwei (O) ; Liguozhu (Kenneth)
> ; tiantao (H) ; wanghuiqiang
> ; Zengtao (B) ; Jonathan
> Cameron ; guodong...@linaro.org; Song Bao Hua
> (Barry Song) ; Meelis Roos 
> Subject: [RFC PATCH 2/2] Revert "sched/topology: Warn when NUMA diameter > 2"
> 
> The scheduler topology code can now figure out what to do with such
> topologies.
> 
> This reverts commit b5b217346de85ed1b03fdecd5c5076b34fbb2f0b.
> 
> Signed-off-by: Valentin Schneider 

Yes, this is fine. I actually have seen some other problems we need
to consider.

The current code is probably well consolidated for machines with
2 hops or less. Thus, even after we fix the 3-hops span issue, I
can still see some other issue.

For example, if we change the sd flags and remove the SD_BALANCE
flags for the last hops in sd_init(), we are able to see large
score increase in unixbench.

if (sched_domains_numa_distance[tl->numa_level] > 
node_reclaim_distance ||
is_3rd_hops_domain(...)) {
sd->flags &= ~(SD_BALANCE_EXEC |
   SD_BALANCE_FORK |
   SD_WAKE_AFFINE);
}

So guess something needs to be tuned for machines with 3 hops or more.

But we need a kernel which has the fix of 3-hops issue before we can
do more work.

> ---
>  kernel/sched/topology.c | 33 -
>  1 file changed, 33 deletions(-)
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index a8f69f234258..0fa41aab74e0 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -688,7 +688,6 @@ cpu_attach_domain(struct sched_domain *sd, struct
> root_domain *rd, int cpu)
>  {
>   struct rq *rq = cpu_rq(cpu);
>   struct sched_domain *tmp;
> - int numa_distance = 0;
> 
>   /* Remove the sched domains which do not contribute to scheduling. */
>   for (tmp = sd; tmp; ) {
> @@ -720,38 +719,6 @@ cpu_attach_domain(struct sched_domain *sd, struct
> root_domain *rd, int cpu)
>   sd->child = NULL;
>   }
> 
> - for (tmp = sd; tmp; tmp = tmp->parent)
> - numa_distance += !!(tmp->flags & SD_NUMA);
> -
> - /*
> -  * FIXME: Diameter >=3 is misrepresented.
> -  *
> -  * Smallest diameter=3 topology is:
> -  *
> -  *   node   0   1   2   3
> -  * 0:  10  20  30  40
> -  * 1:  20  10  20  30
> -  * 2:  30  20  10  20
> -  * 3:  40  30  20  10
> -  *
> -  *   0 --- 1 --- 2 --- 3
> -  *
> -  * NUMA-3   0-3 N/A N/A 0-3
> -  *  groups: {0-2},{1-3} 
> {1-3},{0-2}
> -  *
> -  * NUMA-2   0-2 0-3 0-3 1-3
> -  *  groups: {0-1},{1-3} {0-2},{2-3} {1-3},{0-1} 
> {2-3},{0-2}
> -  *
> -  * NUMA-1   0-1 0-2 1-3 2-3
> -  *  groups: {0},{1} {1},{2},{0} {2},{3},{1} {3},{2}
> -  *
> -  * NUMA-0   0   1   2   3
> -  *
> -  * The NUMA-2 groups for nodes 0 and 3 are obviously buggered, as the
> -  * group span isn't a subset of the domain span.
> -  */
> - WARN_ONCE(numa_distance > 2, "Shortest NUMA path spans too many 
> nodes\n");
> -
>   sched_domain_debug(sd, cpu);
> 
>   rq_attach_root(rq, rd);
> --
> 2.27.0

Thanks
Barry



RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-08 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On Behalf Of
> David Hildenbrand
> Sent: Monday, February 8, 2021 9:22 PM
> To: Song Bao Hua (Barry Song) ; Matthew Wilcox
> 
> Cc: Wangzhou (B) ; linux-kernel@vger.kernel.org;
> io...@lists.linux-foundation.org; linux...@kvack.org;
> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> Morton ; Alexander Viro ;
> gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com;
> jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth)
> ; zhangfei@linaro.org; chensihang (A)
> 
> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> pin
> 
> On 08.02.21 03:27, Song Bao Hua (Barry Song) wrote:
> >
> >
> >> -Original Message-
> >> From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On Behalf
> Of
> >> Matthew Wilcox
> >> Sent: Monday, February 8, 2021 2:31 PM
> >> To: Song Bao Hua (Barry Song) 
> >> Cc: Wangzhou (B) ; linux-kernel@vger.kernel.org;
> >> io...@lists.linux-foundation.org; linux...@kvack.org;
> >> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> >> Morton ; Alexander Viro
> ;
> >> gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com;
> >> jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth)
> >> ; zhangfei@linaro.org; chensihang (A)
> >> 
> >> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> >> pin
> >>
> >> On Sun, Feb 07, 2021 at 10:24:28PM +, Song Bao Hua (Barry Song) wrote:
> >>>>> In high-performance I/O cases, accelerators might want to perform
> >>>>> I/O on a memory without IO page faults which can result in dramatically
> >>>>> increased latency. Current memory related APIs could not achieve this
> >>>>> requirement, e.g. mlock can only avoid memory to swap to backup device,
> >>>>> page migration can still trigger IO page fault.
> >>>>
> >>>> Well ... we have two requirements.  The application wants to not take
> >>>> page faults.  The system wants to move the application to a different
> >>>> NUMA node in order to optimise overall performance.  Why should the
> >>>> application's desires take precedence over the kernel's desires?  And why
> >>>> should it be done this way rather than by the sysadmin using numactl to
> >>>> lock the application to a particular node?
> >>>
> >>> NUMA balancer is just one of many reasons for page migration. Even one
> >>> simple alloc_pages() can cause memory migration in just single NUMA
> >>> node or UMA system.
> >>>
> >>> The other reasons for page migration include but are not limited to:
> >>> * memory move due to CMA
> >>> * memory move due to huge pages creation
> >>>
> >>> Hardly we can ask users to disable the COMPACTION, CMA and Huge Page
> >>> in the whole system.
> >>
> >> You're dodging the question.  Should the CMA allocation fail because
> >> another application is using SVA?
> >>
> >> I would say no.
> >
> > I would say no as well.
> >
> > While IOMMU is enabled, CMA almost has one user only: IOMMU driver
> > as other drivers will depend on iommu to use non-contiguous memory
> > though they are still calling dma_alloc_coherent().
> >
> > In iommu driver, dma_alloc_coherent is called during initialization
> > and there is no new allocation afterwards. So it wouldn't cause
> > runtime impact on SVA performance. Even there is new allocations,
> > CMA will fall back to general alloc_pages() and iommu drivers are
> > almost allocating small memory for command queues.
> >
> > So I would say general compound pages, huge pages, especially
> > transparent huge pages, would be bigger concerns than CMA for
> > internal page migration within one NUMA.
> >
> > Not like CMA, general alloc_pages() can get memory by moving
> > pages other than those pinned.
> >
> > And there is no guarantee we can always bind the memory of
> > SVA applications to single one NUMA, so NUMA balancing is
> > still a concern.
> >
> > But I agree we need a way to make CMA success while the userspace
> > pages are pinned. Since pin has been viral in many drivers, I
> > assume there is a way to handle this. Otherwise, APIs like
> > V4L2_MEMORY_USERPTR[1] will

RE: [RFC PATCH 1/2] sched/topology: Get rid of NUMA overlapping groups

2021-02-08 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Valentin Schneider [mailto:valentin.schnei...@arm.com]
> Sent: Thursday, February 4, 2021 4:55 AM
> To: linux-kernel@vger.kernel.org
> Cc: vincent.guit...@linaro.org; mgor...@suse.de; mi...@kernel.org;
> pet...@infradead.org; dietmar.eggem...@arm.com; morten.rasmus...@arm.com;
> linux...@openeuler.org; xuwei (O) ; Liguozhu (Kenneth)
> ; tiantao (H) ; wanghuiqiang
> ; Zengtao (B) ; Jonathan
> Cameron ; guodong...@linaro.org; Song Bao Hua
> (Barry Song) ; Meelis Roos 
> Subject: [RFC PATCH 1/2] sched/topology: Get rid of NUMA overlapping groups
> 
> As pointed out in commit
> 
>   b5b217346de8 ("sched/topology: Warn when NUMA diameter > 2")
> 
> overlapping groups result in broken topology data structures whenever the
> underlying system has a NUMA diameter greater than 2. This stems from
> overlapping groups being built from sibling domain's spans, yielding bogus
> transitivity relations the like of:
> 
>   distance(A, B) <= 30 && distance(B, C) <= 20
> =>
>   distance(A, C) <= 30
> 
> As discussed with Barry, a feasible approach is to catch bogus overlapping
> groups and fix them after the fact [1].
> 
> A more proactive approach would be to prevent aforementioned bogus
> relations from being built altogether, implies departing from the
> "group span is sibling domain child's span" strategy. Said strategy only
> works for diameter <= 2, which fortunately or unfortunately is currently
> the most common case.
> 
> The chosen approach is, for NUMA domains:
> a) have the local group be the child domain's span, as before
> b) have all remote groups span only their respective node
> 
> This boils down to getting rid of overlapping groups.
> 

Hi Valentin,

While I like your approach, this will require more time
to evaluate possible influence as the approach also affects
all machines without 3-hops issue. So x86 platforms need to
be tested and benchmark is required.

What about we firstly finish the review of "grandchild" approach
v2 and have a solution for kunpeng920 and Sun Fire X4600-M2
while not impacting other machines which haven't 3-hops issues
first?

I would appreciate very much if you could comment on v2:
https://lore.kernel.org/lkml/20210203111201.20720-1-song.bao@hisilicon.com/


> Note that b) requires introducing cross sched_domain_topology_level
> references for sched_group_capacity. This is a somewhat prickly matter as
> we need to ensure whichever group we hook into won't see its domain
> degenerated (which was never an issue when such references were bounded
> within a single topology level).
> 
> This lifts the NUMA diameter restriction, although yields more groups in
> the NUMA domains. As an example, here is the distance matrix for
> an AMD Epyc:
> 
>   node   0   1   2   3   4   5   6   7
> 0:  10  16  16  16  32  32  32  32
> 1:  16  10  16  16  32  32  32  32
> 2:  16  16  10  16  32  32  32  32
> 3:  16  16  16  10  32  32  32  32
> 4:  32  32  32  32  10  16  16  16
> 5:  32  32  32  32  16  10  16  16
> 6:  32  32  32  32  16  16  10  16
> 7:  32  32  32  32  16  16  16  10
> 
> Emulating this on QEMU yields, before the patch:
>   [0.386745] CPU0 attaching sched-domain(s):
>   [0.386969]  domain-0: span=0-3 level=NUMA
>   [0.387708]   groups: 0:{ span=0 cap=1008 }, 1:{ span=1 cap=1007 },
> 2:{ span=2 cap=1007 }, 3:{ span=3 cap=998 }
>   [0.388505]   domain-1: span=0-7 level=NUMA
>   [0.388700]groups: 0:{ span=0-3 cap=4020 }, 4:{ span=4-7 cap=4014 }
>   [0.389861] CPU1 attaching sched-domain(s):
>   [0.390020]  domain-0: span=0-3 level=NUMA
>   [0.390200]   groups: 1:{ span=1 cap=1007 }, 2:{ span=2 cap=1007 },
> 3:{ span=3 cap=998 }, 0:{ span=0 cap=1008 }
>   [0.390701]   domain-1: span=0-7 level=NUMA
>   [0.390874]groups: 0:{ span=0-3 cap=4020 }, 4:{ span=4-7 cap=4014 }
>   [0.391460] CPU2 attaching sched-domain(s):
>   [0.391664]  domain-0: span=0-3 level=NUMA
>   [0.392750]   groups: 2:{ span=2 cap=1007 }, 3:{ span=3 cap=998 }, 0:{ 
> span=0
> cap=1008 }, 1:{ span=1 cap=1007 }
>   [0.393672]   domain-1: span=0-7 level=NUMA
>   [0.393961]groups: 0:{ span=0-3 cap=4020 }, 4:{ span=4-7 cap=4014 }
>   [0.394645] CPU3 attaching sched-domain(s):
>   [0.394792]  domain-0: span=0-3 level=NUMA
>   [0.394961]   groups: 3:{ span=3 cap=998 }, 0:{ span=0 cap=1008 }, 1:{ 
> span=1
> cap=1007 }, 2:{ span=2 cap=1007 }
>   [0.395749]   domain-1: span=0-7 level=NUMA
>   [0.396098]groups: 0:{ span=0-3 cap=4020 }, 4:{ span=4-7 cap=4014 }
>   [0.396455] CPU4 attaching sched-domain(s):
>   [0.396603]  doma

RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-07 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: David Rientjes [mailto:rient...@google.com]
> Sent: Monday, February 8, 2021 3:18 PM
> To: Song Bao Hua (Barry Song) 
> Cc: Matthew Wilcox ; Wangzhou (B)
> ; linux-kernel@vger.kernel.org;
> io...@lists.linux-foundation.org; linux...@kvack.org;
> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> Morton ; Alexander Viro ;
> gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com;
> jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth)
> ; zhangfei@linaro.org; chensihang (A)
> 
> Subject: RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> pin
> 
> On Sun, 7 Feb 2021, Song Bao Hua (Barry Song) wrote:
> 
> > NUMA balancer is just one of many reasons for page migration. Even one
> > simple alloc_pages() can cause memory migration in just single NUMA
> > node or UMA system.
> >
> > The other reasons for page migration include but are not limited to:
> > * memory move due to CMA
> > * memory move due to huge pages creation
> >
> > Hardly we can ask users to disable the COMPACTION, CMA and Huge Page
> > in the whole system.
> >
> 
> What about only for mlocked memory, i.e. disable
> vm.compact_unevictable_allowed?
> 
> Adding syscalls is a big deal, we can make a reasonable inference that
> we'll have to support this forever if it's merged.  I haven't seen mention
> of what other unevictable memory *should* be migratable that would be
> adversely affected if we disable that sysctl.  Maybe that gets you part of
> the way there and there are some other deficiencies, but it seems like a
> good start would be to describe how CONFIG_NUMA_BALANCING=n +
> vm.compact_unevcitable_allowed + mlock() doesn't get you mostly there and
> then look into what's missing.
> 

I believe it can resolve the performance problem for the SVA
applications if we disable vm.compact_unevcitable_allowed and
NUMA_BALANCE, and use mlock().

The problem is that it is insensible to ask users to disable
unevictable_allowed or numa balancing of the whole system
only because there is one SVA application in the system.

SVA, for itself, is a mechanism to let cpu and devices share same
address space. In a typical server system, there are many processes,
the better way would be only changing the behavior of the specific
process rather than changing the whole system. It is hard to ask
users to do that only because there is a SVA monster.
Plus, this might negatively affect those applications not using SVA.

> If it's a very compelling case where there simply are no alternatives, it
> would make sense.  Alternative is to find a more generic way, perhaps in
> combination with vm.compact_unevictable_allowed, to achieve what you're
> looking to do that can be useful even beyond your originally intended use
> case.

sensible. Actually pin is exactly the way to disable migration for specific
pages AKA. disabling "vm.compact_unevictable_allowed" on those pages.

It is hard to differentiate what pages should not be migrated. Only apps
know that as even SVA applications can allocate many non-IO pages which
should be able to move.

Thanks
Barry


RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-07 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On Behalf Of
> Matthew Wilcox
> Sent: Monday, February 8, 2021 2:31 PM
> To: Song Bao Hua (Barry Song) 
> Cc: Wangzhou (B) ; linux-kernel@vger.kernel.org;
> io...@lists.linux-foundation.org; linux...@kvack.org;
> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org; Andrew
> Morton ; Alexander Viro ;
> gre...@linuxfoundation.org; j...@ziepe.ca; kevin.t...@intel.com;
> jean-phili...@linaro.org; eric.au...@redhat.com; Liguozhu (Kenneth)
> ; zhangfei@linaro.org; chensihang (A)
> 
> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> pin
> 
> On Sun, Feb 07, 2021 at 10:24:28PM +, Song Bao Hua (Barry Song) wrote:
> > > > In high-performance I/O cases, accelerators might want to perform
> > > > I/O on a memory without IO page faults which can result in dramatically
> > > > increased latency. Current memory related APIs could not achieve this
> > > > requirement, e.g. mlock can only avoid memory to swap to backup device,
> > > > page migration can still trigger IO page fault.
> > >
> > > Well ... we have two requirements.  The application wants to not take
> > > page faults.  The system wants to move the application to a different
> > > NUMA node in order to optimise overall performance.  Why should the
> > > application's desires take precedence over the kernel's desires?  And why
> > > should it be done this way rather than by the sysadmin using numactl to
> > > lock the application to a particular node?
> >
> > NUMA balancer is just one of many reasons for page migration. Even one
> > simple alloc_pages() can cause memory migration in just single NUMA
> > node or UMA system.
> >
> > The other reasons for page migration include but are not limited to:
> > * memory move due to CMA
> > * memory move due to huge pages creation
> >
> > Hardly we can ask users to disable the COMPACTION, CMA and Huge Page
> > in the whole system.
> 
> You're dodging the question.  Should the CMA allocation fail because
> another application is using SVA?
> 
> I would say no.  

I would say no as well.

While IOMMU is enabled, CMA almost has one user only: IOMMU driver
as other drivers will depend on iommu to use non-contiguous memory
though they are still calling dma_alloc_coherent().

In iommu driver, dma_alloc_coherent is called during initialization
and there is no new allocation afterwards. So it wouldn't cause
runtime impact on SVA performance. Even there is new allocations,
CMA will fall back to general alloc_pages() and iommu drivers are
almost allocating small memory for command queues.

So I would say general compound pages, huge pages, especially
transparent huge pages, would be bigger concerns than CMA for
internal page migration within one NUMA. 

Not like CMA, general alloc_pages() can get memory by moving
pages other than those pinned.

And there is no guarantee we can always bind the memory of
SVA applications to single one NUMA, so NUMA balancing is
still a concern.

But I agree we need a way to make CMA success while the userspace
pages are pinned. Since pin has been viral in many drivers, I
assume there is a way to handle this. Otherwise, APIs like 
V4L2_MEMORY_USERPTR[1] will possibly make CMA fail as there
is no guarantee that usersspace will allocate unmovable memory
and there is no guarantee the fallback path- alloc_pages() can
succeed while allocating big memory.

Will investigate more.

> The application using SVA should take the one-time
> performance hit from having its memory moved around.

Sometimes I also feel SVA is doomed to suffer from performance
impact due to page migration. But we are still trying to
extend its use cases to high-performance I/O.

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/media/v4l2-core/videobuf-dma-sg.c

Thanks
Barry


RE: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin

2021-02-07 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Matthew Wilcox [mailto:wi...@infradead.org]
> Sent: Monday, February 8, 2021 10:34 AM
> To: Wangzhou (B) 
> Cc: linux-kernel@vger.kernel.org; io...@lists.linux-foundation.org;
> linux...@kvack.org; linux-arm-ker...@lists.infradead.org;
> linux-...@vger.kernel.org; Andrew Morton ;
> Alexander Viro ; gre...@linuxfoundation.org; Song
> Bao Hua (Barry Song) ; j...@ziepe.ca;
> kevin.t...@intel.com; jean-phili...@linaro.org; eric.au...@redhat.com;
> Liguozhu (Kenneth) ; zhangfei@linaro.org;
> chensihang (A) 
> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory
> pin
> 
> On Sun, Feb 07, 2021 at 04:18:03PM +0800, Zhou Wang wrote:
> > SVA(share virtual address) offers a way for device to share process virtual
> > address space safely, which makes more convenient for user space device
> > driver coding. However, IO page faults may happen when doing DMA
> > operations. As the latency of IO page fault is relatively big, DMA
> > performance will be affected severely when there are IO page faults.
> > >From a long term view, DMA performance will be not stable.
> >
> > In high-performance I/O cases, accelerators might want to perform
> > I/O on a memory without IO page faults which can result in dramatically
> > increased latency. Current memory related APIs could not achieve this
> > requirement, e.g. mlock can only avoid memory to swap to backup device,
> > page migration can still trigger IO page fault.
> 
> Well ... we have two requirements.  The application wants to not take
> page faults.  The system wants to move the application to a different
> NUMA node in order to optimise overall performance.  Why should the
> application's desires take precedence over the kernel's desires?  And why
> should it be done this way rather than by the sysadmin using numactl to
> lock the application to a particular node?

NUMA balancer is just one of many reasons for page migration. Even one
simple alloc_pages() can cause memory migration in just single NUMA
node or UMA system.

The other reasons for page migration include but are not limited to:
* memory move due to CMA
* memory move due to huge pages creation

Hardly we can ask users to disable the COMPACTION, CMA and Huge Page
in the whole system.

On the other hand, numactl doesn't always bind memory to single NUMA
node, sometimes, while applications require many cpu, it could bind
more than one memory node.

Thanks
Barry



RE: [PATCH v2] dma-mapping: benchmark: pretend DMA is transmitting

2021-02-05 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Christoph Hellwig [mailto:h...@lst.de]
> Sent: Friday, February 5, 2021 11:36 PM
> To: Song Bao Hua (Barry Song) 
> Cc: Christoph Hellwig ; m.szyprow...@samsung.com;
> robin.mur...@arm.com; io...@lists.linux-foundation.org;
> linux-kernel@vger.kernel.org; linux...@openeuler.org
> Subject: Re: [PATCH v2] dma-mapping: benchmark: pretend DMA is transmitting
> 
> On Fri, Feb 05, 2021 at 10:32:26AM +, Song Bao Hua (Barry Song) wrote:
> > I can keep the struct size unchanged by changing the struct to
> >
> > struct map_benchmark {
> > __u64 avg_map_100ns; /* average map latency in 100ns */
> > __u64 map_stddev; /* standard deviation of map latency */
> > __u64 avg_unmap_100ns; /* as above */
> > __u64 unmap_stddev;
> > __u32 threads; /* how many threads will do map/unmap in parallel */
> > __u32 seconds; /* how long the test will last */
> > __s32 node; /* which numa node this benchmark will run on */
> > __u32 dma_bits; /* DMA addressing capability */
> > __u32 dma_dir; /* DMA data direction */
> > __u32 dma_trans_ns; /* time for DMA transmission in ns */
> >
> > __u32 exp; /* For future use */
> > __u64 expansion[9]; /* For future use */
> > };
> >
> > But the code is really ugly now.
> 
> Thats why we usually use __u8 fields for reserved field.  You might
> consider just switching to that instead while you're at it. I guess
> we'll just have to get the addition into 5.11 then to make sure we
> don't release a kernel with the alignment fix.

I assume there is no need to keep the same size with 5.11-rc, so
could change the struct to:

struct map_benchmark {
__u64 avg_map_100ns; /* average map latency in 100ns */
__u64 map_stddev; /* standard deviation of map latency */
__u64 avg_unmap_100ns; /* as above */
__u64 unmap_stddev;
__u32 threads; /* how many threads will do map/unmap in parallel */
__u32 seconds; /* how long the test will last */
__s32 node; /* which numa node this benchmark will run on */
__u32 dma_bits; /* DMA addressing capability */
__u32 dma_dir; /* DMA data direction */
__u8 expansion[84]; /* For future use */
};

This won't increase size on 64bit system, but it increases 4bytes
on 32bits system comparing to 5.11-rc. How do you think about it?

Thanks
Barry



RE: [PATCH v2] dma-mapping: benchmark: pretend DMA is transmitting

2021-02-05 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Christoph Hellwig [mailto:h...@lst.de]
> Sent: Friday, February 5, 2021 10:21 PM
> To: Song Bao Hua (Barry Song) 
> Cc: m.szyprow...@samsung.com; h...@lst.de; robin.mur...@arm.com;
> io...@lists.linux-foundation.org; linux-kernel@vger.kernel.org;
> linux...@openeuler.org
> Subject: Re: [PATCH v2] dma-mapping: benchmark: pretend DMA is transmitting
> 
> On Fri, Feb 05, 2021 at 03:00:35PM +1300, Barry Song wrote:
> > +   __u32 dma_trans_ns; /* time for DMA transmission in ns */
> > __u64 expansion[10];/* For future use */
> 
> We need to keep the struct size, so the expansion field needs to
> shrink by the equivalent amount of data that is added in dma_trans_ns.

Unfortunately I didn't put a rsv u32 field after dma_dir
in the original patch.
There were five 32bits data before expansion[]:

struct map_benchmark {
__u64 avg_map_100ns; /* average map latency in 100ns */
__u64 map_stddev; /* standard deviation of map latency */
__u64 avg_unmap_100ns; /* as above */
__u64 unmap_stddev;
__u32 threads; /* how many threads will do map/unmap in parallel */
__u32 seconds; /* how long the test will last */
__s32 node; /* which numa node this benchmark will run on */
__u32 dma_bits; /* DMA addressing capability */
__u32 dma_dir; /* DMA data direction */
__u64 expansion[10];/* For future use */
};

My bad. That was really silly. I should have done the below from
the first beginning:
struct map_benchmark {
__u64 avg_map_100ns; /* average map latency in 100ns */
__u64 map_stddev; /* standard deviation of map latency */
__u64 avg_unmap_100ns; /* as above */
__u64 unmap_stddev;
__u32 threads; /* how many threads will do map/unmap in parallel */
__u32 seconds; /* how long the test will last */
__s32 node; /* which numa node this benchmark will run on */
__u32 dma_bits; /* DMA addressing capability */
__u32 dma_dir; /* DMA data direction */
__u32 rsv;
__u64 expansion[10];/* For future use */
};

So on 64bit system, this patch doesn't change the length of struct
as the new added u32 just fill the gap between dma_dir and expansion.

For 32bit system, this patch increases 4 bytes in the length.

I can keep the struct size unchanged by changing the struct to

struct map_benchmark {
__u64 avg_map_100ns; /* average map latency in 100ns */
__u64 map_stddev; /* standard deviation of map latency */
__u64 avg_unmap_100ns; /* as above */
__u64 unmap_stddev;
__u32 threads; /* how many threads will do map/unmap in parallel */
__u32 seconds; /* how long the test will last */
__s32 node; /* which numa node this benchmark will run on */
__u32 dma_bits; /* DMA addressing capability */
__u32 dma_dir; /* DMA data direction */
__u32 dma_trans_ns; /* time for DMA transmission in ns */

__u32 exp; /* For future use */
__u64 expansion[9]; /* For future use */
};

But the code is really ugly now.

Thanks
Barry


RE: [PATCH] dma-mapping: benchmark: pretend DMA is transmitting

2021-02-04 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Robin Murphy [mailto:robin.mur...@arm.com]
> Sent: Friday, February 5, 2021 12:51 PM
> To: Song Bao Hua (Barry Song) ;
> m.szyprow...@samsung.com; h...@lst.de; io...@lists.linux-foundation.org
> Cc: linux-kernel@vger.kernel.org; linux...@openeuler.org
> Subject: Re: [PATCH] dma-mapping: benchmark: pretend DMA is transmitting
> 
> On 2021-02-04 22:58, Barry Song wrote:
> > In a real dma mapping user case, after dma_map is done, data will be
> > transmit. Thus, in multi-threaded user scenario, IOMMU contention
> > should not be that severe. For example, if users enable multiple
> > threads to send network packets through 1G/10G/100Gbps NIC, usually
> > the steps will be: map -> transmission -> unmap.  Transmission delay
> > reduces the contention of IOMMU. Here a delay is added to simulate
> > the transmission for TX case so that the tested result could be
> > more accurate.
> >
> > RX case would be much more tricky. It is not supported yet.
> 
> I guess it might be a reasonable approximation to map several pages,
> then unmap them again after a slightly more random delay. Or maybe
> divide the threads into pairs of mappers and unmappers respectively
> filling up and draining proper little buffer pools.

Yes. Good suggestions. I am actually thinking about how to support
cases like networks. There is a pre-mapped list of pages, each page
is bound with some hardware DMA block descriptor(BD). So if Linux can
consume the packets in time, those buffers are always re-used. Only
when the page bound with BD is full and OS can't consume it in time,
another temp page will be allocated and mapped, BD will switch to use
this temp page, then finally unmap it if it is not needed any more.
On the other hand, the pre-mapped pages are never unmapped.

For things like filesystem and disk driver, RX is always requested by
users. The model would be simpler: map -> rx -> unmap. For networks,
RX transmission can come spontaneously.

Anyway, I'll put this into TBD. For this moment, mainly handle TX path.
Or maybe the current code has been able to handle simple RX model :-)

> 
> > Signed-off-by: Barry Song 
> > ---
> >   kernel/dma/map_benchmark.c  | 11 +++
> >   tools/testing/selftests/dma/dma_map_benchmark.c | 17 +++--
> >   2 files changed, 26 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
> > index 1b1b8ff875cb..1976db7e34e4 100644
> > --- a/kernel/dma/map_benchmark.c
> > +++ b/kernel/dma/map_benchmark.c
> > @@ -21,6 +21,7 @@
> >   #define DMA_MAP_BENCHMARK _IOWR('d', 1, struct map_benchmark)
> >   #define DMA_MAP_MAX_THREADS   1024
> >   #define DMA_MAP_MAX_SECONDS   300
> > +#define DMA_MAP_MAX_TRANS_DELAY(10 * 1000 * 1000) /* 10ms */
> 
> Using MSEC_PER_SEC might be sufficiently self-documenting?

Yes, I guess you mean NSEC_PER_MSEC. will move to it.

> 
> >   #define DMA_MAP_BIDIRECTIONAL 0
> >   #define DMA_MAP_TO_DEVICE 1
> > @@ -36,6 +37,7 @@ struct map_benchmark {
> > __s32 node; /* which numa node this benchmark will run on */
> > __u32 dma_bits; /* DMA addressing capability */
> > __u32 dma_dir; /* DMA data direction */
> > +   __u32 dma_trans_ns; /* time for DMA transmission in ns */
> > __u64 expansion[10];/* For future use */
> >   };
> >
> > @@ -87,6 +89,10 @@ static int map_benchmark_thread(void *data)
> > map_etime = ktime_get();
> > map_delta = ktime_sub(map_etime, map_stime);
> >
> > +   /* Pretend DMA is transmitting */
> > +   if (map->dir != DMA_FROM_DEVICE)
> > +   ndelay(map->bparam.dma_trans_ns);
> 
> TBH I think the option of a fixed delay between map and unmap might be a
> handy thing in general, so having the direction check at all seems
> needlessly restrictive. As long as the driver implements all the basic
> building blocks, combining them to simulate specific traffic patterns
> can be left up to the benchmark tool.

Sensible, will remove the condition check.

> 
> Robin.
> 
> > +
> > unmap_stime = ktime_get();
> > dma_unmap_single(map->dev, dma_addr, PAGE_SIZE, map->dir);
> > unmap_etime = ktime_get();
> > @@ -218,6 +224,11 @@ static long map_benchmark_ioctl(struct file *file,
> unsigned int cmd,
> > return -EINVAL;
> > }
> >
> > +   if (map->bparam.dma_trans_ns > DMA_MAP_MAX_TRANS_DELAY) {
> > +   pr_err("invalid transmission delay\n");
> > +  

RE: [PATCH v2] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2

2021-02-03 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Meelis Roos [mailto:mr...@linux.ee]
> Sent: Thursday, February 4, 2021 12:58 AM
> To: Song Bao Hua (Barry Song) ;
> valentin.schnei...@arm.com; vincent.guit...@linaro.org; mgor...@suse.de;
> mi...@kernel.org; pet...@infradead.org; dietmar.eggem...@arm.com;
> morten.rasmus...@arm.com; linux-kernel@vger.kernel.org
> Cc: linux...@openeuler.org; xuwei (O) ; Liguozhu (Kenneth)
> ; tiantao (H) ; wanghuiqiang
> ; Zengtao (B) ; Jonathan
> Cameron ; guodong...@linaro.org
> Subject: Re: [PATCH v2] sched/topology: fix the issue groups don't span
> domain->span for NUMA diameter > 2
> 
> 03.02.21 13:12 Barry Song wrote:
> > kernel/sched/topology.c | 85 +
> >   1 file changed, 53 insertions(+), 32 deletions(-)
> >
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 5d3675c7a76b..964ed89001fe 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> 
> This one still works on the Sun X4600-M2, on top of 
> v5.11-rc6-55-g3aaf0a27ffc2.
> 
> 
> Performance-wise - is the some simple benhmark to run to meaure the impact?
> Compared to what - 5.10.0 or the kernel with the warning?

Hi Meelis,
Thanks for retesting.

Comparing to the kernel with the warning is enough. As I mentioned here:
https://lore.kernel.org/lkml/20210115203632.34396-1-song.bao@hisilicon.com/

I have seen two major issues the broken sched_group has:

* in load_balance() and find_busiest_group()
kernel is calculating the avg_load and group_type by:

sum(load of cpus within sched_domain)

capacity of the whole sched_group

since sched_group isn't a subset of sched_domain, so the load of
the problematic group is severely underestimated.

sched_domain

  +--+
  |  |
  |  +---+
  |  | +---+  +--+   |   |
  |  | | cpu0  |  | cpu1 |   |   |
  |  | +---+  +--+   |   |
  +--+   |
 |   |
 |  +---+  +---+ |
 |  |cpu2   |  |cpu3   | |
 |  +---+  +---+ |
 |   |
 +---+
problematic  sched_group


For the above example, kernel will divide "the sum load of
cpu0 and cpu1" by "the capacity of the whole group including
cpu0,1,2 and 3".

* in select_task_rq_fair() and find_idlest_group()
Kernel could push a forked/exec-ed task to the outside of the
sched_domain, but still inside the sched_group. For the above
diagram, while kernel wants to find the idlest cpu in the
sched_domain, it can result in picking cpu2 or cpu3.

I guess these two issues can potentially affect many benchmarks.
Our team have seen 5% unixbench score increase with the fix in
some machines though the real impact might be case-by-case.

> 
> drop caches and time the build time of linux kernel with make -j64?
> 
> --
> Meelis Roos

Thanks
Barry


RE: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and add cluster scheduler

2021-02-03 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Tim Chen [mailto:tim.c.c...@linux.intel.com]
> Sent: Friday, January 8, 2021 12:17 PM
> To: Song Bao Hua (Barry Song) ;
> valentin.schnei...@arm.com; catalin.mari...@arm.com; w...@kernel.org;
> r...@rjwysocki.net; vincent.guit...@linaro.org; l...@kernel.org;
> gre...@linuxfoundation.org; Jonathan Cameron ;
> mi...@redhat.com; pet...@infradead.org; juri.le...@redhat.com;
> dietmar.eggem...@arm.com; rost...@goodmis.org; bseg...@google.com;
> mgor...@suse.de; mark.rutl...@arm.com; sudeep.ho...@arm.com;
> aubrey...@linux.intel.com
> Cc: linux-arm-ker...@lists.infradead.org; linux-kernel@vger.kernel.org;
> linux-a...@vger.kernel.org; linux...@openeuler.org; xuwei (O)
> ; Zengtao (B) ; tiantao (H)
> 
> Subject: Re: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and
> add cluster scheduler
> 
> 
> 
> On 1/6/21 12:30 AM, Barry Song wrote:
> > ARM64 server chip Kunpeng 920 has 6 clusters in each NUMA node, and each
> > cluster has 4 cpus. All clusters share L3 cache data while each cluster
> > has local L3 tag. On the other hand, each cluster will share some
> > internal system bus. This means cache is much more affine inside one cluster
> > than across clusters.
> >
> > +---+  
> > +-+
> > |  +--++--++---+
> >  |
> > |  | CPU0 || cpu1 | |+---+ |
> >  |
> > |  +--++--+ ||   | |
> >  |
> > |   ++L3 | |
> >  |
> > |  +--++--+   cluster   ||tag| |
> >  |
> > |  | CPU2 || CPU3 | ||   | |
> >  |
> > |  +--++--+ |+---+ |
> >  |
> > |   |  |
> >  |
> > +---+  |
> >  |
> > +---+  |
> >  |
> > |  +--++--+ +--+
> >  |
> > |  |  ||  | |+---+ |
> >  |
> > |  +--++--+ ||   | |
> >  |
> > |   ||L3 | |
> >  |
> > |  +--++--+ ++tag| |
> >  |
> > |  |  ||  | ||   | |
> >  |
> > |  +--++--+ |+---+ |
> >  |
> > |   |  |
> >  |
> > +---+  |   L3   
> >  |
> >|   data 
> >  |
> > +---+  |
> >  |
> > |  +--++--+ |+---+ |
> >  |
> > |  |  ||  | ||   | |
> >  |
> > |  +--++--+ ++L3 | |
> >  |
> > |   ||tag| |
> >  |
> > |  +--++--+ ||   | |
> >  |
> > |  |  ||  |+++---+ |
> >  |
> > |  +--++--+|---+
> >  |
> > +---|  |
> >  |
> > +---

RE: [PATCH] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2

2021-02-03 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Song Bao Hua (Barry Song)
> Sent: Wednesday, February 3, 2021 11:18 PM
> To: 'Valentin Schneider' ;
> vincent.guit...@linaro.org; mgor...@suse.de; mi...@kernel.org;
> pet...@infradead.org; dietmar.eggem...@arm.com; morten.rasmus...@arm.com;
> linux-kernel@vger.kernel.org
> Cc: linux...@openeuler.org; xuwei (O) ; Liguozhu (Kenneth)
> ; tiantao (H) ; wanghuiqiang
> ; Zengtao (B) ; Jonathan
> Cameron ; guodong...@linaro.org; Meelis Roos
> 
> Subject: RE: [PATCH] sched/topology: fix the issue groups don't span
> domain->span for NUMA diameter > 2
> 
> 
> 
> > -Original Message-
> > From: Valentin Schneider [mailto:valentin.schnei...@arm.com]
> > Sent: Wednesday, February 3, 2021 4:17 AM
> > To: Song Bao Hua (Barry Song) ;
> > vincent.guit...@linaro.org; mgor...@suse.de; mi...@kernel.org;
> > pet...@infradead.org; dietmar.eggem...@arm.com; morten.rasmus...@arm.com;
> > linux-kernel@vger.kernel.org
> > Cc: linux...@openeuler.org; xuwei (O) ; Liguozhu
> (Kenneth)
> > ; tiantao (H) ;
> wanghuiqiang
> > ; Zengtao (B) ; Jonathan
> > Cameron ; guodong...@linaro.org; Song Bao Hua
> > (Barry Song) ; Meelis Roos 
> > Subject: Re: [PATCH] sched/topology: fix the issue groups don't span
> > domain->span for NUMA diameter > 2
> >
> > On 01/02/21 16:38, Barry Song wrote:
> > > @@ -964,6 +941,12 @@ static void init_overlap_sched_group(struct
> sched_domain
> > *sd,
> > >
> > >   build_balance_mask(sd, sg, mask);
> > >   cpu = cpumask_first_and(sched_group_span(sg), mask);
> > > + /*
> > > +  * for the group generated by grandchild, use the sgc of 2nd cpu
> > > +  * because the 1st cpu might be used by another sched_group
> > > +  */
> > > + if (from_grandchild && cpumask_weight(mask) > 1)
> > > + cpu = cpumask_next_and(cpu, sched_group_span(sg), mask);
> > >
> > >   sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
> >
> > So you are getting a (hopefully) unique ID for this group span at this
> > given topology level (i.e. sd->private) but as I had stated in that list of
> > issues, this creates an sgc that isn't attached to the local group of any
> > sched_domain, and thus won't get its capacity values updated.
> >
> > This can actually be seen via the capacity values you're getting at build
> > time:
> >
> > > [0.868907] CPU0 attaching sched-domain(s):
> > ...
> > > [0.869542]domain-2: span=0-5 level=NUMA
> > > [0.869559] groups: 0:{ span=0-3 cap=4002 }, 5:{ span=4-5 cap=2048 
> > > }
> >   
> > > [0.871177] CPU4 attaching sched-domain(s):
> > ...
> > > [0.871200]   groups: 4:{ span=4 cap=977 }, 5:{ span=5 cap=1001 }
> > > [0.871243]   domain-1: span=4-7 level=NUMA
> > > [0.871257]groups: 4:{ span=4-5 cap=1978 }, 6:{ span=6-7 cap=1968 }
> > 
> >
> 
> Yes. I could see this issue.  We could hack update_group_capacity to let
> it scan both local_group  and sched_group generated by grandchild, but it
> seems your edit is much better.
> 
> > IMO what we want to do here is to hook this CPU0-domain-2-group5 to the sgc
> > of CPU4-domain1-group4. I've done that in the below diff - this gives us
> > groups with sgc's owned at lower topology levels, but this will only ever
> > be true for non-local groups. This has the added benefit of working with
> > single-CPU nodes. Briefly tested on your topology and the sunfire's (via
> > QEMU), and I didn't get screamed at.
> >
> > Before the fun police comes and impounds my keyboard, I'd like to point out
> > that we could leverage this cross-level sgc referencing hack to further
> > change the NUMA domains and pretty much get rid of overlapping groups
> > (that's what I was fumbling with in [1]).
> >
> > [1]: http://lore.kernel.org/r/jhjwnw11ak2.mog...@arm.com
> >
> > That is, rather than building overlapping groups and fixing them whenever
> > that breaks (distance > 2), we could have:
> > - the local group being the child domain's span (as always)
> > - all non-local NUMA groups spanning a single node each, with the right sgc
> >   cross-referencing.
> >
> > Thoughts?
> 
> I guess the original purpose of overlapping groups is creating as few groups
> as possible. If we totally remove overlapping groups, it seems we will create
> much more groups?
> For exa

RE: [PATCH] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2

2021-02-03 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Valentin Schneider [mailto:valentin.schnei...@arm.com]
> Sent: Wednesday, February 3, 2021 4:17 AM
> To: Song Bao Hua (Barry Song) ;
> vincent.guit...@linaro.org; mgor...@suse.de; mi...@kernel.org;
> pet...@infradead.org; dietmar.eggem...@arm.com; morten.rasmus...@arm.com;
> linux-kernel@vger.kernel.org
> Cc: linux...@openeuler.org; xuwei (O) ; Liguozhu (Kenneth)
> ; tiantao (H) ; wanghuiqiang
> ; Zengtao (B) ; Jonathan
> Cameron ; guodong...@linaro.org; Song Bao Hua
> (Barry Song) ; Meelis Roos 
> Subject: Re: [PATCH] sched/topology: fix the issue groups don't span
> domain->span for NUMA diameter > 2
> 
> On 01/02/21 16:38, Barry Song wrote:
> > @@ -964,6 +941,12 @@ static void init_overlap_sched_group(struct 
> > sched_domain
> *sd,
> >
> >   build_balance_mask(sd, sg, mask);
> >   cpu = cpumask_first_and(sched_group_span(sg), mask);
> > +   /*
> > +* for the group generated by grandchild, use the sgc of 2nd cpu
> > +* because the 1st cpu might be used by another sched_group
> > +*/
> > +   if (from_grandchild && cpumask_weight(mask) > 1)
> > +   cpu = cpumask_next_and(cpu, sched_group_span(sg), mask);
> >
> >   sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
> 
> So you are getting a (hopefully) unique ID for this group span at this
> given topology level (i.e. sd->private) but as I had stated in that list of
> issues, this creates an sgc that isn't attached to the local group of any
> sched_domain, and thus won't get its capacity values updated.
> 
> This can actually be seen via the capacity values you're getting at build
> time:
> 
> > [0.868907] CPU0 attaching sched-domain(s):
> ...
> > [0.869542]domain-2: span=0-5 level=NUMA
> > [0.869559] groups: 0:{ span=0-3 cap=4002 }, 5:{ span=4-5 cap=2048 }
>   
> > [0.871177] CPU4 attaching sched-domain(s):
> ...
> > [0.871200]   groups: 4:{ span=4 cap=977 }, 5:{ span=5 cap=1001 }
> > [0.871243]   domain-1: span=4-7 level=NUMA
> > [0.871257]groups: 4:{ span=4-5 cap=1978 }, 6:{ span=6-7 cap=1968 }
> 
> 

Yes. I could see this issue.  We could hack update_group_capacity to let
it scan both local_group  and sched_group generated by grandchild, but it
seems your edit is much better.

> IMO what we want to do here is to hook this CPU0-domain-2-group5 to the sgc
> of CPU4-domain1-group4. I've done that in the below diff - this gives us
> groups with sgc's owned at lower topology levels, but this will only ever
> be true for non-local groups. This has the added benefit of working with
> single-CPU nodes. Briefly tested on your topology and the sunfire's (via
> QEMU), and I didn't get screamed at.
> 
> Before the fun police comes and impounds my keyboard, I'd like to point out
> that we could leverage this cross-level sgc referencing hack to further
> change the NUMA domains and pretty much get rid of overlapping groups
> (that's what I was fumbling with in [1]).
> 
> [1]: http://lore.kernel.org/r/jhjwnw11ak2.mog...@arm.com
> 
> That is, rather than building overlapping groups and fixing them whenever
> that breaks (distance > 2), we could have:
> - the local group being the child domain's span (as always)
> - all non-local NUMA groups spanning a single node each, with the right sgc
>   cross-referencing.
> 
> Thoughts?

I guess the original purpose of overlapping groups is creating as few groups
as possible. If we totally remove overlapping groups, it seems we will create
much more groups?
For example, while node0 begins to build sched_domain for distance 20, it will
add node2, since the distance between node2 and node3 is 15, so while node2 is
added, node3 is also added as node2's lower domain has covered node3. So we need
two groups only for node0's sched_domain of distance level 20.
+---+  ++
 |   |  15  ||
 |  node0++ | node1  |
 |   |  ||
 ++--+XXX+
  | XXX
  |XX
20| 15   XX
  |XXX
  |   X XXX
 ++XXX   +---+
 | | 15  |  node3|
 | node2   +-+   |
 | | +---+
 +-+

If we remove overlapping group, we will add a group for node2, another
group for node3. Then we get three groups.

I am not sure if it is always positive for performance.

> 
> --->8---
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index b

RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device

2021-02-01 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Tian, Kevin [mailto:kevin.t...@intel.com]
> Sent: Tuesday, February 2, 2021 3:52 PM
> To: Jason Gunthorpe 
> Cc: Song Bao Hua (Barry Song) ; chensihang (A)
> ; Arnd Bergmann ; Greg
> Kroah-Hartman ; linux-kernel@vger.kernel.org;
> io...@lists.linux-foundation.org; linux...@kvack.org; Zhangfei Gao
> ; Liguozhu (Kenneth) ;
> linux-accelerat...@lists.ozlabs.org
> Subject: RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device
> 
> > From: Jason Gunthorpe 
> > Sent: Tuesday, February 2, 2021 7:44 AM
> >
> > On Fri, Jan 29, 2021 at 10:09:03AM +, Tian, Kevin wrote:
> > > > SVA is not doom to work with IO page fault only. If we have SVA+pin,
> > > > we would get both sharing address and stable I/O latency.
> > >
> > > Isn't it like a traditional MAP_DMA API (imply pinning) plus specifying
> > > cpu_va of the memory pool as the iova?
> >
> > I think their issue is the HW can't do the cpu_va trick without also
> > involving the system IOMMU in a SVA mode
> >
> 
> This is the part that I didn't understand. Using cpu_va in a MAP_DMA
> interface doesn't require device support. It's just an user-specified
> address to be mapped into the IOMMU page table. On the other hand,

The background is that uacce is based on SVA and we are building
applications on uacce:
https://www.kernel.org/doc/html/v5.10/misc-devices/uacce.html
so IOMMU simply uses the page table of MMU, and don't do any
special mapping to an user-specified address. We don't break
the basic assumption that uacce is using SVA, otherwise, we
need to re-build uacce and the whole base.

> sharing CPU page table through a SVA interface for an usage where I/O
> page faults must be completely avoided seems a misleading attempt.

That is not for completely avoiding IO page fault, that is just
an extension for high-performance I/O case, providing a way to
avoid IO latency jitter. Using it or not is totally up to users.

> Even if people do want this model (e.g. mix pinning+fault), it should be
> a mm syscall as Greg pointed out, not specific to sva.
> 

We are glad to make it a syscall if people are happy with
it. The simplest way would be a syscall similar with
userfaultfd  if we don't want to mess up mm_struct.

> Thanks
> Kevin

Thanks
Barry


RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device

2021-02-01 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Jason Gunthorpe [mailto:j...@ziepe.ca]
> Sent: Tuesday, February 2, 2021 12:44 PM
> To: Tian, Kevin 
> Cc: Song Bao Hua (Barry Song) ; chensihang (A)
> ; Arnd Bergmann ; Greg
> Kroah-Hartman ; linux-kernel@vger.kernel.org;
> io...@lists.linux-foundation.org; linux...@kvack.org; Zhangfei Gao
> ; Liguozhu (Kenneth) ;
> linux-accelerat...@lists.ozlabs.org
> Subject: Re: [RFC PATCH v2] uacce: Add uacce_ctrl misc device
> 
> On Fri, Jan 29, 2021 at 10:09:03AM +, Tian, Kevin wrote:
> > > SVA is not doom to work with IO page fault only. If we have SVA+pin,
> > > we would get both sharing address and stable I/O latency.
> >
> > Isn't it like a traditional MAP_DMA API (imply pinning) plus specifying
> > cpu_va of the memory pool as the iova?
> 
> I think their issue is the HW can't do the cpu_va trick without also
> involving the system IOMMU in a SVA mode
> 
> It really is something that belongs under some general /dev/sva as we
> talked on the vfio thread

AFAIK, there is no this /dev/sva so /dev/uacce is an uAPI
which belongs to sva.

Another option is that we add a system call like
fs/userfaultfd.c, and move the file_operations and  ioctl
to the anon inode by creating fd via anon_inode_getfd().
Then nothing will be buried by uacce.

> 
> Jason

Thanks
Barry



RE: [PATCH] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2

2021-02-01 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Valentin Schneider [mailto:valentin.schnei...@arm.com]
> Sent: Tuesday, February 2, 2021 7:11 AM
> To: Song Bao Hua (Barry Song) ;
> vincent.guit...@linaro.org; mgor...@suse.de; mi...@kernel.org;
> pet...@infradead.org; dietmar.eggem...@arm.com; morten.rasmus...@arm.com;
> linux-kernel@vger.kernel.org
> Cc: linux...@openeuler.org; xuwei (O) ; Liguozhu (Kenneth)
> ; tiantao (H) ; wanghuiqiang
> ; Zengtao (B) ; Jonathan
> Cameron ; guodong...@linaro.org; Song Bao Hua
> (Barry Song) ; Meelis Roos 
> Subject: Re: [PATCH] sched/topology: fix the issue groups don't span
> domain->span for NUMA diameter > 2
> 
> 
> Hi,
> 
> On 01/02/21 16:38, Barry Song wrote:
> > A tricky thing is that we shouldn't use the sgc of the 1st CPU of node2
> > for the sched_group generated by grandchild, otherwise, when this cpu
> > becomes the balance_cpu of another sched_group of cpus other than node0,
> > our sched_group generated by grandchild will access the same sgc with
> > the sched_group generated by child of another CPU.
> >
> > So in init_overlap_sched_group(), sgc's capacity be overwritten:
> > build_balance_mask(sd, sg, mask);
> > cpu = cpumask_first_and(sched_group_span(sg), mask);
> >
> > sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
> >
> > And WARN_ON_ONCE(!cpumask_equal(group_balance_mask(sg), mask)) will
> > also be triggered:
> > static void init_overlap_sched_group(struct sched_domain *sd,
> >  struct sched_group *sg)
> > {
> > if (atomic_inc_return(>sgc->ref) == 1)
> > cpumask_copy(group_balance_mask(sg), mask);
> > else
> > WARN_ON_ONCE(!cpumask_equal(group_balance_mask(sg), mask));
> > }
> >
> > So here move to use the sgc of the 2nd cpu. For the corner case, if NUMA
> > has only one CPU, we will still trigger this WARN_ON_ONCE. But It is
> > really unlikely to be a real case for one NUMA to have one CPU only.
> >
> 
> Well, it's trivial to boot this with QEMU, and it's actually the example
> the comment atop that WARN_ONCE() is based on. Also, you could end up with
> a single CPU on a node during hotplug operations...

Hi Valentin,

The qemu topology is just a reflection of real kunpeng920 case, and pls
also note Meelis has also tested on another real hardware "8-node Sun
Fire X4600-M2" and gave the tested-by.

It might not a perfect fix, but it is the simplest way to fix for this
moment and for real cases. A "perfect" fix will require major
refactoring of topology.c.

I don't think hotplug is much relevant as even some cpus are unplugged
and only one cpu is left in the sched_group of the sched_domain, the
related domain and group are still getting right settings.

On the other hand, the corner could literally  be fixed, but will
get some very ugly code involved. I mean, two sched_group can result
in using the same sgc:
1. the sched_group generated by grandchild with only one numa
2. the sched_group generated by child with more than one numa

Right now, I'm moving to the 2nd cpu for sched_group1, if we move to
use 2nd cpu for sched_group2, then having only one cpu in one NUMA
wouldn't be a problem anymore. But the code will be very ugly.
So I would prefer to keep this assumption and just ignore the unreal
corner case.

> 
> I am not entirely sure whether having more than one CPU per node is a
> sufficient condition. I'm starting to *think* it is, but I'm not entirely
> convinced yet - and now I need a new notebook.

Me too. Some extremely complicated topology might break the assumption.
Really need a new notebook to draw this kind of complicated topology to
break the assumption :-)

But it is sufficient for the existing real cases which need fixing. When
someday a real case in which each numa has more than one CPU wakes up
the below warning:
WARN_ON_ONCE(!cpumask_equal(group_balance_mask(sg), mask)).
It might be the right time to consider major refactoring of topology.c.

Thanks
Barry


RE: [PATCH 1/1] sched/topology: Make sched_init_numa() use a set for the deduplicating sort

2021-02-01 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Dietmar Eggemann [mailto:dietmar.eggem...@arm.com]
> Sent: Monday, February 1, 2021 10:54 PM
> To: Valentin Schneider ;
> linux-kernel@vger.kernel.org
> Cc: mi...@kernel.org; pet...@infradead.org; vincent.guit...@linaro.org;
> morten.rasmus...@arm.com; mgor...@suse.de; Song Bao Hua (Barry Song)
> 
> Subject: Re: [PATCH 1/1] sched/topology: Make sched_init_numa() use a set for
> the deduplicating sort
> 
> On 22/01/2021 13:39, Valentin Schneider wrote:
> 
> [...]
> 
> > @@ -1705,7 +1702,7 @@ void sched_init_numa(void)
> > /* Compute default topology size */
> > for (i = 0; sched_domain_topology[i].mask; i++);
> >
> > -   tl = kzalloc((i + level + 1) *
> > +   tl = kzalloc((i + nr_levels) *
> > sizeof(struct sched_domain_topology_level), GFP_KERNEL);
> > if (!tl)
> > return;
> 
> This hunk creates issues during startup on my Arm64 juno board on 
> tip/sched/core.

I also reported this kernel panic here:
https://lore.kernel.org/lkml/bfb703294b234e1e926a68fcb73db...@hisilicon.com/#t

> 
> ---8<---
> 
> From: Dietmar Eggemann 
> Date: Mon, 1 Feb 2021 09:58:04 +0100
> Subject: [PATCH] sched/topology: Fix sched_domain_topology_level alloc in
>  sched_init_numa
> 
> Commit "sched/topology: Make sched_init_numa() use a set for the
> deduplicating sort" allocates 'i + nr_levels (level)' instead of
> 'i + nr_levels + 1' sched_domain_topology_level.
> 
> This led to an Oops (on Arm64 juno with CONFIG_SCHED_DEBUG):
> 
> sched_init_domains
>   build_sched_domains()
> __free_domain_allocs()
>   __sdt_free() {
>   ...
> for_each_sd_topology(tl)
> ...
>   sd = *per_cpu_ptr(sdd->sd, j); <--
> ...
>   }
> 
> Signed-off-by: Dietmar Eggemann 
> ---

This patch also resolved my panic. So:

Tested-by: Barry Song 

Thanks
Barry



RE: [PATCH v3 01/12] genirq: add IRQF_NO_AUTOEN for request_irq

2021-01-31 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Thomas Gleixner [mailto:t...@linutronix.de]
> Sent: Friday, January 29, 2021 8:55 AM
> To: Song Bao Hua (Barry Song) ;
> dmitry.torok...@gmail.com; m...@kernel.org; gre...@linuxfoundation.org;
> linux-in...@vger.kernel.org; linux-kernel@vger.kernel.org
> Cc: linux...@openeuler.org; Song Bao Hua (Barry Song)
> 
> Subject: Re: [PATCH v3 01/12] genirq: add IRQF_NO_AUTOEN for request_irq
> 
> Barry,
> 
> On Fri, Jan 08 2021 at 11:39, Barry Song wrote:
> > diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> > index ab8567f32501..2b28314e2572 100644
> > --- a/kernel/irq/manage.c
> > +++ b/kernel/irq/manage.c
> > @@ -1693,6 +1693,9 @@ __setup_irq(unsigned int irq, struct irq_desc *desc,
> struct irqaction *new)
> > irqd_set(>irq_data, IRQD_NO_BALANCING);
> > }
> >
> > +   if (new->flags & IRQF_NO_AUTOEN)
> > +   irq_settings_set_noautoen(desc);
> 
> If we move this to request time flags, then setting the noautoen bit on
> the irq descriptor is pretty pointless. See below.
> 
> I rather get rid of the irq_settings magic for NOAUTOEN completely.

Thanks for your comment, Thomas.

Got this issue fixed in v4:
https://lore.kernel.org/lkml/20210128223538.20272-1-song.bao@hisilicon.com/

btw, for those drivers which are using the first pattern:
irq_set_status_flags(irq, IRQ_NOAUTOEN);
request_irq(dev, irq...);

Simply running "git grep IRQ_NOAUTOEN"  will help figure where to fix.

For those drivers which are using the second pattern:
request_irq(dev, irq...);
disable_irq(irq);

I wrote a script as below:

#!/bin/bash
if [ $# != 1 -o ! -d $1 ] ; then
echo "USAGE: $0 dir"
exit 1;
fi

find $1 -iname "*.c" | while read i
do
if [ -d "$i" ]; then
break
fi

irq=`grep -n -A 10 -E 
"request_irq|request_threaded_irq|request_any_context_irq" $i | grep 
disable_irq` 
if [ "$irq" != "" ]; then
echo "$i":"$irq"
fi
done

The script says there are more than 70 cases in 5.11-rc6.
We are going to fix all of them after this one settles down.

Thanks
Barry

> 
> Thanks,
> 
> tglx
> ---
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -61,6 +61,8 @@
>   *interrupt handler after suspending interrupts. For system
>   *wakeup devices users need to implement wakeup detection in
>   *their interrupt handlers.
> + * IRQF_NO_AUTOEN - Don't enable IRQ automatically when users request it. 
> Users
> + *will enable it explicitly by enable_irq() later.
>   */
>  #define IRQF_SHARED  0x0080
>  #define IRQF_PROBE_SHARED0x0100
> @@ -74,6 +76,7 @@
>  #define IRQF_NO_THREAD   0x0001
>  #define IRQF_EARLY_RESUME0x0002
>  #define IRQF_COND_SUSPEND0x0004
> +#define IRQF_NO_AUTOEN   0x0008
> 
>  #define IRQF_TIMER   (__IRQF_TIMER | IRQF_NO_SUSPEND | 
> IRQF_NO_THREAD)
> 
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -1693,7 +1693,8 @@ static int
>   irqd_set(>irq_data, IRQD_NO_BALANCING);
>   }
> 
> - if (irq_settings_can_autoenable(desc)) {
> + if (!(new->flags & IRQF_NO_AUTOEN) &&
> + irq_settings_can_autoenable(desc)) {
>   irq_startup(desc, IRQ_RESEND, IRQ_START_COND);
>   } else {
>   /*
> @@ -2086,10 +2087,15 @@ int request_threaded_irq(unsigned int ir
>* which interrupt is which (messes up the interrupt freeing
>* logic etc).
>*
> +  * Also shared interrupts do not go well with disabling auto enable.
> +  * The sharing interrupt might request it while it's still disabled
> +  * and then wait for interrupts forever.
> +  *
>* Also IRQF_COND_SUSPEND only makes sense for shared interrupts and
>* it cannot be set along with IRQF_NO_SUSPEND.
>*/
>   if (((irqflags & IRQF_SHARED) && !dev_id) ||
> + ((irqflags & IRQF_SHARED) && (irqflags & IRQF_NO_AUTOEN)) ||
>   (!(irqflags & IRQF_SHARED) && (irqflags & IRQF_COND_SUSPEND)) ||
>   ((irqflags & IRQF_NO_SUSPEND) && (irqflags & IRQF_COND_SUSPEND)))
>   return -EINVAL;



RE: [PATCH 1/1] sched/topology: Make sched_init_numa() use a set for the deduplicating sort

2021-01-28 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Valentin Schneider [mailto:valentin.schnei...@arm.com]
> Sent: Friday, January 29, 2021 3:47 AM
> To: Song Bao Hua (Barry Song) ;
> linux-kernel@vger.kernel.org
> Cc: mi...@kernel.org; pet...@infradead.org; vincent.guit...@linaro.org;
> dietmar.eggem...@arm.com; morten.rasmus...@arm.com; mgor...@suse.de
> Subject: RE: [PATCH 1/1] sched/topology: Make sched_init_numa() use a set
> for the deduplicating sort
> 
> On 25/01/21 21:35, Song Bao Hua (Barry Song) wrote:
> > I was using 5.11-rc1. One thing I'd like to mention is that:
> >
> > For the below topology:
> > +---+  +-+
> > | node1 |  20  |node2|
> > |   +--+ |
> > +---+---+  +-+
> > |  |12
> > 12  |  |
> > +---+---+  +---+-+
> > |   |  |node3|
> > | node0 |  | |
> > +---+  +-+
> >
> > with node0-node2 as 22, node0-node3 as 24, node1-node3 as 22.
> >
> > I will get the below sched_domains_numa_distance[]:
> > 10, 12, 22, 24
> > As you can see there is *no* 20. So the node1 and node2 will
> > only get two-level numa sched_domain:
> >
> 
> 
> So that's
> 
> -numa node,cpus=0-1,nodeid=0 -numa node,cpus=2-3,nodeid=1, \
> -numa node,cpus=4-5,nodeid=2, -numa node,cpus=6-7,nodeid=3, \
> -numa dist,src=0,dst=1,val=12, \
> -numa dist,src=0,dst=2,val=22, \
> -numa dist,src=0,dst=3,val=24, \
> -numa dist,src=1,dst=2,val=20, \
> -numa dist,src=1,dst=3,val=22, \
> -numa dist,src=2,dst=3,val=12
> 
> but running this still doesn't get me a splat. Debugging
> sched_domains_numa_distance[] still gives me
> {10, 12, 20, 22, 24}
> 
> >
> > But for the below topology:
> > +---+  +-+
> > | node0 |  20  |node2|
> > |   +--+ |
> > +---+---+  +-+
> > |  |12
> > 12  |  |
> > +---+---+  +---+-+
> > |   |  |node3|
> > | node1 |  | |
> > +---+  +-+
> >
> > with node1-node2 as 22, node1-node3 as 24,node0-node3 as 22.
> >
> > I will get the below sched_domains_numa_distance[]:
> > 10, 12, 20, 22, 24
> >
> > What I have seen is the performance will be better if we
> > drop the 20 as we will get a sched_domain hierarchy with less
> > levels, and two intermediate nodes won't have the group span
> > issue.
> >
> 
> That is another thing that's worth considering. Morten was arguing that if
> the distance between two nodes is so tiny, it might not be worth
> representing it at all in the scheduler topology.

Yes. I agree it is a different thing. Anyway, I saw your patch has been
in sched tree. One side effect your patch is the one more sched_domain
level is imported for this topology:

24
  X X XXX X X  X X X X XXX
 XX XX X  X
 XXXX
   XX XXX
 XX 22  XXX
 X   XXX   XX
XX  X   
   XX  XXXXX X XX XXX
++   +-+  +-+  XX+-+
| 0  |   12  | 1   | 20   | 2   |   12   |3|
|+---+ +--+ ++ |
+---X+   +-+  +--X--++-+
XX
XX  X
 X XX
  XX  XX
   XXX
X XXX XXX
 X XX XX XX X X X 
   22
Without the patch, Linux will use 10,12,22,24 to build sched_domain;
With your patch, Linux will use 10,12,20,22,24 to build sched_domain.

So one more layer is added. What I have seen is that:

For node0 sched_domain <=12 and sched_domain <=20 span the same range
(node0, node1). So one of them is redundant. then in cpu_attach_domain,
the redundant one is dropped due to "remove the sched domains which
do not contribute to scheduling".

For node1&2, the origin code had no "20", thus built one less sched_domain
level.

What is really interesting is that removing 20 actually gives better
benchmark in speccpu :-)


> 
> > Thanks
> > Barry

Thanks
Barry



RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device

2021-01-27 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Jason Gunthorpe [mailto:j...@ziepe.ca]
> Sent: Wednesday, January 27, 2021 7:20 AM
> To: Song Bao Hua (Barry Song) 
> Cc: Wangzhou (B) ; Greg Kroah-Hartman
> ; Arnd Bergmann ; Zhangfei Gao
> ; linux-accelerat...@lists.ozlabs.org;
> linux-kernel@vger.kernel.org; io...@lists.linux-foundation.org;
> linux...@kvack.org; Liguozhu (Kenneth) ; chensihang
> (A) 
> Subject: Re: [RFC PATCH v2] uacce: Add uacce_ctrl misc device
> 
> On Tue, Jan 26, 2021 at 01:26:45AM +, Song Bao Hua (Barry Song) wrote:
> > > On Mon, Jan 25, 2021 at 11:35:22PM +, Song Bao Hua (Barry Song) wrote:
> > >
> > > > > On Mon, Jan 25, 2021 at 10:21:14PM +, Song Bao Hua (Barry Song)
> wrote:
> > > > > > mlock, while certainly be able to prevent swapping out, it won't
> > > > > > be able to stop page moving due to:
> > > > > > * memory compaction in alloc_pages()
> > > > > > * making huge pages
> > > > > > * numa balance
> > > > > > * memory compaction in CMA
> > > > >
> > > > > Enabling those things is a major reason to have SVA device in the
> > > > > first place, providing a SW API to turn it all off seems like the
> > > > > wrong direction.
> > > >
> > > > I wouldn't say this is a major reason to have SVA. If we read the
> > > > history of SVA and papers, people would think easy programming due
> > > > to data struct sharing between cpu and device, and process space
> > > > isolation in device would be the major reasons for SVA. SVA also
> > > > declares it supports zero-copy while zero-copy doesn't necessarily
> > > > depend on SVA.
> > >
> > > Once you have to explicitly make system calls to declare memory under
> > > IO, you loose all of that.
> > >
> > > Since you've asked the app to be explicit about the DMAs it intends to
> > > do, there is not really much reason to use SVA for those DMAs anymore.
> >
> > Let's see a non-SVA case. We are not using SVA, we can have
> > a memory pool by hugetlb or pin, and app can allocate memory
> > from this pool, and get stable I/O performance on the memory
> > from the pool. But device has its separate page table which
> > is not bound with this process, thus lacking the protection
> > of process space isolation. Plus, CPU and device are using
> > different address.
> 
> So you are relying on the platform to do the SVA for the device?
> 

Sorry for late response.

uacce and its userspace framework UADK depend on SVA, leveraging
the enhanced security by isolated process address space.

This patch is mainly an extension for performance optimization to
get stable high-performance I/O on pinned memory even though the
hardware supports IO page fault to get pages back after swapping
out or page migration.
But IO page fault will cause serious latency jitter for high-speed
I/O.
For slow speed device, they don't need to use this extension.

> This feels like it goes back to another topic where I felt the SVA
> setup uAPI should be shared and not buried into every driver's unique
> ioctls.
> 
> Having something like this in a shared SVA system is somewhat less
> strange.

Sounds reasonable. On the other hand, uacce seems to be an common
uAPI for SVA, and probably the only one for this moment.

uacce is a framework not a specific driver as any accelerators
can hook into this framework as long as a device provides
uacce_ops and register itself by uacce_register(). Uacce, for
itself, doesn't bind with any specific hardware. So uacce interfaces
are kind of common uAPI :-)

> 
> Jason

Thanks
Barry



RE: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and add cluster scheduler

2021-01-25 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Dietmar Eggemann [mailto:dietmar.eggem...@arm.com]
> Sent: Wednesday, January 13, 2021 1:53 AM
> To: Song Bao Hua (Barry Song) ; Morten Rasmussen
> ; Tim Chen 
> Cc: valentin.schnei...@arm.com; catalin.mari...@arm.com; w...@kernel.org;
> r...@rjwysocki.net; vincent.guit...@linaro.org; l...@kernel.org;
> gre...@linuxfoundation.org; Jonathan Cameron ;
> mi...@redhat.com; pet...@infradead.org; juri.le...@redhat.com;
> rost...@goodmis.org; bseg...@google.com; mgor...@suse.de;
> mark.rutl...@arm.com; sudeep.ho...@arm.com; aubrey...@linux.intel.com;
> linux-arm-ker...@lists.infradead.org; linux-kernel@vger.kernel.org;
> linux-a...@vger.kernel.org; linux...@openeuler.org; xuwei (O)
> ; Zengtao (B) ; tiantao (H)
> 
> Subject: Re: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and
> add cluster scheduler
> 
> On 08/01/2021 22:30, Song Bao Hua (Barry Song) wrote:
> >
> >> -Original Message-
> >> From: Morten Rasmussen [mailto:morten.rasmus...@arm.com]
> >> Sent: Saturday, January 9, 2021 4:13 AM
> >> To: Tim Chen 
> >> Cc: Song Bao Hua (Barry Song) ;
> >> valentin.schnei...@arm.com; catalin.mari...@arm.com; w...@kernel.org;
> >> r...@rjwysocki.net; vincent.guit...@linaro.org; l...@kernel.org;
> >> gre...@linuxfoundation.org; Jonathan Cameron
> ;
> >> mi...@redhat.com; pet...@infradead.org; juri.le...@redhat.com;
> >> dietmar.eggem...@arm.com; rost...@goodmis.org; bseg...@google.com;
> >> mgor...@suse.de; mark.rutl...@arm.com; sudeep.ho...@arm.com;
> >> aubrey...@linux.intel.com; linux-arm-ker...@lists.infradead.org;
> >> linux-kernel@vger.kernel.org; linux-a...@vger.kernel.org;
> >> linux...@openeuler.org; xuwei (O) ; Zengtao (B)
> >> ; tiantao (H) 
> >> Subject: Re: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters
> and
> >> add cluster scheduler
> >>
> >> On Thu, Jan 07, 2021 at 03:16:47PM -0800, Tim Chen wrote:
> >>> On 1/6/21 12:30 AM, Barry Song wrote:
> >>>> ARM64 server chip Kunpeng 920 has 6 clusters in each NUMA node, and each
> >>>> cluster has 4 cpus. All clusters share L3 cache data while each cluster
> >>>> has local L3 tag. On the other hand, each cluster will share some
> >>>> internal system bus. This means cache is much more affine inside one 
> >>>> cluster
> >>>> than across clusters.
> >>>
> >>> There is a similar need for clustering in x86.  Some x86 cores could share
> >> L2 caches that
> >>> is similar to the cluster in Kupeng 920 (e.g. on Jacobsville there are 6
> clusters
> >>> of 4 Atom cores, each cluster sharing a separate L2, and 24 cores sharing
> >> L3).
> >>> Having a sched domain at the L2 cluster helps spread load among
> >>> L2 domains.  This will reduce L2 cache contention and help with
> >>> performance for low to moderate load scenarios.
> >>
> >> IIUC, you are arguing for the exact opposite behaviour, i.e. balancing
> >> between L2 caches while Barry is after consolidating tasks within the
> >> boundaries of a L3 tag cache. One helps cache utilization, the other
> >> communication latency between tasks. Am I missing something?
> >
> > Morten, this is not true.
> >
> > we are both actually looking for the same behavior. My patch also
> > has done the exact same behavior of spreading with Tim's patch.
> 
> That's the case for the load-balance path because of the extra Sched
> Domain (SD) (CLS/MC_L2) below MC.
> 
> But in wakeup you add code which leads to a different packing strategy.

Yes, but I put a note for the 1st case:
"Case 1. we have two tasks *without* any relationship running in a system
with 2 clusters and 8 cpus"

so for tasks without wake-up relationship, the current patch will only
result in spreading.

Anyway, I will also test Tim's benchmark in kunpeng920 with the SCHED_CLUTER
to see what will happen. Till now, benchmark has only covered the case to
figure out the benefit of changing wake-up path.
I would also be interested in figuring out what we have got from the change
of load_balance().

> 
> It looks like that Tim's workload (SPECrate mcf) shows a performance
> boost solely because of the changes the additional MC_L2 SD introduces
> in load balance. The wakeup path is unchanged, i.e. llc-packing. IMHO we
> have to carefully distinguish between packing vs. spreading in wakeup
> and load-balance here.
> 
> > Considering the below two cases:
> >

RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device

2021-01-25 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Jason Gunthorpe [mailto:j...@ziepe.ca]
> Sent: Tuesday, January 26, 2021 2:13 PM
> To: Song Bao Hua (Barry Song) 
> Cc: Wangzhou (B) ; Greg Kroah-Hartman
> ; Arnd Bergmann ; Zhangfei Gao
> ; linux-accelerat...@lists.ozlabs.org;
> linux-kernel@vger.kernel.org; io...@lists.linux-foundation.org;
> linux...@kvack.org; Liguozhu (Kenneth) ; chensihang
> (A) 
> Subject: Re: [RFC PATCH v2] uacce: Add uacce_ctrl misc device
> 
> On Mon, Jan 25, 2021 at 11:35:22PM +, Song Bao Hua (Barry Song) wrote:
> 
> > > On Mon, Jan 25, 2021 at 10:21:14PM +, Song Bao Hua (Barry Song) wrote:
> > > > mlock, while certainly be able to prevent swapping out, it won't
> > > > be able to stop page moving due to:
> > > > * memory compaction in alloc_pages()
> > > > * making huge pages
> > > > * numa balance
> > > > * memory compaction in CMA
> > >
> > > Enabling those things is a major reason to have SVA device in the
> > > first place, providing a SW API to turn it all off seems like the
> > > wrong direction.
> >
> > I wouldn't say this is a major reason to have SVA. If we read the
> > history of SVA and papers, people would think easy programming due
> > to data struct sharing between cpu and device, and process space
> > isolation in device would be the major reasons for SVA. SVA also
> > declares it supports zero-copy while zero-copy doesn't necessarily
> > depend on SVA.
> 
> Once you have to explicitly make system calls to declare memory under
> IO, you loose all of that.
> 
> Since you've asked the app to be explicit about the DMAs it intends to
> do, there is not really much reason to use SVA for those DMAs anymore.

Let's see a non-SVA case. We are not using SVA, we can have
a memory pool by hugetlb or pin, and app can allocate memory
from this pool, and get stable I/O performance on the memory
from the pool. But device has its separate page table which
is not bound with this process, thus lacking the protection
of process space isolation. Plus, CPU and device are using
different address.

And then we move to SVA case, we can still have a memory pool
by hugetlb or pin, and app can allocate memory from this pool
since this pool is mapped to the address space of the process,
and we are able to get stable I/O performance since it is always
there. But in this case, device is using the page table of
process with the full permission control.
And they are using same address and can possibly enjoy the easy
programming if HW supports.

SVA is not doom to work with IO page fault only. If we have SVA+pin,
we would get both sharing address and stable I/O latency.

> 
> Jason

Thanks
Barry



RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device

2021-01-25 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On Behalf Of
> Jason Gunthorpe
> Sent: Tuesday, January 26, 2021 12:16 PM
> To: Song Bao Hua (Barry Song) 
> Cc: Wangzhou (B) ; Greg Kroah-Hartman
> ; Arnd Bergmann ; Zhangfei Gao
> ; linux-accelerat...@lists.ozlabs.org;
> linux-kernel@vger.kernel.org; io...@lists.linux-foundation.org;
> linux...@kvack.org; Liguozhu (Kenneth) ; chensihang
> (A) 
> Subject: Re: [RFC PATCH v2] uacce: Add uacce_ctrl misc device
> 
> On Mon, Jan 25, 2021 at 10:21:14PM +, Song Bao Hua (Barry Song) wrote:
> > mlock, while certainly be able to prevent swapping out, it won't
> > be able to stop page moving due to:
> > * memory compaction in alloc_pages()
> > * making huge pages
> > * numa balance
> > * memory compaction in CMA
> 
> Enabling those things is a major reason to have SVA device in the
> first place, providing a SW API to turn it all off seems like the
> wrong direction.

I wouldn't say this is a major reason to have SVA. If we read the
history of SVA and papers, people would think easy programming due
to data struct sharing between cpu and device, and process space
isolation in device would be the major reasons for SVA. SVA also
declares it supports zero-copy while zero-copy doesn't necessarily
depend on SVA.

Page migration and I/O page fault overhead, on the other hand, would
probably be the major problems which block SVA becoming a 
high-performance and more popular solution.

> 
> If the device doesn't want to use SVA then don't use it, use normal
> DMA pinning like everything else.
> 

If we disable SVA, we won't get the benefits of SVA on address sharing,
and process space isolation.

> Jason

Thanks
Barry


RE: [PATCH 1/1] sched/topology: Make sched_init_numa() use a set for the deduplicating sort

2021-01-25 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Valentin Schneider [mailto:valentin.schnei...@arm.com]
> Sent: Tuesday, January 26, 2021 5:46 AM
> To: Song Bao Hua (Barry Song) ;
> linux-kernel@vger.kernel.org
> Cc: mi...@kernel.org; pet...@infradead.org; vincent.guit...@linaro.org;
> dietmar.eggem...@arm.com; morten.rasmus...@arm.com; mgor...@suse.de
> Subject: RE: [PATCH 1/1] sched/topology: Make sched_init_numa() use a set for
> the deduplicating sort
> 
> On 25/01/21 09:26, Valentin Schneider wrote:
> > On 25/01/21 02:23, Song Bao Hua (Barry Song) wrote:
> >
> >> with the below topology:
> >> qemu-system-aarch64 -M virt -nographic \
> >>  -smp cpus=8 \
> >>  -numa node,cpus=0-1,nodeid=0 \
> >>  -numa node,cpus=2-3,nodeid=1 \
> >>  -numa node,cpus=4-5,nodeid=2 \
> >>  -numa node,cpus=6-7,nodeid=3 \
> >>  -numa dist,src=0,dst=1,val=12 \
> >>  -numa dist,src=0,dst=2,val=20 \
> >>  -numa dist,src=0,dst=3,val=22 \
> >>  -numa dist,src=1,dst=2,val=22 \
> >>  -numa dist,src=2,dst=3,val=12 \
> >>  -numa dist,src=1,dst=3,val=24 \
> >>
> >>
> >> The panic address is *1294:
> >>
> >> if (sdd->sd) {
> >> 1280:   f9400e61ldr x1, [x19, #24]
> >> 1284:   b4000201cbz x1, 12c4 
> >> 
> >> sd = *per_cpu_ptr(sdd->sd, j);
> >> 1288:   93407eb7sxtwx23, w21
> >> 128c:   aa0103e0mov x0, x1
> >> 1290:   f8777ac2ldr x2, [x22, x23, lsl #3]
> >> *1294:   f8626800ldr x0, [x0, x2]
> >> if (sd && (sd->flags & SD_OVERLAP))
> >> 1298:   b4000120cbz x0, 12bc 
> >> 
> >> 129c:   b9403803ldr w3, [x0, #56]
> >> 12a0:   365800e3tbz w3, #11, 12bc
> >> 
> >> free_sched_groups(sd->groups, 0);
> >> 12a4:   f9400800ldr x0, [x0, #16]
> >> if (!sg)
> >>
> >
> > Thanks for giving it a shot, let me run that with your topology and see
> > where I end up.
> >
> 
> I can't seem to reproduce this - your topology is actually among the ones
> I tested this against.
> 
> Applying this patch obviously doesn't get rid of the group span issue, but
> it does remove this warning:
> 
> [0.354806] ERROR: Node-0 not representative
> [0.354806]
> [0.355223]   10 12 20 22
> [0.355475]   12 10 22 24
> [0.355648]   20 22 10 12
> [0.355814]   22 24 12 10
> 
> I'm running this based on tip/sched/core:
> 
>   65bcf072e20e ("sched: Use task_current() instead of 'rq->curr == p'")
I was using 5.11-rc1. One thing I'd like to mention is that:

For the below topology:
+---+  +-+
| node1 |  20  |node2|
|   +--+ |
+---+---+  +-+
|  |12
12  |  |
+---+---+  +---+-+
|   |  |node3|
| node0 |  | |
+---+  +-+

with node0-node2 as 22, node0-node3 as 24, node1-node3 as 22.

I will get the below sched_domains_numa_distance[]:
10, 12, 22, 24
As you can see there is *no* 20. So the node1 and node2 will
only get two-level numa sched_domain:


But for the below topology:
+---+  +-+
| node0 |  20  |node2|
|   +--+ |
+---+---+  +-+
|  |12
12  |  |
+---+---+  +---+-+
|   |  |node3|
| node1 |  | |
+---+  +-+

with node1-node2 as 22, node1-node3 as 24,node0-node3 as 22.

I will get the below sched_domains_numa_distance[]:
10, 12, 20, 22, 24

What I have seen is the performance will be better if we
drop the 20 as we will get a sched_domain hierarchy with less
levels, and two intermediate nodes won't have the group span
issue.

Thanks
Barry



RE: [RFC PATCH v2] uacce: Add uacce_ctrl misc device

2021-01-25 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Jason Gunthorpe [mailto:j...@ziepe.ca]
> Sent: Tuesday, January 26, 2021 4:47 AM
> To: Wangzhou (B) 
> Cc: Greg Kroah-Hartman ; Arnd Bergmann
> ; Zhangfei Gao ;
> linux-accelerat...@lists.ozlabs.org; linux-kernel@vger.kernel.org;
> io...@lists.linux-foundation.org; linux...@kvack.org; Song Bao Hua (Barry 
> Song)
> ; Liguozhu (Kenneth) ;
> chensihang (A) 
> Subject: Re: [RFC PATCH v2] uacce: Add uacce_ctrl misc device
> 
> On Mon, Jan 25, 2021 at 04:34:56PM +0800, Zhou Wang wrote:
> 
> > +static int uacce_pin_page(struct uacce_pin_container *priv,
> > + struct uacce_pin_address *addr)
> > +{
> > +   unsigned int flags = FOLL_FORCE | FOLL_WRITE;
> > +   unsigned long first, last, nr_pages;
> > +   struct page **pages;
> > +   struct pin_pages *p;
> > +   int ret;
> > +
> > +   first = (addr->addr & PAGE_MASK) >> PAGE_SHIFT;
> > +   last = ((addr->addr + addr->size - 1) & PAGE_MASK) >> PAGE_SHIFT;
> > +   nr_pages = last - first + 1;
> > +
> > +   pages = vmalloc(nr_pages * sizeof(struct page *));
> > +   if (!pages)
> > +   return -ENOMEM;
> > +
> > +   p = kzalloc(sizeof(*p), GFP_KERNEL);
> > +   if (!p) {
> > +   ret = -ENOMEM;
> > +   goto free;
> > +   }
> > +
> > +   ret = pin_user_pages_fast(addr->addr & PAGE_MASK, nr_pages,
> > + flags | FOLL_LONGTERM, pages);
> 
> This needs to copy the RLIMIT_MEMLOCK and can_do_mlock() stuff from
> other places, like ib_umem_get
> 
> > +   ret = xa_err(xa_store(>array, p->first, p, GFP_KERNEL));
> 
> And this is really weird, I don't think it makes sense to make handles
> for DMA based on the starting VA.
> 
> > +static int uacce_unpin_page(struct uacce_pin_container *priv,
> > +   struct uacce_pin_address *addr)
> > +{
> > +   unsigned long first, last, nr_pages;
> > +   struct pin_pages *p;
> > +
> > +   first = (addr->addr & PAGE_MASK) >> PAGE_SHIFT;
> > +   last = ((addr->addr + addr->size - 1) & PAGE_MASK) >> PAGE_SHIFT;
> > +   nr_pages = last - first + 1;
> > +
> > +   /* find pin_pages */
> > +   p = xa_load(>array, first);
> > +   if (!p)
> > +   return -ENODEV;
> > +
> > +   if (p->nr_pages != nr_pages)
> > +   return -EINVAL;
> > +
> > +   /* unpin */
> > +   unpin_user_pages(p->pages, p->nr_pages);
> 
> And unpinning without guaranteeing there is no ongoing DMA is really
> weird

In SVA case, kernel has no idea if accelerators are accessing
the memory so I would assume SVA has a method to prevent
the pages being transferred from migration or release. Otherwise,
SVA will crash easily in a system with high memory pressure.

Anyway, This is a problem worth further investigating.

> 
> Are you abusing this in conjunction with a SVA scheme just to prevent
> page motion? Why wasn't mlock good enough?

Page migration won't cause any disfunction in SVA case as IO page
fault will get a valid page again. It is only a performance issue
as IO page fault has larger latency than the usual page fault,
would be 3-80slower than page fault[1]

mlock, while certainly be able to prevent swapping out, it won't
be able to stop page moving due to:
* memory compaction in alloc_pages()
* making huge pages
* numa balance
* memory compaction in CMA
etc.

[1] https://ieeexplore.ieee.org/stamp/stamp.jsp?tp==7482091=1
> 
> Jason

Thanks
Barry



RE: [RFC PATCH] sched/fair: first try to fix the scheduling impact of NUMA diameter > 2

2021-01-25 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Valentin Schneider [mailto:valentin.schnei...@arm.com]
> Sent: Tuesday, January 26, 2021 1:11 AM
> To: Song Bao Hua (Barry Song) ; Vincent Guittot
> ; Mel Gorman 
> Cc: Ingo Molnar ; Peter Zijlstra ;
> Dietmar Eggemann ; Morten Rasmussen
> ; linux-kernel ;
> linux...@openeuler.org
> Subject: RE: [RFC PATCH] sched/fair: first try to fix the scheduling impact
> of NUMA diameter > 2
> 
> On 25/01/21 03:13, Song Bao Hua (Barry Song) wrote:
> > As long as NUMA diameter > 2, building sched_domain by sibling's child 
> > domain
> > will definitely create a sched_domain with sched_group which will span
> > out of the sched_domain
> >+--+ +--++---+   +--+
> >| node |  12 |node  | 20 | node  |  12   |node  |
> >|  0   +-+1 ++ 2 +---+3 |
> >+--+ +--++---+   +--+
> >
> > domain0node0node1node2  node3
> >
> > domain1node0+1  node0+1  node2+3node2+3
> >  +
> > domain2node0+1+2 |
> >  group: node0+1  |
> >group:node2+3 <---+
> >
> > when node2 is added into the domain2 of node0, kernel is using the child
> > domain of node2's domain2, which is domain1(node2+3). Node 3 is outside
> > the span of node0+1+2.
> >
> > Will we move to use the *child* domain of the *child* domain of node2's
> > domain2 to build the sched_group?
> >
> > I mean:
> >+--+ +--++---+   +--+
> >| node |  12 |node  | 20 | node  |  12   |node  |
> >|  0   +-+1 ++ 2 +---+3 |
> >+--+ +--++---+   +--+
> >
> > domain0node0node1  +- node2  node3
> >|
> > domain1node0+1  node0+1| node2+3node2+3
> >|
> > domain2node0+1+2   |
> >  group: node0+1|
> >group:node2 <---+
> >
> > In this way, it seems we don't have to create a new group as we are just
> > reusing the existing group?
> >
> 
> One thing I've been musing over is pretty much this; that is to say we
> would make all non-local NUMA sched_groups span a single node. This would
> let us reuse an existing span+sched_group_capacity: the local group of that
> node at its first NUMA topology level.
> 
> Essentially this means getting rid of the overlapping groups, and the
> balance mask is handled the same way as for !NUMA, i.e. it's the local
> group span. I've not gone far enough through the thought experiment to see
> where does it miserably fall apart... It is at the very least violating the
> expectation that a group span is a child domain's span - here it can be a
> grand^x children domain's span.
> 
> 
> If we take your topology, we currently have:
> 
> | tl\node | 0| 1 | 2 | 3|
> |-+--+---+---+--|
> | NUMA0   | (0)->(1) | (1)->(2)->(0) | (2)->(3)->(1) | (3)->(2) |
> | NUMA1   | (0-1)->(1-3) | (0-2)->(2-3)  | (1-3)->(0-1)  | (2-3)->(0-2) |
> | NUMA2   | (0-2)->(1-3) | N/A   | N/A   | (1-3)->(0-2) |
> 
> With the current overlapping group scheme, we would need to make it look
> like so:
> 
> | tl\node | 0 | 1 | 2 | 3 |
> |-+---+---+---+---
> |
> | NUMA0   | (0)->(1)  | (1)->(2)->(0) | (2)->(3)->(1) | (3)->(2)  |
> | NUMA1   | (0-1)->(1-2)* | (0-2)->(2-3)  | (1-3)->(0-1)  | (2-3)->(1-2)* |
> | NUMA2   | (0-2)->(1-3)  | N/A   | N/A   | (1-3)->(0-2)  |
> 
> But as already discussed, that's tricky to make work. With the node-span
> groups thing, we would turn this into:
> 
> | tl\node | 0  | 1 | 2 | 3  |
> |-++---+---+|
> | NUMA0   | (0)->(1)   | (1)->(2)->(0) | (2)->(3)->(1) | (3)->(2)   |
> | NUMA1   | (0-1)->(2) | (

RE: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and add cluster scheduler

2021-01-25 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Dietmar Eggemann [mailto:dietmar.eggem...@arm.com]
> Sent: Wednesday, January 13, 2021 12:00 AM
> To: Morten Rasmussen ; Tim Chen
> 
> Cc: Song Bao Hua (Barry Song) ;
> valentin.schnei...@arm.com; catalin.mari...@arm.com; w...@kernel.org;
> r...@rjwysocki.net; vincent.guit...@linaro.org; l...@kernel.org;
> gre...@linuxfoundation.org; Jonathan Cameron ;
> mi...@redhat.com; pet...@infradead.org; juri.le...@redhat.com;
> rost...@goodmis.org; bseg...@google.com; mgor...@suse.de;
> mark.rutl...@arm.com; sudeep.ho...@arm.com; aubrey...@linux.intel.com;
> linux-arm-ker...@lists.infradead.org; linux-kernel@vger.kernel.org;
> linux-a...@vger.kernel.org; linux...@openeuler.org; xuwei (O)
> ; Zengtao (B) ; tiantao (H)
> 
> Subject: Re: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and
> add cluster scheduler
> 
> On 11/01/2021 10:28, Morten Rasmussen wrote:
> > On Fri, Jan 08, 2021 at 12:22:41PM -0800, Tim Chen wrote:
> >>
> >>
> >> On 1/8/21 7:12 AM, Morten Rasmussen wrote:
> >>> On Thu, Jan 07, 2021 at 03:16:47PM -0800, Tim Chen wrote:
> >>>> On 1/6/21 12:30 AM, Barry Song wrote:
> 
> [...]
> 
> >> I think it is going to depend on the workload.  If there are dependent
> >> tasks that communicate with one another, putting them together
> >> in the same cluster will be the right thing to do to reduce communication
> >> costs.  On the other hand, if the tasks are independent, putting them 
> >> together
> on the same cluster
> >> will increase resource contention and spreading them out will be better.
> >
> > Agree. That is exactly where I'm coming from. This is all about the task
> > placement policy. We generally tend to spread tasks to avoid resource
> > contention, SMT and caches, which seems to be what you are proposing to
> > extend. I think that makes sense given it can produce significant
> > benefits.
> >
> >>
> >> Any thoughts on what is the right clustering "tag" to use to clump
> >> related tasks together?
> >> Cgroup? Pid? Tasks with same mm?
> >
> > I think this is the real question. I think the closest thing we have at
> > the moment is the wakee/waker flip heuristic. This seems to be related.
> > Perhaps the wake_affine tricks can serve as starting point?
> 
> wake_wide() switches between packing (select_idle_sibling(), llc_size
> CPUs) and spreading (find_idlest_cpu(), all CPUs).
> 
> AFAICS, since none of the sched domains set SD_BALANCE_WAKE, currently
> all wakeups are (llc-)packed.

Sorry for late response. I was struggling with some other topology
issues recently.

For "all wakeups are (llc-)packed",
it seems you mean current want_affine is only affecting the new_cpu,
and for wake-up path, we will always go to select_idle_sibling() rather
than find_idlest_cpu() since nobody sets SD_WAKE_BALANCE in any
sched_domain ?

> 
>  select_task_rq_fair()
> 
>for_each_domain(cpu, tmp)
> 
>  if (tmp->flags & sd_flag)
>sd = tmp;
> 
> 
> In case we would like to further distinguish between llc-packing and
> even narrower (cluster or MC-L2)-packing, we would introduce a 2. level
> packing vs. spreading heuristic further down in sis().

I didn't get your point on "2 level packing". Would you like
to describe more? It seems you mean we need to have separate
calculation for avg_scan_cost and sched_feat(SIS_) for cluster
(or MC-L2) since cluster and llc are not in the same level
physically?

> 
> IMHO, Barry's current implementation doesn't do this right now. Instead
> he's trying to pack on cluster first and if not successful look further
> among the remaining llc CPUs for an idle CPU.

Yes. That is exactly what the current patch is doing.

Thanks
Barry


RE: [RFC PATCH] sched/fair: first try to fix the scheduling impact of NUMA diameter > 2

2021-01-24 Thread Song Bao Hua (Barry Song)
> >
> >
> > Hi,
> >
> > On 18/01/21 11:25, Song Bao Hua wrote:
> > >> -Original Message-
> > >> From: Vincent Guittot [mailto:vincent.guit...@linaro.org]
> > >> Sent: Tuesday, January 19, 2021 12:14 AM
> > >> To: Song Bao Hua (Barry Song) 
> > >> Cc: Ingo Molnar ; Peter Zijlstra 
> > >> ;
> > >> Dietmar Eggemann ; Morten Rasmussen
> > >> ; Valentin Schneider
> > ;
> > >> linux-kernel ; Mel Gorman
> ;
> > >> linux...@openeuler.org
> > >> Subject: Re: [RFC PATCH] sched/fair: first try to fix the scheduling 
> > >> impact
> > >> of NUMA diameter > 2
> > >>
> > >> On Fri, 15 Jan 2021 at 21:42, Barry Song  
> > >> wrote:
> > >> >
> > >> > This patch is a follow-up of the 3-hops issue reported by Valentin
> Schneider:
> > >> > [1] https://lore.kernel.org/lkml/jhjtux5edo2.mog...@arm.com/
> > >> > [2]
> > >>
> >
> https://lore.kernel.org/lkml/20201110184300.15673-1-valentin.schneider@arm
> > >> .com/
> > >> >
> > >> > Here is a brief summary of the background:
> > >> > For a NUMA system with 3-hops, sched_group for NUMA 2-hops could be not
> > a
> > >> > subset of sched_domain.
> > >> > For example, for a system with the below topology(two cpus in each NUMA
> > >> > node):
> > >> > node   0   1   2   3
> > >> >   0:  10  12  20  22
> > >> >   1:  12  10  22  24
> > >> >   2:  20  22  10  12
> > >> >   3:  22  24  12  10
> > >> >
> > >> > For CPU0, domain-2 will span 0-5, but its group will span 0-3, 4-7.
> > >> > 4-7 isn't a subset of 0-5.
> > >> >
> > >> > CPU0 attaching sched-domain(s):
> > >> >  domain-0: span=0-1 level=MC
> > >> >   groups: 0:{ span=0 cap=989 }, 1:{ span=1 cap=1016 }
> > >> >   domain-1: span=0-3 level=NUMA
> > >> >groups: 0:{ span=0-1 cap=2005 }, 2:{ span=2-3 cap=2028 }
> > >> >domain-2: span=0-5 level=NUMA
> > >> > groups: 0:{ span=0-3 cap=4033 }, 4:{ span=4-7 cap=3909 }
> > >> >  ERROR: groups don't span domain->span
> > >> > domain-3: span=0-7 level=NUMA
> > >> >  groups: 0:{ span=0-5 mask=0-1 cap=6062 }, 6:{ span=4-7 mask=6-7
> > cap=3928 }
> > >> >
> > >> > All other cpus also have the same issue: sched_group could be not a 
> > >> > subset
> > >> > of sched_domain.
> > >> >
> > >> > Here I am trying to figure out the scheduling impact of this issue from
> > >> > two aspects:
> > >> > 1. find busiest cpu in load_balance
> > >> > 2. find idlest cpu in fork/exec/wake balance
> > >>
> > >> Would be better to fix the error in the sched domain topology instead
> > >> of hacking the load balance to compensate the topology problem
> > >
> > > I think Valentin Schneider has tried to do that before, but failed. This
> will
> > > add some new groups which won't be managed by current
> update_group_capacity()?
> > > @Valentine, would you like to share more details?
> > >
> >
> > Sorry for being late to the party, this is gnarly stuff and I can't dive
> > back into it without spending some time staring at my notes and diagrams...
> > I did indeed try to fix the group construction, thinking it would "just" be
> > a matter of changing one mask into another, but it turned out to be quite
> > trickier.
> >
> > Let's go back to https://lore.kernel.org/lkml/jhjtux5edo2.mog...@arm.com/
> >
> > Right now, for that #Case study w/ QEMU platform, we get:
> >
> >   CPU0-domain1: span=0-2
> > group0: span=0-2, mask=0
> > group2: span=1-3, mask=2 # CPU3 shouldn't be included
> >   CPU1-domain1: span=0-3
> > group1: span=0-2, mask=1
> > group3: span=2-3, mask=3
> >   CPU2-domain1: span=0-3
> > group2: span=1-3, mask=2
> > group0: span=0-1, mask=0
> >   CPU3-domain1: span=0-2
> > group3: span=2-3, mask=3
> > group1: span=0-2, mask=1 # CPU0 shouldn't be included
> >
> > We would want to "fix" this into:
> >
> >   CPU0-domain1
> > group0: span=0-2, mask=0
> >   - group2: span=1-3, mask=2
> >   + group?: span=1-2, mask=??
> >

  1   2   3   >