Re: [PATCH] locking: remove spin_lock_flags() etc

2021-10-25 Thread Peter Zijlstra
On Mon, Oct 25, 2021 at 03:06:24PM +0200, Arnd Bergmann wrote:
> On Mon, Oct 25, 2021 at 11:57 AM Peter Zijlstra  wrote:
> > On Sat, Oct 23, 2021 at 06:04:57PM +0200, Arnd Bergmann wrote:
> > > On Sat, Oct 23, 2021 at 3:37 AM Waiman Long  wrote:
> > > >> On 10/22/21 7:59 AM, Arnd Bergmann wrote:
> > > > > From: Arnd Bergmann 
> > > > >
> > > > > As this is all dead code, just remove it and the helper functions 
> > > > > built
> > > > > around it. For arch/ia64, the inline asm could be cleaned up, but
> > > > > it seems safer to leave it untouched.
> > > > >
> > > > > Signed-off-by: Arnd Bergmann 
> > > >
> > > > Does that mean we can also remove the GENERIC_LOCKBREAK config option
> > > > from the Kconfig files as well?
> > >
> > >  I couldn't figure this out.
> > >
> > > What I see is that the only architectures setting GENERIC_LOCKBREAK are
> > > nds32, parisc, powerpc, s390, sh and sparc64, while the only architectures
> > > implementing arch_spin_is_contended() are arm32, csky and ia64.
> > >
> > > The part I don't understand is whether the option actually does anything
> > > useful any more after commit d89c70356acf ("locking/core: Remove 
> > > break_lock
> > > field when CONFIG_GENERIC_LOCKBREAK=y").
> >
> > Urgh, what a mess.. AFAICT there's still code in
> > kernel/locking/spinlock.c that relies on it. Specifically when
> > GENERIC_LOCKBREAK=y we seem to create _lock*() variants that are
> > basically TaS locks which drop preempt/irq disable while spinning.
> >
> > Anybody having this on and not having native TaS locks is in for a rude
> > surprise I suppose... sparc64 being the obvious candidate there :/
> 
> Is this a problem on s390 and powerpc, those two being the ones
> that matter in practice?
> 
> On s390, we pick between the cmpxchg() based directed-yield when
> running on virtualized CPUs, and a normal qspinlock when running on a
> dedicated CPU.
> 
> On PowerPC, we pick at compile-time between either the qspinlock
> (default-enabled on Book3S-64, i.e. all server chips) or a ll/sc based
> spinlock plus vm_yield() (default on embedded and 32-bit mac).

Urgh, yeah, so this crud undermines the whole point of having a fair
lock. I'm thinking s390 and Power want to have this fixed.


Re: [PATCH] locking: remove spin_lock_flags() etc

2021-10-25 Thread Peter Zijlstra
On Mon, Oct 25, 2021 at 11:57:28AM +0200, Peter Zijlstra wrote:
> On Sat, Oct 23, 2021 at 06:04:57PM +0200, Arnd Bergmann wrote:
> > On Sat, Oct 23, 2021 at 3:37 AM Waiman Long  wrote:
> > >> On 10/22/21 7:59 AM, Arnd Bergmann wrote:
> > > > From: Arnd Bergmann 
> > > >
> > > > As this is all dead code, just remove it and the helper functions built
> > > > around it. For arch/ia64, the inline asm could be cleaned up, but
> > > > it seems safer to leave it untouched.
> > > >
> > > > Signed-off-by: Arnd Bergmann 
> > >
> > > Does that mean we can also remove the GENERIC_LOCKBREAK config option
> > > from the Kconfig files as well?
> > 
> >  I couldn't figure this out.
> > 
> > What I see is that the only architectures setting GENERIC_LOCKBREAK are
> > nds32, parisc, powerpc, s390, sh and sparc64, while the only architectures
> > implementing arch_spin_is_contended() are arm32, csky and ia64.
> > 
> > The part I don't understand is whether the option actually does anything
> > useful any more after commit d89c70356acf ("locking/core: Remove break_lock
> > field when CONFIG_GENERIC_LOCKBREAK=y").
> 
> Urgh, what a mess.. AFAICT there's still code in
> kernel/locking/spinlock.c that relies on it. Specifically when
> GENERIC_LOCKBREAK=y we seem to create _lock*() variants that are
> basically TaS locks which drop preempt/irq disable while spinning.
> 
> Anybody having this on and not having native TaS locks is in for a rude
> surprise I suppose... sparc64 being the obvious candidate there :/

Something like the *totally*untested* patch below would rip it all out.

---
 arch/ia64/Kconfig|  3 --
 arch/nds32/Kconfig   |  4 --
 arch/parisc/Kconfig  |  5 ---
 arch/powerpc/Kconfig |  5 ---
 arch/s390/Kconfig|  3 --
 arch/sh/Kconfig  |  4 --
 arch/sparc/Kconfig   |  6 ---
 include/linux/rwlock_api_smp.h   |  4 +-
 include/linux/spinlock_api_smp.h |  4 +-
 kernel/Kconfig.locks | 26 ++--
 kernel/locking/spinlock.c| 90 
 11 files changed, 17 insertions(+), 137 deletions(-)

diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index 1e33666fa679..5ec3abba3c81 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -81,9 +81,6 @@ config MMU
 config STACKTRACE_SUPPORT
def_bool y
 
-config GENERIC_LOCKBREAK
-   def_bool n
-
 config GENERIC_CALIBRATE_DELAY
bool
default y
diff --git a/arch/nds32/Kconfig b/arch/nds32/Kconfig
index aea26e739543..699008dbd6c2 100644
--- a/arch/nds32/Kconfig
+++ b/arch/nds32/Kconfig
@@ -59,10 +59,6 @@ config GENERIC_CSUM
 config GENERIC_HWEIGHT
def_bool y
 
-config GENERIC_LOCKBREAK
-   def_bool y
-   depends on PREEMPTION
-
 config STACKTRACE_SUPPORT
def_bool y
 
diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig
index 27a8b49af11f..afe70bcdde2c 100644
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -86,11 +86,6 @@ config ARCH_DEFCONFIG
default "arch/parisc/configs/generic-32bit_defconfig" if !64BIT
default "arch/parisc/configs/generic-64bit_defconfig" if 64BIT
 
-config GENERIC_LOCKBREAK
-   bool
-   default y
-   depends on SMP && PREEMPTION
-
 config ARCH_HAS_ILOG2_U32
bool
default n
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index ba5b66189358..e782c9ea3f81 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -98,11 +98,6 @@ config LOCKDEP_SUPPORT
bool
default y
 
-config GENERIC_LOCKBREAK
-   bool
-   default y
-   depends on SMP && PREEMPTION
-
 config GENERIC_HWEIGHT
bool
default y
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index b86de61b8caa..e4ff05f5393b 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -26,9 +26,6 @@ config GENERIC_BUG
 config GENERIC_BUG_RELATIVE_POINTERS
def_bool y
 
-config GENERIC_LOCKBREAK
-   def_bool y if PREEMPTION
-
 config PGSTE
def_bool y if KVM
 
diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index 6904f4bdbf00..26f1cf2c69a3 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -86,10 +86,6 @@ config GENERIC_HWEIGHT
 config GENERIC_CALIBRATE_DELAY
bool
 
-config GENERIC_LOCKBREAK
-   def_bool y
-   depends on SMP && PREEMPTION
-
 config ARCH_SUSPEND_POSSIBLE
def_bool n
 
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index b120ed947f50..e77e7254eaa0 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -246,12 +246,6 @@ config US3_MC
 
  If in doubt, say Y, as this information can be very useful.
 
-# Global things across all Sun machines.
-config GENERIC_LOCKBREAK
-   bool
-  

Re: [PATCH] locking: remove spin_lock_flags() etc

2021-10-25 Thread Peter Zijlstra
On Sat, Oct 23, 2021 at 06:04:57PM +0200, Arnd Bergmann wrote:
> On Sat, Oct 23, 2021 at 3:37 AM Waiman Long  wrote:
> >> On 10/22/21 7:59 AM, Arnd Bergmann wrote:
> > > From: Arnd Bergmann 
> > >
> > > As this is all dead code, just remove it and the helper functions built
> > > around it. For arch/ia64, the inline asm could be cleaned up, but
> > > it seems safer to leave it untouched.
> > >
> > > Signed-off-by: Arnd Bergmann 
> >
> > Does that mean we can also remove the GENERIC_LOCKBREAK config option
> > from the Kconfig files as well?
> 
>  I couldn't figure this out.
> 
> What I see is that the only architectures setting GENERIC_LOCKBREAK are
> nds32, parisc, powerpc, s390, sh and sparc64, while the only architectures
> implementing arch_spin_is_contended() are arm32, csky and ia64.
> 
> The part I don't understand is whether the option actually does anything
> useful any more after commit d89c70356acf ("locking/core: Remove break_lock
> field when CONFIG_GENERIC_LOCKBREAK=y").

Urgh, what a mess.. AFAICT there's still code in
kernel/locking/spinlock.c that relies on it. Specifically when
GENERIC_LOCKBREAK=y we seem to create _lock*() variants that are
basically TaS locks which drop preempt/irq disable while spinning.

Anybody having this on and not having native TaS locks is in for a rude
surprise I suppose... sparc64 being the obvious candidate there :/





Re: [PATCH v3 0/4] Add mem_hops field in perf_mem_data_src structure

2021-10-18 Thread Peter Zijlstra
On Mon, Oct 18, 2021 at 02:46:18PM +1100, Michael Ellerman wrote:
> Peter Zijlstra  writes:
> > On Wed, Oct 06, 2021 at 07:36:50PM +0530, Kajol Jain wrote:
> >
> >> Kajol Jain (4):
> >>   perf: Add comment about current state of PERF_MEM_LVL_* namespace and
> >> remove an extra line
> >>   perf: Add mem_hops field in perf_mem_data_src structure
> >>   tools/perf: Add mem_hops field in perf_mem_data_src structure
> >>   powerpc/perf: Fix data source encodings for L2.1 and L3.1 accesses
> >> 
> >>  arch/powerpc/perf/isa207-common.c | 26 +-
> >>  arch/powerpc/perf/isa207-common.h |  2 ++
> >>  include/uapi/linux/perf_event.h   | 19 ---
> >>  tools/include/uapi/linux/perf_event.h | 19 ---
> >>  tools/perf/util/mem-events.c  | 20 ++--
> >>  5 files changed, 73 insertions(+), 13 deletions(-)
> >
> > Acked-by: Peter Zijlstra (Intel) 
> >
> > How do we want this routed? Shall I take it, or does Michael want it in
> > the Power tree?
> 
> It's mostly non-powerpc, so I think you should take it.
> 
> There's a slim chance we could end up with a conflict in the powerpc
> part, but that's no big deal.

Sure thing, into perf/core it goes. Thanks!


Re: [PATCH 2/2] sched: Centralize SCHED_{SMT, MC, CLUSTER} definitions

2021-10-15 Thread Peter Zijlstra
On Fri, Oct 08, 2021 at 04:22:27PM +0100, Valentin Schneider wrote:

> So x86 has it default yes, and a lot of others (e.g. arm64) have it default
> no.
> 
> IMO you don't gain much by disabling them. SCHED_MC and SCHED_CLUSTER only
> control the presence of a sched_domain_topology_level - if it's useless it
> gets degenerated at domain build time. Some valid reasons for not using
> them is if the architecture defines its own topology table (e.g. powerpc
> has CACHE and MC levels which are not gated behind any CONFIG).
> 
> SCHED_SMT has an impact on code generated in sched/core.c, but that is also
> gated by a static key.
> 
> So I'd say having them default yes is sensible. I'd even say we should
> change the "If unsure say N here." to "Y".

Right, so I tend to agree (and also that we should fix that Kconfig help
text). But it would be very nice to have feedback from the affected arch
maintainers.



Re: [PATCH] tracing: Have all levels of checks prevent recursion

2021-10-15 Thread Peter Zijlstra
On Fri, Oct 15, 2021 at 02:20:33PM -0400, Steven Rostedt wrote:
> On Fri, 15 Oct 2021 20:04:29 +0200
> Peter Zijlstra  wrote:
> 
> > On Fri, Oct 15, 2021 at 01:58:06PM -0400, Steven Rostedt wrote:
> > > Something like this:  
> > 
> > I think having one copy of that in a header is better than having 3
> > copies. But yes, something along them lines.
> 
> I was just about to ask you about this patch ;-)

Much better :-)

> diff --git a/kernel/events/internal.h b/kernel/events/internal.h
> index 228801e20788..c91711f20cf8 100644
> --- a/kernel/events/internal.h
> +++ b/kernel/events/internal.h
> @@ -206,11 +206,7 @@ DEFINE_OUTPUT_COPY(__output_copy_user, 
> arch_perf_out_copy_user)
>  static inline int get_recursion_context(int *recursion)
>  {
>   unsigned int pc = preempt_count();

Although I think we can do without that ^ line as well :-)

> - unsigned char rctx = 0;
> -
> - rctx += !!(pc & (NMI_MASK));
> - rctx += !!(pc & (NMI_MASK | HARDIRQ_MASK));
> - rctx += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));
> + unsigned char rctx = interrupt_context_level();
>  
>   if (recursion[rctx])
>   return -1;


Re: [PATCH] tracing: Have all levels of checks prevent recursion

2021-10-15 Thread Peter Zijlstra
On Fri, Oct 15, 2021 at 01:58:06PM -0400, Steven Rostedt wrote:
> Something like this:

I think having one copy of that in a header is better than having 3
copies. But yes, something along them lines.


Re: [PATCH] tracing: Have all levels of checks prevent recursion

2021-10-15 Thread Peter Zijlstra
On Fri, Oct 15, 2021 at 11:00:35AM -0400, Steven Rostedt wrote:
> From: "Steven Rostedt (VMware)" 
> 
> While writing an email explaining the "bit = 0" logic for a discussion on

>   bit = trace_get_context_bit() + start;

While there, you were also going to update that function to match/use
get_recursion_context(). Because your version is still branch hell.


Re: [PATCH 2/2] ftrace: prevent preemption in perf_ftrace_function_call()

2021-10-12 Thread Peter Zijlstra
On Tue, Oct 12, 2021 at 01:40:31PM +0800, 王贇 wrote:

> diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
> index 6aed10e..33c2f76 100644
> --- a/kernel/trace/trace_event_perf.c
> +++ b/kernel/trace/trace_event_perf.c
> @@ -441,12 +441,19 @@ void perf_trace_buf_update(void *record, u16 type)
>   if (!rcu_is_watching())
>   return;
> 
> + /*
> +  * Prevent CPU changing from now on. rcu must
> +  * be in watching if the task was migrated and
> +  * scheduled.
> +  */
> + preempt_disable_notrace();
> +
>   if ((unsigned long)ops->private != smp_processor_id())
> - return;
> + goto out;
> 
>   bit = ftrace_test_recursion_trylock(ip, parent_ip);
>   if (bit < 0)
> - return;
> + goto out;
> 
>   event = container_of(ops, struct perf_event, ftrace_ops);
> 

This seems rather daft, wouldn't it be easier to just put that check
under the recursion thing?


Re: [PATCH v3 0/4] Add mem_hops field in perf_mem_data_src structure

2021-10-07 Thread Peter Zijlstra
On Wed, Oct 06, 2021 at 07:36:50PM +0530, Kajol Jain wrote:

> Kajol Jain (4):
>   perf: Add comment about current state of PERF_MEM_LVL_* namespace and
> remove an extra line
>   perf: Add mem_hops field in perf_mem_data_src structure
>   tools/perf: Add mem_hops field in perf_mem_data_src structure
>   powerpc/perf: Fix data source encodings for L2.1 and L3.1 accesses
> 
>  arch/powerpc/perf/isa207-common.c | 26 +-
>  arch/powerpc/perf/isa207-common.h |  2 ++
>  include/uapi/linux/perf_event.h   | 19 ---
>  tools/include/uapi/linux/perf_event.h | 19 ---
>  tools/perf/util/mem-events.c  | 20 ++--
>  5 files changed, 73 insertions(+), 13 deletions(-)

Acked-by: Peter Zijlstra (Intel) 

How do we want this routed? Shall I take it, or does Michael want it in
the Power tree?


Re: [PATCH 2/4] perf: Add mem_hops field in perf_mem_data_src structure

2021-10-05 Thread Peter Zijlstra
On Tue, Oct 05, 2021 at 02:48:35PM +0530, Kajol Jain wrote:
> Going forward, future generation systems can have more hierarchy
> within the chip/package level but currently we don't have any data source
> encoding field in perf, which can be used to represent this level of data.
> 
> Add a new field called 'mem_hops' in the perf_mem_data_src structure
> which can be used to represent intra-chip/package or inter-chip/off-package
> details. This field is of size 3 bits where PERF_MEM_HOPS_{NA, 0..6} value
> can be used to present different hop levels data.
> 
> Also add corresponding macros to define mem_hop field values
> and shift value.
> 
> Currently we define macro for HOPS_0 which corresponds
> to data coming from another core but same chip.
> 
> For ex: Encodings for mem_hops fields with L2 cache:
> 
> L2- local L2
> L2 | REMOTE | HOPS_0  - remote core, same chip L2

Can we do s/chip/node/ ? Hops are something NUMA related, while chips
come in a bag or something :-)

> +/* hop level */
> +#define PERF_MEM_HOPS_0  0x01 /* remote core, same chip */
> +/* 2-7 available */
> +#define PERF_MEM_HOPS_SHIFT  43


Re: [PATCH v5 6/6] sched/fair: Consider SMT in ASYM_PACKING load balance

2021-09-17 Thread Peter Zijlstra
On Fri, Sep 17, 2021 at 05:25:02PM +0200, Vincent Guittot wrote:

> With the removal of the condition !sds->local_stat.sum_nr_running
> which seems useless because dst_cpu is idle and not SMT, this patch
> looks good to me

I've made it look like this. Thanks!

---
Subject: sched/fair: Consider SMT in ASYM_PACKING load balance
From: Ricardo Neri 
Date: Fri, 10 Sep 2021 18:18:19 -0700

From: Ricardo Neri 

When deciding to pull tasks in ASYM_PACKING, it is necessary not only to
check for the idle state of the destination CPU, dst_cpu, but also of
its SMT siblings.

If dst_cpu is idle but its SMT siblings are busy, performance suffers
if it pulls tasks from a medium priority CPU that does not have SMT
siblings.

Implement asym_smt_can_pull_tasks() to inspect the state of the SMT
siblings of both dst_cpu and the CPUs in the candidate busiest group.

Signed-off-by: Ricardo Neri 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Joel Fernandes (Google) 
Reviewed-by: Len Brown 
Link: 
https://lkml.kernel.org/r/20210911011819.12184-7-ricardo.neri-calde...@linux.intel.com
---
 kernel/sched/fair.c |   92 
 1 file changed, 92 insertions(+)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8538,10 +8538,96 @@ group_type group_classify(unsigned int i
return group_has_spare;
 }
 
+/**
+ * asym_smt_can_pull_tasks - Check whether the load balancing CPU can pull 
tasks
+ * @dst_cpu:   Destination CPU of the load balancing
+ * @sds:   Load-balancing data with statistics of the local group
+ * @sgs:   Load-balancing statistics of the candidate busiest group
+ * @sg:The candidate busiest group
+ *
+ * Check the state of the SMT siblings of both @sds::local and @sg and decide
+ * if @dst_cpu can pull tasks.
+ *
+ * If @dst_cpu does not have SMT siblings, it can pull tasks if two or more of
+ * the SMT siblings of @sg are busy. If only one CPU in @sg is busy, pull tasks
+ * only if @dst_cpu has higher priority.
+ *
+ * If both @dst_cpu and @sg have SMT siblings, and @sg has exactly one more
+ * busy CPU than @sds::local, let @dst_cpu pull tasks if it has higher 
priority.
+ * Bigger imbalances in the number of busy CPUs will be dealt with in
+ * update_sd_pick_busiest().
+ *
+ * If @sg does not have SMT siblings, only pull tasks if all of the SMT 
siblings
+ * of @dst_cpu are idle and @sg has lower priority.
+ */
+static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
+   struct sg_lb_stats *sgs,
+   struct sched_group *sg)
+{
+#ifdef CONFIG_SCHED_SMT
+   bool local_is_smt, sg_is_smt;
+   int sg_busy_cpus;
+
+   local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
+   sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY;
+
+   sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
+
+   if (!local_is_smt) {
+   /*
+* If we are here, @dst_cpu is idle and does not have SMT
+* siblings. Pull tasks if candidate group has two or more
+* busy CPUs.
+*/
+   if (sg_busy_cpus >= 2) /* implies sg_is_smt */
+   return true;
+
+   /*
+* @dst_cpu does not have SMT siblings. @sg may have SMT
+* siblings and only one is busy. In such case, @dst_cpu
+* can help if it has higher priority and is idle (i.e.,
+* it has no running tasks).
+*/
+   return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
+   }
+
+   /* @dst_cpu has SMT siblings. */
+
+   if (sg_is_smt) {
+   int local_busy_cpus = sds->local->group_weight -
+ sds->local_stat.idle_cpus;
+   int busy_cpus_delta = sg_busy_cpus - local_busy_cpus;
+
+   if (busy_cpus_delta == 1)
+   return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
+
+   return false;
+   }
+
+   /*
+* @sg does not have SMT siblings. Ensure that @sds::local does not end
+* up with more than one busy SMT sibling and only pull tasks if there
+* are not busy CPUs (i.e., no CPU has running tasks).
+*/
+   if (!sds->local_stat.sum_nr_running)
+   return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
+
+   return false;
+#else
+   /* Always return false so that callers deal with non-SMT cases. */
+   return false;
+#endif
+}
+
 static inline bool
 sched_asym(struct lb_env *env, struct sd_lb_stats *sds,  struct sg_lb_stats 
*sgs,
   struct sched_group *group)
 {
+   /* Only do SMT checks if either local or candidate have SMT siblings */
+   if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
+   (group->flags & SD_SHARE_CPUCAPACITY))

Re: [PATCH 1/3] perf: Add macros to specify onchip L2/L3 accesses

2021-09-14 Thread Peter Zijlstra
On Tue, Sep 14, 2021 at 08:40:38PM +1000, Michael Ellerman wrote:
> Peter Zijlstra  writes:

> > I'm thinking we ought to keep hops as steps along the NUMA fabric, with
> > 0 hops being the local node. That only gets us:
> >
> >  L2, remote=0, hops=HOPS_0 -- our L2
> >  L2, remote=1, hops=HOPS_0 -- L2 on the local node but not ours
> >  L2, remote=1, hops!=HOPS_0 -- L2 on a remote node
> 
> Hmm. I'm not sure about tying it directly to NUMA hops. I worry we're
> going to see more and more systems where there's a hierarchy within the
> chip/package, in addition to the traditional NUMA hierarchy.
> 
> Although then I guess it becomes a question of what exactly is a NUMA
> hop, maybe the answer is that on those future systems those
> intra-chip/package hops should be represented as NUMA hops.
> 
> It's not like we have a hard definition of what a NUMA hop is?

Not really, typically whatever the BIOS/DT/whatever tables tell us. I
think in case of Power you're mostly making things up in software :-)

But yeah, I think we have plenty wriggle room there.


Re: [PATCH 1/3] perf: Add macros to specify onchip L2/L3 accesses

2021-09-09 Thread Peter Zijlstra
On Thu, Sep 09, 2021 at 10:45:54PM +1000, Michael Ellerman wrote:

> > The 'new' composite doesnt have a hops field because the hardware that
> > nessecitated that change doesn't report it, but we could easily add a
> > field there.
> >
> > Suppose we add, mem_hops:3 (would 6 hops be too small?) and the
> > corresponding PERF_MEM_HOPS_{NA, 0..6}
> 
> It's really 7 if we use remote && hop = 0 to mean the first hop.

I don't think we can do that, becaus of backward compat. Currently:

  lvl_num=DRAM, remote=1

denites: "Remote DRAM of any distance". Effectively it would have the new
hops field filled with zeros though, so if you then decode with the hops
field added it suddenly becomes:

 lvl_num=DRAM, remote=1, hops=0

and reads like: "Remote DRAM of 0 hops" which is quite daft. Therefore 0
really must denote a 'N/A'.

> If we're wanting to use some of the hop levels to represent
> intra-chip/package hops then we could possibly use them all on a really
> big system.
> 
> eg. you could imagine something like:
> 
>  L2 | - local L2
>  L2 | REMOTE | HOPS_0 - L2 of neighbour core
>  L2 | REMOTE | HOPS_1 - L2 of near core on same chip (same 1/2 of chip)
>  L2 | REMOTE | HOPS_2 - L2 of far core on same chip (other 1/2 of chip)
>  L2 | REMOTE | HOPS_3 - L2 of sibling chip in same package
>  L2 | REMOTE | HOPS_4 - L2 on separate package 1 hop away
>  L2 | REMOTE | HOPS_5 - L2 on separate package 2 hops away
>  L2 | REMOTE | HOPS_6 - L2 on separate package 3 hops away
> 
> 
> Whether it's useful to represent all those levels I'm not sure, but it's
> probably good if we have the ability.

I'm thinking we ought to keep hops as steps along the NUMA fabric, with
0 hops being the local node. That only gets us:

 L2, remote=0, hops=HOPS_0 -- our L2
 L2, remote=1, hops=HOPS_0 -- L2 on the local node but not ours
 L2, remote=1, hops!=HOPS_0 -- L2 on a remote node

> I guess I'm 50/50 on whether that's enough levels, or whether we want
> another bit to allow for future growth.

Right, possibly safer to add one extra bit while we can I suppose.



Re: [PATCH 1/3] perf: Add macros to specify onchip L2/L3 accesses

2021-09-08 Thread Peter Zijlstra
On Wed, Sep 08, 2021 at 05:17:53PM +1000, Michael Ellerman wrote:
> Kajol Jain  writes:

> > diff --git a/include/uapi/linux/perf_event.h 
> > b/include/uapi/linux/perf_event.h
> > index f92880a15645..030b3e990ac3 100644
> > --- a/include/uapi/linux/perf_event.h
> > +++ b/include/uapi/linux/perf_event.h
> > @@ -1265,7 +1265,9 @@ union perf_mem_data_src {
> >  #define PERF_MEM_LVLNUM_L2 0x02 /* L2 */
> >  #define PERF_MEM_LVLNUM_L3 0x03 /* L3 */
> >  #define PERF_MEM_LVLNUM_L4 0x04 /* L4 */
> > -/* 5-0xa available */
> > +#define PERF_MEM_LVLNUM_OC_L2  0x05 /* On Chip L2 */
> > +#define PERF_MEM_LVLNUM_OC_L3  0x06 /* On Chip L3 */
> 
> The obvious use for 5 is for "L5" and so on.
> 
> I'm not sure adding new levels is the best idea, because these don't fit
> neatly into the hierarchy, they are off to the side.
> 
> 
> I wonder if we should use the remote field.
> 
> ie. for another core's L2 we set:
> 
>   mem_lvl = PERF_MEM_LVL_L2
>   mem_remote = 1

This mixes APIs (see below), IIUC the correct usage would be something
like: lvl_num=L2 remote=1

> Which would mean "remote L2", but not remote enough to be
> lvl = PERF_MEM_LVL_REM_CCE1.
> 
> It would be printed by the existing tools/perf code as "Remote L2", vs
> "Remote cache (1 hop)", which seems OK.
> 
> 
> ie. we'd be able to express:
> 
>   Current core's L2: LVL_L2
>   Other core's L2:   LVL_L2 | REMOTE
>   Other chip's L2:   LVL_REM_CCE1 | REMOTE
> 
> And similarly for L3.
> 
> I think that makes sense? Unless people think remote should be reserved
> to mean on another chip, though we already have REM_CCE1 for that.

IIRC the PERF_MEM_LVL_* namespace is somewhat depricated in favour of
the newer composite PERF_MEM_{LVLNUM_,REMOTE_,SNOOPX_} fields. Of
course, ABIs being what they are, we get to support both :/ But I'm not
sure mixing them is a great idea.

Also, clearly this could use a comment...

The 'new' composite doesnt have a hops field because the hardware that
nessecitated that change doesn't report it, but we could easily add a
field there.

Suppose we add, mem_hops:3 (would 6 hops be too small?) and the
corresponding PERF_MEM_HOPS_{NA, 0..6}

Then I suppose you can encode things like:

L2  - local L2
L2 | REMOTE - remote L2 at an unspecified distance (NA)
L2 | REMOTE | HOPS_0- remote L2 on the same node
L2 | REMOTE | HOPS_1- remote L2 on a node 1 removed

Would that work?


Re: [PATCH v3] powerpc/32: Add support for out-of-line static calls

2021-09-01 Thread Peter Zijlstra
On Wed, Sep 01, 2021 at 08:30:21AM +, Christophe Leroy wrote:
> Add support for out-of-line static calls on PPC32. This change
> improve performance of calls to global function pointers by
> using direct calls instead of indirect calls.
> 
> The trampoline is initialy populated with a 'blr' or branch to target,
> followed by an unreachable long jump sequence.
> 
> In order to cater with parallele execution, the trampoline needs to
> be updated in a way that ensures it remains consistent at all time.
> This means we can't use the traditional lis/addi to load r12 with
> the target address, otherwise there would be a window during which
> the first instruction contains the upper part of the new target
> address while the second instruction still contains the lower part of
> the old target address. To avoid that the target address is stored
> just after the 'bctr' and loaded from there with a single instruction.
> 
> Then, depending on the target distance, arch_static_call_transform()
> will either replace the first instruction by a direct 'bl ' or
> 'nop' in order to have the trampoline fall through the long jump
> sequence.
> 
> For the special case of __static_call_return0(), to avoid the risk of
> a far branch, a version of it is inlined at the end of the trampoline.

(also, it's in the same line, so it avoids another cachemiss and it
nicely fills the hole you had in your 32byte chunk)

> Performancewise the long jump sequence is probably not better than
> the indirect calls set by GCC when we don't use static calls, but
> such calls are unlikely to be required on powerpc32: With most
> configurations the kernel size is far below 32 Mbytes so only
> modules may happen to be too far. And even modules are likely to
> be close enough as they are allocated below the kernel core and
> as close as possible of the kernel text.
> 
> static_call selftest is running successfully with this change.

Nice!, I'd ask if you'd tried PREEMPT_DYNAMIC, since that should really
stress the thing, but I see that also requires GENERIC_ENTRY and you
don't have that. Alas.

Acked-by: Peter Zijlstra (Intel) 


Re: [PATCH v2] powerpc/32: Add support for out-of-line static calls

2021-08-31 Thread Peter Zijlstra
On Tue, Aug 31, 2021 at 01:12:26PM +, Christophe Leroy wrote:
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 36b72d972568..a0fe69d8ec83 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -247,6 +247,7 @@ config PPC
>   select HAVE_SOFTIRQ_ON_OWN_STACK
>   select HAVE_STACKPROTECTOR  if PPC32 && 
> $(cc-option,-mstack-protector-guard=tls -mstack-protector-guard-reg=r2)
>   select HAVE_STACKPROTECTOR  if PPC64 && 
> $(cc-option,-mstack-protector-guard=tls -mstack-protector-guard-reg=r13)
> + select HAVE_STATIC_CALL if PPC32
>   select HAVE_SYSCALL_TRACEPOINTS
>   select HAVE_VIRT_CPU_ACCOUNTING
>   select HUGETLB_PAGE_SIZE_VARIABLE   if PPC_BOOK3S_64 && HUGETLB_PAGE
> diff --git a/arch/powerpc/include/asm/static_call.h 
> b/arch/powerpc/include/asm/static_call.h
> new file mode 100644
> index ..2402c6d32439
> --- /dev/null
> +++ b/arch/powerpc/include/asm/static_call.h
> @@ -0,0 +1,25 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_POWERPC_STATIC_CALL_H
> +#define _ASM_POWERPC_STATIC_CALL_H
> +
> +#define __POWERPC_SCT(name, inst)\
> + asm(".pushsection .text, \"ax\" \n" \
> + ".align 5   \n" \
> + ".globl " STATIC_CALL_TRAMP_STR(name) " \n" \
> + STATIC_CALL_TRAMP_STR(name) ":  \n" \
> + inst "  \n" \
> + "   lis 12,1f@ha\n" \
> + "   lwz 12,1f@l(12) \n" \
> + "   mtctr   12  \n" \
> + "   bctr\n" \
> + "1: .long 0 \n" \
> + "   nop \n" \
> + "   nop \n" \
> + ".type " STATIC_CALL_TRAMP_STR(name) ", @function   \n" \
> + ".size " STATIC_CALL_TRAMP_STR(name) ", . - " 
> STATIC_CALL_TRAMP_STR(name) " \n" \
> + ".popsection\n")
> +
> +#define ARCH_DEFINE_STATIC_CALL_TRAMP(name, func)__POWERPC_SCT(name, "b 
> " #func)
> +#define ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name) __POWERPC_SCT(name, 
> "blr")
> +
> +#endif /* _ASM_POWERPC_STATIC_CALL_H */

> diff --git a/arch/powerpc/kernel/static_call.c 
> b/arch/powerpc/kernel/static_call.c
> new file mode 100644
> index ..e5e78205ccb4
> --- /dev/null
> +++ b/arch/powerpc/kernel/static_call.c
> @@ -0,0 +1,36 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include 
> +#include 
> +
> +#include 
> +
> +void arch_static_call_transform(void *site, void *tramp, void *func, bool 
> tail)
> +{
> + int err;
> + unsigned long target = (long)func;
> + bool is_short = is_offset_in_branch_range((long)target - (long)tramp);
> +
> + if (!tramp)
> + return;
> +
> + mutex_lock(_mutex);
> +
> + if (func && !is_short) {
> + err = patch_instruction(tramp + 20, ppc_inst(target));
> + if (err)
> + goto out;
> + }
> +
> + if (!func)
> + err = patch_instruction(tramp, ppc_inst(PPC_RAW_BLR()));
> + else if (is_short)
> + err = patch_branch(tramp, target, 0);
> + else
> + err = patch_instruction(tramp, ppc_inst(PPC_RAW_NOP()));
> +out:
> + mutex_unlock(_mutex);
> +
> + if (err)
> + panic("%s: patching failed %pS at %pS\n", __func__, func, 
> tramp);
> +}
> +EXPORT_SYMBOL_GPL(arch_static_call_transform);

Yes, this should work nicely!

Since you have the two nop's at the end, you could frob in an
optimization for __static_call_return0 without too much issue.

Replace the two nops with (excuse my ppc asm):

li r3, 0
blr

and augment arch_static_call_transform() with something like:

if (func == &__static_call_return0)
err = patch_branch(tramp, tramp+24, 0);



Re: [PATCH] powerpc/32: Add support for out-of-line static calls

2021-08-31 Thread Peter Zijlstra
On Tue, Aug 31, 2021 at 08:05:21AM +, Christophe Leroy wrote:

> +#define ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name) \
> + asm(".pushsection .text, \"ax\" \n" \
> + ".align 4   \n" \
> + ".globl " STATIC_CALL_TRAMP_STR(name) " \n" \
> + STATIC_CALL_TRAMP_STR(name) ":  \n" \
> + "   blr \n" \
> + "   nop \n" \
> + "   nop \n" \
> + "   nop \n" \
> + ".type " STATIC_CALL_TRAMP_STR(name) ", @function   \n" \
> + ".size " STATIC_CALL_TRAMP_STR(name) ", . - " 
> STATIC_CALL_TRAMP_STR(name) " \n" \
> + ".popsection\n")

> +static int patch_trampoline_32(u32 *addr, unsigned long target)
> +{
> + int err;
> +
> + err = patch_instruction(addr++, ppc_inst(PPC_RAW_LIS(_R12, 
> PPC_HA(target;
> + err |= patch_instruction(addr++, ppc_inst(PPC_RAW_ADDI(_R12, _R12, 
> PPC_LO(target;
> + err |= patch_instruction(addr++, ppc_inst(PPC_RAW_MTCTR(_R12)));
> + err |= patch_instruction(addr, ppc_inst(PPC_RAW_BCTR()));
> +
> + return err;
> +}

There can be concurrent execution and modification; the above doesn't
look safe in that regard. What happens if you've say, done the first
two, but not the latter two and execution happens (on a different
CPU or through IRQ context, etc..)?

> +void arch_static_call_transform(void *site, void *tramp, void *func, bool 
> tail)
> +{
> + int err;
> + unsigned long target = (long)func;
> +
> + if (!tramp)
> + return;
> +
> + mutex_lock(_mutex);
> +
> + if (!func)
> + err = patch_instruction(tramp, ppc_inst(PPC_RAW_BLR()));
> + else if (is_offset_in_branch_range((long)target - (long)tramp))
> + err = patch_branch(tramp, target, 0);

These two are single instruction modifications and I'm assuming the
hardware is sane enough that execution sees either the old or the new
instruction. So this should work.

> + else if (IS_ENABLED(CONFIG_PPC32))
> + err = patch_trampoline_32(tramp, target);
> + else
> + BUILD_BUG();
> +
> + mutex_unlock(_mutex);
> +
> + if (err)
> + panic("%s: patching failed %pS at %pS\n", __func__, func, 
> tramp);
> +}
> +EXPORT_SYMBOL_GPL(arch_static_call_transform);

One possible solution that we explored on ARM64, was having the
trampoline be in 2 slots:


b 1f

1:  blr
nop
nop
nop

2:  blr
nop
nop
nop

Where initially the first slot is active (per "b 1f"), then you write
the second slot, and as a final act, re-write the initial branch to
point to slot 2.

Then you execute synchronize_rcu_tasks() under your text mutex
(careful!) to ensure all users of your slot1 are gone and the next
modification repeats the whole thing, except for using slot1 etc..

Eventually I think Ard came up with the latest ARM64 proposal which puts
a literal in a RO section (could be the text section I suppose) and
loads and branches to that.

Anyway, the thing is, you can really only modify a single instruction at
the time and need to ensure concurrent execution is correct.


Re: [PATCH v4 6/6] sched/fair: Consider SMT in ASYM_PACKING load balance

2021-08-27 Thread Peter Zijlstra
On Fri, Aug 27, 2021 at 12:13:42PM +0200, Vincent Guittot wrote:
> > +/**
> > + * asym_smt_can_pull_tasks - Check whether the load balancing CPU can pull 
> > tasks
> > + * @dst_cpu:   Destination CPU of the load balancing
> > + * @sds:   Load-balancing data with statistics of the local group
> > + * @sgs:   Load-balancing statistics of the candidate busiest group
> > + * @sg:The candidate busiet group
> > + *
> > + * Check the state of the SMT siblings of both @sds::local and @sg and 
> > decide
> > + * if @dst_cpu can pull tasks. If @dst_cpu does not have SMT siblings, it 
> > can
> > + * pull tasks if two or more of the SMT siblings of @sg are busy. If only 
> > one
> > + * CPU in @sg is busy, pull tasks only if @dst_cpu has higher priority.
> > + *
> > + * If both @dst_cpu and @sg have SMT siblings, even the number of idle CPUs
> > + * between @sds::local and @sg. Thus, pull tasks from @sg if the difference
> > + * between the number of busy CPUs is 2 or more. If the difference is of 1,
> > + * only pull if @dst_cpu has higher priority. If @sg does not have SMT 
> > siblings
> > + * only pull tasks if all of the SMT siblings of @dst_cpu are idle and @sg
> > + * has lower priority.
> > + */
> > +static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
> > +   struct sg_lb_stats *sgs,
> > +   struct sched_group *sg)
> > +{
> > +#ifdef CONFIG_SCHED_SMT
> > +   bool local_is_smt, sg_is_smt;
> > +   int sg_busy_cpus;
> > +
> > +   local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
> > +   sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY;
> > +
> > +   sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
> > +
> > +   if (!local_is_smt) {
> > +   /*
> > +* If we are here, @dst_cpu is idle and does not have SMT
> > +* siblings. Pull tasks if candidate group has two or more
> > +* busy CPUs.
> > +*/
> > +   if (sg_is_smt && sg_busy_cpus >= 2)
> > +   return true;
> > +
> > +   /*
> > +* @dst_cpu does not have SMT siblings. @sg may have SMT
> > +* siblings and only one is busy. In such case, @dst_cpu
> > +* can help if it has higher priority and is idle.
> > +*/
> > +   return !sds->local_stat.group_util &&
> 
> sds->local_stat.group_util can't be used to decide if a CPU or group
> of CPUs is idle. util_avg is usually not null when a CPU becomes idle
> and you can have to wait  more than 300ms before it becomes Null
> At the opposite, the utilization of a CPU can be null but a task with
> null utilization has just woken up on it.
> Utilization is used to reflect the average work of the CPU or group of
> CPUs but not the current state

If you want immediate idle, sgs->nr_running == 0 or sgs->idle_cpus ==
sgs->group_weight come to mind.

> > +  sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
> > +   }
> > +
> > +   /* @dst_cpu has SMT siblings. */
> > +
> > +   if (sg_is_smt) {
> > +   int local_busy_cpus = sds->local->group_weight -
> > + sds->local_stat.idle_cpus;
> > +   int busy_cpus_delta = sg_busy_cpus - local_busy_cpus;
> > +
> > +   /* Local can always help to even the number busy CPUs. */
> 
> default behavior of the load balance already tries to even the number
> of idle CPUs.

Right, but I suppose this is because we're trapped here and have to deal
with the SMT-SMT case too. Ricardo, can you clarify?

> > +   if (busy_cpus_delta >= 2)
> > +   return true;
> > +
> > +   if (busy_cpus_delta == 1)
> > +   return sched_asym_prefer(dst_cpu,
> > +sg->asym_prefer_cpu);
> > +
> > +   return false;
> > +   }
> > +
> > +   /*
> > +* @sg does not have SMT siblings. Ensure that @sds::local does not 
> > end
> > +* up with more than one busy SMT sibling and only pull tasks if 
> > there
> > +* are not busy CPUs. As CPUs move in and out of idle state 
> > frequently,
> > +* also check the group utilization to smoother the decision.
> > +*/
> > +   if (!sds->local_stat.group_util)
> 
> same comment as above about the meaning of group_util == 0
> 
> > +   return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
> > +
> > +   return false;
> > +#else
> > +   /* Always return false so that callers deal with non-SMT cases. */
> > +   return false;
> > +#endif
> > +}
> > +
> >  static inline bool
> >  sched_asym(struct lb_env *env, struct sd_lb_stats *sds,  struct 
> > sg_lb_stats *sgs,
> >struct sched_group *group)
> >  {
> > +   /* Only do SMT checks if either local or candidate have 

Re: [RFC PATCH] powerpc: Investigate static_call concept

2021-08-27 Thread Peter Zijlstra
On Fri, Aug 27, 2021 at 09:45:37AM +, Christophe Leroy wrote:
> This RFC is to validate the concept of static_call on powerpc.
> 
> Highly copied from x86.
> 
> It replaces ppc_md.get_irq() which is called at every IRQ, by
> a static call.

The code looks saner, but does it actually improve performance? I'm
thinking the double branch also isn't free.

> When updating the call, we just replace the instruction at the
> trampoline address by a relative jump to the function.
> 
> For the time being, the case of out-of-range functions is not handled.

The paranoid in me would've made it:

BUG_ON(patch_branch(...));

just to make sure to notice the target not fitting. Ohh, patch_branch()
doesn't return the create_branch() error, perhaps that wants to be
fixed?

Did you see the arm64 variant that deals with out-of-range functions in
their trampoline?

  
https://lore.kernel.org/linux-arm-kernel/20201120082103.4840-1-a...@kernel.org/

Not exactly 'nice' but supposedly that works.

> +#define ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name) \
> + asm(".pushsection .text, \"ax\" \n" \
> + ".align 4   \n" \
> + ".globl " STATIC_CALL_TRAMP_STR(name) " \n" \
> + STATIC_CALL_TRAMP_STR(name) ":  \n" \
> + "   blr \n" \
> + ".type " STATIC_CALL_TRAMP_STR(name) ", @function   \n" \
> + ".size " STATIC_CALL_TRAMP_STR(name) ", . - " 
> STATIC_CALL_TRAMP_STR(name) " \n" \
> + ".popsection\n")
> +

Since you support CALL_NULL_TRAMP, your patch function below:

> +void arch_static_call_transform(void *site, void *tramp, void *func, bool 
> tail)
> +{
> + mutex_lock(_mutex);
> +
> + if (tramp)
> + patch_branch(tramp, (unsigned long)func, 0);
> +
> + mutex_unlock(_mutex);
> +}
> +EXPORT_SYMBOL_GPL(arch_static_call_transform);

Ought to patch in "blr" when !func to be consistent :-)

I'm thinking that your core kernel text all fits in the native range and
only modules need out-of-range ?


Re: [PATCH v2 0/3] Updates to powerpc for robust CPU online/offline

2021-08-23 Thread Peter Zijlstra
On Mon, Aug 23, 2021 at 03:04:37PM +0530, Srikar Dronamraju wrote:
> * Peter Zijlstra  [2021-08-23 10:33:30]:
> 
> > On Sat, Aug 21, 2021 at 03:55:32PM +0530, Srikar Dronamraju wrote:
> > > Scheduler expects unique number of node distances to be available
> > > at boot. It uses node distance to calculate this unique node
> > > distances. On Power Servers, node distances for offline nodes is not
> > > available. However, Power Servers already knows unique possible node
> > > distances. Fake the offline node's distance_lookup_table entries so
> > > that all possible node distances are updated.
> > > 
> > > For example distance info from numactl from a fully populated 8 node
> > > system at boot may look like this.
> > > 
> > > node distances:
> > > node   0   1   2   3   4   5   6   7
> > >   0:  10  20  40  40  40  40  40  40
> > >   1:  20  10  40  40  40  40  40  40
> > >   2:  40  40  10  20  40  40  40  40
> > >   3:  40  40  20  10  40  40  40  40
> > >   4:  40  40  40  40  10  20  40  40
> > >   5:  40  40  40  40  20  10  40  40
> > >   6:  40  40  40  40  40  40  10  20
> > >   7:  40  40  40  40  40  40  20  10
> > > 
> > > However the same system when only two nodes are online at boot, then
> > > distance info from numactl will look like
> > > node distances:
> > > node   0   1
> > >   0:  10  20
> > >   1:  20  10
> > > 
> > > With the faked numa distance at boot, the node distance table will look
> > > like
> > > node   0   1   2
> > >   0:  10  20  40
> > >   1:  20  10  40
> > >   2:  40  40  10
> > > 
> > > The actual distance will be populated once the nodes are onlined.
> > 
> > How did you want all this merged? I picked up Valentin's patch, do you
> > want me to pick up these PowerPC patches in the same tree, or do you
> > want to route them seperately?
> 
> While both (the patch you accepted and this series) together help solve the
> problem, I think there is no hard dependency between the two. Hence I would
> think it should be okay to go through the powerpc tree.
> 

OK, works for me, thanks!


Re: [PATCH v2 0/3] Updates to powerpc for robust CPU online/offline

2021-08-23 Thread Peter Zijlstra
On Sat, Aug 21, 2021 at 03:55:32PM +0530, Srikar Dronamraju wrote:
> Scheduler expects unique number of node distances to be available
> at boot. It uses node distance to calculate this unique node
> distances. On Power Servers, node distances for offline nodes is not
> available. However, Power Servers already knows unique possible node
> distances. Fake the offline node's distance_lookup_table entries so
> that all possible node distances are updated.
> 
> For example distance info from numactl from a fully populated 8 node
> system at boot may look like this.
> 
> node distances:
> node   0   1   2   3   4   5   6   7
>   0:  10  20  40  40  40  40  40  40
>   1:  20  10  40  40  40  40  40  40
>   2:  40  40  10  20  40  40  40  40
>   3:  40  40  20  10  40  40  40  40
>   4:  40  40  40  40  10  20  40  40
>   5:  40  40  40  40  20  10  40  40
>   6:  40  40  40  40  40  40  10  20
>   7:  40  40  40  40  40  40  20  10
> 
> However the same system when only two nodes are online at boot, then
> distance info from numactl will look like
> node distances:
> node   0   1
>   0:  10  20
>   1:  20  10
> 
> With the faked numa distance at boot, the node distance table will look
> like
> node   0   1   2
>   0:  10  20  40
>   1:  20  10  40
>   2:  40  40  10
> 
> The actual distance will be populated once the nodes are onlined.

How did you want all this merged? I picked up Valentin's patch, do you
want me to pick up these PowerPC patches in the same tree, or do you
want to route them seperately?


Re: [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE

2021-08-04 Thread Peter Zijlstra
On Wed, Aug 04, 2021 at 04:39:44PM +1000, Nicholas Piggin wrote:
> For that matter, I wonder if we shouldn't do something like this 
> (untested) so the low level batch flush has visibility to the high
> level flush range.
> 
> x86 could use this too AFAIKS, just needs to pass the range a bit
> further down, but in practice I'm not sure it would ever really
> matter for them because the 2MB level exceeds the single page flush
> ceiling for 4k pages unlike powerpc with 64k pages. But in corner
> cases where the unmap crossed a bunch of small vmas or the ceiling
> was increased then in theory it could be of use.

x86 can do single invalidates for 2M pages if that's the only type
encountered in the range, at which point we can do 33*2M = 66M, which is
well below the 1G range for the next level of huge.

> Subject: [PATCH v1] mm/mmu_gather: provide archs with the entire range that is
>  to be flushed, not just the particular gather
> 
> This allows archs to optimise flushing heuristics better, in the face of
> flush operations forcing smaller flush batches. For example, an
> architecture may choose a more costly per-page invalidation for small
> ranges of pages with the assumption that the full TLB flush cost would
> be more expensive in terms of refills. However if a very large range is
> forced into flushing small ranges, the faster full-process flush may
> have been the better choice.

What is splitting up the flushes for you? That page_size power-special?


Re: PowerPC guest getting "BUG: scheduling while atomic" on linux-next-20210623 during secondary CPUs bringup

2021-06-25 Thread Peter Zijlstra
On Fri, Jun 25, 2021 at 02:23:16PM +0530, Bharata B Rao wrote:
> On Fri, Jun 25, 2021 at 09:28:09AM +0200, Peter Zijlstra wrote:
> > On Fri, Jun 25, 2021 at 11:16:08AM +0530, Srikar Dronamraju wrote:
> > > * Bharata B Rao  [2021-06-24 21:25:09]:
> > > 
> > > > A PowerPC KVM guest gets the following BUG message when booting
> > > > linux-next-20210623:
> > > > 
> > > > smp: Bringing up secondary CPUs ...
> > > > BUG: scheduling while atomic: swapper/1/0/0x
> > 
> > 'funny', your preempt_count is actually too low. The check here is for
> > preempt_count() == DISABLE_OFFSET (aka. 1 when PREEMPT=y), but you have
> > 0.
> > 
> > > > no locks held by swapper/1/0.
> > > > Modules linked in:
> > > > CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.13.0-rc7-next-20210623
> > > > Call Trace:
> > > > [cae5bc20] [c0badc64] dump_stack_lvl+0x98/0xe0 
> > > > (unreliable)
> > > > [cae5bc60] [c0210200] __schedule_bug+0xb0/0xe0
> > > > [cae5bcd0] [c1609e28] __schedule+0x1788/0x1c70
> > > > [cae5be20] [c160a8cc] schedule_idle+0x3c/0x70
> > > > [cae5be50] [c022984c] do_idle+0x2bc/0x420
> > > > [cae5bf00] [c0229d88] cpu_startup_entry+0x38/0x40
> > > > [cae5bf30] [c00666c0] start_secondary+0x290/0x2a0
> > > > [cae5bf90] [c000be54] start_secondary_prolog+0x10/0x14
> > > > 
> > > > 
> > > > 
> > > > smp: Brought up 2 nodes, 16 CPUs
> > > > numa: Node 0 CPUs: 0-7
> > > > numa: Node 1 CPUs: 8-15
> > > > 
> > > > This seems to have started from next-20210521 and isn't seen on
> > > > next-20210511.
> > > > 
> > > 
> > > Bharata,
> > > 
> > > I think the regression is due to Commit f1a0a376ca0c ("sched/core:
> > > Initialize the idle task with preemption disabled")
> > 
> > So that extra preempt_disable() that got removed would've incremented it
> > to 1 and then things would've been fine.
> > 
> > Except.. Valentin changed things such that preempt_count() should've
> > been inittialized to 1, instead of 0, but for some raisin that didn't
> > stick.. what gives.
> > 
> > So we have init_idle(p) -> init_idle_preempt_count(p) ->
> > task_thread_info(p)->preempt_count = PREEMPT_DISABLED;
> > 
> > But somehow, by the time you're running start_secondary(), that's gotten
> > to be 0 again. Does DEBUG_PREEMPT give more clues?
> 
> PREEMPTION is off here.

You mean: CONFIG_PREEMPTION=n, what about CONFIG_PREEMPT_COUNT?

Because if both are =n, then I don't see how that warning could trigger.
in_atomic_preempt_off() would then result in prempt_count() == 0, and
per the print above, it *is* 0.


Re: PowerPC guest getting "BUG: scheduling while atomic" on linux-next-20210623 during secondary CPUs bringup

2021-06-25 Thread Peter Zijlstra
On Fri, Jun 25, 2021 at 11:16:08AM +0530, Srikar Dronamraju wrote:
> * Bharata B Rao  [2021-06-24 21:25:09]:
> 
> > A PowerPC KVM guest gets the following BUG message when booting
> > linux-next-20210623:
> > 
> > smp: Bringing up secondary CPUs ...
> > BUG: scheduling while atomic: swapper/1/0/0x

'funny', your preempt_count is actually too low. The check here is for
preempt_count() == DISABLE_OFFSET (aka. 1 when PREEMPT=y), but you have
0.

> > no locks held by swapper/1/0.
> > Modules linked in:
> > CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.13.0-rc7-next-20210623
> > Call Trace:
> > [cae5bc20] [c0badc64] dump_stack_lvl+0x98/0xe0 (unreliable)
> > [cae5bc60] [c0210200] __schedule_bug+0xb0/0xe0
> > [cae5bcd0] [c1609e28] __schedule+0x1788/0x1c70
> > [cae5be20] [c160a8cc] schedule_idle+0x3c/0x70
> > [cae5be50] [c022984c] do_idle+0x2bc/0x420
> > [cae5bf00] [c0229d88] cpu_startup_entry+0x38/0x40
> > [cae5bf30] [c00666c0] start_secondary+0x290/0x2a0
> > [cae5bf90] [c000be54] start_secondary_prolog+0x10/0x14
> > 
> > 
> > 
> > smp: Brought up 2 nodes, 16 CPUs
> > numa: Node 0 CPUs: 0-7
> > numa: Node 1 CPUs: 8-15
> > 
> > This seems to have started from next-20210521 and isn't seen on
> > next-20210511.
> > 
> 
> Bharata,
> 
> I think the regression is due to Commit f1a0a376ca0c ("sched/core:
> Initialize the idle task with preemption disabled")

So that extra preempt_disable() that got removed would've incremented it
to 1 and then things would've been fine.

Except.. Valentin changed things such that preempt_count() should've
been inittialized to 1, instead of 0, but for some raisin that didn't
stick.. what gives.

So we have init_idle(p) -> init_idle_preempt_count(p) ->
task_thread_info(p)->preempt_count = PREEMPT_DISABLED;

But somehow, by the time you're running start_secondary(), that's gotten
to be 0 again. Does DEBUG_PREEMPT give more clues?


Re: [PATCH v3 0/4] Add perf interface to expose nvdimm

2021-06-23 Thread Peter Zijlstra
On Wed, Jun 23, 2021 at 01:40:38PM +0530, kajoljain wrote:
> 
> 
> On 6/22/21 6:44 PM, Peter Zijlstra wrote:
> > On Thu, Jun 17, 2021 at 06:56:13PM +0530, Kajol Jain wrote:
> >> ---
> >> Kajol Jain (4):
> >>   drivers/nvdimm: Add nvdimm pmu structure
> >>   drivers/nvdimm: Add perf interface to expose nvdimm performance stats
> >>   powerpc/papr_scm: Add perf interface support
> >>   powerpc/papr_scm: Document papr_scm sysfs event format entries
> > 
> > Don't see anything obviously wrong with this one.
> > 
> > Acked-by: Peter Zijlstra (Intel) 
> > 
> 
> Hi Peter,
> Thanks for reviewing the patch. Can you help me on how to take 
> these patches to linus tree or can you take it?

I would expect either the NVDIMM or PPC maintainers to take this. Dan,
Michael ?


Re: [PATCH v3 0/4] Add perf interface to expose nvdimm

2021-06-22 Thread Peter Zijlstra
On Thu, Jun 17, 2021 at 06:56:13PM +0530, Kajol Jain wrote:
> ---
> Kajol Jain (4):
>   drivers/nvdimm: Add nvdimm pmu structure
>   drivers/nvdimm: Add perf interface to expose nvdimm performance stats
>   powerpc/papr_scm: Add perf interface support
>   powerpc/papr_scm: Document papr_scm sysfs event format entries

Don't see anything obviously wrong with this one.

Acked-by: Peter Zijlstra (Intel) 


Re: [PATCH 2/4] drivers/nvdimm: Add perf interface to expose nvdimm performance stats

2021-06-08 Thread Peter Zijlstra
On Tue, Jun 08, 2021 at 05:26:58PM +0530, Kajol Jain wrote:
> +static int nvdimm_pmu_cpu_offline(unsigned int cpu, struct hlist_node *node)
> +{
> + struct nvdimm_pmu *nd_pmu;
> + u32 target;
> + int nodeid;
> + const struct cpumask *cpumask;
> +
> + nd_pmu = hlist_entry_safe(node, struct nvdimm_pmu, node);
> +
> + /* Clear it, incase given cpu is set in nd_pmu->arch_cpumask */
> + cpumask_test_and_clear_cpu(cpu, _pmu->arch_cpumask);
> +
> + /*
> +  * If given cpu is not same as current designated cpu for
> +  * counter access, just return.
> +  */
> + if (cpu != nd_pmu->cpu)
> + return 0;
> +
> + /* Check for any active cpu in nd_pmu->arch_cpumask */
> + target = cpumask_any(_pmu->arch_cpumask);
> + nd_pmu->cpu = target;
> +
> + /*
> +  * Incase we don't have any active cpu in nd_pmu->arch_cpumask,
> +  * check in given cpu's numa node list.
> +  */
> + if (target >= nr_cpu_ids) {
> + nodeid = cpu_to_node(cpu);
> + cpumask = cpumask_of_node(nodeid);
> + target = cpumask_any_but(cpumask, cpu);
> + nd_pmu->cpu = target;
> +
> + if (target >= nr_cpu_ids)
> + return -1;
> + }
> +
> + return 0;
> +}
> +
> +static int nvdimm_pmu_cpu_online(unsigned int cpu, struct hlist_node *node)
> +{
> + struct nvdimm_pmu *nd_pmu;
> +
> + nd_pmu = hlist_entry_safe(node, struct nvdimm_pmu, node);
> +
> + if (nd_pmu->cpu >= nr_cpu_ids)
> + nd_pmu->cpu = cpu;
> +
> + return 0;
> +}

> +static int nvdimm_pmu_cpu_hotplug_init(struct nvdimm_pmu *nd_pmu)
> +{
> + int nodeid, rc;
> + const struct cpumask *cpumask;
> +
> + /*
> +  * Incase cpu hotplug is not handled by arch specific code
> +  * they can still provide required cpumask which can be used
> +  * to get designatd cpu for counter access.
> +  * Check for any active cpu in nd_pmu->arch_cpumask.
> +  */
> + if (!cpumask_empty(_pmu->arch_cpumask)) {
> + nd_pmu->cpu = cpumask_any(_pmu->arch_cpumask);
> + } else {
> + /* pick active cpu from the cpumask of device numa node. */
> + nodeid = dev_to_node(nd_pmu->dev);
> + cpumask = cpumask_of_node(nodeid);
> + nd_pmu->cpu = cpumask_any(cpumask);
> + }
> +
> + rc = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "perf/nvdimm:online",
> +  nvdimm_pmu_cpu_online, 
> nvdimm_pmu_cpu_offline);
> +

Did you actually test this hotplug stuff?

That is, create a counter, unplug the CPU the counter was on, and
continue counting? "perf stat -I" is a good option for this, concurrent
with a hotplug.

Because I don't think it's actually correct. The thing is perf core is
strictly per-cpu, and it will place the event on a specific CPU context.
If you then unplug that CPU, nothing will touch the events on that CPU
anymore.

What drivers that span CPUs need to do is call
perf_pmu_migrate_context() whenever the CPU they were assigned to goes
away. Please have a look at arch/x86/events/rapl.c or
arch/x86/events/amd/power.c for relatively simple drivers that have this
property.




Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map

2021-05-28 Thread Peter Zijlstra
On Mon, May 24, 2021 at 09:48:29PM +0530, Srikar Dronamraju wrote:
> * Valentin Schneider  [2021-05-24 15:16:09]:

> > I suppose one way to avoid the hook would be to write some "fake" distance
> > values into your distance_lookup_table[] for offline nodes using your
> > distance_ref_point_depth thing, i.e. ensure an iteration of
> > node_distance(a, b) covers all distance values [1]. You can then keep patch
> > 3 around, and that should roughly be it.
> > 
> 
> Yes, this would suffice but to me its not very clean.
> static int found[distance_ref_point_depth];
> 
> for_each_node(node){
>   int i, nd, distance = LOCAL_DISTANCE;
>   goto out;
> 
>   nd = node_distance(node, first_online_node)
>   for (i=0; i < distance_ref_point_depth; i++, distance *= 2) {
>   if (node_online) {
>   if (distance != nd)
>   continue;
>   found[i] ++;
>   break;
>   }
>   if (found[i])
>   continue;
>   distance_lookup_table[node][i] = 
> distance_lookup_table[first_online_node][i];
>   found[i] ++;
>   break;
>   }
> }
> 
> But do note: We are setting a precedent for node distance between two nodes
> to change.

Not really; or rather not more than already is the case AFAICT. Because
currently your distance table will have *something* in it
(LOCAL_DISTANCE afaict) for nodes that have never been online, which is
what triggered the whole problem to begin with.

Only after the node has come online for the first time, will it contain
the right value.

So both before and after this proposal the actual distance value changes
after the first time a node goes online.

Yes that's unfortunate, but I don't see a problem with pre-filling it
with something useful in order to avoid aditional arch hooks.




Re: [RFC v2 4/4] powerpc/papr_scm: Add cpu hotplug support for nvdimm pmu device

2021-05-26 Thread Peter Zijlstra
On Wed, May 26, 2021 at 12:56:58PM +0530, kajoljain wrote:
> On 5/25/21 7:46 PM, Peter Zijlstra wrote:
> > On Tue, May 25, 2021 at 06:52:16PM +0530, Kajol Jain wrote:

> >> It adds cpumask to designate a cpu to make HCALL to
> >> collect the counter data for the nvdimm device and
> >> update ABI documentation accordingly.
> >>
> >> Result in power9 lpar system:
> >> command:# cat /sys/devices/nmem0/cpumask
> >> 0
> > 
> > Is this specific to the papr thing, or should this be in generic nvdimm
> > code?
> 
> This code is not specific to papr device and we can move it to
> generic nvdimm interface. But do we need to add some checks on whether
> any arch/platform specific driver want that support or we can assume 
> that this will be something needed by all platforms?

I'm a complete NVDIMM n00b, but to me it would appear they would have to
conform to the normal memory hierarchy and would thus always be
per-node.

Also, if/when deviation from this rule is observed, we can always
rework/extend this. For now I think it would make sense to have the
per-node ness of the thing expressed in the generic layer.


Re: [RFC v2 4/4] powerpc/papr_scm: Add cpu hotplug support for nvdimm pmu device

2021-05-25 Thread Peter Zijlstra
On Tue, May 25, 2021 at 06:52:16PM +0530, Kajol Jain wrote:
> Patch here adds cpu hotplug functions to nvdimm pmu.

I'm thinking "Patch here" qualifies for "This patch", see
Documentation/process/submitting-patches.rst .

> It adds cpumask to designate a cpu to make HCALL to
> collect the counter data for the nvdimm device and
> update ABI documentation accordingly.
> 
> Result in power9 lpar system:
> command:# cat /sys/devices/nmem0/cpumask
> 0

Is this specific to the papr thing, or should this be in generic nvdimm
code?


Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map

2021-05-21 Thread Peter Zijlstra
On Fri, May 21, 2021 at 08:08:02AM +0530, Srikar Dronamraju wrote:
> * Peter Zijlstra  [2021-05-20 20:56:31]:
> 
> > On Thu, May 20, 2021 at 09:14:25PM +0530, Srikar Dronamraju wrote:
> > > Currently scheduler populates the distance map by looking at distance
> > > of each node from all other nodes. This should work for most
> > > architectures and platforms.
> > > 
> > > However there are some architectures like POWER that may not expose
> > > the distance of nodes that are not yet onlined because those resources
> > > are not yet allocated to the OS instance. Such architectures have
> > > other means to provide valid distance data for the current platform.
> > > 
> > > For example distance info from numactl from a fully populated 8 node
> > > system at boot may look like this.
> > > 
> > > node distances:
> > > node   0   1   2   3   4   5   6   7
> > >   0:  10  20  40  40  40  40  40  40
> > >   1:  20  10  40  40  40  40  40  40
> > >   2:  40  40  10  20  40  40  40  40
> > >   3:  40  40  20  10  40  40  40  40
> > >   4:  40  40  40  40  10  20  40  40
> > >   5:  40  40  40  40  20  10  40  40
> > >   6:  40  40  40  40  40  40  10  20
> > >   7:  40  40  40  40  40  40  20  10
> > > 
> > > However the same system when only two nodes are online at boot, then the
> > > numa topology will look like
> > > node distances:
> > > node   0   1
> > >   0:  10  20
> > >   1:  20  10
> > > 
> > > It may be implementation dependent on what node_distance(0,3) where
> > > node 0 is online and node 3 is offline. In POWER case, it returns
> > > LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max
> > > distance between nodes is 20. However that would not be true.
> > > 
> > > When Nodes are onlined and CPUs from those nodes are hotplugged,
> > > the max node distance would be 40.
> > > 
> > > To handle such scenarios, let scheduler allow architectures to populate
> > > the distance map. Architectures that like to populate the distance map
> > > can overload arch_populate_distance_map().
> > 
> > Why? Why can't your node_distance() DTRT? The arch interface is
> > nr_node_ids and node_distance(), I don't see why we need something new
> > and then replace one special use of it.
> > 
> > By virtue of you being able to actually implement this new hook, you
> > supposedly can actually do node_distance() right too.
> 
> Since for an offline node, arch interface code doesn't have the info.
> As far as I know/understand, in POWER, unless there is an active memory or
> CPU that's getting onlined, arch can't fetch the correct node distance.
> 
> Taking the above example: node 3 is offline, then node_distance of (3,X)
> where X is anything other than 3, is not reliable. The moment node 3 is
> onlined, the node distance is reliable.
> 
> This problem will not happen even on POWER if all the nodes have either
> memory or CPUs active at the time of boot.

But then how can you implement this new hook? Going by the fact that
both nr_node_ids and distance_ref_points_depth are fixed, how many
possible __node_distance() configurations are there left?

The example provided above does not suggest there's much room for
alternatives, and hence for actual need of this new interface.



Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map

2021-05-20 Thread Peter Zijlstra
On Thu, May 20, 2021 at 09:14:25PM +0530, Srikar Dronamraju wrote:
> Currently scheduler populates the distance map by looking at distance
> of each node from all other nodes. This should work for most
> architectures and platforms.
> 
> However there are some architectures like POWER that may not expose
> the distance of nodes that are not yet onlined because those resources
> are not yet allocated to the OS instance. Such architectures have
> other means to provide valid distance data for the current platform.
> 
> For example distance info from numactl from a fully populated 8 node
> system at boot may look like this.
> 
> node distances:
> node   0   1   2   3   4   5   6   7
>   0:  10  20  40  40  40  40  40  40
>   1:  20  10  40  40  40  40  40  40
>   2:  40  40  10  20  40  40  40  40
>   3:  40  40  20  10  40  40  40  40
>   4:  40  40  40  40  10  20  40  40
>   5:  40  40  40  40  20  10  40  40
>   6:  40  40  40  40  40  40  10  20
>   7:  40  40  40  40  40  40  20  10
> 
> However the same system when only two nodes are online at boot, then the
> numa topology will look like
> node distances:
> node   0   1
>   0:  10  20
>   1:  20  10
> 
> It may be implementation dependent on what node_distance(0,3) where
> node 0 is online and node 3 is offline. In POWER case, it returns
> LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max
> distance between nodes is 20. However that would not be true.
> 
> When Nodes are onlined and CPUs from those nodes are hotplugged,
> the max node distance would be 40.
> 
> To handle such scenarios, let scheduler allow architectures to populate
> the distance map. Architectures that like to populate the distance map
> can overload arch_populate_distance_map().

Why? Why can't your node_distance() DTRT? The arch interface is
nr_node_ids and node_distance(), I don't see why we need something new
and then replace one special use of it.

By virtue of you being able to actually implement this new hook, you
supposedly can actually do node_distance() right too.


Re: [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance

2021-05-19 Thread Peter Zijlstra
On Tue, May 18, 2021 at 12:07:40PM -0700, Ricardo Neri wrote:
> On Fri, May 14, 2021 at 07:14:15PM -0700, Ricardo Neri wrote:
> > On Fri, May 14, 2021 at 11:47:45AM +0200, Peter Zijlstra wrote:

> > > So I'm thinking that this is a property of having ASYM_PACKING at a core
> > > level, rather than some arch special. Wouldn't something like this be
> > > more appropriate?

> > Thanks Peter for the quick review! This makes sense to me. The only
> > reason we proposed arch_asym_check_smt_siblings() is because we were
> > about breaking powerpc (I need to study how they set priorities for SMT,
> > if applicable). If you think this is not an issue I can post a
> > v4 with this update.
> 
> As far as I can see, priorities in powerpc are set by the CPU number.
> However, I am not sure how CPUs are enumerated? If CPUs in brackets are
> SMT sibling, Does an enumeration looks like A) [0, 1], [2, 3] or B) [0, 2],
> [1, 3]? I guess B is the right answer. Otherwise, both SMT siblings of a
> core would need to be busy before a new core is used.
> 
> Still, I think the issue described in the cover letter may be
> reproducible in powerpc as well. If CPU3 is offlined, and [0, 2] pulled
> tasks from [1, -] so that both CPU0 and CPU2 become busy, CPU1 would not be
> able to help since CPU0 has the highest priority.
> 
> I am cc'ing the linuxppc list to get some feedback.

IIRC the concern with Power is that their Cores can go faster if the
higher SMT siblings are unused.

That is, suppose you have an SMT4 Core with only a single active task,
then if only SMT0 is used it can reach max performance, but if the
active sibling is SMT1 it can not reach max performance, and if the only
active sibling is SMT2 it goes slower still.

So they need to pack the tasks to the lowest SMT siblings, and have the
highest SMT siblings idle (where possible) in order to increase
performance.




Re: [RFC 1/4] drivers/nvdimm: Add perf interface to expose nvdimm performance stats

2021-05-14 Thread Peter Zijlstra
On Thu, May 13, 2021 at 05:56:14PM +0530, kajoljain wrote:

> But yes the current read/add/del functions are not adding value. We
> could  add an arch/platform specific function which could handle the
> capturing of the counter data and do the rest of the operation here,
> is this approach better?

Right; have your register_nvdimm_pmu() set pmu->{add,del,read} to
nd_pmu->{add,del,read} directly, don't bother with these intermediates.
Also you can WARN_ON_ONCE() if any of them are NULL and fail
registration at that point.


Re: [RFC 1/4] drivers/nvdimm: Add perf interface to expose nvdimm performance stats

2021-05-12 Thread Peter Zijlstra
On Wed, May 12, 2021 at 10:08:21PM +0530, Kajol Jain wrote:
> +static void nvdimm_pmu_read(struct perf_event *event)
> +{
> + struct nvdimm_pmu *nd_pmu = to_nvdimm_pmu(event->pmu);
> +
> + /* jump to arch/platform specific callbacks if any */
> + if (nd_pmu && nd_pmu->read)
> + nd_pmu->read(event, nd_pmu->dev);
> +}
> +
> +static void nvdimm_pmu_del(struct perf_event *event, int flags)
> +{
> + struct nvdimm_pmu *nd_pmu = to_nvdimm_pmu(event->pmu);
> +
> + /* jump to arch/platform specific callbacks if any */
> + if (nd_pmu && nd_pmu->del)
> + nd_pmu->del(event, flags, nd_pmu->dev);
> +}
> +
> +static int nvdimm_pmu_add(struct perf_event *event, int flags)
> +{
> + struct nvdimm_pmu *nd_pmu = to_nvdimm_pmu(event->pmu);
> +
> + if (flags & PERF_EF_START)
> + /* jump to arch/platform specific callbacks if any */
> + if (nd_pmu && nd_pmu->add)
> + return nd_pmu->add(event, flags, nd_pmu->dev);
> + return 0;
> +}

What's the value add here? Why can't you directly set driver pointers? I
also don't really believe ->{add,del,read} can be optional and still
have a sane driver.


Re: [RESEND PATCH v4 10/11] powerpc: Protect patching_mm with a lock

2021-05-07 Thread Peter Zijlstra
On Fri, May 07, 2021 at 03:03:51PM -0500, Christopher M. Riedl wrote:
> On Thu May 6, 2021 at 5:51 AM CDT, Peter Zijlstra wrote:
> > On Wed, May 05, 2021 at 11:34:51PM -0500, Christopher M. Riedl wrote:
> > > Powerpc allows for multiple CPUs to patch concurrently. When patching
> > > with STRICT_KERNEL_RWX a single patching_mm is allocated for use by all
> > > CPUs for the few times that patching occurs. Use a spinlock to protect
> > > the patching_mm from concurrent use.
> > > 
> > > Modify patch_instruction() to acquire the lock, perform the patch op,
> > > and then release the lock.
> > > 
> > > Also introduce {lock,unlock}_patching() along with
> > > patch_instruction_unlocked() to avoid per-iteration lock overhead when
> > > patch_instruction() is called in a loop. A follow-up patch converts some
> > > uses of patch_instruction() to use patch_instruction_unlocked() instead.
> >
> > x86 uses text_mutex for all this, why not do the same?
> 
> I wasn't entirely sure if there is a problem with potentially going to
> sleep in some of the places where patch_instruction() is called - the
> spinlock avoids that (hypothetical) problem.

So I'm not saying you like have to do this; but I did wonder if there's
a reason not to, and given you didn't mention it, I had to ask.

> I just tried switching to text_mutex and at least on a P9 machine the
> series boots w/ the Hash and Radix MMUs (with some lockdep errors). I
> can rework this in the next version to use text_mutex if I don't find
> any new problems with more extensive testing. It does mean more changes
> to use patch_instruction_unlocked() in kprobe/optprobe/ftace in
> arch/powerpc since iirc those are called with text_mutex already held.

The x86 text_poke() has a lockdep_assert_held(_mutex) in to make
sure nobody 'forgets' :-)


Re: [RESEND PATCH v4 10/11] powerpc: Protect patching_mm with a lock

2021-05-06 Thread Peter Zijlstra
On Wed, May 05, 2021 at 11:34:51PM -0500, Christopher M. Riedl wrote:
> Powerpc allows for multiple CPUs to patch concurrently. When patching
> with STRICT_KERNEL_RWX a single patching_mm is allocated for use by all
> CPUs for the few times that patching occurs. Use a spinlock to protect
> the patching_mm from concurrent use.
> 
> Modify patch_instruction() to acquire the lock, perform the patch op,
> and then release the lock.
> 
> Also introduce {lock,unlock}_patching() along with
> patch_instruction_unlocked() to avoid per-iteration lock overhead when
> patch_instruction() is called in a loop. A follow-up patch converts some
> uses of patch_instruction() to use patch_instruction_unlocked() instead.

x86 uses text_mutex for all this, why not do the same?


Re: [OpenRISC] [PATCH v6 1/9] locking/qspinlock: Add ARCH_USE_QUEUED_SPINLOCKS_XCHG32

2021-04-07 Thread Peter Zijlstra
On Wed, Apr 07, 2021 at 08:52:08AM +0900, Stafford Horne wrote:
> Why doesn't RISC-V add the xchg16 emulation code similar to OpenRISC?  For
> OpenRISC we added xchg16 and xchg8 emulation code to enable qspinlocks.  So
> one thought is with CONFIG_ARCH_USE_QUEUED_SPINLOCKS_XCHG32=y, can we remove 
> our
> xchg16/xchg8 emulation code?

CONFIG_ARCH_USE_QUEUED_SPINLOCKS_XCHG32 is guaranteed crap.

All the architectures that have wanted it are RISC style LL/SC archs,
and for them a cmpxchg loop is a daft thing to do, since it reduces the
chance of it behaving sanely.

Why would we provide something that's known to be suboptimal? If an
architecture chooses to not care about determinism and or fwd progress,
then that's their choice. But not one, I feel, we should encourage.



Re: [PATCH V2 1/5] powerpc/perf: Expose processor pipeline stage cycles using PERF_SAMPLE_WEIGHT_STRUCT

2021-03-25 Thread Peter Zijlstra
On Thu, Mar 25, 2021 at 10:01:35AM -0300, Arnaldo Carvalho de Melo wrote:
> Em Wed, Mar 24, 2021 at 10:05:23AM +0530, Madhavan Srinivasan escreveu:
> > 
> > On 3/22/21 8:27 PM, Athira Rajeev wrote:
> > > Performance Monitoring Unit (PMU) registers in powerpc provides
> > > information on cycles elapsed between different stages in the
> > > pipeline. This can be used for application tuning. On ISA v3.1
> > > platform, this information is exposed by sampling registers.
> > > Patch adds kernel support to capture two of the cycle counters
> > > as part of perf sample using the sample type:
> > > PERF_SAMPLE_WEIGHT_STRUCT.
> > > 
> > > The power PMU function 'get_mem_weight' currently uses 64 bit weight
> > > field of perf_sample_data to capture memory latency. But following the
> > > introduction of PERF_SAMPLE_WEIGHT_TYPE, weight field could contain
> > > 64-bit or 32-bit value depending on the architexture support for
> > > PERF_SAMPLE_WEIGHT_STRUCT. Patches uses WEIGHT_STRUCT to expose the
> > > pipeline stage cycles info. Hence update the ppmu functions to work for
> > > 64-bit and 32-bit weight values.
> > > 
> > > If the sample type is PERF_SAMPLE_WEIGHT, use the 64-bit weight field.
> > > if the sample type is PERF_SAMPLE_WEIGHT_STRUCT, memory subsystem
> > > latency is stored in the low 32bits of perf_sample_weight structure.
> > > Also for CPU_FTR_ARCH_31, capture the two cycle counter information in
> > > two 16 bit fields of perf_sample_weight structure.
> > 
> > Changes looks fine to me.
> > 
> > Reviewed-by: Madhavan Srinivasan 
> 
> So who will process the kernel bits? I'm merging the tooling parts,

I was sorta expecting these to go through the powerpc tree. Let me know
if you want them in tip/perf/core instead.


Re: [PATCH V2] powerpc/perf: Fix handling of privilege level checks in perf interrupt context

2021-02-26 Thread Peter Zijlstra
On Thu, Feb 25, 2021 at 05:10:39AM -0500, Athira Rajeev wrote:
> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
> index 4b4319d8..c8be44c 100644
> --- a/arch/powerpc/perf/core-book3s.c
> +++ b/arch/powerpc/perf/core-book3s.c
> @@ -222,7 +222,7 @@ static inline void perf_get_data_addr(struct perf_event 
> *event, struct pt_regs *
>   if (!(mmcra & MMCRA_SAMPLE_ENABLE) || sdar_valid)
>   *addrp = mfspr(SPRN_SDAR);
>  
> - if (is_kernel_addr(mfspr(SPRN_SDAR)) && perf_allow_kernel(>attr) 
> != 0)
> + if (is_kernel_addr(mfspr(SPRN_SDAR)) && event->attr.exclude_kernel)
>   *addrp = 0;
>  }
>  
> @@ -507,7 +507,7 @@ static void power_pmu_bhrb_read(struct perf_event *event, 
> struct cpu_hw_events *
>* addresses, hence include a check before filtering 
> code
>*/
>   if (!(ppmu->flags & PPMU_ARCH_31) &&
> - is_kernel_addr(addr) && 
> perf_allow_kernel(>attr) != 0)
> + is_kernel_addr(addr) && event->attr.exclude_kernel)
>       continue;
>  
>   /* Branches are read most recent first (ie. mfbhrb 0 is

Acked-by: Peter Zijlstra (Intel) 


Re: [PATCH] powerpc/perf: Fix handling of privilege level checks in perf interrupt context

2021-02-23 Thread Peter Zijlstra
On Tue, Feb 23, 2021 at 01:31:49AM -0500, Athira Rajeev wrote:
> Running "perf mem record" in powerpc platforms with selinux enabled
> resulted in soft lockup's. Below call-trace was seen in the logs:
> 
> CPU: 58 PID: 3751 Comm: sssd_nss Not tainted 5.11.0-rc7+ #2
> NIP:  c0dff3d4 LR: c0dff3d0 CTR: 
> REGS: c07fffab7d60 TRAP: 0100   Not tainted  (5.11.0-rc7+)
> <<>>
> NIP [c0dff3d4] _raw_spin_lock_irqsave+0x94/0x120
> LR [c0dff3d0] _raw_spin_lock_irqsave+0x90/0x120
> Call Trace:
> [cfd471a0] [cfd47260] 0xcfd47260 (unreliable)
> [cfd471e0] [c0b5fbbc] skb_queue_tail+0x3c/0x90
> [cfd47220] [c0296edc] audit_log_end+0x6c/0x180
> [cfd47260] [c06a3f20] common_lsm_audit+0xb0/0xe0
> [cfd472a0] [c066c664] slow_avc_audit+0xa4/0x110
> [cfd47320] [c066cff4] avc_has_perm+0x1c4/0x260
> [cfd47430] [c066e064] selinux_perf_event_open+0x74/0xd0
> [cfd47450] [c0669888] security_perf_event_open+0x68/0xc0
> [cfd47490] [c013d788] record_and_restart+0x6e8/0x7f0
> [cfd476c0] [c013dabc] perf_event_interrupt+0x22c/0x560
> [cfd477d0] [c002d0fc] performance_monitor_exception+0x4c/0x60
> [cfd477f0] [c000b378] 
> performance_monitor_common_virt+0x1c8/0x1d0
> interrupt: f00 at _raw_spin_lock_irqsave+0x38/0x120
> NIP:  c0dff378 LR: c0b5fbbc CTR: c07d47f0
> REGS: cfd47860 TRAP: 0f00   Not tainted  (5.11.0-rc7+)
> <<>>
> NIP [c0dff378] _raw_spin_lock_irqsave+0x38/0x120
> LR [c0b5fbbc] skb_queue_tail+0x3c/0x90
> interrupt: f00
> [cfd47b00] [0038] 0x38 (unreliable)
> [cfd47b40] [caae6200] 0xcaae6200
> [cfd47b80] [c0296edc] audit_log_end+0x6c/0x180
> [cfd47bc0] [c029f494] audit_log_exit+0x344/0xf80
> [cfd47d10] [c02a2b00] __audit_syscall_exit+0x2c0/0x320
> [cfd47d60] [c0032878] do_syscall_trace_leave+0x148/0x200
> [cfd47da0] [c003d5b4] syscall_exit_prepare+0x324/0x390
> [cfd47e10] [c000d76c] system_call_common+0xfc/0x27c
> 
> The above trace shows that while the CPU was handling a performance
> monitor exception, there was a call to "security_perf_event_open"
> function. In powerpc core-book3s, this function is called from
> 'perf_allow_kernel' check during recording of data address in the sample
> via perf_get_data_addr().
> 
> Commit da97e18458fb ("perf_event: Add support for LSM and SELinux checks")
> introduced security enhancements to perf. As part of this commit, the new
> security hook for perf_event_open was added in all places where perf
> paranoid check was previously used. In powerpc core-book3s code, originally
> had paranoid checks in 'perf_get_data_addr' and 'power_pmu_bhrb_read'. So
> 'perf_paranoid_kernel' checks were replaced with 'perf_allow_kernel' in
> these pmu helper functions as well.
> 
> The intention of paranoid checks in core-book3s is to verify privilege
> access before capturing some of the sample data. Along with paranoid
> checks, 'perf_allow_kernel' also does a 'security_perf_event_open'. Since
> these functions are accessed while recording sample, we end up in calling
> selinux_perf_event_open in PMI context. Some of the security functions
> use spinlock like sidtab_sid2str_put(). If a perf interrupt hits under
> a spin lock and if we end up in calling selinux hook functions in PMI
> handler, this could cause a dead lock.
> 
> Since the purpose of this security hook is to control access to
> perf_event_open, it is not right to call this in interrupt context.
> But in case of powerpc PMU, we need the privilege checks for specific
> samples from branch history ring buffer and sampling register values.

I'm confused... why would you need those checks at event time? Either
the event has perf_event_attr::exclude_kernel and it then isn't allowed
to expose kernel addresses, or it doesn't and it is.

There should never be an event-time question of permission like this. If
you allow creation of an event, you're allowing the data it generates.


Re: [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma()

2021-02-02 Thread Peter Zijlstra
On Tue, Feb 02, 2021 at 09:54:36AM +, Nadav Amit wrote:
> > On Feb 2, 2021, at 1:31 AM, Peter Zijlstra  wrote:
> > 
> > On Tue, Feb 02, 2021 at 07:20:55AM +, Nadav Amit wrote:
> >> Arm does not define tlb_end_vma, and consequently it flushes the TLB after
> >> each VMA. I suspect it is not intentional.
> > 
> > ARM is one of those that look at the VM_EXEC bit to explicitly flush
> > ITLB IIRC, so it has to.
> 
> Hmm… I don’t think Arm is doing that. At least arm64 does not use the
> default tlb_flush(), and it does not seem to consider VM_EXEC (at least in
> this path):
> 

ARM != ARM64. ARM certainly does, but you're right, I don't think ARM64
does this.




Re: [RFC PATCH 3/6] mm/mremap: Use pmd/pud_poplulate to update page table entries

2021-02-02 Thread Peter Zijlstra
On Tue, Feb 02, 2021 at 02:41:13PM +0530, Aneesh Kumar K.V wrote:
> pmd/pud_populate is the right interface to be used to set the respective
> page table entries. Some architectures do assume that set_pmd/pud_at
> can only be used to set a hugepage PTE. Since we are not setting up a hugepage
> PTE here, use the pmd/pud_populate interface.

Since you did the audit, it might be nice to record which architectures
that are. Also, how much work to fix them?


Re: [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma()

2021-02-02 Thread Peter Zijlstra
On Tue, Feb 02, 2021 at 07:20:55AM +, Nadav Amit wrote:
> Arm does not define tlb_end_vma, and consequently it flushes the TLB after
> each VMA. I suspect it is not intentional.

ARM is one of those that look at the VM_EXEC bit to explicitly flush
ITLB IIRC, so it has to.


Re: [RFC 00/20] TLB batching consolidation and enhancements

2021-02-01 Thread Peter Zijlstra
On Sun, Jan 31, 2021 at 07:57:01AM +, Nadav Amit wrote:
> > On Jan 30, 2021, at 7:30 PM, Nicholas Piggin  wrote:

> > I'll go through the patches a bit more closely when they all come 
> > through. Sparc and powerpc of course need the arch lazy mode to get 
> > per-page/pte information for operations that are not freeing pages, 
> > which is what mmu gather is designed for.
> 
> IIUC you mean any PTE change requires a TLB flush. Even setting up a new PTE
> where no previous PTE was set, right?

These are the HASH architectures. Their hardware doesn't walk the
page-tables, but it consults a hash-table to resolve page translations.

They _MUST_ flush the entries under the PTL to avoid ever seeing
conflicting information, which will make them really unhappy. They can
do this because they have TLBI broadcast.

There's a few more details I'm sure, but those seem to have slipped from
my mind.


Re: [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma()

2021-02-01 Thread Peter Zijlstra
On Sat, Jan 30, 2021 at 04:11:23PM -0800, Nadav Amit wrote:

> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index 427bfcc6cdec..b97136b7010b 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -334,8 +334,8 @@ static inline void __tlb_reset_range(struct mmu_gather 
> *tlb)
>  
>  #ifdef CONFIG_MMU_GATHER_NO_RANGE
>  
> -#if defined(tlb_flush) || defined(tlb_start_vma) || defined(tlb_end_vma)
> -#error MMU_GATHER_NO_RANGE relies on default tlb_flush(), tlb_start_vma() 
> and tlb_end_vma()
> +#if defined(tlb_flush)
> +#error MMU_GATHER_NO_RANGE relies on default tlb_flush()
>  #endif
>  
>  /*
> @@ -362,10 +362,6 @@ static inline void tlb_end_vma(struct mmu_gather *tlb, 
> struct vm_area_struct *vm
>  
>  #ifndef tlb_flush
>  
> -#if defined(tlb_start_vma) || defined(tlb_end_vma)
> -#error Default tlb_flush() relies on default tlb_start_vma() and 
> tlb_end_vma()
> -#endif

#ifdef CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING
#error 
#endif

goes here...


>  static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct 
> *vma)
>  {
>   if (tlb->fullmm)
>   return;
>  
> + if (IS_ENABLED(CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING))
> + return;

Also, can you please stick to the CONFIG_MMU_GATHER_* namespace?

I also don't think AGRESSIVE_FLUSH_BATCHING quite captures what it does.
How about:

CONFIG_MMU_GATHER_NO_PER_VMA_FLUSH

?




Re: [PATCH 0/3] sched: Task priority related cleanups

2021-01-29 Thread Peter Zijlstra
On Thu, Jan 28, 2021 at 02:10:37PM +0100, Dietmar Eggemann wrote:

> Dietmar Eggemann (3):
>   sched: Remove MAX_USER_RT_PRIO
>   sched: Remove USER_PRIO, TASK_USER_PRIO and MAX_USER_PRIO
>   sched/core: Update task_prio() function header

Thanks!


Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

2021-01-05 Thread Peter Zijlstra
On Tue, Jan 05, 2021 at 08:20:51AM -0800, Andy Lutomirski wrote:
> > Interestingly, the architecture recently added a control bit to remove
> > this synchronisation from exception return, so if we set that then we'd
> > have a problem with SYNC_CORE and adding an ISB would be necessary (and
> > we could probable then make kernel->kernel returns cheaper, but I
> > suspect we're relying on this implicit synchronisation in other places
> > too).
> > 
> 
> Is ISB just a context synchronization event or does it do more?

IIRC it just the instruction sync (like power ISYNC).

> On x86, it’s very hard to tell that MFENCE does any more than LOCK,
> but it’s much slower.  And we have LFENCE, which, as documented,
> doesn’t appear to have any semantics at all.  (Or at least it didn’t
> before Spectre.)

AFAIU MFENCE is a completion barrier, while LOCK prefix is not. A bit
like ARM's DSB vs DMB.

It is for this reason that mb() is still MFENCE, while our smp_mb() is a
LOCK prefixed NO-OP.

And yes, LFENCE used to be poorly defined and it was sometimes
understood to be a completion barrier relative to prior LOADs, while it
is now a completion barrier for any prior instruction, and really should
be renamed to IFENCE.




Re: [RFC][PATCH 1/2] libnvdimm: Introduce ND_CMD_GET_STAT to retrieve nvdimm statistics

2020-12-08 Thread Peter Zijlstra
On Mon, Dec 07, 2020 at 04:54:21PM -0800, Dan Williams wrote:
> [ add perf maintainers ]
> 
> On Sun, Nov 8, 2020 at 1:16 PM Vaibhav Jain  wrote:
> >
> > Implement support for exposing generic nvdimm statistics via newly
> > introduced dimm-command ND_CMD_GET_STAT that can be handled by nvdimm
> > command handler function and provide values for these statistics back
> > to libnvdimm. Following generic nvdimm statistics are defined as an
> > enumeration in 'uapi/ndctl.h':
> >
> > * "media_reads" : Number of media reads that have occurred since reboot.
> > * "media_writes" : Number of media writes that have occurred since reboot.
> > * "read_requests" : Number of read requests that have occurred since reboot.
> > * "write_requests" : Number of write requests that have occurred since 
> > reboot.
> 
> Perhaps document these as "since device reset"? As I can imagine some
> devices might have a mechanism to reset the count outside of "reboot"
> which is a bit ambiguous.
> 
> > * "total_media_reads" : Total number of media reads that have occurred.
> > * "total_media_writes" : Total number of media writes that have occurred.
> > * "total_read_requests" : Total number of read requests that have occurred.
> > * "total_write_requests" : Total number of write requests that have 
> > occurred.
> >
> > Apart from ND_CMD_GET_STAT ioctl these nvdimm statistics are also
> > exposed via sysfs '/stats' directory for easy user-space
> > access like below:
> >
> > /sys/class/nd/ndctl0/device/nmem0/stats # tail -n +1 *
> > ==> media_reads <==
> > 252197707602
> > ==> media_writes <==
> > 20684685172
> > ==> read_requests <==
> > 658810924962
> > ==> write_requests <==
> > 404464081574
> 
> Hmm, I haven't looked but how hard would it be to plumb these to be
> perf counter-events. So someone could combine these with other perf
> counters?
> 
> > In case a specific nvdimm-statistic is not supported than nvdimm
> > command handler function can simply return an error (e.g -ENOENT) for
> > request to read that nvdimm-statistic.
> 
> Makes sense, but I expect the perf route also has a way to enumerate
> which statistics / counters are supported. I'm not opposed to also
> having them in sysfs, but I think perf support should be a first class
> citizen.

arch/x86/events/msr.c might be a good starting point for a software pmu
delivering pure counters.


Re: [PATCH AUTOSEL 5.9 27/39] sched/idle: Fix arch_cpu_idle() vs tracing

2020-12-03 Thread Peter Zijlstra
On Thu, Dec 03, 2020 at 03:54:42PM +0100, Heiko Carstens wrote:
> On Thu, Dec 03, 2020 at 08:28:21AM -0500, Sasha Levin wrote:
> > From: Peter Zijlstra 
> > 
> > [ Upstream commit 58c644ba512cfbc2e39b758dd979edd1d6d00e27 ]
> > 
> > We call arch_cpu_idle() with RCU disabled, but then use
> > local_irq_{en,dis}able(), which invokes tracing, which relies on RCU.
> > 
> > Switch all arch_cpu_idle() implementations to use
> > raw_local_irq_{en,dis}able() and carefully manage the
> > lockdep,rcu,tracing state like we do in entry.
> > 
> > (XXX: we really should change arch_cpu_idle() to not return with
> > interrupts enabled)
> > 
> > Reported-by: Sven Schnelle 
> > Signed-off-by: Peter Zijlstra (Intel) 
> > Reviewed-by: Mark Rutland 
> > Tested-by: Mark Rutland 
> > Link: https://lkml.kernel.org/r/20201120114925.594122...@infradead.org
> > Signed-off-by: Sasha Levin 
> 
> This patch broke s390 irq state tracing. A patch to fix this is
> scheduled to be merged upstream today (hopefully).
> Therefore I think this patch should not yet go into 5.9 stable.

Agreed.


Re: [MOCKUP] x86/mm: Lightweight lazy mm refcounting

2020-12-03 Thread Peter Zijlstra
On Wed, Dec 02, 2020 at 09:25:51PM -0800, Andy Lutomirski wrote:

> power: same as ARM, except that the loop may be rather larger since
> the systems are bigger.  But I imagine it's still faster than Nick's
> approach -- a cmpxchg to a remote cacheline should still be faster than
> an IPI shootdown. 

While a single atomic might be cheaper than an IPI, the comparison
doesn't work out nicely. You do the xchg() on every unlazy, while the
IPI would be once per process exit.

So over the life of the process, it might do very many unlazies, adding
up to a total cost far in excess of what the single IPI would've been.


And while I appreciate all the work to get rid of the active_mm
accounting; the worry I have with pushing this all into arch code is
that it will be so very easy to get this subtly wrong.


Re: [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-12-02 Thread Peter Zijlstra
On Wed, Dec 02, 2020 at 06:38:12AM -0800, Andy Lutomirski wrote:
> 
> > On Dec 2, 2020, at 6:20 AM, Peter Zijlstra  wrote:
> > 
> > On Sun, Nov 29, 2020 at 02:01:39AM +1000, Nicholas Piggin wrote:
> >> + * - A delayed freeing and RCU-like quiescing sequence based on
> >> + *   mm switching to avoid IPIs completely.
> > 
> > That one's interesting too. so basically you want to count switch_mm()
> > invocations on each CPU. Then, periodically snapshot the counter on each
> > CPU, and when they've all changed, increment a global counter.
> > 
> > Then, you snapshot the global counter and wait for it to increment
> > (twice I think, the first increment might already be in progress).
> > 
> > The only question here is what should drive this machinery.. the tick
> > probably.
> > 
> > This shouldn't be too hard to do I think.
> > 
> > Something a little like so perhaps?
> 
> I don’t think this will work.  A CPU can go idle with lazy mm and nohz
> forever.  This could lead to unbounded memory use on a lightly loaded
> system.

Hurm.. quite so indeed. Fixing that seems to end up with requiring that
other proposal, such that we can tell which CPU has what active_mm
stuck.

Also, more complicated... :/


Re: [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-12-02 Thread Peter Zijlstra
On Sun, Nov 29, 2020 at 02:01:39AM +1000, Nicholas Piggin wrote:
> +  * - A delayed freeing and RCU-like quiescing sequence based on
> +  *   mm switching to avoid IPIs completely.

That one's interesting too. so basically you want to count switch_mm()
invocations on each CPU. Then, periodically snapshot the counter on each
CPU, and when they've all changed, increment a global counter.

Then, you snapshot the global counter and wait for it to increment
(twice I think, the first increment might already be in progress).

The only question here is what should drive this machinery.. the tick
probably.

This shouldn't be too hard to do I think.

Something a little like so perhaps?


diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 41404afb7f4c..27b64a60a468 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4525,6 +4525,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
 * finish_task_switch()'s mmdrop().
 */
switch_mm_irqs_off(prev->active_mm, next->mm, next);
+   rq->nr_mm_switches++;
 
if (!prev->mm) {// from kernel
/* will mmdrop() in finish_task_switch(). */
@@ -4739,6 +4740,80 @@ unsigned long long task_sched_runtime(struct task_struct 
*p)
return ns;
 }
 
+static DEFINE_PER_CPU(unsigned long[2], mm_switches);
+
+static struct {
+   unsigned long __percpu *switches[2];
+   unsigned long generation;
+   atomic_t complete;
+   struct wait_queue_dead wait;
+} mm_foo = {
+   .switches = _switches,
+   .generation = 0,
+   .complete = -1, // XXX bootstrap, hotplug
+   .wait = __WAIT_QUEUE_HEAD_INITIALIZER(mm_foo.wait),
+};
+
+static void mm_gen_tick(int cpu, struct rq *rq)
+{
+   unsigned long prev, curr, switches = rq->nr_mm_switches;
+   int idx = READ_ONCE(mm_foo.generation) & 1;
+
+   /* DATA-DEP on mm_foo.generation */
+
+   prev = __this_cpu_read(mm_foo.switches[idx^1]);
+   curr = __this_cpu_read(mm_foo.switches[idx]);
+
+   /* we haven't switched since the last generation */
+   if (prev == switches)
+   return false;
+
+   __this_cpu_write(mm_foo.switches[idx], switches);
+
+   /*
+* If @curr is less than @prev, this is the first update of
+* this generation, per the above, switches has also increased since,
+* so mark out CPU complete.
+*/
+   if ((long)(curr - prev) < 0 && atomic_dec_and_test(_foo.complete)) {
+   /*
+* All CPUs are complete, IOW they all switched at least once
+* since the last generation. Reset the completion counter and
+* increment the generation.
+*/
+   atomic_set(_foo.complete, nr_online_cpus());
+   /*
+* Matches the address dependency above:
+*
+*   idx = gen & 1  complete = nr_cpus
+*
+*   curr = sw[idx] generation++;
+*   prev = sw[idx^1]
+*   if (curr < prev)
+* complete--
+*
+* If we don't observe the new generation; we'll not decrement. 
If we
+* do see the new generation, we must also see the new 
completion count.
+*/
+   smp_wmb();
+   mm_foo.generation++;
+   return true;
+   }
+
+   return false;
+}
+
+static void mm_gen_wake(void)
+{
+   wake_up_all(_foo.wait);
+}
+
+static void mm_gen_wait(void)
+{
+   unsigned int gen = READ_ONCE(mm_foo.generation);
+   wait_event(_foo.wait, READ_ONCE(mm_foo.generation) - gen > 1);
+}
+
 /*
  * This function gets called by the timer code, with HZ frequency.
  * We call it with interrupts disabled.
@@ -4750,6 +4825,7 @@ void scheduler_tick(void)
struct task_struct *curr = rq->curr;
struct rq_flags rf;
unsigned long thermal_pressure;
+   bool wake_mm_gen;
 
arch_scale_freq_tick();
sched_clock_tick();
@@ -4763,8 +4839,13 @@ void scheduler_tick(void)
calc_global_load_tick(rq);
psi_task_tick(rq);
 
+   wake_mm_gen = mm_gen_tick(cpu, rq);
+
rq_unlock(rq, );
 
+   if (wake_mm_gen)
+   mm_gen_wake();
+
perf_event_task_tick();
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bf9d8da7d35e..62fb685db8d0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -927,6 +927,7 @@ struct rq {
unsigned intttwu_pending;
 #endif
u64 nr_switches;
+   u64 nr_mm_switches;
 
 #ifdef CONFIG_UCLAMP_TASK
/* Utilization clamp values based on CPU's RUNNABLE tasks */


Re: [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-12-02 Thread Peter Zijlstra
On Wed, Dec 02, 2020 at 12:17:31PM +0100, Peter Zijlstra wrote:

> So the obvious 'improvement' here would be something like:
> 
>   for_each_online_cpu(cpu) {
>   p = rcu_dereference(cpu_rq(cpu)->curr;
>   if (p->active_mm != mm)
>   continue;
>   __cpumask_set_cpu(cpu, tmpmask);
>   }
>   on_each_cpu_mask(tmpmask, ...);
> 
> The remote CPU will never switch _to_ @mm, on account of it being quite
> dead, but it is quite prone to false negatives.
> 
> Consider that __schedule() sets rq->curr *before* context_switch(), this
> means we'll see next->active_mm, even though prev->active_mm might still
> be our @mm.
> 
> Now, because we'll be removing the atomic ops from context_switch()'s
> active_mm swizzling, I think we can change this to something like the
> below. The hope being that the cost of the new barrier can be offset by
> the loss of the atomics.
> 
> Hmm ?
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 41404afb7f4c..2597c5c0ccb0 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4509,7 +4509,6 @@ context_switch(struct rq *rq, struct task_struct *prev,
>   if (!next->mm) {// to kernel
>   enter_lazy_tlb(prev->active_mm, next);
>  
> - next->active_mm = prev->active_mm;
>   if (prev->mm)   // from user
>   mmgrab(prev->active_mm);
>   else
> @@ -4524,6 +4523,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
>* case 'prev->active_mm == next->mm' through
>* finish_task_switch()'s mmdrop().
>*/
> + next->active_mm = next->mm;
>   switch_mm_irqs_off(prev->active_mm, next->mm, next);

I think that next->active_mm store should be after switch_mm(),
otherwise we still race.

>  
>   if (!prev->mm) {// from kernel
> @@ -5713,11 +5713,9 @@ static void __sched notrace __schedule(bool preempt)
>  
>   if (likely(prev != next)) {
>   rq->nr_switches++;
> - /*
> -  * RCU users of rcu_dereference(rq->curr) may not see
> -  * changes to task_struct made by pick_next_task().
> -  */
> - RCU_INIT_POINTER(rq->curr, next);
> +
> + next->active_mm = prev->active_mm;
> + rcu_assign_pointer(rq->curr, next);
>   /*
>* The membarrier system call requires each architecture
>* to have a full memory barrier after updating


Re: [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-12-02 Thread Peter Zijlstra
On Sun, Nov 29, 2020 at 02:01:39AM +1000, Nicholas Piggin wrote:
> +static void shoot_lazy_tlbs(struct mm_struct *mm)
> +{
> + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) {
> + /*
> +  * IPI overheads have not found to be expensive, but they could
> +  * be reduced in a number of possible ways, for example (in
> +  * roughly increasing order of complexity):
> +  * - A batch of mms requiring IPIs could be gathered and freed
> +  *   at once.
> +  * - CPUs could store their active mm somewhere that can be
> +  *   remotely checked without a lock, to filter out
> +  *   false-positives in the cpumask.
> +  * - After mm_users or mm_count reaches zero, switching away
> +  *   from the mm could clear mm_cpumask to reduce some IPIs
> +  *   (some batching or delaying would help).
> +  * - A delayed freeing and RCU-like quiescing sequence based on
> +  *   mm switching to avoid IPIs completely.
> +  */
> + on_each_cpu_mask(mm_cpumask(mm), do_shoot_lazy_tlb, (void *)mm, 
> 1);
> + if (IS_ENABLED(CONFIG_DEBUG_VM))
> + on_each_cpu(do_check_lazy_tlb, (void *)mm, 1);

So the obvious 'improvement' here would be something like:

for_each_online_cpu(cpu) {
p = rcu_dereference(cpu_rq(cpu)->curr;
if (p->active_mm != mm)
continue;
__cpumask_set_cpu(cpu, tmpmask);
}
on_each_cpu_mask(tmpmask, ...);

The remote CPU will never switch _to_ @mm, on account of it being quite
dead, but it is quite prone to false negatives.

Consider that __schedule() sets rq->curr *before* context_switch(), this
means we'll see next->active_mm, even though prev->active_mm might still
be our @mm.

Now, because we'll be removing the atomic ops from context_switch()'s
active_mm swizzling, I think we can change this to something like the
below. The hope being that the cost of the new barrier can be offset by
the loss of the atomics.

Hmm ?

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 41404afb7f4c..2597c5c0ccb0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4509,7 +4509,6 @@ context_switch(struct rq *rq, struct task_struct *prev,
if (!next->mm) {// to kernel
enter_lazy_tlb(prev->active_mm, next);
 
-   next->active_mm = prev->active_mm;
if (prev->mm)   // from user
mmgrab(prev->active_mm);
else
@@ -4524,6 +4523,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
 * case 'prev->active_mm == next->mm' through
 * finish_task_switch()'s mmdrop().
 */
+   next->active_mm = next->mm;
switch_mm_irqs_off(prev->active_mm, next->mm, next);
 
if (!prev->mm) {// from kernel
@@ -5713,11 +5713,9 @@ static void __sched notrace __schedule(bool preempt)
 
if (likely(prev != next)) {
rq->nr_switches++;
-   /*
-* RCU users of rcu_dereference(rq->curr) may not see
-* changes to task_struct made by pick_next_task().
-*/
-   RCU_INIT_POINTER(rq->curr, next);
+
+   next->active_mm = prev->active_mm;
+   rcu_assign_pointer(rq->curr, next);
/*
 * The membarrier system call requires each architecture
 * to have a full memory barrier after updating


Re: [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-11-30 Thread Peter Zijlstra
On Mon, Nov 30, 2020 at 10:30:00AM +0100, Peter Zijlstra wrote:
> On Sat, Nov 28, 2020 at 07:54:57PM -0800, Andy Lutomirski wrote:
> > This means that mm_cpumask operations won't need to be full barriers
> > forever, and we might not want to take the implied full barriers in
> > set_bit() and clear_bit() for granted.
> 
> There is no implied full barrier for those ops.

Neither ARM nor Power implies any ordering on those ops. But Power has
some of the worst atomic performance in the world despite of that.

IIRC they don't do their LL/SC in L1.


Re: [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-11-30 Thread Peter Zijlstra
On Sat, Nov 28, 2020 at 07:54:57PM -0800, Andy Lutomirski wrote:
> This means that mm_cpumask operations won't need to be full barriers
> forever, and we might not want to take the implied full barriers in
> set_bit() and clear_bit() for granted.

There is no implied full barrier for those ops.


Re: [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-11-30 Thread Peter Zijlstra
On Sat, Nov 28, 2020 at 07:54:57PM -0800, Andy Lutomirski wrote:
> Version (b) seems fairly straightforward to implement -- add RCU
> protection and a atomic_t special_ref_cleared (initially 0) to struct
> mm_struct itself.  After anyone clears a bit to mm_cpumask (which is
> already a barrier),

No it isn't. clear_bit() implies no barrier what so ever. That's x86
you're thinking about.

> they read mm_users.  If it's zero, then they scan
> mm_cpumask and see if it's empty.  If it is, they atomically swap
> special_ref_cleared to 1.  If it was zero before the swap, they do
> mmdrop().  I can imagine some tweaks that could make this a big
> faster, at least in the limit of a huge number of CPUs.


Re: [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-11-30 Thread Peter Zijlstra
On Sun, Nov 29, 2020 at 12:16:26PM -0800, Andy Lutomirski wrote:
> On Sat, Nov 28, 2020 at 7:54 PM Andy Lutomirski  wrote:
> >
> > On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin  wrote:
> > >
> > > On big systems, the mm refcount can become highly contented when doing
> > > a lot of context switching with threaded applications (particularly
> > > switching between the idle thread and an application thread).
> > >
> > > Abandoning lazy tlb slows switching down quite a bit in the important
> > > user->idle->user cases, so so instead implement a non-refcounted scheme
> > > that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
> > > any remaining lazy ones.
> > >
> > > Shootdown IPIs are some concern, but they have not been observed to be
> > > a big problem with this scheme (the powerpc implementation generated
> > > 314 additional interrupts on a 144 CPU system during a kernel compile).
> > > There are a number of strategies that could be employed to reduce IPIs
> > > if they turn out to be a problem for some workload.
> >
> > I'm still wondering whether we can do even better.
> >
> 
> Hold on a sec.. __mmput() unmaps VMAs, frees pagetables, and flushes
> the TLB.  On x86, this will shoot down all lazies as long as even a
> single pagetable was freed.  (Or at least it will if we don't have a
> serious bug, but the code seems okay.  We'll hit pmd_free_tlb, which
> sets tlb->freed_tables, which will trigger the IPI.)  So, on
> architectures like x86, the shootdown approach should be free.  The
> only way it ought to have any excess IPIs is if we have CPUs in
> mm_cpumask() that don't need IPI to free pagetables, which could
> happen on paravirt.
> 
> Can you try to figure out why you saw any increase in IPIs?  It would
> be nice if we can make the new code unconditional.

Power doesn't do IPI based TLBI.


Re: [PATCH v2 3/6] perf/core: Fix arch_perf_get_page_size()

2020-11-26 Thread Peter Zijlstra
On Thu, Nov 26, 2020 at 12:56:06PM +, Matthew Wilcox wrote:
> On Thu, Nov 26, 2020 at 01:42:07PM +0100, Peter Zijlstra wrote:
> > +   pgdp = pgd_offset(mm, addr);
> > +   pgd = READ_ONCE(*pgdp);
> 
> I forget how x86-32-PAE maps to Linux's PGD/P4D/PUD/PMD scheme, but
> according to volume 3, section 4.4.2, PAE paging uses a 64-bit PDE, so
> whether a PDE is a PGD or a PMD, we're only reading it with READ_ONCE
> rather than the lockless-retry method used by ptep_get_lockless().
> So it's potentially racy?  Do we need a pmdp_get_lockless() or
> pgdp_get_lockless()?

Oh gawd... this isn't new here though, right? Current gup_fast also gets
that wrong, if it is in deed wrong.

I suppose it's a race far more likely today, with THP and all, than it
ever was back then.


Re: [PATCH v2 1/6] mm/gup: Provide gup_get_pte() more generic

2020-11-26 Thread Peter Zijlstra
On Thu, Nov 26, 2020 at 12:43:00PM +, Matthew Wilcox wrote:
> On Thu, Nov 26, 2020 at 01:01:15PM +0100, Peter Zijlstra wrote:
> > +#ifdef CONFIG_GUP_GET_PTE_LOW_HIGH
> > +/*
> > + * WARNING: only to be used in the get_user_pages_fast() implementation.
> > + * With get_user_pages_fast(), we walk down the pagetables without taking 
> > any
> > + * locks.  For this we would like to load the pointers atomically, but 
> > sometimes
> > + * that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE).  
> > What
> > + * we do have is the guarantee that a PTE will only either go from not 
> > present
> > + * to present, or present to not present or both -- it will not switch to a
> > + * completely different present page without a TLB flush in between; 
> > something
> > + * that we are blocking by holding interrupts off.
> 
> I feel like this comment needs some love.  How about:
> 
>  * For walking the pagetables without holding any locks.  Some architectures
>  * (eg x86-32 PAE) cannot load the entries atomically without using
>  * expensive instructions.  We are guaranteed that a PTE will only either go
>  * from not present to present, or present to not present -- it will not
>  * switch to a completely different present page without a TLB flush
>  * inbetween; which we are blocking by holding interrupts off.
> 
> And it would be nice to have an assertion that interrupts are disabled
> in the code.  Because comments are nice, but nobody reads them.

Quite agreed, I'll stick a separate patch on with the updated comment
and a lockdep_assert_irqs_disabled() in. I'm afraid that latter will make
for header soup though :/

We'll see, let the robots have it.


Re: [PATCH v2 4/6] arm64/mm: Implement pXX_leaf_size() support

2020-11-26 Thread Peter Zijlstra


Now with pmd_cont() defined...

---
Subject: arm64/mm: Implement pXX_leaf_size() support
From: Peter Zijlstra 
Date: Fri Nov 13 11:46:06 CET 2020

ARM64 has non-pagetable aligned large page support with PTE_CONT, when
this bit is set the page is part of a super-page. Match the hugetlb
code and support these super pages for PTE and PMD levels.

This enables PERF_SAMPLE_{DATA,CODE}_PAGE_SIZE to report accurate
pagetable leaf sizes.

Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/arm64/include/asm/pgtable.h |4 
 1 file changed, 4 insertions(+)

--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -407,6 +407,7 @@ static inline int pmd_trans_huge(pmd_t p
 #define pmd_dirty(pmd) pte_dirty(pmd_pte(pmd))
 #define pmd_young(pmd) pte_young(pmd_pte(pmd))
 #define pmd_valid(pmd) pte_valid(pmd_pte(pmd))
+#define pmd_cont(pmd)  pte_cont(pmd_pte(pmd))
 #define pmd_wrprotect(pmd) pte_pmd(pte_wrprotect(pmd_pte(pmd)))
 #define pmd_mkold(pmd) pte_pmd(pte_mkold(pmd_pte(pmd)))
 #define pmd_mkwrite(pmd)   pte_pmd(pte_mkwrite(pmd_pte(pmd)))
@@ -503,6 +504,9 @@ extern pgprot_t phys_mem_access_prot(str
 PMD_TYPE_SECT)
 #define pmd_leaf(pmd)  pmd_sect(pmd)
 
+#define pmd_leaf_size(pmd) (pmd_cont(pmd) ? CONT_PMD_SIZE : PMD_SIZE)
+#define pte_leaf_size(pte) (pte_cont(pte) ? CONT_PTE_SIZE : PAGE_SIZE)
+
 #if defined(CONFIG_ARM64_64K_PAGES) || CONFIG_PGTABLE_LEVELS < 3
 static inline bool pud_sect(pud_t pud) { return false; }
 static inline bool pud_table(pud_t pud) { return true; }


Re: [PATCH v2 3/6] perf/core: Fix arch_perf_get_page_size()

2020-11-26 Thread Peter Zijlstra
On Thu, Nov 26, 2020 at 12:34:58PM +, Matthew Wilcox wrote:
> On Thu, Nov 26, 2020 at 01:01:17PM +0100, Peter Zijlstra wrote:
> > The (new) page-table walker in arch_perf_get_page_size() is broken in
> > various ways. Specifically while it is used in a lockless manner, it
> > doesn't depend on CONFIG_HAVE_FAST_GUP nor uses the proper _lockless
> > offset methods, nor is careful to only read each entry only once.
> > 
> > Also the hugetlb support is broken due to calling pte_page() without
> > first checking pte_special().
> > 
> > Rewrite the whole thing to be a proper lockless page-table walker and
> > employ the new pXX_leaf_size() pgtable functions to determine the
> > pagetable size without looking at the page-frames.
> > 
> > Fixes: 51b646b2d9f8 ("perf,mm: Handle non-page-table-aligned hugetlbfs")
> > Fixes: 8d97e71811aa ("perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE")
> > Signed-off-by: Peter Zijlstra (Intel) 
> > Tested-by: Kan Liang 
> > ---
> >  arch/arm64/include/asm/pgtable.h|3 +
> >  arch/sparc/include/asm/pgtable_64.h |   13 
> >  arch/sparc/mm/hugetlbpage.c |   19 --
> >  include/linux/pgtable.h |   16 +
> >  kernel/events/core.c|  102 
> > +---
> >  5 files changed, 82 insertions(+), 71 deletions(-)
> 
> This diffstat doesn't match the patch in this email ...

Urgh, no idea how I did that... I must've edited the diff and not done a
quilt-refresh. Updated below.

---
Subject: perf/core: Fix arch_perf_get_page_size()
From: Peter Zijlstra 
Date: Wed, 11 Nov 2020 13:43:57 +0100

The (new) page-table walker in arch_perf_get_page_size() is broken in
various ways. Specifically while it is used in a lockless manner, it
doesn't depend on CONFIG_HAVE_FAST_GUP nor uses the proper _lockless
offset methods, nor is careful to only read each entry only once.

Also the hugetlb support is broken due to calling pte_page() without
first checking pte_special().

Rewrite the whole thing to be a proper lockless page-table walker and
employ the new pXX_leaf_size() pgtable functions to determine the
pagetable size without looking at the page-frames.

Fixes: 51b646b2d9f8 ("perf,mm: Handle non-page-table-aligned hugetlbfs")
Fixes: 8d97e71811aa ("perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE")
Signed-off-by: Peter Zijlstra (Intel) 
Tested-by: Kan Liang 
---
 kernel/events/core.c |  105 ++-
 1 file changed, 39 insertions(+), 66 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -52,6 +52,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -7001,90 +7002,62 @@ static u64 perf_virt_to_phys(u64 virt)
return phys_addr;
 }
 
-#ifdef CONFIG_MMU
-
 /*
- * Return the MMU page size of a given virtual address.
- *
- * This generic implementation handles page-table aligned huge pages, as well
- * as non-page-table aligned hugetlbfs compound pages.
- *
- * If an architecture supports and uses non-page-table aligned pages in their
- * kernel mapping it will need to provide it's own implementation of this
- * function.
+ * Return the pagetable size of a given virtual address.
  */
-__weak u64 arch_perf_get_page_size(struct mm_struct *mm, unsigned long addr)
+static u64 perf_get_pgtable_size(struct mm_struct *mm, unsigned long addr)
 {
-   struct page *page;
-   pgd_t *pgd;
-   p4d_t *p4d;
-   pud_t *pud;
-   pmd_t *pmd;
-   pte_t *pte;
-
-   pgd = pgd_offset(mm, addr);
-   if (pgd_none(*pgd))
-   return 0;
+   u64 size = 0;
 
-   p4d = p4d_offset(pgd, addr);
-   if (!p4d_present(*p4d))
+#ifdef CONFIG_HAVE_FAST_GUP
+   pgd_t *pgdp, pgd;
+   p4d_t *p4dp, p4d;
+   pud_t *pudp, pud;
+   pmd_t *pmdp, pmd;
+   pte_t *ptep, pte;
+
+   pgdp = pgd_offset(mm, addr);
+   pgd = READ_ONCE(*pgdp);
+   if (pgd_none(pgd))
return 0;
 
-   if (p4d_leaf(*p4d))
-   return 1ULL << P4D_SHIFT;
+   if (pgd_leaf(pgd))
+   return pgd_leaf_size(pgd);
 
-   pud = pud_offset(p4d, addr);
-   if (!pud_present(*pud))
+   p4dp = p4d_offset_lockless(pgdp, pgd, addr);
+   p4d = READ_ONCE(*p4dp);
+   if (!p4d_present(p4d))
return 0;
 
-   if (pud_leaf(*pud)) {
-#ifdef pud_page
-   page = pud_page(*pud);
-   if (PageHuge(page))
-   return page_size(compound_head(page));
-#endif
-   return 1ULL << PUD_SHIFT;
-   }
+   if (p4d_leaf(p4d))
+   return p4d_leaf_size(p4d);
 
-   pmd = pmd_offset(pud, addr);
-   if (!pmd_present(*pmd))
+   pudp = pud_offset_lockless(p4dp, p4d, addr);
+   pud = READ_ONCE(*pudp);
+ 

[PATCH v2 1/6] mm/gup: Provide gup_get_pte() more generic

2020-11-26 Thread Peter Zijlstra
In order to write another lockless page-table walker, we need
gup_get_pte() exposed. While doing that, rename it to
ptep_get_lockless() to match the existing ptep_get() naming.

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/pgtable.h |   55 +
 mm/gup.c|   58 
 2 files changed, 56 insertions(+), 57 deletions(-)

--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -258,6 +258,61 @@ static inline pte_t ptep_get(pte_t *ptep
 }
 #endif
 
+#ifdef CONFIG_GUP_GET_PTE_LOW_HIGH
+/*
+ * WARNING: only to be used in the get_user_pages_fast() implementation.
+ *
+ * With get_user_pages_fast(), we walk down the pagetables without taking any
+ * locks.  For this we would like to load the pointers atomically, but 
sometimes
+ * that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE).  What
+ * we do have is the guarantee that a PTE will only either go from not present
+ * to present, or present to not present or both -- it will not switch to a
+ * completely different present page without a TLB flush in between; something
+ * that we are blocking by holding interrupts off.
+ *
+ * Setting ptes from not present to present goes:
+ *
+ *   ptep->pte_high = h;
+ *   smp_wmb();
+ *   ptep->pte_low = l;
+ *
+ * And present to not present goes:
+ *
+ *   ptep->pte_low = 0;
+ *   smp_wmb();
+ *   ptep->pte_high = 0;
+ *
+ * We must ensure here that the load of pte_low sees 'l' IFF pte_high sees 'h'.
+ * We load pte_high *after* loading pte_low, which ensures we don't see an 
older
+ * value of pte_high.  *Then* we recheck pte_low, which ensures that we haven't
+ * picked up a changed pte high. We might have gotten rubbish values from
+ * pte_low and pte_high, but we are guaranteed that pte_low will not have the
+ * present bit set *unless* it is 'l'. Because get_user_pages_fast() only
+ * operates on present ptes we're safe.
+ */
+static inline pte_t ptep_get_lockless(pte_t *ptep)
+{
+   pte_t pte;
+
+   do {
+   pte.pte_low = ptep->pte_low;
+   smp_rmb();
+   pte.pte_high = ptep->pte_high;
+   smp_rmb();
+   } while (unlikely(pte.pte_low != ptep->pte_low));
+
+   return pte;
+}
+#else /* CONFIG_GUP_GET_PTE_LOW_HIGH */
+/*
+ * We require that the PTE can be read atomically.
+ */
+static inline pte_t ptep_get_lockless(pte_t *ptep)
+{
+   return ptep_get(ptep);
+}
+#endif /* CONFIG_GUP_GET_PTE_LOW_HIGH */
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #ifndef __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
 static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2079,62 +2079,6 @@ static void put_compound_head(struct pag
put_page(page);
 }
 
-#ifdef CONFIG_GUP_GET_PTE_LOW_HIGH
-
-/*
- * WARNING: only to be used in the get_user_pages_fast() implementation.
- *
- * With get_user_pages_fast(), we walk down the pagetables without taking any
- * locks.  For this we would like to load the pointers atomically, but 
sometimes
- * that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE).  What
- * we do have is the guarantee that a PTE will only either go from not present
- * to present, or present to not present or both -- it will not switch to a
- * completely different present page without a TLB flush in between; something
- * that we are blocking by holding interrupts off.
- *
- * Setting ptes from not present to present goes:
- *
- *   ptep->pte_high = h;
- *   smp_wmb();
- *   ptep->pte_low = l;
- *
- * And present to not present goes:
- *
- *   ptep->pte_low = 0;
- *   smp_wmb();
- *   ptep->pte_high = 0;
- *
- * We must ensure here that the load of pte_low sees 'l' IFF pte_high sees 'h'.
- * We load pte_high *after* loading pte_low, which ensures we don't see an 
older
- * value of pte_high.  *Then* we recheck pte_low, which ensures that we haven't
- * picked up a changed pte high. We might have gotten rubbish values from
- * pte_low and pte_high, but we are guaranteed that pte_low will not have the
- * present bit set *unless* it is 'l'. Because get_user_pages_fast() only
- * operates on present ptes we're safe.
- */
-static inline pte_t gup_get_pte(pte_t *ptep)
-{
-   pte_t pte;
-
-   do {
-   pte.pte_low = ptep->pte_low;
-   smp_rmb();
-   pte.pte_high = ptep->pte_high;
-   smp_rmb();
-   } while (unlikely(pte.pte_low != ptep->pte_low));
-
-   return pte;
-}
-#else /* CONFIG_GUP_GET_PTE_LOW_HIGH */
-/*
- * We require that the PTE can be read atomically.
- */
-static inline pte_t gup_get_pte(pte_t *ptep)
-{
-   return ptep_get(ptep);
-}
-#endif /* CONFIG_GUP_GET_PTE_LOW_HIGH */
-
 static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
unsigned int flags,
struct page

[PATCH v2 0/6] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE

2020-11-26 Thread Peter Zijlstra
Hi,

These patches provide generic infrastructure to determine TLB page size from
page table entries alone. Perf will use this (for either data or code address)
to aid in profiling TLB issues.

While most architectures only have page table aligned large pages, some
(notably ARM64, Sparc64 and Power) provide non page table aligned large pages
and need to provide their own implementation of these functions.

I've provided (completely untested) implementations for ARM64, Sparc64 and
Power/8xxx (it looks like I'm still missing Power/Book3s64/hash support).

Changes since -v1:

 - Changed wording to reflect these are page-table sizes; actual TLB sizes
   might vary.
 - added Power/8xx

Barring any objections I'll queue these in tip/perf/core, as these patches fix
the code that's currently in there.




[PATCH v2 4/6] arm64/mm: Implement pXX_leaf_size() support

2020-11-26 Thread Peter Zijlstra
ARM64 has non-pagetable aligned large page support with PTE_CONT, when
this bit is set the page is part of a super-page. Match the hugetlb
code and support these super pages for PTE and PMD levels.

This enables PERF_SAMPLE_{DATA,CODE}_PAGE_SIZE to report accurate
pagetable leaf sizes.

Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/arm64/include/asm/pgtable.h |3 +++
 1 file changed, 3 insertions(+)

--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -503,6 +503,9 @@ extern pgprot_t phys_mem_access_prot(str
 PMD_TYPE_SECT)
 #define pmd_leaf(pmd)  pmd_sect(pmd)
 
+#define pmd_leaf_size(pmd) (pmd_cont(pmd) ? CONT_PMD_SIZE : PMD_SIZE)
+#define pte_leaf_size(pte) (pte_cont(pte) ? CONT_PTE_SIZE : PAGE_SIZE)
+
 #if defined(CONFIG_ARM64_64K_PAGES) || CONFIG_PGTABLE_LEVELS < 3
 static inline bool pud_sect(pud_t pud) { return false; }
 static inline bool pud_table(pud_t pud) { return true; }




[PATCH v2 3/6] perf/core: Fix arch_perf_get_page_size()

2020-11-26 Thread Peter Zijlstra
The (new) page-table walker in arch_perf_get_page_size() is broken in
various ways. Specifically while it is used in a lockless manner, it
doesn't depend on CONFIG_HAVE_FAST_GUP nor uses the proper _lockless
offset methods, nor is careful to only read each entry only once.

Also the hugetlb support is broken due to calling pte_page() without
first checking pte_special().

Rewrite the whole thing to be a proper lockless page-table walker and
employ the new pXX_leaf_size() pgtable functions to determine the
pagetable size without looking at the page-frames.

Fixes: 51b646b2d9f8 ("perf,mm: Handle non-page-table-aligned hugetlbfs")
Fixes: 8d97e71811aa ("perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE")
Signed-off-by: Peter Zijlstra (Intel) 
Tested-by: Kan Liang 
---
 arch/arm64/include/asm/pgtable.h|3 +
 arch/sparc/include/asm/pgtable_64.h |   13 
 arch/sparc/mm/hugetlbpage.c |   19 --
 include/linux/pgtable.h |   16 +
 kernel/events/core.c|  102 +---
 5 files changed, 82 insertions(+), 71 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -52,6 +52,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -7001,90 +7001,62 @@ static u64 perf_virt_to_phys(u64 virt)
return phys_addr;
 }
 
-#ifdef CONFIG_MMU
-
 /*
- * Return the MMU page size of a given virtual address.
- *
- * This generic implementation handles page-table aligned huge pages, as well
- * as non-page-table aligned hugetlbfs compound pages.
- *
- * If an architecture supports and uses non-page-table aligned pages in their
- * kernel mapping it will need to provide it's own implementation of this
- * function.
+ * Return the pagetable size of a given virtual address.
  */
-__weak u64 arch_perf_get_page_size(struct mm_struct *mm, unsigned long addr)
+static u64 perf_get_pgtable_size(struct mm_struct *mm, unsigned long addr)
 {
-   struct page *page;
-   pgd_t *pgd;
-   p4d_t *p4d;
-   pud_t *pud;
-   pmd_t *pmd;
-   pte_t *pte;
+   u64 size = 0;
 
-   pgd = pgd_offset(mm, addr);
-   if (pgd_none(*pgd))
-   return 0;
+#ifdef CONFIG_HAVE_FAST_GUP
+   pgd_t *pgdp, pgd;
+   p4d_t *p4dp, p4d;
+   pud_t *pudp, pud;
+   pmd_t *pmdp, pmd;
+   pte_t *ptep, pte;
 
-   p4d = p4d_offset(pgd, addr);
-   if (!p4d_present(*p4d))
+   pgdp = pgd_offset(mm, addr);
+   pgd = READ_ONCE(*pgdp);
+   if (pgd_none(pgd))
return 0;
 
-   if (p4d_leaf(*p4d))
-   return 1ULL << P4D_SHIFT;
+   if (pgd_leaf(pgd))
+   return pgd_leaf_size(pgd);
 
-   pud = pud_offset(p4d, addr);
-   if (!pud_present(*pud))
+   p4dp = p4d_offset_lockless(pgdp, pgd, addr);
+   p4d = READ_ONCE(*p4dp);
+   if (!p4d_present(p4d))
return 0;
 
-   if (pud_leaf(*pud)) {
-#ifdef pud_page
-   page = pud_page(*pud);
-   if (PageHuge(page))
-   return page_size(compound_head(page));
-#endif
-   return 1ULL << PUD_SHIFT;
-   }
+   if (p4d_leaf(p4d))
+   return p4d_leaf_size(p4d);
 
-   pmd = pmd_offset(pud, addr);
-   if (!pmd_present(*pmd))
+   pudp = pud_offset_lockless(p4dp, p4d, addr);
+   pud = READ_ONCE(*pudp);
+   if (!pud_present(pud))
return 0;
 
-   if (pmd_leaf(*pmd)) {
-#ifdef pmd_page
-   page = pmd_page(*pmd);
-   if (PageHuge(page))
-   return page_size(compound_head(page));
-#endif
-   return 1ULL << PMD_SHIFT;
-   }
+   if (pud_leaf(pud))
+   return pud_leaf_size(pud);
 
-   pte = pte_offset_map(pmd, addr);
-   if (!pte_present(*pte)) {
-   pte_unmap(pte);
+   pmdp = pmd_offset_lockless(pudp, pud, addr);
+   pmd = READ_ONCE(*pmdp);
+   if (!pmd_present(pmd))
return 0;
-   }
 
-   page = pte_page(*pte);
-   if (PageHuge(page)) {
-   u64 size = page_size(compound_head(page));
-   pte_unmap(pte);
-   return size;
-   }
-
-   pte_unmap(pte);
-   return PAGE_SIZE;
-}
+   if (pmd_leaf(pmd))
+   return pmd_leaf_size(pmd);
 
-#else
+   ptep = pte_offset_map(, addr);
+   pte = ptep_get_lockless(ptep);
+   if (pte_present(pte))
+   size = pte_leaf_size(pte);
+   pte_unmap(ptep);
+#endif /* CONFIG_HAVE_FAST_GUP */
 
-static u64 arch_perf_get_page_size(struct mm_struct *mm, unsigned long addr)
-{
-   return 0;
+   return size;
 }
 
-#endif
-
 static u64 perf_get_page_size(unsigned long addr)
 {
struct mm_struct *mm;
@@ -7109,7 +7081,7 @@ static u64 perf_get_page_size(unsigned l
mm = _mm;
}
 
-   size = arch_perf_get_page_size(mm, addr);
+   size = perf_get_pgtable_size(mm, addr);
 
local_irq_restore(flags);
 




[PATCH v2 5/6] sparc64/mm: Implement pXX_leaf_size() support

2020-11-26 Thread Peter Zijlstra
Sparc64 has non-pagetable aligned large page support; wire up the
pXX_leaf_size() functions to report the correct pagetable page size.

This enables PERF_SAMPLE_{DATA,CODE}_PAGE_SIZE to report accurate
pagetable leaf sizes.

Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/sparc/include/asm/pgtable_64.h |   13 +
 arch/sparc/mm/hugetlbpage.c |   19 +--
 2 files changed, 26 insertions(+), 6 deletions(-)

--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -1121,6 +1121,19 @@ extern unsigned long cmdline_memory_size
 
 asmlinkage void do_sparc64_fault(struct pt_regs *regs);
 
+#ifdef CONFIG_HUGETLB_PAGE
+
+#define pud_leaf_size pud_leaf_size
+extern unsigned long pud_leaf_size(pud_t pud);
+
+#define pmd_leaf_size pmd_leaf_size
+extern unsigned long pmd_leaf_size(pmd_t pmd);
+
+#define pte_leaf_size pte_leaf_size
+extern unsigned long pte_leaf_size(pte_t pte);
+
+#endif /* CONFIG_HUGETLB_PAGE */
+
 #endif /* !(__ASSEMBLY__) */
 
 #endif /* !(_SPARC64_PGTABLE_H) */
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -247,14 +247,17 @@ static unsigned int sun4u_huge_tte_to_sh
return shift;
 }
 
-static unsigned int huge_tte_to_shift(pte_t entry)
+static unsigned long tte_to_shift(pte_t entry)
 {
-   unsigned long shift;
-
if (tlb_type == hypervisor)
-   shift = sun4v_huge_tte_to_shift(entry);
-   else
-   shift = sun4u_huge_tte_to_shift(entry);
+   return sun4v_huge_tte_to_shift(entry);
+
+   return sun4u_huge_tte_to_shift(entry);
+}
+
+static unsigned int huge_tte_to_shift(pte_t entry)
+{
+   unsigned long shift = tte_to_shift(entry);
 
if (shift == PAGE_SHIFT)
WARN_ONCE(1, "tto_to_shift: invalid hugepage tte=0x%lx\n",
@@ -272,6 +275,10 @@ static unsigned long huge_tte_to_size(pt
return size;
 }
 
+unsigned long pud_leaf_size(pud_t pud) { return 1UL << 
tte_to_shift((pte_t)pud); }
+unsigned long pmd_leaf_size(pmd_t pmd) { return 1UL << 
tte_to_shift((pte_t)pmd); }
+unsigned long pte_leaf_size(pte_t pte) { return 1UL << 
tte_to_shift((pte_t)pte); }
+
 pte_t *huge_pte_alloc(struct mm_struct *mm,
unsigned long addr, unsigned long sz)
 {




[PATCH v2 6/6] powerpc/8xx: Implement pXX_leaf_size() support

2020-11-26 Thread Peter Zijlstra
Christophe Leroy wrote:

> I can help with powerpc 8xx. It is a 32 bits powerpc. The PGD has 1024
> entries, that means each entry maps 4M.
>
> Page sizes are 4k, 16k, 512k and 8M.
>
> For the 8M pages we use hugepd with a single entry. The two related PGD
> entries point to the same hugepd.
>
> For the other sizes, they are in standard page tables. 16k pages appear
> 4 times in the page table. 512k entries appear 128 times in the page
> table.
>
> When the PGD entry has _PMD_PAGE_8M bits, the PMD entry points to a
> hugepd with holds the single 8M entry.
>
> In the PTE, we have two bits: _PAGE_SPS and _PAGE_HUGE
>
> _PAGE_HUGE means it is a 512k page
> _PAGE_SPS means it is not a 4k page
>
> The kernel can by build either with 4k pages as standard page size, or
> 16k pages. It doesn't change the page table layout though.

Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/powerpc/include/asm/nohash/32/pte-8xx.h |   23 +++
 1 file changed, 23 insertions(+)

--- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
@@ -135,6 +135,29 @@ static inline pte_t pte_mkhuge(pte_t pte
 }
 
 #define pte_mkhuge pte_mkhuge
+
+static inline unsigned long pgd_leaf_size(pgd_t pgd)
+{
+   if (pgd_val(pgd) & _PMD_PAGE_8M)
+   return SZ_8M;
+   return SZ_4M;
+}
+
+#define pgd_leaf_size pgd_leaf_size
+
+static inline unsigned long pte_leaf_size(pte_t pte)
+{
+   pte_basic_t val = pte_val(pte);
+
+   if (val & _PAGE_HUGE)
+   return SZ_512K;
+   if (val & _PAGE_SPS)
+   return SZ_16K;
+   return SZ_4K;
+}
+
+#define pte_leaf_size pte_leaf_size
+
 #endif
 
 #endif /* __KERNEL__ */




[PATCH v2 2/6] mm: Introduce pXX_leaf_size()

2020-11-26 Thread Peter Zijlstra
A number of architectures have non-pagetable aligned huge/large pages.
For such architectures a leaf can actually be part of a larger entry.

Provide generic helpers to determine the size of a page-table leaf.

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/pgtable.h |   16 
 1 file changed, 16 insertions(+)

--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1536,4 +1536,20 @@ typedef unsigned int pgtbl_mod_mask;
 #define pmd_leaf(x)0
 #endif
 
+#ifndef pgd_leaf_size
+#define pgd_leaf_size(x) (1ULL << PGDIR_SHIFT)
+#endif
+#ifndef p4d_leaf_size
+#define p4d_leaf_size(x) P4D_SIZE
+#endif
+#ifndef pud_leaf_size
+#define pud_leaf_size(x) PUD_SIZE
+#endif
+#ifndef pmd_leaf_size
+#define pmd_leaf_size(x) PMD_SIZE
+#endif
+#ifndef pte_leaf_size
+#define pte_leaf_size(x) PAGE_SIZE
+#endif
+
 #endif /* _LINUX_PGTABLE_H */




Re: [PATCH 0/5] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE

2020-11-26 Thread Peter Zijlstra
On Fri, Nov 20, 2020 at 01:20:04PM +0100, Peter Zijlstra wrote:

> > > I can help with powerpc 8xx. It is a 32 bits powerpc. The PGD has 1024
> > > entries, that means each entry maps 4M.
> > > 
> > > Page sizes are 4k, 16k, 512k and 8M.
> > > 
> > > For the 8M pages we use hugepd with a single entry. The two related PGD
> > > entries point to the same hugepd.
> > > 
> > > For the other sizes, they are in standard page tables. 16k pages appear
> > > 4 times in the page table. 512k entries appear 128 times in the page
> > > table.
> > > 
> > > When the PGD entry has _PMD_PAGE_8M bits, the PMD entry points to a
> > > hugepd with holds the single 8M entry.
> > > 
> > > In the PTE, we have two bits: _PAGE_SPS and _PAGE_HUGE
> > > 
> > > _PAGE_HUGE means it is a 512k page
> > > _PAGE_SPS means it is not a 4k page
> > > 
> > > The kernel can by build either with 4k pages as standard page size, or
> > > 16k pages. It doesn't change the page table layout though.
> > > 
> > > Hope this is clear. Now I don't really know to wire that up to your 
> > > series.

Does the below accurately reflect things?

Let me go find a suitable cross-compiler ..

diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h 
b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
index 1581204467e1..fcc48d590d88 100644
--- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
@@ -135,6 +135,29 @@ static inline pte_t pte_mkhuge(pte_t pte)
 }
 
 #define pte_mkhuge pte_mkhuge
+
+static inline unsigned long pgd_leaf_size(pgd_t pgd)
+{
+   if (pgd_val(pgd) & _PMD_PAGE_8M)
+   return SZ_8M;
+   return SZ_4M;
+}
+
+#define pgd_leaf_size pgd_leaf_size
+
+static inline unsigned long pte_leaf_size(pte_t pte)
+{
+   pte_basic_t val = pte_val(pte);
+
+   if (val & _PAGE_HUGE)
+   return SZ_512K;
+   if (val & _PAGE_SPS)
+   return SZ_16K;
+   return SZ_4K;
+}
+
+#define pte_leaf_size pte_leaf_size
+
 #endif
 
 #endif /* __KERNEL__ */


Re: [PATCH 0/5] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE

2020-11-20 Thread Peter Zijlstra
On Fri, Nov 20, 2020 at 12:18:22PM +0100, Christophe Leroy wrote:
> Hi Peter,
> 
> Le 13/11/2020 à 14:44, Christophe Leroy a écrit :
> > Hi
> > 
> > Le 13/11/2020 à 12:19, Peter Zijlstra a écrit :
> > > Hi,
> > > 
> > > These patches provide generic infrastructure to determine TLB page size 
> > > from
> > > page table entries alone. Perf will use this (for either data or code 
> > > address)
> > > to aid in profiling TLB issues.
> > > 
> > > While most architectures only have page table aligned large pages, some
> > > (notably ARM64, Sparc64 and Power) provide non page table aligned large 
> > > pages
> > > and need to provide their own implementation of these functions.
> > > 
> > > I've provided (completely untested) implementations for ARM64 and 
> > > Sparc64, but
> > > failed to penetrate the _many_ Power MMUs. I'm hoping Nick or Aneesh can 
> > > help
> > > me out there.
> > > 
> > 
> > I can help with powerpc 8xx. It is a 32 bits powerpc. The PGD has 1024
> > entries, that means each entry maps 4M.
> > 
> > Page sizes are 4k, 16k, 512k and 8M.
> > 
> > For the 8M pages we use hugepd with a single entry. The two related PGD
> > entries point to the same hugepd.
> > 
> > For the other sizes, they are in standard page tables. 16k pages appear
> > 4 times in the page table. 512k entries appear 128 times in the page
> > table.
> > 
> > When the PGD entry has _PMD_PAGE_8M bits, the PMD entry points to a
> > hugepd with holds the single 8M entry.
> > 
> > In the PTE, we have two bits: _PAGE_SPS and _PAGE_HUGE
> > 
> > _PAGE_HUGE means it is a 512k page
> > _PAGE_SPS means it is not a 4k page
> > 
> > The kernel can by build either with 4k pages as standard page size, or
> > 16k pages. It doesn't change the page table layout though.
> > 
> > Hope this is clear. Now I don't really know to wire that up to your series.
> 
> Does my description make sense ? Is there anything I can help with ?

It did, and I had vague memories from when we fixed that pgd_t issue.
I've just not had time to dig through the powerpc code yet to find the
right mmu header to stick it in.

I was meaning to get another version of these patches posted this week,
but time keeps slipping away, I'll try.


Re: [PATCH 1/2] kernel/cpu: add arch override for clear_tasks_mm_cpumask() mm handling

2020-11-20 Thread Peter Zijlstra
On Fri, Nov 20, 2020 at 12:57:56PM +1000, Nicholas Piggin wrote:
> powerpc keeps a counter in the mm which counts bits set in mm_cpumask as
> well as other things. This means it can't use generic code to clear bits
> out of the mask and doesn't adjust the arch specific counter.
> 
> Add an arch override allowing powerpc to use clear_tasks_mm_cpumask().
> 
> Signed-off-by: Nicholas Piggin 

Seems reasonable enough..

Acked-by: Peter Zijlstra (Intel) 

> ---
>  kernel/cpu.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 6ff2578ecf17..2b8d7a5db383 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -815,6 +815,10 @@ void __init cpuhp_threads_init(void)
>  }
>  
>  #ifdef CONFIG_HOTPLUG_CPU
> +#ifndef arch_clear_mm_cpumask_cpu
> +#define arch_clear_mm_cpumask_cpu(cpu, mm) cpumask_clear_cpu(cpu, 
> mm_cpumask(mm))
> +#endif
> +
>  /**
>   * clear_tasks_mm_cpumask - Safely clear tasks' mm_cpumask for a CPU
>   * @cpu: a CPU id
> @@ -850,7 +854,7 @@ void clear_tasks_mm_cpumask(int cpu)
>   t = find_lock_task_mm(p);
>   if (!t)
>   continue;
> - cpumask_clear_cpu(cpu, mm_cpumask(t->mm));
> + arch_clear_mm_cpumask_cpu(cpu, t->mm);
>   task_unlock(t);
>   }
>   rcu_read_unlock();
> -- 
> 2.23.0
> 


Re: [PATCH 0/5] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE

2020-11-16 Thread Peter Zijlstra
On Mon, Nov 16, 2020 at 08:36:36AM -0800, Dave Hansen wrote:
> On 11/16/20 8:32 AM, Matthew Wilcox wrote:
> >>
> >> That's really the best we can do from software without digging into
> >> microarchitecture-specific events.
> > I mean this is perf.  Digging into microarch specific events is what it
> > does ;-)
> 
> Yeah, totally.

Sure, but the automatic promotion/demotion of TLB sizes is not visible
if you don't know what you startd out with.

> But, if we see a bunch of 4k TLB hit events, it's still handy to know
> that those 4k TLB hits originated from a 2M page table entry.  This
> series just makes sure that perf has the data about the page table
> mapping sizes regardless of what the microarchitecture does with it.

This. 


Re: [PATCH 0/5] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE

2020-11-16 Thread Peter Zijlstra
On Mon, Nov 16, 2020 at 08:28:23AM -0800, Dave Hansen wrote:
> On 11/16/20 7:54 AM, Matthew Wilcox wrote:
> > It gets even more complicated with CPUs with multiple levels of TLB
> > which support different TLB entry sizes.  My CPU reports:
> > 
> > TLB info
> >  Instruction TLB: 2M/4M pages, fully associative, 8 entries
> >  Instruction TLB: 4K pages, 8-way associative, 64 entries
> >  Data TLB: 1GB pages, 4-way set associative, 4 entries
> >  Data TLB: 4KB pages, 4-way associative, 64 entries
> >  Shared L2 TLB: 4KB/2MB pages, 6-way associative, 1536 entries
> 
> It's even "worse" on recent AMD systems.  Those will coalesce multiple
> adjacent PTEs into a single TLB entry.  I think Alphas did something
> like this back in the day with an opt-in.
> 
> Anyway, the changelog should probably replace:

ARM64 does too.

> > This enables PERF_SAMPLE_{DATA,CODE}_PAGE_SIZE to report accurate TLB
> > page sizes.
> 
> with something more like:
> 
> This enables PERF_SAMPLE_{DATA,CODE}_PAGE_SIZE to report accurate page
> table mapping sizes.

Sure.


Re: [PATCH 2/5] mm: Introduce pXX_leaf_size()

2020-11-13 Thread Peter Zijlstra
On Fri, Nov 13, 2020 at 12:19:03PM +0100, Peter Zijlstra wrote:
> A number of architectures have non-pagetable aligned huge/large pages.
> For such architectures a leaf can actually be part of a larger TLB
> entry.
> 
> Provide generic helpers to determine the TLB size of a page-table
> leaf.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  include/linux/pgtable.h |   16 
>  1 file changed, 16 insertions(+)
> 
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1536,4 +1536,20 @@ typedef unsigned int pgtbl_mod_mask;
>  #define pmd_leaf(x)  0
>  #endif
>  
> +#ifndef pgd_leaf_size
> +#define pgd_leaf_size(x) PGD_SIZE

Argh, I lost a refresh, that should've been:

+#define pgd_leaf_size(x) (1ULL << PGDIR_SHIFT)


> +#endif
> +#ifndef p4d_leaf_size
> +#define p4d_leaf_size(x) P4D_SIZE
> +#endif
> +#ifndef pud_leaf_size
> +#define pud_leaf_size(x) PUD_SIZE
> +#endif
> +#ifndef pmd_leaf_size
> +#define pmd_leaf_size(x) PMD_SIZE
> +#endif
> +#ifndef pte_leaf_size
> +#define pte_leaf_size(x) PAGE_SIZE
> +#endif
> +
>  #endif /* _LINUX_PGTABLE_H */
> 
> 


[PATCH 5/5] sparc64/mm: Implement pXX_leaf_size() support

2020-11-13 Thread Peter Zijlstra
Sparc64 has non-pagetable aligned large page support; wire up the
pXX_leaf_size() functions to report the correct TLB page size.

This enables PERF_SAMPLE_{DATA,CODE}_PAGE_SIZE to report accurate TLB
page sizes.

Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/sparc/include/asm/pgtable_64.h |   13 +
 arch/sparc/mm/hugetlbpage.c |   19 +--
 2 files changed, 26 insertions(+), 6 deletions(-)

--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -1121,6 +1121,19 @@ extern unsigned long cmdline_memory_size
 
 asmlinkage void do_sparc64_fault(struct pt_regs *regs);
 
+#ifdef CONFIG_HUGETLB_PAGE
+
+#define pud_leaf_size pud_leaf_size
+extern unsigned long pud_leaf_size(pud_t pud);
+
+#define pmd_leaf_size pmd_leaf_size
+extern unsigned long pmd_leaf_size(pmd_t pmd);
+
+#define pte_leaf_size pte_leaf_size
+extern unsigned long pte_leaf_size(pte_t pte);
+
+#endif /* CONFIG_HUGETLB_PAGE */
+
 #endif /* !(__ASSEMBLY__) */
 
 #endif /* !(_SPARC64_PGTABLE_H) */
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -247,14 +247,17 @@ static unsigned int sun4u_huge_tte_to_sh
return shift;
 }
 
-static unsigned int huge_tte_to_shift(pte_t entry)
+static unsigned long tte_to_shift(pte_t entry)
 {
-   unsigned long shift;
-
if (tlb_type == hypervisor)
-   shift = sun4v_huge_tte_to_shift(entry);
-   else
-   shift = sun4u_huge_tte_to_shift(entry);
+   return sun4v_huge_tte_to_shift(entry);
+
+   return sun4u_huge_tte_to_shift(entry);
+}
+
+static unsigned int huge_tte_to_shift(pte_t entry)
+{
+   unsigned long shift = tte_to_shift(entry);
 
if (shift == PAGE_SHIFT)
WARN_ONCE(1, "tto_to_shift: invalid hugepage tte=0x%lx\n",
@@ -272,6 +275,10 @@ static unsigned long huge_tte_to_size(pt
return size;
 }
 
+unsigned long pud_leaf_size(pud_t pud) { return 1UL << 
tte_to_shift((pte_t)pud); }
+unsigned long pmd_leaf_size(pmd_t pmd) { return 1UL << 
tte_to_shift((pte_t)pmd); }
+unsigned long pte_leaf_size(pte_t pte) { return 1UL << 
tte_to_shift((pte_t)pte); }
+
 pte_t *huge_pte_alloc(struct mm_struct *mm,
unsigned long addr, unsigned long sz)
 {




[PATCH 0/5] perf/mm: Fix PERF_SAMPLE_*_PAGE_SIZE

2020-11-13 Thread Peter Zijlstra
Hi,

These patches provide generic infrastructure to determine TLB page size from
page table entries alone. Perf will use this (for either data or code address)
to aid in profiling TLB issues.

While most architectures only have page table aligned large pages, some
(notably ARM64, Sparc64 and Power) provide non page table aligned large pages
and need to provide their own implementation of these functions.

I've provided (completely untested) implementations for ARM64 and Sparc64, but
failed to penetrate the _many_ Power MMUs. I'm hoping Nick or Aneesh can help
me out there.



[PATCH 3/5] perf/core: Fix arch_perf_get_page_size()

2020-11-13 Thread Peter Zijlstra
The (new) page-table walker in arch_perf_get_page_size() is broken in
various ways. Specifically while it is used in a locless manner, it
doesn't depend on CONFIG_HAVE_FAST_GUP nor uses the proper _lockless
offset methods, nor is careful to only read each entry only once.

Also the hugetlb support is broken due to calling pte_page() without
first checking pte_special().

Rewrite the whole thing to be a proper lockless page-table walker and
employ the new pXX_leaf_size() pgtable functions to determine the TLB
size without looking at the page-frames.

Fixes: 51b646b2d9f8 ("perf,mm: Handle non-page-table-aligned hugetlbfs")
Fixes: 8d97e71811aa ("perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE")
Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/arm64/include/asm/pgtable.h|3 +
 arch/sparc/include/asm/pgtable_64.h |   13 
 arch/sparc/mm/hugetlbpage.c |   19 --
 include/linux/pgtable.h |   16 +
 kernel/events/core.c|  102 +---
 5 files changed, 82 insertions(+), 71 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7001,90 +7001,62 @@ static u64 perf_virt_to_phys(u64 virt)
return phys_addr;
 }
 
-#ifdef CONFIG_MMU
-
 /*
- * Return the MMU page size of a given virtual address.
- *
- * This generic implementation handles page-table aligned huge pages, as well
- * as non-page-table aligned hugetlbfs compound pages.
- *
- * If an architecture supports and uses non-page-table aligned pages in their
- * kernel mapping it will need to provide it's own implementation of this
- * function.
+ * Return the MMU/TLB page size of a given virtual address.
  */
-__weak u64 arch_perf_get_page_size(struct mm_struct *mm, unsigned long addr)
+static u64 perf_get_tlb_page_size(struct mm_struct *mm, unsigned long addr)
 {
-   struct page *page;
-   pgd_t *pgd;
-   p4d_t *p4d;
-   pud_t *pud;
-   pmd_t *pmd;
-   pte_t *pte;
+   u64 size = 0;
 
-   pgd = pgd_offset(mm, addr);
-   if (pgd_none(*pgd))
-   return 0;
+#ifdef CONFIG_HAVE_FAST_GUP
+   pgd_t *pgdp, pgd;
+   p4d_t *p4dp, p4d;
+   pud_t *pudp, pud;
+   pmd_t *pmdp, pmd;
+   pte_t *ptep, pte;
 
-   p4d = p4d_offset(pgd, addr);
-   if (!p4d_present(*p4d))
+   pgdp = pgd_offset(mm, addr);
+   pgd = READ_ONCE(*pgdp);
+   if (pgd_none(pgd))
return 0;
 
-   if (p4d_leaf(*p4d))
-   return 1ULL << P4D_SHIFT;
+   if (pgd_leaf(pgd))
+   return pgd_leaf_size(pgd);
 
-   pud = pud_offset(p4d, addr);
-   if (!pud_present(*pud))
+   p4dp = p4d_offset_lockless(pgdp, pgd, addr);
+   p4d = READ_ONCE(*p4dp);
+   if (!p4d_present(p4d))
return 0;
 
-   if (pud_leaf(*pud)) {
-#ifdef pud_page
-   page = pud_page(*pud);
-   if (PageHuge(page))
-   return page_size(compound_head(page));
-#endif
-   return 1ULL << PUD_SHIFT;
-   }
+   if (p4d_leaf(p4d))
+   return p4d_leaf_size(p4d);
 
-   pmd = pmd_offset(pud, addr);
-   if (!pmd_present(*pmd))
+   pudp = pud_offset_lockless(p4dp, p4d, addr);
+   pud = READ_ONCE(*pudp);
+   if (!pud_present(pud))
return 0;
 
-   if (pmd_leaf(*pmd)) {
-#ifdef pmd_page
-   page = pmd_page(*pmd);
-   if (PageHuge(page))
-   return page_size(compound_head(page));
-#endif
-   return 1ULL << PMD_SHIFT;
-   }
+   if (pud_leaf(pud))
+   return pud_leaf_size(pud);
 
-   pte = pte_offset_map(pmd, addr);
-   if (!pte_present(*pte)) {
-   pte_unmap(pte);
+   pmdp = pmd_offset_lockless(pudp, pud, addr);
+   pmd = READ_ONCE(*pmdp);
+   if (!pmd_present(pmd))
return 0;
-   }
 
-   page = pte_page(*pte);
-   if (PageHuge(page)) {
-   u64 size = page_size(compound_head(page));
-   pte_unmap(pte);
-   return size;
-   }
-
-   pte_unmap(pte);
-   return PAGE_SIZE;
-}
+   if (pmd_leaf(pmd))
+   return pmd_leaf_size(pmd);
 
-#else
+   ptep = pte_offset_map(, addr);
+   pte = ptep_get_lockless(ptep);
+   if (pte_present(pte))
+   size = pte_leaf_size(pte);
+   pte_unmap(ptep);
+#endif /* CONFIG_HAVE_FAST_GUP */
 
-static u64 arch_perf_get_page_size(struct mm_struct *mm, unsigned long addr)
-{
-   return 0;
+   return size;
 }
 
-#endif
-
 static u64 perf_get_page_size(unsigned long addr)
 {
struct mm_struct *mm;
@@ -7109,7 +7081,7 @@ static u64 perf_get_page_size(unsigned l
mm = _mm;
}
 
-   size = arch_perf_get_page_size(mm, addr);
+   size = perf_get_tlb_page_size(mm, addr);
 
local_irq_restore(flags);
 




[PATCH 1/5] mm/gup: Provide gup_get_pte() more generic

2020-11-13 Thread Peter Zijlstra
In order to write another lockless page-table walker, we need
gup_get_pte() exposed. While doing that, rename it to
ptep_get_lockless() to match the existing ptep_get() naming.

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/pgtable.h |   55 +
 mm/gup.c|   58 
 2 files changed, 56 insertions(+), 57 deletions(-)

--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -258,6 +258,61 @@ static inline pte_t ptep_get(pte_t *ptep
 }
 #endif
 
+#ifdef CONFIG_GUP_GET_PTE_LOW_HIGH
+/*
+ * WARNING: only to be used in the get_user_pages_fast() implementation.
+ *
+ * With get_user_pages_fast(), we walk down the pagetables without taking any
+ * locks.  For this we would like to load the pointers atomically, but 
sometimes
+ * that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE).  What
+ * we do have is the guarantee that a PTE will only either go from not present
+ * to present, or present to not present or both -- it will not switch to a
+ * completely different present page without a TLB flush in between; something
+ * that we are blocking by holding interrupts off.
+ *
+ * Setting ptes from not present to present goes:
+ *
+ *   ptep->pte_high = h;
+ *   smp_wmb();
+ *   ptep->pte_low = l;
+ *
+ * And present to not present goes:
+ *
+ *   ptep->pte_low = 0;
+ *   smp_wmb();
+ *   ptep->pte_high = 0;
+ *
+ * We must ensure here that the load of pte_low sees 'l' IFF pte_high sees 'h'.
+ * We load pte_high *after* loading pte_low, which ensures we don't see an 
older
+ * value of pte_high.  *Then* we recheck pte_low, which ensures that we haven't
+ * picked up a changed pte high. We might have gotten rubbish values from
+ * pte_low and pte_high, but we are guaranteed that pte_low will not have the
+ * present bit set *unless* it is 'l'. Because get_user_pages_fast() only
+ * operates on present ptes we're safe.
+ */
+static inline pte_t ptep_get_lockless(pte_t *ptep)
+{
+   pte_t pte;
+
+   do {
+   pte.pte_low = ptep->pte_low;
+   smp_rmb();
+   pte.pte_high = ptep->pte_high;
+   smp_rmb();
+   } while (unlikely(pte.pte_low != ptep->pte_low));
+
+   return pte;
+}
+#else /* CONFIG_GUP_GET_PTE_LOW_HIGH */
+/*
+ * We require that the PTE can be read atomically.
+ */
+static inline pte_t ptep_get_lockless(pte_t *ptep)
+{
+   return ptep_get(ptep);
+}
+#endif /* CONFIG_GUP_GET_PTE_LOW_HIGH */
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #ifndef __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
 static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2079,62 +2079,6 @@ static void put_compound_head(struct pag
put_page(page);
 }
 
-#ifdef CONFIG_GUP_GET_PTE_LOW_HIGH
-
-/*
- * WARNING: only to be used in the get_user_pages_fast() implementation.
- *
- * With get_user_pages_fast(), we walk down the pagetables without taking any
- * locks.  For this we would like to load the pointers atomically, but 
sometimes
- * that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE).  What
- * we do have is the guarantee that a PTE will only either go from not present
- * to present, or present to not present or both -- it will not switch to a
- * completely different present page without a TLB flush in between; something
- * that we are blocking by holding interrupts off.
- *
- * Setting ptes from not present to present goes:
- *
- *   ptep->pte_high = h;
- *   smp_wmb();
- *   ptep->pte_low = l;
- *
- * And present to not present goes:
- *
- *   ptep->pte_low = 0;
- *   smp_wmb();
- *   ptep->pte_high = 0;
- *
- * We must ensure here that the load of pte_low sees 'l' IFF pte_high sees 'h'.
- * We load pte_high *after* loading pte_low, which ensures we don't see an 
older
- * value of pte_high.  *Then* we recheck pte_low, which ensures that we haven't
- * picked up a changed pte high. We might have gotten rubbish values from
- * pte_low and pte_high, but we are guaranteed that pte_low will not have the
- * present bit set *unless* it is 'l'. Because get_user_pages_fast() only
- * operates on present ptes we're safe.
- */
-static inline pte_t gup_get_pte(pte_t *ptep)
-{
-   pte_t pte;
-
-   do {
-   pte.pte_low = ptep->pte_low;
-   smp_rmb();
-   pte.pte_high = ptep->pte_high;
-   smp_rmb();
-   } while (unlikely(pte.pte_low != ptep->pte_low));
-
-   return pte;
-}
-#else /* CONFIG_GUP_GET_PTE_LOW_HIGH */
-/*
- * We require that the PTE can be read atomically.
- */
-static inline pte_t gup_get_pte(pte_t *ptep)
-{
-   return ptep_get(ptep);
-}
-#endif /* CONFIG_GUP_GET_PTE_LOW_HIGH */
-
 static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
unsigned int flags,
struct page

[PATCH 4/5] arm64/mm: Implement pXX_leaf_size() support

2020-11-13 Thread Peter Zijlstra
ARM64 has non-pagetable aligned large page support with PTE_CONT, when
this bit is set the page is part of a super-page. Match the hugetlb
code and support these super pages for PTE and PMD levels.

This enables PERF_SAMPLE_{DATA,CODE}_PAGE_SIZE to report accurate TLB
page sizes.

Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/arm64/include/asm/pgtable.h |3 +++
 1 file changed, 3 insertions(+)

--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -503,6 +503,9 @@ extern pgprot_t phys_mem_access_prot(str
 PMD_TYPE_SECT)
 #define pmd_leaf(pmd)  pmd_sect(pmd)
 
+#define pmd_leaf_size(pmd) (pmd_cont(pmd) ? CONT_PMD_SIZE : PMD_SIZE)
+#define pte_leaf_size(pte) (pte_cont(pte) ? CONT_PTE_SIZE : PAGE_SIZE)
+
 #if defined(CONFIG_ARM64_64K_PAGES) || CONFIG_PGTABLE_LEVELS < 3
 static inline bool pud_sect(pud_t pud) { return false; }
 static inline bool pud_table(pud_t pud) { return true; }




[PATCH 2/5] mm: Introduce pXX_leaf_size()

2020-11-13 Thread Peter Zijlstra
A number of architectures have non-pagetable aligned huge/large pages.
For such architectures a leaf can actually be part of a larger TLB
entry.

Provide generic helpers to determine the TLB size of a page-table
leaf.

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/pgtable.h |   16 
 1 file changed, 16 insertions(+)

--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1536,4 +1536,20 @@ typedef unsigned int pgtbl_mod_mask;
 #define pmd_leaf(x)0
 #endif
 
+#ifndef pgd_leaf_size
+#define pgd_leaf_size(x) PGD_SIZE
+#endif
+#ifndef p4d_leaf_size
+#define p4d_leaf_size(x) P4D_SIZE
+#endif
+#ifndef pud_leaf_size
+#define pud_leaf_size(x) PUD_SIZE
+#endif
+#ifndef pmd_leaf_size
+#define pmd_leaf_size(x) PMD_SIZE
+#endif
+#ifndef pte_leaf_size
+#define pte_leaf_size(x) PAGE_SIZE
+#endif
+
 #endif /* _LINUX_PGTABLE_H */




Re: [PATCH 1/3] asm-generic/atomic64: Add support for ARCH_ATOMIC

2020-11-11 Thread Peter Zijlstra
On Wed, Nov 11, 2020 at 02:39:01PM +0100, Christophe Leroy wrote:
> Hello,
> 
> Le 11/11/2020 à 12:07, Nicholas Piggin a écrit :
> > This passes atomic64 selftest on ppc32 on qemu (uniprocessor only)
> > both before and after powerpc is converted to use ARCH_ATOMIC.
> 
> Can you explain what this change does and why it is needed ?

That certainly should've been in the Changelog. This enables atomic
instrumentation, see asm-generic/atomic-instrumented.h. IOW, it makes
atomic ops visible to K*SAN.


Re: [PATCH 3/8] powerpc: Mark functions called inside uaccess blocks w/ 'notrace'

2020-10-16 Thread Peter Zijlstra
On Fri, Oct 16, 2020 at 07:56:16AM +0100, Christoph Hellwig wrote:
> On Thu, Oct 15, 2020 at 10:01:54AM -0500, Christopher M. Riedl wrote:
> > Functions called between user_*_access_begin() and user_*_access_end()
> > should be either inlined or marked 'notrace' to prevent leaving
> > userspace access exposed. Mark any such functions relevant to signal
> > handling so that subsequent patches can call them inside uaccess blocks.
> 
> I don't think running this much code with uaccess enabled is a good
> idea.  Please refactor the code to reduce the criticial sections with
> uaccess enabled.
> 
> Btw, does powerpc already have the objtool validation that we don't
> accidentally jump out of unsafe uaccess critical sections?

It does not, there was some effort on that a while ago, but I suspect
they're waiting for the ARM64 effort to land and build on that.


Re: [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends

2020-09-24 Thread Peter Zijlstra
On Thu, Sep 24, 2020 at 09:51:38AM -0400, Steven Rostedt wrote:

> > It turns out, that getting selected for pull-balance is exactly that
> > condition, and clearly a migrate_disable() task cannot be pulled, but we
> > can use that signal to try and pull away the running task that's in the
> > way.
> 
> Unless of course that running task is in a migrate disable section
> itself ;-)

See my ramblings here:

  
https://lkml.kernel.org/r/20200924082717.ga1362...@hirez.programming.kicks-ass.net

My plan was to have the migrate_enable() of the running task trigger the
migration in that case.



Re: [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends

2020-09-24 Thread Peter Zijlstra
On Thu, Sep 24, 2020 at 08:32:41AM -0400, Steven Rostedt wrote:
> Anyway, instead of blocking. What about having a counter of number of
> migrate disabled tasks per cpu, and when taking a migrate_disable(), and 
> there's
> already another task with migrate_disabled() set, and the current task has
> an affinity greater than 1, it tries to migrate to another CPU?

That doesn't solve the problem. On wakeup we should already prefer an
idle CPU over one running a (RT) task, but you can always wake more
tasks than there's CPUs around and you'll _have_ to stack at some point.

The trick is how to unstack them correctly. We need to detect when a
migrate_disable() task _should_ start running again, and migrate away
whoever is in the way at that point.

It turns out, that getting selected for pull-balance is exactly that
condition, and clearly a migrate_disable() task cannot be pulled, but we
can use that signal to try and pull away the running task that's in the
way.


Re: [PATCH 1/2] lockdep: improve current->(hard|soft)irqs_enabled synchronisation with actual irq state

2020-07-25 Thread Peter Zijlstra
On Thu, Jul 23, 2020 at 08:56:14PM +1000, Nicholas Piggin wrote:
> diff --git a/arch/powerpc/include/asm/hw_irq.h 
> b/arch/powerpc/include/asm/hw_irq.h
> index 3a0db7b0b46e..35060be09073 100644
> --- a/arch/powerpc/include/asm/hw_irq.h
> +++ b/arch/powerpc/include/asm/hw_irq.h
> @@ -200,17 +200,14 @@ static inline bool arch_irqs_disabled(void)
>  #define powerpc_local_irq_pmu_save(flags)\
>do {   \
>   raw_local_irq_pmu_save(flags);  \
> - trace_hardirqs_off();   \
> + if (!raw_irqs_disabled_flags(flags))\
> + trace_hardirqs_off();   \
>   } while(0)

So one problem with the above is something like this:

raw_local_irq_save();

  powerpc_local_irq_pmu_save();

that would now no longer call into tracing/lockdep at all. As a
consequence, lockdep and tracing would show the NMI ran with IRQs
enabled, which is exceptionally weird..

Similar problem with:

raw_local_irq_disable();
local_irq_save()

Now, most architectures today seem to do what x86 also did:


  trace_hardirqs_off()
  ...
  if (irqs_unmasked(regs))
trace_hardirqs_on()


Which is 'funny' when it interleaves like:

local_irq_disable();
...
local_irq_enable()
  trace_hardirqs_on();
  
  raw_local_irq_enable();

Because then it will undo the trace_hardirqs_on() we just did. With the
result that both tracing and lockdep will see a hardirqs-disable without
a matching enable, while the hardware state is enabled.

Which is exactly the state Alexey seems to have ran into.

Now, x86, and at least arm64 call nmi_enter() before
trace_hardirqs_off(), but AFAICT Power never did that, and that's part
of the problem. nmi_enter() does lockdep_off() and that _used_ to also
kill IRQ tracking.

Now, my patch changed that, it makes IRQ tracking not respect
lockdep_off(). And that exposed x86 (and everybody else :/) to the same
problem you have.

And this is why I made x86 look at software state in NMIs. Because then
it all works out. For bonus points, trace_hardirqs_*() also has some
do-it-once logic for tracing.



Anyway, it's Saturday evening, time for a beer. I'll stare at this more
later.


Re: [PATCH v3 5/6] powerpc/pseries: implement paravirt qspinlocks for SPLPAR

2020-07-25 Thread Peter Zijlstra
On Fri, Jul 24, 2020 at 03:10:59PM -0400, Waiman Long wrote:
> On 7/24/20 4:16 AM, Will Deacon wrote:
> > On Thu, Jul 23, 2020 at 08:47:59PM +0200, pet...@infradead.org wrote:
> > > On Thu, Jul 23, 2020 at 02:32:36PM -0400, Waiman Long wrote:
> > > > BTW, do you have any comment on my v2 lock holder cpu info qspinlock 
> > > > patch?
> > > > I will have to update the patch to fix the reported 0-day test problem, 
> > > > but
> > > > I want to collect other feedback before sending out v3.
> > > I want to say I hate it all, it adds instructions to a path we spend an
> > > aweful lot of time optimizing without really getting anything back for
> > > it.
> > > 
> > > Will, how do you feel about it?
> > I can see it potentially being useful for debugging, but I hate the
> > limitation to 256 CPUs. Even arm64 is hitting that now.
> 
> After thinking more about that, I think we can use all the remaining bits in
> the 16-bit locked_pending. Reserving 1 bit for locked and 1 bit for pending,
> there are 14 bits left. So as long as NR_CPUS < 16k (requirement for 16-bit
> locked_pending), we can put all possible cpu numbers into the lock. We can
> also just use smp_processor_id() without additional percpu data.

That sounds horrific, wouldn't that destroy the whole point of using a
byte for pending?

> > Also, you're talking ~1% gains here. I think our collective time would
> > be better spent off reviewing the CNA series and trying to make it more
> > deterministic.
> 
> I thought you guys are not interested in CNA. I do want to get CNA merged,
> if possible. Let review the current version again and see if there are ways
> we can further improve it.

It's not a lack of interrest. We were struggling with the fairness
issues and the complexity of the thing. I forgot the current state of
matters, but at one point UNLOCK was O(n) in waiters, which is, of
course, 'unfortunate'.

I'll have to look up whatever notes remain, but the basic idea of
keeping remote nodes on a secondary list is obviously breaking all sorts
of fairness. After that they pile on a bunch of hacks to fix the worst
of them, but it feels exactly like that, a bunch of hacks.

One of the things I suppose we ought to do is see if some of the ideas
of phase-fair locks can be applied to this.

That coupled with a chronic lack of time for anything :-(


Re: [PATCH 1/2] lockdep: improve current->(hard|soft)irqs_enabled synchronisation with actual irq state

2020-07-23 Thread Peter Zijlstra
On Thu, Jul 23, 2020 at 11:11:03PM +1000, Nicholas Piggin wrote:
> Excerpts from Peter Zijlstra's message of July 23, 2020 9:40 pm:
> > On Thu, Jul 23, 2020 at 08:56:14PM +1000, Nicholas Piggin wrote:
> > 
> >> diff --git a/arch/powerpc/include/asm/hw_irq.h 
> >> b/arch/powerpc/include/asm/hw_irq.h
> >> index 3a0db7b0b46e..35060be09073 100644
> >> --- a/arch/powerpc/include/asm/hw_irq.h
> >> +++ b/arch/powerpc/include/asm/hw_irq.h
> >> @@ -200,17 +200,14 @@ static inline bool arch_irqs_disabled(void)
> >>  #define powerpc_local_irq_pmu_save(flags) \
> >> do {   \
> >>raw_local_irq_pmu_save(flags);  \
> >> -  trace_hardirqs_off();   \
> >> +  if (!raw_irqs_disabled_flags(flags))\
> >> +  trace_hardirqs_off();   \
> >>} while(0)
> >>  #define powerpc_local_irq_pmu_restore(flags)  \
> >>do {\
> >> -  if (raw_irqs_disabled_flags(flags)) {   \
> >> -  raw_local_irq_pmu_restore(flags);   \
> >> -  trace_hardirqs_off();   \
> >> -  } else {\
> >> +  if (!raw_irqs_disabled_flags(flags))\
> >>trace_hardirqs_on();\
> >> -  raw_local_irq_pmu_restore(flags);   \
> >> -  }   \
> >> +  raw_local_irq_pmu_restore(flags);   \
> >>} while(0)
> > 
> > You shouldn't be calling lockdep from NMI context!
> 
> After this patch it doesn't.

You sure, trace_hardirqs_{on,off}() calls into lockdep. (FWIW they're
also broken vs entry ordering, but that's another story).

> trace_hardirqs_on/off implementation appears to expect to be called in NMI 
> context though, for some reason.

Hurpm, not sure.. I'll have to go grep arch code now :/ The generic NMI
code didn't touch that stuff.

Argh, yes, there might be broken there... damn! I'll go frob around.


Re: [PATCH v3 5/6] powerpc/pseries: implement paravirt qspinlocks for SPLPAR

2020-07-23 Thread Peter Zijlstra
On Thu, Jul 09, 2020 at 12:06:13PM -0400, Waiman Long wrote:
> We don't really need to do a pv_spinlocks_init() if pv_kick() isn't
> supported.

Waiman, if you cannot explain how not having kick is a sane thing, what
are you saying here?


Re: [PATCH 1/2] lockdep: improve current->(hard|soft)irqs_enabled synchronisation with actual irq state

2020-07-23 Thread Peter Zijlstra
On Thu, Jul 23, 2020 at 08:56:14PM +1000, Nicholas Piggin wrote:

> diff --git a/arch/powerpc/include/asm/hw_irq.h 
> b/arch/powerpc/include/asm/hw_irq.h
> index 3a0db7b0b46e..35060be09073 100644
> --- a/arch/powerpc/include/asm/hw_irq.h
> +++ b/arch/powerpc/include/asm/hw_irq.h
> @@ -200,17 +200,14 @@ static inline bool arch_irqs_disabled(void)
>  #define powerpc_local_irq_pmu_save(flags)\
>do {   \
>   raw_local_irq_pmu_save(flags);  \
> - trace_hardirqs_off();   \
> + if (!raw_irqs_disabled_flags(flags))\
> + trace_hardirqs_off();   \
>   } while(0)
>  #define powerpc_local_irq_pmu_restore(flags) \
>   do {\
> - if (raw_irqs_disabled_flags(flags)) {   \
> - raw_local_irq_pmu_restore(flags);   \
> - trace_hardirqs_off();   \
> - } else {\
> + if (!raw_irqs_disabled_flags(flags))\
>   trace_hardirqs_on();\
> - raw_local_irq_pmu_restore(flags);   \
> - }   \
> + raw_local_irq_pmu_restore(flags);   \
>   } while(0)

You shouldn't be calling lockdep from NMI context! That is, I recently
added suport for that on x86:

  https://lkml.kernel.org/r/20200623083721.155449...@infradead.org
  https://lkml.kernel.org/r/20200623083721.216740...@infradead.org

But you need to be very careful on how you order things, as you can see
the above relies on preempt_count() already having been incremented with
NMI_MASK.


Re: [PATCH v2 06/10] powerpc/smp: Generalize 2nd sched domain

2020-07-22 Thread Peter Zijlstra
On Wed, Jul 22, 2020 at 01:48:22PM +0530, Srikar Dronamraju wrote:
> * pet...@infradead.org  [2020-07-22 09:46:24]:
> 
> > On Tue, Jul 21, 2020 at 05:08:10PM +0530, Srikar Dronamraju wrote:
> > > Currently "CACHE" domain happens to be the 2nd sched domain as per
> > > powerpc_topology. This domain will collapse if cpumask of l2-cache is
> > > same as SMT domain. However we could generalize this domain such that it
> > > could mean either be a "CACHE" domain or a "BIGCORE" domain.
> > 
> > What's the whole smallcore vs bigcore thing?
> > 
> > Would it make sense to have an actual overview of the various topology
> > layers somewhere? Because I'm getting lost and can't really make sense
> > of the code due to that.
> 
> To quote with an example: using (Power 9)
> 
> Power 9 is an SMT 8 core by design. However this 8 thread core can run as 2
> independent core with threads 0,2,4 and 6 acting like a core, while threads
> 1,3,5,7 acting as another core.  
> 
> The firmware can decide to showcase them as 2 independent small cores or as
> one big core. However the LLC (i.e last level of cache) is shared between
> all the threads of the SMT 8 core. Future power chips, LLC might change, it
> may be expanded to share with another SMT 8 core or it could be made
> specific to SMT 4 core.
> 
> The smt 8 core is also known as fused core/ Big core.
> The smaller smt 4 core is known as small core.
> 
> So on a Power9 Baremetal, the firmware would show up as SMT4 core.
> and we have a CACHE level at SMT 8. CACHE level would be very very similar
> to MC domain in X86.
> 
> Hope this is clear.

Ooh, that thing. I thought P9 was in actual fact an SMT4 hardware part,
but to be compatible with P8 it has this fused option where two cores
act like a single SMT8 part.

The interleaving enumeration is done due to P7 asymmetric cores,
resuting in schedulers having the preference to use the lower threads.

Combined that results in a P9-fused configuration using two independent
full cores when there's only 2 runnable threads.

Which is a subtly different story from yours.

But reading your explanation, it looks like the Linux topology setup
could actually unravel the fused-faux-SMT8 into two independent SMT4
parts, negating that firmware option.

Anyway, a few comments just around there might be helpfull.


Thanks!


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-21 Thread Peter Zijlstra
On Tue, Jul 21, 2020 at 11:15:13AM -0400, Mathieu Desnoyers wrote:
> - On Jul 21, 2020, at 11:06 AM, Peter Zijlstra pet...@infradead.org wrote:
> 
> > On Tue, Jul 21, 2020 at 08:04:27PM +1000, Nicholas Piggin wrote:
> > 
> >> That being said, the x86 sync core gap that I imagined could be fixed
> >> by changing to rq->curr == rq->idle test does not actually exist because
> >> the global membarrier does not have a sync core option. So fixing the
> >> exit_lazy_tlb points that this series does *should* fix that. So
> >> PF_KTHREAD may be less problematic than I thought from implementation
> >> point of view, only semantics.
> > 
> > So I've been trying to figure out where that PF_KTHREAD comes from,
> > commit 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy
> > load") changed 'p->mm' to '!(p->flags & PF_KTHREAD)'.
> > 
> > So the first version:
> > 
> >  
> > https://lkml.kernel.org/r/20190906031300.1647-5-mathieu.desnoy...@efficios.com
> > 
> > appears to unconditionally send the IPI and checks p->mm in the IPI
> > context, but then v2:
> > 
> >  
> > https://lkml.kernel.org/r/20190908134909.12389-1-mathieu.desnoy...@efficios.com
> > 
> > has the current code. But I've been unable to find the reason the
> > 'p->mm' test changed into '!(p->flags & PF_KTHREAD)'.
> 
> Looking back at my inbox, it seems like you are the one who proposed to
> skip all kthreads: 
> 
> https://lkml.kernel.org/r/20190904124333.gq2...@hirez.programming.kicks-ass.net

I had a feeling it might've been me ;-) I just couldn't find the email.

> > The comment doesn't really help either; sure we have the whole lazy mm
> > thing, but that's ->active_mm, not ->mm.
> > 
> > Possibly it is because {,un}use_mm() do not have sufficient barriers to
> > make the remote p->mm test work? Or were we over-eager with the !p->mm
> > doesn't imply kthread 'cleanups' at the time?
> 
> The nice thing about adding back kthreads to the threads considered for 
> membarrier
> IPI is that it has no observable effect on the user-space ABI. No 
> pre-existing kthread
> rely on this, and we just provide an additional guarantee for future kthread
> implementations.
> 
> > Also, I just realized, I still have a fix for use_mm() now
> > kthread_use_mm() that seems to have been lost.
> 
> I suspect we need to at least document the memory barriers in kthread_use_mm 
> and
> kthread_unuse_mm to state that they are required by membarrier if we want to
> ipi kthreads as well.

Right, so going by that email you found it was mostly a case of being
lazy, but yes, if we audit the kthread_{,un}use_mm() barriers and add
any other bits that might be needed, covering kthreads should be
possible.

No objections from me for making it so.


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-16 Thread Peter Zijlstra
On Wed, Jul 15, 2020 at 10:18:20PM -0700, Andy Lutomirski wrote:
> > On Jul 15, 2020, at 9:15 PM, Nicholas Piggin  wrote:

> > CPU0 CPU1
> > 1. user stuff
> > a. membarrier()  2. enter kernel
> > b. read rq->curr 3. rq->curr switched to kthread
> > c. is kthread, skip IPI  4. switch_to kthread
> > d. return to user5. rq->curr switched to user thread
> > 6. switch_to user thread
> > 7. exit kernel
> > 8. more user stuff

> I find it hard to believe that this is x86 only. Why would thread
> switch imply core sync on any architecture?  Is x86 unique in having a
> stupid expensive core sync that is heavier than smp_mb()?

smp_mb() is nowhere near the most expensive barrier we have in Linux,
mb() might qualify, since that has some completion requirements since it
needs to serialize against external actors.

On x86_64 things are rather murky, we have:

LOCK prefix -- which implies smp_mb() before and after RmW
LFENCE -- which used to be rmb like, until Spectre, and now it
  is ISYNC like. Since ISYNC ensures an empty pipeline,
  it also implies all loads are retired (and therefore
  complete) it implies rmb.
MFENCE -- which is a memop completion barrier like, it makes
  sure all previously issued memops are complete.

if you read that carefully, you'll note you'll have to use LFENCE +
MFENCE to order against non-memops instructions.

But none of them imply dumping the instruction decoder caches, that only
happens on core serializing instructions like CR3 writes, IRET, CPUID
and a few others, I think we recently got a SERIALIZE instruction to add
to this list.


On ARM64 there's something a whole different set of barriers, and again
smp_mb() isn't nowhere near the top of the list. They have roughly 3
classes:

ISB -- instruction sync barrier
DMB(x) -- memory ordering in domain x
DSB(x) -- memory completion in domain x

And they have at least 3 domains (IIRC), system, outer, inner.

The ARM64 __switch_to() includes a dsb(sy), just like PowerPC used to
have a SYNC, but since PowerPC is rare for only having one rediculously
heavy serializing instruction, we got to re-use the smp_mb() early in
__schedule() instead, but ARM64 can't do that.


So rather than say that x86 is special here, I'd say that PowerPC is
special here.

> But I’m wondering if all this deferred sync stuff is wrong. In the
> brave new world of io_uring and such, perhaps kernel access matter
> too.  Heck, even:

IIRC the membarrier SYNC_CORE use-case is about user-space
self-modifying code.

Userspace re-uses a text address and needs to SYNC_CORE before it can be
sure the old text is forgotten. Nothing the kernel does matters there.

I suppose the manpage could be more clear there.



Re: [PATCH v2 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper

2020-07-14 Thread Peter Zijlstra
On Tue, Jul 14, 2020 at 07:31:03PM +0300, Jarkko Sakkinen wrote:
> On Tue, Jul 14, 2020 at 03:01:09PM +0200, Peter Zijlstra wrote:
> > to help with text_alloc() usage in generic code, but I think
> > fundamentally, there's only these two options.
> 
> There is one arch (nios2), which uses a regular kzalloc(). This means
> that both text_alloc() and text_memfree() need to be weaks symbols and
> nios2 needs to have overriding text.c to do its thing.

IMO nios2 is just broken.


Re: [PATCH v2 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper

2020-07-14 Thread Peter Zijlstra
On Tue, Jul 14, 2020 at 03:19:24PM +0300, Ard Biesheuvel wrote:
> So perhaps the answer is to have text_alloc() not with a 'where'
> argument but with a 'why' argument. Or more simply, just have separate
> alloc/free APIs for each case, with generic versions that can be
> overridden by the architecture.

Well, there only seem to be 2 cases here, either the pointer needs to
fit in some immediate displacement, or not.

On x86 we seem have the advantage of a fairly large immediate
displacement as compared to many other architectures (due to our
variable sized instructions). And thus have been fairly liberal with our
usage of it (also our indirect jmps/calls suck, double so with
RETCH-POLINE).

Still, the indirect jump, as mentioned by Russel should work for
arbitrarily placed code for us too.


So I'm thinking that something like:

enum ptr_type {
immediate_displacement,
absolute,
};

void *text_alloc(unsigned long size, enum ptr_type type)
{
unsigned long vstart = VMALLOC_START;
unsigned long vend   = VMALLOC_END;

if (type == immediate_displacement) {
vstart = MODULES_VADDR;
vend   = MODULES_END;
}

return __vmalloc_node_range(size, TEXT_ALIGN, vstart, vend,
GFP_KERNEL, PAGE_KERNEL_EXEC, 0,
NUMA_NO_NODE, _RET_IP_);
}

void text_free(void *ptr)
{
vfree(ptr);
}

Should work for all cases. Yes, we might then want something like a per
arch:

{BPF,FTRACE,KPROBE}_TEXT_TYPE

to help with text_alloc() usage in generic code, but I think
fundamentally, there's only these two options.


Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-07-14 Thread Peter Zijlstra
On Tue, Jul 14, 2020 at 05:46:05AM -0700, Andy Lutomirski wrote:
> x86 has this exact problem. At least no more than 64*8 CPUs share the cache 
> line :)

I've seen patches for a 'sparse' bitmap to solve related problems.

It's basically the same code, except it multiplies everything (size,
bit-nr) by a constant to reduce the number of active bits per line.

This sadly doesn't take topology into account, but reducing contention
is still good ofcourse.


Re: [PATCH 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper

2020-07-14 Thread Peter Zijlstra
On Tue, Jul 14, 2020 at 11:33:33AM +0100, Russell King - ARM Linux admin wrote:
> For 32-bit ARM, our bpf code uses "blx/bx" (or equivalent code
> sequences) rather than encoding a "bl" or "b", so BPF there doesn't
> care where the executable memory is mapped, and doesn't need any
> PLTs.  Given that, should bpf always allocate from the vmalloc()
> region to preserve the module space for modules?

Ah, okay, then I suspect arm64 does something similar there. Thanks!

> I'm more concerned about ftrace though, but only because I don't
> have the understanding of that subsystem to really say whether there
> are any side effects from having the allocations potentially be out
> of range of a "bl" or "b" instruction.
> 
> If ftrace jumps both to and from the allocated page using a "load
> address to register, branch to register" approach like BPF does, then
> ftrace should be safe - and again, raises the issue that maybe it
> should always come from vmalloc() space.

I think the problem with ftrace is patching multiple instruction;
because it sounds like you'd need something to load the absolute address
in a register and then jump to that. And where it's relatively easy to
replace a single instruction, replace multiple instructions gets real
tricky real quick.

Which then leads to you being stuck with that 26bit displacement, IIRC.



Re: [PATCH v2 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper

2020-07-14 Thread Peter Zijlstra
On Tue, Jul 14, 2020 at 11:28:27AM +0100, Will Deacon wrote:

> As Ard says, module_alloc() _is_ special, in the sense that the virtual
> memory it allocates wants to be close to the kernel text, whereas the
> concept of allocating executable memory is broader and doesn't have these
> restrictions. So, while I'm in favour of having a text_alloc() interface
> that can be used by callers which only require an executable mapping, I'd
> much prefer for the module_alloc() code to remain for, err, modules.

So on x86 all those things (kprobes, bpf, ftrace) require that same
closeness.

An interface like the late vmalloc_exec() will simply not work for us.

We recently talked about arm64-kprobes and how you're not doing any of
the optimizations and fully rely on the exception return. And I see
you're one of the few archs that has bpf_jit_alloc_exec() (also,
shouldn't you be using PAGE_KERNEL_EXEC there?). But the BPF core seems
to use module_alloc() as a default means of allocating text.


So what should this look like? Have a text_alloc() with an argument that
indicates where? But then I suppose we also need a means to manage PLT
entries. Otherwise I don't exactly see how you're going to call BPF
code, or how that BPF stuff is going to call back into its helpers.




  1   2   3   4   5   6   7   8   >