Re: [RFC v2 2/2] [MOCKUP] sched/mm: Lightweight lazy mm refcounting

2020-12-03 Thread Nicholas Piggin
Excerpts from Andy Lutomirski's message of December 4, 2020 3:26 pm:
> This is a mockup.  It's designed to illustrate the algorithm and how the
> code might be structured.  There are several things blatantly wrong with
> it:
> 
> The coding stype is not up to kernel standards.  I have prototypes in the
> wrong places and other hacks.
> 
> There's a problem with mm_cpumask() not being reliable.

Interesting, this might be a way to reduce those IPIs with fairly 
minimal fast path cost. Would be interesting to see how much performance 
advantage it has over my dumb simple shoot-lazies.

For powerpc I don't think we'd be inclined to go that way, so don't feel 
the need to add this complexity for us alone -- we'd be more inclined to 
move the exit lazy to the final TLB shootdown path, which we're slowly 
getting more infrastructure in place to do.

(The powerpc hash MMU code which we're slowly moving away from might 
never get that capability because it's complex there and hard to do with
that virtualisation model so current big systems (and radix MMU until we
finish the TLB flushing stuff) want something here, but for those the
shoot-lazies could quite likely be sufficient)

But if core code was moved over to something like this for the benefit of
others archs we'd probably just as happily do that.

There's a few nits but I don't think I can see a fundamental problem 
yet.

Thanks,
Nick


Re: [RFC v2 1/2] [NEEDS HELP] x86/mm: Handle unlazying membarrier core sync in the arch code

2020-12-03 Thread Nicholas Piggin
Excerpts from Andy Lutomirski's message of December 4, 2020 3:26 pm:
> The core scheduler isn't a great place for
> membarrier_mm_sync_core_before_usermode() -- the core scheduler doesn't
> actually know whether we are lazy.  With the old code, if a CPU is
> running a membarrier-registered task, goes idle, gets unlazied via a TLB
> shootdown IPI, and switches back to the membarrier-registered task, it
> will do an unnecessary core sync.
> 
> Conveniently, x86 is the only architecture that does anything in this
> hook, so we can just move the code.

This should go on top of my series that adds the exit_lazy_mm call
and switches x86 over, at least.

> XXX: there are some comments in swich_mm_irqs_off() that seem to be
> trying to document what barriers are expected, and it's not clear to me
> that these barriers are actually present in all paths through the
> code.  So I think this change makes the code more comprehensible and
> has no effect on the code's correctness, but I'm not at all convinced
> that the code is correct.
> 
> Signed-off-by: Andy Lutomirski 
> ---
>  arch/x86/mm/tlb.c   | 17 -
>  kernel/sched/core.c | 14 +++---
>  2 files changed, 23 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 3338a1feccf9..23df035b80e8 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -8,6 +8,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -496,6 +497,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
> mm_struct *next,
>* from one thread in a process to another thread in the same
>* process. No TLB flush required.
>*/
> +
> + // XXX: why is this okay wrt membarrier?
>   if (!was_lazy)
>   return;
>  
> @@ -508,12 +511,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
> mm_struct *next,
>   smp_mb();
>   next_tlb_gen = atomic64_read(>context.tlb_gen);
>   if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
> - next_tlb_gen)
> + next_tlb_gen) {
> + /*
> +  * We're reactivating an mm, and membarrier might
> +  * need to serialize.  Tell membarrier.
> +  */
> +
> + // XXX: I can't understand the logic in
> + // membarrier_mm_sync_core_before_usermode().  What's
> + // the mm check for?

Writing CR3 is serializing, apparently. Another x86ism that gets 
commented and moved into arch/x86 with my patch.


> + membarrier_mm_sync_core_before_usermode(next);
>   return;
> + }
>  
>   /*
>* TLB contents went out of date while we were in lazy
>* mode. Fall through to the TLB switching code below.
> +  * No need for an explicit membarrier invocation -- the CR3
> +  * write will serialize.
>*/
>   new_asid = prev_asid;
>   need_flush = true;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2d95dc3f4644..6c4b76147166 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3619,22 +3619,22 @@ static struct rq *finish_task_switch(struct 
> task_struct *prev)
>   kcov_finish_switch(current);
>  
>   fire_sched_in_preempt_notifiers(current);
> +
>   /*
>* When switching through a kernel thread, the loop in
>* membarrier_{private,global}_expedited() may have observed that
>* kernel thread and not issued an IPI. It is therefore possible to
>* schedule between user->kernel->user threads without passing though
>* switch_mm(). Membarrier requires a barrier after storing to
> -  * rq->curr, before returning to userspace, so provide them here:
> +  * rq->curr, before returning to userspace, and mmdrop() provides
> +  * this barrier.
>*
> -  * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
> -  *   provided by mmdrop(),
> -  * - a sync_core for SYNC_CORE.
> +  * XXX: I don't think mmdrop() actually does this.  There's no
> +  * smp_mb__before/after_atomic() in there.

mmdrop definitely does provide a full barrier.

>*/
> - if (mm) {
> - membarrier_mm_sync_core_before_usermode(mm);
> + if (mm)
>   mmdrop(mm);
> - }
> +
>   if (unlikely(prev_state == TASK_DEAD)) {
>   if (prev->sched_class->task_dead)
>   prev->sched_class->task_dead(prev);
> -- 
> 2.28.0
> 
> 


Re: [PATCH] powerpc/mm: Don't see NULL pointer dereference as a KUAP fault

2020-12-03 Thread Christophe Leroy




Le 03/12/2020 à 12:55, Michael Ellerman a écrit :

Christophe Leroy  writes:

Sometimes, NULL pointer dereferences are expected. Even when they
are accidental they are unlikely an exploit attempt because the
first page is never mapped.


The first page can be mapped if mmap_min_addr is 0.

Blocking all faults to the first page would potentially break any
program that does that.

Also if there is something mapped at 0 it's a good chance it is an
exploit attempt :)


Ok, I see.

In fact, we hit this warning because we don't provide 
copy_from_kernel_nofault_allowed()

I'll cook a patch for that.

Christophe


Re: [PATCH 3/7] powerpc/64s: flush L1D after user accesses

2020-12-03 Thread Christophe Leroy



Quoting Qian Cai :


On Thu, 2020-12-03 at 12:17 -0500, Qian Cai wrote:

[]
> +static inline bool
> +bad_kuap_fault(struct pt_regs *regs, unsigned long address, bool  
is_write)

> +{
> +  return WARN(mmu_has_feature(MMU_FTR_RADIX_KUAP) &&
> +		(regs->kuap & (is_write ? AMR_KUAP_BLOCK_WRITE :  
AMR_KUAP_BLOCK_READ)),

> +  "Bug: %s fault blocked by AMR!", is_write ? "Write" : "Read");
> +}

A simple "echo t > /proc/sysrq-trigger" will trigger this warning almost
endlessly on POWER9 NV.


I have just realized the patch just moved this warning around, so  
the issue was
pre-existent. Since I have not tested sysrq-t regularly, I am not  
sure when it

started to break. So far, I have reverted some of those for testing which did
not help, i.e., the sysrq-t issue remains.

16852975f0f  Revert "powerpc/64s: Use early_mmu_has_feature() in set_kuap()"
129e240ead32 Revert "powerpc: Implement user_access_save() and  
user_access_restore()"
edb0046c842c Revert "powerpc/64s/kuap: Add missing isync to KUAP  
restore paths"
2d46ee87ce44 Revert "powerpc/64/kuap: Conditionally restore AMR in  
interrupt exit"
c1e0e805fc57 Revert "powerpc/64s/kuap: Conditionally restore AMR in  
kuap_restore_amr asm"
7f30b7aaf23a Revert "selftests/powerpc: rfi_flush: disable entry  
flush if present"

bc9b9967a100 Revert "powerpc/64s: flush L1D on kernel entry"
b77e7b54f5eb Revert "powerpc/64s: flush L1D after user accesses"
22dddf532c64 Revert "powerpc: Only include kup-radix.h for 64-bit Book3S"
2679d155c46a Revert "selftests/powerpc: entry flush test"
87954b9b4243 Revert "selftests/powerpc: refactor entry and rfi_flush tests"
342d82bd4c5d Revert "powerpc/64s: rename pnv|pseries_setup_rfi_flush  
to _setup_security_mitigations"


I also hit that WARNING in the same way earlier this week.

I think it has been broken by commit c33165253492 ("powerpc: use  
non-set_fs based maccess routines")


IIUC we should provide copy_from_kernel_nofault_allowed() to avoid that.

Christophe


[RFC v2 2/2] [MOCKUP] sched/mm: Lightweight lazy mm refcounting

2020-12-03 Thread Andy Lutomirski
This is a mockup.  It's designed to illustrate the algorithm and how the
code might be structured.  There are several things blatantly wrong with
it:

The coding stype is not up to kernel standards.  I have prototypes in the
wrong places and other hacks.

There's a problem with mm_cpumask() not being reliable.

Signed-off-by: Andy Lutomirski 
---
 kernel/fork.c|   4 ++
 kernel/sched/core.c  | 128 +--
 kernel/sched/sched.h |  11 +++-
 3 files changed, 126 insertions(+), 17 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index da8d360fb032..0887a33cf84f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1066,6 +1066,8 @@ struct mm_struct *mm_alloc(void)
return mm_init(mm, current, current_user_ns());
 }
 
+extern void mm_fixup_lazy_refs(struct mm_struct *mm);
+
 static inline void __mmput(struct mm_struct *mm)
 {
VM_BUG_ON(atomic_read(>mm_users));
@@ -1084,6 +1086,8 @@ static inline void __mmput(struct mm_struct *mm)
}
if (mm->binfmt)
module_put(mm->binfmt->module);
+
+   mm_fixup_lazy_refs(mm);
mmdrop(mm);
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6c4b76147166..69dfdfe0e5b4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3555,6 +3555,75 @@ prepare_task_switch(struct rq *rq, struct task_struct 
*prev,
prepare_arch_switch(next);
 }
 
+static void drop_extra_mm_refs(struct rq *rq)
+{
+   unsigned long old_mm;
+
+   if (likely(!atomic_long_read(>mm_to_mmdrop)))
+   return;
+
+   /*
+* Slow path.  This only happens when we recently stopped using
+* an mm that is exiting.
+*/
+   old_mm = atomic_long_xchg_relaxed(>mm_to_mmdrop, 0);
+   if (old_mm)
+   mmdrop((struct mm_struct *)old_mm);
+}
+
+/*
+ * This ensures that all lazy_mm refs to mm are converted to mm_count
+ * refcounts.  Our caller holds an mm_count reference, so we don't need
+ * to worry about mm being freed out from under us.
+ */
+void mm_fixup_lazy_refs(struct mm_struct *mm)
+{
+   int cpu;
+
+   /*
+* mm_users is zero, so no new lazy refs will be taken.
+*/
+   WARN_ON_ONCE(atomic_read(>mm_users) != 0);
+
+   /*
+* XXX: this is wrong on arm64 and possibly on other architectures.
+* Maybe we need a config option for this?  Or a
+* for_each_possible_lazy_cpu(cpu, mm) helper?
+*/
+   for_each_cpu(cpu, mm_cpumask(mm)) {
+   struct rq *rq = cpu_rq(cpu);
+   unsigned long old;
+
+   if (READ_ONCE(rq->lazy_mm) != mm)
+   continue;
+
+   // XXX: we could optimize this by doing a big addition to
+   // mm_count up front instead of incrementing it separately
+   // for each CPU.
+   mmgrab(mm);
+
+   // XXX: could this be relaxed instead?
+   old = atomic_long_xchg(>mm_to_mmdrop, (unsigned long)mm);
+
+   // At this point, mm can be mmdrop()ed at any time, probably
+   // by the target cpu.
+
+   if (!old)
+   continue;  // All done!
+
+   WARN_ON_ONCE(old == (unsigned long)mm);
+
+   // Uh oh!  We just stole an mm reference from the target CPU.
+   // Fortunately, we just observed the target's lazy_mm pointing
+   // to something other than old, and we observed this after
+   // bringing mm_users down to 0.  This means that the remote
+   // cpu is definitely done with old.  So we can drop it on the
+   // remote CPU's behalf.
+
+   mmdrop((struct mm_struct *)old);
+   }
+}
+
 /**
  * finish_task_switch - clean up after a task-switch
  * @prev: the thread we just switched away from.
@@ -3578,7 +3647,6 @@ static struct rq *finish_task_switch(struct task_struct 
*prev)
__releases(rq->lock)
 {
struct rq *rq = this_rq();
-   struct mm_struct *mm = rq->prev_mm;
long prev_state;
 
/*
@@ -3597,8 +3665,6 @@ static struct rq *finish_task_switch(struct task_struct 
*prev)
  current->comm, current->pid, preempt_count()))
preempt_count_set(FORK_PREEMPT_COUNT);
 
-   rq->prev_mm = NULL;
-
/*
 * A task struct has one reference for the use as "current".
 * If a task dies, then it sets TASK_DEAD in tsk->state and calls
@@ -3629,11 +3695,28 @@ static struct rq *finish_task_switch(struct task_struct 
*prev)
 * rq->curr, before returning to userspace, and mmdrop() provides
 * this barrier.
 *
-* XXX: I don't think mmdrop() actually does this.  There's no
-* smp_mb__before/after_atomic() in there.
+* XXX: I don't think mmdrop() actually did this.  There's no
+* smp_mb__before/after_atomic() in there.  But mmdrop()
+* certainly doesn't do 

[RFC v2 1/2] [NEEDS HELP] x86/mm: Handle unlazying membarrier core sync in the arch code

2020-12-03 Thread Andy Lutomirski
The core scheduler isn't a great place for
membarrier_mm_sync_core_before_usermode() -- the core scheduler doesn't
actually know whether we are lazy.  With the old code, if a CPU is
running a membarrier-registered task, goes idle, gets unlazied via a TLB
shootdown IPI, and switches back to the membarrier-registered task, it
will do an unnecessary core sync.

Conveniently, x86 is the only architecture that does anything in this
hook, so we can just move the code.

XXX: there are some comments in swich_mm_irqs_off() that seem to be
trying to document what barriers are expected, and it's not clear to me
that these barriers are actually present in all paths through the
code.  So I think this change makes the code more comprehensible and
has no effect on the code's correctness, but I'm not at all convinced
that the code is correct.

Signed-off-by: Andy Lutomirski 
---
 arch/x86/mm/tlb.c   | 17 -
 kernel/sched/core.c | 14 +++---
 2 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 3338a1feccf9..23df035b80e8 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -496,6 +497,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
 * from one thread in a process to another thread in the same
 * process. No TLB flush required.
 */
+
+   // XXX: why is this okay wrt membarrier?
if (!was_lazy)
return;
 
@@ -508,12 +511,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
smp_mb();
next_tlb_gen = atomic64_read(>context.tlb_gen);
if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
-   next_tlb_gen)
+   next_tlb_gen) {
+   /*
+* We're reactivating an mm, and membarrier might
+* need to serialize.  Tell membarrier.
+*/
+
+   // XXX: I can't understand the logic in
+   // membarrier_mm_sync_core_before_usermode().  What's
+   // the mm check for?
+   membarrier_mm_sync_core_before_usermode(next);
return;
+   }
 
/*
 * TLB contents went out of date while we were in lazy
 * mode. Fall through to the TLB switching code below.
+* No need for an explicit membarrier invocation -- the CR3
+* write will serialize.
 */
new_asid = prev_asid;
need_flush = true;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d95dc3f4644..6c4b76147166 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3619,22 +3619,22 @@ static struct rq *finish_task_switch(struct task_struct 
*prev)
kcov_finish_switch(current);
 
fire_sched_in_preempt_notifiers(current);
+
/*
 * When switching through a kernel thread, the loop in
 * membarrier_{private,global}_expedited() may have observed that
 * kernel thread and not issued an IPI. It is therefore possible to
 * schedule between user->kernel->user threads without passing though
 * switch_mm(). Membarrier requires a barrier after storing to
-* rq->curr, before returning to userspace, so provide them here:
+* rq->curr, before returning to userspace, and mmdrop() provides
+* this barrier.
 *
-* - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
-*   provided by mmdrop(),
-* - a sync_core for SYNC_CORE.
+* XXX: I don't think mmdrop() actually does this.  There's no
+* smp_mb__before/after_atomic() in there.
 */
-   if (mm) {
-   membarrier_mm_sync_core_before_usermode(mm);
+   if (mm)
mmdrop(mm);
-   }
+
if (unlikely(prev_state == TASK_DEAD)) {
if (prev->sched_class->task_dead)
prev->sched_class->task_dead(prev);
-- 
2.28.0



[RFC v2 0/2] lazy mm refcounting

2020-12-03 Thread Andy Lutomirski
This is part of a larger series here, but the beginning bit is irrelevant
to the current discussion:

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/mm=203d39d11562575fd8bd6a094d97a3a332d8b265

This is IMO a lot better than v1.  It's now almost entirely in generic
code.  (It looks like it's 100% generic, but that's a lie -- the
generic code currently that all possible lazy mm refs are in
mm_cpumask(), and that's not true on all arches.  So, if we take my
approach, we'll need to have a little arch hook to control this.)

Here's how I think it fits with various arches:

x86: On bare metal (i.e. paravirt flush unavailable), the loop won't do
much.  The existing TLB shootdown when user tables are freed will
empty mm_cpumask of everything but the calling CPU.  So x86 ends up
pretty close to as good as we can get short of reworking mm_cpumask() itself.

arm64: It needs the fixup above for correctness, but I think performance
should be pretty good.  Compared to current kernels, we lose an mmgrab()
and mmdrop() on each lazy transition, and we add a reasonably fast loop
over all cpus on process exit.  Someone (probably me) needs to make
sure we don't need some extra barriers.

power: Similar to x86.

s390x: Should be essentially the same as arm64.

Other arches: I don't know.  Further research is required.

What do you all think?

Andy Lutomirski (2):
  [NEEDS HELP] x86/mm: Handle unlazying membarrier core sync in the arch
code
  [MOCKUP] sched/mm: Lightweight lazy mm refcounting

 arch/x86/mm/tlb.c|  17 +-
 kernel/fork.c|   4 ++
 kernel/sched/core.c  | 134 +--
 kernel/sched/sched.h |  11 +++-
 4 files changed, 145 insertions(+), 21 deletions(-)

-- 
2.28.0



[PATCH 1/3] powerpc/smp: Parse ibm, thread-groups with multiple properties

2020-12-03 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

The "ibm,thread-groups" device-tree property is an array that is used
to indicate if groups of threads within a core share certain
properties. It provides details of which property is being shared by
which groups of threads. This array can encode information about
multiple properties being shared by different thread-groups within the
core.

Example: Suppose,
"ibm,thread-groups" = [1,2,4,8,10,12,14,9,11,13,15,2,2,4,8,10,12,14,9,11,13,15]

This can be decomposed up into two consecutive arrays:

a) [1,2,4,8,10,12,14,9,11,13,15]
b) [2,2,4,8,10,12,14,9,11,13,15]

where in,

a) provides information of Property "1" being shared by "2" groups,
   each with "4" threads each. The "ibm,ppc-interrupt-server#s" of the
   first group is {8,10,12,14} and the "ibm,ppc-interrupt-server#s" of
   the second group is {9,11,13,15}. Property "1" is indicative of
   the thread in the group sharing L1 cache, translation cache and
   Instruction Data flow.

b) provides information of Property "2" being shared by "2" groups,
   each group with "4" threads. The "ibm,ppc-interrupt-server#s" of
   the first group is {8,10,12,14} and the
   "ibm,ppc-interrupt-server#s" of the second group is
   {9,11,13,15}. Property "2" indicates that the threads in each group
   share the L2-cache.
   
The existing code assumes that the "ibm,thread-groups" encodes
information about only one property. Hence even on platforms which
encode information about multiple properties being shared by the
corresponding groups of threads, the current code will only pick the
first one. (In the above example, it will only consider
[1,2,4,8,10,12,14,9,11,13,15] but not [2,2,4,8,10,12,14,9,11,13,15]).

This patch extends the parsing support on platforms which encode
information about multiple properties being shared by the
corresponding groups of threads.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/kernel/smp.c | 146 +-
 1 file changed, 92 insertions(+), 54 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 8c2857c..6a242a3 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -106,6 +106,15 @@ struct thread_groups {
unsigned int thread_list[MAX_THREAD_LIST_SIZE];
 };
 
+/* Maximum number of properties that groups of threads within a core can share 
*/
+#define MAX_THREAD_GROUP_PROPERTIES 1
+
+struct thread_groups_list {
+   unsigned int nr_properties;
+   struct thread_groups property_tgs[MAX_THREAD_GROUP_PROPERTIES];
+};
+
+static struct thread_groups_list tgl[NR_CPUS] __initdata;
 /*
  * On big-cores system, cpu_l1_cache_map for each CPU corresponds to
  * the set its siblings that share the L1-cache.
@@ -695,81 +704,94 @@ static void or_cpumasks_related(int i, int j, struct 
cpumask *(*srcmask)(int),
 /*
  * parse_thread_groups: Parses the "ibm,thread-groups" device tree
  *  property for the CPU device node @dn and stores
- *  the parsed output in the thread_groups
- *  structure @tg if the ibm,thread-groups[0]
- *  matches @property.
+ *  the parsed output in the thread_groups_list
+ *  structure @tglp.
  *
  * @dn: The device node of the CPU device.
- * @tg: Pointer to a thread group structure into which the parsed
+ * @tglp: Pointer to a thread group list structure into which the parsed
  *  output of "ibm,thread-groups" is stored.
- * @property: The property of the thread-group that the caller is
- *interested in.
  *
  * ibm,thread-groups[0..N-1] array defines which group of threads in
  * the CPU-device node can be grouped together based on the property.
  *
- * ibm,thread-groups[0] tells us the property based on which the
+ * This array can represent thread groupings for multiple properties.
+ *
+ * ibm,thread-groups[i + 0] tells us the property based on which the
  * threads are being grouped together. If this value is 1, it implies
  * that the threads in the same group share L1, translation cache.
  *
- * ibm,thread-groups[1] tells us how many such thread groups exist.
+ * ibm,thread-groups[i+1] tells us how many such thread groups exist for the
+ * property ibm,thread-groups[i]
  *
- * ibm,thread-groups[2] tells us the number of threads in each such
+ * ibm,thread-groups[i+2] tells us the number of threads in each such
  * group.
+ * Suppose k = (ibm,thread-groups[i+1] * ibm,thread-groups[i+2]), then,
  *
- * ibm,thread-groups[3..N-1] is the list of threads identified by
+ * ibm,thread-groups[i+3..i+k+2] (is the list of threads identified by
  * "ibm,ppc-interrupt-server#s" arranged as per their membership in
  * the grouping.
  *
- * Example: If ibm,thread-groups = [1,2,4,5,6,7,8,9,10,11,12] it
- * implies that there are 2 groups of 4 threads each, where each group
- * of threads share L1, translation cache.
+ * Example:
+ * If ibm,thread-groups = 

[PATCH 3/3] powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache

2020-12-03 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

On POWER platforms where only some groups of threads within a core
share the L2-cache (indicated by the ibm,thread-groups device-tree
property), we currently print the incorrect shared_cpu_map/list for
L2-cache in the sysfs.

This patch reports the correct shared_cpu_map/list on such platforms.

Example:
On a platform with "ibm,thread-groups" set to
 0001 0002 0004 
 0002 0004 0006 0001
 0003 0005 0007 0002
 0002 0004  0002
 0004 0006 0001 0003
 0005 0007

This indicates that threads {0,2,4,6} in the core share the L2-cache
and threads {1,3,5,7} in the core share the L2 cache.

However, without the patch, the shared_cpu_map/list for L2 for CPUs 0,
1 is reported in the sysfs as follows:

/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0-7
/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map:00,00ff

/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_list:0-7
/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map:00,00ff

With the patch, the shared_cpu_map/list for L2 cache for CPUs 0, 1 is
correctly reported as follows:

/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,2,4,6
/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map:00,0055

/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_list:1,3,5,7
/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map:00,00aa

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/kernel/cacheinfo.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/powerpc/kernel/cacheinfo.c b/arch/powerpc/kernel/cacheinfo.c
index 65ab9fc..1cc8f37 100644
--- a/arch/powerpc/kernel/cacheinfo.c
+++ b/arch/powerpc/kernel/cacheinfo.c
@@ -651,15 +651,22 @@ static unsigned int index_dir_to_cpu(struct 
cache_index_dir *index)
return dev->id;
 }
 
+extern bool thread_group_shares_l2;
 /*
  * On big-core systems, each core has two groups of CPUs each of which
  * has its own L1-cache. The thread-siblings which share l1-cache with
  * @cpu can be obtained via cpu_smallcore_mask().
+ *
+ * On some big-core systems, the L2 cache is shared only between some
+ * groups of siblings. This is already parsed and encoded in
+ * cpu_l2_cache_mask().
  */
 static const struct cpumask *get_big_core_shared_cpu_map(int cpu, struct cache 
*cache)
 {
if (cache->level == 1)
return cpu_smallcore_mask(cpu);
+   if (cache->level == 2 && thread_group_shares_l2)
+   return cpu_l2_cache_mask(cpu);
 
return >shared_cpu_map;
 }
-- 
1.9.4



[PATCH 2/3] powerpc/smp: Add support detecting thread-groups sharing L2 cache

2020-12-03 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

On POWER systems, groups of threads within a core sharing the L2-cache
can be indicated by the "ibm,thread-groups" property array with the
identifier "2".

This patch adds support for detecting this, and when present, populate
the populating the cpu_l2_cache_mask of every CPU to the core-siblings
which share L2 with the CPU as specified in the by the
"ibm,thread-groups" property array.

On a platform with the following "ibm,thread-group" configuration
 0001 0002 0004 
 0002 0004 0006 0001
 0003 0005 0007 0002
 0002 0004  0002
 0004 0006 0001 0003
 0005 0007

Without this patch, the sched-domain hierarchy for CPUs 0,1 would be
CPU0 attaching sched-domain(s):
domain-0: span=0,2,4,6 level=SMT
domain-1: span=0-7 level=CACHE
domain-2: span=0-15,24-39,48-55 level=MC
domain-3: span=0-55 level=DIE

CPU1 attaching sched-domain(s):
domain-0: span=1,3,5,7 level=SMT
domain-1: span=0-7 level=CACHE
domain-2: span=0-15,24-39,48-55 level=MC
domain-3: span=0-55 level=DIE

The CACHE domain at 0-7 is incorrect since the ibm,thread-groups
sub-array
[0002 0002 0004
  0002 0004 0006
 0001 0003 0005 0007]
indicates that L2 (Property "2") is shared only between the threads of a single
group. There are "2" groups of threads where each group contains "4"
threads each. The groups being {0,2,4,6} and {1,3,5,7}.

With this patch, the sched-domain hierarchy for CPUs 0,1 would be
CPU0 attaching sched-domain(s):
domain-0: span=0,2,4,6 level=SMT
domain-1: span=0-15,24-39,48-55 level=MC
domain-2: span=0-55 level=DIE

CPU1 attaching sched-domain(s):
domain-0: span=1,3,5,7 level=SMT
domain-1: span=0-15,24-39,48-55 level=MC
domain-2: span=0-55 level=DIE

The CACHE domain with span=0,2,4,6 for CPU 0 (span=1,3,5,7 for CPU 1
resp.) gets degenerated into the SMT domain. Furthermore, the
last-level-cache domain gets correctly set to the SMT sched-domain.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/kernel/smp.c | 66 +--
 1 file changed, 58 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 6a242a3..a116d2d 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -76,6 +76,7 @@
 struct task_struct *secondary_current;
 bool has_big_cores;
 bool coregroup_enabled;
+bool thread_group_shares_l2;
 
 DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
 DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
@@ -99,6 +100,7 @@ enum {
 
 #define MAX_THREAD_LIST_SIZE   8
 #define THREAD_GROUP_SHARE_L1   1
+#define THREAD_GROUP_SHARE_L2   2
 struct thread_groups {
unsigned int property;
unsigned int nr_groups;
@@ -107,7 +109,7 @@ struct thread_groups {
 };
 
 /* Maximum number of properties that groups of threads within a core can share 
*/
-#define MAX_THREAD_GROUP_PROPERTIES 1
+#define MAX_THREAD_GROUP_PROPERTIES 2
 
 struct thread_groups_list {
unsigned int nr_properties;
@@ -121,6 +123,13 @@ struct thread_groups_list {
  */
 DEFINE_PER_CPU(cpumask_var_t, cpu_l1_cache_map);
 
+/*
+ * On some big-cores system, thread_group_l2_cache_map for each CPU
+ * corresponds to the set its siblings within the core that share the
+ * L2-cache.
+ */
+DEFINE_PER_CPU(cpumask_var_t, thread_group_l2_cache_map);
+
 /* SMP operations for this machine */
 struct smp_ops_t *smp_ops;
 
@@ -718,7 +727,9 @@ static void or_cpumasks_related(int i, int j, struct 
cpumask *(*srcmask)(int),
  *
  * ibm,thread-groups[i + 0] tells us the property based on which the
  * threads are being grouped together. If this value is 1, it implies
- * that the threads in the same group share L1, translation cache.
+ * that the threads in the same group share L1, translation cache. If
+ * the value is 2, it implies that the threads in the same group share
+ * the same L2 cache.
  *
  * ibm,thread-groups[i+1] tells us how many such thread groups exist for the
  * property ibm,thread-groups[i]
@@ -745,10 +756,10 @@ static void or_cpumasks_related(int i, int j, struct 
cpumask *(*srcmask)(int),
  * 12}.
  *
  * b) there are 2 groups of 4 threads each, where each group of
- *threads share some property indicated by the first value 2. The
- *"ibm,ppc-interrupt-server#s" of the first group is {5,7,9,11}
- *and the "ibm,ppc-interrupt-server#s" of the second group is
- *{6,8,10,12} structure
+ *threads share some property indicated by the first value 2 (L2
+ *cache). The "ibm,ppc-interrupt-server#s" of the first group is
+ *{5,7,9,11} and the "ibm,ppc-interrupt-server#s" of the second
+ *group is {6,8,10,12} structure
  *
  * Returns 0 on 

[PATCH 0/3] Extend Parsing "ibm, thread-groups" for Shared-L2 information

2020-12-03 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

The "ibm,thread-groups" device-tree property is an array that is used
to indicate if groups of threads within a core share certain
properties. It provides details of which property is being shared by
which groups of threads. This array can encode information about
multiple properties being shared by different thread-groups within the
core.

Example: Suppose,
"ibm,thread-groups" = [1,2,4,8,10,12,14,9,11,13,15,2,2,4,8,10,12,14,9,11,13,15]

This can be decomposed up into two consecutive arrays:

a) [1,2,4,8,10,12,14,9,11,13,15]
b) [2,2,4,8,10,12,14,9,11,13,15]

where in,

a) provides information of Property "1" being shared by "2" groups,
   each with "4" threads each. The "ibm,ppc-interrupt-server#s" of the
   first group is {8,10,12,14} and the "ibm,ppc-interrupt-server#s" of
   the second group is {9,11,13,15}. Property "1" is indicative of
   the thread in the group sharing L1 cache, translation cache and
   Instruction Data flow.

b) provides information of Property "2" being shared by "2" groups,
   each group with "4" threads. The "ibm,ppc-interrupt-server#s" of
   the first group is {8,10,12,14} and the
   "ibm,ppc-interrupt-server#s" of the second group is
   {9,11,13,15}. Property "2" indicates that the threads in each group
   share the L2-cache.
   
The existing code assumes that the "ibm,thread-groups" encodes
information about only one property. Hence even on platforms which
encode information about multiple properties being shared by the
corresponding groups of threads, the current code will only pick the
first one. (In the above example, it will only consider
[1,2,4,8,10,12,14,9,11,13,15] but not [2,2,4,8,10,12,14,9,11,13,15]).

Furthermore, currently on platforms where groups of threads share L2
cache, we incorrectly create an extra CACHE level sched-domain that
maps to all the threads of the core.

For example, if "ibm,thread-groups" is 
 0001 0002 0004 
 0002 0004 0006 0001
 0003 0005 0007 0002
 0002 0004  0002
 0004 0006 0001 0003
 0005 0007

then, the sub-array
[0002 0002 0004
  0002 0004 0006
 0001 0003 0005 0007]
indicates that L2 (Property "2") is shared only between the threads of a single
group. There are "2" groups of threads where each group contains "4"
threads each. The groups being {0,2,4,6} and {1,3,5,7}.

However, the sched-domain hierarchy for CPUs 0,1 is
CPU0 attaching sched-domain(s):
domain-0: span=0,2,4,6 level=SMT
domain-1: span=0-7 level=CACHE
domain-2: span=0-15,24-39,48-55 level=MC
domain-3: span=0-55 level=DIE

CPU1 attaching sched-domain(s):
domain-0: span=1,3,5,7 level=SMT
domain-1: span=0-7 level=CACHE
domain-2: span=0-15,24-39,48-55 level=MC
domain-3: span=0-55 level=DIE

where the CACHE domain reports that L2 is shared across the entire
core which is incorrect on such platforms.


This patchset remedies these issues by extending the parsing support
for "ibm,thread-groups" to discover information about multiple
properties being shared by the corresponding groups of threads. In
particular we cano now detect if the groups of threads within a core
share the L2-cache. On such platforms, we populate the populating the
cpu_l2_cache_mask of every CPU to the core-siblings which share L2
with the CPU as specified in the by the "ibm,thread-groups" property
array.

With the patchset, the sched-domain hierarchy is correctly
reported. For eg for CPUs 0,1, with the patchset

CPU0 attaching sched-domain(s):
domain-0: span=0,2,4,6 level=SMT
domain-1: span=0-15,24-39,48-55 level=MC
domain-2: span=0-55 level=DIE

CPU1 attaching sched-domain(s):
domain-0: span=1,3,5,7 level=SMT
domain-1: span=0-15,24-39,48-55 level=MC
domain-2: span=0-55 level=DIE

The CACHE domain with span=0,2,4,6 for CPU 0 (span=1,3,5,7 for CPU 1
resp.) gets degenerated into the SMT domain. Furthermore, the
last-level-cache domain gets correctly set to the SMT sched-domain.

Finally, this patchset reports the correct shared_cpu_map/list in the
sysfs for L2 cache on such platforms. With the patchset for CPUs0, 1,
for L2 cache we see the correct shared_cpu_map/list

/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,2,4,6
/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map:00,0055

/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_list:1,3,5,7
/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map:00,00aa

The patchset has been tested on older platforms which encode only the
L1 sharing information via "ibm,thread-groups" and there is no
regression found.


Gautham R. Shenoy (3):
  powerpc/smp: Parse ibm,thread-groups with multiple properties
  powerpc/smp: Add support 

Re: [MOCKUP] x86/mm: Lightweight lazy mm refcounting

2020-12-03 Thread Andy Lutomirski


> On Dec 3, 2020, at 2:13 PM, Nicholas Piggin  wrote:
> 
> Excerpts from Peter Zijlstra's message of December 3, 2020 6:44 pm:
>>> On Wed, Dec 02, 2020 at 09:25:51PM -0800, Andy Lutomirski wrote:
>>> 
>>> power: same as ARM, except that the loop may be rather larger since
>>> the systems are bigger.  But I imagine it's still faster than Nick's
>>> approach -- a cmpxchg to a remote cacheline should still be faster than
>>> an IPI shootdown. 
>> 
>> While a single atomic might be cheaper than an IPI, the comparison
>> doesn't work out nicely. You do the xchg() on every unlazy, while the
>> IPI would be once per process exit.
>> 
>> So over the life of the process, it might do very many unlazies, adding
>> up to a total cost far in excess of what the single IPI would've been.
> 
> Yeah this is the concern, I looked at things that add cost to the
> idle switch code and it gets hard to justify the scalability improvement
> when you slow these fundmaental things down even a bit.

v2 fixes this and is generally much nicer. I’ll send it out in a couple hours.

> 
> I still think working on the assumption that IPIs = scary expensive 
> might not be correct. An IPI itself is, but you only issue them when 
> you've left a lazy mm on another CPU which just isn't that often.
> 
> Thanks,
> Nick


Re: [PATCH kernel v2] vfio/pci/nvlink2: Do not attempt NPU2 setup on POWER8NVL NPU

2020-12-03 Thread Alex Williamson
On Sun, 22 Nov 2020 18:39:50 +1100
Alexey Kardashevskiy  wrote:

> We execute certain NPU2 setup code (such as mapping an LPID to a device
> in NPU2) unconditionally if an Nvlink bridge is detected. However this
> cannot succeed on POWER8NVL machines as the init helpers return an error
> other than ENODEV which means the device is there is and setup failed so
> vfio_pci_enable() fails and pass through is not possible.
> 
> This changes the two NPU2 related init helpers to return -ENODEV if
> there is no "memory-region" device tree property as this is
> the distinction between NPU and NPU2.
> 
> Tested on
> - POWER9 pvr=004e1201, Ubuntu 19.04 host, Ubuntu 18.04 vm,
>   NVIDIA GV100 10de:1db1 driver 418.39
> - POWER8 pvr=004c0100, RHEL 7.6 host, Ubuntu 16.10 vm,
>   NVIDIA P100 10de:15f9 driver 396.47
> 
> Fixes: 7f92891778df ("vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] 
> subdriver")
> Cc: sta...@vger.kernel.org # 5.0
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v2:
> * updated commit log with tested configs and replaced P8+ with POWER8NVL for 
> clarity
> ---
>  drivers/vfio/pci/vfio_pci_nvlink2.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)

Thanks, applies to vfio next branch for v5.11.

Alex


> 
> diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c 
> b/drivers/vfio/pci/vfio_pci_nvlink2.c
> index 65c61710c0e9..9adcf6a8f888 100644
> --- a/drivers/vfio/pci/vfio_pci_nvlink2.c
> +++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
> @@ -231,7 +231,7 @@ int vfio_pci_nvdia_v100_nvlink2_init(struct 
> vfio_pci_device *vdev)
>   return -EINVAL;
>  
>   if (of_property_read_u32(npu_node, "memory-region", _phandle))
> - return -EINVAL;
> + return -ENODEV;
>  
>   mem_node = of_find_node_by_phandle(mem_phandle);
>   if (!mem_node)
> @@ -393,7 +393,7 @@ int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev)
>   int ret;
>   struct vfio_pci_npu2_data *data;
>   struct device_node *nvlink_dn;
> - u32 nvlink_index = 0;
> + u32 nvlink_index = 0, mem_phandle = 0;
>   struct pci_dev *npdev = vdev->pdev;
>   struct device_node *npu_node = pci_device_to_OF_node(npdev);
>   struct pci_controller *hose = pci_bus_to_host(npdev->bus);
> @@ -408,6 +408,9 @@ int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev)
>   if (!pnv_pci_get_gpu_dev(vdev->pdev))
>   return -ENODEV;
>  
> + if (of_property_read_u32(npu_node, "memory-region", _phandle))
> + return -ENODEV;
> +
>   /*
>* NPU2 normally has 8 ATSD registers (for concurrency) and 6 links
>* so we can allocate one register per link, using nvlink index as



Re: [PATCH 0/6] Add documentation for Documentation/features at the built docs

2020-12-03 Thread Jonathan Corbet
On Mon, 30 Nov 2020 16:36:29 +0100
Mauro Carvalho Chehab  wrote:

> This series got already submitted last year:
> 
>
> https://lore.kernel.org/lkml/cover.1561222784.git.mchehab+sams...@kernel.org/
> 
> Yet, on that time, there were too many other patches related to ReST
> conversion floating around. So, at the end, I guess this one got missed.
> 
> So, I did a rebase on the top of upstream, and added a few new changes.

OK, I've gone ahead and applied these; it gains me a new trivial conflict
with x86, but so be it...

That said, I think that the RST table formatting could be *way* improved.
The current tables are all white space and hard to make sense of.  What if
we condensed the information?  Just looking at the first entry in
Documentation/admin-guide/features.html, perhaps it could look like:

FEATURE KCONFIG/DESCRIPTION STATUS

cBPF-JITHAVE_CBPF_JIT   TODO: alpha, arc, arm...
ok: mips, powerpc, ...
arch supports cBPF JIT
optimizations

The result would be far more compact and easy to read, IMO.  I may get
around to giving this a try if (hint :) nobody else gets there first.

Thanks,

jon


Re: [MOCKUP] x86/mm: Lightweight lazy mm refcounting

2020-12-03 Thread Nicholas Piggin
Excerpts from Peter Zijlstra's message of December 3, 2020 6:44 pm:
> On Wed, Dec 02, 2020 at 09:25:51PM -0800, Andy Lutomirski wrote:
> 
>> power: same as ARM, except that the loop may be rather larger since
>> the systems are bigger.  But I imagine it's still faster than Nick's
>> approach -- a cmpxchg to a remote cacheline should still be faster than
>> an IPI shootdown. 
> 
> While a single atomic might be cheaper than an IPI, the comparison
> doesn't work out nicely. You do the xchg() on every unlazy, while the
> IPI would be once per process exit.
> 
> So over the life of the process, it might do very many unlazies, adding
> up to a total cost far in excess of what the single IPI would've been.

Yeah this is the concern, I looked at things that add cost to the
idle switch code and it gets hard to justify the scalability improvement
when you slow these fundmaental things down even a bit.

I still think working on the assumption that IPIs = scary expensive 
might not be correct. An IPI itself is, but you only issue them when 
you've left a lazy mm on another CPU which just isn't that often.

Thanks,
Nick


Re: [PATCH 3/7] powerpc/64s: flush L1D after user accesses

2020-12-03 Thread Qian Cai
On Thu, 2020-12-03 at 12:17 -0500, Qian Cai wrote:
> []
> > +static inline bool
> > +bad_kuap_fault(struct pt_regs *regs, unsigned long address, bool is_write)
> > +{
> > +   return WARN(mmu_has_feature(MMU_FTR_RADIX_KUAP) &&
> > +   (regs->kuap & (is_write ? AMR_KUAP_BLOCK_WRITE : 
> > AMR_KUAP_BLOCK_READ)),
> > +   "Bug: %s fault blocked by AMR!", is_write ? "Write" : 
> > "Read");
> > +}
> 
> A simple "echo t > /proc/sysrq-trigger" will trigger this warning almost
> endlessly on POWER9 NV.

I have just realized the patch just moved this warning around, so the issue was
pre-existent. Since I have not tested sysrq-t regularly, I am not sure when it
started to break. So far, I have reverted some of those for testing which did
not help, i.e., the sysrq-t issue remains.

16852975f0f  Revert "powerpc/64s: Use early_mmu_has_feature() in set_kuap()"
129e240ead32 Revert "powerpc: Implement user_access_save() and 
user_access_restore()"
edb0046c842c Revert "powerpc/64s/kuap: Add missing isync to KUAP restore paths"
2d46ee87ce44 Revert "powerpc/64/kuap: Conditionally restore AMR in interrupt 
exit"
c1e0e805fc57 Revert "powerpc/64s/kuap: Conditionally restore AMR in 
kuap_restore_amr asm"
7f30b7aaf23a Revert "selftests/powerpc: rfi_flush: disable entry flush if 
present"
bc9b9967a100 Revert "powerpc/64s: flush L1D on kernel entry"
b77e7b54f5eb Revert "powerpc/64s: flush L1D after user accesses"
22dddf532c64 Revert "powerpc: Only include kup-radix.h for 64-bit Book3S"
2679d155c46a Revert "selftests/powerpc: entry flush test"
87954b9b4243 Revert "selftests/powerpc: refactor entry and rfi_flush tests"
342d82bd4c5d Revert "powerpc/64s: rename pnv|pseries_setup_rfi_flush to 
_setup_security_mitigations"



Re: [PATCH 3/7] powerpc/64s: flush L1D after user accesses

2020-12-03 Thread Qian Cai
On Thu, 2020-12-03 at 12:17 -0500, Qian Cai wrote:
> A simple "echo t > /proc/sysrq-trigger" will trigger this warning almost
> endlessly on Power8 NV.

Correction -- POWER9 NV.



Re: [PATCH 3/7] powerpc/64s: flush L1D after user accesses

2020-12-03 Thread Qian Cai
On Fri, 2020-11-20 at 10:13 +1100, Daniel Axtens wrote:
> From: Nicholas Piggin 
> 
> IBM Power9 processors can speculatively operate on data in the L1 cache
> before it has been completely validated, via a way-prediction mechanism. It
> is not possible for an attacker to determine the contents of impermissible
> memory using this method, since these systems implement a combination of
> hardware and software security measures to prevent scenarios where
> protected data could be leaked.
> 
> However these measures don't address the scenario where an attacker induces
> the operating system to speculatively execute instructions using data that
> the attacker controls. This can be used for example to speculatively bypass
> "kernel user access prevention" techniques, as discovered by Anthony
> Steinhauser of Google's Safeside Project. This is not an attack by itself,
> but there is a possibility it could be used in conjunction with
> side-channels or other weaknesses in the privileged code to construct an
> attack.
> 
> This issue can be mitigated by flushing the L1 cache between privilege
> boundaries of concern. This patch flushes the L1 cache after user accesses.
> 
> This is part of the fix for CVE-2020-4788.
> 
> Signed-off-by: Nicholas Piggin 
> Signed-off-by: Daniel Axtens 
> ---
>  .../admin-guide/kernel-parameters.txt |  4 +
>  .../powerpc/include/asm/book3s/64/kup-radix.h | 66 --
>  arch/powerpc/include/asm/exception-64s.h  |  3 +
>  arch/powerpc/include/asm/feature-fixups.h |  9 ++
>  arch/powerpc/include/asm/kup.h| 19 +++--
>  arch/powerpc/include/asm/security_features.h  |  3 +
>  arch/powerpc/include/asm/setup.h  |  1 +
>  arch/powerpc/kernel/exceptions-64s.S  | 85 ++-
>  arch/powerpc/kernel/setup_64.c| 62 ++
>  arch/powerpc/kernel/vmlinux.lds.S |  7 ++
>  arch/powerpc/lib/feature-fixups.c | 50 +++
>  arch/powerpc/platforms/powernv/setup.c| 10 ++-
>  arch/powerpc/platforms/pseries/setup.c|  4 +
>  13 files changed, 233 insertions(+), 90 deletions(-)
[]
> diff --git a/arch/powerpc/include/asm/book3s/64/kup-radix.h 
> b/arch/powerpc/include/asm/book3s/64/kup-radix.h
> index 3ee1ec60be84..97c2394e7dea 100644
> --- a/arch/powerpc/include/asm/book3s/64/kup-radix.h
> +++ b/arch/powerpc/include/asm/book3s/64/kup-radix.h
[]
> +static inline bool
> +bad_kuap_fault(struct pt_regs *regs, unsigned long address, bool is_write)
> +{
> + return WARN(mmu_has_feature(MMU_FTR_RADIX_KUAP) &&
> + (regs->kuap & (is_write ? AMR_KUAP_BLOCK_WRITE : 
> AMR_KUAP_BLOCK_READ)),
> + "Bug: %s fault blocked by AMR!", is_write ? "Write" : 
> "Read");
> +}

A simple "echo t > /proc/sysrq-trigger" will trigger this warning almost
endlessly on Power8 NV.

.config: 
https://cailca.coding.net/public/linux/mm/git/files/master/powerpc.config

[  391.734028][ T1986] Bug: Read fault blocked by AMR!
[  391.734032][ T1986] WARNING: CPU: 80 PID: 1986 at 
arch/powerpc/include/asm/book3s/64/kup-radix.h:145 do_page_fault+0x8fc/0xb70
[  391.734232][ T1986] Modules linked in: kvm_hv kvm ip_tables x_tables sd_mod 
ahci libahci tg3 libata firmware_class libphy dm_mirror dm_region_hash dm_log 
dm_mod
[  391.734425][ T1986] CPU: 80 PID: 1986 Comm: bash Tainted: GW 
5.10.0-rc6-next-20201203+ #3
[  391.734535][ T1986] NIP:  c004dd1c LR: c004dd18 CTR: 

[  391.734648][ T1986] REGS: c00020003a0bf3a0 TRAP: 0700   Tainted: GW  
(5.10.0-rc6-next-20201203+)
[  391.734768][ T1986] MSR:  90021033   CR: 
4884  XER: 
[  391.734906][ T1986] CFAR: c00bb05c IRQMASK: 1 
[  391.734906][ T1986] GPR00: c004dd18 c00020003a0bf630 
c7fe0d00 001f 
[  391.734906][ T1986] GPR04: c0f1cc58 0003 
0027 c000201cc6207218 
[  391.734906][ T1986] GPR08: 0023 0002 
c00020004753bd80 c7f1cee8 
[  391.734906][ T1986] GPR12: 2000 c000201fff7f8380 
4000 000110929798 
[  391.734906][ T1986] GPR16: 000110929724 0001108c6988 
00011085f290 00011092d568 
[  391.734906][ T1986] GPR20: 0001229f1f80 0001 
0001 c0aa8dc8 
[  391.734906][ T1986] GPR24: c0ab4a00 c00020001cc8c880 
  
[  391.734906][ T1986] GPR28: c801aa18 0160 
c00020003a0bf760 0300 
[  391.735865][ T1986] NIP [c004dd1c] do_page_fault+0x8fc/0xb70
[  391.735947][ T1986] LR [c004dd18] do_page_fault+0x8f8/0xb70
[  391.736033][ T1986] Call Trace:
[  391.736072][ T1986] [c00020

Re: [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-12-03 Thread Alexander Gordeev
On Thu, Dec 03, 2020 at 09:14:22AM -0800, Andy Lutomirski wrote:
> 
> 
> > On Dec 3, 2020, at 9:09 AM, Alexander Gordeev  
> > wrote:
> > 
> > On Mon, Nov 30, 2020 at 10:31:51AM -0800, Andy Lutomirski wrote:
> >> other arch folk: there's some background here:
> 
> > 
> >> 
> >> power: Ridiculously complicated, seems to vary by system and kernel config.
> >> 
> >> So, Nick, your unconditional IPI scheme is apparently a big
> >> improvement for power, and it should be an improvement and have low
> >> cost for x86.  On arm64 and s390x it will add more IPIs on process
> >> exit but reduce contention on context switching depending on how lazy
> > 
> > s390 does not invalidate TLBs per-CPU explicitly - we have special
> > instructions for that. Those in turn initiate signalling to other
> > CPUs, completely transparent to OS.
> 
> Just to make sure I understand: this means that you broadcast flushes to all 
> CPUs, not just a subset?

Correct.
If mm has one CPU attached we flush TLB only for that CPU.
If mm has more than one CPUs attached we flush all CPUs' TLBs.

In fact, details are bit more complicated, since the hardware
is able to flush subsets of TLB entries depending on provided
parameters (e.g page tables used to create that entries).
But we can not select a CPU subset.

> > Apart from mm_count, I am struggling to realize how the suggested
> > scheme could change the the contention on s390 in connection with
> > TLB. Could you clarify a bit here, please?
> 
> I’m just talking about mm_count. Maintaining mm_count is quite expensive on 
> some workloads.
> 
> > 
> >> TLB works.  I suppose we could try it for all architectures without
> >> any further optimizations.  Or we could try one of the perhaps
> >> excessively clever improvements I linked above.  arm64, s390x people,
> >> what do you think?
> > 
> > I do not immediately see anything in the series that would harm
> > performance on s390.
> > 
> > We however use mm_cpumask to distinguish between local and global TLB
> > flushes. With this series it looks like mm_cpumask is *required* to
> > be consistent with lazy users. And that is something quite diffucult
> > for us to adhere (at least in the foreseeable future).
> 
> You don’t actually need to maintain mm_cpumask — we could scan all CPUs 
> instead.
> 
> > 
> > But actually keeping track of lazy users in a cpumask is something
> > the generic code would rather do AFAICT.
> 
> The problem is that arches don’t agree on what the contents of mm_cpumask 
> should be.  Tracking a mask of exactly what the arch wants in generic code is 
> a nontrivial operation.

It could be yet another cpumask or the CPU scan you mentioned.
Just wanted to make sure there is no new requirement for an arch
to maintain mm_cpumask ;)

Thanks, Andy!


Re: [PATCH] powerpc/hotplug: assign hot added LMB to the right node

2020-12-03 Thread Greg KH
On Thu, Dec 03, 2020 at 11:15:14AM +0100, Laurent Dufour wrote:
> This patch applies to 5.9 and earlier kernels only.
> 
> Since 5.10, this has been fortunately fixed by the commit
> e5e179aa3a39 ("pseries/drmem: don't cache node id in drmem_lmb struct").

Why can't we just backport that patch instead?  It's almost always
better to do that than to have a one-off patch, as almost always those
have bugs in them.

thanks,

greg k-h


Re: [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-12-03 Thread Andy Lutomirski



> On Dec 3, 2020, at 9:09 AM, Alexander Gordeev  wrote:
> 
> On Mon, Nov 30, 2020 at 10:31:51AM -0800, Andy Lutomirski wrote:
>> other arch folk: there's some background here:

> 
>> 
>> power: Ridiculously complicated, seems to vary by system and kernel config.
>> 
>> So, Nick, your unconditional IPI scheme is apparently a big
>> improvement for power, and it should be an improvement and have low
>> cost for x86.  On arm64 and s390x it will add more IPIs on process
>> exit but reduce contention on context switching depending on how lazy
> 
> s390 does not invalidate TLBs per-CPU explicitly - we have special
> instructions for that. Those in turn initiate signalling to other
> CPUs, completely transparent to OS.

Just to make sure I understand: this means that you broadcast flushes to all 
CPUs, not just a subset?

> 
> Apart from mm_count, I am struggling to realize how the suggested
> scheme could change the the contention on s390 in connection with
> TLB. Could you clarify a bit here, please?

I’m just talking about mm_count. Maintaining mm_count is quite expensive on 
some workloads.

> 
>> TLB works.  I suppose we could try it for all architectures without
>> any further optimizations.  Or we could try one of the perhaps
>> excessively clever improvements I linked above.  arm64, s390x people,
>> what do you think?
> 
> I do not immediately see anything in the series that would harm
> performance on s390.
> 
> We however use mm_cpumask to distinguish between local and global TLB
> flushes. With this series it looks like mm_cpumask is *required* to
> be consistent with lazy users. And that is something quite diffucult
> for us to adhere (at least in the foreseeable future).

You don’t actually need to maintain mm_cpumask — we could scan all CPUs instead.

> 
> But actually keeping track of lazy users in a cpumask is something
> the generic code would rather do AFAICT.

The problem is that arches don’t agree on what the contents of mm_cpumask 
should be.  Tracking a mask of exactly what the arch wants in generic code is a 
nontrivial operation.


Re: [PATCH AUTOSEL 5.9 27/39] sched/idle: Fix arch_cpu_idle() vs tracing

2020-12-03 Thread Peter Zijlstra
On Thu, Dec 03, 2020 at 03:54:42PM +0100, Heiko Carstens wrote:
> On Thu, Dec 03, 2020 at 08:28:21AM -0500, Sasha Levin wrote:
> > From: Peter Zijlstra 
> > 
> > [ Upstream commit 58c644ba512cfbc2e39b758dd979edd1d6d00e27 ]
> > 
> > We call arch_cpu_idle() with RCU disabled, but then use
> > local_irq_{en,dis}able(), which invokes tracing, which relies on RCU.
> > 
> > Switch all arch_cpu_idle() implementations to use
> > raw_local_irq_{en,dis}able() and carefully manage the
> > lockdep,rcu,tracing state like we do in entry.
> > 
> > (XXX: we really should change arch_cpu_idle() to not return with
> > interrupts enabled)
> > 
> > Reported-by: Sven Schnelle 
> > Signed-off-by: Peter Zijlstra (Intel) 
> > Reviewed-by: Mark Rutland 
> > Tested-by: Mark Rutland 
> > Link: https://lkml.kernel.org/r/20201120114925.594122...@infradead.org
> > Signed-off-by: Sasha Levin 
> 
> This patch broke s390 irq state tracing. A patch to fix this is
> scheduled to be merged upstream today (hopefully).
> Therefore I think this patch should not yet go into 5.9 stable.

Agreed.


Re: [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-12-03 Thread Alexander Gordeev
On Mon, Nov 30, 2020 at 10:31:51AM -0800, Andy Lutomirski wrote:
> other arch folk: there's some background here:
> 
> https://lkml.kernel.org/r/calcetrvxube8lfnn-qs+dzroqaiw+sfug1j047ybyv31sat...@mail.gmail.com
> 
> On Sun, Nov 29, 2020 at 12:16 PM Andy Lutomirski  wrote:
> >
> > On Sat, Nov 28, 2020 at 7:54 PM Andy Lutomirski  wrote:
> > >
> > > On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin  wrote:
> > > >
> > > > On big systems, the mm refcount can become highly contented when doing
> > > > a lot of context switching with threaded applications (particularly
> > > > switching between the idle thread and an application thread).
> > > >
> > > > Abandoning lazy tlb slows switching down quite a bit in the important
> > > > user->idle->user cases, so so instead implement a non-refcounted scheme
> > > > that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
> > > > any remaining lazy ones.
> > > >
> > > > Shootdown IPIs are some concern, but they have not been observed to be
> > > > a big problem with this scheme (the powerpc implementation generated
> > > > 314 additional interrupts on a 144 CPU system during a kernel compile).
> > > > There are a number of strategies that could be employed to reduce IPIs
> > > > if they turn out to be a problem for some workload.
> > >
> > > I'm still wondering whether we can do even better.
> > >
> >
> > Hold on a sec.. __mmput() unmaps VMAs, frees pagetables, and flushes
> > the TLB.  On x86, this will shoot down all lazies as long as even a
> > single pagetable was freed.  (Or at least it will if we don't have a
> > serious bug, but the code seems okay.  We'll hit pmd_free_tlb, which
> > sets tlb->freed_tables, which will trigger the IPI.)  So, on
> > architectures like x86, the shootdown approach should be free.  The
> > only way it ought to have any excess IPIs is if we have CPUs in
> > mm_cpumask() that don't need IPI to free pagetables, which could
> > happen on paravirt.
> 
> Indeed, on x86, we do this:
> 
> [   11.558844]  flush_tlb_mm_range.cold+0x18/0x1d
> [   11.559905]  tlb_finish_mmu+0x10e/0x1a0
> [   11.561068]  exit_mmap+0xc8/0x1a0
> [   11.561932]  mmput+0x29/0xd0
> [   11.562688]  do_exit+0x316/0xa90
> [   11.563588]  do_group_exit+0x34/0xb0
> [   11.564476]  __x64_sys_exit_group+0xf/0x10
> [   11.565512]  do_syscall_64+0x34/0x50
> 
> and we have info->freed_tables set.
> 
> What are the architectures that have large systems like?
> 
> x86: we already zap lazies, so it should cost basically nothing to do
> a little loop at the end of __mmput() to make sure that no lazies are
> left.  If we care about paravirt performance, we could implement one
> of the optimizations I mentioned above to fix up the refcounts instead
> of sending an IPI to any remaining lazies.
> 
> arm64: AFAICT arm64's flush uses magic arm64 hardware support for
> remote flushes, so any lazy mm references will still exist after
> exit_mmap().  (arm64 uses lazy TLB, right?)  So this is kind of like
> the x86 paravirt case.  Are there large enough arm64 systems that any
> of this matters?
> 
> s390x: The code has too many acronyms for me to understand it fully,
> but I think it's more or less the same situation as arm64.  How big do
> s390x systems come?
> 
> power: Ridiculously complicated, seems to vary by system and kernel config.
> 
> So, Nick, your unconditional IPI scheme is apparently a big
> improvement for power, and it should be an improvement and have low
> cost for x86.  On arm64 and s390x it will add more IPIs on process
> exit but reduce contention on context switching depending on how lazy

s390 does not invalidate TLBs per-CPU explicitly - we have special
instructions for that. Those in turn initiate signalling to other
CPUs, completely transparent to OS.

Apart from mm_count, I am struggling to realize how the suggested
scheme could change the the contention on s390 in connection with
TLB. Could you clarify a bit here, please?

> TLB works.  I suppose we could try it for all architectures without
> any further optimizations.  Or we could try one of the perhaps
> excessively clever improvements I linked above.  arm64, s390x people,
> what do you think?

I do not immediately see anything in the series that would harm
performance on s390.

We however use mm_cpumask to distinguish between local and global TLB
flushes. With this series it looks like mm_cpumask is *required* to
be consistent with lazy users. And that is something quite diffucult
for us to adhere (at least in the foreseeable future).

But actually keeping track of lazy users in a cpumask is something
the generic code would rather do AFAICT.

Thanks!


Re: [PATCH AUTOSEL 5.9 27/39] sched/idle: Fix arch_cpu_idle() vs tracing

2020-12-03 Thread Heiko Carstens
On Thu, Dec 03, 2020 at 08:28:21AM -0500, Sasha Levin wrote:
> From: Peter Zijlstra 
> 
> [ Upstream commit 58c644ba512cfbc2e39b758dd979edd1d6d00e27 ]
> 
> We call arch_cpu_idle() with RCU disabled, but then use
> local_irq_{en,dis}able(), which invokes tracing, which relies on RCU.
> 
> Switch all arch_cpu_idle() implementations to use
> raw_local_irq_{en,dis}able() and carefully manage the
> lockdep,rcu,tracing state like we do in entry.
> 
> (XXX: we really should change arch_cpu_idle() to not return with
> interrupts enabled)
> 
> Reported-by: Sven Schnelle 
> Signed-off-by: Peter Zijlstra (Intel) 
> Reviewed-by: Mark Rutland 
> Tested-by: Mark Rutland 
> Link: https://lkml.kernel.org/r/20201120114925.594122...@infradead.org
> Signed-off-by: Sasha Levin 

This patch broke s390 irq state tracing. A patch to fix this is
scheduled to be merged upstream today (hopefully).
Therefore I think this patch should not yet go into 5.9 stable.


Re: [MOCKUP] x86/mm: Lightweight lazy mm refcounting

2020-12-03 Thread Rik van Riel
On Thu, 2020-12-03 at 12:31 +, Matthew Wilcox wrote:

> And this just makes me think RCU freeing of mm_struct.  I'm sure it's
> more complicated than that (then, or now), but if an anonymous
> process
> is borrowing a freed mm, and the mm is freed by RCU then it will not
> go
> away until the task context switches.  When we context switch back to
> the anon task, it'll borrow some other task's MM and won't even
> notice
> that the MM it was using has gone away.

One major complication here is that most of the
active_mm borrowing is done by the idle task,
but RCU does not wait for idle tasks to context
switch.

That means RCU, as it is today, is not a
mechanism that mm_struct freeing could just
piggyback off.

-- 
All Rights Reversed.


signature.asc
Description: This is a digitally signed message part


Re: powerpc 5.10-rcN boot failures with RCU_SCALE_TEST=m

2020-12-03 Thread Uladzislau Rezki
On Thu, Dec 03, 2020 at 05:22:20PM +1100, Michael Ellerman wrote:
> Uladzislau Rezki  writes:
> > On Thu, Dec 03, 2020 at 01:03:32AM +1100, Michael Ellerman wrote:
> ...
> >> 
> >> The SMP bringup stalls because _cpu_up() is blocked trying to take
> >> cpu_hotplug_lock for writing:
> >> 
> >> [  401.403132][T0] task:swapper/0   state:D stack:12512 pid:1 
> >> ppid: 0 flags:0x0800
> >> [  401.403502][T0] Call Trace:
> >> [  401.403907][T0] [c62c37d0] [c62c3830] 
> >> 0xc62c3830 (unreliable)
> >> [  401.404068][T0] [c62c39b0] [c0019d70] 
> >> __switch_to+0x2e0/0x4a0
> >> [  401.404189][T0] [c62c3a10] [c0b87228] 
> >> __schedule+0x288/0x9b0
> >> [  401.404257][T0] [c62c3ad0] [c0b879b8] 
> >> schedule+0x68/0x120
> >> [  401.404324][T0] [c62c3b00] [c0184ad4] 
> >> percpu_down_write+0x164/0x170
> >> [  401.404390][T0] [c62c3b50] [c0116b68] 
> >> _cpu_up+0x68/0x280
> >> [  401.404475][T0] [c62c3bb0] [c0116e70] 
> >> cpu_up+0xf0/0x140
> >> [  401.404546][T0] [c62c3c30] [c011776c] 
> >> bringup_nonboot_cpus+0xac/0xf0
> >> [  401.404643][T0] [c62c3c80] [c0eea1b8] 
> >> smp_init+0x40/0xcc
> >> [  401.404727][T0] [c62c3ce0] [c0ec43dc] 
> >> kernel_init_freeable+0x1e0/0x3a0
> >> [  401.404799][T0] [c62c3db0] [c0011ec4] 
> >> kernel_init+0x24/0x150
> >> [  401.404958][T0] [c62c3e20] [c000daf0] 
> >> ret_from_kernel_thread+0x5c/0x6c
> >> 
> >> It can't get it because kprobe_optimizer() has taken it for read and is now
> >> blocked waiting for synchronize_rcu_tasks():
> >> 
> >> [  401.418808][T0] task:kworker/0:1 state:D stack:13392 pid:   12 
> >> ppid: 2 flags:0x0800
> >> [  401.418951][T0] Workqueue: events kprobe_optimizer
> >> [  401.419078][T0] Call Trace:
> >> [  401.419121][T0] [c62ef650] [c62ef710] 
> >> 0xc62ef710 (unreliable)
> >> [  401.419213][T0] [c62ef830] [c0019d70] 
> >> __switch_to+0x2e0/0x4a0
> >> [  401.419281][T0] [c62ef890] [c0b87228] 
> >> __schedule+0x288/0x9b0
> >> [  401.419347][T0] [c62ef950] [c0b879b8] 
> >> schedule+0x68/0x120
> >> [  401.419415][T0] [c62ef980] [c0b8e664] 
> >> schedule_timeout+0x2a4/0x340
> >> [  401.419484][T0] [c62efa80] [c0b894ec] 
> >> wait_for_completion+0x9c/0x170
> >> [  401.419552][T0] [c62efae0] [c01ac85c] 
> >> __wait_rcu_gp+0x19c/0x210
> >> [  401.419619][T0] [c62efb40] [c01ac90c] 
> >> synchronize_rcu_tasks_generic+0x3c/0x70
> >> [  401.419690][T0] [c62efbe0] [c022a3dc] 
> >> kprobe_optimizer+0x1dc/0x470
> >> [  401.419757][T0] [c62efc60] [c0136684] 
> >> process_one_work+0x2f4/0x530
> >> [  401.419823][T0] [c62efd20] [c0138d28] 
> >> worker_thread+0x78/0x570
> >> [  401.419891][T0] [c62efdb0] [c0142424] 
> >> kthread+0x194/0x1a0
> >> [  401.419976][T0] [c62efe20] [c000daf0] 
> >> ret_from_kernel_thread+0x5c/0x6c
> >> 
> >> But why is the synchronize_rcu_tasks() not completing?
> >> 
> > I think that it is because RCU is not fully initialized by that time.
> 
> Yeah that would explain it :)
> 
> > The 36dadef23fcc ("kprobes: Init kprobes in early_initcall") patch
> > switches to early_initcall() that has a higher priority sequence than
> > core_initcall() that is used to complete an RCU setup in the 
> > rcu_set_runtime_mode().
> 
> I was looking at debug_lockdep_rcu_enabled(), which is:
> 
> noinstr int notrace debug_lockdep_rcu_enabled(void)
> {
>   return rcu_scheduler_active != RCU_SCHEDULER_INACTIVE && debug_locks &&
>  current->lockdep_recursion == 0;
> }
> 
> That is not firing any warnings for me because rcu_scheduler_active is:
> 
> (gdb) p/x rcu_scheduler_active
> $1 = 0x1
> 
> Which is:
> 
> #define RCU_SCHEDULER_INIT1
> 

Agree with that.

> But that's different to RCU_SCHEDULER_RUNNING, which is set in
> rcu_set_runtime_mode() as you mentioned:
> 
> static int __init rcu_set_runtime_mode(void)
> {
>   rcu_test_sync_prims();
>   rcu_scheduler_active = RCU_SCHEDULER_RUNNING;
>   kfree_rcu_scheduler_running();
>   rcu_test_sync_prims();
>   return 0;
> }
> 
BTW, since you can reproduce it and have a test setup, could you please
check that:


-core_initcall(rcu_set_runtime_mode);
+early_initcall(rcu_set_runtime_mode);


Just in case. The the synchronize_rcu_tasks() gets blocked:


void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t *crcu_array,
if (crcu_array[j] == crcu_array[i])
break;
if (j == i) {
wait_for_completion(_array[i].completion); <--- here

[PATCH AUTOSEL 4.19 10/14] soc: fsl: dpio: Get the cpumask through cpumask_of(cpu)

2020-12-03 Thread Sasha Levin
From: Hao Si 

[ Upstream commit 2663b3388551230cbc4606a40fabf3331ceb59e4 ]

The local variable 'cpumask_t mask' is in the stack memory, and its address
is assigned to 'desc->affinity' in 'irq_set_affinity_hint()'.
But the memory area where this variable is located is at risk of being
modified.

During LTP testing, the following error was generated:

Unable to handle kernel paging request at virtual address 12e9b790
Mem abort info:
  ESR = 0x9607
  Exception class = DABT (current EL), IL = 32 bits
  SET = 0, FnV = 0
  EA = 0, S1PTW = 0
Data abort info:
  ISV = 0, ISS = 0x0007
  CM = 0, WnR = 0
swapper pgtable: 4k pages, 48-bit VAs, pgdp = 75ac5e07
[12e9b790] pgd=0027dbffe003, pud=0027dbffd003,
pmd=0027b6d61003, pte=
Internal error: Oops: 9607 [#1] PREEMPT SMP
Modules linked in: xt_conntrack
Process read_all (pid: 20171, stack limit = 0x44ea4095)
CPU: 14 PID: 20171 Comm: read_all Tainted: GB   W
Hardware name: NXP Layerscape LX2160ARDB (DT)
pstate: 8085 (Nzcv daIf -PAN -UAO)
pc : irq_affinity_hint_proc_show+0x54/0xb0
lr : irq_affinity_hint_proc_show+0x4c/0xb0
sp : 1138bc10
x29: 1138bc10 x28: d131d1e0
x27: 007000c0 x26: 8025b9480dc0
x25: 8025b9480da8 x24: 03ff
x23: 8027334f8300 x22: 80272e97d000
x21: 80272e97d0b0 x20: 8025b9480d80
x19: 09a49000 x18: 
x17:  x16: 
x15:  x14: 
x13:  x12: 0040
x11:  x10: 802735b79b88
x9 :  x8 : 
x7 : 09a49848 x6 : 0003
x5 :  x4 : 08157d6c
x3 : 1138bc10 x2 : 12e9b790
x1 :  x0 : 
Call trace:
 irq_affinity_hint_proc_show+0x54/0xb0
 seq_read+0x1b0/0x440
 proc_reg_read+0x80/0xd8
 __vfs_read+0x60/0x178
 vfs_read+0x94/0x150
 ksys_read+0x74/0xf0
 __arm64_sys_read+0x24/0x30
 el0_svc_common.constprop.0+0xd8/0x1a0
 el0_svc_handler+0x34/0x88
 el0_svc+0x10/0x14
Code: f9001bbf 943e0732 f94066c2 b462 (f9400041)
---[ end trace b495bdcb0b3b732b ]---
Kernel panic - not syncing: Fatal exception
SMP: stopping secondary CPUs
SMP: failed to stop secondary CPUs 0,2-4,6,8,11,13-15
Kernel Offset: disabled
CPU features: 0x0,21006008
Memory Limit: none
---[ end Kernel panic - not syncing: Fatal exception ]---

Fix it by using 'cpumask_of(cpu)' to get the cpumask.

Signed-off-by: Hao Si 
Signed-off-by: Lin Chen 
Signed-off-by: Yi Wang 
Signed-off-by: Li Yang 
Signed-off-by: Sasha Levin 
---
 drivers/soc/fsl/dpio/dpio-driver.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/soc/fsl/dpio/dpio-driver.c 
b/drivers/soc/fsl/dpio/dpio-driver.c
index b60b77bfaffae..ea6f8904c01b5 100644
--- a/drivers/soc/fsl/dpio/dpio-driver.c
+++ b/drivers/soc/fsl/dpio/dpio-driver.c
@@ -53,7 +53,6 @@ static int register_dpio_irq_handlers(struct fsl_mc_device 
*dpio_dev, int cpu)
struct dpio_priv *priv;
int error;
struct fsl_mc_device_irq *irq;
-   cpumask_t mask;
 
priv = dev_get_drvdata(_dev->dev);
 
@@ -72,9 +71,7 @@ static int register_dpio_irq_handlers(struct fsl_mc_device 
*dpio_dev, int cpu)
}
 
/* set the affinity hint */
-   cpumask_clear();
-   cpumask_set_cpu(cpu, );
-   if (irq_set_affinity_hint(irq->msi_desc->irq, ))
+   if (irq_set_affinity_hint(irq->msi_desc->irq, cpumask_of(cpu)))
dev_err(_dev->dev,
"irq_set_affinity failed irq %d cpu %d\n",
irq->msi_desc->irq, cpu);
-- 
2.27.0



[PATCH AUTOSEL 4.19 04/14] powerpc: Drop -me200 addition to build flags

2020-12-03 Thread Sasha Levin
From: Michael Ellerman 

[ Upstream commit e02152ba2810f7c88cb54e71cda096268dfa9241 ]

Currently a build with CONFIG_E200=y will fail with:

  Error: invalid switch -me200
  Error: unrecognized option -me200

Upstream binutils has never supported an -me200 option. Presumably it
was supported at some point by either a fork or Freescale internal
binutils.

We can't support code that we can't even build test, so drop the
addition of -me200 to the build flags, so we can at least build with
CONFIG_E200=y.

Reported-by: Németh Márton 
Reported-by: kernel test robot 
Signed-off-by: Michael Ellerman 
Reviewed-by: Nick Desaulniers 
Acked-by: Scott Wood 
Link: https://lore.kernel.org/r/20201116120913.165317-1-...@ellerman.id.au
Signed-off-by: Sasha Levin 
---
 arch/powerpc/Makefile | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 8954108df4570..f51e21ea53492 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -251,7 +251,6 @@ endif
 
 cpu-as-$(CONFIG_4xx)   += -Wa,-m405
 cpu-as-$(CONFIG_ALTIVEC)   += $(call as-option,-Wa$(comma)-maltivec)
-cpu-as-$(CONFIG_E200)  += -Wa,-me200
 cpu-as-$(CONFIG_E500)  += -Wa,-me500
 
 # When using '-many -mpower4' gas will first try and find a matching power4
-- 
2.27.0



[PATCH AUTOSEL 5.4 15/23] soc: fsl: dpio: Get the cpumask through cpumask_of(cpu)

2020-12-03 Thread Sasha Levin
From: Hao Si 

[ Upstream commit 2663b3388551230cbc4606a40fabf3331ceb59e4 ]

The local variable 'cpumask_t mask' is in the stack memory, and its address
is assigned to 'desc->affinity' in 'irq_set_affinity_hint()'.
But the memory area where this variable is located is at risk of being
modified.

During LTP testing, the following error was generated:

Unable to handle kernel paging request at virtual address 12e9b790
Mem abort info:
  ESR = 0x9607
  Exception class = DABT (current EL), IL = 32 bits
  SET = 0, FnV = 0
  EA = 0, S1PTW = 0
Data abort info:
  ISV = 0, ISS = 0x0007
  CM = 0, WnR = 0
swapper pgtable: 4k pages, 48-bit VAs, pgdp = 75ac5e07
[12e9b790] pgd=0027dbffe003, pud=0027dbffd003,
pmd=0027b6d61003, pte=
Internal error: Oops: 9607 [#1] PREEMPT SMP
Modules linked in: xt_conntrack
Process read_all (pid: 20171, stack limit = 0x44ea4095)
CPU: 14 PID: 20171 Comm: read_all Tainted: GB   W
Hardware name: NXP Layerscape LX2160ARDB (DT)
pstate: 8085 (Nzcv daIf -PAN -UAO)
pc : irq_affinity_hint_proc_show+0x54/0xb0
lr : irq_affinity_hint_proc_show+0x4c/0xb0
sp : 1138bc10
x29: 1138bc10 x28: d131d1e0
x27: 007000c0 x26: 8025b9480dc0
x25: 8025b9480da8 x24: 03ff
x23: 8027334f8300 x22: 80272e97d000
x21: 80272e97d0b0 x20: 8025b9480d80
x19: 09a49000 x18: 
x17:  x16: 
x15:  x14: 
x13:  x12: 0040
x11:  x10: 802735b79b88
x9 :  x8 : 
x7 : 09a49848 x6 : 0003
x5 :  x4 : 08157d6c
x3 : 1138bc10 x2 : 12e9b790
x1 :  x0 : 
Call trace:
 irq_affinity_hint_proc_show+0x54/0xb0
 seq_read+0x1b0/0x440
 proc_reg_read+0x80/0xd8
 __vfs_read+0x60/0x178
 vfs_read+0x94/0x150
 ksys_read+0x74/0xf0
 __arm64_sys_read+0x24/0x30
 el0_svc_common.constprop.0+0xd8/0x1a0
 el0_svc_handler+0x34/0x88
 el0_svc+0x10/0x14
Code: f9001bbf 943e0732 f94066c2 b462 (f9400041)
---[ end trace b495bdcb0b3b732b ]---
Kernel panic - not syncing: Fatal exception
SMP: stopping secondary CPUs
SMP: failed to stop secondary CPUs 0,2-4,6,8,11,13-15
Kernel Offset: disabled
CPU features: 0x0,21006008
Memory Limit: none
---[ end Kernel panic - not syncing: Fatal exception ]---

Fix it by using 'cpumask_of(cpu)' to get the cpumask.

Signed-off-by: Hao Si 
Signed-off-by: Lin Chen 
Signed-off-by: Yi Wang 
Signed-off-by: Li Yang 
Signed-off-by: Sasha Levin 
---
 drivers/soc/fsl/dpio/dpio-driver.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/soc/fsl/dpio/dpio-driver.c 
b/drivers/soc/fsl/dpio/dpio-driver.c
index 7b642c330977f..7f397b4ad878d 100644
--- a/drivers/soc/fsl/dpio/dpio-driver.c
+++ b/drivers/soc/fsl/dpio/dpio-driver.c
@@ -95,7 +95,6 @@ static int register_dpio_irq_handlers(struct fsl_mc_device 
*dpio_dev, int cpu)
 {
int error;
struct fsl_mc_device_irq *irq;
-   cpumask_t mask;
 
irq = dpio_dev->irqs[0];
error = devm_request_irq(_dev->dev,
@@ -112,9 +111,7 @@ static int register_dpio_irq_handlers(struct fsl_mc_device 
*dpio_dev, int cpu)
}
 
/* set the affinity hint */
-   cpumask_clear();
-   cpumask_set_cpu(cpu, );
-   if (irq_set_affinity_hint(irq->msi_desc->irq, ))
+   if (irq_set_affinity_hint(irq->msi_desc->irq, cpumask_of(cpu)))
dev_err(_dev->dev,
"irq_set_affinity failed irq %d cpu %d\n",
irq->msi_desc->irq, cpu);
-- 
2.27.0



[PATCH AUTOSEL 5.4 12/23] ibmvnic: skip tx timeout reset while in resetting

2020-12-03 Thread Sasha Levin
From: Lijun Pan 

[ Upstream commit 855a631a4c11458a9cef1ab79c1530436aa95fae ]

Sometimes it takes longer than 5 seconds (watchdog timeout) to complete
failover, migration, and other resets. In stead of scheduling another
timeout reset, we wait for the current one to complete.

Suggested-by: Brian King 
Signed-off-by: Lijun Pan 
Reviewed-by: Dany Madden 
Signed-off-by: Jakub Kicinski 
Signed-off-by: Sasha Levin 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index e53994ca3142c..af1d9d2150616 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -2266,6 +2266,12 @@ static void ibmvnic_tx_timeout(struct net_device *dev)
 {
struct ibmvnic_adapter *adapter = netdev_priv(dev);
 
+   if (test_bit(0, >resetting)) {
+   netdev_err(adapter->netdev,
+  "Adapter is resetting, skip timeout reset\n");
+   return;
+   }
+
ibmvnic_reset(adapter, VNIC_RESET_TIMEOUT);
 }
 
-- 
2.27.0



[PATCH AUTOSEL 5.4 05/23] powerpc: Drop -me200 addition to build flags

2020-12-03 Thread Sasha Levin
From: Michael Ellerman 

[ Upstream commit e02152ba2810f7c88cb54e71cda096268dfa9241 ]

Currently a build with CONFIG_E200=y will fail with:

  Error: invalid switch -me200
  Error: unrecognized option -me200

Upstream binutils has never supported an -me200 option. Presumably it
was supported at some point by either a fork or Freescale internal
binutils.

We can't support code that we can't even build test, so drop the
addition of -me200 to the build flags, so we can at least build with
CONFIG_E200=y.

Reported-by: Németh Márton 
Reported-by: kernel test robot 
Signed-off-by: Michael Ellerman 
Reviewed-by: Nick Desaulniers 
Acked-by: Scott Wood 
Link: https://lore.kernel.org/r/20201116120913.165317-1-...@ellerman.id.au
Signed-off-by: Sasha Levin 
---
 arch/powerpc/Makefile | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 37ac731a556b8..9f73fb6b1cc91 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -250,7 +250,6 @@ KBUILD_CFLAGS   += $(call cc-option,-mno-string)
 
 cpu-as-$(CONFIG_4xx)   += -Wa,-m405
 cpu-as-$(CONFIG_ALTIVEC)   += $(call as-option,-Wa$(comma)-maltivec)
-cpu-as-$(CONFIG_E200)  += -Wa,-me200
 cpu-as-$(CONFIG_E500)  += -Wa,-me500
 
 # When using '-many -mpower4' gas will first try and find a matching power4
-- 
2.27.0



[PATCH AUTOSEL 5.9 27/39] sched/idle: Fix arch_cpu_idle() vs tracing

2020-12-03 Thread Sasha Levin
From: Peter Zijlstra 

[ Upstream commit 58c644ba512cfbc2e39b758dd979edd1d6d00e27 ]

We call arch_cpu_idle() with RCU disabled, but then use
local_irq_{en,dis}able(), which invokes tracing, which relies on RCU.

Switch all arch_cpu_idle() implementations to use
raw_local_irq_{en,dis}able() and carefully manage the
lockdep,rcu,tracing state like we do in entry.

(XXX: we really should change arch_cpu_idle() to not return with
interrupts enabled)

Reported-by: Sven Schnelle 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Mark Rutland 
Tested-by: Mark Rutland 
Link: https://lkml.kernel.org/r/20201120114925.594122...@infradead.org
Signed-off-by: Sasha Levin 
---
 arch/alpha/kernel/process.c  |  2 +-
 arch/arm/kernel/process.c|  2 +-
 arch/arm64/kernel/process.c  |  2 +-
 arch/csky/kernel/process.c   |  2 +-
 arch/h8300/kernel/process.c  |  2 +-
 arch/hexagon/kernel/process.c|  2 +-
 arch/ia64/kernel/process.c   |  2 +-
 arch/microblaze/kernel/process.c |  2 +-
 arch/mips/kernel/idle.c  | 12 ++--
 arch/nios2/kernel/process.c  |  2 +-
 arch/openrisc/kernel/process.c   |  2 +-
 arch/parisc/kernel/process.c |  2 +-
 arch/powerpc/kernel/idle.c   |  4 ++--
 arch/riscv/kernel/process.c  |  2 +-
 arch/s390/kernel/idle.c  |  6 +++---
 arch/sh/kernel/idle.c|  2 +-
 arch/sparc/kernel/leon_pmc.c |  4 ++--
 arch/sparc/kernel/process_32.c   |  2 +-
 arch/sparc/kernel/process_64.c   |  4 ++--
 arch/um/kernel/process.c |  2 +-
 arch/x86/include/asm/mwait.h |  2 --
 arch/x86/kernel/process.c| 12 +++-
 kernel/sched/idle.c  | 28 +++-
 23 files changed, 64 insertions(+), 38 deletions(-)

diff --git a/arch/alpha/kernel/process.c b/arch/alpha/kernel/process.c
index 7462a79110024..4c7b0414a3ff3 100644
--- a/arch/alpha/kernel/process.c
+++ b/arch/alpha/kernel/process.c
@@ -57,7 +57,7 @@ EXPORT_SYMBOL(pm_power_off);
 void arch_cpu_idle(void)
 {
wtint(0);
-   local_irq_enable();
+   raw_local_irq_enable();
 }
 
 void arch_cpu_idle_dead(void)
diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index 8e6ace03e960b..9f199b1e83839 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -71,7 +71,7 @@ void arch_cpu_idle(void)
arm_pm_idle();
else
cpu_do_idle();
-   local_irq_enable();
+   raw_local_irq_enable();
 }
 
 void arch_cpu_idle_prepare(void)
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 2da5f3f9d345f..f7c42a7d09b66 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -124,7 +124,7 @@ void arch_cpu_idle(void)
 * tricks
 */
cpu_do_idle();
-   local_irq_enable();
+   raw_local_irq_enable();
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
diff --git a/arch/csky/kernel/process.c b/arch/csky/kernel/process.c
index f730869e21eed..69af6bc87e647 100644
--- a/arch/csky/kernel/process.c
+++ b/arch/csky/kernel/process.c
@@ -102,6 +102,6 @@ void arch_cpu_idle(void)
 #ifdef CONFIG_CPU_PM_STOP
asm volatile("stop\n");
 #endif
-   local_irq_enable();
+   raw_local_irq_enable();
 }
 #endif
diff --git a/arch/h8300/kernel/process.c b/arch/h8300/kernel/process.c
index 83ce3caf73139..a2961c7b2332c 100644
--- a/arch/h8300/kernel/process.c
+++ b/arch/h8300/kernel/process.c
@@ -57,7 +57,7 @@ asmlinkage void ret_from_kernel_thread(void);
  */
 void arch_cpu_idle(void)
 {
-   local_irq_enable();
+   raw_local_irq_enable();
__asm__("sleep");
 }
 
diff --git a/arch/hexagon/kernel/process.c b/arch/hexagon/kernel/process.c
index dfd322c5ce83a..20962601a1b47 100644
--- a/arch/hexagon/kernel/process.c
+++ b/arch/hexagon/kernel/process.c
@@ -44,7 +44,7 @@ void arch_cpu_idle(void)
 {
__vmwait();
/*  interrupts wake us up, but irqs are still disabled */
-   local_irq_enable();
+   raw_local_irq_enable();
 }
 
 /*
diff --git a/arch/ia64/kernel/process.c b/arch/ia64/kernel/process.c
index f19cb97c00987..1b2769260688d 100644
--- a/arch/ia64/kernel/process.c
+++ b/arch/ia64/kernel/process.c
@@ -252,7 +252,7 @@ void arch_cpu_idle(void)
if (mark_idle)
(*mark_idle)(1);
 
-   safe_halt();
+   raw_safe_halt();
 
if (mark_idle)
(*mark_idle)(0);
diff --git a/arch/microblaze/kernel/process.c b/arch/microblaze/kernel/process.c
index a9e46e525cd0a..f99860771ff48 100644
--- a/arch/microblaze/kernel/process.c
+++ b/arch/microblaze/kernel/process.c
@@ -149,5 +149,5 @@ int dump_fpu(struct pt_regs *regs, elf_fpregset_t *fpregs)
 
 void arch_cpu_idle(void)
 {
-   local_irq_enable();
+   raw_local_irq_enable();
 }
diff --git a/arch/mips/kernel/idle.c b/arch/mips/kernel/idle.c
index 5bc3b04693c7d..18e69ebf5691d 100644
--- a/arch/mips/kernel/idle.c
+++ b/arch/mips/kernel/idle.c
@@ -33,19 +33,19 @@ static void __cpuidle r3081_wait(void)
 {

[PATCH AUTOSEL 5.9 26/39] soc: fsl: dpio: Get the cpumask through cpumask_of(cpu)

2020-12-03 Thread Sasha Levin
From: Hao Si 

[ Upstream commit 2663b3388551230cbc4606a40fabf3331ceb59e4 ]

The local variable 'cpumask_t mask' is in the stack memory, and its address
is assigned to 'desc->affinity' in 'irq_set_affinity_hint()'.
But the memory area where this variable is located is at risk of being
modified.

During LTP testing, the following error was generated:

Unable to handle kernel paging request at virtual address 12e9b790
Mem abort info:
  ESR = 0x9607
  Exception class = DABT (current EL), IL = 32 bits
  SET = 0, FnV = 0
  EA = 0, S1PTW = 0
Data abort info:
  ISV = 0, ISS = 0x0007
  CM = 0, WnR = 0
swapper pgtable: 4k pages, 48-bit VAs, pgdp = 75ac5e07
[12e9b790] pgd=0027dbffe003, pud=0027dbffd003,
pmd=0027b6d61003, pte=
Internal error: Oops: 9607 [#1] PREEMPT SMP
Modules linked in: xt_conntrack
Process read_all (pid: 20171, stack limit = 0x44ea4095)
CPU: 14 PID: 20171 Comm: read_all Tainted: GB   W
Hardware name: NXP Layerscape LX2160ARDB (DT)
pstate: 8085 (Nzcv daIf -PAN -UAO)
pc : irq_affinity_hint_proc_show+0x54/0xb0
lr : irq_affinity_hint_proc_show+0x4c/0xb0
sp : 1138bc10
x29: 1138bc10 x28: d131d1e0
x27: 007000c0 x26: 8025b9480dc0
x25: 8025b9480da8 x24: 03ff
x23: 8027334f8300 x22: 80272e97d000
x21: 80272e97d0b0 x20: 8025b9480d80
x19: 09a49000 x18: 
x17:  x16: 
x15:  x14: 
x13:  x12: 0040
x11:  x10: 802735b79b88
x9 :  x8 : 
x7 : 09a49848 x6 : 0003
x5 :  x4 : 08157d6c
x3 : 1138bc10 x2 : 12e9b790
x1 :  x0 : 
Call trace:
 irq_affinity_hint_proc_show+0x54/0xb0
 seq_read+0x1b0/0x440
 proc_reg_read+0x80/0xd8
 __vfs_read+0x60/0x178
 vfs_read+0x94/0x150
 ksys_read+0x74/0xf0
 __arm64_sys_read+0x24/0x30
 el0_svc_common.constprop.0+0xd8/0x1a0
 el0_svc_handler+0x34/0x88
 el0_svc+0x10/0x14
Code: f9001bbf 943e0732 f94066c2 b462 (f9400041)
---[ end trace b495bdcb0b3b732b ]---
Kernel panic - not syncing: Fatal exception
SMP: stopping secondary CPUs
SMP: failed to stop secondary CPUs 0,2-4,6,8,11,13-15
Kernel Offset: disabled
CPU features: 0x0,21006008
Memory Limit: none
---[ end Kernel panic - not syncing: Fatal exception ]---

Fix it by using 'cpumask_of(cpu)' to get the cpumask.

Signed-off-by: Hao Si 
Signed-off-by: Lin Chen 
Signed-off-by: Yi Wang 
Signed-off-by: Li Yang 
Signed-off-by: Sasha Levin 
---
 drivers/soc/fsl/dpio/dpio-driver.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/soc/fsl/dpio/dpio-driver.c 
b/drivers/soc/fsl/dpio/dpio-driver.c
index 7b642c330977f..7f397b4ad878d 100644
--- a/drivers/soc/fsl/dpio/dpio-driver.c
+++ b/drivers/soc/fsl/dpio/dpio-driver.c
@@ -95,7 +95,6 @@ static int register_dpio_irq_handlers(struct fsl_mc_device 
*dpio_dev, int cpu)
 {
int error;
struct fsl_mc_device_irq *irq;
-   cpumask_t mask;
 
irq = dpio_dev->irqs[0];
error = devm_request_irq(_dev->dev,
@@ -112,9 +111,7 @@ static int register_dpio_irq_handlers(struct fsl_mc_device 
*dpio_dev, int cpu)
}
 
/* set the affinity hint */
-   cpumask_clear();
-   cpumask_set_cpu(cpu, );
-   if (irq_set_affinity_hint(irq->msi_desc->irq, ))
+   if (irq_set_affinity_hint(irq->msi_desc->irq, cpumask_of(cpu)))
dev_err(_dev->dev,
"irq_set_affinity failed irq %d cpu %d\n",
irq->msi_desc->irq, cpu);
-- 
2.27.0



[PATCH AUTOSEL 5.9 18/39] ibmvnic: skip tx timeout reset while in resetting

2020-12-03 Thread Sasha Levin
From: Lijun Pan 

[ Upstream commit 855a631a4c11458a9cef1ab79c1530436aa95fae ]

Sometimes it takes longer than 5 seconds (watchdog timeout) to complete
failover, migration, and other resets. In stead of scheduling another
timeout reset, we wait for the current one to complete.

Suggested-by: Brian King 
Signed-off-by: Lijun Pan 
Reviewed-by: Dany Madden 
Signed-off-by: Jakub Kicinski 
Signed-off-by: Sasha Levin 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 81ec233926acb..37012aa594f41 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -2368,6 +2368,12 @@ static void ibmvnic_tx_timeout(struct net_device *dev, 
unsigned int txqueue)
 {
struct ibmvnic_adapter *adapter = netdev_priv(dev);
 
+   if (test_bit(0, >resetting)) {
+   netdev_err(adapter->netdev,
+  "Adapter is resetting, skip timeout reset\n");
+   return;
+   }
+
ibmvnic_reset(adapter, VNIC_RESET_TIMEOUT);
 }
 
-- 
2.27.0



[PATCH AUTOSEL 5.9 09/39] powerpc: Drop -me200 addition to build flags

2020-12-03 Thread Sasha Levin
From: Michael Ellerman 

[ Upstream commit e02152ba2810f7c88cb54e71cda096268dfa9241 ]

Currently a build with CONFIG_E200=y will fail with:

  Error: invalid switch -me200
  Error: unrecognized option -me200

Upstream binutils has never supported an -me200 option. Presumably it
was supported at some point by either a fork or Freescale internal
binutils.

We can't support code that we can't even build test, so drop the
addition of -me200 to the build flags, so we can at least build with
CONFIG_E200=y.

Reported-by: Németh Márton 
Reported-by: kernel test robot 
Signed-off-by: Michael Ellerman 
Reviewed-by: Nick Desaulniers 
Acked-by: Scott Wood 
Link: https://lore.kernel.org/r/20201116120913.165317-1-...@ellerman.id.au
Signed-off-by: Sasha Levin 
---
 arch/powerpc/Makefile | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 3e8da9cf2eb9d..e6643d5699fef 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -249,7 +249,6 @@ KBUILD_CFLAGS   += $(call cc-option,-mno-string)
 cpu-as-$(CONFIG_40x)   += -Wa,-m405
 cpu-as-$(CONFIG_44x)   += -Wa,-m440
 cpu-as-$(CONFIG_ALTIVEC)   += $(call as-option,-Wa$(comma)-maltivec)
-cpu-as-$(CONFIG_E200)  += -Wa,-me200
 cpu-as-$(CONFIG_E500)  += -Wa,-me500
 
 # When using '-many -mpower4' gas will first try and find a matching power4
-- 
2.27.0



Re: [MOCKUP] x86/mm: Lightweight lazy mm refcounting

2020-12-03 Thread Matthew Wilcox
On Wed, Dec 02, 2020 at 09:25:51PM -0800, Andy Lutomirski wrote:
> This code compiles, but I haven't even tried to boot it.  The earlier
> part of the series isn't terribly interesting -- it's a handful of
> cleanups that remove all reads of ->active_mm from arch/x86.  I've
> been meaning to do that for a while, and now I did it.  But, with
> that done, I think we can move to a totally different lazy mm refcounting
> model.

I went back and read Documentation/vm/active_mm.rst recently.

I think it's useful to think about how this would have been handled if
we'd had RCU at the time.  Particularly:

Linus wrote:
> To support all that, the "struct mm_struct" now has two counters: a
> "mm_users" counter that is how many "real address space users" there are,
> and a "mm_count" counter that is the number of "lazy" users (ie anonymous
> users) plus one if there are any real users.

And this just makes me think RCU freeing of mm_struct.  I'm sure it's
more complicated than that (then, or now), but if an anonymous process
is borrowing a freed mm, and the mm is freed by RCU then it will not go
away until the task context switches.  When we context switch back to
the anon task, it'll borrow some other task's MM and won't even notice
that the MM it was using has gone away.


Re: [PATCH] powerpc/mm: Don't see NULL pointer dereference as a KUAP fault

2020-12-03 Thread Michael Ellerman
Christophe Leroy  writes:
> Sometimes, NULL pointer dereferences are expected. Even when they
> are accidental they are unlikely an exploit attempt because the
> first page is never mapped.

The first page can be mapped if mmap_min_addr is 0.

Blocking all faults to the first page would potentially break any
program that does that.

Also if there is something mapped at 0 it's a good chance it is an
exploit attempt :)

cheers


> The exemple below shows what we get when invoking the "show task"
> sysrq handler, by writing 't' in /proc/sysrq-trigger
>
> [ 1117.202054] [ cut here ]
> [ 1117.202102] Bug: fault blocked by AP register !
> [ 1117.202261] WARNING: CPU: 0 PID: 377 at 
> arch/powerpc/include/asm/nohash/32/kup-8xx.h:66 do_page_fault+0x4a8/0x5ec
> [ 1117.202310] Modules linked in:
> [ 1117.202428] CPU: 0 PID: 377 Comm: sh Tainted: GW 
> 5.10.0-rc5-s3k-dev-01340-g83f53be2de31-dirty #4175
> [ 1117.202499] NIP:  c0012048 LR: c0012048 CTR: 
> [ 1117.202573] REGS: cacdbb88 TRAP: 0700   Tainted: GW  
> (5.10.0-rc5-s3k-dev-01340-g83f53be2de31-dirty)
> [ 1117.202625] MSR:  00021032   CR: 2408  XER: 2000
> [ 1117.202899]
> [ 1117.202899] GPR00: c0012048 cacdbc40 c2929290 0023 c092e554 0001 
> c09865e8 c092e640
> [ 1117.202899] GPR08: 1032   00014efc 28082224 100d166a 
> 100a0920 
> [ 1117.202899] GPR16: 100cac0c 100b 1080c3fc 1080d685 100d 100d 
>  100a0900
> [ 1117.202899] GPR24: 100d c07892ec  c0921510 c21f4440 005c 
> c000 cacdbc80
> [ 1117.204362] NIP [c0012048] do_page_fault+0x4a8/0x5ec
> [ 1117.204461] LR [c0012048] do_page_fault+0x4a8/0x5ec
> [ 1117.204509] Call Trace:
> [ 1117.204609] [cacdbc40] [c0012048] do_page_fault+0x4a8/0x5ec (unreliable)
> [ 1117.204771] [cacdbc70] [c00112f0] handle_page_fault+0x8/0x34
> [ 1117.204911] --- interrupt: 301 at copy_from_kernel_nofault+0x70/0x1c0
> [ 1117.204979] NIP:  c010dbec LR: c010dbac CTR: 0001
> [ 1117.205053] REGS: cacdbc80 TRAP: 0301   Tainted: GW  
> (5.10.0-rc5-s3k-dev-01340-g83f53be2de31-dirty)
> [ 1117.205104] MSR:  9032   CR: 28082224  XER: 
> [ 1117.205416] DAR: 005c DSISR: c000
> [ 1117.205416] GPR00: c0045948 cacdbd38 c2929290 0001 0017 0017 
> 0027 000f
> [ 1117.205416] GPR08: c09926ec   3000 24082224
> [ 1117.206106] NIP [c010dbec] copy_from_kernel_nofault+0x70/0x1c0
> [ 1117.206202] LR [c010dbac] copy_from_kernel_nofault+0x30/0x1c0
> [ 1117.206258] --- interrupt: 301
> [ 1117.206372] [cacdbd38] [c004bbb0] kthread_probe_data+0x44/0x70 (unreliable)
> [ 1117.206561] [cacdbd58] [c0045948] print_worker_info+0xe0/0x194
> [ 1117.206717] [cacdbdb8] [c00548ac] sched_show_task+0x134/0x168
> [ 1117.206851] [cacdbdd8] [c005a268] show_state_filter+0x70/0x100
> [ 1117.206989] [cacdbe08] [c039baa0] sysrq_handle_showstate+0x14/0x24
> [ 1117.207122] [cacdbe18] [c039bf18] __handle_sysrq+0xac/0x1d0
> [ 1117.207257] [cacdbe48] [c039c0c0] write_sysrq_trigger+0x4c/0x74
> [ 1117.207407] [cacdbe68] [c01fba48] proc_reg_write+0xb4/0x114
> [ 1117.207550] [cacdbe88] [c0179968] vfs_write+0x12c/0x478
> [ 1117.207686] [cacdbf08] [c0179e60] ksys_write+0x78/0x128
> [ 1117.207826] [cacdbf38] [c00110d0] ret_from_syscall+0x0/0x34
> [ 1117.207938] --- interrupt: c01 at 0xfd4e784
> [ 1117.208008] NIP:  0fd4e784 LR: 0fe0f244 CTR: 10048d38
> [ 1117.208083] REGS: cacdbf48 TRAP: 0c01   Tainted: GW  
> (5.10.0-rc5-s3k-dev-01340-g83f53be2de31-dirty)
> [ 1117.208134] MSR:  d032   CR: 4400  XER: 
> [ 1117.208470]
> [ 1117.208470] GPR00: 0004 7fc34090 77bfb4e0 0001 1080fa40 0002 
> 740f fefefeff
> [ 1117.208470] GPR08: 7f7f7f7f 10048d38 1080c414 7fc343c0 
> [ 1117.209104] NIP [0fd4e784] 0xfd4e784
> [ 1117.209180] LR [0fe0f244] 0xfe0f244
> [ 1117.209236] --- interrupt: c01
> [ 1117.209274] Instruction dump:
> [ 1117.209353] 714a4000 418200f0 73ca0001 40820084 73ca0032 408200f8 73c90040 
> 4082ff60
> [ 1117.209727] 0fe0 3c60c082 386399f4 48013b65 <0fe0> 80010034 
> 386b 7c0803a6
> [ 1117.210102] ---[ end trace 1927c0323393af3e ]---
>
> So, avoid the big KUAP warning by bailing out of bad_kernel_fault()
> before calling bad_kuap_fault() when address references the first
> page.
>
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/mm/fault.c | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 0add963a849b..be2b4318206f 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -198,6 +198,10 @@ static bool bad_kernel_fault(struct pt_regs *regs, 
> unsigned long error_code,
>  {
>   int is_exec = TRAP(regs) == 0x400;
>  
> + // Kernel fault on first page is likely a NULL pointer dereference
> + if (address < PAGE_SIZE)
> + return true;
> +
>   /* NX faults set DSISR_PROTFAULT on the 8xx, 

Re: [PATCH] EDAC, mv64x60: Fix error return code in mv64x60_pci_err_probe()

2020-12-03 Thread Borislav Petkov
On Thu, Dec 03, 2020 at 10:27:25PM +1100, Michael Ellerman wrote:
> It's dead code, so drop it.
> 
> I can send a patch if no one else wants to.

Yes please.

I love patches removing code! :-)

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette


Re: [PATCH] EDAC, mv64x60: Fix error return code in mv64x60_pci_err_probe()

2020-12-03 Thread Michael Ellerman
Borislav Petkov  writes:
> On Tue, Nov 24, 2020 at 02:30:09PM +0800, Wang ShaoBo wrote:
>> Fix to return -ENODEV error code when edac_pci_add_device() failed instaed
>> of 0 in mv64x60_pci_err_probe(), as done elsewhere in this function.
>> 
>> Fixes: 4f4aeeabc061 ("drivers-edac: add marvell mv64x60 driver")
>> Signed-off-by: Wang ShaoBo 
>> ---
>>  drivers/edac/mv64x60_edac.c | 1 +
>>  1 file changed, 1 insertion(+)
>> 
>> diff --git a/drivers/edac/mv64x60_edac.c b/drivers/edac/mv64x60_edac.c
>> index 3c68bb525d5d..456b9ca1fe8d 100644
>> --- a/drivers/edac/mv64x60_edac.c
>> +++ b/drivers/edac/mv64x60_edac.c
>> @@ -168,6 +168,7 @@ static int mv64x60_pci_err_probe(struct platform_device 
>> *pdev)
>>  
>>  if (edac_pci_add_device(pci, pdata->edac_idx) > 0) {
>>  edac_dbg(3, "failed edac_pci_add_device()\n");
>> +res = -ENODEV;
>>  goto err;
>>  }
>
> That driver depends on MV64X60 and I don't see anything in the tree
> enabling it and I can't select it AFAICT:
>
> config MV64X60
> bool
> select PPC_INDIRECT_PCI
> select CHECK_CACHE_COHERENCY

It was selected by PPC_C2K, but that was dropped in:

  92c8c16f3457 ("powerpc/embedded6xx: Remove C2K board support")

> PPC folks, what do we do here?
>
> If not used anymore, I'd love to have one less EDAC driver.

It's dead code, so drop it.

I can send a patch if no one else wants to.

cheers


[PATCH] powerpc/hotplug: assign hot added LMB to the right node

2020-12-03 Thread Laurent Dufour
This patch applies to 5.9 and earlier kernels only.

Since 5.10, this has been fortunately fixed by the commit
e5e179aa3a39 ("pseries/drmem: don't cache node id in drmem_lmb struct").

When LMBs are added to a running system, the node id assigned to the LMB is
fetched from the temporary DT node provided by the hypervisor.

However, LMBs added are always assigned to the first online node. This is a
mistake and this is because hot_add_drconf_scn_to_nid() called by
lmb_set_nid() is checking for the LMB flags DRCONF_MEM_ASSIGNED which is
set later in dlpar_add_lmb().

To fix this issue, simply set that flag earlier in dlpar_add_lmb().

Note, this code has been rewrote in 5.10 and thus this fix has no meaning
since this version.

Signed-off-by: Laurent Dufour 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Nathan Lynch 
Cc: Scott Cheloha 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sta...@vger.kernel.org
---
 arch/powerpc/platforms/pseries/hotplug-memory.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index e54dcbd04b2f..92d83915c629 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -663,12 +663,14 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
return rc;
}
 
+   lmb->flags |= DRCONF_MEM_ASSIGNED;
lmb_set_nid(lmb);
block_sz = memory_block_size_bytes();
 
/* Add the memory */
rc = __add_memory(lmb->nid, lmb->base_addr, block_sz);
if (rc) {
+   lmb->flags &= ~DRCONF_MEM_ASSIGNED;
invalidate_lmb_associativity_index(lmb);
return rc;
}
@@ -676,10 +678,9 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
rc = dlpar_online_lmb(lmb);
if (rc) {
__remove_memory(lmb->nid, lmb->base_addr, block_sz);
+   lmb->flags &= ~DRCONF_MEM_ASSIGNED;
invalidate_lmb_associativity_index(lmb);
lmb_clear_nid(lmb);
-   } else {
-   lmb->flags |= DRCONF_MEM_ASSIGNED;
}
 
return rc;
-- 
2.29.2



Re: [MOCKUP] x86/mm: Lightweight lazy mm refcounting

2020-12-03 Thread Peter Zijlstra
On Wed, Dec 02, 2020 at 09:25:51PM -0800, Andy Lutomirski wrote:

> power: same as ARM, except that the loop may be rather larger since
> the systems are bigger.  But I imagine it's still faster than Nick's
> approach -- a cmpxchg to a remote cacheline should still be faster than
> an IPI shootdown. 

While a single atomic might be cheaper than an IPI, the comparison
doesn't work out nicely. You do the xchg() on every unlazy, while the
IPI would be once per process exit.

So over the life of the process, it might do very many unlazies, adding
up to a total cost far in excess of what the single IPI would've been.


And while I appreciate all the work to get rid of the active_mm
accounting; the worry I have with pushing this all into arch code is
that it will be so very easy to get this subtly wrong.


Re: [PATCH kernel v2] powerpc/kuap: Restore AMR after replaying soft interrupts

2020-12-03 Thread Alexey Kardashevskiy




On 03/12/2020 19:03, Christophe Leroy wrote:



Le 03/12/2020 à 06:47, Alexey Kardashevskiy a écrit :

When interrupted in raw_copy_from_user()/... after user memory access
is enabled, a nested handler may also access user memory (perf is
one example) and when it does so, it calls prevent_read_from_user()
which prevents the upper handler from accessing user memory.

This saves/restores AMR when replaying interrupts.

get_kuap/set_kuap have stubs for disabled KUAP on RADIX but there are
none for hash-only configs (BOOK3E) so this adds stubs and moves
AMR_KUAP_BLOCK_xxx.

Found by syzkaller. More likely to break with enabled
CONFIG_DEBUG_ATOMIC_SLEEP, the call chain is
timer_interrupt -> ktime_get -> read_seqcount_begin -> local_irq_restore.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v2:
* fixed compile on hash
* moved get/set to arch_local_irq_restore
* block KUAP before replaying


---

This is an example:

[ cut here ]
Bug: Read fault blocked by AMR!
WARNING: CPU: 0 PID: 1603 at 
/home/aik/p/kernel/arch/powerpc/include/asm/book3s/64/kup-radix.h:145 
__do_page_fau


Modules linked in:
CPU: 0 PID: 1603 Comm: amr Not tainted 5.10.0-rc6_v5.10-rc6_a+fstn1 #24
NIP:  c009ece8 LR: c009ece4 CTR: 
REGS: cdc63560 TRAP: 0700   Not tainted  
(5.10.0-rc6_v5.10-rc6_a+fstn1)

MSR:  80021033   CR: 28002888  XER: 2004
CFAR: c01fa928 IRQMASK: 1
GPR00: c009ece4 cdc637f0 c2397600 
001f
GPR04: c20eb318  cdc63494 
0027
GPR08: c0007fe4de68 cdfe9180  
0001
GPR12: 2000 c30a  

GPR16:    
bfff
GPR20:  c000134a4020 c19c2218 
0fe0
GPR24:   cd106200 
4000
GPR28:  0300 cdc63910 
c1946730

NIP [c009ece8] __do_page_fault+0xb38/0xde0
LR [c009ece4] __do_page_fault+0xb34/0xde0
Call Trace:
[cdc637f0] [c009ece4] __do_page_fault+0xb34/0xde0 
(unreliable)

[cdc638a0] [c000c968] handle_page_fault+0x10/0x2c
--- interrupt: 300 at strncpy_from_user+0x290/0x440
 LR = strncpy_from_user+0x284/0x440
[cdc63ba0] [c0c3dcb0] strncpy_from_user+0x2f0/0x440 
(unreliable)

[cdc63c30] [c068b888] getname_flags+0x88/0x2c0
[cdc63c90] [c0662a44] do_sys_openat2+0x2d4/0x5f0
[cdc63d30] [c066560c] do_sys_open+0xcc/0x140
[cdc63dc0] [c0045e10] system_call_exception+0x160/0x240
[cdc63e20] [c000da60] system_call_common+0xf0/0x27c
Instruction dump:
409c0048 3fe2ff5b 3bfff128 fac10060 fae10068 482f7a85 6000 3c62ff5b
7fe4fb78 3863f250 4815bbd9 6000 <0fe0> 3c62ff5b 3863f2b8 4815c8b5
irq event stamp: 254
hardirqs last  enabled at (253): [] 
arch_local_irq_restore+0xa0/0x150
hardirqs last disabled at (254): [] 
data_access_common_virt+0x1b0/0x1d0
softirqs last  enabled at (0): [] 
copy_process+0x78c/0x2120

softirqs last disabled at (0): [<>] 0x0
---[ end trace ba98aec5151f3aeb ]---
---
  arch/powerpc/include/asm/book3s/64/kup-radix.h |  3 ---
  arch/powerpc/include/asm/kup.h | 10 ++
  arch/powerpc/kernel/irq.c  |  6 ++
  3 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/kup-radix.h 
b/arch/powerpc/include/asm/book3s/64/kup-radix.h

index a39e2d193fdc..4ad607461b75 100644
--- a/arch/powerpc/include/asm/book3s/64/kup-radix.h
+++ b/arch/powerpc/include/asm/book3s/64/kup-radix.h
@@ -5,9 +5,6 @@
  #include 
  #include 
-#define AMR_KUAP_BLOCK_READ    UL(0x4000)
-#define AMR_KUAP_BLOCK_WRITE    UL(0x8000)
-#define AMR_KUAP_BLOCKED    (AMR_KUAP_BLOCK_READ | AMR_KUAP_BLOCK_WRITE)
  #define AMR_KUAP_SHIFT    62
  #ifdef __ASSEMBLY__
diff --git a/arch/powerpc/include/asm/kup.h 
b/arch/powerpc/include/asm/kup.h

index 0d93331d0fab..e63930767a96 100644
--- a/arch/powerpc/include/asm/kup.h
+++ b/arch/powerpc/include/asm/kup.h
@@ -14,6 +14,10 @@
  #define KUAP_CURRENT_WRITE    8
  #define KUAP_CURRENT    (KUAP_CURRENT_READ | KUAP_CURRENT_WRITE)
+#define AMR_KUAP_BLOCK_READ    UL(0x4000)
+#define AMR_KUAP_BLOCK_WRITE    UL(0x8000)
+#define AMR_KUAP_BLOCKED    (AMR_KUAP_BLOCK_READ | AMR_KUAP_BLOCK_WRITE)
+


Those macro are specific to BOOK3S_64, they have nothing to do in 
asm/kup.h, must stay in the file included just below



  #ifdef CONFIG_PPC_BOOK3S_64
  #include 
  #endif
@@ -64,6 +68,12 @@ bad_kuap_fault(struct pt_regs *regs, unsigned long 
address, bool is_write)

  }
  static inline void kuap_check_amr(void) { }
+static inline unsigned long get_kuap(void)
+{
+    return AMR_KUAP_BLOCKED;
+}
+


The 

Re: [PATCH kernel v2] powerpc/kuap: Restore AMR after replaying soft interrupts

2020-12-03 Thread Alexey Kardashevskiy




On 03/12/2020 17:38, Aneesh Kumar K.V wrote:

Alexey Kardashevskiy  writes:


When interrupted in raw_copy_from_user()/... after user memory access
is enabled, a nested handler may also access user memory (perf is
one example) and when it does so, it calls prevent_read_from_user()
which prevents the upper handler from accessing user memory.

This saves/restores AMR when replaying interrupts.

get_kuap/set_kuap have stubs for disabled KUAP on RADIX but there are
none for hash-only configs (BOOK3E) so this adds stubs and moves
AMR_KUAP_BLOCK_xxx.

Found by syzkaller. More likely to break with enabled
CONFIG_DEBUG_ATOMIC_SLEEP, the call chain is
timer_interrupt -> ktime_get -> read_seqcount_begin -> local_irq_restore.


Can you test this with 
https://github.com/kvaneesh/linux/commits/hash-kuap-reworked-2


Yup, broken:

[8.813115] [ cut here ] 

[8.813499] Bug: Read fault blocked by AMR! 

[8.813532] WARNING: CPU: 0 PID: 1113 at 
/home/aik/p/kernel/arch/powerpc/include/asm/book3s/64/kup.h:369 
__do_page_fault+0xd
34/0xdf0 

[8.814248] Modules linked in: 

[8.814459] CPU: 0 PID: 1113 Comm: amr Not tainted 
5.10.0-rc4_aneesh--hash-kuap-reworked-2_a+fstn1 #61
[8.815075] NIP:  c00a4674 LR: c00a4670 CTR: 

[8.815632] REGS: cd44f530 TRAP: 0700   Not tainted 
(5.10.0-rc4_aneesh--hash-kuap-reworked-2_a+fstn1)





We do save restore AMR on interrupt entry and exit.


This is nested interrupt. we get interrupted while copying to/from user. 
The first handler has AMR off, then it goes to strncpy_from_user, 
enables AMR, page fault happens and another interrupt arrives - and if 
the new interrupt also wants strncpy_from_user/co, it will enable AMR 
again (which is ok), do copy and then disable AMR; and then we go back 
and resume the first strncpy_from_user which thinks AMR is enabling 
while it is not. I just see how strncpy_from_user is called while 
strncpy_from_user is running. Nick understands the mechanics better.


I can give you an initramdisk which crashes in a millisecond, ping me in 
slack.



--
Alexey


Re: [PATCH kernel v2] powerpc/kuap: Restore AMR after replaying soft interrupts

2020-12-03 Thread Christophe Leroy




Le 03/12/2020 à 06:47, Alexey Kardashevskiy a écrit :

When interrupted in raw_copy_from_user()/... after user memory access
is enabled, a nested handler may also access user memory (perf is
one example) and when it does so, it calls prevent_read_from_user()
which prevents the upper handler from accessing user memory.

This saves/restores AMR when replaying interrupts.

get_kuap/set_kuap have stubs for disabled KUAP on RADIX but there are
none for hash-only configs (BOOK3E) so this adds stubs and moves
AMR_KUAP_BLOCK_xxx.

Found by syzkaller. More likely to break with enabled
CONFIG_DEBUG_ATOMIC_SLEEP, the call chain is
timer_interrupt -> ktime_get -> read_seqcount_begin -> local_irq_restore.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v2:
* fixed compile on hash
* moved get/set to arch_local_irq_restore
* block KUAP before replaying


---

This is an example:

[ cut here ]
Bug: Read fault blocked by AMR!
WARNING: CPU: 0 PID: 1603 at 
/home/aik/p/kernel/arch/powerpc/include/asm/book3s/64/kup-radix.h:145 
__do_page_fau

Modules linked in:
CPU: 0 PID: 1603 Comm: amr Not tainted 5.10.0-rc6_v5.10-rc6_a+fstn1 #24
NIP:  c009ece8 LR: c009ece4 CTR: 
REGS: cdc63560 TRAP: 0700   Not tainted  (5.10.0-rc6_v5.10-rc6_a+fstn1)
MSR:  80021033   CR: 28002888  XER: 2004
CFAR: c01fa928 IRQMASK: 1
GPR00: c009ece4 cdc637f0 c2397600 001f
GPR04: c20eb318  cdc63494 0027
GPR08: c0007fe4de68 cdfe9180  0001
GPR12: 2000 c30a  
GPR16:    bfff
GPR20:  c000134a4020 c19c2218 0fe0
GPR24:   cd106200 4000
GPR28:  0300 cdc63910 c1946730
NIP [c009ece8] __do_page_fault+0xb38/0xde0
LR [c009ece4] __do_page_fault+0xb34/0xde0
Call Trace:
[cdc637f0] [c009ece4] __do_page_fault+0xb34/0xde0 (unreliable)
[cdc638a0] [c000c968] handle_page_fault+0x10/0x2c
--- interrupt: 300 at strncpy_from_user+0x290/0x440
 LR = strncpy_from_user+0x284/0x440
[cdc63ba0] [c0c3dcb0] strncpy_from_user+0x2f0/0x440 (unreliable)
[cdc63c30] [c068b888] getname_flags+0x88/0x2c0
[cdc63c90] [c0662a44] do_sys_openat2+0x2d4/0x5f0
[cdc63d30] [c066560c] do_sys_open+0xcc/0x140
[cdc63dc0] [c0045e10] system_call_exception+0x160/0x240
[cdc63e20] [c000da60] system_call_common+0xf0/0x27c
Instruction dump:
409c0048 3fe2ff5b 3bfff128 fac10060 fae10068 482f7a85 6000 3c62ff5b
7fe4fb78 3863f250 4815bbd9 6000 <0fe0> 3c62ff5b 3863f2b8 4815c8b5
irq event stamp: 254
hardirqs last  enabled at (253): [] 
arch_local_irq_restore+0xa0/0x150
hardirqs last disabled at (254): [] 
data_access_common_virt+0x1b0/0x1d0
softirqs last  enabled at (0): [] copy_process+0x78c/0x2120
softirqs last disabled at (0): [<>] 0x0
---[ end trace ba98aec5151f3aeb ]---
---
  arch/powerpc/include/asm/book3s/64/kup-radix.h |  3 ---
  arch/powerpc/include/asm/kup.h | 10 ++
  arch/powerpc/kernel/irq.c  |  6 ++
  3 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/kup-radix.h 
b/arch/powerpc/include/asm/book3s/64/kup-radix.h
index a39e2d193fdc..4ad607461b75 100644
--- a/arch/powerpc/include/asm/book3s/64/kup-radix.h
+++ b/arch/powerpc/include/asm/book3s/64/kup-radix.h
@@ -5,9 +5,6 @@
  #include 
  #include 
  
-#define AMR_KUAP_BLOCK_READ	UL(0x4000)

-#define AMR_KUAP_BLOCK_WRITE   UL(0x8000)
-#define AMR_KUAP_BLOCKED   (AMR_KUAP_BLOCK_READ | AMR_KUAP_BLOCK_WRITE)
  #define AMR_KUAP_SHIFT62
  
  #ifdef __ASSEMBLY__

diff --git a/arch/powerpc/include/asm/kup.h b/arch/powerpc/include/asm/kup.h
index 0d93331d0fab..e63930767a96 100644
--- a/arch/powerpc/include/asm/kup.h
+++ b/arch/powerpc/include/asm/kup.h
@@ -14,6 +14,10 @@
  #define KUAP_CURRENT_WRITE8
  #define KUAP_CURRENT  (KUAP_CURRENT_READ | KUAP_CURRENT_WRITE)
  
+#define AMR_KUAP_BLOCK_READ	UL(0x4000)

+#define AMR_KUAP_BLOCK_WRITE   UL(0x8000)
+#define AMR_KUAP_BLOCKED   (AMR_KUAP_BLOCK_READ | AMR_KUAP_BLOCK_WRITE)
+


Those macro are specific to BOOK3S_64, they have nothing to do in asm/kup.h, must stay in the file 
included just below



  #ifdef CONFIG_PPC_BOOK3S_64
  #include 
  #endif
@@ -64,6 +68,12 @@ bad_kuap_fault(struct pt_regs *regs, unsigned long address, 
bool is_write)
  }
  
  static inline void kuap_check_amr(void) { }

+static inline unsigned long get_kuap(void)
+{
+   return AMR_KUAP_BLOCKED;
+}
+


The above is not generic, it is specific to