Re: [PATCH] KVM: PPC: Book3S HV: Do not allocate HPT for a nested guest

2020-09-13 Thread David Gibson
On Fri, Sep 11, 2020 at 01:16:07AM -0300, Fabiano Rosas wrote:
> The current nested KVM code does not support HPT guests. This is
> informed/enforced in some ways:
> 
> - Hosts < P9 will not be able to enable the nested HV feature;
> 
> - The nested hypervisor MMU capabilities will not contain
>   KVM_CAP_PPC_MMU_HASH_V3;
> 
> - QEMU reflects the MMU capabilities in the
>   'ibm,arch-vec-5-platform-support' device-tree property;
> 
> - The nested guest, at 'prom_parse_mmu_model' ignores the
>   'disable_radix' kernel command line option if HPT is not supported;
> 
> - The KVM_PPC_CONFIGURE_V3_MMU ioctl will fail if trying to use HPT.
> 
> There is, however, still a way to start a HPT guest by using
> max-compat-cpu=power8 at the QEMU machine options. This leads to the
> guest being set to use hash after QEMU calls the KVM_PPC_ALLOCATE_HTAB
> ioctl.
> 
> With the guest set to hash, the nested hypervisor goes through the
> entry path that has no knowledge of nesting (kvmppc_run_vcpu) and
> crashes when it tries to execute an hypervisor-privileged (mtspr
> HDEC) instruction at __kvmppc_vcore_entry:
> 
> root@L1:~ $ qemu-system-ppc64 -machine pseries,max-cpu-compat=power8 ...
> 
> 
> [  538.543303] CPU: 83 PID: 25185 Comm: CPU 0/KVM Not tainted 5.9.0-rc4 #1
> [  538.543355] NIP:  c0080753f388 LR: c0080753f368 CTR: 
> c01e5ec0
> [  538.543417] REGS: c013e91e33b0 TRAP: 0700   Not tainted  (5.9.0-rc4)
> [  538.543470] MSR:  82843033   CR: 
> 22422882  XER: 2004
> [  538.543546] CFAR: c0080753f4b0 IRQMASK: 3
>GPR00: c008075397a0 c013e91e3640 c0080755e600 
> 8000
>GPR04:  c013eab19800 c01394de 
> 0043a054db72
>GPR08: 003b1652   
> c008075502e0
>GPR12: c01e5ec0 c007ffa74200 c013eab19800 
> 0008
>GPR16:  c0139676c6c0 c1d23948 
> c013e91e38b8
>GPR20: 0053  0001 
> 
>GPR24: 0001 0001  
> 0001
>GPR28: 0001 0053 c013eab19800 
> 0001
> [  538.544067] NIP [c0080753f388] __kvmppc_vcore_entry+0x90/0x104 [kvm_hv]
> [  538.544121] LR [c0080753f368] __kvmppc_vcore_entry+0x70/0x104 [kvm_hv]
> [  538.544173] Call Trace:
> [  538.544196] [c013e91e3640] [c013e91e3680] 0xc013e91e3680 
> (unreliable)
> [  538.544260] [c013e91e3820] [c008075397a0] 
> kvmppc_run_core+0xbc8/0x19d0 [kvm_hv]
> [  538.544325] [c013e91e39e0] [c0080753d99c] 
> kvmppc_vcpu_run_hv+0x404/0xc00 [kvm_hv]
> [  538.544394] [c013e91e3ad0] [c008072da4fc] 
> kvmppc_vcpu_run+0x34/0x48 [kvm]
> [  538.544472] [c013e91e3af0] [c008072d61b8] 
> kvm_arch_vcpu_ioctl_run+0x310/0x420 [kvm]
> [  538.544539] [c013e91e3b80] [c008072c7450] 
> kvm_vcpu_ioctl+0x298/0x778 [kvm]
> [  538.544605] [c013e91e3ce0] [c04b8c2c] sys_ioctl+0x1dc/0xc90
> [  538.544662] [c013e91e3dc0] [c002f9a4] 
> system_call_exception+0xe4/0x1c0
> [  538.544726] [c013e91e3e20] [c000d140] 
> system_call_common+0xf0/0x27c
> [  538.544787] Instruction dump:
> [  538.544821] f86d1098 6000 6000 4899 e8ad0fe8 e8c500a0 e9264140 
> 75290002
> [  538.544886] 7d1602a6 7cec42a6 40820008 7d0807b4 <7d164ba6> 7d083a14 
> f90d10a0 480104fd
> [  538.544953] ---[ end trace 74423e2b948c2e0c ]---
> 
> This patch makes the KVM_PPC_ALLOCATE_HTAB ioctl fail when running in
> the nested hypervisor, causing QEMU to abort.
> 
> Reported-by: Satheesh Rajendran 
> Signed-off-by: Fabiano Rosas 

Reviewed-by: David Gibson 

> ---
>  arch/powerpc/kvm/book3s_hv.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 4ba06a2a306c..764b6239ef72 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -5250,6 +5250,12 @@ static long kvm_arch_vm_ioctl_hv(struct file *filp,
>   case KVM_PPC_ALLOCATE_HTAB: {
>   u32 htab_order;
>  
> + /* If we're a nested hypervisor, we currently only support 
> radix */
> + if (kvmhv_on_pseries()) {
> + r = -EOPNOTSUPP;
> + break;
> + }
> +
>   r = -EFAULT;
>   if (get_user(htab_order, (u32 __user *)argp))
>   break;

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


[PATCH -next] macintosh: windfarm: use for_each_child_of_node() macro

2020-09-13 Thread Qinglang Miao
Use for_each_child_of_node() macro instead of open coding it.

Signed-off-by: Qinglang Miao 
---
 drivers/macintosh/windfarm_smu_sat.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/macintosh/windfarm_smu_sat.c 
b/drivers/macintosh/windfarm_smu_sat.c
index cb75dc035..e46e1153a 100644
--- a/drivers/macintosh/windfarm_smu_sat.c
+++ b/drivers/macintosh/windfarm_smu_sat.c
@@ -216,8 +216,7 @@ static int wf_sat_probe(struct i2c_client *client,
 
vsens[0] = vsens[1] = -1;
isens[0] = isens[1] = -1;
-   child = NULL;
-   while ((child = of_get_next_child(dev, child)) != NULL) {
+   for_each_child_of_node(dev, child) {
reg = of_get_property(child, "reg", NULL);
loc = of_get_property(child, "location", NULL);
if (reg == NULL || loc == NULL)
-- 
2.23.0



Re: [PATCH 13/15] selftests/seccomp: powerpc: Set syscall return during ptrace syscall exit

2020-09-13 Thread Michael Ellerman
Kees Cook  writes:
> Some archs (like ppc) only support changing the return code during
> syscall exit when ptrace is used. As the syscall number might not
> be available anymore during syscall exit, it needs to be saved
> during syscall enter. Adjust the ptrace tests to do this.

I'm not that across all the fixture stuff, but if I'm reading it right
you're now calling change_syscall() on both entry and exit for all
arches.

That should work, but it no longer tests changing the return code on
entry on the arches that support it, which seems like a backward step?

cheers


> Reported-by: Thadeu Lima de Souza Cascardo 
> Suggested-by: Thadeu Lima de Souza Cascardo 
> Link: 
> https://lore.kernel.org/linux-kselftest/20200911181012.171027-1-casca...@canonical.com/
> Fixes: 58d0a862f573 ("seccomp: add tests for ptrace hole")
> Signed-off-by: Kees Cook 
> ---
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 34 +++
>  1 file changed, 20 insertions(+), 14 deletions(-)
>
> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c 
> b/tools/testing/selftests/seccomp/seccomp_bpf.c
> index bbab2420d708..26c712c6a575 100644
> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> @@ -1949,12 +1949,19 @@ void tracer_seccomp(struct __test_metadata 
> *_metadata, pid_t tracee,
>  
>  }
>  
> +FIXTURE(TRACE_syscall) {
> + struct sock_fprog prog;
> + pid_t tracer, mytid, mypid, parent;
> + long syscall_nr;
> +};
> +
>  void tracer_ptrace(struct __test_metadata *_metadata, pid_t tracee,
>  int status, void *args)
>  {
> - int ret, nr;
> + int ret;
>   unsigned long msg;
>   static bool entry;
> + FIXTURE_DATA(TRACE_syscall) *self = args;
>  
>   /*
>* The traditional way to tell PTRACE_SYSCALL entry/exit
> @@ -1968,24 +1975,23 @@ void tracer_ptrace(struct __test_metadata *_metadata, 
> pid_t tracee,
>   EXPECT_EQ(entry ? PTRACE_EVENTMSG_SYSCALL_ENTRY
>   : PTRACE_EVENTMSG_SYSCALL_EXIT, msg);
>  
> - if (!entry)
> - return;
> -
> - nr = get_syscall(_metadata, tracee);
> + /*
> +  * Some architectures only support setting return values during
> +  * syscall exit under ptrace, and on exit the syscall number may
> +  * no longer be available. Therefore, save it here, and call
> +  * "change syscall and set return values" on both entry and exit.
> +  */
> + if (entry)
> + self->syscall_nr = get_syscall(_metadata, tracee);
>  
> - if (nr == __NR_getpid)
> + if (self->syscall_nr == __NR_getpid)
>   change_syscall(_metadata, tracee, __NR_getppid, 0);
> - if (nr == __NR_gettid)
> + if (self->syscall_nr == __NR_gettid)
>   change_syscall(_metadata, tracee, -1, 45000);
> - if (nr == __NR_openat)
> + if (self->syscall_nr == __NR_openat)
>   change_syscall(_metadata, tracee, -1, -ESRCH);
>  }
>  
> -FIXTURE(TRACE_syscall) {
> - struct sock_fprog prog;
> - pid_t tracer, mytid, mypid, parent;
> -};
> -
>  FIXTURE_VARIANT(TRACE_syscall) {
>   /*
>* All of the SECCOMP_RET_TRACE behaviors can be tested with either
> @@ -2044,7 +2050,7 @@ FIXTURE_SETUP(TRACE_syscall)
>   self->tracer = setup_trace_fixture(_metadata,
>  variant->use_ptrace ? tracer_ptrace
>  : tracer_seccomp,
> -NULL, variant->use_ptrace);
> +self, variant->use_ptrace);
>  
>   ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
>   ASSERT_EQ(0, ret);
> -- 
> 2.25.1


[PATCH v2 4/4] powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm

2020-09-13 Thread Nicholas Piggin
Commit 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of
single-threaded mm_cpumask") added a mechanism to trim the mm_cpumask of
a process under certain conditions. One of the assumptions is that
mm_users would not be incremented via a reference outside the process
context with mmget_not_zero() then go on to kthread_use_mm() via that
reference.

That invariant was broken by io_uring code (see previous sparc64 fix),
but I'll point Fixes: to the original powerpc commit because we are
changing that assumption going forward, so this will make backports
match up.

Fix this by no longer relying on that assumption, but by having each CPU
check the mm is not being used, and clearing their own bit from the mask
only if it hasn't been switched-to by the time the IPI is processed.

This relies on commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
invalidate") and ARCH_WANT_IRQS_OFF_ACTIVATE_MM to disable irqs over mm
switch sequences.

Reviewed-by: Michael Ellerman 
Depends-on: 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB invalidate")
Fixes: 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of 
single-threaded mm_cpumask")
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/tlb.h   | 13 -
 arch/powerpc/mm/book3s64/radix_tlb.c | 23 ---
 2 files changed, 16 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index fbc6f3002f23..d97f061fecac 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -66,19 +66,6 @@ static inline int mm_is_thread_local(struct mm_struct *mm)
return false;
return cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm));
 }
-static inline void mm_reset_thread_local(struct mm_struct *mm)
-{
-   WARN_ON(atomic_read(&mm->context.copros) > 0);
-   /*
-* It's possible for mm_access to take a reference on mm_users to
-* access the remote mm from another thread, but it's not allowed
-* to set mm_cpumask, so mm_users may be > 1 here.
-*/
-   WARN_ON(current->mm != mm);
-   atomic_set(&mm->context.active_cpus, 1);
-   cpumask_clear(mm_cpumask(mm));
-   cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm));
-}
 #else /* CONFIG_PPC_BOOK3S_64 */
 static inline int mm_is_thread_local(struct mm_struct *mm)
 {
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
b/arch/powerpc/mm/book3s64/radix_tlb.c
index 0d233763441f..143b4fd396f0 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -645,19 +645,29 @@ static void do_exit_flush_lazy_tlb(void *arg)
struct mm_struct *mm = arg;
unsigned long pid = mm->context.id;
 
+   /*
+* A kthread could have done a mmget_not_zero() after the flushing CPU
+* checked mm_is_singlethreaded, and be in the process of
+* kthread_use_mm when interrupted here. In that case, current->mm will
+* be set to mm, because kthread_use_mm() setting ->mm and switching to
+* the mm is done with interrupts off.
+*/
if (current->mm == mm)
-   return; /* Local CPU */
+   goto out_flush;
 
if (current->active_mm == mm) {
-   /*
-* Must be a kernel thread because sender is single-threaded.
-*/
-   BUG_ON(current->mm);
+   WARN_ON_ONCE(current->mm != NULL);
+   /* Is a kernel thread and is using mm as the lazy tlb */
mmgrab(&init_mm);
-   switch_mm(mm, &init_mm, current);
current->active_mm = &init_mm;
+   switch_mm_irqs_off(mm, &init_mm, current);
mmdrop(mm);
}
+
+   atomic_dec(&mm->context.active_cpus);
+   cpumask_clear_cpu(smp_processor_id(), mm_cpumask(mm));
+
+out_flush:
_tlbiel_pid(pid, RIC_FLUSH_ALL);
 }
 
@@ -672,7 +682,6 @@ static void exit_flush_lazy_tlbs(struct mm_struct *mm)
 */
smp_call_function_many(mm_cpumask(mm), do_exit_flush_lazy_tlb,
(void *)mm, 1);
-   mm_reset_thread_local(mm);
 }
 
 void radix__flush_tlb_mm(struct mm_struct *mm)
-- 
2.23.0



[PATCH v2 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race

2020-09-13 Thread Nicholas Piggin
The de facto (and apparently uncommented) standard for using an mm had,
thanks to this code in sparc if nothing else, been that you must have a
reference on mm_users *and that reference must have been obtained with
mmget()*, i.e., from a thread with a reference to mm_users that had used
the mm.

The introduction of mmget_not_zero() in commit d2005e3f41d4
("userfaultfd: don't pin the user memory in userfaultfd_file_create()")
allowed mm_count holders to aoperate on user mappings asynchronously
from the actual threads using the mm, but they were not to load those
mappings into their TLB (i.e., walking vmas and page tables is okay,
kthread_use_mm() is not).

io_uring 2b188cc1bb857 ("Add io_uring IO interface") added code which
does a kthread_use_mm() from a mmget_not_zero() refcount.

The problem with this is code which previously assumed mm == current->mm
and mm->mm_users == 1 implies the mm will remain single-threaded at
least until this thread creates another mm_users reference, has now
broken.

arch/sparc/kernel/smp_64.c:

if (atomic_read(&mm->mm_users) == 1) {
cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
goto local_flush_and_out;
}

vs fs/io_uring.c

if (unlikely(!(ctx->flags & IORING_SETUP_SQPOLL) ||
 !mmget_not_zero(ctx->sqo_mm)))
return -EFAULT;
kthread_use_mm(ctx->sqo_mm);

mmget_not_zero() could come in right after the mm_users == 1 test, then
kthread_use_mm() which sets its CPU in the mm_cpumask. That update could
be lost if cpumask_copy() occurs afterward.

I propose we fix this by allowing mmget_not_zero() to be a first-class
reference, and not have this obscure undocumented and unchecked
restriction.

The basic fix for sparc64 is to remove its mm_cpumask clearing code. The
optimisation could be effectively restored by sending IPIs to mm_cpumask
members and having them remove themselves from mm_cpumask. This is more
tricky so I leave it as an exercise for someone with a sparc64 SMP.
powerpc has a (currently similarly broken) example.

Signed-off-by: Nicholas Piggin 
---
 arch/sparc/kernel/smp_64.c | 65 --
 1 file changed, 14 insertions(+), 51 deletions(-)

diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
index e286e2badc8a..e38d8bf454e8 100644
--- a/arch/sparc/kernel/smp_64.c
+++ b/arch/sparc/kernel/smp_64.c
@@ -1039,38 +1039,9 @@ void smp_fetch_global_pmu(void)
  * are flush_tlb_*() routines, and these run after flush_cache_*()
  * which performs the flushw.
  *
- * The SMP TLB coherency scheme we use works as follows:
- *
- * 1) mm->cpu_vm_mask is a bit mask of which cpus an address
- *space has (potentially) executed on, this is the heuristic
- *we use to avoid doing cross calls.
- *
- *Also, for flushing from kswapd and also for clones, we
- *use cpu_vm_mask as the list of cpus to make run the TLB.
- *
- * 2) TLB context numbers are shared globally across all processors
- *in the system, this allows us to play several games to avoid
- *cross calls.
- *
- *One invariant is that when a cpu switches to a process, and
- *that processes tsk->active_mm->cpu_vm_mask does not have the
- *current cpu's bit set, that tlb context is flushed locally.
- *
- *If the address space is non-shared (ie. mm->count == 1) we avoid
- *cross calls when we want to flush the currently running process's
- *tlb state.  This is done by clearing all cpu bits except the current
- *processor's in current->mm->cpu_vm_mask and performing the
- *flush locally only.  This will force any subsequent cpus which run
- *this task to flush the context from the local tlb if the process
- *migrates to another cpu (again).
- *
- * 3) For shared address spaces (threads) and swapping we bite the
- *bullet for most cases and perform the cross call (but only to
- *the cpus listed in cpu_vm_mask).
- *
- *The performance gain from "optimizing" away the cross call for threads is
- *questionable (in theory the big win for threads is the massive sharing of
- *address space state across processors).
+ * mm->cpu_vm_mask is a bit mask of which cpus an address
+ * space has (potentially) executed on, this is the heuristic
+ * we use to limit cross calls.
  */
 
 /* This currently is only used by the hugetlb arch pre-fault
@@ -1080,18 +1051,13 @@ void smp_fetch_global_pmu(void)
 void smp_flush_tlb_mm(struct mm_struct *mm)
 {
u32 ctx = CTX_HWBITS(mm->context);
-   int cpu = get_cpu();
 
-   if (atomic_read(&mm->mm_users) == 1) {
-   cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
-   goto local_flush_and_out;
-   }
+   get_cpu();
 
smp_cross_call_masked(&xcall_flush_tlb_mm,
  ctx, 0, 0,
  mm_cpumask(mm));
 
-local_flush_and_out:
__flush_tlb_mm(ctx, SECONDARY_CONTEXT);
 
put_cpu();
@@ -1114,17 +1080,15 @@ void smp_flush_tlb_pending(s

[PATCH v2 2/4] powerpc: select ARCH_WANT_IRQS_OFF_ACTIVATE_MM

2020-09-13 Thread Nicholas Piggin
powerpc uses IPIs in some situations to switch a kernel thread away
from a lazy tlb mm, which is subject to the TLB flushing race
described in the changelog introducing ARCH_WANT_IRQS_OFF_ACTIVATE_MM.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/Kconfig   | 1 +
 arch/powerpc/include/asm/mmu_context.h | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 65bed1fdeaad..587ba8352d01 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -149,6 +149,7 @@ config PPC
select ARCH_USE_QUEUED_RWLOCKS  if PPC_QUEUED_SPINLOCKS
select ARCH_USE_QUEUED_SPINLOCKSif PPC_QUEUED_SPINLOCKS
select ARCH_WANT_IPC_PARSE_VERSION
+   select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
select ARCH_WEAK_RELEASE_ACQUIRE
select BINFMT_ELF
select BUILDTIME_TABLE_SORT
diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index a3a12a8341b2..b42813359f49 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -244,7 +244,7 @@ static inline void switch_mm(struct mm_struct *prev, struct 
mm_struct *next,
 #define activate_mm activate_mm
 static inline void activate_mm(struct mm_struct *prev, struct mm_struct *next)
 {
-   switch_mm(prev, next, current);
+   switch_mm_irqs_off(prev, next, current);
 }
 
 /* We don't currently use enter_lazy_tlb() for anything */
-- 
2.23.0



[PATCH v2 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race

2020-09-13 Thread Nicholas Piggin
Reading and modifying current->mm and current->active_mm and switching
mm should be done with irqs off, to prevent races seeing an intermediate
state.

This is similar to commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
invalidate"). At exec-time when the new mm is activated, the old one
should usually be single-threaded and no longer used, unless something
else is holding an mm_users reference (which may be possible).

Absent other mm_users, there is also a race with preemption and lazy tlb
switching. Consider the kernel_execve case where the current thread is
using a lazy tlb active mm:

  call_usermodehelper()
kernel_execve()
  old_mm = current->mm;
  active_mm = current->active_mm;
  *** preempt *** >  schedule()
   prev->active_mm = NULL;
   mmdrop(prev active_mm);
 ...
  <  schedule()
  current->mm = mm;
  current->active_mm = mm;
  if (!old_mm)
  mmdrop(active_mm);

If we switch back to the kernel thread from a different mm, there is a
double free of the old active_mm, and a missing free of the new one.

Closing this race only requires interrupts to be disabled while ->mm
and ->active_mm are being switched, but the TLB problem requires also
holding interrupts off over activate_mm. Unfortunately not all archs
can do that yet, e.g., arm defers the switch if irqs are disabled and
expects finish_arch_post_lock_switch() to be called to complete the
flush; um takes a blocking lock in activate_mm().

So as a first step, disable interrupts across the mm/active_mm updates
to close the lazy tlb preempt race, and provide an arch option to
extend that to activate_mm which allows architectures doing IPI based
TLB shootdowns to close the second race.

This is a bit ugly, but in the interest of fixing the bug and backporting
before all architectures are converted this is a compromise.

Signed-off-by: Nicholas Piggin 
---
 arch/Kconfig |  7 +++
 fs/exec.c| 17 +++--
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index af14a567b493..94821e3f94d1 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -414,6 +414,13 @@ config MMU_GATHER_NO_GATHER
bool
depends on MMU_GATHER_TABLE_FREE
 
+config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
+   bool
+   help
+ Temporary select until all architectures can be converted to have
+ irqs disabled over activate_mm. Architectures that do IPI based TLB
+ shootdowns should enable this.
+
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool
 
diff --git a/fs/exec.c b/fs/exec.c
index a91003e28eaa..d4fb18baf1fb 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1130,11 +1130,24 @@ static int exec_mmap(struct mm_struct *mm)
}
 
task_lock(tsk);
-   active_mm = tsk->active_mm;
membarrier_exec_mmap(mm);
-   tsk->mm = mm;
+
+   local_irq_disable();
+   active_mm = tsk->active_mm;
tsk->active_mm = mm;
+   tsk->mm = mm;
+   /*
+* This prevents preemption while active_mm is being loaded and
+* it and mm are being updated, which could cause problems for
+* lazy tlb mm refcounting when these are updated by context
+* switches. Not all architectures can handle irqs off over
+* activate_mm yet.
+*/
+   if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
+   local_irq_enable();
activate_mm(active_mm, mm);
+   if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
+   local_irq_enable();
tsk->mm->vmacache_seqnum = 0;
vmacache_flush(tsk);
task_unlock(tsk);
-- 
2.23.0



[PATCH v2 0/4] more mm switching vs TLB shootdown and lazy tlb fixes

2020-09-13 Thread Nicholas Piggin
This is an attempt to fix a few different related issues around
switching mm, TLB flushing, and lazy tlb mm handling.

This will require all architectures to eventually move to disabling
irqs over activate_mm, but it's possible we could add another arch
call after irqs are re-enabled for those few which can't do their
entire activation with irqs disabled.

Testing so far indicates this has fixed a mm refcounting bug that
powerpc was running into (via distro report and backport). I haven't
had any real feedback on this series outside powerpc (and it doesn't
really affect other archs), so I propose patches 1,2,4 go via the
powerpc tree.

There is no dependency between them and patch 3, I put it there only
because it follows the history of the code (powerpc code was written
using the sparc64 logic), but I guess they have to go via different arch
trees. Dave, I'll leave patch 3 with you.

Thanks,
Nick

Since v1:
- Updates from Michael Ellerman's review comments.

Nicholas Piggin (4):
  mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
  powerpc: select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
  sparc64: remove mm_cpumask clearing to fix kthread_use_mm race
  powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm

 arch/Kconfig   |  7 +++
 arch/powerpc/Kconfig   |  1 +
 arch/powerpc/include/asm/mmu_context.h |  2 +-
 arch/powerpc/include/asm/tlb.h | 13 --
 arch/powerpc/mm/book3s64/radix_tlb.c   | 23 ++---
 arch/sparc/kernel/smp_64.c | 65 ++
 fs/exec.c  | 17 ++-
 7 files changed, 54 insertions(+), 74 deletions(-)

-- 
2.23.0



[PATCH -next] soc/qman: convert to use be32_add_cpu()

2020-09-13 Thread Liu Shixin
Signed-off-by: Liu Shixin 
drivers/soc/fsl/qbman/qman_test_api.c---
 drivers/soc/fsl/qbman/qman_test_api.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/soc/fsl/qbman/qman_test_api.c 
b/drivers/soc/fsl/qbman/qman_test_api.c
index 2895d062cf51..7066b2f1467c 100644
--- a/drivers/soc/fsl/qbman/qman_test_api.c
+++ b/drivers/soc/fsl/qbman/qman_test_api.c
@@ -86,7 +86,7 @@ static void fd_inc(struct qm_fd *fd)
len--;
qm_fd_set_param(fd, fmt, off, len);
 
-   fd->cmd = cpu_to_be32(be32_to_cpu(fd->cmd) + 1);
+   be32_add_cpu(&fd->cmd, 1);
 }
 
 /* The only part of the 'fd' we can't memcmp() is the ppid */
-- 
2.25.1



Re: [PATCH 12/15] selftests/seccomp: powerpc: Fix seccomp return value testing

2020-09-13 Thread Michael Ellerman
Kees Cook  writes:
> On powerpc, the errno is not inverted, and depends on ccr.so being
> set. Add this to a powerpc definition of SYSCALL_RET_SET().
>
> Co-developed-by: Thadeu Lima de Souza Cascardo 
> Signed-off-by: Thadeu Lima de Souza Cascardo 
> Link: 
> https://lore.kernel.org/linux-kselftest/20200911181012.171027-1-casca...@canonical.com/
> Fixes: 5d83c2b37d43 ("selftests/seccomp: Add powerpc support")
> Signed-off-by: Kees Cook 
> ---
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 15 +++
>  1 file changed, 15 insertions(+)

This looks right to me, and matches what strace does AFAICS.

Reviewed-by: Michael Ellerman 

cheers

> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c 
> b/tools/testing/selftests/seccomp/seccomp_bpf.c
> index 623953a53032..bbab2420d708 100644
> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> @@ -1750,6 +1750,21 @@ TEST_F(TRACE_poke, getpid_runs_normally)
>  # define ARCH_REGS   struct pt_regs
>  # define SYSCALL_NUM(_regs)  (_regs).gpr[0]
>  # define SYSCALL_RET(_regs)  (_regs).gpr[3]
> +# define SYSCALL_RET_SET(_regs, _val)\
> + do {\
> + typeof(_val) _result = (_val);  \
> + /*  \
> +  * A syscall error is signaled by CR0 SO bit\
> +  * and the code is stored as a positive value.  \
> +  */ \
> + if (_result < 0) {  \
> + SYSCALL_RET(_regs) = -result;   \
> + (_regs).ccr |= 0x1000;  \
> + } else {\
> + SYSCALL_RET(_regs) = result;\
> + (_regs).ccr &= ~0x1000; \
> + }   \
> + } while (0)
>  #elif defined(__s390__)
>  # define ARCH_REGS   s390_regs
>  # define SYSCALL_NUM(_regs)  (_regs).gprs[2]
> -- 
> 2.25.1


Re: [PATCH] spi: fsl-espi: Only process interrupts for expected events

2020-09-13 Thread Nicholas Piggin
Excerpts from Chris Packham's message of September 14, 2020 8:03 am:
> Hi All,
> 
> On 4/09/20 12:28 pm, Chris Packham wrote:
>> The SPIE register contains counts for the TX FIFO so any time the irq
>> handler was invoked we would attempt to process the RX/TX fifos. Use the
>> SPIM value to mask the events so that we only process interrupts that
>> were expected.
>>
>> This was a latent issue exposed by commit 3282a3da25bd ("powerpc/64:
>> Implement soft interrupt replay in C").
>>
>> Signed-off-by: Chris Packham 
>> Cc: sta...@vger.kernel.org
>> ---
> ping?

I don't know the code/hardware but thanks for tracking this down.

Was there anything more to be done with Jocke's observations, or would 
that be a follow-up patch if anything?

If this patch fixes your problem it should probably go in, unless there 
are any objections.

Thanks,
Nick

>>
>> Notes:
>>  I've tested this on a T2080RDB and a custom board using the T2081 SoC. 
>> With
>>  this change I don't see any spurious instances of the "Transfer done but
>>  SPIE_DON isn't set!" or "Transfer done but rx/tx fifo's aren't empty!" 
>> messages
>>  and the updates to spi flash are successful.
>>  
>>  I think this should go into the stable trees that contain 3282a3da25bd 
>> but I
>>  haven't added a Fixes: tag because I think 3282a3da25bd exposed the 
>> issue as
>>  opposed to causing it.
>>
>>   drivers/spi/spi-fsl-espi.c | 5 +++--
>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/spi/spi-fsl-espi.c b/drivers/spi/spi-fsl-espi.c
>> index 7e7c92cafdbb..cb120b68c0e2 100644
>> --- a/drivers/spi/spi-fsl-espi.c
>> +++ b/drivers/spi/spi-fsl-espi.c
>> @@ -574,13 +574,14 @@ static void fsl_espi_cpu_irq(struct fsl_espi *espi, 
>> u32 events)
>>   static irqreturn_t fsl_espi_irq(s32 irq, void *context_data)
>>   {
>>  struct fsl_espi *espi = context_data;
>> -u32 events;
>> +u32 events, mask;
>>   
>>  spin_lock(&espi->lock);
>>   
>>  /* Get interrupt events(tx/rx) */
>>  events = fsl_espi_read_reg(espi, ESPI_SPIE);
>> -if (!events) {
>> +mask = fsl_espi_read_reg(espi, ESPI_SPIM);
>> +if (!(events & mask)) {
>>  spin_unlock(&espi->lock);
>>  return IRQ_NONE;
>>  }


Re: [PATCH v1 4/5] powerpc/fault: Avoid heavy search_exception_tables() verification

2020-09-13 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of September 9, 2020 4:20 pm:
> 
> 
> Le 09/09/2020 à 08:04, Aneesh Kumar K.V a écrit :
>> Christophe Leroy  writes:
>> 
>>> search_exception_tables() is an heavy operation, we have to avoid it.
>>> When KUAP is selected, we'll know the fault has been blocked by KUAP.
>>> Otherwise, it behaves just as if the address was already in the TLBs
>>> and no fault was generated.
>>>
>>> Signed-off-by: Christophe Leroy 
>>> ---
>>>   arch/powerpc/mm/fault.c | 20 +---
>>>   1 file changed, 5 insertions(+), 15 deletions(-)
>>>
>>> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
>>> index 525e0c2b5406..edde169ba3a6 100644
>>> --- a/arch/powerpc/mm/fault.c
>>> +++ b/arch/powerpc/mm/fault.c
>>> @@ -214,24 +214,14 @@ static bool bad_kernel_fault(struct pt_regs *regs, 
>>> unsigned long error_code,
>>> if (address >= TASK_SIZE)
>>> return true;
>>>   
>>> -   if (!is_exec && (error_code & DSISR_PROTFAULT) &&
>>> -   !search_exception_tables(regs->nip)) {
>>> +   // Read/write fault blocked by KUAP is bad, it can never succeed.
>>> +   if (bad_kuap_fault(regs, address, is_write)) {
>>> pr_crit_ratelimited("Kernel attempted to access user page (%lx) 
>>> - exploit attempt? (uid: %d)\n",
>>> -   address,
>>> -   from_kuid(&init_user_ns, current_uid()));
>>> -   }
>>> -
>>> -   // Fault on user outside of certain regions (eg. copy_tofrom_user()) is 
>>> bad
>>> -   if (!search_exception_tables(regs->nip))
>>> -   return true;
>> 
>> We still need to keep this ? Without that we detect the lack of
>> exception tables pretty late.
> 
> Is that a problem at all to detect the lack of exception tables late ?
> That case is very unlikely and will lead to failure anyway. So, is it 
> worth impacting performance of the likely case which will always have an 
> exception table and where we expect the exception to run as fast as 
> possible ?
> 
> The other architectures I have looked at (arm64 and x86) only have the 
> exception table search together with the down_read_trylock(&mm->mmap_sem).

Yeah I don't see how it'd be a problem. User could arrange for page 
table to already be at this address and avoid the fault so it's not the 
right way to stop an attacker, KUAP is.

Thanks,
Nick


Re: [PATCH] spi: fsl-espi: Only process interrupts for expected events

2020-09-13 Thread Chris Packham
Hi All,

On 4/09/20 12:28 pm, Chris Packham wrote:
> The SPIE register contains counts for the TX FIFO so any time the irq
> handler was invoked we would attempt to process the RX/TX fifos. Use the
> SPIM value to mask the events so that we only process interrupts that
> were expected.
>
> This was a latent issue exposed by commit 3282a3da25bd ("powerpc/64:
> Implement soft interrupt replay in C").
>
> Signed-off-by: Chris Packham 
> Cc: sta...@vger.kernel.org
> ---
ping?
>
> Notes:
>  I've tested this on a T2080RDB and a custom board using the T2081 SoC. 
> With
>  this change I don't see any spurious instances of the "Transfer done but
>  SPIE_DON isn't set!" or "Transfer done but rx/tx fifo's aren't empty!" 
> messages
>  and the updates to spi flash are successful.
>  
>  I think this should go into the stable trees that contain 3282a3da25bd 
> but I
>  haven't added a Fixes: tag because I think 3282a3da25bd exposed the 
> issue as
>  opposed to causing it.
>
>   drivers/spi/spi-fsl-espi.c | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/spi/spi-fsl-espi.c b/drivers/spi/spi-fsl-espi.c
> index 7e7c92cafdbb..cb120b68c0e2 100644
> --- a/drivers/spi/spi-fsl-espi.c
> +++ b/drivers/spi/spi-fsl-espi.c
> @@ -574,13 +574,14 @@ static void fsl_espi_cpu_irq(struct fsl_espi *espi, u32 
> events)
>   static irqreturn_t fsl_espi_irq(s32 irq, void *context_data)
>   {
>   struct fsl_espi *espi = context_data;
> - u32 events;
> + u32 events, mask;
>   
>   spin_lock(&espi->lock);
>   
>   /* Get interrupt events(tx/rx) */
>   events = fsl_espi_read_reg(espi, ESPI_SPIE);
> - if (!events) {
> + mask = fsl_espi_read_reg(espi, ESPI_SPIM);
> + if (!(events & mask)) {
>   spin_unlock(&espi->lock);
>   return IRQ_NONE;
>   }

[PATCH] powerpc/papr_scm: Support dynamic enable/disable of performance statistics

2020-09-13 Thread Vaibhav Jain
Collection of performance statistics of an NVDIMM can be dynamically
enabled/disabled from the Hypervisor Management Console even when the
guest lpar is running. The current implementation however will check if
the performance statistics collection is supported during NVDIMM probe
and if yes will assume that to be the case afterwards.

Hence we update papr_scm to remove this assumption from the code by
eliminating the 'stat_buffer_len' member from 'struct papr_scm_priv'
that was used to cache the max buffer size needed to fetch NVDIMM
performance stats from PHYP. With that struct member gone, various
functions that depended on it are updated. Specifically
perf_stats_show() is updated to query the PHYP first for the size of
buffer needed to hold all performance statistics instead of relying on
'stat_buffer_len'

Signed-off-by: Vaibhav Jain 
---
 arch/powerpc/platforms/pseries/papr_scm.c | 53 +++
 1 file changed, 25 insertions(+), 28 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 27268370dee00..6697e1c3b9ebe 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -112,9 +112,6 @@ struct papr_scm_priv {
 
/* Health information for the dimm */
u64 health_bitmap;
-
-   /* length of the stat buffer as expected by phyp */
-   size_t stat_buffer_len;
 };
 
 static LIST_HEAD(papr_nd_regions);
@@ -230,14 +227,15 @@ static int drc_pmem_query_n_bind(struct papr_scm_priv *p)
  * - If buff_stats == NULL the return value is the size in byes of the buffer
  * needed to hold all supported performance-statistics.
  * - If buff_stats != NULL and num_stats == 0 then we copy all known
- * performance-statistics to 'buff_stat' and expect to be large enough to
- * hold them.
+ * performance-statistics to 'buff_stat' and expect it to be large enough to
+ * hold them. The 'buff_size' args contains the size of the 'buff_stats'
  * - if buff_stats != NULL and num_stats > 0 then copy the requested
  * performance-statistics to buff_stats.
  */
 static ssize_t drc_pmem_query_stats(struct papr_scm_priv *p,
struct papr_scm_perf_stats *buff_stats,
-   unsigned int num_stats)
+   unsigned int num_stats,
+   size_t buff_size)
 {
unsigned long ret[PLPAR_HCALL_BUFSIZE];
size_t size;
@@ -261,12 +259,18 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv 
*p,
size = sizeof(struct papr_scm_perf_stats) +
num_stats * sizeof(struct papr_scm_perf_stat);
else
-   size = p->stat_buffer_len;
+   size = buff_size;
} else {
/* In case of no out buffer ignore the size */
size = 0;
}
 
+   /* verify that the buffer size needed is sufficient */
+   if (size > buff_size) {
+   __WARN();
+   return -EINVAL;
+   }
+
/* Do the HCALL asking PHYP for info */
rc = plpar_hcall(H_SCM_PERFORMANCE_STATS, ret, p->drc_index,
 buff_stats ? virt_to_phys(buff_stats) : 0,
@@ -277,6 +281,10 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv 
*p,
dev_err(&p->pdev->dev,
"Unknown performance stats, Err:0x%016lX\n", ret[0]);
return -ENOENT;
+   } else if (rc == H_AUTHORITY) {
+   dev_dbg(&p->pdev->dev,
+   "Performance stats in-accessible\n");
+   return -EPERM;
} else if (rc != H_SUCCESS) {
dev_err(&p->pdev->dev,
"Failed to query performance stats, Err:%lld\n", rc);
@@ -526,10 +534,6 @@ static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p,
struct papr_scm_perf_stat *stat;
struct papr_scm_perf_stats *stats;
 
-   /* Silently fail if fetching performance metrics isn't  supported */
-   if (!p->stat_buffer_len)
-   return 0;
-
/* Allocate request buffer enough to hold single performance stat */
size = sizeof(struct papr_scm_perf_stats) +
sizeof(struct papr_scm_perf_stat);
@@ -543,9 +547,11 @@ static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p,
stat->stat_val = 0;
 
/* Fetch the fuel gauge and populate it in payload */
-   rc = drc_pmem_query_stats(p, stats, 1);
+   rc = drc_pmem_query_stats(p, stats, 1, size);
if (rc < 0) {
dev_dbg(&p->pdev->dev, "Err(%d) fetching fuel gauge\n", rc);
+   /* Silently fail if unable to fetch performance metric */
+   rc = 0;
goto free_stats;
}
 
@@ -786,23 +792,25 @@ static ssize_t perf_stats_show(struct device *dev,
   struct device_attribute *a

[PATCH] powerpc/papr_scm: Add PAPR command family to pass-through command-set

2020-09-13 Thread Vaibhav Jain
Add NVDIMM_FAMILY_PAPR to the list of valid 'dimm_family_mask'
acceptable by papr_scm. This is needed as since commit
92fe2aa859f5 ("libnvdimm: Validate command family indices") libnvdimm
performs a validation of 'nd_cmd_pkg.nd_family' received as part of
ND_CMD_CALL processing to ensure only known command families can use
the general ND_CMD_CALL pass-through functionality.

Without this change the ND_CMD_CALL pass-through targeting
NVDIMM_FAMILY_PAPR error out with -EINVAL.

Fixes: 92fe2aa859f5 ("libnvdimm: Validate command family indices")
Signed-off-by: Vaibhav Jain 
---
 arch/powerpc/platforms/pseries/papr_scm.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 5493bc847bd08..27268370dee00 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -898,6 +898,9 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
p->bus_desc.of_node = p->pdev->dev.of_node;
p->bus_desc.provider_name = kstrdup(p->pdev->name, GFP_KERNEL);
 
+   /* Set the dimm command family mask to accept PDSMs */
+   set_bit(NVDIMM_FAMILY_PAPR, &p->bus_desc.dimm_family_mask);
+
if (!p->bus_desc.provider_name)
return -ENOMEM;
 
-- 
2.26.2



Re: [PATCH v5 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling

2020-09-13 Thread Satheesh Rajendran
On Fri, Sep 11, 2020 at 09:55:23PM +1000, Michael Ellerman wrote:
> Srikar Dronamraju  writes:
> > Current code assumes that cpumask of cpus sharing a l2-cache mask will
> > always be a superset of cpu_sibling_mask.
> >
> > Lets stop that assumption. cpu_l2_cache_mask is a superset of
> > cpu_sibling_mask if and only if shared_caches is set.
> 
> I'm seeing oopses with this:
> 
> [0.117392][T1] smp: Bringing up secondary CPUs ...
> [0.156515][T1] smp: Brought up 2 nodes, 2 CPUs
> [0.158265][T1] numa: Node 0 CPUs: 0
> [0.158520][T1] numa: Node 1 CPUs: 1
> [0.167453][T1] BUG: Unable to handle kernel data access on read at 
> 0x800041228298
> [0.167992][T1] Faulting instruction address: 0xc018c128
> [0.168817][T1] Oops: Kernel access of bad area, sig: 11 [#1]
> [0.168964][T1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> [0.169417][T1] Modules linked in:
> [0.170047][T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
> 5.9.0-rc2-00095-g7430ad5aa700 #209
> [0.170305][T1] NIP:  c018c128 LR: c018c0cc CTR: 
> c004dce0
> [0.170498][T1] REGS: c0007e343880 TRAP: 0380   Not tainted  
> (5.9.0-rc2-00095-g7430ad5aa700)
> [0.170602][T1] MSR:  82009033   CR: 
> 4400  XER: 
> [0.170985][T1] CFAR: c018c288 IRQMASK: 0
> [0.170985][T1] GPR00:  c0007e343b10 
> c173e400 4000
> [0.170985][T1] GPR04:  0800 
> 0800 
> [0.170985][T1] GPR08:  c122c298 
> c0003fffc000 c0007fd05ce8
> [0.170985][T1] GPR12: c0007e0119f8 c193 
> 8ade 
> [0.170985][T1] GPR16: c0007e3c0640 0917 
> c0007e3c0658 0008
> [0.170985][T1] GPR20: c15d0bb8 8ade 
> c0f57400 c1817c28
> [0.170985][T1] GPR24: c176dc80 c0007e3c0890 
> c0007e3cfe00 
> [0.170985][T1] GPR28: c1772310 c0007e011900 
> c0007e3c0800 0001
> [0.172750][T1] NIP [c018c128] build_sched_domains+0x808/0x14b0
> [0.172900][T1] LR [c018c0cc] build_sched_domains+0x7ac/0x14b0
> [0.173186][T1] Call Trace:
> [0.173484][T1] [c0007e343b10] [c018bfe8] 
> build_sched_domains+0x6c8/0x14b0 (unreliable)
> [0.173821][T1] [c0007e343c50] [c018dcdc] 
> sched_init_domains+0xec/0x130
> [0.174037][T1] [c0007e343ca0] [c10d59d8] 
> sched_init_smp+0x50/0xc4
> [0.174207][T1] [c0007e343cd0] [c10b45c4] 
> kernel_init_freeable+0x1b4/0x378
> [0.174378][T1] [c0007e343db0] [c00129fc] 
> kernel_init+0x24/0x158
> [0.174740][T1] [c0007e343e20] [c000d9d0] 
> ret_from_kernel_thread+0x5c/0x6c
> [0.175050][T1] Instruction dump:
> [0.175626][T1] 554905ee 71480040 7d2907b4 4182016c 2c29 3920006e 
> 913e002c 41820034
> [0.175841][T1] 7c6307b4 e9300020 78631f24 7d58182a <7d2a482a> 
> f93e0080 7d404828 314a0001
> [0.178340][T1] ---[ end trace 6876b88dd1d4b3bb ]---
> [0.178512][T1]
> [1.180458][T1] Kernel panic - not syncing: Attempted to kill init! 
> exitcode=0x000b
> 
> That's qemu:
> 
> qemu-system-ppc64 -nographic -vga none -M pseries -cpu POWER8 \
>   -kernel build~/vmlinux \
>   -m 2G,slots=2,maxmem=4G \
>   -object memory-backend-ram,size=1G,id=m0 \
>   -object memory-backend-ram,size=1G,id=m1 \
>   -numa node,nodeid=0,memdev=m0 \
>   -numa node,nodeid=1,memdev=m1 \
>   -smp 2,sockets=2,maxcpus=2  \

PowerKVM guest vCPUs does not yet have L2 and L3 cache elements
I had this bug raised some time ago, probably related?
https://bugs.launchpad.net/qemu/+bug/1774605


Regards,
-Satheesh.
> 
> 
> On mambo I get:
> 
> [0.005069][T1] smp: Bringing up secondary CPUs ...
> [0.011656][T1] smp: Brought up 2 nodes, 8 CPUs
> [0.011682][T1] numa: Node 0 CPUs: 0-3
> [0.011709][T1] numa: Node 1 CPUs: 4-7
> [0.012015][T1] BUG: arch topology borken
> [0.012040][T1]  the SMT domain not a subset of the CACHE domain
> [0.012107][T1] BUG: Unable to handle kernel data access on read at 
> 0x8001012e7398
> [0.012142][T1] Faulting instruction address: 0xc01aa4f0
> [0.012174][T1] Oops: Kernel access of bad area, sig: 11 [#1]
> [0.012206][T1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
> [0.012236][T1] Modules linked in:
> [0.012264][T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
> 5.9.0-rc2-00095-g7430ad5aa700 #1
> [0.012304][T1] NIP:  c01aa4f0 LR: c01aa498 CTR: 
> 
> [0.012341][T1] REGS: c000ef583880 TRAP: 0380   Not tainted  
> (5.9.

RE: [PATCHv7 12/12] misc: pci_endpoint_test: Add driver data for Layerscape PCIe controllers

2020-09-13 Thread Z.q. Hou
Hi Rob,

Thanks a lot for your review and ack!

Regards,
Zhiqiang

> -Original Message-
> From: Rob Herring 
> Sent: 2020年9月11日 2:18
> To: Z.q. Hou 
> Cc: bhelg...@google.com; shawn...@kernel.org; M.h. Lian
> ; Leo Li ;
> linuxppc-dev@lists.ozlabs.org; robh...@kernel.org;
> linux-arm-ker...@lists.infradead.org; Roy Zang ;
> andrew.mur...@arm.com; linux-...@vger.kernel.org;
> lorenzo.pieral...@arm.com; gustavo.pimen...@synopsys.com; Mingkai Hu
> ; linux-ker...@vger.kernel.org;
> jingooh...@gmail.com; kis...@ti.com; devicet...@vger.kernel.org
> Subject: Re: [PATCHv7 12/12] misc: pci_endpoint_test: Add driver data for
> Layerscape PCIe controllers
> 
> On Tue, 11 Aug 2020 17:54:41 +0800, Zhiqiang Hou wrote:
> > From: Hou Zhiqiang 
> >
> > The commit 0a121f9bc3f5 ("misc: pci_endpoint_test: Use streaming DMA
> > APIs for buffer allocation") changed to use streaming DMA APIs,
> > however,
> > dma_map_single() might not return a 4KB aligned address, so add the
> > default_data as driver data for Layerscape PCIe controllers to make it
> > 4KB aligned.
> >
> > Signed-off-by: Hou Zhiqiang 
> > ---
> > V7:
> >  - New patch.
> >
> >  drivers/misc/pci_endpoint_test.c | 8 ++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> >
> 
> Acked-by: Rob Herring 


RE: [PATCHv7 04/12] PCI: designware-ep: Modify MSI and MSIX CAP way of finding

2020-09-13 Thread Z.q. Hou
Hi Rob,

Thanks a lot for your comments!

> -Original Message-
> From: Rob Herring 
> Sent: 2020年9月11日 2:10
> To: Z.q. Hou 
> Cc: linux-...@vger.kernel.org; devicet...@vger.kernel.org;
> linux-ker...@vger.kernel.org; linux-arm-ker...@lists.infradead.org;
> linuxppc-dev@lists.ozlabs.org; bhelg...@google.com;
> lorenzo.pieral...@arm.com; shawn...@kernel.org; Leo Li
> ; kis...@ti.com; gustavo.pimen...@synopsys.com;
> Roy Zang ; jingooh...@gmail.com;
> andrew.mur...@arm.com; Mingkai Hu ; M.h. Lian
> ; Xiaowei Bao 
> Subject: Re: [PATCHv7 04/12] PCI: designware-ep: Modify MSI and MSIX CAP
> way of finding
> 
> On Tue, Aug 11, 2020 at 05:54:33PM +0800, Zhiqiang Hou wrote:
> > From: Xiaowei Bao 
> >
> > Each PF of EP device should have its own MSI or MSIX capabitily
> > struct, so create a dw_pcie_ep_func struct and move the msi_cap and
> > msix_cap to this struct from dw_pcie_ep, and manage the PFs via a
> > list.
> >
> > Signed-off-by: Xiaowei Bao 
> > Signed-off-by: Hou Zhiqiang 
> > ---
> > V7:
> >  - Rebase the patch without functionality change.
> >
> >  .../pci/controller/dwc/pcie-designware-ep.c   | 139
> +++---
> >  drivers/pci/controller/dwc/pcie-designware.h  |  18 ++-
> >  2 files changed, 136 insertions(+), 21 deletions(-)
> >
> > diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c
> > b/drivers/pci/controller/dwc/pcie-designware-ep.c
> > index 56bd1cd71f16..4680a51c49c0 100644
> > --- a/drivers/pci/controller/dwc/pcie-designware-ep.c
> > +++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
> > @@ -28,6 +28,19 @@ void dw_pcie_ep_init_notify(struct dw_pcie_ep
> *ep)
> > }  EXPORT_SYMBOL_GPL(dw_pcie_ep_init_notify);
> >
> > +struct dw_pcie_ep_func *
> > +dw_pcie_ep_get_func_from_ep(struct dw_pcie_ep *ep, u8 func_no) {
> > +   struct dw_pcie_ep_func *ep_func;
> > +
> > +   list_for_each_entry(ep_func, &ep->func_list, list) {
> > +   if (ep_func->func_no == func_no)
> > +   return ep_func;
> > +   }
> > +
> > +   return NULL;
> > +}
> > +
> >  static unsigned int dw_pcie_ep_func_select(struct dw_pcie_ep *ep, u8
> > func_no)  {
> > unsigned int func_offset = 0;
> > @@ -68,6 +81,47 @@ void dw_pcie_ep_reset_bar(struct dw_pcie *pci,
> enum pci_barno bar)
> > __dw_pcie_ep_reset_bar(pci, func_no, bar, 0);  }
> >
> > +static u8 __dw_pcie_ep_find_next_cap(struct dw_pcie_ep *ep, u8
> func_no,
> > +   u8 cap_ptr, u8 cap)
> > +{
> > +   struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> > +   unsigned int func_offset = 0;
> > +   u8 cap_id, next_cap_ptr;
> > +   u16 reg;
> > +
> > +   if (!cap_ptr)
> > +   return 0;
> > +
> > +   func_offset = dw_pcie_ep_func_select(ep, func_no);
> > +
> > +   reg = dw_pcie_readw_dbi(pci, func_offset + cap_ptr);
> > +   cap_id = (reg & 0x00ff);
> > +
> > +   if (cap_id > PCI_CAP_ID_MAX)
> > +   return 0;
> > +
> > +   if (cap_id == cap)
> > +   return cap_ptr;
> > +
> > +   next_cap_ptr = (reg & 0xff00) >> 8;
> > +   return __dw_pcie_ep_find_next_cap(ep, func_no, next_cap_ptr, cap); }
> > +
> > +static u8 dw_pcie_ep_find_capability(struct dw_pcie_ep *ep, u8
> > +func_no, u8 cap) {
> > +   struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> > +   unsigned int func_offset = 0;
> > +   u8 next_cap_ptr;
> > +   u16 reg;
> > +
> > +   func_offset = dw_pcie_ep_func_select(ep, func_no);
> > +
> > +   reg = dw_pcie_readw_dbi(pci, func_offset + PCI_CAPABILITY_LIST);
> > +   next_cap_ptr = (reg & 0x00ff);
> > +
> > +   return __dw_pcie_ep_find_next_cap(ep, func_no, next_cap_ptr, cap); }
> 
> These are almost the same as __dw_pcie_find_next_cap and
> dw_pcie_find_capability. Please modify them to take a function offset and
> work for both host and endpoints.

Yes, will rework in next version.

> 
> > +
> >  static int dw_pcie_ep_write_header(struct pci_epc *epc, u8 func_no,
> >struct pci_epf_header *hdr)
> >  {
> > @@ -257,13 +311,18 @@ static int dw_pcie_ep_get_msi(struct pci_epc
> *epc, u8 func_no)
> > struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> > u32 val, reg;
> > unsigned int func_offset = 0;
> > +   struct dw_pcie_ep_func *ep_func;
> >
> > -   if (!ep->msi_cap)
> > +   ep_func = dw_pcie_ep_get_func_from_ep(ep, func_no);
> > +   if (!ep_func)
> > +   return -EINVAL;
> > +
> > +   if (!ep_func->msi_cap)
> > return -EINVAL;
> >
> > func_offset = dw_pcie_ep_func_select(ep, func_no);
> >
> > -   reg = ep->msi_cap + func_offset + PCI_MSI_FLAGS;
> > +   reg = ep_func->msi_cap + func_offset + PCI_MSI_FLAGS;
> > val = dw_pcie_readw_dbi(pci, reg);
> > if (!(val & PCI_MSI_FLAGS_ENABLE))
> > return -EINVAL;
> > @@ -279,13 +338,18 @@ static int dw_pcie_ep_set_msi(struct pci_epc
> *epc, u8 func_no, u8 interrupts)
> > struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> > u32 val, reg;
> > unsigned int func_offset = 0;
> > +   struct dw_pcie_ep_func *ep_func;
> > +
> > +   ep_func = dw_pcie_ep_get_func_from_ep(e

Re: [PATCH v5 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling

2020-09-13 Thread Srikar Dronamraju
Fix to make it work where CPUs dont have a l2-cache element.

>8-8<-

>From b25d47b01b7195b1df19083a4043fa6a87a901a3 Mon Sep 17 00:00:00 2001
From: Srikar Dronamraju 
Date: Thu, 9 Jul 2020 13:33:38 +0530
Subject: [PATCH v5.2 05/10] powerpc/smp: Dont assume l2-cache to be superset of
 sibling

Current code assumes that cpumask of cpus sharing a l2-cache mask will
always be a superset of cpu_sibling_mask.

Lets stop that assumption. cpu_l2_cache_mask is a superset of
cpu_sibling_mask if and only if shared_caches is set.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Jordan Niethe 
Cc: Vaidyanathan Srinivasan 
Reviewed-by: Gautham R. Shenoy 
Signed-off-by: Srikar Dronamraju 
---
Changelog v1 -> v2:
Set cpumask after verifying l2-cache. (Gautham)

Changelog v5 -> v5.2:
If cpu has no l2-cache set cpumask as per its
 sibling mask. (Michael Ellerman)

 arch/powerpc/kernel/smp.c | 43 +--
 1 file changed, 29 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 9f4333d..168532e 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1186,9 +1186,23 @@ static bool update_mask_by_l2(int cpu, struct cpumask 
*(*mask_fn)(int))
int i;
 
l2_cache = cpu_to_l2cache(cpu);
-   if (!l2_cache)
+   if (!l2_cache) {
+   struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
+
+   /*
+* If no l2cache for this CPU, assume all siblings to share
+* cache with this CPU.
+*/
+   if (has_big_cores)
+   sibling_mask = cpu_smallcore_mask;
+
+   for_each_cpu(i, sibling_mask(cpu))
+   set_cpus_related(cpu, i, cpu_l2_cache_mask);
+
return false;
+   }
 
+   cpumask_set_cpu(cpu, mask_fn(cpu));
for_each_cpu(i, cpu_online_mask) {
/*
 * when updating the marks the current CPU has not been marked
@@ -1271,29 +1285,30 @@ static void add_cpu_to_masks(int cpu)
 * add it to it's own thread sibling mask.
 */
cpumask_set_cpu(cpu, cpu_sibling_mask(cpu));
+   cpumask_set_cpu(cpu, cpu_core_mask(cpu));
 
for (i = first_thread; i < first_thread + threads_per_core; i++)
if (cpu_online(i))
set_cpus_related(i, cpu, cpu_sibling_mask);
 
add_cpu_to_smallcore_masks(cpu);
-   /*
-* Copy the thread sibling mask into the cache sibling mask
-* and mark any CPUs that share an L2 with this CPU.
-*/
-   for_each_cpu(i, cpu_sibling_mask(cpu))
-   set_cpus_related(cpu, i, cpu_l2_cache_mask);
update_mask_by_l2(cpu, cpu_l2_cache_mask);
 
-   /*
-* Copy the cache sibling mask into core sibling mask and mark
-* any CPUs on the same chip as this CPU.
-*/
-   for_each_cpu(i, cpu_l2_cache_mask(cpu))
-   set_cpus_related(cpu, i, cpu_core_mask);
+   if (pkg_id == -1) {
+   struct cpumask *(*mask)(int) = cpu_sibling_mask;
+
+   /*
+* Copy the sibling mask into core sibling mask and
+* mark any CPUs on the same chip as this CPU.
+*/
+   if (shared_caches)
+   mask = cpu_l2_cache_mask;
+
+   for_each_cpu(i, mask(cpu))
+   set_cpus_related(cpu, i, cpu_core_mask);
 
-   if (pkg_id == -1)
return;
+   }
 
for_each_cpu(i, cpu_online_mask)
if (get_physical_package_id(i) == pkg_id)
-- 
2.17.1



Re: [PATCH v5 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling

2020-09-13 Thread Srikar Dronamraju
* Michael Ellerman  [2020-09-13 11:46:41]:

> Srikar Dronamraju  writes:
> > * Michael Ellerman  [2020-09-11 21:55:23]:
> >
> >> Srikar Dronamraju  writes:
> >> > Current code assumes that cpumask of cpus sharing a l2-cache mask will
> >> > always be a superset of cpu_sibling_mask.
> >> >
> >> > Lets stop that assumption. cpu_l2_cache_mask is a superset of
> >> > cpu_sibling_mask if and only if shared_caches is set.
> >> 
> >> I'm seeing oopses with this:
> >> 
> 
> The patch fixes qemu, and I don't see the crash on mambo, but I still
> see:
>   [0.010536] smp: Bringing up secondary CPUs ...
>   [0.019189] smp: Brought up 2 nodes, 8 CPUs
>   [0.019210] numa: Node 0 CPUs: 0-3
>   [0.019235] numa: Node 1 CPUs: 4-7
>   [0.02]  the CACHE domain not a subset of the MC domain
>   [0.024505] BUG: arch topology borken
>   [0.024527]  the SMT domain not a subset of the CACHE domain
>   [0.024563] BUG: arch topology borken
>   [0.024584]  the CACHE domain not a subset of the MC domain
>   [0.024645] BUG: arch topology borken
>   [0.024666]  the SMT domain not a subset of the CACHE domain
>   [0.024702] BUG: arch topology borken
>   [0.024723]  the CACHE domain not a subset of the MC domain
> 
> That's the p9 mambo model, using skiboot.tcl from skiboot, with CPUS=2,
> THREADS=4 and MAMBO_NUMA=1.
> 

I was able to reproduce with
 qemu-system-ppc64 -nographic -vga none -M pseries -cpu POWER8 \
   -kernel build~/vmlinux \
   -m 2G,slots=2,maxmem=4G \
   -object memory-backend-ram,size=1G,id=m0 \
   -object memory-backend-ram,size=1G,id=m1 \
   -numa node,nodeid=0,memdev=m0 \
   -numa node,nodeid=1,memdev=m1 \
   -smp 8,threads=4,sockets=2,maxcpus=8  \


If the CPU doesn't have a l2-cache element, then CPU not only has to set
itself in the cpu_l2_cache but also the siblings. Otherwise it will so
happen that the Siblings will have 4 Cpus set, and the Cache domain will
have just one cpu set, leading to this BUG message.

Patch follows this mail.

> Node layout is:
> 
> [0.00] Early memory node ranges
> [0.00]   node   0: [mem 0x-0x]
> [0.00]   node   1: [mem 0x2000-0x2000]
> [0.00] Initmem setup node 0 [mem 
> 0x-0x]
> [0.00] On node 0 totalpages: 65536
> [0.00] Initmem setup node 1 [mem 
> 0x2000-0x2000]
> [0.00] On node 1 totalpages: 65536
> 
> 
> There aren't any l2-cache properties in the device-tree under cpus.
> 
> I'll try and have a closer look tonight.
> 
> cheers

-- 
Thanks and Regards
Srikar Dronamraju


RE: [PATCHv7 02/12] PCI: designware-ep: Add the doorbell mode of MSI-X in EP mode

2020-09-13 Thread Z.q. Hou
Hi Rob,

Thanks a lot for your comments!

> -Original Message-
> From: Rob Herring 
> Sent: 2020年9月11日 1:58
> To: Z.q. Hou 
> Cc: linux-...@vger.kernel.org; devicet...@vger.kernel.org;
> linux-ker...@vger.kernel.org; linux-arm-ker...@lists.infradead.org;
> linuxppc-dev@lists.ozlabs.org; bhelg...@google.com;
> lorenzo.pieral...@arm.com; shawn...@kernel.org; Leo Li
> ; kis...@ti.com; gustavo.pimen...@synopsys.com;
> Roy Zang ; jingooh...@gmail.com;
> andrew.mur...@arm.com; Mingkai Hu ; M.h. Lian
> ; Xiaowei Bao 
> Subject: Re: [PATCHv7 02/12] PCI: designware-ep: Add the doorbell mode of
> MSI-X in EP mode
> 
> On Tue, Aug 11, 2020 at 05:54:31PM +0800, Zhiqiang Hou wrote:
> > From: Xiaowei Bao 
> >
> > Add the doorbell mode of MSI-X in DWC EP driver.
> >
> > Signed-off-by: Xiaowei Bao 
> > Reviewed-by: Andrew Murray 
> > Signed-off-by: Hou Zhiqiang 
> > ---
> > V7:
> >  - Rebase the patch without functionality change.
> >
> >  drivers/pci/controller/dwc/pcie-designware-ep.c | 14 ++
> >  drivers/pci/controller/dwc/pcie-designware.h| 12 
> >  2 files changed, 26 insertions(+)
> >
> > diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c
> > b/drivers/pci/controller/dwc/pcie-designware-ep.c
> > index e5bd3a5ef380..e76b504ed465 100644
> > --- a/drivers/pci/controller/dwc/pcie-designware-ep.c
> > +++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
> > @@ -471,6 +471,20 @@ int dw_pcie_ep_raise_msi_irq(struct dw_pcie_ep
> *ep, u8 func_no,
> > return 0;
> >  }
> >
> > +int dw_pcie_ep_raise_msix_irq_doorbell(struct dw_pcie_ep *ep, u8
> > +func_no,
> 
> return void. It never has an error.
> 
> It could also just be an inline function.

Yes, make sense and will change in next version.

Thanks,
Zhiqiang

> 
> > +  u16 interrupt_num)
> > +{
> > +   struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> > +   u32 msg_data;
> > +
> > +   msg_data = (func_no << PCIE_MSIX_DOORBELL_PF_SHIFT) |
> > +  (interrupt_num - 1);
> > +
> > +   dw_pcie_writel_dbi(pci, PCIE_MSIX_DOORBELL, msg_data);
> > +
> > +   return 0;
> > +}
> > +
> >  int dw_pcie_ep_raise_msix_irq(struct dw_pcie_ep *ep, u8 func_no,
> >   u16 interrupt_num)
> >  {
> > diff --git a/drivers/pci/controller/dwc/pcie-designware.h
> > b/drivers/pci/controller/dwc/pcie-designware.h
> > index 89f8271ec5ee..745b4938225a 100644
> > --- a/drivers/pci/controller/dwc/pcie-designware.h
> > +++ b/drivers/pci/controller/dwc/pcie-designware.h
> > @@ -97,6 +97,9 @@
> >  #define PCIE_MISC_CONTROL_1_OFF0x8BC
> >  #define PCIE_DBI_RO_WR_EN  BIT(0)
> >
> > +#define PCIE_MSIX_DOORBELL 0x948
> > +#define PCIE_MSIX_DOORBELL_PF_SHIFT24
> > +
> >  #define PCIE_PL_CHK_REG_CONTROL_STATUS 0xB20
> >  #define PCIE_PL_CHK_REG_CHK_REG_START  BIT(0)
> >  #define PCIE_PL_CHK_REG_CHK_REG_CONTINUOUS BIT(1)
> > @@ -434,6 +437,8 @@ int dw_pcie_ep_raise_msi_irq(struct dw_pcie_ep
> *ep, u8 func_no,
> >  u8 interrupt_num);
> >  int dw_pcie_ep_raise_msix_irq(struct dw_pcie_ep *ep, u8 func_no,
> >  u16 interrupt_num);
> > +int dw_pcie_ep_raise_msix_irq_doorbell(struct dw_pcie_ep *ep, u8
> func_no,
> > +  u16 interrupt_num);
> >  void dw_pcie_ep_reset_bar(struct dw_pcie *pci, enum pci_barno bar);
> > #else  static inline void dw_pcie_ep_linkup(struct dw_pcie_ep *ep) @@
> > -475,6 +480,13 @@ static inline int dw_pcie_ep_raise_msix_irq(struct
> dw_pcie_ep *ep, u8 func_no,
> > return 0;
> >  }
> >
> > +static inline int dw_pcie_ep_raise_msix_irq_doorbell(struct dw_pcie_ep
> *ep,
> > +u8 func_no,
> > +u16 interrupt_num)
> > +{
> > +   return 0;
> > +}
> > +
> >  static inline void dw_pcie_ep_reset_bar(struct dw_pcie *pci, enum
> > pci_barno bar)  {  }
> > --
> > 2.17.1
> >


RE: [PATCHv7 10/12] arm64: dts: layerscape: Add PCIe EP node for ls1088a

2020-09-13 Thread Z.q. Hou
Hi Rob,

Thanks a lot for your comments!

> -Original Message-
> From: Rob Herring 
> Sent: 2020年9月11日 0:48
> To: Z.q. Hou 
> Cc: linux-...@vger.kernel.org; devicet...@vger.kernel.org;
> linux-ker...@vger.kernel.org; linux-arm-ker...@lists.infradead.org;
> linuxppc-dev@lists.ozlabs.org; bhelg...@google.com;
> lorenzo.pieral...@arm.com; shawn...@kernel.org; Leo Li
> ; kis...@ti.com; gustavo.pimen...@synopsys.com;
> Roy Zang ; jingooh...@gmail.com;
> andrew.mur...@arm.com; Mingkai Hu ; M.h. Lian
> ; Xiaowei Bao 
> Subject: Re: [PATCHv7 10/12] arm64: dts: layerscape: Add PCIe EP node for
> ls1088a
> 
> On Tue, Aug 11, 2020 at 05:54:39PM +0800, Zhiqiang Hou wrote:
> > From: Xiaowei Bao 
> >
> > Add PCIe EP node for ls1088a to support EP mode.
> >
> > Signed-off-by: Xiaowei Bao 
> > Reviewed-by: Andrew Murray 
> > Signed-off-by: Hou Zhiqiang 
> > ---
> > V7:
> >  - Rebase the patch without functionality change.
> >
> >  .../arm64/boot/dts/freescale/fsl-ls1088a.dtsi | 31
> +++
> >  1 file changed, 31 insertions(+)
> >
> > diff --git a/arch/arm64/boot/dts/freescale/fsl-ls1088a.dtsi
> b/arch/arm64/boot/dts/freescale/fsl-ls1088a.dtsi
> > index 169f4742ae3b..915592141f1b 100644
> > --- a/arch/arm64/boot/dts/freescale/fsl-ls1088a.dtsi
> > +++ b/arch/arm64/boot/dts/freescale/fsl-ls1088a.dtsi
> > @@ -499,6 +499,17 @@
> > status = "disabled";
> > };
> >
> > +   pcie_ep@340 {
> 
> pci-ep@...

The DT node name must be "xxx-xx" style? If yes, the LS1046A EP node also has 
the name "pcie_ep", it also need to be fixed.

Thanks,
Zhiqiang

> 
> > +   compatible = "fsl,ls1088a-pcie-ep","fsl,ls-pcie-ep";
> > +   reg = <0x00 0x0340 0x0 0x0010
> > +  0x20 0x 0x8 0x>;
> > +   reg-names = "regs", "addr_space";
> > +   num-ib-windows = <24>;
> > +   num-ob-windows = <128>;
> > +   max-functions = /bits/ 8 <2>;
> > +   status = "disabled";
> > +   };
> > +
> > pcie@350 {
> > compatible = "fsl,ls1088a-pcie";
> > reg = <0x00 0x0350 0x0 0x0010   /* controller
> registers */
> > @@ -525,6 +536,16 @@
> > status = "disabled";
> > };
> >
> > +   pcie_ep@350 {
> > +   compatible = "fsl,ls1088a-pcie-ep","fsl,ls-pcie-ep";
> > +   reg = <0x00 0x0350 0x0 0x0010
> > +  0x28 0x 0x8 0x>;
> > +   reg-names = "regs", "addr_space";
> > +   num-ib-windows = <6>;
> > +   num-ob-windows = <8>;
> > +   status = "disabled";
> > +   };
> > +
> > pcie@360 {
> > compatible = "fsl,ls1088a-pcie";
> > reg = <0x00 0x0360 0x0 0x0010   /* controller
> registers */
> > @@ -551,6 +572,16 @@
> > status = "disabled";
> > };
> >
> > +   pcie_ep@360 {
> > +   compatible = "fsl,ls1088a-pcie-ep","fsl,ls-pcie-ep";
> > +   reg = <0x00 0x0360 0x0 0x0010
> > +  0x30 0x 0x8 0x>;
> > +   reg-names = "regs", "addr_space";
> > +   num-ib-windows = <6>;
> > +   num-ob-windows = <8>;
> > +   status = "disabled";
> > +   };
> > +
> > smmu: iommu@500 {
> > compatible = "arm,mmu-500";
> > reg = <0 0x500 0 0x80>;
> > --
> > 2.17.1
> >


Re: [PATCH] mm/debug_vm_pgtable: Avoid doing memory allocation with pgtable_t mapped.

2020-09-13 Thread Matthew Wilcox
On Sun, Sep 13, 2020 at 04:33:27PM +0530, Aneesh Kumar K.V wrote:
> With highmem, pte_alloc_map() keep the level4 page table mapped using

.noitcerid etisoppo eht ni selbat egap eht srebmun ygolonimret xuniL



Re: [PATCH] selftests/seccomp: fix ptrace tests on powerpc

2020-09-13 Thread Michael Ellerman
Thadeu Lima de Souza Cascardo  writes:
> On Tue, Sep 08, 2020 at 04:18:17PM -0700, Kees Cook wrote:
>> On Tue, Jun 30, 2020 at 01:47:39PM -0300, Thadeu Lima de Souza Cascardo 
>> wrote:
...
>> > @@ -1809,10 +1818,15 @@ void tracer_ptrace(struct __test_metadata 
>> > *_metadata, pid_t tracee,
>> >EXPECT_EQ(entry ? PTRACE_EVENTMSG_SYSCALL_ENTRY
>> >: PTRACE_EVENTMSG_SYSCALL_EXIT, msg);
>> >  
>> > -  if (!entry)
>> > +  if (!entry && !syscall_nr)
>> >return;
>> >  
>> > -  nr = get_syscall(_metadata, tracee);
>> > +  if (entry)
>> > +  nr = get_syscall(_metadata, tracee);
>> > +  else
>> > +  nr = *syscall_nr;
>> 
>> This is weird? Shouldn't get_syscall() be modified to do the right thing
>> here instead of depending on the extra arg?
>> 
>
> R0 might be clobered. Same documentation mentions it as volatile. So, during
> syscall exit, we can't tell for sure that R0 will have the original syscall
> number. So, we need to grab it during syscall enter, save it somewhere and
> reuse it. I used the test context/args for that.

The user r0 (in regs->gpr[0]) shouldn't be clobbered.

But it is modified if the tracer skips the syscall, by setting the
syscall number to -1. Or if the tracer changes the syscall number.

So if you need the original syscall number in the exit path then I think
you do need to save it at entry.

cheers


[PATCH] mm/debug_vm_pgtable: Avoid doing memory allocation with pgtable_t mapped.

2020-09-13 Thread Aneesh Kumar K.V
With highmem, pte_alloc_map() keep the level4 page table mapped using
kmap_atomic(). Avoid doing new memory allocation with page table
mapped like above.

[9.409233] BUG: sleeping function called from invalid context at 
mm/page_alloc.c:4822
[9.410557] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: 
swapper
[9.411932] no locks held by swapper/1.
[9.412595] CPU: 0 PID: 1 Comm: swapper Not tainted 
5.9.0-rc3-00323-gc50eb1ed654b5 #2
[9.413824] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.12.0-1 04/01/2014
[9.415207] Call Trace:
[9.415651]  ? ___might_sleep.cold+0xa7/0xcc
[9.416367]  ? __alloc_pages_nodemask+0x14c/0x5b0
[9.417055]  ? swap_migration_tests+0x50/0x293
[9.417704]  ? debug_vm_pgtable+0x4bc/0x708
[9.418287]  ? swap_migration_tests+0x293/0x293
[9.418911]  ? do_one_initcall+0x82/0x3cb
[9.419465]  ? parse_args+0x1bd/0x280
[9.419983]  ? rcu_read_lock_sched_held+0x36/0x60
[9.420673]  ? trace_initcall_level+0x1f/0xf3
[9.421279]  ? trace_initcall_level+0xbd/0xf3
[9.421881]  ? do_basic_setup+0x9d/0xdd
[9.422410]  ? do_basic_setup+0xc3/0xdd
[9.422938]  ? kernel_init_freeable+0x72/0xa3
[9.423539]  ? rest_init+0x134/0x134
[9.424055]  ? kernel_init+0x5/0x12c
[9.424574]  ? ret_from_fork+0x19/0x30

Reported-by: kernel test robot 
Signed-off-by: Aneesh Kumar K.V 
---
 mm/debug_vm_pgtable.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index d12bde82ae95..612c665a1136 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -994,7 +994,13 @@ static int __init debug_vm_pgtable(void)
p4dp = p4d_alloc(mm, pgdp, vaddr);
pudp = pud_alloc(mm, p4dp, vaddr);
pmdp = pmd_alloc(mm, pudp, vaddr);
-   ptep = pte_alloc_map(mm, pmdp, vaddr);
+   /*
+* Allocate pgtable_t
+*/
+   if (pte_alloc(mm, pmdp)) {
+   pr_err("pgtable allocation failed\n");
+   return 1;
+   }
 
/*
 * Save all the page table page addresses as the page table
@@ -1048,8 +1054,7 @@ static int __init debug_vm_pgtable(void)
 * proper page table lock.
 */
 
-   ptl = pte_lockptr(mm, pmdp);
-   spin_lock(ptl);
+   ptep = pte_offset_map_lock(mm, pmdp, vaddr, &ptl);
pte_clear_tests(mm, ptep, pte_aligned, vaddr, prot);
pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
pte_unmap_unlock(ptep, ptl);
-- 
2.26.2



Re: [PATCH v2] selftests/seccomp: fix ptrace tests on powerpc

2020-09-13 Thread Michael Ellerman
Kees Cook  writes:
> On Fri, Sep 11, 2020 at 03:10:12PM -0300, Thadeu Lima de Souza Cascardo wrote:
...
>> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c 
>> b/tools/testing/selftests/seccomp/seccomp_bpf.c
>> index 7a6d40286a42..0ddc0846e9c0 100644
>> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
>> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
>> @@ -1916,10 +1957,15 @@ void tracer_ptrace(struct __test_metadata 
>> *_metadata, pid_t tracee,
>>  EXPECT_EQ(entry ? PTRACE_EVENTMSG_SYSCALL_ENTRY
>>  : PTRACE_EVENTMSG_SYSCALL_EXIT, msg);
>>  
>> -if (!entry)
>> +if (!entry && !variant)
>>  return;
>>  
>> -nr = get_syscall(_metadata, tracee);
>> +if (entry)
>> +nr = get_syscall(_metadata, tracee);
>> +else if (variant)
>> +nr = variant->syscall_nr;
>> +if (variant)
>> +variant->syscall_nr = nr;
>
> So, to be clear this is _only_ an issue for the ptrace side of things,
> yes? i.e. seccomp's setting of the return value will correct stick?

Yes. There's a comment which (hopefully) explains the difference here:

  
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/kernel/ptrace/ptrace.c?commit=ab29a807a7ddaa7c84d2f4cb8d29e74e33759072#n239

Which says:

static int do_seccomp(struct pt_regs *regs)
{
if (!test_thread_flag(TIF_SECCOMP))
return 0;

/*
 * The ABI we present to seccomp tracers is that r3 contains
 * the syscall return value and orig_gpr3 contains the first
 * syscall parameter. This is different to the ptrace ABI where
 * both r3 and orig_gpr3 contain the first syscall parameter.
 */
regs->gpr[3] = -ENOSYS;


cheers