Re: [PATCH v5 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling

2020-09-11 Thread Srikar Dronamraju
* Michael Ellerman  [2020-09-11 21:55:23]:

> Srikar Dronamraju  writes:
> > Current code assumes that cpumask of cpus sharing a l2-cache mask will
> > always be a superset of cpu_sibling_mask.
> >
> > Lets stop that assumption. cpu_l2_cache_mask is a superset of
> > cpu_sibling_mask if and only if shared_caches is set.
> 
> I'm seeing oopses with this:
> 
> [0.117392][T1] smp: Bringing up secondary CPUs ...
> [0.156515][T1] smp: Brought up 2 nodes, 2 CPUs
> [0.158265][T1] numa: Node 0 CPUs: 0
> [0.158520][T1] numa: Node 1 CPUs: 1
> [0.167453][T1] BUG: Unable to handle kernel data access on read at 
> 0x800041228298
> [0.167992][T1] Faulting instruction address: 0xc018c128
> [0.168817][T1] Oops: Kernel access of bad area, sig: 11 [#1]
> [0.168964][T1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> [0.169417][T1] Modules linked in:
> [0.170047][T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
> 5.9.0-rc2-00095-g7430ad5aa700 #209
> [0.170305][T1] NIP:  c018c128 LR: c018c0cc CTR: 
> c004dce0
> [0.170498][T1] REGS: c0007e343880 TRAP: 0380   Not tainted  
> (5.9.0-rc2-00095-g7430ad5aa700)
> [0.170602][T1] MSR:  82009033   CR: 
> 4400  XER: 
> [0.170985][T1] CFAR: c018c288 IRQMASK: 0
> [0.170985][T1] GPR00:  c0007e343b10 
> c173e400 4000
> [0.170985][T1] GPR04:  0800 
> 0800 
> [0.170985][T1] GPR08:  c122c298 
> c0003fffc000 c0007fd05ce8
> [0.170985][T1] GPR12: c0007e0119f8 c193 
> 8ade 
> [0.170985][T1] GPR16: c0007e3c0640 0917 
> c0007e3c0658 0008
> [0.170985][T1] GPR20: c15d0bb8 8ade 
> c0f57400 c1817c28
> [0.170985][T1] GPR24: c176dc80 c0007e3c0890 
> c0007e3cfe00 
> [0.170985][T1] GPR28: c1772310 c0007e011900 
> c0007e3c0800 0001
> [0.172750][T1] NIP [c018c128] build_sched_domains+0x808/0x14b0
> [0.172900][T1] LR [c018c0cc] build_sched_domains+0x7ac/0x14b0
> [0.173186][T1] Call Trace:
> [0.173484][T1] [c0007e343b10] [c018bfe8] 
> build_sched_domains+0x6c8/0x14b0 (unreliable)
> [0.173821][T1] [c0007e343c50] [c018dcdc] 
> sched_init_domains+0xec/0x130
> [0.174037][T1] [c0007e343ca0] [c10d59d8] 
> sched_init_smp+0x50/0xc4
> [0.174207][T1] [c0007e343cd0] [c10b45c4] 
> kernel_init_freeable+0x1b4/0x378
> [0.174378][T1] [c0007e343db0] [c00129fc] 
> kernel_init+0x24/0x158
> [0.174740][T1] [c0007e343e20] [c000d9d0] 
> ret_from_kernel_thread+0x5c/0x6c
> [0.175050][T1] Instruction dump:
> [0.175626][T1] 554905ee 71480040 7d2907b4 4182016c 2c29 3920006e 
> 913e002c 41820034
> [0.175841][T1] 7c6307b4 e9300020 78631f24 7d58182a <7d2a482a> 
> f93e0080 7d404828 314a0001
> [0.178340][T1] ---[ end trace 6876b88dd1d4b3bb ]---
> [0.178512][T1]
> [1.180458][T1] Kernel panic - not syncing: Attempted to kill init! 
> exitcode=0x000b
> 
> That's qemu:
> 
> qemu-system-ppc64 -nographic -vga none -M pseries -cpu POWER8 \
>   -kernel build~/vmlinux \
>   -m 2G,slots=2,maxmem=4G \
>   -object memory-backend-ram,size=1G,id=m0 \
>   -object memory-backend-ram,size=1G,id=m1 \
>   -numa node,nodeid=0,memdev=m0 \
>   -numa node,nodeid=1,memdev=m1 \
>   -smp 2,sockets=2,maxcpus=2  \
> 

Thanks Michael for the report and also for identifying the patch and also
giving an easy reproducer. That made my task easy. (My only problem was all
my PowerKVM hosts had a old compiler that refuse to compile never kernels.)

So in this setup, CPU doesn't have a l2-cache. And in that scenario, we
miss updating the l2-cache domain. Actually the initial patch had this
exact code. However it was my mistake. I should have reassessed it before
making changes suggested by Gautham.

Patch below. Do let me know if you want me to send the patch separately.

> 
> On mambo I get:
> 
> [0.005069][T1] smp: Bringing up secondary CPUs ...
> [0.011656][T1] smp: Brought up 2 nodes, 8 CPUs
> [0.011682][T1] numa: Node 0 CPUs: 0-3
> [0.011709][T1] numa: Node 1 CPUs: 4-7
> [0.012015][T1] BUG: arch topology borken
> [0.012040][T1]  the SMT domain not a subset of the CACHE domain
> [0.012107][T1] BUG: Unable to handle kernel data access on read at 
> 0x8001012e7398
> [0.012142][T1] Faulting instruction address: 0xc01aa4f0
> [0.012174][T1] Oops: Kernel access of bad area, sig: 11 [#1]
> [0.012206][T1] LE PAGE_SIZE=64K MMU=Hash SMP 

[PATCH] serial: ucc_uart: make qe_uart_set_mctrl() static

2020-09-11 Thread Jason Yan
This eliminates the following sparse warning:

drivers/tty/serial/ucc_uart.c:286:6: warning: symbol 'qe_uart_set_mctrl'
was not declared. Should it be static?

Reported-by: Hulk Robot 
Signed-off-by: Jason Yan 
---
 drivers/tty/serial/ucc_uart.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/tty/serial/ucc_uart.c b/drivers/tty/serial/ucc_uart.c
index 3c8c662c69e2..d6a8604157ab 100644
--- a/drivers/tty/serial/ucc_uart.c
+++ b/drivers/tty/serial/ucc_uart.c
@@ -283,7 +283,7 @@ static unsigned int qe_uart_tx_empty(struct uart_port *port)
  * don't need that support. This function must exist, however, otherwise
  * the kernel will panic.
  */
-void qe_uart_set_mctrl(struct uart_port *port, unsigned int mctrl)
+static void qe_uart_set_mctrl(struct uart_port *port, unsigned int mctrl)
 {
 }
 
-- 
2.25.4



[PATCH v2] mm/gup: fix gup_fast with dynamic page table folding

2020-09-11 Thread Vasily Gorbik
Currently to make sure that every page table entry is read just once
gup_fast walks perform READ_ONCE and pass pXd value down to the next
gup_pXd_range function by value e.g.:

static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 unsigned int flags, struct page **pages, int *nr)
...
pudp = pud_offset(, addr);

This function passes a reference on that local value copy to pXd_offset,
and might get the very same pointer in return. This happens when the
level is folded (on most arches), and that pointer should not be iterated.

On s390 due to the fact that each task might have different 5,4 or
3-level address translation and hence different levels folded the logic
is more complex and non-iteratable pointer to a local copy leads to
severe problems.

Here is an example of what happens with gup_fast on s390, for a task
with 3-levels paging, crossing a 2 GB pud boundary:

// addr = 0x1007000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 unsigned int flags, struct page **pages, int *nr)
{
unsigned long next;
pud_t *pudp;

// pud_offset returns  itself (a pointer to a value on stack)
pudp = pud_offset(, addr);
do {
// on second iteratation reading "random" stack value
pud_t pud = READ_ONCE(*pudp);

// next = 0x1008000, due to PUD_SIZE/MASK != 
PGDIR_SIZE/MASK on s390
next = pud_addr_end(addr, end);
...
} while (pudp++, addr = next, addr != end); // pudp++ iterating over 
stack

return 1;
}

This happens since s390 moved to common gup code with
commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust")
and commit 1a42010cdc26 ("s390/mm: convert to the generic
get_user_pages_fast code"). s390 tried to mimic static level folding by
changing pXd_offset primitives to always calculate top level page table
offset in pgd_offset and just return the value passed when pXd_offset
has to act as folded.

What is crucial for gup_fast and what has been overlooked is
that PxD_SIZE/MASK and thus pXd_addr_end should also change
correspondingly. And the latter is not possible with dynamic folding.

To fix the issue in addition to pXd values pass original
pXdp pointers down to gup_pXd_range functions. And introduce
pXd_offset_lockless helpers, which take an additional pXd
entry value parameter. This has already been discussed in
https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1

Cc:  # 5.2+
Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
Reviewed-by: Gerald Schaefer 
Reviewed-by: Alexander Gordeev 
Signed-off-by: Vasily Gorbik 
---
v2: added brackets  -> &(pgd)

 arch/s390/include/asm/pgtable.h | 42 +++--
 include/linux/pgtable.h | 10 
 mm/gup.c| 18 +++---
 3 files changed, 49 insertions(+), 21 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 7eb01a5459cd..b55561cc8786 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, 
unsigned long address)
 
 #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address)
 
-static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
+static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long 
address)
 {
-   if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
-   return (p4d_t *) pgd_deref(*pgd) + p4d_index(address);
-   return (p4d_t *) pgd;
+   if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
+   return (p4d_t *) pgd_deref(pgd) + p4d_index(address);
+   return (p4d_t *) pgdp;
 }
+#define p4d_offset_lockless p4d_offset_lockless
 
-static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
+static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address)
 {
-   if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
-   return (pud_t *) p4d_deref(*p4d) + pud_index(address);
-   return (pud_t *) p4d;
+   return p4d_offset_lockless(pgdp, *pgdp, address);
+}
+
+static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long 
address)
+{
+   if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
+   return (pud_t *) p4d_deref(p4d) + pud_index(address);
+   return (pud_t *) p4dp;
+}
+#define pud_offset_lockless pud_offset_lockless
+
+static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address)
+{
+   return pud_offset_lockless(p4dp, *p4dp, address);
 }
 #define pud_offset pud_offset
 
-static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
+static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, 

[PATCH] mm/gup: fix gup_fast with dynamic page table folding

2020-09-11 Thread Vasily Gorbik
Currently to make sure that every page table entry is read just once
gup_fast walks perform READ_ONCE and pass pXd value down to the next
gup_pXd_range function by value e.g.:

static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 unsigned int flags, struct page **pages, int *nr)
...
pudp = pud_offset(, addr);

This function passes a reference on that local value copy to pXd_offset,
and might get the very same pointer in return. This happens when the
level is folded (on most arches), and that pointer should not be iterated.

On s390 due to the fact that each task might have different 5,4 or
3-level address translation and hence different levels folded the logic
is more complex and non-iteratable pointer to a local copy leads to
severe problems.

Here is an example of what happens with gup_fast on s390, for a task
with 3-levels paging, crossing a 2 GB pud boundary:

// addr = 0x1007000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 unsigned int flags, struct page **pages, int *nr)
{
unsigned long next;
pud_t *pudp;

// pud_offset returns  itself (a pointer to a value on stack)
pudp = pud_offset(, addr);
do {
// on second iteratation reading "random" stack value
pud_t pud = READ_ONCE(*pudp);

// next = 0x1008000, due to PUD_SIZE/MASK != 
PGDIR_SIZE/MASK on s390
next = pud_addr_end(addr, end);
...
} while (pudp++, addr = next, addr != end); // pudp++ iterating over 
stack

return 1;
}

This happens since s390 moved to common gup code with
commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust")
and commit 1a42010cdc26 ("s390/mm: convert to the generic
get_user_pages_fast code"). s390 tried to mimic static level folding by
changing pXd_offset primitives to always calculate top level page table
offset in pgd_offset and just return the value passed when pXd_offset
has to act as folded.

What is crucial for gup_fast and what has been overlooked is
that PxD_SIZE/MASK and thus pXd_addr_end should also change
correspondingly. And the latter is not possible with dynamic folding.

To fix the issue in addition to pXd values pass original
pXdp pointers down to gup_pXd_range functions. And introduce
pXd_offset_lockless helpers, which take an additional pXd
entry value parameter. This has already been discussed in
https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1

Cc:  # 5.2+
Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
Reviewed-by: Gerald Schaefer 
Reviewed-by: Alexander Gordeev 
Signed-off-by: Vasily Gorbik 
---
 arch/s390/include/asm/pgtable.h | 42 +++--
 include/linux/pgtable.h | 10 
 mm/gup.c| 18 +++---
 3 files changed, 49 insertions(+), 21 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 7eb01a5459cd..b55561cc8786 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, 
unsigned long address)
 
 #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address)
 
-static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
+static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long 
address)
 {
-   if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
-   return (p4d_t *) pgd_deref(*pgd) + p4d_index(address);
-   return (p4d_t *) pgd;
+   if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
+   return (p4d_t *) pgd_deref(pgd) + p4d_index(address);
+   return (p4d_t *) pgdp;
 }
+#define p4d_offset_lockless p4d_offset_lockless
 
-static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
+static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address)
 {
-   if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
-   return (pud_t *) p4d_deref(*p4d) + pud_index(address);
-   return (pud_t *) p4d;
+   return p4d_offset_lockless(pgdp, *pgdp, address);
+}
+
+static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long 
address)
+{
+   if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
+   return (pud_t *) p4d_deref(p4d) + pud_index(address);
+   return (pud_t *) p4dp;
+}
+#define pud_offset_lockless pud_offset_lockless
+
+static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address)
+{
+   return pud_offset_lockless(p4dp, *p4dp, address);
 }
 #define pud_offset pud_offset
 
-static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
+static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long 
address)
+{
+  

Re: [PATCH v2] selftests/seccomp: fix ptrace tests on powerpc

2020-09-11 Thread Kees Cook
On Fri, Sep 11, 2020 at 03:10:12PM -0300, Thadeu Lima de Souza Cascardo wrote:
> As pointed out by Michael Ellerman, the ptrace ABI on powerpc does not
> allow or require the return code to be set on syscall entry when
> skipping the syscall. It will always return ENOSYS and the return code
> must be set on syscall exit.
> 
> This code does that, behaving more similarly to strace. It still sets
> the return code on entry, which is overridden on powerpc, and it will
> always repeat the same on exit. Also, on powerpc, the errno is not
> inverted, and depends on ccr.so being set.
> 
> This has been tested on powerpc and amd64.

This looks like two fixes in one, so this should be split. :)

>  tools/testing/selftests/seccomp/seccomp_bpf.c | 81 ---
>  1 file changed, 53 insertions(+), 28 deletions(-)
> 
> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c 
> b/tools/testing/selftests/seccomp/seccomp_bpf.c
> index 7a6d40286a42..0ddc0846e9c0 100644
> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> @@ -1837,15 +1837,24 @@ void change_syscall(struct __test_metadata *_metadata,
>  #endif
>  
>   /* If syscall is skipped, change return value. */
> - if (syscall == -1)
> + if (syscall == -1) {
>  #ifdef SYSCALL_NUM_RET_SHARE_REG
>   TH_LOG("Can't modify syscall return on this architecture");
> -
>  #elif defined(__xtensa__)
>   regs.SYSCALL_RET(regs) = result;
> +#elif defined(__powerpc__)
> + /* Error is signaled by CR0 SO bit and error code is positive. 
> */
> + if (result < 0) {
> + regs.SYSCALL_RET = -result;
> + regs.ccr |= 0x1000;
> + } else {
> + regs.SYSCALL_RET = result;
> + regs.ccr &= ~0x1000;
> + }
>  #else
>   regs.SYSCALL_RET = result;
>  #endif
> + }

I'll send a series soon that will include this bit, since I don't want
to collect these kinds of arch-specific things in the functions. (And
the xtensa one went in without my review!)

> +FIXTURE(TRACE_syscall) {
> + struct sock_fprog prog;
> + pid_t tracer, mytid, mypid, parent;
> +};
> +
> +FIXTURE_VARIANT(TRACE_syscall) {
> + /*
> +  * All of the SECCOMP_RET_TRACE behaviors can be tested with either
> +  * SECCOMP_RET_TRACE+PTRACE_CONT or plain ptrace()+PTRACE_SYSCALL.
> +  * This indicates if we should use SECCOMP_RET_TRACE (false), or
> +  * ptrace (true).
> +  */
> + bool use_ptrace;
> +
> + /*
> +  * Some archs (like ppc) only support changing the return code during
> +  * syscall exit when ptrace is used.  As the syscall number might not
> +  * be available anymore during syscall exit, it needs to be saved
> +  * during syscall enter.
> +  */
> + int syscall_nr;

This should be part of the fixture struct, not the variant. 

> +};
> +
> +FIXTURE_VARIANT_ADD(TRACE_syscall, ptrace) {
> + .use_ptrace = true,
> +};
> +
> +FIXTURE_VARIANT_ADD(TRACE_syscall, seccomp) {
> + .use_ptrace = false,
> +};

i.e. if a member isn't initialized in FIXTURE_VARIANT_ADD, it shouldn't
be defined in FIXTURE_VARIANT. :)

> +
>  void tracer_ptrace(struct __test_metadata *_metadata, pid_t tracee,
>  int status, void *args)
>  {
>   int ret, nr;
>   unsigned long msg;
>   static bool entry;
> + FIXTURE_VARIANT(TRACE_syscall) * variant = args;
>  
>   /*
>* The traditional way to tell PTRACE_SYSCALL entry/exit
> @@ -1916,10 +1957,15 @@ void tracer_ptrace(struct __test_metadata *_metadata, 
> pid_t tracee,
>   EXPECT_EQ(entry ? PTRACE_EVENTMSG_SYSCALL_ENTRY
>   : PTRACE_EVENTMSG_SYSCALL_EXIT, msg);
>  
> - if (!entry)
> + if (!entry && !variant)
>   return;
>  
> - nr = get_syscall(_metadata, tracee);
> + if (entry)
> + nr = get_syscall(_metadata, tracee);
> + else if (variant)
> + nr = variant->syscall_nr;
> + if (variant)
> + variant->syscall_nr = nr;

So, to be clear this is _only_ an issue for the ptrace side of things,
yes? i.e. seccomp's setting of the return value will correct stick?

-- 
Kees Cook


Re: [PATCH] powerpc/papr_scm: Fix warning triggered by perf_stats_show()

2020-09-11 Thread Ira Weiny
On Sat, Sep 12, 2020 at 01:36:46AM +0530, Vaibhav Jain wrote:
> Thanks for reviewing this patch Ira,
> 
> 
> Ira Weiny  writes:
> 
> > On Thu, Sep 10, 2020 at 02:52:12PM +0530, Vaibhav Jain wrote:
> >> A warning is reported by the kernel in case perf_stats_show() returns
> >> an error code. The warning is of the form below:
> >> 
> >>  papr_scm ibm,persistent-memory:ibm,pmemory@4411:
> >>  Failed to query performance stats, Err:-10
> >>  dev_attr_show: perf_stats_show+0x0/0x1c0 [papr_scm] returned bad count
> >>  fill_read_buffer: dev_attr_show+0x0/0xb0 returned bad count
> >> 
> >> On investigation it looks like that the compiler is silently truncating the
> >> return value of drc_pmem_query_stats() from 'long' to 'int', since the
> >> variable used to store the return code 'rc' is an 'int'. This
> >> truncated value is then returned back as a 'ssize_t' back from
> >> perf_stats_show() to 'dev_attr_show()' which thinks of it as a large
> >> unsigned number and triggers this warning..
> >> 
> >> To fix this we update the type of variable 'rc' from 'int' to
> >> 'ssize_t' that prevents the compiler from truncating the return value
> >> of drc_pmem_query_stats() and returning correct signed value back from
> >> perf_stats_show().
> >> 
> >> Fixes: 2d02bf835e573 ('powerpc/papr_scm: Fetch nvdimm performance
> >>stats from PHYP')
> >> Signed-off-by: Vaibhav Jain 
> >> ---
> >>  arch/powerpc/platforms/pseries/papr_scm.c | 3 ++-
> >>  1 file changed, 2 insertions(+), 1 deletion(-)
> >> 
> >> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> >> b/arch/powerpc/platforms/pseries/papr_scm.c
> >> index a88a707a608aa..9f00b61676ab9 100644
> >> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> >> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> >> @@ -785,7 +785,8 @@ static int papr_scm_ndctl(struct nvdimm_bus_descriptor 
> >> *nd_desc,
> >>  static ssize_t perf_stats_show(struct device *dev,
> >>   struct device_attribute *attr, char *buf)
> >>  {
> >> -  int index, rc;
> >> +  int index;
> >> +  ssize_t rc;
> >
> > I'm not sure this is really fixing everything here.
> 
> The issue is with the statement in perf_stats_show():
> 
> 'return rc ? rc : seq_buf_used();'
> 
> The function seq_buf_used() returns an 'unsigned int' and with 'rc'
> typed as 'int', forces a promotion of the expression to 'unsigned int'
> which causes a loss of signedness of 'rc' and compiler silently
> assigns this unsigned value to the function return typed as 'signed
> long'.
> 
> Making 'rc', a 'signed long' forces a promotion of the expresion to
> 'signed long' which preserves the signedness of 'rc' and will also be
> compatible with the function return type.

Ok, ok, I read this all wrong.

FWIW I would also cast seq_buf_used() to ssize_t to show you know what you are
doing there.

> 
> >
> > drc_pmem_query_stats() can return negative errno's.  Why are those not 
> > checked
> > somewhere in perf_stats_show()?
> >
> For the specific invocation 'drc_pmem_query_stats(p, stats, 0)' we only
> expect return value 'rc <=0' with '0' indicating a successful fetch of
> nvdimm performance stats from hypervisor. Hence there are no explicit
> checks for negative error codes in the functions as all return values
> !=0 indicate an error.
> 
> 
> > It seems like all this fix is handling is a > 0 return value: 'ret[0]' from
> > line 289 in papr_scm.c...  Or something?
> No, in case the arg 'num_stats' is '0' and 'buff_stats != NULL' the
> variable 'size' is assigned a non-zero value hence that specific branch
> you mentioned  is never taken. Instead in case of success
> drc_pmem_query_stats() return '0' and in case of an error a negative
> error code is returned.
> 
> >
> > Worse yet drc_pmem_query_stats() is returning ssize_t which is a signed 
> > value.
> > Therefore, it should not be returning -errno.  I'm surprised the static
> > checkers did not catch that.
> Didnt quite get the assertion here. The function is marked to return
> 'ssize_t' because we can return both +ve for success and -ve values to
> indicate errors.

Sorry I was reading this as size_t and meant to say unsigned...  I was looking
at this too quickly.

> 
> >
> > I believe I caught similar errors with a patch series before which did not 
> > pay
> > attention to variable types.
> >
> > Please audit this code for these types of errors and ensure you are really
> > doing the correct thing when using the sysfs interface.  I'm pretty sure bad
> > things will eventually happen (if they are not already) if you return some
> > really big number to the sysfs core from *_show().
> I think this problem is different compared to what you had previously pointed
> to. The values returned from drc_pmem_query_stats() can be stored in an
> 'int' variable too, however it was the silent promotion of a signed type
> to unsigned type was what caused this specific issue.

Ok this makes more sense now.  Sorry about not looking more carefully.

But I still think 

[PATCH] ibmvfc: Avoid link down on FS9100 canister reboot

2020-09-11 Thread Brian King
When a canister on a FS9100, or similar storage, running in NPIV mode,
is rebooted, its WWPNs will fail over to another canister. When this
occurs, we see a WWPN going away from the fabric at one N-Port ID,
and, a short time later, the same WWPN appears at a different N-Port ID.
When the canister is fully operational again, the WWPNs fail back to
the original canister. If there is any I/O outstanding to the target
when this occurs, it will result in the implicit logout the ibmvfc driver
issues before removing the rport to fail. When the WWPN then shows up at a
different N-Port ID, and we issue a PLOGI to it, the VIOS will
see that it still has a login for this WWPN at the old N-Port ID,
which results in the VIOS simulating a link down / link up sequence
to the client, in order to get the VIOS and client LPAR in sync.

The patch below improves the way we handle this scenario so as to
avoid the link bounce, which affects all targets under the virtual
host adapter. The change is to utilize the Move Login MAD, which
will work even when I/O is outstanding to the target. The change
only alters the target state machine for the case where the implicit
logout fails prior to deleting the rport. If this implicit logout fails,
we defer deleting the ibmvfc_target object after calling
fc_remote_port_delete. This enables us to later retry the implicit logout
after terminate_rport_io occurs, or to issue the Move Login request if
a WWPN shows up at a new N-Port ID prior to this occurring.

This has been tested by IBM's storage interoperability team
on a FS9100, forcing the failover to occur. With debug tracing enabled
in the ibmvfc driver, we confirmed the move login was sent
in this scenario and confirmed the link bounce no longer occurred.

Signed-off-by: Brian King 
---
 drivers/scsi/ibmvscsi/ibmvfc.c | 211 ++---
 drivers/scsi/ibmvscsi/ibmvfc.h |  38 +++-
 2 files changed, 232 insertions(+), 17 deletions(-)

diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
index 175a165..322bb30 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.c
+++ b/drivers/scsi/ibmvscsi/ibmvfc.c
@@ -134,6 +134,7 @@
 static void ibmvfc_tgt_query_target(struct ibmvfc_target *);
 static void ibmvfc_npiv_logout(struct ibmvfc_host *);
 static void ibmvfc_tgt_implicit_logout_and_del(struct ibmvfc_target *);
+static void ibmvfc_tgt_move_login(struct ibmvfc_target *);
 
 static const char *unknown_error = "unknown error";
 
@@ -431,7 +432,20 @@ static int ibmvfc_set_tgt_action(struct ibmvfc_target *tgt,
}
break;
case IBMVFC_TGT_ACTION_LOGOUT_RPORT_WAIT:
-   if (action == IBMVFC_TGT_ACTION_DEL_RPORT) {
+   if (action == IBMVFC_TGT_ACTION_DEL_RPORT ||
+   action == IBMVFC_TGT_ACTION_DEL_AND_LOGOUT_RPORT) {
+   tgt->action = action;
+   rc = 0;
+   }
+   break;
+   case IBMVFC_TGT_ACTION_LOGOUT_DELETED_RPORT:
+   if (action == IBMVFC_TGT_ACTION_LOGOUT_RPORT) {
+   tgt->action = action;
+   rc = 0;
+   }
+   break;
+   case IBMVFC_TGT_ACTION_DEL_AND_LOGOUT_RPORT:
+   if (action == IBMVFC_TGT_ACTION_LOGOUT_DELETED_RPORT) {
tgt->action = action;
rc = 0;
}
@@ -441,16 +455,18 @@ static int ibmvfc_set_tgt_action(struct ibmvfc_target 
*tgt,
tgt->action = action;
rc = 0;
}
+   break;
case IBMVFC_TGT_ACTION_DELETED_RPORT:
break;
default:
-   if (action >= IBMVFC_TGT_ACTION_LOGOUT_RPORT)
-   tgt->add_rport = 0;
tgt->action = action;
rc = 0;
break;
}
 
+   if (action >= IBMVFC_TGT_ACTION_LOGOUT_RPORT)
+   tgt->add_rport = 0;
+
return rc;
 }
 
@@ -548,7 +564,8 @@ static void ibmvfc_set_host_action(struct ibmvfc_host 
*vhost,
  **/
 static void ibmvfc_reinit_host(struct ibmvfc_host *vhost)
 {
-   if (vhost->action == IBMVFC_HOST_ACTION_NONE) {
+   if (vhost->action == IBMVFC_HOST_ACTION_NONE &&
+   vhost->state == IBMVFC_ACTIVE) {
if (!ibmvfc_set_host_state(vhost, IBMVFC_INITIALIZING)) {
scsi_block_requests(vhost->host);
ibmvfc_set_host_action(vhost, IBMVFC_HOST_ACTION_QUERY);
@@ -2574,7 +2591,9 @@ static void ibmvfc_terminate_rport_io(struct fc_rport 
*rport)
struct ibmvfc_host *vhost = shost_priv(shost);
struct fc_rport *dev_rport;
struct scsi_device *sdev;
-   unsigned long rc;
+   struct ibmvfc_target *tgt;
+   unsigned long rc, flags;
+   unsigned int found;
 
ENTER;
shost_for_each_device(sdev, shost) {
@@ -2588,6 +2607,25 @@ static void 

Re: [PATCH] powerpc/boot/dts: Fix dtc "pciex" warnings

2020-09-11 Thread Christian Lamparter

Hello,

On 2020-09-08 20:52, Christian Lamparter wrote:

On Tue, Sep 8, 2020 at 9:12 AM Michael Ellerman  wrote:

Christian Lamparter  writes:

On 2020-06-23 15:03, Michael Ellerman wrote:

With CONFIG_OF_ALL_DTBS=y, as set by eg. allmodconfig, we see lots of
warnings about our dts files, such as:

arch/powerpc/boot/dts/glacier.dts:492.26-532.5:
Warning (pci_bridge): /plb/pciex@d: node name is not "pci"
or "pcie"




Unfortunately yes. This patch will break uboot code in Meraki MX60(W) / MX60.

  > 
https://github.com/riptidewave93/meraki-uboot/blob/mx60w-20180413/board/amcc/bluestone/bluestone.c#L1178

| if (!pci_available()) {
| fdt_find_and_setprop(blob, "/plb/pciex@d", "status",
|   "disabled", sizeof("disabled"), 1);
| }


Backstory: There are two version of the Meraki MX60. The MX60
and the MX60W. The difference is that the MX60W has a populated
mini-pcie slot on the PCB for a >W

I'm happy to revert that hunk if you think any one is actually booting
mainline on those.


The MX60(W) or APM82181 in general?

The APM82181 devices run really well on the kernel 5.8. The APM82181
had some bitrot and missing pieces. But I started at around 4.4 with
upstreaming various bits and stuff. For proof, I can post a bootlog from
my MyBook Live dev station running my mbl-debian on this weekend:
.


Here is the MyBook Live's Single Bootlog for 5.9-rc4.

root@mbl-debian:~# dmesg
[0.00] printk: bootconsole [udbg0] enabled
[0.00] Linux version 5.9.0-rc4+ (root@debian64) (powerpc-linux-gnu-gcc 
(Debian 10.2.0-3) 10.2.0, GNU ld (GNU Binutils for Debian) 2.35) #1 Fri Sep 11 
22:13:07 CEST 2020
[0.00] Using PowerPC 44x Platform machine description
[0.00] -
[0.00] phys_mem_size = 0x1000
[0.00] dcache_bsize  = 0x20
[0.00] icache_bsize  = 0x20
[0.00] cpu_features  = 0x0120
[0.00]   possible= 0x4120
[0.00]   always  = 0x0020
[0.00] cpu_user_features = 0x8c008000 0x
[0.00] mmu_features  = 0x0008
[0.00] -
[0.00] Top of RAM: 0x1000, Total RAM: 0x1000
[0.00] Memory hole size: 0MB
[0.00] Zone ranges:
[0.00]   Normal   [mem 0x-0x0fff]
[0.00] Movable zone start for each node
[0.00] Early memory node ranges
[0.00]   node   0: [mem 0x-0x0fff]
[0.00] Initmem setup node 0 [mem 0x-0x0fff]
[0.00] On node 0 totalpages: 16384
[0.00]   Normal zone: 40 pages used for memmap
[0.00]   Normal zone: 0 pages reserved
[0.00]   Normal zone: 16384 pages, LIFO batch:3
[0.00] MMU: Allocated 1088 bytes of context maps for 255 contexts
[0.00] pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768
[0.00] pcpu-alloc: [0] 0
[0.00] Built 1 zonelists, mobility grouping on.  Total pages: 16344
[0.00] Kernel command line: 
root=PARTUUID=25a9cfcc-6b0d-4ae4-abbf-e5e2e3889a40 rw console=ttyS0,115200
[0.00] Dentry cache hash table entries: 32768 (order: 3, 131072 bytes, 
linear)
[0.00] Inode-cache hash table entries: 16384 (order: 2, 65536 bytes, 
linear)
[0.00] mem auto-init: stack:off, heap alloc:off, heap free:off
[0.00] Memory: 252528K/262144K available (5840K kernel code, 448K 
rwdata, 1824K rodata, 224K init, 312K bss, 9616K reserved, 0K cma-reserved)
[0.00] Kernel virtual memory layout:
[0.00]   * 0xffbdc000..0xc000  : fixmap
[0.00]   * 0xd100..0xffbdc000  : vmalloc & ioremap
[0.00] random: get_random_u32 called from 
cache_random_seq_create+0x68/0x128 with crng_init=0
[0.00] SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[0.00] NR_IRQS: 512, nr_irqs: 512, preallocated irqs: 16
[0.00] UIC0 (32 IRQ sources) at DCR 0xc0
[0.00] UIC1 (32 IRQ sources) at DCR 0xd0
[0.00] UIC2 (32 IRQ sources) at DCR 0xe0
[0.00] UIC3 (32 IRQ sources) at DCR 0xf0
[0.00] time_init: decrementer frequency = 800.08 MHz
[0.00] time_init: processor frequency   = 800.08 MHz
[0.16] clocksource: timebase: mask: 0x max_cycles: 
0xb881274fa3, max_idle_ns: 440795210636 ns
[0.008990] clocksource: timebase mult[140] shift[24] registered
[0.014011] clockevent: decrementer mult[ccef] shift[32] cpu[0]
[0.019133] Console: colour dummy device 80x25
[0.022213] pid_max: default: 32768 minimum: 301
[0.025920] Mount-cache hash table entries: 4096 (order: 0, 16384 bytes, 
linear)
[0.031945] Mountpoint-cache hash table entries: 4096 (order: 0, 16384 
bytes, linear)
[0.040909] devtmpfs: initialized
[

Re: [PATCH] powerpc/papr_scm: Fix warning triggered by perf_stats_show()

2020-09-11 Thread Vaibhav Jain
Thanks for reviewing this patch Ira,


Ira Weiny  writes:

> On Thu, Sep 10, 2020 at 02:52:12PM +0530, Vaibhav Jain wrote:
>> A warning is reported by the kernel in case perf_stats_show() returns
>> an error code. The warning is of the form below:
>> 
>>  papr_scm ibm,persistent-memory:ibm,pmemory@4411:
>>Failed to query performance stats, Err:-10
>>  dev_attr_show: perf_stats_show+0x0/0x1c0 [papr_scm] returned bad count
>>  fill_read_buffer: dev_attr_show+0x0/0xb0 returned bad count
>> 
>> On investigation it looks like that the compiler is silently truncating the
>> return value of drc_pmem_query_stats() from 'long' to 'int', since the
>> variable used to store the return code 'rc' is an 'int'. This
>> truncated value is then returned back as a 'ssize_t' back from
>> perf_stats_show() to 'dev_attr_show()' which thinks of it as a large
>> unsigned number and triggers this warning..
>> 
>> To fix this we update the type of variable 'rc' from 'int' to
>> 'ssize_t' that prevents the compiler from truncating the return value
>> of drc_pmem_query_stats() and returning correct signed value back from
>> perf_stats_show().
>> 
>> Fixes: 2d02bf835e573 ('powerpc/papr_scm: Fetch nvdimm performance
>>stats from PHYP')
>> Signed-off-by: Vaibhav Jain 
>> ---
>>  arch/powerpc/platforms/pseries/papr_scm.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
>> b/arch/powerpc/platforms/pseries/papr_scm.c
>> index a88a707a608aa..9f00b61676ab9 100644
>> --- a/arch/powerpc/platforms/pseries/papr_scm.c
>> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
>> @@ -785,7 +785,8 @@ static int papr_scm_ndctl(struct nvdimm_bus_descriptor 
>> *nd_desc,
>>  static ssize_t perf_stats_show(struct device *dev,
>> struct device_attribute *attr, char *buf)
>>  {
>> -int index, rc;
>> +int index;
>> +ssize_t rc;
>
> I'm not sure this is really fixing everything here.

The issue is with the statement in perf_stats_show():

'return rc ? rc : seq_buf_used();'

The function seq_buf_used() returns an 'unsigned int' and with 'rc'
typed as 'int', forces a promotion of the expression to 'unsigned int'
which causes a loss of signedness of 'rc' and compiler silently
assigns this unsigned value to the function return typed as 'signed
long'.

Making 'rc', a 'signed long' forces a promotion of the expresion to
'signed long' which preserves the signedness of 'rc' and will also be
compatible with the function return type.

>
> drc_pmem_query_stats() can return negative errno's.  Why are those not checked
> somewhere in perf_stats_show()?
>
For the specific invocation 'drc_pmem_query_stats(p, stats, 0)' we only
expect return value 'rc <=0' with '0' indicating a successful fetch of
nvdimm performance stats from hypervisor. Hence there are no explicit
checks for negative error codes in the functions as all return values
!=0 indicate an error.


> It seems like all this fix is handling is a > 0 return value: 'ret[0]' from
> line 289 in papr_scm.c...  Or something?
No, in case the arg 'num_stats' is '0' and 'buff_stats != NULL' the
variable 'size' is assigned a non-zero value hence that specific branch
you mentioned  is never taken. Instead in case of success
drc_pmem_query_stats() return '0' and in case of an error a negative
error code is returned.

>
> Worse yet drc_pmem_query_stats() is returning ssize_t which is a signed value.
> Therefore, it should not be returning -errno.  I'm surprised the static
> checkers did not catch that.
Didnt quite get the assertion here. The function is marked to return
'ssize_t' because we can return both +ve for success and -ve values to
indicate errors.

>
> I believe I caught similar errors with a patch series before which did not pay
> attention to variable types.
>
> Please audit this code for these types of errors and ensure you are really
> doing the correct thing when using the sysfs interface.  I'm pretty sure bad
> things will eventually happen (if they are not already) if you return some
> really big number to the sysfs core from *_show().
I think this problem is different compared to what you had previously pointed
to. The values returned from drc_pmem_query_stats() can be stored in an
'int' variable too, however it was the silent promotion of a signed type
to unsigned type was what caused this specific issue.

-- 
Cheers
~ Vaibhav


Re: [PATCH] mm/gup: fix gup_fast with dynamic page table folding

2020-09-11 Thread Jason Gunthorpe
On Fri, Sep 11, 2020 at 04:40:00PM -0300, Jason Gunthorpe wrote:
> These would probably be better as static inlines though, as only s390
> compiles will type check pudp like this.

Never mind, it must be a macro - still need brackets though

Jason


Re: [PATCH] mm/gup: fix gup_fast with dynamic page table folding

2020-09-11 Thread Jason Gunthorpe
On Fri, Sep 11, 2020 at 09:03:06PM +0200, Vasily Gorbik wrote:
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e8cbc2e795d5..e899d3506671 100644
> +++ b/include/linux/pgtable.h
> @@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask;
>  #define mm_pmd_folded(mm)__is_defined(__PAGETABLE_PMD_FOLDED)
>  #endif
>  
> +#ifndef p4d_offset_lockless
> +#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(, address)
> +#endif
> +#ifndef pud_offset_lockless
> +#define pud_offset_lockless(p4dp, p4d, address) pud_offset(, address)
> +#endif
> +#ifndef pmd_offset_lockless
> +#define pmd_offset_lockless(pudp, pud, address) pmd_offset(, address)

Needs brackets: &(pgd) 

These would probably be better as static inlines though, as only s390
compiles will type check pudp like this.

Jason


Re: [PATCH] mm/gup: fix gup_fast with dynamic page table folding

2020-09-11 Thread Linus Torvalds
On Fri, Sep 11, 2020 at 12:04 PM Vasily Gorbik  wrote:
>
> Currently to make sure that every page table entry is read just once
> gup_fast walks perform READ_ONCE and pass pXd value down to the next
> gup_pXd_range function by value e.g.:
[ ... ]

Ack, this looks sane to me.

I was going to ask how horrible it would be to convert all the other
users, but a quick grep convinced me that yeah, it's only GUP that is
this special, and we don't want to make this interface be the real one
for everything else too..

Linus


Re: [PATCH v2] selftests/seccomp: fix ptrace tests on powerpc

2020-09-11 Thread Shuah Khan

On 9/11/20 12:10 PM, Thadeu Lima de Souza Cascardo wrote:

As pointed out by Michael Ellerman, the ptrace ABI on powerpc does not
allow or require the return code to be set on syscall entry when
skipping the syscall. It will always return ENOSYS and the return code
must be set on syscall exit.

This code does that, behaving more similarly to strace. It still sets
the return code on entry, which is overridden on powerpc, and it will
always repeat the same on exit. Also, on powerpc, the errno is not
inverted, and depends on ccr.so being set.

This has been tested on powerpc and amd64.

Cc: Michael Ellerman 
Cc: Kees Cook 
Signed-off-by: Thadeu Lima de Souza Cascardo 
---
  tools/testing/selftests/seccomp/seccomp_bpf.c | 81 ---
  1 file changed, 53 insertions(+), 28 deletions(-)

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c 
b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 7a6d40286a42..0ddc0846e9c0 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -1837,15 +1837,24 @@ void change_syscall(struct __test_metadata *_metadata,
  #endif
  
  	/* If syscall is skipped, change return value. */

-   if (syscall == -1)
+   if (syscall == -1) {
  #ifdef SYSCALL_NUM_RET_SHARE_REG
TH_LOG("Can't modify syscall return on this architecture");
-
  #elif defined(__xtensa__)
regs.SYSCALL_RET(regs) = result;
+#elif defined(__powerpc__)
+   /* Error is signaled by CR0 SO bit and error code is positive. 
*/
+   if (result < 0) {
+   regs.SYSCALL_RET = -result;
+   regs.ccr |= 0x1000;
+   } else {
+   regs.SYSCALL_RET = result;
+   regs.ccr &= ~0x1000;
+   }
  #else
regs.SYSCALL_RET = result;
  #endif
+   }
  
  #ifdef HAVE_GETREGS

ret = ptrace(PTRACE_SETREGS, tracee, 0, );
@@ -1897,12 +1906,44 @@ void tracer_seccomp(struct __test_metadata *_metadata, 
pid_t tracee,
  
  }
  
+FIXTURE(TRACE_syscall) {

+   struct sock_fprog prog;
+   pid_t tracer, mytid, mypid, parent;
+};
+
+FIXTURE_VARIANT(TRACE_syscall) {
+   /*
+* All of the SECCOMP_RET_TRACE behaviors can be tested with either
+* SECCOMP_RET_TRACE+PTRACE_CONT or plain ptrace()+PTRACE_SYSCALL.
+* This indicates if we should use SECCOMP_RET_TRACE (false), or
+* ptrace (true).
+*/
+   bool use_ptrace;
+
+   /*
+* Some archs (like ppc) only support changing the return code during
+* syscall exit when ptrace is used.  As the syscall number might not
+* be available anymore during syscall exit, it needs to be saved
+* during syscall enter.
+*/
+   int syscall_nr;
+};
+
+FIXTURE_VARIANT_ADD(TRACE_syscall, ptrace) {
+   .use_ptrace = true,
+};
+
+FIXTURE_VARIANT_ADD(TRACE_syscall, seccomp) {
+   .use_ptrace = false,
+};
+
  void tracer_ptrace(struct __test_metadata *_metadata, pid_t tracee,
   int status, void *args)
  {
int ret, nr;
unsigned long msg;
static bool entry;
+   FIXTURE_VARIANT(TRACE_syscall) * variant = args;
  
  	/*

 * The traditional way to tell PTRACE_SYSCALL entry/exit
@@ -1916,10 +1957,15 @@ void tracer_ptrace(struct __test_metadata *_metadata, 
pid_t tracee,
EXPECT_EQ(entry ? PTRACE_EVENTMSG_SYSCALL_ENTRY
: PTRACE_EVENTMSG_SYSCALL_EXIT, msg);
  
-	if (!entry)

+   if (!entry && !variant)
return;
  
-	nr = get_syscall(_metadata, tracee);

+   if (entry)
+   nr = get_syscall(_metadata, tracee);
+   else if (variant)
+   nr = variant->syscall_nr;
+   if (variant)
+   variant->syscall_nr = nr;
  
  	if (nr == __NR_getpid)

change_syscall(_metadata, tracee, __NR_getppid, 0);
@@ -1929,29 +1975,6 @@ void tracer_ptrace(struct __test_metadata *_metadata, 
pid_t tracee,
change_syscall(_metadata, tracee, -1, -ESRCH);
  }
  
-FIXTURE(TRACE_syscall) {

-   struct sock_fprog prog;
-   pid_t tracer, mytid, mypid, parent;
-};
-
-FIXTURE_VARIANT(TRACE_syscall) {
-   /*
-* All of the SECCOMP_RET_TRACE behaviors can be tested with either
-* SECCOMP_RET_TRACE+PTRACE_CONT or plain ptrace()+PTRACE_SYSCALL.
-* This indicates if we should use SECCOMP_RET_TRACE (false), or
-* ptrace (true).
-*/
-   bool use_ptrace;
-};
-
-FIXTURE_VARIANT_ADD(TRACE_syscall, ptrace) {
-   .use_ptrace = true,
-};
-
-FIXTURE_VARIANT_ADD(TRACE_syscall, seccomp) {
-   .use_ptrace = false,
-};
-
  FIXTURE_SETUP(TRACE_syscall)
  {
struct sock_filter filter[] = {
@@ -1992,7 +2015,9 @@ FIXTURE_SETUP(TRACE_syscall)
self->tracer = setup_trace_fixture(_metadata,
   variant->use_ptrace ? tracer_ptrace

[PATCH v2] selftests/seccomp: fix ptrace tests on powerpc

2020-09-11 Thread Thadeu Lima de Souza Cascardo
As pointed out by Michael Ellerman, the ptrace ABI on powerpc does not
allow or require the return code to be set on syscall entry when
skipping the syscall. It will always return ENOSYS and the return code
must be set on syscall exit.

This code does that, behaving more similarly to strace. It still sets
the return code on entry, which is overridden on powerpc, and it will
always repeat the same on exit. Also, on powerpc, the errno is not
inverted, and depends on ccr.so being set.

This has been tested on powerpc and amd64.

Cc: Michael Ellerman 
Cc: Kees Cook 
Signed-off-by: Thadeu Lima de Souza Cascardo 
---
 tools/testing/selftests/seccomp/seccomp_bpf.c | 81 ---
 1 file changed, 53 insertions(+), 28 deletions(-)

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c 
b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 7a6d40286a42..0ddc0846e9c0 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -1837,15 +1837,24 @@ void change_syscall(struct __test_metadata *_metadata,
 #endif
 
/* If syscall is skipped, change return value. */
-   if (syscall == -1)
+   if (syscall == -1) {
 #ifdef SYSCALL_NUM_RET_SHARE_REG
TH_LOG("Can't modify syscall return on this architecture");
-
 #elif defined(__xtensa__)
regs.SYSCALL_RET(regs) = result;
+#elif defined(__powerpc__)
+   /* Error is signaled by CR0 SO bit and error code is positive. 
*/
+   if (result < 0) {
+   regs.SYSCALL_RET = -result;
+   regs.ccr |= 0x1000;
+   } else {
+   regs.SYSCALL_RET = result;
+   regs.ccr &= ~0x1000;
+   }
 #else
regs.SYSCALL_RET = result;
 #endif
+   }
 
 #ifdef HAVE_GETREGS
ret = ptrace(PTRACE_SETREGS, tracee, 0, );
@@ -1897,12 +1906,44 @@ void tracer_seccomp(struct __test_metadata *_metadata, 
pid_t tracee,
 
 }
 
+FIXTURE(TRACE_syscall) {
+   struct sock_fprog prog;
+   pid_t tracer, mytid, mypid, parent;
+};
+
+FIXTURE_VARIANT(TRACE_syscall) {
+   /*
+* All of the SECCOMP_RET_TRACE behaviors can be tested with either
+* SECCOMP_RET_TRACE+PTRACE_CONT or plain ptrace()+PTRACE_SYSCALL.
+* This indicates if we should use SECCOMP_RET_TRACE (false), or
+* ptrace (true).
+*/
+   bool use_ptrace;
+
+   /*
+* Some archs (like ppc) only support changing the return code during
+* syscall exit when ptrace is used.  As the syscall number might not
+* be available anymore during syscall exit, it needs to be saved
+* during syscall enter.
+*/
+   int syscall_nr;
+};
+
+FIXTURE_VARIANT_ADD(TRACE_syscall, ptrace) {
+   .use_ptrace = true,
+};
+
+FIXTURE_VARIANT_ADD(TRACE_syscall, seccomp) {
+   .use_ptrace = false,
+};
+
 void tracer_ptrace(struct __test_metadata *_metadata, pid_t tracee,
   int status, void *args)
 {
int ret, nr;
unsigned long msg;
static bool entry;
+   FIXTURE_VARIANT(TRACE_syscall) * variant = args;
 
/*
 * The traditional way to tell PTRACE_SYSCALL entry/exit
@@ -1916,10 +1957,15 @@ void tracer_ptrace(struct __test_metadata *_metadata, 
pid_t tracee,
EXPECT_EQ(entry ? PTRACE_EVENTMSG_SYSCALL_ENTRY
: PTRACE_EVENTMSG_SYSCALL_EXIT, msg);
 
-   if (!entry)
+   if (!entry && !variant)
return;
 
-   nr = get_syscall(_metadata, tracee);
+   if (entry)
+   nr = get_syscall(_metadata, tracee);
+   else if (variant)
+   nr = variant->syscall_nr;
+   if (variant)
+   variant->syscall_nr = nr;
 
if (nr == __NR_getpid)
change_syscall(_metadata, tracee, __NR_getppid, 0);
@@ -1929,29 +1975,6 @@ void tracer_ptrace(struct __test_metadata *_metadata, 
pid_t tracee,
change_syscall(_metadata, tracee, -1, -ESRCH);
 }
 
-FIXTURE(TRACE_syscall) {
-   struct sock_fprog prog;
-   pid_t tracer, mytid, mypid, parent;
-};
-
-FIXTURE_VARIANT(TRACE_syscall) {
-   /*
-* All of the SECCOMP_RET_TRACE behaviors can be tested with either
-* SECCOMP_RET_TRACE+PTRACE_CONT or plain ptrace()+PTRACE_SYSCALL.
-* This indicates if we should use SECCOMP_RET_TRACE (false), or
-* ptrace (true).
-*/
-   bool use_ptrace;
-};
-
-FIXTURE_VARIANT_ADD(TRACE_syscall, ptrace) {
-   .use_ptrace = true,
-};
-
-FIXTURE_VARIANT_ADD(TRACE_syscall, seccomp) {
-   .use_ptrace = false,
-};
-
 FIXTURE_SETUP(TRACE_syscall)
 {
struct sock_filter filter[] = {
@@ -1992,7 +2015,9 @@ FIXTURE_SETUP(TRACE_syscall)
self->tracer = setup_trace_fixture(_metadata,
   variant->use_ptrace ? tracer_ptrace
   : 

Re: [PATCH] selftests/seccomp: fix ptrace tests on powerpc

2020-09-11 Thread Thadeu Lima de Souza Cascardo
On Tue, Sep 08, 2020 at 04:18:17PM -0700, Kees Cook wrote:
> On Tue, Jun 30, 2020 at 01:47:39PM -0300, Thadeu Lima de Souza Cascardo wrote:
> > As pointed out by Michael Ellerman, the ptrace ABI on powerpc does not
> > allow or require the return code to be set on syscall entry when
> > skipping the syscall. It will always return ENOSYS and the return code
> > must be set on syscall exit.
> > 
> > This code does that, behaving more similarly to strace. It still sets
> > the return code on entry, which is overridden on powerpc, and it will
> > always repeat the same on exit. Also, on powerpc, the errno is not
> > inverted, and depends on ccr.so being set.
> > 
> > This has been tested on powerpc and amd64.
> > 
> > Cc: Michael Ellerman 
> > Cc: Kees Cook 
> > Signed-off-by: Thadeu Lima de Souza Cascardo 
> 
> Yikes, I missed this from a while ago. I apologize for responding so
> late!
> 
> This appears still unfixed; is that correct?
> 

Yes. I will send a v2 on top of recent changes to the test.

> > ---
> >  tools/testing/selftests/seccomp/seccomp_bpf.c | 24 +++
> >  1 file changed, 20 insertions(+), 4 deletions(-)
> > 
> > diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c 
> > b/tools/testing/selftests/seccomp/seccomp_bpf.c
> > index 252140a52553..b90a9190ba88 100644
> > --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> > +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> > @@ -1738,6 +1738,14 @@ void change_syscall(struct __test_metadata 
> > *_metadata,
> > TH_LOG("Can't modify syscall return on this architecture");
> >  #else
> > regs.SYSCALL_RET = result;
> > +# if defined(__powerpc__)
> > +   if (result < 0) {
> > +   regs.SYSCALL_RET = -result;
> > +   regs.ccr |= 0x1000;
> > +   } else {
> > +   regs.ccr &= ~0x1000;
> > +   }
> > +# endif
> >  #endif
> 
> Just so I understand correctly: for ppc to "see" this result, it needs
> to be both negative AND have this specific register set?
> 

Yes. According to Documentation/powerpc/syscall64-abi.rst:

"
- For the sc instruction, both a value and an error condition are returned.
  cr0.SO is the error condition, and r3 is the return value. When cr0.SO is
  clear, the syscall succeeded and r3 is the return value. When cr0.SO is set,
  the syscall failed and r3 is the error value (that normally corresponds to
  errno).
"

So, while some other arches will return -EINVAL, ppc returns EINVAL. And that
is distinguished from, say, read(2) returning 22 bytes read, by using CR.SO.

> >  
> >  #ifdef HAVE_GETREGS
> > @@ -1796,6 +1804,7 @@ void tracer_ptrace(struct __test_metadata *_metadata, 
> > pid_t tracee,
> > int ret, nr;
> > unsigned long msg;
> > static bool entry;
> > +   int *syscall_nr = args;
> >  
> > /*
> >  * The traditional way to tell PTRACE_SYSCALL entry/exit
> > @@ -1809,10 +1818,15 @@ void tracer_ptrace(struct __test_metadata 
> > *_metadata, pid_t tracee,
> > EXPECT_EQ(entry ? PTRACE_EVENTMSG_SYSCALL_ENTRY
> > : PTRACE_EVENTMSG_SYSCALL_EXIT, msg);
> >  
> > -   if (!entry)
> > +   if (!entry && !syscall_nr)
> > return;
> >  
> > -   nr = get_syscall(_metadata, tracee);
> > +   if (entry)
> > +   nr = get_syscall(_metadata, tracee);
> > +   else
> > +   nr = *syscall_nr;
> 
> This is weird? Shouldn't get_syscall() be modified to do the right thing
> here instead of depending on the extra arg?
> 

R0 might be clobered. Same documentation mentions it as volatile. So, during
syscall exit, we can't tell for sure that R0 will have the original syscall
number. So, we need to grab it during syscall enter, save it somewhere and
reuse it. I used the test context/args for that. That's the main change I had
to deal with after recent changes to the test. I used the variant struct now.

I only saw the need to do this under tracer_ptrace, as that was the only one
changing syscall return values using ptrace. And that can only be done during
syscall exit on ppc (ptrace ABI we can't break). So, changing get_syscall did
not seem necessary.

Thanks.
Cascardo.

> > +   if (syscall_nr)
> > +   *syscall_nr = nr;
> >  
> > if (nr == __NR_getpid)
> > change_syscall(_metadata, tracee, __NR_getppid, 0);
> > @@ -1889,9 +1903,10 @@ TEST_F(TRACE_syscall, ptrace_syscall_redirected)
> >  
> >  TEST_F(TRACE_syscall, ptrace_syscall_errno)
> >  {
> > +   int syscall_nr = -1;
> > /* Swap SECCOMP_RET_TRACE tracer for PTRACE_SYSCALL tracer. */
> > teardown_trace_fixture(_metadata, self->tracer);
> > -   self->tracer = setup_trace_fixture(_metadata, tracer_ptrace, NULL,
> > +   self->tracer = setup_trace_fixture(_metadata, tracer_ptrace, 
> > _nr,
> >true);
> >  
> > /* Tracer should skip the open syscall, resulting in ESRCH. */
> > @@ -1900,9 +1915,10 @@ TEST_F(TRACE_syscall, ptrace_syscall_errno)
> >  

[PATCH v2 14/14] powerpc/pseries/iommu: Rename "direct window" to "dma window"

2020-09-11 Thread Leonardo Bras
Cc: linuxppc-dev@lists.ozlabs.org, linux-ker...@vger.kernel.org, 

A previous change introduced the usage of DDW as a bigger indirect DMA
mapping when the DDW available size does not map the whole partition.

As most of the code that manipulates direct mappings was reused for
indirect mappings, it's necessary to rename all names and debug/info
messages to reflect that it can be used for both kinds of mapping.

Also, defines DEFAULT_DMA_WIN as "ibm,dma-window" to document that
it's the name of the default DMA window.

Those changes are not supposed to change how the code works in any
way, just adjust naming.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 102 +
 1 file changed, 53 insertions(+), 49 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index c4de23080d1b..56638b7f07fc 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -349,7 +349,7 @@ struct dynamic_dma_window_prop {
__be32  window_shift;   /* ilog2(tce_window_size) */
 };
 
-struct direct_window {
+struct dma_win {
struct device_node *device;
const struct dynamic_dma_window_prop *prop;
struct list_head list;
@@ -369,13 +369,14 @@ struct ddw_create_response {
u32 addr_lo;
 };
 
-static LIST_HEAD(direct_window_list);
+static LIST_HEAD(dma_win_list);
 /* prevents races between memory on/offline and window creation */
-static DEFINE_SPINLOCK(direct_window_list_lock);
+static DEFINE_SPINLOCK(dma_win_list_lock);
 /* protects initializing window twice for same device */
-static DEFINE_MUTEX(direct_window_init_mutex);
+static DEFINE_MUTEX(dma_win_init_mutex);
 #define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
 #define DMA64_PROPNAME "linux,dma64-ddr-window-info"
+#define DEFAULT_DMA_WIN "ibm,dma-window"
 
 static int tce_clearrange_multi_pSeriesLP(unsigned long start_pfn,
unsigned long num_pfn, const void *arg)
@@ -706,15 +707,18 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus 
*bus)
pr_debug("pci_dma_bus_setup_pSeriesLP: setting up bus %pOF\n",
 dn);
 
-   /* Find nearest ibm,dma-window, walking up the device tree */
+   /*
+* Find nearest ibm,dma-window (default DMA window), walking up the
+* device tree
+*/
for (pdn = dn; pdn != NULL; pdn = pdn->parent) {
-   dma_window = of_get_property(pdn, "ibm,dma-window", NULL);
+   dma_window = of_get_property(pdn, DEFAULT_DMA_WIN, NULL);
if (dma_window != NULL)
break;
}
 
if (dma_window == NULL) {
-   pr_debug("  no ibm,dma-window property !\n");
+   pr_debug("  no %s property !\n", DEFAULT_DMA_WIN);
return;
}
 
@@ -810,11 +814,11 @@ static void remove_dma_window(struct device_node *np, u32 
*ddw_avail,
 
ret = rtas_call(ddw_avail[DDW_REMOVE_PE_DMA_WIN], 1, 1, NULL, liobn);
if (ret)
-   pr_warn("%pOF: failed to remove direct window: rtas returned "
+   pr_warn("%pOF: failed to remove dma window: rtas returned "
"%d to ibm,remove-pe-dma-window(%x) %llx\n",
np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
else
-   pr_debug("%pOF: successfully removed direct window: rtas 
returned "
+   pr_debug("%pOF: successfully removed dma window: rtas returned "
"%d to ibm,remove-pe-dma-window(%x) %llx\n",
np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
 }
@@ -842,7 +846,7 @@ static int remove_ddw(struct device_node *np, bool 
remove_prop, const char *win_
 
ret = of_remove_property(np, win);
if (ret)
-   pr_warn("%pOF: failed to remove direct window property: %d\n",
+   pr_warn("%pOF: failed to remove dma window property: %d\n",
np, ret);
return 0;
 }
@@ -886,34 +890,34 @@ static phys_addr_t ddw_memory_hotplug_max(void)
 
 static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, bool 
*direct_mapping)
 {
-   struct direct_window *window;
-   const struct dynamic_dma_window_prop *direct64;
+   struct dma_win *window;
+   const struct dynamic_dma_window_prop *dma64;
unsigned long window_size;
bool found = false;
 
-   spin_lock(_window_list_lock);
+   spin_lock(_win_list_lock);
/* check if we already created a window and dupe that config if so */
-   list_for_each_entry(window, _window_list, list) {
+   list_for_each_entry(window, _win_list, list) {
if (window->device == pdn) {
-   direct64 = window->prop;
-   *dma_addr = be64_to_cpu(direct64->dma_base);
+   dma64 = window->prop;
+   

[PATCH v2 13/14] powerpc/pseries/iommu: Make use of DDW for indirect mapping

2020-09-11 Thread Leonardo Bras
Cc: linuxppc-dev@lists.ozlabs.org, linux-ker...@vger.kernel.org, 

So far it's assumed possible to map the guest RAM 1:1 to the bus, which
works with a small number of devices. SRIOV changes it as the user can
configure hundreds VFs and since phyp preallocates TCEs and does not
allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
per a PE to limit waste of physical pages.

As of today, if the assumed direct mapping is not possible, DDW creation
is skipped and the default DMA window "ibm,dma-window" is used instead.

The default DMA window uses 4k pages instead of 64k pages, and since
the amount of pages (TCEs) may stay the same (on pHyp case), making
use of DDW instead of the default DMA window for indirect mapping will
expand in up to 16x the amount of memory that can be mapped on DMA.

Indirect mapping will only be used if direct mapping is not a
possibility.

For indirect mapping, it's necessary to re-create the iommu_table with
the new DMA window parameters, so iommu_alloc() can use it.

Removing the default DMA window for using DDW with indirect mapping
is only allowed if there is no current IOMMU memory allocated in
the iommu_table. enable_ddw() is aborted otherwise.

Even though there won't be both direct and indirect mappings at the
same time, we can't reuse the DIRECT64_PROPNAME property name, or else
an older kexec()ed kernel can assume direct mapping, and skip
iommu_alloc(), causing undesirable behavior.
So a new property name DMA64_PROPNAME "linux,dma64-ddr-window-info"
was created to represent a DDW that does not allow direct mapping.

Note: ddw_memory_hotplug_max() was moved up so it can be used in
find_existing_ddw().

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 160 -
 1 file changed, 103 insertions(+), 57 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 9b7c03652e72..c4de23080d1b 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -375,6 +375,7 @@ static DEFINE_SPINLOCK(direct_window_list_lock);
 /* protects initializing window twice for same device */
 static DEFINE_MUTEX(direct_window_init_mutex);
 #define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
+#define DMA64_PROPNAME "linux,dma64-ddr-window-info"
 
 static int tce_clearrange_multi_pSeriesLP(unsigned long start_pfn,
unsigned long num_pfn, const void *arg)
@@ -846,10 +847,48 @@ static int remove_ddw(struct device_node *np, bool 
remove_prop, const char *win_
return 0;
 }
 
-static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
+static phys_addr_t ddw_memory_hotplug_max(void)
+{
+   phys_addr_t max_addr = memory_hotplug_max();
+   struct device_node *memory;
+
+   /*
+* The "ibm,pmemory" can appear anywhere in the address space.
+* Assuming it is still backed by page structs, set the upper limit
+* for the huge DMA window as MAX_PHYSMEM_BITS.
+*/
+   if (of_find_node_by_type(NULL, "ibm,pmemory"))
+   return (sizeof(phys_addr_t) * 8 <= MAX_PHYSMEM_BITS) ?
+   (phys_addr_t)-1 : (1ULL << MAX_PHYSMEM_BITS);
+
+   for_each_node_by_type(memory, "memory") {
+   unsigned long start, size;
+   int n_mem_addr_cells, n_mem_size_cells, len;
+   const __be32 *memcell_buf;
+
+   memcell_buf = of_get_property(memory, "reg", );
+   if (!memcell_buf || len <= 0)
+   continue;
+
+   n_mem_addr_cells = of_n_addr_cells(memory);
+   n_mem_size_cells = of_n_size_cells(memory);
+
+   start = of_read_number(memcell_buf, n_mem_addr_cells);
+   memcell_buf += n_mem_addr_cells;
+   size = of_read_number(memcell_buf, n_mem_size_cells);
+   memcell_buf += n_mem_size_cells;
+
+   max_addr = max_t(phys_addr_t, max_addr, start + size);
+   }
+
+   return max_addr;
+}
+
+static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, bool 
*direct_mapping)
 {
struct direct_window *window;
const struct dynamic_dma_window_prop *direct64;
+   unsigned long window_size;
bool found = false;
 
spin_lock(_window_list_lock);
@@ -858,6 +897,10 @@ static bool find_existing_ddw(struct device_node *pdn, u64 
*dma_addr)
if (window->device == pdn) {
direct64 = window->prop;
*dma_addr = be64_to_cpu(direct64->dma_base);
+
+   window_size = (1UL << 
be32_to_cpu(direct64->window_shift));
+   *direct_mapping = (window_size >= 
ddw_memory_hotplug_max());
+
found = true;
break;
}
@@ -912,6 +955,7 @@ static int find_existing_ddw_windows(void)
return 0;
 

[PATCH v2 12/14] powerpc/pseries/iommu: Find existing DDW with given property name

2020-09-11 Thread Leonardo Bras
Cc: linuxppc-dev@lists.ozlabs.org, linux-ker...@vger.kernel.org, 

Extract find_existing_ddw_windows() into find_existing_ddw_windows_named()
and calls it with current property name.

This will allow more property names to be checked in
find_existing_ddw_windows(), enabling the creation of new property names,
like the one that will be used for indirect mapping.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 25 +++--
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index f6a65ecd1db5..9b7c03652e72 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -882,24 +882,21 @@ static struct direct_window *ddw_list_new_entry(struct 
device_node *pdn,
return window;
 }
 
-static int find_existing_ddw_windows(void)
+static void find_existing_ddw_windows_named(const char *name)
 {
int len;
struct device_node *pdn;
struct direct_window *window;
-   const struct dynamic_dma_window_prop *direct64;
-
-   if (!firmware_has_feature(FW_FEATURE_LPAR))
-   return 0;
+   const struct dynamic_dma_window_prop *dma64;
 
-   for_each_node_with_property(pdn, DIRECT64_PROPNAME) {
-   direct64 = of_get_property(pdn, DIRECT64_PROPNAME, );
-   if (!direct64 || len < sizeof(*direct64)) {
-   remove_ddw(pdn, true, DIRECT64_PROPNAME);
+   for_each_node_with_property(pdn, name) {
+   dma64 = of_get_property(pdn, name, );
+   if (!dma64 || len < sizeof(*dma64)) {
+   remove_ddw(pdn, true, name);
continue;
}
 
-   window = ddw_list_new_entry(pdn, direct64);
+   window = ddw_list_new_entry(pdn, dma64);
if (!window)
break;
 
@@ -907,6 +904,14 @@ static int find_existing_ddw_windows(void)
list_add(>list, _window_list);
spin_unlock(_window_list_lock);
}
+}
+
+static int find_existing_ddw_windows(void)
+{
+   if (!firmware_has_feature(FW_FEATURE_LPAR))
+   return 0;
+
+   find_existing_ddw_windows_named(DIRECT64_PROPNAME);
 
return 0;
 }
-- 
2.25.4



[PATCH v2 11/14] powerpc/pseries/iommu: Update remove_dma_window() to accept property name

2020-09-11 Thread Leonardo Bras
Cc: linuxppc-dev@lists.ozlabs.org, linux-ker...@vger.kernel.org, 

Update remove_dma_window() so it can be used to remove DDW with a given
property name.

This enables the creation of new property names for DDW, so we can
have different usage for it, like indirect mapping.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index abd36b257725..f6a65ecd1db5 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -818,31 +818,32 @@ static void remove_dma_window(struct device_node *np, u32 
*ddw_avail,
np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
 }
 
-static void remove_ddw(struct device_node *np, bool remove_prop)
+static int remove_ddw(struct device_node *np, bool remove_prop, const char 
*win_name)
 {
struct property *win;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
int ret = 0;
 
+   win = of_find_property(np, win_name, NULL);
+   if (!win)
+   return -EINVAL;
+
ret = of_property_read_u32_array(np, "ibm,ddw-applicable",
 _avail[0], DDW_APPLICABLE_SIZE);
if (ret)
-   return;
-
-   win = of_find_property(np, DIRECT64_PROPNAME, NULL);
-   if (!win)
-   return;
+   return 0;
 
if (win->length >= sizeof(struct dynamic_dma_window_prop))
remove_dma_window(np, ddw_avail, win);
 
if (!remove_prop)
-   return;
+   return 0;
 
ret = of_remove_property(np, win);
if (ret)
pr_warn("%pOF: failed to remove direct window property: %d\n",
np, ret);
+   return 0;
 }
 
 static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
@@ -894,7 +895,7 @@ static int find_existing_ddw_windows(void)
for_each_node_with_property(pdn, DIRECT64_PROPNAME) {
direct64 = of_get_property(pdn, DIRECT64_PROPNAME, );
if (!direct64 || len < sizeof(*direct64)) {
-   remove_ddw(pdn, true);
+   remove_ddw(pdn, true, DIRECT64_PROPNAME);
continue;
}
 
@@ -1325,7 +1326,7 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
win64 = NULL;
 
 out_win_del:
-   remove_ddw(pdn, true);
+   remove_ddw(pdn, true, DIRECT64_PROPNAME);
 
 out_failed:
if (default_win_removed)
@@ -1480,7 +1481,7 @@ static int iommu_reconfig_notifier(struct notifier_block 
*nb, unsigned long acti
 * we have to remove the property when releasing
 * the device node.
 */
-   remove_ddw(np, false);
+   remove_ddw(np, false, DIRECT64_PROPNAME);
if (pci && pci->table_group)
iommu_pseries_free_group(pci->table_group,
np->full_name);
-- 
2.25.4



[PATCH v2 10/14] powerpc/pseries/iommu: Reorganize iommu_table_setparms*() with new helper

2020-09-11 Thread Leonardo Bras
Cc: linuxppc-dev@lists.ozlabs.org, linux-ker...@vger.kernel.org, 

Add a new helper _iommu_table_setparms(), and use it in
iommu_table_setparms() and iommu_table_setparms_lpar() to avoid duplicated
code.

Also, setting tbl->it_ops was happening outsite iommu_table_setparms*(),
so move it to the new helper. Since we need the iommu_table_ops to be
declared before used, move iommu_table_lpar_multi_ops and
iommu_table_pseries_ops to before their respective iommu_table_setparms*().

The tce_exchange_pseries() also had to be moved up, since it's used in
iommu_table_lpar_multi_ops.xchg_no_kill.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 149 -
 1 file changed, 72 insertions(+), 77 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 510ccb0521af..abd36b257725 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -495,12 +495,62 @@ static int tce_setrange_multi_pSeriesLP(unsigned long 
start_pfn,
return rc;
 }
 
+#ifdef CONFIG_IOMMU_API
+static int tce_exchange_pseries(struct iommu_table *tbl, long index, unsigned
+   long *tce, enum dma_data_direction *direction,
+   bool realmode)
+{
+   long rc;
+   unsigned long ioba = (unsigned long)index << tbl->it_page_shift;
+   unsigned long flags, oldtce = 0;
+   u64 proto_tce = iommu_direction_to_tce_perm(*direction);
+   unsigned long newtce = *tce | proto_tce;
+
+   spin_lock_irqsave(>large_pool.lock, flags);
+
+   rc = plpar_tce_get((u64)tbl->it_index, ioba, );
+   if (!rc)
+   rc = plpar_tce_put((u64)tbl->it_index, ioba, newtce);
+
+   if (!rc) {
+   *direction = iommu_tce_direction(oldtce);
+   *tce = oldtce & ~(TCE_PCI_READ | TCE_PCI_WRITE);
+   }
+
+   spin_unlock_irqrestore(>large_pool.lock, flags);
+
+   return rc;
+}
+#endif
+
 static int tce_setrange_multi_pSeriesLP_walk(unsigned long start_pfn,
unsigned long num_pfn, void *arg)
 {
return tce_setrange_multi_pSeriesLP(start_pfn, num_pfn, arg);
 }
 
+static inline void _iommu_table_setparms(struct iommu_table *tbl, unsigned 
long busno,
+unsigned long liobn, unsigned long 
win_addr,
+unsigned long window_size, unsigned 
long page_shift,
+unsigned long base, struct 
iommu_table_ops *table_ops)
+{
+   tbl->it_busno = busno;
+   tbl->it_index = liobn;
+   tbl->it_offset = win_addr >> page_shift;
+   tbl->it_size = window_size >> page_shift;
+   tbl->it_page_shift = page_shift;
+   tbl->it_base = base;
+   tbl->it_blocksize = 16;
+   tbl->it_type = TCE_PCI;
+   tbl->it_ops = table_ops;
+}
+
+struct iommu_table_ops iommu_table_pseries_ops = {
+   .set = tce_build_pSeries,
+   .clear = tce_free_pSeries,
+   .get = tce_get_pseries
+};
+
 static void iommu_table_setparms(struct pci_controller *phb,
 struct device_node *dn,
 struct iommu_table *tbl)
@@ -509,8 +559,13 @@ static void iommu_table_setparms(struct pci_controller 
*phb,
const unsigned long *basep;
const u32 *sizep;
 
-   node = phb->dn;
+   /* Test if we are going over 2GB of DMA space */
+   if (phb->dma_window_base_cur + phb->dma_window_size > 0x8000ul) {
+   udbg_printf("PCI_DMA: Unexpected number of IOAs under this 
PHB.\n");
+   panic("PCI_DMA: Unexpected number of IOAs under this PHB.\n");
+   }
 
+   node = phb->dn;
basep = of_get_property(node, "linux,tce-base", NULL);
sizep = of_get_property(node, "linux,tce-size", NULL);
if (basep == NULL || sizep == NULL) {
@@ -519,33 +574,25 @@ static void iommu_table_setparms(struct pci_controller 
*phb,
return;
}
 
-   tbl->it_base = (unsigned long)__va(*basep);
+   _iommu_table_setparms(tbl, phb->bus->number, 0, 
phb->dma_window_base_cur,
+ phb->dma_window_size, IOMMU_PAGE_SHIFT_4K,
+ (unsigned long)__va(*basep), 
_table_pseries_ops);
 
if (!is_kdump_kernel())
memset((void *)tbl->it_base, 0, *sizep);
 
-   tbl->it_busno = phb->bus->number;
-   tbl->it_page_shift = IOMMU_PAGE_SHIFT_4K;
-
-   /* Units of tce entries */
-   tbl->it_offset = phb->dma_window_base_cur >> tbl->it_page_shift;
-
-   /* Test if we are going over 2GB of DMA space */
-   if (phb->dma_window_base_cur + phb->dma_window_size > 0x8000ul) {
-   udbg_printf("PCI_DMA: Unexpected number of IOAs under this 
PHB.\n");
-   panic("PCI_DMA: Unexpected number of IOAs under this PHB.\n");
-   }
-
phb->dma_window_base_cur += 

[PATCH v2 09/14] powerpc/pseries/iommu: Add ddw_property_create() and refactor enable_ddw()

2020-09-11 Thread Leonardo Bras
Cc: linuxppc-dev@lists.ozlabs.org, linux-ker...@vger.kernel.org, 

Code used to create a ddw property that was previously scattered in
enable_ddw() is now gathered in ddw_property_create(), which deals with
allocation and filling the property, letting it ready for
of_property_add(), which now occurs in sequence.

This created an opportunity to reorganize the second part of enable_ddw():

Without this patch enable_ddw() does, in order:
kzalloc() property & members, create_ddw(), fill ddwprop inside property,
ddw_list_new_entry(), do tce_setrange_multi_pSeriesLP_walk in all memory,
of_add_property(), and list_add().

With this patch enable_ddw() does, in order:
create_ddw(), ddw_property_create(), of_add_property(),
ddw_list_new_entry(), do tce_setrange_multi_pSeriesLP_walk in all memory,
and list_add().

This change requires of_remove_property() in case anything fails after
of_add_property(), but we get to do tce_setrange_multi_pSeriesLP_walk
in all memory, which looks the most expensive operation, only if
everything else succeeds.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 106 +++--
 1 file changed, 63 insertions(+), 43 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index bba8584a8f9d..510ccb0521af 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -177,7 +177,7 @@ static int tce_build_pSeriesLP(unsigned long liobn, long 
tcenum, long tceshift,
if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) {
ret = (int)rc;
tce_free_pSeriesLP(liobn, tcenum_start, tceshift,
-  (npages_start - (npages + 1)));
+  (npages_start - (npages + 1)));
break;
}
 
@@ -215,7 +215,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table 
*tbl, long tcenum,
if ((npages == 1) || !firmware_has_feature(FW_FEATURE_PUT_TCE_IND)) {
return tce_build_pSeriesLP(tbl->it_index, tcenum,
   tceshift, npages, uaddr,
-  direction, attrs);
+  direction, attrs);
}
 
local_irq_save(flags);  /* to protect tcep and the page behind it */
@@ -269,7 +269,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table 
*tbl, long tcenum,
if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) {
ret = (int)rc;
tce_freemulti_pSeriesLP(tbl, tcenum_start,
-   (npages_start - (npages + limit)));
+   (npages_start - (npages + limit)));
return ret;
}
 
@@ -1121,6 +1121,35 @@ static void reset_dma_window(struct pci_dev *dev, struct 
device_node *par_dn)
 ret);
 }
 
+static struct property *ddw_property_create(const char *propname, u32 liobn, 
u64 dma_addr,
+   u32 page_shift, u32 window_shift)
+{
+   struct dynamic_dma_window_prop *ddwprop;
+   struct property *win64;
+
+   win64 = kzalloc(sizeof(*win64), GFP_KERNEL);
+   if (!win64)
+   return NULL;
+
+   win64->name = kstrdup(propname, GFP_KERNEL);
+   ddwprop = kzalloc(sizeof(*ddwprop), GFP_KERNEL);
+   win64->value = ddwprop;
+   win64->length = sizeof(*ddwprop);
+   if (!win64->name || !win64->value) {
+   kfree(win64);
+   kfree(win64->name);
+   kfree(win64->value);
+   return NULL;
+   }
+
+   ddwprop->liobn = cpu_to_be32(liobn);
+   ddwprop->dma_base = cpu_to_be64(dma_addr);
+   ddwprop->tce_shift = cpu_to_be32(page_shift);
+   ddwprop->window_shift = cpu_to_be32(window_shift);
+
+   return win64;
+}
+
 /*
  * If the PE supports dynamic dma windows, and there is space for a table
  * that can map all pages in a linear offset, then setup such a table,
@@ -1138,12 +1167,11 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
struct ddw_query_response query;
struct ddw_create_response create;
int page_shift;
-   u64 max_addr;
+   u64 max_addr, win_addr;
struct device_node *dn;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
struct direct_window *window;
-   struct property *win64;
-   struct dynamic_dma_window_prop *ddwprop;
+   struct property *win64 = NULL;
struct failed_ddw_pdn *fpdn;
bool default_win_removed = false;
 
@@ -1245,72 +1273,64 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
goto out_failed;
}
len = order_base_2(max_addr);
-   win64 = kzalloc(sizeof(struct property), GFP_KERNEL);
-   if (!win64) {
-   dev_info(>dev,
-  

[PATCH v2 08/14] powerpc/pseries/iommu: Allow DDW windows starting at 0x00

2020-09-11 Thread Leonardo Bras
Cc: linuxppc-dev@lists.ozlabs.org, linux-ker...@vger.kernel.org, 

enable_ddw() currently returns the address of the DMA window, which is
considered invalid if has the value 0x00.

Also, it only considers valid an address returned from find_existing_ddw
if it's not 0x00.

Changing this behavior makes sense, given the users of enable_ddw() only
need to know if direct mapping is possible. It can also allow a DMA window
starting at 0x00 to be used.

This will be helpful for using a DDW with indirect mapping, as the window
address will be different than 0x00, but it will not map the whole
partition.

Signed-off-by: Leonardo Bras 
Reviewed-by: Alexey Kardashevskiy 
---
 arch/powerpc/platforms/pseries/iommu.c | 30 --
 1 file changed, 14 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index e4c17d27dcf3..bba8584a8f9d 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -849,24 +849,25 @@ static void remove_ddw(struct device_node *np, bool 
remove_prop)
np, ret);
 }
 
-static u64 find_existing_ddw(struct device_node *pdn)
+static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
 {
struct direct_window *window;
const struct dynamic_dma_window_prop *direct64;
-   u64 dma_addr = 0;
+   bool found = false;
 
spin_lock(_window_list_lock);
/* check if we already created a window and dupe that config if so */
list_for_each_entry(window, _window_list, list) {
if (window->device == pdn) {
direct64 = window->prop;
-   dma_addr = be64_to_cpu(direct64->dma_base);
+   *dma_addr = be64_to_cpu(direct64->dma_base);
+   found = true;
break;
}
}
spin_unlock(_window_list_lock);
 
-   return dma_addr;
+   return found;
 }
 
 static struct direct_window *ddw_list_new_entry(struct device_node *pdn,
@@ -1129,15 +1130,15 @@ static void reset_dma_window(struct pci_dev *dev, 
struct device_node *par_dn)
  * pdn: the parent pe node with the ibm,dma_window property
  * Future: also check if we can remap the base window for our base page size
  *
- * returns the dma offset for use by the direct mapped DMA code.
+ * returns true if can map all pages (direct mapping), false otherwise..
  */
-static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
+static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 {
int len, ret;
struct ddw_query_response query;
struct ddw_create_response create;
int page_shift;
-   u64 dma_addr, max_addr;
+   u64 max_addr;
struct device_node *dn;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
struct direct_window *window;
@@ -1148,8 +1149,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
 
mutex_lock(_window_init_mutex);
 
-   dma_addr = find_existing_ddw(pdn);
-   if (dma_addr != 0)
+   if (find_existing_ddw(pdn, >dev.archdata.dma_offset))
goto out_unlock;
 
/*
@@ -1296,7 +1296,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
list_add(>list, _window_list);
spin_unlock(_window_list_lock);
 
-   dma_addr = be64_to_cpu(ddwprop->dma_base);
+   dev->dev.archdata.dma_offset = be64_to_cpu(ddwprop->dma_base);
goto out_unlock;
 
 out_free_window:
@@ -1309,6 +1309,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
kfree(win64->name);
kfree(win64->value);
kfree(win64);
+   win64 = NULL;
 
 out_failed:
if (default_win_removed)
@@ -1322,7 +1323,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
 
 out_unlock:
mutex_unlock(_window_init_mutex);
-   return dma_addr;
+   return win64;
 }
 
 static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
@@ -1401,11 +1402,8 @@ static bool iommu_bypass_supported_pSeriesLP(struct 
pci_dev *pdev, u64 dma_mask)
break;
}
 
-   if (pdn && PCI_DN(pdn)) {
-   pdev->dev.archdata.dma_offset = enable_ddw(pdev, pdn);
-   if (pdev->dev.archdata.dma_offset)
-   return true;
-   }
+   if (pdn && PCI_DN(pdn))
+   return enable_ddw(pdev, pdn);
 
return false;
 }
-- 
2.25.4



[PATCH v2 07/14] powerpc/pseries/iommu: Add ddw_list_new_entry() helper

2020-09-11 Thread Leonardo Bras
Cc: linuxppc-dev@lists.ozlabs.org, linux-ker...@vger.kernel.org, 

There are two functions creating direct_window_list entries in a
similar way, so create a ddw_list_new_entry() to avoid duplicity and
simplify those functions.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 32 +-
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 49008d2e217d..e4c17d27dcf3 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -869,6 +869,21 @@ static u64 find_existing_ddw(struct device_node *pdn)
return dma_addr;
 }
 
+static struct direct_window *ddw_list_new_entry(struct device_node *pdn,
+   const struct 
dynamic_dma_window_prop *dma64)
+{
+   struct direct_window *window;
+
+   window = kzalloc(sizeof(*window), GFP_KERNEL);
+   if (!window)
+   return NULL;
+
+   window->device = pdn;
+   window->prop = dma64;
+
+   return window;
+}
+
 static int find_existing_ddw_windows(void)
 {
int len;
@@ -881,18 +896,15 @@ static int find_existing_ddw_windows(void)
 
for_each_node_with_property(pdn, DIRECT64_PROPNAME) {
direct64 = of_get_property(pdn, DIRECT64_PROPNAME, );
-   if (!direct64)
-   continue;
-
-   window = kzalloc(sizeof(*window), GFP_KERNEL);
-   if (!window || len < sizeof(struct dynamic_dma_window_prop)) {
-   kfree(window);
+   if (!direct64 || len < sizeof(*direct64)) {
remove_ddw(pdn, true);
continue;
}
 
-   window->device = pdn;
-   window->prop = direct64;
+   window = ddw_list_new_entry(pdn, direct64);
+   if (!window)
+   break;
+
spin_lock(_window_list_lock);
list_add(>list, _window_list);
spin_unlock(_window_list_lock);
@@ -1261,7 +1273,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
dev_dbg(>dev, "created tce table LIOBN 0x%x for %pOF\n",
  create.liobn, dn);
 
-   window = kzalloc(sizeof(*window), GFP_KERNEL);
+   window = ddw_list_new_entry(pdn, ddwprop);
if (!window)
goto out_clear_window;
 
@@ -1280,8 +1292,6 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
goto out_free_window;
}
 
-   window->device = pdn;
-   window->prop = ddwprop;
spin_lock(_window_list_lock);
list_add(>list, _window_list);
spin_unlock(_window_list_lock);
-- 
2.25.4



[PATCH v2 05/14] powerpc/kernel/iommu: Add new iommu_table_in_use() helper

2020-09-11 Thread Leonardo Bras
Cc: linuxppc-dev@lists.ozlabs.org, linux-ker...@vger.kernel.org, 

Having a function to check if the iommu table has any allocation helps
deciding if a tbl can be reset for using a new DMA window.

It should be enough to replace all instances of !bitmap_empty(tbl...).

iommu_table_in_use() skips reserved memory, so we don't need to worry about
releasing it before testing. This causes iommu_table_release_pages() to
become unnecessary, given it is only used to remove reserved memory for
testing.

Also, only allow storing reserved memory values in tbl if they are valid
in the table, so there is no need to check it in the new helper.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/include/asm/iommu.h |  1 +
 arch/powerpc/kernel/iommu.c  | 61 +++-
 2 files changed, 30 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 5032f1593299..2913e5c8b1f8 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -154,6 +154,7 @@ extern int iommu_tce_table_put(struct iommu_table *tbl);
  */
 extern struct iommu_table *iommu_init_table(struct iommu_table *tbl,
int nid, unsigned long res_start, unsigned long res_end);
+bool iommu_table_in_use(struct iommu_table *tbl);
 
 #define IOMMU_TABLE_GROUP_MAX_TABLES   2
 
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index ffb2637dc82b..c838da3d8f32 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -655,34 +655,21 @@ static void iommu_table_reserve_pages(struct iommu_table 
*tbl,
if (tbl->it_offset == 0)
set_bit(0, tbl->it_map);
 
+   /* Check if res_start..res_end is a valid range in the table */
+   if (res_start >= res_end || res_start < tbl->it_offset ||
+   res_end > (tbl->it_offset + tbl->it_size)) {
+   tbl->it_reserved_start = tbl->it_offset;
+   tbl->it_reserved_end = tbl->it_offset;
+   return;
+   }
+
tbl->it_reserved_start = res_start;
tbl->it_reserved_end = res_end;
 
-   /* Check if res_start..res_end isn't empty and overlaps the table */
-   if (res_start && res_end &&
-   (tbl->it_offset + tbl->it_size < res_start ||
-res_end < tbl->it_offset))
-   return;
-
for (i = tbl->it_reserved_start; i < tbl->it_reserved_end; ++i)
set_bit(i - tbl->it_offset, tbl->it_map);
 }
 
-static void iommu_table_release_pages(struct iommu_table *tbl)
-{
-   int i;
-
-   /*
-* In case we have reserved the first bit, we should not emit
-* the warning below.
-*/
-   if (tbl->it_offset == 0)
-   clear_bit(0, tbl->it_map);
-
-   for (i = tbl->it_reserved_start; i < tbl->it_reserved_end; ++i)
-   clear_bit(i - tbl->it_offset, tbl->it_map);
-}
-
 /*
  * Build a iommu_table structure.  This contains a bit map which
  * is used to manage allocation of the tce space.
@@ -743,6 +730,23 @@ struct iommu_table *iommu_init_table(struct iommu_table 
*tbl, int nid,
return tbl;
 }
 
+bool iommu_table_in_use(struct iommu_table *tbl)
+{
+   unsigned long start = 0, end;
+
+   /* ignore reserved bit0 */
+   if (tbl->it_offset == 0)
+   start = 1;
+   end = tbl->it_reserved_start - tbl->it_offset;
+   if (find_next_bit(tbl->it_map, end, start) != end)
+   return true;
+
+   start = tbl->it_reserved_end - tbl->it_offset;
+   end = tbl->it_size;
+   return find_next_bit(tbl->it_map, end, start) != end;
+
+}
+
 static void iommu_table_free(struct kref *kref)
 {
unsigned long bitmap_sz;
@@ -759,10 +763,8 @@ static void iommu_table_free(struct kref *kref)
return;
}
 
-   iommu_table_release_pages(tbl);
-
/* verify that table contains no entries */
-   if (!bitmap_empty(tbl->it_map, tbl->it_size))
+   if (iommu_table_in_use(tbl))
pr_warn("%s: Unexpected TCEs\n", __func__);
 
/* calculate bitmap size in bytes */
@@ -1068,18 +1070,13 @@ int iommu_take_ownership(struct iommu_table *tbl)
for (i = 0; i < tbl->nr_pools; i++)
spin_lock(>pools[i].lock);
 
-   iommu_table_release_pages(tbl);
-
-   if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
+   if (iommu_table_in_use(tbl)) {
pr_err("iommu_tce: it_map is not empty");
ret = -EBUSY;
-   /* Undo iommu_table_release_pages, i.e. restore bit#0, etc */
-   iommu_table_reserve_pages(tbl, tbl->it_reserved_start,
-   tbl->it_reserved_end);
-   } else {
-   memset(tbl->it_map, 0xff, sz);
}
 
+   memset(tbl->it_map, 0xff, sz);
+
for (i = 0; i < tbl->nr_pools; i++)
spin_unlock(>pools[i].lock);

[PATCH v2 06/14] powerpc/pseries/iommu: Add iommu_pseries_alloc_table() helper

2020-09-11 Thread Leonardo Bras
Cc: linuxppc-dev@lists.ozlabs.org, linux-ker...@vger.kernel.org, 

Creates a helper to allow allocating a new iommu_table without the need
to reallocate the iommu_group.

This will be helpful for replacing the iommu_table for the new DMA window,
after we remove the old one with iommu_tce_table_put().

Signed-off-by: Leonardo Bras 
Reviewed-by: Alexey Kardashevskiy 
---
 arch/powerpc/platforms/pseries/iommu.c | 25 ++---
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 4eff3a6a441e..49008d2e217d 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -53,28 +53,31 @@ enum {
DDW_EXT_QUERY_OUT_SIZE = 2
 };
 
-static struct iommu_table_group *iommu_pseries_alloc_group(int node)
+static struct iommu_table *iommu_pseries_alloc_table(int node)
 {
-   struct iommu_table_group *table_group;
struct iommu_table *tbl;
 
-   table_group = kzalloc_node(sizeof(struct iommu_table_group), GFP_KERNEL,
-  node);
-   if (!table_group)
-   return NULL;
-
tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, node);
if (!tbl)
-   goto free_group;
+   return NULL;
 
INIT_LIST_HEAD_RCU(>it_group_list);
kref_init(>it_kref);
+   return tbl;
+}
 
-   table_group->tables[0] = tbl;
+static struct iommu_table_group *iommu_pseries_alloc_group(int node)
+{
+   struct iommu_table_group *table_group;
+
+   table_group = kzalloc_node(sizeof(*table_group), GFP_KERNEL, node);
+   if (!table_group)
+   return NULL;
 
-   return table_group;
+   table_group->tables[0] = iommu_pseries_alloc_table(node);
+   if (table_group->tables[0])
+   return table_group;
 
-free_group:
kfree(table_group);
return NULL;
 }
-- 
2.25.4



[PATCH v2 04/14] powerpc/kernel/iommu: Use largepool as a last resort when !largealloc

2020-09-11 Thread Leonardo Bras
Cc: linuxppc-dev@lists.ozlabs.org, linux-ker...@vger.kernel.org, 

As of today, doing iommu_range_alloc() only for !largealloc (npages <= 15)
will only be able to use 3/4 of the available pages, given pages on
largepool  not being available for !largealloc.

This could mean some drivers not being able to fully use all the available
pages for the DMA window.

Add pages on largepool as a last resort for !largealloc, making all pages
of the DMA window available.

Signed-off-by: Leonardo Bras 
Reviewed-by: Alexey Kardashevskiy 
---
 arch/powerpc/kernel/iommu.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 7961645a6980..ffb2637dc82b 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -261,6 +261,15 @@ static unsigned long iommu_range_alloc(struct device *dev,
pass++;
goto again;
 
+   } else if (pass == tbl->nr_pools + 1) {
+   /* Last resort: try largepool */
+   spin_unlock(>lock);
+   pool = >large_pool;
+   spin_lock(>lock);
+   pool->hint = pool->start;
+   pass++;
+   goto again;
+
} else {
/* Give up */
spin_unlock_irqrestore(&(pool->lock), flags);
-- 
2.25.4



[PATCH v2 03/14] powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE() to save TCEs

2020-09-11 Thread Leonardo Bras
Cc: linuxppc-dev@lists.ozlabs.org, linux-ker...@vger.kernel.org, 

Currently both iommu_alloc_coherent() and iommu_free_coherent() align the
desired allocation size to PAGE_SIZE, and gets system pages and IOMMU
mappings (TCEs) for that value.

When IOMMU_PAGE_SIZE < PAGE_SIZE, this behavior may cause unnecessary
TCEs to be created for mapping the whole system page.

Example:
- PAGE_SIZE = 64k, IOMMU_PAGE_SIZE() = 4k
- iommu_alloc_coherent() is called for 128 bytes
- 1 system page (64k) is allocated
- 16 IOMMU pages (16 x 4k) are allocated (16 TCEs used)

It would be enough to use a single TCE for this, so 15 TCEs are
wasted in the process.

Update iommu_*_coherent() to make sure the size alignment happens only
for IOMMU_PAGE_SIZE() before calling iommu_alloc() and iommu_free().

Also, on iommu_range_alloc(), replace ALIGN(n, 1 << tbl->it_page_shift)
with IOMMU_PAGE_ALIGN(n, tbl), which is easier to read and does the
same.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/kernel/iommu.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 9704f3f76e63..7961645a6980 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -237,10 +237,9 @@ static unsigned long iommu_range_alloc(struct device *dev,
}
 
if (dev)
-   boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
- 1 << tbl->it_page_shift);
+   boundary_size = IOMMU_PAGE_ALIGN(dma_get_seg_boundary(dev) + 1, 
tbl);
else
-   boundary_size = ALIGN(1UL << 32, 1 << tbl->it_page_shift);
+   boundary_size = IOMMU_PAGE_ALIGN(1UL << 32, tbl);
/* 4GB boundary for iseries_hv_alloc and iseries_hv_map */
 
n = iommu_area_alloc(tbl->it_map, limit, start, npages, tbl->it_offset,
@@ -858,6 +857,7 @@ void *iommu_alloc_coherent(struct device *dev, struct 
iommu_table *tbl,
unsigned int order;
unsigned int nio_pages, io_order;
struct page *page;
+   size_t size_io = size;
 
size = PAGE_ALIGN(size);
order = get_order(size);
@@ -884,8 +884,9 @@ void *iommu_alloc_coherent(struct device *dev, struct 
iommu_table *tbl,
memset(ret, 0, size);
 
/* Set up tces to cover the allocated range */
-   nio_pages = size >> tbl->it_page_shift;
-   io_order = get_iommu_order(size, tbl);
+   size_io = IOMMU_PAGE_ALIGN(size_io, tbl);
+   nio_pages = size_io >> tbl->it_page_shift;
+   io_order = get_iommu_order(size_io, tbl);
mapping = iommu_alloc(dev, tbl, ret, nio_pages, DMA_BIDIRECTIONAL,
  mask >> tbl->it_page_shift, io_order, 0);
if (mapping == DMA_MAPPING_ERROR) {
@@ -900,10 +901,9 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t 
size,
 void *vaddr, dma_addr_t dma_handle)
 {
if (tbl) {
-   unsigned int nio_pages;
+   size_t size_io = IOMMU_PAGE_ALIGN(size, tbl);
+   unsigned int nio_pages = size_io >> tbl->it_page_shift;
 
-   size = PAGE_ALIGN(size);
-   nio_pages = size >> tbl->it_page_shift;
iommu_free(tbl, dma_handle, nio_pages);
size = PAGE_ALIGN(size);
free_pages((unsigned long)vaddr, get_order(size));
-- 
2.25.4



[PATCH v2 02/14] powerpc/pseries/iommu: Makes sure IOMMU_PAGE_SIZE <= PAGE_SIZE

2020-09-11 Thread Leonardo Bras
Cc: linuxppc-dev@lists.ozlabs.org, linux-ker...@vger.kernel.org, 

Having System pagesize < IOMMU pagesize may cause a page owned by another
process/VM to be written by a buggy driver / device.

As it's intended to use DDW for indirect mapping, it's possible to get
a configuration where (PAGE_SIZE = 4k) < (IOMMU_PAGE_SIZE() = 64k).

To avoid this, make sure create_ddw() can only use pagesize <= PAGE_SIZE.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 9db3927607a4..4eff3a6a441e 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1206,17 +1206,20 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
goto out_failed;
}
}
-   if (query.page_size & 4) {
-   page_shift = 24; /* 16MB */
-   } else if (query.page_size & 2) {
+
+   /* Choose the biggest pagesize available <= PAGE_SHIFT */
+   if ((query.page_size & 4)) {
+   page_shift = 24; /* 16MB - Hugepages */
+   } else if ((query.page_size & 2) && PAGE_SHIFT >= 16) {
page_shift = 16; /* 64kB */
-   } else if (query.page_size & 1) {
+   } else if ((query.page_size & 1) && PAGE_SHIFT >= 12) {
page_shift = 12; /* 4kB */
} else {
dev_dbg(>dev, "no supported direct page size in mask %x",
  query.page_size);
goto out_failed;
}
+
/* verify the window * number of ptes will map the partition */
/* check largest block * page size > max memory hotplug addr */
max_addr = ddw_memory_hotplug_max();
-- 
2.25.4



[PATCH v2 01/14] powerpc/pseries/iommu: Replace hard-coded page shift

2020-09-11 Thread Leonardo Bras
Cc: linuxppc-dev@lists.ozlabs.org, linux-ker...@vger.kernel.org, 

Some functions assume IOMMU page size can only be 4K (pageshift == 12).
Update them to accept any page size passed, so we can use 64K pages.

In the process, some defines like TCE_SHIFT were made obsolete, and then
removed.

IODA3 Revision 3.0_prd1 (OpenPowerFoundation), Figures 3.4 and 3.5 show
a RPN of 52-bit, and considers a 12-bit pageshift, so there should be
no need of using TCE_RPN_MASK, which masks out any bit after 40 in rpn.
It's usage removed from tce_build_pSeries(), tce_build_pSeriesLP(), and
tce_buildmulti_pSeriesLP().

Most places had a tbl struct, so using tbl->it_page_shift was simple.
tce_free_pSeriesLP() was a special case, since callers not always have a
tbl struct, so adding a tceshift parameter seems the right thing to do.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/include/asm/tce.h |  8 --
 arch/powerpc/platforms/pseries/iommu.c | 39 +++---
 2 files changed, 23 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
index db5fc2f2262d..0c34d2756d92 100644
--- a/arch/powerpc/include/asm/tce.h
+++ b/arch/powerpc/include/asm/tce.h
@@ -19,15 +19,7 @@
 #define TCE_VB 0
 #define TCE_PCI1
 
-/* TCE page size is 4096 bytes (1 << 12) */
-
-#define TCE_SHIFT  12
-#define TCE_PAGE_SIZE  (1 << TCE_SHIFT)
-
 #define TCE_ENTRY_SIZE 8   /* each TCE is 64 bits */
-
-#define TCE_RPN_MASK   0xfful  /* 40-bit RPN (4K pages) */
-#define TCE_RPN_SHIFT  12
 #define TCE_VALID  0x800   /* TCE valid */
 #define TCE_ALLIO  0x400   /* TCE valid for all lpars */
 #define TCE_PCI_WRITE  0x2 /* write from PCI allowed */
diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index e4198700ed1a..9db3927607a4 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -107,6 +107,8 @@ static int tce_build_pSeries(struct iommu_table *tbl, long 
index,
u64 proto_tce;
__be64 *tcep;
u64 rpn;
+   const unsigned long tceshift = tbl->it_page_shift;
+   const unsigned long pagesize = IOMMU_PAGE_SIZE(tbl);
 
proto_tce = TCE_PCI_READ; // Read allowed
 
@@ -117,10 +119,10 @@ static int tce_build_pSeries(struct iommu_table *tbl, 
long index,
 
while (npages--) {
/* can't move this out since we might cross MEMBLOCK boundary */
-   rpn = __pa(uaddr) >> TCE_SHIFT;
-   *tcep = cpu_to_be64(proto_tce | (rpn & TCE_RPN_MASK) << 
TCE_RPN_SHIFT);
+   rpn = __pa(uaddr) >> tceshift;
+   *tcep = cpu_to_be64(proto_tce | rpn << tceshift);
 
-   uaddr += TCE_PAGE_SIZE;
+   uaddr += pagesize;
tcep++;
}
return 0;
@@ -146,7 +148,7 @@ static unsigned long tce_get_pseries(struct iommu_table 
*tbl, long index)
return be64_to_cpu(*tcep);
 }
 
-static void tce_free_pSeriesLP(unsigned long liobn, long, long);
+static void tce_free_pSeriesLP(unsigned long liobn, long, long, long);
 static void tce_freemulti_pSeriesLP(struct iommu_table*, long, long);
 
 static int tce_build_pSeriesLP(unsigned long liobn, long tcenum, long tceshift,
@@ -166,12 +168,12 @@ static int tce_build_pSeriesLP(unsigned long liobn, long 
tcenum, long tceshift,
proto_tce |= TCE_PCI_WRITE;
 
while (npages--) {
-   tce = proto_tce | (rpn & TCE_RPN_MASK) << tceshift;
+   tce = proto_tce | rpn << tceshift;
rc = plpar_tce_put((u64)liobn, (u64)tcenum << tceshift, tce);
 
if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) {
ret = (int)rc;
-   tce_free_pSeriesLP(liobn, tcenum_start,
+   tce_free_pSeriesLP(liobn, tcenum_start, tceshift,
   (npages_start - (npages + 1)));
break;
}
@@ -205,10 +207,11 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table 
*tbl, long tcenum,
long tcenum_start = tcenum, npages_start = npages;
int ret = 0;
unsigned long flags;
+   const unsigned long tceshift = tbl->it_page_shift;
 
if ((npages == 1) || !firmware_has_feature(FW_FEATURE_PUT_TCE_IND)) {
return tce_build_pSeriesLP(tbl->it_index, tcenum,
-  tbl->it_page_shift, npages, uaddr,
+  tceshift, npages, uaddr,
   direction, attrs);
}
 
@@ -225,13 +228,13 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table 
*tbl, long tcenum,
if (!tcep) {
local_irq_restore(flags);
return 

[PATCH v2 00/14] DDW Indirect Mapping

2020-09-11 Thread Leonardo Bras
Cc: linuxppc-dev@lists.ozlabs.org, linux-ker...@vger.kernel.org, 

##
This patchset is based on top of:
https://github.com/linuxppc/linux/tree/next
that already contains
http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=194179=%2A=both
##

So far it's assumed possible to map the guest RAM 1:1 to the bus, which
works with a small number of devices. SRIOV changes it as the user can
configure hundreds VFs and since phyp preallocates TCEs and does not
allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
per a PE to limit waste of physical pages.

As of today, if the assumed direct mapping is not possible, DDW creation
is skipped and the default DMA window "ibm,dma-window" is used instead.

Using the DDW instead of the default DMA window may allow to expand the
amount of memory that can be DMA-mapped, given the number of pages (TCEs)
may stay the same and the default DMA window offers only 4k-pages
while DDW may offer larger pages (64k).

Patch #1 replaces hard-coded 4K page size with a variable containing the
correct page size for the window.

Patch #2 makes sure IOMMU_PAGE_SIZE() <= PAGE_SIZE, to avoid mapping
pages from other processess.

Patch #3 will save TCEs for small allocations when
IOMMU_PAGE_SIZE() < PAGE_SIZE.

Patch #4 let small allocations use largepool if there is no more space
left in the other pools, thus allowing the whole DMA window to be used by
smaller allocations.

Patch #5 introduces iommu_table_in_use(), and replace manual bit-field
checking where it's used. It will be used for aborting enable_ddw() if
there is any current iommu allocation and we are trying single window
indirect mapping.

Patch #6 introduces iommu_pseries_alloc_table() that will be helpful
when indirect mapping needs to replace the iommu_table.

Patch #7 adds helpers for adding DDWs in the list.

Patch #8 refactors enable_ddw() so it returns if direct mapping is
possible, instead of DMA offset. It helps for next patches on
indirect DMA mapping and also allows DMA windows starting at 0x00.

Patch #9 bring new helper to simplify enable_ddw(), allowing
some reorganization for introducing indirect mapping DDW.

Patch #10 adds new helper _iommu_table_setparms() and use it in other
*setparams*() to fill iommu_table. It will also be used for creating a
new iommu_table for indirect mapping.

Patch #11 updates remove_dma_window() to accept different property names,
so we can introduce a new property for indirect mapping.

Patch #12 extracts find_existing_ddw_windows() into
find_existing_ddw_windows_named(), and calls it by it's property name.
This will be useful when the property for indirect mapping is created,
so we can search the device-tree for both properties.

Patch #13:
Instead of destroying the created DDW if it doesn't map the whole
partition, make use of it instead of the default DMA window as it improves
performance. Also, update the iommu_table and re-generate the pools.
It introduces a new property name for DDW with indirect DMA mapping.

Patch #14:
Does some renaming of 'direct window' to 'dma window', given the DDW
created can now be also used in indirect mapping if direct mapping is not
available.

All patches were tested into an LPAR with an Ethernet VF:
4005:01:00.0 Ethernet controller: Mellanox Technologies MT27700 Family
[ConnectX-4 Virtual Function]

Patchset was tested with a 64GB DDW which did not map the whole
partition (128G).

Leonardo Bras (14):
  powerpc/pseries/iommu: Replace hard-coded page shift
  powerpc/pseries/iommu: Makes sure IOMMU_PAGE_SIZE <= PAGE_SIZE
  powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE() to save TCEs
  powerpc/kernel/iommu: Use largepool as a last resort when !largealloc
  powerpc/kernel/iommu: Add new iommu_table_in_use() helper
  powerpc/pseries/iommu: Add iommu_pseries_alloc_table() helper
  powerpc/pseries/iommu: Add ddw_list_new_entry() helper
  powerpc/pseries/iommu: Allow DDW windows starting at 0x00
  powerpc/pseries/iommu: Add ddw_property_create() and refactor
enable_ddw()
  powerpc/pseries/iommu: Reorganize iommu_table_setparms*() with new
helper
  powerpc/pseries/iommu: Update remove_dma_window() to accept property
name
  powerpc/pseries/iommu: Find existing DDW with given property name
  powerpc/pseries/iommu: Make use of DDW for indirect mapping
  powerpc/pseries/iommu: Rename "direct window" to "dma window"

 arch/powerpc/include/asm/iommu.h   |   1 +
 arch/powerpc/include/asm/tce.h |   8 -
 arch/powerpc/kernel/iommu.c|  86 ++--
 arch/powerpc/platforms/pseries/iommu.c | 648 ++---
 4 files changed, 417 insertions(+), 326 deletions(-)

-- 
2.25.4



Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-11 Thread Linus Torvalds
On Fri, Sep 11, 2020 at 5:20 AM Alexander Gordeev
 wrote:
>
> What if the entry is still pud_present, but got remapped after
> READ_ONCE(*pudp)? IOW, it is still valid, but points elsewhere?

That can't happen.

The GUP walk doesn't hold any locks, but it *is* done with interrupts
disabled, and anybody who is modifying the page tables needs to do the
TLB flush, and/or RCU-free them.

The interrupt disable means that on architectures where the TLB flush
involves an IPI, it will be delayed until afterwards, but it also acts
as a big RCU read lock hammer.

So the page tables can get modified under us, but the old pages won't
be released and re-used.

Linus


Re: [PATCH] powerpc/traps: fix recoverability of machine check handling on book3s/32

2020-09-11 Thread Michael Ellerman
Michal Suchánek  writes:
> Hello,
>
> does this logic apply to "Unrecoverable System Reset" as well?

Which logic do you mean?

We do call die() before checking MSR_RI in system_reset_exception():

/*
 * No debugger or crash dump registered, print logs then
 * panic.
 */
die("System Reset", regs, SIGABRT);
  
mdelay(2*MSEC_PER_SEC); /* Wait a little while for others to print */
add_taint(TAINT_DIE, LOCKDEP_NOW_UNRELIABLE);
nmi_panic(regs, "System Reset");
  
  out:
  #ifdef CONFIG_PPC_BOOK3S_64
BUG_ON(get_paca()->in_nmi == 0);
if (get_paca()->in_nmi > 1)
die("Unrecoverable nested System Reset", regs, SIGABRT);
  #endif
/* Must die if the interrupt is not recoverable */
if (!(regs->msr & MSR_RI))
die("Unrecoverable System Reset", regs, SIGABRT);


So you should see the output from die("System Reset", ...) even if
MSR[RI] was clear when you took the system reset.

cheers

> On Tue, Jan 22, 2019 at 02:11:24PM +, Christophe Leroy wrote:
>> Looks like book3s/32 doesn't set RI on machine check, so
>> checking RI before calling die() will always be fatal
>> allthought this is not an issue in most cases.
>> 
>> Fixes: b96672dd840f ("powerpc: Machine check interrupt is a non-maskable 
>> interrupt")
>> Fixes: daf00ae71dad ("powerpc/traps: restore recoverability of machine_check 
>> interrupts")
>> Signed-off-by: Christophe Leroy 
>> Cc: sta...@vger.kernel.org
>> ---
>>  arch/powerpc/kernel/traps.c | 8 
>>  1 file changed, 4 insertions(+), 4 deletions(-)
>> 
>> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
>> index 64936b60d521..c740f8bfccc9 100644
>> --- a/arch/powerpc/kernel/traps.c
>> +++ b/arch/powerpc/kernel/traps.c
>> @@ -763,15 +763,15 @@ void machine_check_exception(struct pt_regs *regs)
>>  if (check_io_access(regs))
>>  goto bail;
>>  
>> -/* Must die if the interrupt is not recoverable */
>> -if (!(regs->msr & MSR_RI))
>> -nmi_panic(regs, "Unrecoverable Machine check");
>> -
>>  if (!nested)
>>  nmi_exit();
>>  
>>  die("Machine check", regs, SIGBUS);
>>  
>> +/* Must die if the interrupt is not recoverable */
>> +if (!(regs->msr & MSR_RI))
>> +nmi_panic(regs, "Unrecoverable Machine check");
>> +
>>  return;
>>  
>>  bail:
>> -- 
>> 2.13.3
>> 


Re: [PATCH v1] pseries/hotplug-memory: hot-add: skip redundant LMB lookup

2020-09-11 Thread Michael Ellerman
Hi Scott,

Scott Cheloha  writes:
> During memory hot-add, dlpar_add_lmb() calls memory_add_physaddr_to_nid()
> to determine which node id (nid) to use when later calling __add_memory().
>
...
>
> Consider an LPAR with 126976 LMBs.  In one test, hot-adding 126000

Nice little machine you got there :P

> LMBs on an upatched kernel took ~3.5 hours while a patched kernel
> completed the same operation in ~2 hours:

...

> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
> b/arch/powerpc/platforms/pseries/hotplug-memory.c
> index 0ea976d1cac4..9cd572440175 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> @@ -595,6 +595,8 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, 
> u32 drc_index)
>  }
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
>  
> +extern int of_drconf_to_nid_single(struct drmem_lmb *);
> +

This needs to go in a header.

It should probably go in arch/powerpc/include/asm/topology.h

cheers

>  static int dlpar_add_lmb(struct drmem_lmb *lmb)
>  {
>   unsigned long block_sz;
> @@ -611,8 +613,10 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
>  
>   block_sz = memory_block_size_bytes();
>  
> - /* Find the node id for this address. */
> - nid = memory_add_physaddr_to_nid(lmb->base_addr);
> + /* Find the node id for this address.  Fake one if necessary. */
> + nid = of_drconf_to_nid_single(lmb);
> + if (nid < 0 || !node_possible(nid))
> + nid = first_online_node;
>  
>   /* Add the memory */
>   rc = __add_memory(nid, lmb->base_addr, block_sz);
> -- 
> 2.24.1


Re: [PATCH] powerpc/powermac: Fix low_sleep_handler with KUAP and KUEP

2020-09-11 Thread Michael Ellerman
Christophe Leroy  writes:
> Le 11/09/2020 à 01:56, Michael Ellerman a écrit :
>> Christophe Leroy  writes:
>>> low_sleep_handler() has an hardcoded restore of segment registers
>>> that doesn't take KUAP and KUEP into account.
>>>
>>> Use head_32's load_segment_registers() routine instead.
>>>
>>> Signed-off-by: Christophe Leroy 
>>> Fixes: a68c31fc01ef ("powerpc/32s: Implement Kernel Userspace Access 
>>> Protection")
>>> Fixes: 31ed2b13c48d ("powerpc/32s: Implement Kernel Userspace Execution 
>>> Prevention.")
>>> Cc: sta...@vger.kernel.org
>>> ---
>>>   arch/powerpc/platforms/powermac/sleep.S | 9 +
>>>   1 file changed, 1 insertion(+), 8 deletions(-)
>> 
>> Doesn't build? pmac32_defconfig, gcc 9.3.0:
>> 
>> ld: arch/powerpc/platforms/powermac/sleep.o: in function `core99_wake_up':
>> (.text+0x25c): undefined reference to `load_segment_registers'
>> 
>> Missing _GLOBAL() presumably?
>
> Oops .. :(
>
> v2 sent out.

Thanks.

cheers


Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-11 Thread Alexander Gordeev
On Thu, Sep 10, 2020 at 07:11:16PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 10, 2020 at 02:22:37PM -0700, John Hubbard wrote:
> 
> > Or am I way off here, and it really is possible (aside from the current
> > s390 situation) to observe something that "is no longer a page table"?
> 
> Yes, that is the issue. Remember there is no locking for GUP
> fast. While a page table cannot be freed there is nothing preventing
> the page table entry from being concurrently modified.
> 
> Without the stack variable it looks like this:
> 
>pud_t pud = READ_ONCE(*pudp);
>if (!pud_present(pud))
> return
>pmd_offset(pudp, address);
> 
> And pmd_offset() expands to
> 
> return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
> 
> Between the READ_ONCE(*pudp) and (*pud) inside pmd_offset() the value
> of *pud can change, eg to !pud_present.
> 
> Then pud_page_vaddr(*pud) will crash. It is not use after free, it
> is using data that has not been validated.

One thing I ask myself and it is probably a good moment to wonder.

What if the entry is still pud_present, but got remapped after
READ_ONCE(*pudp)? IOW, it is still valid, but points elsewhere?

> Jason


Re: [PATCH] KVM: PPC: Don't return -ENOTSUPP to userspace in ioctls

2020-09-11 Thread Thadeu Lima de Souza Cascardo
On Fri, Sep 11, 2020 at 12:53:45PM +0200, Greg Kurz wrote:
> ENOTSUPP is a linux only thingy, the value of which is unknown to
> userspace, not to be confused with ENOTSUP which linux maps to
> EOPNOTSUPP, as permitted by POSIX [1]:
> 
> [EOPNOTSUPP]
> Operation not supported on socket. The type of socket (address family
> or protocol) does not support the requested operation. A conforming
> implementation may assign the same values for [EOPNOTSUPP] and [ENOTSUP].
> 
> Return -EOPNOTSUPP instead of -ENOTSUPP for the following ioctls:
> - KVM_GET_FPU for Book3s and BookE
> - KVM_SET_FPU for Book3s and BookE
> - KVM_GET_DIRTY_LOG for BookE
> 
> This doesn't affect QEMU which doesn't call the KVM_GET_FPU and
> KVM_SET_FPU ioctls on POWER anyway since they are not supported,
> and _buggily_ ignores anything but -EPERM for KVM_GET_DIRTY_LOG.
> 
> [1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html
> 
> Signed-off-by: Greg Kurz 

Agreed. ENOTSUPP should never be returned to userspace.

Acked-by: Thadeu Lima de Souza Cascardo 

> ---
>  arch/powerpc/kvm/book3s.c |4 ++--
>  arch/powerpc/kvm/booke.c  |6 +++---
>  2 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> index 1fce9777af1c..44bf567b6589 100644
> --- a/arch/powerpc/kvm/book3s.c
> +++ b/arch/powerpc/kvm/book3s.c
> @@ -558,12 +558,12 @@ int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, 
> struct kvm_regs *regs)
>  
>  int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
>  {
> - return -ENOTSUPP;
> + return -EOPNOTSUPP;
>  }
>  
>  int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
>  {
> - return -ENOTSUPP;
> + return -EOPNOTSUPP;
>  }
>  
>  int kvmppc_get_one_reg(struct kvm_vcpu *vcpu, u64 id,
> diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
> index 3e1c9f08e302..b1abcb816439 100644
> --- a/arch/powerpc/kvm/booke.c
> +++ b/arch/powerpc/kvm/booke.c
> @@ -1747,12 +1747,12 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id,
>  
>  int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
>  {
> - return -ENOTSUPP;
> + return -EOPNOTSUPP;
>  }
>  
>  int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
>  {
> - return -ENOTSUPP;
> + return -EOPNOTSUPP;
>  }
>  
>  int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
> @@ -1773,7 +1773,7 @@ void kvm_arch_sync_dirty_log(struct kvm *kvm, struct 
> kvm_memory_slot *memslot)
>  
>  int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
>  {
> - return -ENOTSUPP;
> + return -EOPNOTSUPP;
>  }
>  
>  void kvmppc_core_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
> 
> 


Re: [PATCH v5 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling

2020-09-11 Thread Michael Ellerman
Srikar Dronamraju  writes:
> Current code assumes that cpumask of cpus sharing a l2-cache mask will
> always be a superset of cpu_sibling_mask.
>
> Lets stop that assumption. cpu_l2_cache_mask is a superset of
> cpu_sibling_mask if and only if shared_caches is set.

I'm seeing oopses with this:

[0.117392][T1] smp: Bringing up secondary CPUs ...
[0.156515][T1] smp: Brought up 2 nodes, 2 CPUs
[0.158265][T1] numa: Node 0 CPUs: 0
[0.158520][T1] numa: Node 1 CPUs: 1
[0.167453][T1] BUG: Unable to handle kernel data access on read at 
0x800041228298
[0.167992][T1] Faulting instruction address: 0xc018c128
[0.168817][T1] Oops: Kernel access of bad area, sig: 11 [#1]
[0.168964][T1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
[0.169417][T1] Modules linked in:
[0.170047][T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
5.9.0-rc2-00095-g7430ad5aa700 #209
[0.170305][T1] NIP:  c018c128 LR: c018c0cc CTR: 
c004dce0
[0.170498][T1] REGS: c0007e343880 TRAP: 0380   Not tainted  
(5.9.0-rc2-00095-g7430ad5aa700)
[0.170602][T1] MSR:  82009033   CR: 
4400  XER: 
[0.170985][T1] CFAR: c018c288 IRQMASK: 0
[0.170985][T1] GPR00:  c0007e343b10 
c173e400 4000
[0.170985][T1] GPR04:  0800 
0800 
[0.170985][T1] GPR08:  c122c298 
c0003fffc000 c0007fd05ce8
[0.170985][T1] GPR12: c0007e0119f8 c193 
8ade 
[0.170985][T1] GPR16: c0007e3c0640 0917 
c0007e3c0658 0008
[0.170985][T1] GPR20: c15d0bb8 8ade 
c0f57400 c1817c28
[0.170985][T1] GPR24: c176dc80 c0007e3c0890 
c0007e3cfe00 
[0.170985][T1] GPR28: c1772310 c0007e011900 
c0007e3c0800 0001
[0.172750][T1] NIP [c018c128] build_sched_domains+0x808/0x14b0
[0.172900][T1] LR [c018c0cc] build_sched_domains+0x7ac/0x14b0
[0.173186][T1] Call Trace:
[0.173484][T1] [c0007e343b10] [c018bfe8] 
build_sched_domains+0x6c8/0x14b0 (unreliable)
[0.173821][T1] [c0007e343c50] [c018dcdc] 
sched_init_domains+0xec/0x130
[0.174037][T1] [c0007e343ca0] [c10d59d8] 
sched_init_smp+0x50/0xc4
[0.174207][T1] [c0007e343cd0] [c10b45c4] 
kernel_init_freeable+0x1b4/0x378
[0.174378][T1] [c0007e343db0] [c00129fc] 
kernel_init+0x24/0x158
[0.174740][T1] [c0007e343e20] [c000d9d0] 
ret_from_kernel_thread+0x5c/0x6c
[0.175050][T1] Instruction dump:
[0.175626][T1] 554905ee 71480040 7d2907b4 4182016c 2c29 3920006e 
913e002c 41820034
[0.175841][T1] 7c6307b4 e9300020 78631f24 7d58182a <7d2a482a> f93e0080 
7d404828 314a0001
[0.178340][T1] ---[ end trace 6876b88dd1d4b3bb ]---
[0.178512][T1]
[1.180458][T1] Kernel panic - not syncing: Attempted to kill init! 
exitcode=0x000b

That's qemu:

qemu-system-ppc64 -nographic -vga none -M pseries -cpu POWER8 \
  -kernel build~/vmlinux \
  -m 2G,slots=2,maxmem=4G \
  -object memory-backend-ram,size=1G,id=m0 \
  -object memory-backend-ram,size=1G,id=m1 \
  -numa node,nodeid=0,memdev=m0 \
  -numa node,nodeid=1,memdev=m1 \
  -smp 2,sockets=2,maxcpus=2  \


On mambo I get:

[0.005069][T1] smp: Bringing up secondary CPUs ...
[0.011656][T1] smp: Brought up 2 nodes, 8 CPUs
[0.011682][T1] numa: Node 0 CPUs: 0-3
[0.011709][T1] numa: Node 1 CPUs: 4-7
[0.012015][T1] BUG: arch topology borken
[0.012040][T1]  the SMT domain not a subset of the CACHE domain
[0.012107][T1] BUG: Unable to handle kernel data access on read at 
0x8001012e7398
[0.012142][T1] Faulting instruction address: 0xc01aa4f0
[0.012174][T1] Oops: Kernel access of bad area, sig: 11 [#1]
[0.012206][T1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
[0.012236][T1] Modules linked in:
[0.012264][T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
5.9.0-rc2-00095-g7430ad5aa700 #1
[0.012304][T1] NIP:  c01aa4f0 LR: c01aa498 CTR: 

[0.012341][T1] REGS: c000ef583880 TRAP: 0380   Not tainted  
(5.9.0-rc2-00095-g7430ad5aa700)
[0.012379][T1] MSR:  92009033   
CR: 4400  XER: 0004
[0.012439][T1] CFAR: c00101b0 IRQMASK: 0
[0.012439][T1] GPR00:  c000ef583b10 
c17fd000 4000
[0.012439][T1] GPR04:  0800 
 
[0.012439][T1] GPR08:  c12eb398 
c000c000 

Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-11 Thread Jason Gunthorpe
On Fri, Sep 11, 2020 at 09:09:39AM +0200, pet...@infradead.org wrote:
> On Thu, Sep 10, 2020 at 06:59:21PM -0300, Jason Gunthorpe wrote:
> > So, I suggest pXX_offset_unlocked()
> 
> Urgh, no. Elsewhere in gup _unlocked() means it will take the lock
> itself (get_user_pages_unlocked()) -- although often it seems to mean
> the lock is already held (git grep _unlocked and marvel).
>
> What we want is _lockless().

This is clear to me!

Thanks,
Jason 


[PATCH] KVM: PPC: Don't return -ENOTSUPP to userspace in ioctls

2020-09-11 Thread Greg Kurz
ENOTSUPP is a linux only thingy, the value of which is unknown to
userspace, not to be confused with ENOTSUP which linux maps to
EOPNOTSUPP, as permitted by POSIX [1]:

[EOPNOTSUPP]
Operation not supported on socket. The type of socket (address family
or protocol) does not support the requested operation. A conforming
implementation may assign the same values for [EOPNOTSUPP] and [ENOTSUP].

Return -EOPNOTSUPP instead of -ENOTSUPP for the following ioctls:
- KVM_GET_FPU for Book3s and BookE
- KVM_SET_FPU for Book3s and BookE
- KVM_GET_DIRTY_LOG for BookE

This doesn't affect QEMU which doesn't call the KVM_GET_FPU and
KVM_SET_FPU ioctls on POWER anyway since they are not supported,
and _buggily_ ignores anything but -EPERM for KVM_GET_DIRTY_LOG.

[1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html

Signed-off-by: Greg Kurz 
---
 arch/powerpc/kvm/book3s.c |4 ++--
 arch/powerpc/kvm/booke.c  |6 +++---
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 1fce9777af1c..44bf567b6589 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -558,12 +558,12 @@ int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, 
struct kvm_regs *regs)
 
 int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
 {
-   return -ENOTSUPP;
+   return -EOPNOTSUPP;
 }
 
 int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
 {
-   return -ENOTSUPP;
+   return -EOPNOTSUPP;
 }
 
 int kvmppc_get_one_reg(struct kvm_vcpu *vcpu, u64 id,
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 3e1c9f08e302..b1abcb816439 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -1747,12 +1747,12 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id,
 
 int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
 {
-   return -ENOTSUPP;
+   return -EOPNOTSUPP;
 }
 
 int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
 {
-   return -ENOTSUPP;
+   return -EOPNOTSUPP;
 }
 
 int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
@@ -1773,7 +1773,7 @@ void kvm_arch_sync_dirty_log(struct kvm *kvm, struct 
kvm_memory_slot *memslot)
 
 int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
 {
-   return -ENOTSUPP;
+   return -EOPNOTSUPP;
 }
 
 void kvmppc_core_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)




[PATCH v4 4/8] mm/memory_hotplug: prepare passing flags to add_memory() and friends

2020-09-11 Thread David Hildenbrand
We soon want to pass flags, e.g., to mark added System RAM resources.
mergeable. Prepare for that.

This patch is based on a similar patch by Oscar Salvador:

https://lkml.kernel.org/r/20190625075227.15193-3-osalva...@suse.de

Acked-by: Wei Liu 
Reviewed-by: Juergen Gross  # Xen related part
Reviewed-by: Pankaj Gupta 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Dan Williams 
Cc: Jason Gunthorpe 
Cc: Pankaj Gupta 
Cc: Baoquan He 
Cc: Wei Yang 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Greg Kroah-Hartman 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Wei Liu 
Cc: Heiko Carstens 
Cc: Vasily Gorbik 
Cc: Christian Borntraeger 
Cc: David Hildenbrand 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: "Oliver O'Halloran" 
Cc: Pingfan Liu 
Cc: Nathan Lynch 
Cc: Libor Pechacek 
Cc: Anton Blanchard 
Cc: Leonardo Bras 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-a...@vger.kernel.org
Cc: linux-nvd...@lists.01.org
Cc: linux-hyp...@vger.kernel.org
Cc: linux-s...@vger.kernel.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: xen-de...@lists.xenproject.org
Signed-off-by: David Hildenbrand 
---
 arch/powerpc/platforms/powernv/memtrace.c   |  2 +-
 arch/powerpc/platforms/pseries/hotplug-memory.c |  2 +-
 drivers/acpi/acpi_memhotplug.c  |  3 ++-
 drivers/base/memory.c   |  3 ++-
 drivers/dax/kmem.c  |  2 +-
 drivers/hv/hv_balloon.c |  2 +-
 drivers/s390/char/sclp_cmd.c|  2 +-
 drivers/virtio/virtio_mem.c |  2 +-
 drivers/xen/balloon.c   |  2 +-
 include/linux/memory_hotplug.h  | 16 
 mm/memory_hotplug.c | 14 +++---
 11 files changed, 30 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/memtrace.c 
b/arch/powerpc/platforms/powernv/memtrace.c
index 13b369d2cc454..6828108486f83 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -224,7 +224,7 @@ static int memtrace_online(void)
ent->mem = 0;
}
 
-   if (add_memory(ent->nid, ent->start, ent->size)) {
+   if (add_memory(ent->nid, ent->start, ent->size, MHP_NONE)) {
pr_err("Failed to add trace memory to node %d\n",
ent->nid);
ret += 1;
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 0ea976d1cac47..e1c9fa0d730f5 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -615,7 +615,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
nid = memory_add_physaddr_to_nid(lmb->base_addr);
 
/* Add the memory */
-   rc = __add_memory(nid, lmb->base_addr, block_sz);
+   rc = __add_memory(nid, lmb->base_addr, block_sz, MHP_NONE);
if (rc) {
invalidate_lmb_associativity_index(lmb);
return rc;
diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index e294f44a78504..2067c3bc55763 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -207,7 +207,8 @@ static int acpi_memory_enable_device(struct 
acpi_memory_device *mem_device)
if (node < 0)
node = memory_add_physaddr_to_nid(info->start_addr);
 
-   result = __add_memory(node, info->start_addr, info->length);
+   result = __add_memory(node, info->start_addr, info->length,
+ MHP_NONE);
 
/*
 * If the memory block has been used by the kernel, add_memory()
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 4db3c660de831..b4c297dd04755 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -432,7 +432,8 @@ static ssize_t probe_store(struct device *dev, struct 
device_attribute *attr,
 
nid = memory_add_physaddr_to_nid(phys_addr);
ret = __add_memory(nid, phys_addr,
-  MIN_MEMORY_BLOCK_SIZE * sections_per_block);
+  MIN_MEMORY_BLOCK_SIZE * sections_per_block,
+  MHP_NONE);
 
if (ret)
goto out;
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 7dcb2902e9b1b..896cb9444e727 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -95,7 +95,7 @@ int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 * this as RAM automatically.
 */
rc = add_memory_driver_managed(numa_node, range.start,
-   range_len(), kmem_name);
+  

Re: [PATCH] powerpc/powermac: Fix low_sleep_handler with KUAP and KUEP

2020-09-11 Thread Christophe Leroy




Le 11/09/2020 à 01:56, Michael Ellerman a écrit :

Christophe Leroy  writes:

low_sleep_handler() has an hardcoded restore of segment registers
that doesn't take KUAP and KUEP into account.

Use head_32's load_segment_registers() routine instead.

Signed-off-by: Christophe Leroy 
Fixes: a68c31fc01ef ("powerpc/32s: Implement Kernel Userspace Access 
Protection")
Fixes: 31ed2b13c48d ("powerpc/32s: Implement Kernel Userspace Execution 
Prevention.")
Cc: sta...@vger.kernel.org
---
  arch/powerpc/platforms/powermac/sleep.S | 9 +
  1 file changed, 1 insertion(+), 8 deletions(-)


Doesn't build? pmac32_defconfig, gcc 9.3.0:

ld: arch/powerpc/platforms/powermac/sleep.o: in function `core99_wake_up':
(.text+0x25c): undefined reference to `load_segment_registers'

Missing _GLOBAL() presumably?


Oops .. :(

v2 sent out.

Thanks
Christophe


Re: [PATCH v1] pseries/hotplug-memory: hot-add: skip redundant LMB lookup

2020-09-11 Thread kernel test robot
Hi Scott,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on powerpc/next]
[also build test WARNING on next-20200910]
[cannot apply to v5.9-rc4]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:
https://github.com/0day-ci/linux/commits/Scott-Cheloha/pseries-hotplug-memory-hot-add-skip-redundant-LMB-lookup/20200911-015744
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc-allyesconfig (attached as .config)
compiler: powerpc64-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross 
ARCH=powerpc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All warnings (new ones prefixed by >>):

>> arch/powerpc/mm/numa.c:433:5: warning: no previous prototype for 
>> 'of_drconf_to_nid_single' [-Wmissing-prototypes]
 433 | int of_drconf_to_nid_single(struct drmem_lmb *lmb)
 | ^~~

# 
https://github.com/0day-ci/linux/commit/e6847f3f58460841a28885578cc3547735cfa50c
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review 
Scott-Cheloha/pseries-hotplug-memory-hot-add-skip-redundant-LMB-lookup/20200911-015744
git checkout e6847f3f58460841a28885578cc3547735cfa50c
vim +/of_drconf_to_nid_single +433 arch/powerpc/mm/numa.c

   428  
   429  /*
   430   * This is like of_node_to_nid_single() for memory represented in the
   431   * ibm,dynamic-reconfiguration-memory node.
   432   */
 > 433  int of_drconf_to_nid_single(struct drmem_lmb *lmb)
   434  {
   435  struct assoc_arrays aa = { .arrays = NULL };
   436  int default_nid = NUMA_NO_NODE;
   437  int nid = default_nid;
   438  int rc, index;
   439  
   440  if ((min_common_depth < 0) || !numa_enabled)
   441  return default_nid;
   442  
   443  rc = of_get_assoc_arrays();
   444  if (rc)
   445  return default_nid;
   446  
   447  if (min_common_depth <= aa.array_sz &&
   448  !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < 
aa.n_arrays) {
   449  index = lmb->aa_index * aa.array_sz + min_common_depth 
- 1;
   450  nid = of_read_number([index], 1);
   451  
   452  if (nid == 0x || nid >= nr_node_ids)
   453  nid = default_nid;
   454  
   455  if (nid > 0) {
   456  index = lmb->aa_index * aa.array_sz;
   457  initialize_distance_lookup_table(nid,
   458  
[index]);
   459  }
   460  }
   461  
   462  return nid;
   463  }
   464  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


.config.gz
Description: application/gzip


[PATCH v2] powerpc/powermac: Fix low_sleep_handler with KUAP and KUEP

2020-09-11 Thread Christophe Leroy
low_sleep_handler() has an hardcoded restore of segment registers
that doesn't take KUAP and KUEP into account.

Use head_32's load_segment_registers() routine instead.

Signed-off-by: Christophe Leroy 
Fixes: a68c31fc01ef ("powerpc/32s: Implement Kernel Userspace Access 
Protection")
Fixes: 31ed2b13c48d ("powerpc/32s: Implement Kernel Userspace Execution 
Prevention.")
Cc: sta...@vger.kernel.org
---
v2: Added missing _GLOBAL() to load_segment_registers in head_32.S
---
 arch/powerpc/kernel/head_32.S   | 2 +-
 arch/powerpc/platforms/powermac/sleep.S | 9 +
 2 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/kernel/head_32.S b/arch/powerpc/kernel/head_32.S
index 5624db0e09a1..0f60743474a2 100644
--- a/arch/powerpc/kernel/head_32.S
+++ b/arch/powerpc/kernel/head_32.S
@@ -1002,7 +1002,7 @@ BEGIN_MMU_FTR_SECTION
 END_MMU_FTR_SECTION_IFSET(MMU_FTR_USE_HIGH_BATS)
blr
 
-load_segment_registers:
+_GLOBAL(load_segment_registers)
li  r0, NUM_USER_SEGMENTS /* load up user segment register values */
mtctr   r0  /* for context 0 */
li  r3, 0   /* Kp = 0, Ks = 0, VSID = 0 */
diff --git a/arch/powerpc/platforms/powermac/sleep.S 
b/arch/powerpc/platforms/powermac/sleep.S
index f9a680fdd9c4..51bfdfe85058 100644
--- a/arch/powerpc/platforms/powermac/sleep.S
+++ b/arch/powerpc/platforms/powermac/sleep.S
@@ -294,14 +294,7 @@ grackle_wake_up:
 * we do any r1 memory access as we are not sure they
 * are in a sane state above the first 256Mb region
 */
-   li  r0,16   /* load up segment register values */
-   mtctr   r0  /* for context 0 */
-   lis r3,0x2000   /* Ku = 1, VSID = 0 */
-   li  r4,0
-3: mtsrin  r3,r4
-   addir3,r3,0x111 /* increment VSID */
-   addis   r4,r4,0x1000/* address of next segment */
-   bdnz3b
+   bl  load_segment_registers
sync
isync
 
-- 
2.25.0



Re: [PATCH] powerpc/traps: fix recoverability of machine check handling on book3s/32

2020-09-11 Thread Christophe Leroy

Hello,

Le 11/09/2020 à 11:15, Michal Suchánek a écrit :

Hello,

does this logic apply to "Unrecoverable System Reset" as well?


I don't know, I don't think I have any way the generate a System Reset 
on my board to check it.


Christophe



Thanks

Michal

On Tue, Jan 22, 2019 at 02:11:24PM +, Christophe Leroy wrote:

Looks like book3s/32 doesn't set RI on machine check, so
checking RI before calling die() will always be fatal
allthought this is not an issue in most cases.

Fixes: b96672dd840f ("powerpc: Machine check interrupt is a non-maskable 
interrupt")
Fixes: daf00ae71dad ("powerpc/traps: restore recoverability of machine_check 
interrupts")
Signed-off-by: Christophe Leroy 
Cc: sta...@vger.kernel.org
---
  arch/powerpc/kernel/traps.c | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 64936b60d521..c740f8bfccc9 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -763,15 +763,15 @@ void machine_check_exception(struct pt_regs *regs)
if (check_io_access(regs))
goto bail;
  
-	/* Must die if the interrupt is not recoverable */

-   if (!(regs->msr & MSR_RI))
-   nmi_panic(regs, "Unrecoverable Machine check");
-
if (!nested)
nmi_exit();
  
  	die("Machine check", regs, SIGBUS);
  
+	/* Must die if the interrupt is not recoverable */

+   if (!(regs->msr & MSR_RI))
+   nmi_panic(regs, "Unrecoverable Machine check");
+
return;
  
  bail:

--
2.13.3



Re: [PATCH] powerpc/traps: fix recoverability of machine check handling on book3s/32

2020-09-11 Thread Michal Suchánek
Hello,

does this logic apply to "Unrecoverable System Reset" as well?

Thanks

Michal

On Tue, Jan 22, 2019 at 02:11:24PM +, Christophe Leroy wrote:
> Looks like book3s/32 doesn't set RI on machine check, so
> checking RI before calling die() will always be fatal
> allthought this is not an issue in most cases.
> 
> Fixes: b96672dd840f ("powerpc: Machine check interrupt is a non-maskable 
> interrupt")
> Fixes: daf00ae71dad ("powerpc/traps: restore recoverability of machine_check 
> interrupts")
> Signed-off-by: Christophe Leroy 
> Cc: sta...@vger.kernel.org
> ---
>  arch/powerpc/kernel/traps.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
> index 64936b60d521..c740f8bfccc9 100644
> --- a/arch/powerpc/kernel/traps.c
> +++ b/arch/powerpc/kernel/traps.c
> @@ -763,15 +763,15 @@ void machine_check_exception(struct pt_regs *regs)
>   if (check_io_access(regs))
>   goto bail;
>  
> - /* Must die if the interrupt is not recoverable */
> - if (!(regs->msr & MSR_RI))
> - nmi_panic(regs, "Unrecoverable Machine check");
> -
>   if (!nested)
>   nmi_exit();
>  
>   die("Machine check", regs, SIGBUS);
>  
> + /* Must die if the interrupt is not recoverable */
> + if (!(regs->msr & MSR_RI))
> + nmi_panic(regs, "Unrecoverable Machine check");
> +
>   return;
>  
>  bail:
> -- 
> 2.13.3
> 


Re: [PATCH 1/7] powerpc/sysfs: Fix W=1 compile warning

2020-09-11 Thread Cédric Le Goater
On 9/11/20 7:26 AM, Christophe Leroy wrote:
> 
> 
> Le 10/09/2020 à 23:02, Cédric Le Goater a écrit :
>> arch/powerpc/kernel/sysfs.c: In function ‘sysfs_create_dscr_default’:
>> arch/powerpc/kernel/sysfs.c:228:7: error: variable ‘err’ set but not used 
>> [-Werror=unused-but-set-variable]
>>     int err = 0;
>>     ^~~
>> cc1: all warnings being treated as errors
> 
> A small sentence explaining how this is fixes would be welcome, so that you 
> don't need to read the code the know what the commit does to fix the warning. 
> Even the subject should be more explicite, rather than saying "Fix W=1 
> compile warning", I think it should say something like "remove unused err 
> variable"

Yes. I will respin a v2 with better commit logs for all.

Thanks,

C. 


Re: [PATCH] KVM: PPC: Book3S HV: Do not allocate HPT for a nested guest

2020-09-11 Thread Greg Kurz
On Fri, 11 Sep 2020 10:08:10 +0200
Michal Suchánek  wrote:

> On Fri, Sep 11, 2020 at 10:01:33AM +0200, Greg Kurz wrote:
> > On Fri, 11 Sep 2020 09:45:36 +0200
> > Greg Kurz  wrote:
> > 
> > > On Fri, 11 Sep 2020 01:16:07 -0300
> > > Fabiano Rosas  wrote:
> > > 
> > > > The current nested KVM code does not support HPT guests. This is
> > > > informed/enforced in some ways:
> > > > 
> > > > - Hosts < P9 will not be able to enable the nested HV feature;
> > > > 
> > > > - The nested hypervisor MMU capabilities will not contain
> > > >   KVM_CAP_PPC_MMU_HASH_V3;
> > > > 
> > > > - QEMU reflects the MMU capabilities in the
> > > >   'ibm,arch-vec-5-platform-support' device-tree property;
> > > > 
> > > > - The nested guest, at 'prom_parse_mmu_model' ignores the
> > > >   'disable_radix' kernel command line option if HPT is not supported;
> > > > 
> > > > - The KVM_PPC_CONFIGURE_V3_MMU ioctl will fail if trying to use HPT.
> > > > 
> > > > There is, however, still a way to start a HPT guest by using
> > > > max-compat-cpu=power8 at the QEMU machine options. This leads to the
> > > > guest being set to use hash after QEMU calls the KVM_PPC_ALLOCATE_HTAB
> > > > ioctl.
> > > > 
> > > > With the guest set to hash, the nested hypervisor goes through the
> > > > entry path that has no knowledge of nesting (kvmppc_run_vcpu) and
> > > > crashes when it tries to execute an hypervisor-privileged (mtspr
> > > > HDEC) instruction at __kvmppc_vcore_entry:
> > > > 
> > > > root@L1:~ $ qemu-system-ppc64 -machine pseries,max-cpu-compat=power8 ...
> > > > 
> > > > 
> > > > [  538.543303] CPU: 83 PID: 25185 Comm: CPU 0/KVM Not tainted 5.9.0-rc4 
> > > > #1
> > > > [  538.543355] NIP:  c0080753f388 LR: c0080753f368 CTR: 
> > > > c01e5ec0
> > > > [  538.543417] REGS: c013e91e33b0 TRAP: 0700   Not tainted  
> > > > (5.9.0-rc4)
> > > > [  538.543470] MSR:  82843033   
> > > > CR: 22422882  XER: 2004
> > > > [  538.543546] CFAR: c0080753f4b0 IRQMASK: 3
> > > >GPR00: c008075397a0 c013e91e3640 
> > > > c0080755e600 8000
> > > >GPR04:  c013eab19800 
> > > > c01394de 0043a054db72
> > > >GPR08: 003b1652  
> > > >  c008075502e0
> > > >GPR12: c01e5ec0 c007ffa74200 
> > > > c013eab19800 0008
> > > >GPR16:  c0139676c6c0 
> > > > c1d23948 c013e91e38b8
> > > >GPR20: 0053  
> > > > 0001 
> > > >GPR24: 0001 0001 
> > > >  0001
> > > >GPR28: 0001 0053 
> > > > c013eab19800 0001
> > > > [  538.544067] NIP [c0080753f388] __kvmppc_vcore_entry+0x90/0x104 
> > > > [kvm_hv]
> > > > [  538.544121] LR [c0080753f368] __kvmppc_vcore_entry+0x70/0x104 
> > > > [kvm_hv]
> > > > [  538.544173] Call Trace:
> > > > [  538.544196] [c013e91e3640] [c013e91e3680] 0xc013e91e3680 
> > > > (unreliable)
> > > > [  538.544260] [c013e91e3820] [c008075397a0] 
> > > > kvmppc_run_core+0xbc8/0x19d0 [kvm_hv]
> > > > [  538.544325] [c013e91e39e0] [c0080753d99c] 
> > > > kvmppc_vcpu_run_hv+0x404/0xc00 [kvm_hv]
> > > > [  538.544394] [c013e91e3ad0] [c008072da4fc] 
> > > > kvmppc_vcpu_run+0x34/0x48 [kvm]
> > > > [  538.544472] [c013e91e3af0] [c008072d61b8] 
> > > > kvm_arch_vcpu_ioctl_run+0x310/0x420 [kvm]
> > > > [  538.544539] [c013e91e3b80] [c008072c7450] 
> > > > kvm_vcpu_ioctl+0x298/0x778 [kvm]
> > > > [  538.544605] [c013e91e3ce0] [c04b8c2c] 
> > > > sys_ioctl+0x1dc/0xc90
> > > > [  538.544662] [c013e91e3dc0] [c002f9a4] 
> > > > system_call_exception+0xe4/0x1c0
> > > > [  538.544726] [c013e91e3e20] [c000d140] 
> > > > system_call_common+0xf0/0x27c
> > > > [  538.544787] Instruction dump:
> > > > [  538.544821] f86d1098 6000 6000 4899 e8ad0fe8 e8c500a0 
> > > > e9264140 75290002
> > > > [  538.544886] 7d1602a6 7cec42a6 40820008 7d0807b4 <7d164ba6> 7d083a14 
> > > > f90d10a0 480104fd
> > > > [  538.544953] ---[ end trace 74423e2b948c2e0c ]---
> > > > 
> > > > This patch makes the KVM_PPC_ALLOCATE_HTAB ioctl fail when running in
> > > > the nested hypervisor, causing QEMU to abort.
> > > > 
> > > > Reported-by: Satheesh Rajendran 
> > > > Signed-off-by: Fabiano Rosas 
> > > > ---
> > > 
> > > LGTM
> > > 
> > > Reviewed-by: Greg Kurz 
> > > 
> > > >  arch/powerpc/kvm/book3s_hv.c | 6 ++
> > > >  1 file changed, 6 insertions(+)
> > > > 
> > > > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > > > index 4ba06a2a306c..764b6239ef72 100644
> > > > --- a/arch/powerpc/kvm/book3s_hv.c
> > > > +++ b/arch/powerpc/kvm/book3s_hv.c
> > > > @@ -5250,6 +5250,12 @@ static long 

Re: [PATCH] KVM: PPC: Book3S HV: Do not allocate HPT for a nested guest

2020-09-11 Thread Michal Suchánek
On Fri, Sep 11, 2020 at 10:01:33AM +0200, Greg Kurz wrote:
> On Fri, 11 Sep 2020 09:45:36 +0200
> Greg Kurz  wrote:
> 
> > On Fri, 11 Sep 2020 01:16:07 -0300
> > Fabiano Rosas  wrote:
> > 
> > > The current nested KVM code does not support HPT guests. This is
> > > informed/enforced in some ways:
> > > 
> > > - Hosts < P9 will not be able to enable the nested HV feature;
> > > 
> > > - The nested hypervisor MMU capabilities will not contain
> > >   KVM_CAP_PPC_MMU_HASH_V3;
> > > 
> > > - QEMU reflects the MMU capabilities in the
> > >   'ibm,arch-vec-5-platform-support' device-tree property;
> > > 
> > > - The nested guest, at 'prom_parse_mmu_model' ignores the
> > >   'disable_radix' kernel command line option if HPT is not supported;
> > > 
> > > - The KVM_PPC_CONFIGURE_V3_MMU ioctl will fail if trying to use HPT.
> > > 
> > > There is, however, still a way to start a HPT guest by using
> > > max-compat-cpu=power8 at the QEMU machine options. This leads to the
> > > guest being set to use hash after QEMU calls the KVM_PPC_ALLOCATE_HTAB
> > > ioctl.
> > > 
> > > With the guest set to hash, the nested hypervisor goes through the
> > > entry path that has no knowledge of nesting (kvmppc_run_vcpu) and
> > > crashes when it tries to execute an hypervisor-privileged (mtspr
> > > HDEC) instruction at __kvmppc_vcore_entry:
> > > 
> > > root@L1:~ $ qemu-system-ppc64 -machine pseries,max-cpu-compat=power8 ...
> > > 
> > > 
> > > [  538.543303] CPU: 83 PID: 25185 Comm: CPU 0/KVM Not tainted 5.9.0-rc4 #1
> > > [  538.543355] NIP:  c0080753f388 LR: c0080753f368 CTR: 
> > > c01e5ec0
> > > [  538.543417] REGS: c013e91e33b0 TRAP: 0700   Not tainted  
> > > (5.9.0-rc4)
> > > [  538.543470] MSR:  82843033   CR: 
> > > 22422882  XER: 2004
> > > [  538.543546] CFAR: c0080753f4b0 IRQMASK: 3
> > >GPR00: c008075397a0 c013e91e3640 c0080755e600 
> > > 8000
> > >GPR04:  c013eab19800 c01394de 
> > > 0043a054db72
> > >GPR08: 003b1652   
> > > c008075502e0
> > >GPR12: c01e5ec0 c007ffa74200 c013eab19800 
> > > 0008
> > >GPR16:  c0139676c6c0 c1d23948 
> > > c013e91e38b8
> > >GPR20: 0053  0001 
> > > 
> > >GPR24: 0001 0001  
> > > 0001
> > >GPR28: 0001 0053 c013eab19800 
> > > 0001
> > > [  538.544067] NIP [c0080753f388] __kvmppc_vcore_entry+0x90/0x104 
> > > [kvm_hv]
> > > [  538.544121] LR [c0080753f368] __kvmppc_vcore_entry+0x70/0x104 
> > > [kvm_hv]
> > > [  538.544173] Call Trace:
> > > [  538.544196] [c013e91e3640] [c013e91e3680] 0xc013e91e3680 
> > > (unreliable)
> > > [  538.544260] [c013e91e3820] [c008075397a0] 
> > > kvmppc_run_core+0xbc8/0x19d0 [kvm_hv]
> > > [  538.544325] [c013e91e39e0] [c0080753d99c] 
> > > kvmppc_vcpu_run_hv+0x404/0xc00 [kvm_hv]
> > > [  538.544394] [c013e91e3ad0] [c008072da4fc] 
> > > kvmppc_vcpu_run+0x34/0x48 [kvm]
> > > [  538.544472] [c013e91e3af0] [c008072d61b8] 
> > > kvm_arch_vcpu_ioctl_run+0x310/0x420 [kvm]
> > > [  538.544539] [c013e91e3b80] [c008072c7450] 
> > > kvm_vcpu_ioctl+0x298/0x778 [kvm]
> > > [  538.544605] [c013e91e3ce0] [c04b8c2c] sys_ioctl+0x1dc/0xc90
> > > [  538.544662] [c013e91e3dc0] [c002f9a4] 
> > > system_call_exception+0xe4/0x1c0
> > > [  538.544726] [c013e91e3e20] [c000d140] 
> > > system_call_common+0xf0/0x27c
> > > [  538.544787] Instruction dump:
> > > [  538.544821] f86d1098 6000 6000 4899 e8ad0fe8 e8c500a0 
> > > e9264140 75290002
> > > [  538.544886] 7d1602a6 7cec42a6 40820008 7d0807b4 <7d164ba6> 7d083a14 
> > > f90d10a0 480104fd
> > > [  538.544953] ---[ end trace 74423e2b948c2e0c ]---
> > > 
> > > This patch makes the KVM_PPC_ALLOCATE_HTAB ioctl fail when running in
> > > the nested hypervisor, causing QEMU to abort.
> > > 
> > > Reported-by: Satheesh Rajendran 
> > > Signed-off-by: Fabiano Rosas 
> > > ---
> > 
> > LGTM
> > 
> > Reviewed-by: Greg Kurz 
> > 
> > >  arch/powerpc/kvm/book3s_hv.c | 6 ++
> > >  1 file changed, 6 insertions(+)
> > > 
> > > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > > index 4ba06a2a306c..764b6239ef72 100644
> > > --- a/arch/powerpc/kvm/book3s_hv.c
> > > +++ b/arch/powerpc/kvm/book3s_hv.c
> > > @@ -5250,6 +5250,12 @@ static long kvm_arch_vm_ioctl_hv(struct file *filp,
> > >   case KVM_PPC_ALLOCATE_HTAB: {
> > >   u32 htab_order;
> > >  
> > > + /* If we're a nested hypervisor, we currently only support 
> > > radix */
> > > + if (kvmhv_on_pseries()) {
> > > + r = -EOPNOTSUPP;
> 
> According to 

Re: [PATCH] KVM: PPC: Book3S HV: Do not allocate HPT for a nested guest

2020-09-11 Thread Greg Kurz
On Fri, 11 Sep 2020 09:45:36 +0200
Greg Kurz  wrote:

> On Fri, 11 Sep 2020 01:16:07 -0300
> Fabiano Rosas  wrote:
> 
> > The current nested KVM code does not support HPT guests. This is
> > informed/enforced in some ways:
> > 
> > - Hosts < P9 will not be able to enable the nested HV feature;
> > 
> > - The nested hypervisor MMU capabilities will not contain
> >   KVM_CAP_PPC_MMU_HASH_V3;
> > 
> > - QEMU reflects the MMU capabilities in the
> >   'ibm,arch-vec-5-platform-support' device-tree property;
> > 
> > - The nested guest, at 'prom_parse_mmu_model' ignores the
> >   'disable_radix' kernel command line option if HPT is not supported;
> > 
> > - The KVM_PPC_CONFIGURE_V3_MMU ioctl will fail if trying to use HPT.
> > 
> > There is, however, still a way to start a HPT guest by using
> > max-compat-cpu=power8 at the QEMU machine options. This leads to the
> > guest being set to use hash after QEMU calls the KVM_PPC_ALLOCATE_HTAB
> > ioctl.
> > 
> > With the guest set to hash, the nested hypervisor goes through the
> > entry path that has no knowledge of nesting (kvmppc_run_vcpu) and
> > crashes when it tries to execute an hypervisor-privileged (mtspr
> > HDEC) instruction at __kvmppc_vcore_entry:
> > 
> > root@L1:~ $ qemu-system-ppc64 -machine pseries,max-cpu-compat=power8 ...
> > 
> > 
> > [  538.543303] CPU: 83 PID: 25185 Comm: CPU 0/KVM Not tainted 5.9.0-rc4 #1
> > [  538.543355] NIP:  c0080753f388 LR: c0080753f368 CTR: 
> > c01e5ec0
> > [  538.543417] REGS: c013e91e33b0 TRAP: 0700   Not tainted  (5.9.0-rc4)
> > [  538.543470] MSR:  82843033   CR: 
> > 22422882  XER: 2004
> > [  538.543546] CFAR: c0080753f4b0 IRQMASK: 3
> >GPR00: c008075397a0 c013e91e3640 c0080755e600 
> > 8000
> >GPR04:  c013eab19800 c01394de 
> > 0043a054db72
> >GPR08: 003b1652   
> > c008075502e0
> >GPR12: c01e5ec0 c007ffa74200 c013eab19800 
> > 0008
> >GPR16:  c0139676c6c0 c1d23948 
> > c013e91e38b8
> >GPR20: 0053  0001 
> > 
> >GPR24: 0001 0001  
> > 0001
> >GPR28: 0001 0053 c013eab19800 
> > 0001
> > [  538.544067] NIP [c0080753f388] __kvmppc_vcore_entry+0x90/0x104 
> > [kvm_hv]
> > [  538.544121] LR [c0080753f368] __kvmppc_vcore_entry+0x70/0x104 
> > [kvm_hv]
> > [  538.544173] Call Trace:
> > [  538.544196] [c013e91e3640] [c013e91e3680] 0xc013e91e3680 
> > (unreliable)
> > [  538.544260] [c013e91e3820] [c008075397a0] 
> > kvmppc_run_core+0xbc8/0x19d0 [kvm_hv]
> > [  538.544325] [c013e91e39e0] [c0080753d99c] 
> > kvmppc_vcpu_run_hv+0x404/0xc00 [kvm_hv]
> > [  538.544394] [c013e91e3ad0] [c008072da4fc] 
> > kvmppc_vcpu_run+0x34/0x48 [kvm]
> > [  538.544472] [c013e91e3af0] [c008072d61b8] 
> > kvm_arch_vcpu_ioctl_run+0x310/0x420 [kvm]
> > [  538.544539] [c013e91e3b80] [c008072c7450] 
> > kvm_vcpu_ioctl+0x298/0x778 [kvm]
> > [  538.544605] [c013e91e3ce0] [c04b8c2c] sys_ioctl+0x1dc/0xc90
> > [  538.544662] [c013e91e3dc0] [c002f9a4] 
> > system_call_exception+0xe4/0x1c0
> > [  538.544726] [c013e91e3e20] [c000d140] 
> > system_call_common+0xf0/0x27c
> > [  538.544787] Instruction dump:
> > [  538.544821] f86d1098 6000 6000 4899 e8ad0fe8 e8c500a0 
> > e9264140 75290002
> > [  538.544886] 7d1602a6 7cec42a6 40820008 7d0807b4 <7d164ba6> 7d083a14 
> > f90d10a0 480104fd
> > [  538.544953] ---[ end trace 74423e2b948c2e0c ]---
> > 
> > This patch makes the KVM_PPC_ALLOCATE_HTAB ioctl fail when running in
> > the nested hypervisor, causing QEMU to abort.
> > 
> > Reported-by: Satheesh Rajendran 
> > Signed-off-by: Fabiano Rosas 
> > ---
> 
> LGTM
> 
> Reviewed-by: Greg Kurz 
> 
> >  arch/powerpc/kvm/book3s_hv.c | 6 ++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > index 4ba06a2a306c..764b6239ef72 100644
> > --- a/arch/powerpc/kvm/book3s_hv.c
> > +++ b/arch/powerpc/kvm/book3s_hv.c
> > @@ -5250,6 +5250,12 @@ static long kvm_arch_vm_ioctl_hv(struct file *filp,
> > case KVM_PPC_ALLOCATE_HTAB: {
> > u32 htab_order;
> >  
> > +   /* If we're a nested hypervisor, we currently only support 
> > radix */
> > +   if (kvmhv_on_pseries()) {
> > +   r = -EOPNOTSUPP;

According to POSIX [1]:

[ENOTSUP]
Not supported. The implementation does not support the requested feature or 
value.

[EOPNOTSUPP]
Operation not supported on socket. The type of socket (address family or 
protocol) does not support the requested operation. A conforming implementation 
may assign 

Re: [PATCH] KVM: PPC: Book3S HV: Do not allocate HPT for a nested guest

2020-09-11 Thread Greg Kurz
On Fri, 11 Sep 2020 01:16:07 -0300
Fabiano Rosas  wrote:

> The current nested KVM code does not support HPT guests. This is
> informed/enforced in some ways:
> 
> - Hosts < P9 will not be able to enable the nested HV feature;
> 
> - The nested hypervisor MMU capabilities will not contain
>   KVM_CAP_PPC_MMU_HASH_V3;
> 
> - QEMU reflects the MMU capabilities in the
>   'ibm,arch-vec-5-platform-support' device-tree property;
> 
> - The nested guest, at 'prom_parse_mmu_model' ignores the
>   'disable_radix' kernel command line option if HPT is not supported;
> 
> - The KVM_PPC_CONFIGURE_V3_MMU ioctl will fail if trying to use HPT.
> 
> There is, however, still a way to start a HPT guest by using
> max-compat-cpu=power8 at the QEMU machine options. This leads to the
> guest being set to use hash after QEMU calls the KVM_PPC_ALLOCATE_HTAB
> ioctl.
> 
> With the guest set to hash, the nested hypervisor goes through the
> entry path that has no knowledge of nesting (kvmppc_run_vcpu) and
> crashes when it tries to execute an hypervisor-privileged (mtspr
> HDEC) instruction at __kvmppc_vcore_entry:
> 
> root@L1:~ $ qemu-system-ppc64 -machine pseries,max-cpu-compat=power8 ...
> 
> 
> [  538.543303] CPU: 83 PID: 25185 Comm: CPU 0/KVM Not tainted 5.9.0-rc4 #1
> [  538.543355] NIP:  c0080753f388 LR: c0080753f368 CTR: 
> c01e5ec0
> [  538.543417] REGS: c013e91e33b0 TRAP: 0700   Not tainted  (5.9.0-rc4)
> [  538.543470] MSR:  82843033   CR: 
> 22422882  XER: 2004
> [  538.543546] CFAR: c0080753f4b0 IRQMASK: 3
>GPR00: c008075397a0 c013e91e3640 c0080755e600 
> 8000
>GPR04:  c013eab19800 c01394de 
> 0043a054db72
>GPR08: 003b1652   
> c008075502e0
>GPR12: c01e5ec0 c007ffa74200 c013eab19800 
> 0008
>GPR16:  c0139676c6c0 c1d23948 
> c013e91e38b8
>GPR20: 0053  0001 
> 
>GPR24: 0001 0001  
> 0001
>GPR28: 0001 0053 c013eab19800 
> 0001
> [  538.544067] NIP [c0080753f388] __kvmppc_vcore_entry+0x90/0x104 [kvm_hv]
> [  538.544121] LR [c0080753f368] __kvmppc_vcore_entry+0x70/0x104 [kvm_hv]
> [  538.544173] Call Trace:
> [  538.544196] [c013e91e3640] [c013e91e3680] 0xc013e91e3680 
> (unreliable)
> [  538.544260] [c013e91e3820] [c008075397a0] 
> kvmppc_run_core+0xbc8/0x19d0 [kvm_hv]
> [  538.544325] [c013e91e39e0] [c0080753d99c] 
> kvmppc_vcpu_run_hv+0x404/0xc00 [kvm_hv]
> [  538.544394] [c013e91e3ad0] [c008072da4fc] 
> kvmppc_vcpu_run+0x34/0x48 [kvm]
> [  538.544472] [c013e91e3af0] [c008072d61b8] 
> kvm_arch_vcpu_ioctl_run+0x310/0x420 [kvm]
> [  538.544539] [c013e91e3b80] [c008072c7450] 
> kvm_vcpu_ioctl+0x298/0x778 [kvm]
> [  538.544605] [c013e91e3ce0] [c04b8c2c] sys_ioctl+0x1dc/0xc90
> [  538.544662] [c013e91e3dc0] [c002f9a4] 
> system_call_exception+0xe4/0x1c0
> [  538.544726] [c013e91e3e20] [c000d140] 
> system_call_common+0xf0/0x27c
> [  538.544787] Instruction dump:
> [  538.544821] f86d1098 6000 6000 4899 e8ad0fe8 e8c500a0 e9264140 
> 75290002
> [  538.544886] 7d1602a6 7cec42a6 40820008 7d0807b4 <7d164ba6> 7d083a14 
> f90d10a0 480104fd
> [  538.544953] ---[ end trace 74423e2b948c2e0c ]---
> 
> This patch makes the KVM_PPC_ALLOCATE_HTAB ioctl fail when running in
> the nested hypervisor, causing QEMU to abort.
> 
> Reported-by: Satheesh Rajendran 
> Signed-off-by: Fabiano Rosas 
> ---

LGTM

Reviewed-by: Greg Kurz 

>  arch/powerpc/kvm/book3s_hv.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 4ba06a2a306c..764b6239ef72 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -5250,6 +5250,12 @@ static long kvm_arch_vm_ioctl_hv(struct file *filp,
>   case KVM_PPC_ALLOCATE_HTAB: {
>   u32 htab_order;
>  
> + /* If we're a nested hypervisor, we currently only support 
> radix */
> + if (kvmhv_on_pseries()) {
> + r = -EOPNOTSUPP;
> + break;
> + }
> +
>   r = -EFAULT;
>   if (get_user(htab_order, (u32 __user *)argp))
>   break;



Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-11 Thread peterz
On Thu, Sep 10, 2020 at 06:59:21PM -0300, Jason Gunthorpe wrote:
> So, I suggest pXX_offset_unlocked()

Urgh, no. Elsewhere in gup _unlocked() means it will take the lock
itself (get_user_pages_unlocked()) -- although often it seems to mean
the lock is already held (git grep _unlocked and marvel).

What we want is _lockless().


Re: [PATCH 5/7] powerpc/powernv/pci: Fix W=1 compile warning

2020-09-11 Thread Oliver O'Halloran
On Fri, Sep 11, 2020 at 7:02 AM Cédric Le Goater  wrote:
>
>   CC  arch/powerpc/platforms/powernv/pci-ioda.o
> ../arch/powerpc/platforms/powernv/pci-ioda.c: In function 
> ‘pnv_ioda_configure_pe’:
> ../arch/powerpc/platforms/powernv/pci-ioda.c:897:18: error: variable ‘parent’ 
> set but not used [-Werror=unused-but-set-variable]
>   struct pci_dev *parent;
>   ^~
>
> Cc: Oliver O'Halloran 
> Signed-off-by: Cédric Le Goater 

Come to think of it a fix for this might already be in -next, see
https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=193967=*

If not,

Reviewed-by: Oliver O'Halloran 


Re: [PATCH 3/7] powerpc/sstep: Fix W=1 compile warning

2020-09-11 Thread Christophe Leroy




Le 11/09/2020 à 07:59, Cédric Le Goater a écrit :

On 9/11/20 7:38 AM, Christophe Leroy wrote:



Le 10/09/2020 à 23:02, Cédric Le Goater a écrit :

../arch/powerpc/lib/sstep.c: In function ‘mlsd_8lsd_ea’:
../arch/powerpc/lib/sstep.c:225:3: error: suggest braces around empty body in 
an ‘if’ statement [-Werror=empty-body]
     ; /* Invalid form. Should already be checked for by caller! */
     ^


A small sentence explaining how this is fixed would be welcome, so that you 
don't need to read the code the know what the commit does to fix the warning. 
Also the subject should be more explicit.





Cc: Jordan Niethe 
Signed-off-by: Cédric Le Goater 
---
   arch/powerpc/lib/sstep.c | 3 ++-
   1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index caee8cc77e19..14572af16e55 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -221,8 +221,9 @@ static nokprobe_inline unsigned long mlsd_8lsd_ea(unsigned 
int instr,
   ; /* Leave ea as is */
   else if (prefix_r && !ra)
   ea += regs->nip;
-    else if (prefix_r && ra)
+    else if (prefix_r && ra) {
   ; /* Invalid form. Should already be checked for by caller! */
+    }


You can't do that. Now checkpatch will complain that you don't have braces on 
all legs of the if/else dance.


Should we fix checkpatch ?


Why not, not then fix 
https://www.kernel.org/doc/html/latest/process/coding-style.html#placing-braces-and-spaces 
first :)


Christophe


Re: [PATCH 2/7] powerpc/prom: Fix W=1 compile warning

2020-09-11 Thread Cédric Le Goater
On 9/11/20 7:33 AM, Christophe Leroy wrote:
> 
> 
> Le 10/09/2020 à 23:02, Cédric Le Goater a écrit :
>> arch/powerpc/kernel/prom.c: In function ‘early_reserve_mem’:
>> arch/powerpc/kernel/prom.c:625:10: error: variable ‘reserve_map’ set but not 
>> used [-Werror=unused-but-set-variable]
>>    __be64 *reserve_map;
>>    ^~~
>> cc1: all warnings being treated as errors
> 
> A small sentence explaining how this is fixes would be welcome, so that you 
> don't need to read the code the know what the commit does to fix the warning. 
> Also the subject should be more explicit.
> 
> 
>>
>> Cc: Christophe Leroy 
>> Signed-off-by: Cédric Le Goater 
>> ---
>>   arch/powerpc/kernel/prom.c | 51 +++---
>>   1 file changed, 26 insertions(+), 25 deletions(-)
>>
>> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
>> index d8a2fb87ba0c..4bae9ebc7d0b 100644
>> --- a/arch/powerpc/kernel/prom.c
>> +++ b/arch/powerpc/kernel/prom.c
>> @@ -622,11 +622,6 @@ static void __init early_reserve_mem_dt(void)
>>     static void __init early_reserve_mem(void)
>>   {
>> -    __be64 *reserve_map;
>> -
>> -    reserve_map = (__be64 *)(((unsigned long)initial_boot_params) +
>> -    fdt_off_mem_rsvmap(initial_boot_params));
>> -
>>   /* Look for the new "reserved-regions" property in the DT */
>>   early_reserve_mem_dt();
>>   @@ -639,28 +634,34 @@ static void __init early_reserve_mem(void)
>>   }
>>   #endif /* CONFIG_BLK_DEV_INITRD */
>>   -#ifdef CONFIG_PPC32
> 
> Instead of such a big change, you could simply do the following in addition 
> to the move of reserve_map allocation after it.
> 
> if (!IS_ENABLED(CONFIG_PPC32))
>     return;

yes. I will include a change for CONFIG_BLK_DEV_INITRD also.

Thanks,

C.  

> 
>> -    /*
>> - * Handle the case where we might be booting from an old kexec
>> - * image that setup the mem_rsvmap as pairs of 32-bit values
>> - */
>> -    if (be64_to_cpup(reserve_map) > 0xull) {
>> -    u32 base_32, size_32;
>> -    __be32 *reserve_map_32 = (__be32 *)reserve_map;
>> -
>> -    DBG("Found old 32-bit reserve map\n");
>> -
>> -    while (1) {
>> -    base_32 = be32_to_cpup(reserve_map_32++);
>> -    size_32 = be32_to_cpup(reserve_map_32++);
>> -    if (size_32 == 0)
>> -    break;
>> -    DBG("reserving: %x -> %x\n", base_32, size_32);
>> -    memblock_reserve(base_32, size_32);
>> +    if (IS_ENABLED(CONFIG_PPC32)) {
>> +    __be64 *reserve_map;
>> +
>> +    reserve_map = (__be64 *)(((unsigned long)initial_boot_params) +
>> + fdt_off_mem_rsvmap(initial_boot_params));
>> +
>> +    /*
>> + * Handle the case where we might be booting from an
>> + * old kexec image that setup the mem_rsvmap as pairs
>> + * of 32-bit values
>> + */
>> +    if (be64_to_cpup(reserve_map) > 0xull) {
>> +    u32 base_32, size_32;
>> +    __be32 *reserve_map_32 = (__be32 *)reserve_map;
>> +
>> +    DBG("Found old 32-bit reserve map\n");
>> +
>> +    while (1) {
>> +    base_32 = be32_to_cpup(reserve_map_32++);
>> +    size_32 = be32_to_cpup(reserve_map_32++);
>> +    if (size_32 == 0)
>> +    break;
>> +    DBG("reserving: %x -> %x\n", base_32, size_32);
>> +    memblock_reserve(base_32, size_32);
>> +    }
>> +    return;
>>   }
>> -    return;
>>   }
>> -#endif
>>   }
>>     #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
>>
> 
> 
> Christophe



Re: [PATCH 3/7] powerpc/sstep: Fix W=1 compile warning

2020-09-11 Thread Cédric Le Goater
On 9/11/20 7:38 AM, Christophe Leroy wrote:
> 
> 
> Le 10/09/2020 à 23:02, Cédric Le Goater a écrit :
>> ../arch/powerpc/lib/sstep.c: In function ‘mlsd_8lsd_ea’:
>> ../arch/powerpc/lib/sstep.c:225:3: error: suggest braces around empty body 
>> in an ‘if’ statement [-Werror=empty-body]
>>     ; /* Invalid form. Should already be checked for by caller! */
>>     ^
> 
> A small sentence explaining how this is fixed would be welcome, so that you 
> don't need to read the code the know what the commit does to fix the warning. 
> Also the subject should be more explicit.
> 
> 
> 
>>
>> Cc: Jordan Niethe 
>> Signed-off-by: Cédric Le Goater 
>> ---
>>   arch/powerpc/lib/sstep.c | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
>> index caee8cc77e19..14572af16e55 100644
>> --- a/arch/powerpc/lib/sstep.c
>> +++ b/arch/powerpc/lib/sstep.c
>> @@ -221,8 +221,9 @@ static nokprobe_inline unsigned long 
>> mlsd_8lsd_ea(unsigned int instr,
>>   ; /* Leave ea as is */
>>   else if (prefix_r && !ra)
>>   ea += regs->nip;
>> -    else if (prefix_r && ra)
>> +    else if (prefix_r && ra) {
>>   ; /* Invalid form. Should already be checked for by caller! */
>> +    }
> 
> You can't do that. Now checkpatch will complain that you don't have braces on 
> all legs of the if/else dance.

Should we fix checkpatch ?

> I think the last 'else if' should simply be removed entirely as it does 
> nothing. Eventually, just leave the comment, something like:
> 
> /* (prefix_r && ra) is Invalid form. Should already be checked for by 
> caller! */
> 
> And if (prefix_r && ra) is not possible, then the previous if should just be 
> 'if (prefx_r)'

Indeed. I will rework that one.

Thanks,

C. 


> Christophe
> 
> 
>>     return ea;
>>   }
>>