PPC64: G5 & 4k/64k page size (was: Re: Call for report - G5/PPC970 status)

2019-12-12 Thread Romain Dolbeau
Le jeu. 12 déc. 2019 à 22:40, Andreas Schwab  a écrit :
> I'm using 4K pages, in case that matters

Yes it does matter, as it seems to be the difference between "working"
and "not working" :-)
Thank you for the config & pointing out the culprit!

With your config, my machine boots (though it's missing some features
as the config seems quite tuned).

Moving from 64k pages to 4k pages on 'my' config (essentially,
Debian's 5.3 with default values for changes since), my machine boots
as well & everything seems to work fine.

So question to Aneesh - did you try 64k pages on your G5, or only 4k?
In the second case, could you try with 64k to see if you can reproduce
the crash?

To Jeroen - is your iMac booting with 4k or 64k pages? Same question
for the crashing G5, though I assume the answer is going to be 64k
there.

Thanks & cordially,

-- 
Romain Dolbeau


[PATCH] Fix vm selftest to run tests with make

2019-12-12 Thread Harish
The last patch overrides the ARCH env variable and hence runs
using make fails with the following.

$ make -C vm/
make: Entering directory '/home/harish/linux/tools/testing/selftests/vm'
make --no-builtin-rules ARCH=ppc64le -C ../../../.. headers_install
make[1]: Entering directory '/home/harish/linux'
Makefile:652: arch/ppc64le/Makefile: No such file or directory
make[1]: *** No rule to make target 'arch/ppc64le/Makefile'.  Stop.
make[1]: Leaving directory '/home/harish/linux'
make: *** [../lib.mk:50: khdr] Error 2
make: Leaving directory '/home/harish/linux/tools/testing/selftests/vm'

This patch fixes it and also handles ppc64/ppc64le arch's to build
and run 64bit tests

Signed-off-by: Harish 
---
 tools/testing/selftests/vm/Makefile| 5 +++--
 tools/testing/selftests/vm/run_vmtests | 2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/vm/Makefile 
b/tools/testing/selftests/vm/Makefile
index 7f9a8a8c31da..ef6218283b68 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -1,7 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
 # Makefile for vm selftests
 uname_M := $(shell uname -m 2>/dev/null || echo not)
-ARCH ?= $(shell echo $(uname_M) | sed -e 's/aarch64.*/arm64/')
+ARCH_USED ?= $(shell echo $(uname_M) | sed -e 's/aarch64.*/arm64/' -e 
's/ppc64.*/ppc64/')
 
 CFLAGS = -Wall -I ../../../../usr/include $(EXTRA_CFLAGS)
 LDLIBS = -lrt
@@ -19,11 +19,12 @@ TEST_GEN_FILES += thuge-gen
 TEST_GEN_FILES += transhuge-stress
 TEST_GEN_FILES += userfaultfd
 
-ifneq (,$(filter $(ARCH),arm64 ia64 mips64 parisc64 ppc64 riscv64 s390x sh64 
sparc64 x86_64))
+ifneq (,$(filter $(ARCH_USED),arm64 ia64 mips64 parisc64 ppc64 riscv64 s390x 
sh64 sparc64 x86_64))
 TEST_GEN_FILES += va_128TBswitch
 TEST_GEN_FILES += virtual_address_range
 endif
 
+
 TEST_PROGS := run_vmtests
 
 TEST_FILES := test_vmalloc.sh
diff --git a/tools/testing/selftests/vm/run_vmtests 
b/tools/testing/selftests/vm/run_vmtests
index a692ea828317..da63dfb9713a 100755
--- a/tools/testing/selftests/vm/run_vmtests
+++ b/tools/testing/selftests/vm/run_vmtests
@@ -61,7 +61,7 @@ fi
 #filter 64bit architectures
 ARCH64STR="arm64 ia64 mips64 parisc64 ppc64 riscv64 s390x sh64 sparc64 x86_64"
 if [ -z $ARCH ]; then
-  ARCH=`uname -m 2>/dev/null | sed -e 's/aarch64.*/arm64/'`
+  ARCH=`uname -m 2>/dev/null | sed -e 's/aarch64.*/arm64/' -e 
's/ppc64.*/ppc64/'`
 fi
 VADDR64=0
 echo "$ARCH64STR" | grep $ARCH && VADDR64=1
-- 
2.21.0



Re: [PATCH 00/58] serial/sysrq: Cleanup ifdeffery

2019-12-12 Thread Christophe Leroy




Le 13/12/2019 à 01:05, Dmitry Safonov a écrit :

The original purpose of the patches set was to add a way to enable
sysrq on a uart where currently it can be constantly either on or off
(CONFIG_MAGIC_SYSRQ_SERIAL), see the last patch:
   "serial/sysrq: Add MAGIC_SYSRQ_SERIAL_SEQUENCE"

But to do that, I had to add uart_try_toggle_sysrq() and I didn't want
to bloat serial_core.h even more. So, I did cleanup by removing
SUPPORT_SYSRQ resulting in a nice diff-stat and lesser ifdeffery.

Most patches are one-liners, I decided to keep them separated per-driver
to let reviewers easier follow the purpose.

Cc: Greg Kroah-Hartman 
Cc: Jiri Slaby 
Cc: Vasiliy Khoruzhick 
Cc: linux-ser...@vger.kernel.org

Dmitry Safonov (58):
   sysrq: Remove sysrq_handler_registered
   serial: Move sysrq members above
   serial_core: Un-ifdef sysrq SUPPORT_SYSRQ
   tty/serial: Migrate aspeed_vuart to use has_sysrq
   tty/serial: Migrate 8250_fsl to use has_sysrq
   tty/serial: Migrate bcm63xx_uart to use has_sysrq
   tty/serial: Migrate 8250_omap to use has_sysrq
   tty/serial: Migrate 8250_port to use has_sysrq
   tty/serial: Migrate amba-pl01* to use has_sysrq
   tty/serial: Migrate apbuart to use has_sysrq
   tty/serial: Migrate arc_uart to use has_sysrq
   tty/serial: Migrate atmel_serial to use has_sysrq
   tty/serial: Migrate clps711x to use has_sysrq
   tty/serial: Migrate cpm_uart to use has_sysrq
   tty/serial: Migrate dz to use has_sysrq
   tty/serial: Migrate efm32-uart to use has_sysrq
   tty/serial: Migrate fsl_linflexuart to use has_sysrq
   tty/serial: Migrate fsl_lpuart to use has_sysrq
   tty/serial: Migrate imx to use has_sysrq
   tty/serial: Migrate ip22zilog to use has_sysrq
   tty/serial: Migrate meson_uart to use has_sysrq
   tty/serial: Migrate milbeaut_usio to use has_sysrq
   tty/serial: Migrate mpc52xx_uart to use has_sysrq
   tty/serial: Don't zero port->sysrq
   tty/serial: Migrate msm_serial to use has_sysrq
   tty/serial: Migrate mux to use has_sysrq
   tty/serial: Migrate mxs-auart to use has_sysrq
   tty/serial: Migrate omap-serial to use has_sysrq
   tty/serial: Migrate pch_uart to use has_sysrq
   tty/serial: Don't check port->sysrq
   tty/serial: Migrate pmac_zilog to use has_sysrq
   tty/serial: Migrate pnx8xxx_uart to use has_sysrq
   serial/f81534: Don't check port->sysrq
   tty/serial: Migrate pxa to use has_sysrq
   tty/serial: Migrate qcom_geni_serial to use has_sysrq
   tty/serial: Migrate sa1100 to use has_sysrq
   tty/serial: Migrate samsung_tty to use has_sysrq
   tty/serial: Migrate sb1250-duart to use has_sysrq
   tty/serial: Migrate sccnxp to use has_sysrq
   tty/serial: Migrate serial_txx9 to use has_sysrq
   tty/serial: Migrate sh-sci to use has_sysrq
   tty/serial: Migrate sprd_serial to use has_sysrq
   tty/serial: Migrate st-asc to use has_sysrq
   tty/serial: Migrate stm32-usart to use has_sysrq
   tty/serial: Migrate sunhv to use has_sysrq
   tty/serial: Migrate sunsab to use has_sysrq
   tty/serial: Migrate sunsu to use has_sysrq
   tty/serial: Migrate sunzilog to use has_sysrq
   serial/ucc_uart: Remove ifdef SUPPORT_SYSRQ
   tty/serial: Migrate vr41xx_siu to use has_sysrq
   tty/serial: Migrate vt8500_serial to use has_sysrq
   tty/serial: Migrate xilinx_uartps to use has_sysrq
   tty/serial: Migrate zs to use has_sysrq
   serial_core: Remove SUPPORT_SYSRQ ifdeffery
   usb/serial: Don't handle break when CONFIG_MAGIC_SYSRQ is disabled
   serial_core: Move sysrq functions from header file
   sysctl/sysrq: Remove __sysrq_enabled copy
   serial/sysrq: Add MAGIC_SYSRQ_SERIAL_SEQUENCE


powerpc patchwork didn't get the full series, see 
https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=148198


Can't find them on linux-serial patchwork either 
(https://patches.linaro.org/project/linux-serial/list/)


It is impossible to review/test powerpc bits without the first patches 
of the series, where can the entire series be found ?


Christophe



  arch/powerpc/kernel/legacy_serial.c |   4 +-
  drivers/tty/serial/8250/8250_aspeed_vuart.c |   5 +-
  drivers/tty/serial/8250/8250_fsl.c  |   4 -
  drivers/tty/serial/8250/8250_of.c   |   4 +-
  drivers/tty/serial/8250/8250_omap.c |   5 +-
  drivers/tty/serial/8250/8250_port.c |   5 +-
  drivers/tty/serial/amba-pl010.c |   5 +-
  drivers/tty/serial/amba-pl011.c |   6 +-
  drivers/tty/serial/apbuart.c|   5 +-
  drivers/tty/serial/arc_uart.c   |   5 +-
  drivers/tty/serial/atmel_serial.c   |   9 +-
  drivers/tty/serial/bcm63xx_uart.c   |   5 +-
  drivers/tty/serial/clps711x.c   |   5 +-
  drivers/tty/serial/cpm_uart/cpm_uart_core.c |   9 +-
  drivers/tty/serial/dz.c |   5 +-
  drivers/tty/serial/efm32-uart.c |   5 +-
  drivers/tty/serial/fsl_linflexuart.c|   8 +-
  drivers/tty/serial/fsl_lpuart.c |   9 +-
  drivers/tty/serial/imx.c|   7 

Re: [PATCH v5 1/2] powerpc/vcpu: Assume dedicated processors as non-preempt

2019-12-12 Thread Srikar Dronamraju
* Michael Ellerman  [2019-12-13 14:50:35]:

> Waiman Long suggested using static_keys.
> 
> Fixes: 247f2f6f3c70 ("sched/core: Don't schedule threads on pre-empted vCPUs")
> Cc: sta...@vger.kernel.org # v4.18+
> Reported-by: Parth Shah 
> Reported-by: Ihor Pasichnyk 
> Tested-by: Juri Lelli 
> Acked-by: Waiman Long 
> Reviewed-by: Gautham R. Shenoy 
> Signed-off-by: Srikar Dronamraju 
> Acked-by: Phil Auld 
> Reviewed-by: Vaidyanathan Srinivasan 
> Tested-by: Parth Shah 
> [mpe: Move the key and setting of the key to pseries/setup.c]
> Signed-off-by: Michael Ellerman 
> ---
>  arch/powerpc/include/asm/spinlock.h| 4 +++-
>  arch/powerpc/platforms/pseries/setup.c | 7 +++
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 

Tested with your version of the patch and its working as expected

Latency percentiles (usec)
50.0th: 45
75.0th: 63
90.0th: 74
95.0th: 78
*99.0th: 82
99.5th: 83
99.9th: 86
min=0, max=96
Latency percentiles (usec)
50.0th: 46
75.0th: 64
90.0th: 75
95.0th: 79
*99.0th: 83
99.5th: 85
99.9th: 91
min=0, max=117
Latency percentiles (usec)
50.0th: 46
75.0th: 64
90.0th: 75
95.0th: 79
*99.0th: 83
99.5th: 84
99.9th: 90
min=0, max=95
Latency percentiles (usec)
50.0th: 47
75.0th: 65
90.0th: 75
95.0th: 79
*99.0th: 84
99.5th: 85
99.9th: 90
min=0, max=117
Latency percentiles (usec)
50.0th: 45
75.0th: 64
90.0th: 75
95.0th: 79
*99.0th: 82
99.5th: 83
99.9th: 93
min=0, max=111


-- 
Thanks and Regards
Srikar Dronamraju



[PATCH v5 2/2] powerpc/shared: Use static key to detect shared processor

2019-12-12 Thread Michael Ellerman
From: Srikar Dronamraju 

With the static key shared processor available, is_shared_processor()
can return without having to query the lppaca structure.

Signed-off-by: Srikar Dronamraju 
Acked-by: Phil Auld 
Acked-by: Waiman Long 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20191205083218.25824-2-sri...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/spinlock.h | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

v5: No change.

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index cac95a3f30c2..1b55fc08f853 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -112,13 +112,8 @@ static inline void splpar_rw_yield(arch_rwlock_t *lock) {};
 
 static inline bool is_shared_processor(void)
 {
-/*
- * LPPACA is only available on Pseries so guard anything LPPACA related to
- * allow other platforms (which include this common header) to compile.
- */
-#ifdef CONFIG_PPC_PSERIES
-   return (IS_ENABLED(CONFIG_PPC_SPLPAR) &&
-   lppaca_shared_proc(local_paca->lppaca_ptr));
+#ifdef CONFIG_PPC_SPLPAR
+   return static_branch_unlikely(_processor);
 #else
return false;
 #endif
-- 
2.21.0



[PATCH v5 1/2] powerpc/vcpu: Assume dedicated processors as non-preempt

2019-12-12 Thread Michael Ellerman
From: Srikar Dronamraju 

With commit 247f2f6f3c70 ("sched/core: Don't schedule threads on
pre-empted vCPUs"), the scheduler avoids preempted vCPUs to schedule
tasks on wakeup. This leads to wrong choice of CPU, which in-turn
leads to larger wakeup latencies. Eventually, it leads to performance
regression in latency sensitive benchmarks like soltp, schbench etc.

On Powerpc, vcpu_is_preempted() only looks at yield_count. If the
yield_count is odd, the vCPU is assumed to be preempted. However
yield_count is increased whenever the LPAR enters CEDE state (idle).
So any CPU that has entered CEDE state is assumed to be preempted.

Even if vCPU of dedicated LPAR is preempted/donated, it should have
right of first-use since they are supposed to own the vCPU.

On a Power9 System with 32 cores
  # lscpu
  Architecture:ppc64le
  Byte Order:  Little Endian
  CPU(s):  128
  On-line CPU(s) list: 0-127
  Thread(s) per core:  8
  Core(s) per socket:  1
  Socket(s):   16
  NUMA node(s):2
  Model:   2.2 (pvr 004e 0202)
  Model name:  POWER9 (architected), altivec supported
  Hypervisor vendor:   pHyp
  Virtualization type: para
  L1d cache:   32K
  L1i cache:   32K
  L2 cache:512K
  L3 cache:10240K
  NUMA node0 CPU(s):   0-63
  NUMA node1 CPU(s):   64-127

  # perf stat -a -r 5 ./schbench
  v5.4 v5.4 + patch
  Latency percentiles (usec)   Latency percentiles (usec)
50.th: 45   50.th: 39
75.th: 62   75.th: 53
90.th: 71   90.th: 67
95.th: 77   95.th: 76
*99.th: 91  *99.th: 89
99.5000th: 707  99.5000th: 93
99.9000th: 6920 99.9000th: 118
min=0, max=10048min=0, max=211
  Latency percentiles (usec)   Latency percentiles (usec)
50.th: 45   50.th: 34
75.th: 61   75.th: 45
90.th: 72   90.th: 53
95.th: 79   95.th: 56
*99.th: 691 *99.th: 61
99.5000th: 3972 99.5000th: 63
99.9000th: 8368 99.9000th: 78
min=0, max=16606min=0, max=228
  Latency percentiles (usec)   Latency percentiles (usec)
50.th: 45   50.th: 34
75.th: 61   75.th: 45
90.th: 71   90.th: 53
95.th: 77   95.th: 57
*99.th: 106 *99.th: 63
99.5000th: 2364 99.5000th: 68
99.9000th: 7480 99.9000th: 100
min=0, max=10001min=0, max=134
  Latency percentiles (usec)   Latency percentiles (usec)
50.th: 45   50.th: 34
75.th: 62   75.th: 46
90.th: 72   90.th: 53
95.th: 78   95.th: 56
*99.th: 93  *99.th: 61
99.5000th: 108  99.5000th: 64
99.9000th: 6792 99.9000th: 85
min=0, max=17681min=0, max=121
  Latency percentiles (usec)   Latency percentiles (usec)
50.th: 46   50.th: 33
75.th: 62   75.th: 44
90.th: 73   90.th: 51
95.th: 79   95.th: 54
*99.th: 113 *99.th: 61
99.5000th: 2724 99.5000th: 64
99.9000th: 6184 99.9000th: 82
min=0, max=9887 min=0, max=121

   Performance counter stats for 'system wide' (5 runs):

  context-switches43,373  ( +-  0.40% )   44,597 ( +-  0.55% )
  cpu-migrations   1,211  ( +-  5.04% )  220 ( +-  6.23% )
  page-faults 15,983  ( +-  5.21% )   15,360 ( +-  3.38% )

Waiman Long suggested using static_keys.

Fixes: 247f2f6f3c70 ("sched/core: Don't schedule threads on pre-empted vCPUs")
Cc: sta...@vger.kernel.org # v4.18+
Reported-by: Parth Shah 
Reported-by: Ihor Pasichnyk 
Tested-by: Juri Lelli 
Acked-by: Waiman Long 
Reviewed-by: Gautham R. Shenoy 
Signed-off-by: Srikar Dronamraju 
Acked-by: Phil Auld 
Reviewed-by: Vaidyanathan Srinivasan 
Tested-by: Parth Shah 
[mpe: Move the key and setting of the 

Re: [PATCH v4 1/2] powerpc/vcpu: Assume dedicated processors as non-preempt

2019-12-12 Thread Michael Ellerman
Srikar Dronamraju  writes:
> With commit 247f2f6f3c70 ("sched/core: Don't schedule threads on pre-empted
> vCPUs"), scheduler avoids preempted vCPUs to schedule tasks on wakeup.
> This leads to wrong choice of CPU, which in-turn leads to larger wakeup
> latencies. Eventually, it leads to performance regression in latency
> sensitive benchmarks like soltp, schbench etc.
>
> On Powerpc, vcpu_is_preempted only looks at yield_count. If the
> yield_count is odd, the vCPU is assumed to be preempted. However
> yield_count is increased whenever LPAR enters CEDE state. So any CPU
> that has entered CEDE state is assumed to be preempted.
>
> Even if vCPU of dedicated LPAR is preempted/donated, it should have
> right of first-use since they are suppose to own the vCPU.
>
> On a Power9 System with 32 cores
>  # lscpu
> Architecture:ppc64le
> Byte Order:  Little Endian
> CPU(s):  128
> On-line CPU(s) list: 0-127
> Thread(s) per core:  8
> Core(s) per socket:  1
> Socket(s):   16
> NUMA node(s):2
> Model:   2.2 (pvr 004e 0202)
> Model name:  POWER9 (architected), altivec supported
> Hypervisor vendor:   pHyp
> Virtualization type: para
> L1d cache:   32K
> L1i cache:   32K
> L2 cache:512K
> L3 cache:10240K
> NUMA node0 CPU(s):   0-63
> NUMA node1 CPU(s):   64-127
>
>   # perf stat -a -r 5 ./schbench
> v5.4 v5.4 + patch
> Latency percentiles (usec)   Latency percentiles (usec)
>   50.th: 45   50.th: 39
>   75.th: 62   75.th: 53
>   90.th: 71   90.th: 67
>   95.th: 77   95.th: 76
>   *99.th: 91  *99.th: 89
>   99.5000th: 707  99.5000th: 93
>   99.9000th: 6920 99.9000th: 118
>   min=0, max=10048min=0, max=211
> Latency percentiles (usec)   Latency percentiles (usec)
>   50.th: 45   50.th: 34
>   75.th: 61   75.th: 45
>   90.th: 72   90.th: 53
>   95.th: 79   95.th: 56
>   *99.th: 691 *99.th: 61
>   99.5000th: 3972 99.5000th: 63
>   99.9000th: 8368 99.9000th: 78
>   min=0, max=16606min=0, max=228
> Latency percentiles (usec)   Latency percentiles (usec)
>   50.th: 45   50.th: 34
>   75.th: 61   75.th: 45
>   90.th: 71   90.th: 53
>   95.th: 77   95.th: 57
>   *99.th: 106 *99.th: 63
>   99.5000th: 2364 99.5000th: 68
>   99.9000th: 7480 99.9000th: 100
>   min=0, max=10001min=0, max=134
> Latency percentiles (usec)   Latency percentiles (usec)
>   50.th: 45   50.th: 34
>   75.th: 62   75.th: 46
>   90.th: 72   90.th: 53
>   95.th: 78   95.th: 56
>   *99.th: 93  *99.th: 61
>   99.5000th: 108  99.5000th: 64
>   99.9000th: 6792 99.9000th: 85
>   min=0, max=17681min=0, max=121
> Latency percentiles (usec)   Latency percentiles (usec)
>   50.th: 46   50.th: 33
>   75.th: 62   75.th: 44
>   90.th: 73   90.th: 51
>   95.th: 79   95.th: 54
>   *99.th: 113 *99.th: 61
>   99.5000th: 2724 99.5000th: 64
>   99.9000th: 6184 99.9000th: 82
>   min=0, max=9887 min=0, max=121
>
>  Performance counter stats for 'system wide' (5 runs):
>
> context-switches43,373  ( +-  0.40% )   44,597 ( +-  0.55% )
> cpu-migrations   1,211  ( +-  5.04% )  220 ( +-  6.23% )
> page-faults 15,983  ( +-  5.21% )   15,360 ( +-  3.38% )
>
> Waiman Long suggested using static_keys.
>
> Fixes: 41946c86876e ("locking/core, powerpc: Implement 
> vcpu_is_preempted(cpu)")
>
> Cc: Parth Shah 
> Cc: Ihor Pasichnyk 
> Cc: Juri Lelli 
> Cc: Phil Auld 
> Cc: Waiman Long 
> Cc: Gautham R. Shenoy 
> Cc: Vaidyanathan Srinivasan 
> Reported-by: Parth Shah 
> Reported-by: Ihor Pasichnyk 
> Tested-by: Juri Lelli 
> Tested-by: Parth Shah 
> Acked-by: Waiman Long 
> 

Isn't LD=ld.lld enough to get a kernel build started?

2019-12-12 Thread Itaru Kitayama
Hi,
Or should I set the LD with the full path to the LLD linker?


Re: HPT allocation failures on POWER8 KVM hosts

2019-12-12 Thread Roman Bolshakov
On Mon, Nov 18, 2019 at 02:42:42PM +0300, Roman Bolshakov wrote:
> On Mon, Nov 18, 2019 at 01:02:00PM +1100, Daniel Axtens wrote:
> > Hi Roman,
> > 
> > > We're running a lot of KVM virtual machines on POWER8 hosts and
> > > sometimes new VMs can't be started because there are no contiguous
> > > regions for HPT because of CMA region fragmentation.
> > >
> > > The issue is covered in the LWN article: https://lwn.net/Articles/684611/
> > > The article points that you raised the problem on LSFMM 2016. However I
> > > couldn't find a follow up article on the issue.
> > >
> > > Looking at the kernel commit log I've identified a few commits that
> > > might reduce CMA fragmentaiton and overcome HPT allocation failure:
> > >   - bd2e75633c801 ("dma-contiguous: use fallback alloc_pages for single 
> > > pages")
> > >   - 678e174c4c16a ("powerpc/mm/iommu: allow migration of cma allocated
> > > pages during mm_iommu_do_alloc")
> > >   - 9a4e9f3b2d739 ("mm: update get_user_pages_longterm to migrate pages 
> > > allocated from
> > > CMA region")
> > >   - d7fefcc8de914 ("mm/cma: add PF flag to force non cma alloc")
> > >
> > > Are there any other commits that address the issue? What is the first
> > > kernel version that shouldn't have the HPT allocation problem due to CMA
> > > fragmentation?
> > 
> > I've had some success increasing the CMA allocation with the
> > kvm_cma_resv_ratio boot parameter - see
> > arch/powerpc/kvm/book3s_hv_builtin.c
> > 
> > The default is 5%. In a support case in a former job we had a customer
> > who increased this to I think 7 or 8% and saw the symptoms subside
> > dramatically.
> > 
> 
> Hi Daniel,
> 
> Thank you, I'll try to increase kvm_cma_resv_ratio for now, but even 5%
> CMA reserve should be more than enough, given the size of HPT as 1/128th
> of VM max memory.
> 
> For a 16GB RAM VM without balloon device, only 128MB is going to be
> reserved for HPT using CMA. So, 5% CMA reserve should allow to provision
> VMs with over 1.5TB of RAM on 256GB RAM host. In other words the default
> CMA reserve allows to overprovision 6 times more memory for VMs than
> presented on a host.
> 
> We rarely add balloon device and sometimes don't add it at all. Therefore
> I'm still looking for commits that would help to avoid the issue with
> the default CMA reserve.
> 

FWIW, I have noticed the following. My host has 4 NUMA nodes with 4 CPUs
per node, only one of the nodes have CMA pages and only two of the nodes
have memory according to /proc/zoneinfo. The error can be reliably
reproduced if I attempt to place vCPUs on the node with CMA pages.

Roman


Re: [PATCH v5 1/2] powerpc/pseries/iommu: Share the per-cpu TCE page with the hypervisor.

2019-12-12 Thread Michael Roth
Quoting Alexey Kardashevskiy (2019-12-11 16:47:30)
> 
> 
> On 12/12/2019 07:31, Michael Roth wrote:
> > Quoting Alexey Kardashevskiy (2019-12-11 02:15:44)
> >>
> >>
> >> On 11/12/2019 02:35, Ram Pai wrote:
> >>> On Tue, Dec 10, 2019 at 04:32:10PM +1100, Alexey Kardashevskiy wrote:
> 
> 
>  On 10/12/2019 16:12, Ram Pai wrote:
> > On Tue, Dec 10, 2019 at 02:07:36PM +1100, Alexey Kardashevskiy wrote:
> >>
> >>
> >> On 07/12/2019 12:12, Ram Pai wrote:
> >>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one 
> >>> of
> >>> its parameters.  On secure VMs, hypervisor cannot access the contents 
> >>> of
> >>> this page since it gets encrypted.  Hence share the page with the
> >>> hypervisor, and unshare when done.
> >>
> >>
> >> I thought the idea was to use H_PUT_TCE and avoid sharing any extra
> >> pages. There is small problem that when DDW is enabled,
> >> FW_FEATURE_MULTITCE is ignored (easy to fix); I also noticed complains
> >> about the performance on slack but this is caused by initial cleanup of
> >> the default TCE window (which we do not use anyway) and to battle this
> >> we can simply reduce its size by adding
> >
> > something that takes hardly any time with H_PUT_TCE_INDIRECT,  takes
> > 13secs per device for H_PUT_TCE approach, during boot. This is with a
> > 30GB guest. With larger guest, the time will further detoriate.
> 
> 
>  No it will not, I checked. The time is the same for 2GB and 32GB guests-
>  the delay is caused by clearing the small DMA window which is small by
>  the space mapped (1GB) but quite huge in TCEs as it uses 4K pages; and
>  for DDW window + emulated devices the IOMMU page size will be 2M/16M/1G
>  (depends on the system) so the number of TCEs is much smaller.
> >>>
> >>> I cant get your results.  What changes did you make to get it?
> >>
> >>
> >> Get what? I passed "-m 2G" and "-m 32G", got the same time - 13s spent
> >> in clearing the default window and the huge window took a fraction of a
> >> second to create and map.
> > 
> > Is this if we disable FW_FEATURE_MULTITCE in the guest and force the use
> > of H_PUT_TCE everywhere?
> 
> 
> Yes. Well, for the DDW case FW_FEATURE_MULTITCE is ignored but even when
> fixed (I have it in my local branch), this does not make a difference.
> 
> 
> > 
> > In theory couldn't we leave FW_FEATURE_MULTITCE in place so that
> > iommu_table_clear() can still use H_STUFF_TCE (which I guess is basically
> > instant),
> 
> PAPR/LoPAPR "conveniently" do not describe what hcall-multi-tce does
> exactly. But I am pretty sure the idea is that either both H_STUFF_TCE
> and H_PUT_TCE_INDIRECT are present or neither.

That was my interpretation (or maybe I just went by what your
implementation did :), but just because they are available doesn't mean
the guest has to use them. I agree it's ugly to condition it on
is_secure_guest(), but to me that seems better than sharing memory
uncessarily, or potentially leaving stale mappings into default IOMMU.

Not sure if that are other alternatives though.

> 
> 
> > and then force H_PUT_TCE for new mappings via something like:
> > 
> > diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> > b/arch/powerpc/platforms/pseries/iommu.c
> > index 6ba081dd61c9..85d092baf17d 100644
> > --- a/arch/powerpc/platforms/pseries/iommu.c
> > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > @@ -194,6 +194,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table 
> > *tbl, long tcenum,
> > unsigned long flags;
> >  
> > if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE)) {
> > +   if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE) || 
> > is_secure_guest()) {
> 
> 
> Nobody (including myself) seems to like the idea of having
> is_secure_guest() all over the place.
> 
> And with KVM acceleration enabled, it is pretty fast anyway. Just now we
> do not have H_PUT_TCE in KVM/UV for secure guests but we will have to
> fix this for secure PCI passhtrough anyway.
> 
> 
> > return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
> >direction, attrs);
> > }
> > 
> > That seems like it would avoid the extra 13s.
> 
> Or move around iommu_table_clear() which imho is just the right thing to do.
> 
> 
> > If we take the additional step of only mapping SWIOTLB range in
> > enable_ddw() for is_secure_guest() that might further improve things
> > (though the bigger motivation with that is the extra isolation it would
> > grant us for stuff behind the IOMMU, since it apparently doesn't affect
> > boot-time all that much)
> 
> 
> Sure, we just need to confirm how many of these swiotlb banks we are
> going to have (just one or many and at what location). Thanks,
> 
> 
> 
> > 
> >>
> >>
> >>
> >> -global
> >> spapr-pci-host-bridge.dma_win_size=0x400
> >
> > 

Re: [PATCH v5 2/2] powerpc/pseries/iommu: Use dma_iommu_ops for Secure VM.

2019-12-12 Thread Michael Roth
Quoting Ram Pai (2019-12-12 00:45:02)
> On Tue, Dec 10, 2019 at 07:43:24PM -0600, Michael Roth wrote:
> > Quoting Ram Pai (2019-12-06 19:12:39)
> > > Commit edea902c1c1e ("powerpc/pseries/iommu: Don't use dma_iommu_ops on
> > > secure guests")
> > > disabled dma_iommu_ops path, for secure VMs. Disabling dma_iommu_ops
> > > path for secure VMs, helped enable dma_direct path.  This enabled
> > > support for bounce-buffering through SWIOTLB.  However it fails to
> > > operate when IOMMU is enabled, since I/O pages are not TCE mapped.
> > > 
> > > Renable dma_iommu_ops path for pseries Secure VMs.  It handles all
> > > cases including, TCE mapping I/O pages, in the presence of a
> > > IOMMU.
> > 
> > Wasn't clear to me at first, but I guess the main gist of this series is
> > that we want to continue to use SWIOTLB, but also need to create mappings
> > of it's bounce buffers in the IOMMU, so we revert to using dma_iommu_ops
> > and rely on the various dma_iommu_{map,alloc}_bypass() hooks throughout
> > to call into dma_direct_* ops rather than relying on the dma_is_direct(ops)
> > checks in DMA API functions to do the same.
> > 
> > That makes sense, but one issue I see with that is that
> > dma_iommu_map_bypass() only tests true if all the following are true:
> > 
> > 1) the device requests a 64-bit DMA mask via
> >dma_set_mask/dma_set_coherent_mask
> > 2) DDW is enabled (i.e. we don't pass disable_ddw on command-line)
> > 
> > dma_is_direct() checks don't have this limitation, so I think for
> > anything cases, such as devices that use a smaller DMA mask, we'll
> > end up falling back to the non-bypass functions in dma_iommu_ops, which
> > will likely break for things like dma_alloc_coherent/dma_map_single
> > since they won't use SWIOTLB pages and won't do the necessary calls to
> > set_memory_unencrypted() to share those non-SWIOTLB buffers with
> > hypervisor.
> > 
> > Maybe that's ok, but I think we should be clearer about how to
> > fail/handle these cases.
> 
> Yes. makes sense. Device that cannot handle 64bit dma mask will not work.
> 
> > 
> > Though I also agree with some concerns Alexey stated earlier: it seems
> > wasteful to map the entire DDW window just so these bounce buffers can be
> > mapped.  Especially if you consider the lack of a mapping to be an 
> > additional
> > safe-guard against things like buggy device implementations on the QEMU
> > side. E.g. if we leaked pages to the hypervisor on accident, those pages
> > wouldn't be immediately accessible to a device, and would still require
> > additional work get past the IOMMU.
> 
> Well, an accidental unintented page leak to the hypervisor, is a very
> bad thing, regardless of any DMA mapping. The device may not be able to
> access it, but the hypervisor still can access it.

Agreed, but if IOMMU can provide additional isolation we should make use
of it when reasonable.

> 
> > 
> > What would it look like if we try to make all this work with disable_ddw 
> > passed
> > to kernel command-line (or forced for is_secure_guest())?
> > 
> >   1) dma_iommu_{alloc,map}_bypass() would no longer get us to dma_direct_* 
> > ops,
> >  but an additional case or hook that considers is_secure_guest() might 
> > do
> >  it.
> >  
> >   2) We'd also need to set up an IOMMU mapping for the bounce buffers via
> >  io_tlb_start/io_tlb_end. We could do it once, on-demand via
> >  dma_iommu_bypass_supported() like we do for the 64-bit DDW window, or
> >  maybe in some init function.
> 
> Hmm... i not sure how to accomplish (2).   we need use some DDW window
> to setup the mappings. right?  If disable_ddw is set, there wont be any
> ddw.  What am i missing?

We have 2 windows, the default 32-bit window that normally covers DMA addresses
0..1GB, and an additional DDW window that's created on demand for 64-bit
devices (since they can address a full linear mapping of all guest
memory at DMA address 1<<59). Your current patch uses the latter, but we
could potentially use the 32-bit window since we only need to map the
SWIOTLB pages.

Not saying that's necessarily better, but one upside is it only requires
devices to support 32-bit DMA addressing, which is a larger class of
devices than those that support 64-bit (since 64-bit device drivers
generally have a 32-bit fallback). Required changes are a bit more
pervasive though.

It might make sense to get both scenarios working at some point, but
maybe it's okay to handle 64-bit first. 64-bit does give us more legroom
if we anticipate changes in where the SWIOTLB memory is allocated from,
or expect it to become larger than 1GB.

> 
> > 
> > That also has the benefit of not requiring devices to support 64-bit DMA.
> > 
> > Alternatively, we could continue to rely on the 64-bit DDW window, but
> > modify call to enable_ddw() to only map the io_tlb_start/end range in
> > the case of is_secure_guest(). This is a little cleaner implementation-wise
> > since we can rely on the 

[PATCH 49/58] serial/ucc_uart: Remove ifdef SUPPORT_SYSRQ

2019-12-12 Thread Dmitry Safonov
ucc_uart doesn't seem to support console over itself, so maybe it can
be deleted with uart_handle_sysrq_char() from the file.

Cc: Timur Tabi 
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Dmitry Safonov 
---
 drivers/tty/serial/ucc_uart.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/tty/serial/ucc_uart.c b/drivers/tty/serial/ucc_uart.c
index a0555ae2b1ef..ff7784047156 100644
--- a/drivers/tty/serial/ucc_uart.c
+++ b/drivers/tty/serial/ucc_uart.c
@@ -551,9 +551,7 @@ static void qe_uart_int_rx(struct uart_qe_port *qe_port)
/* Overrun does not affect the current character ! */
if (status & BD_SC_OV)
tty_insert_flip_char(tport, 0, TTY_OVERRUN);
-#ifdef SUPPORT_SYSRQ
port->sysrq = 0;
-#endif
goto error_return;
 }
 
-- 
2.24.0



[PATCH 31/58] tty/serial: Migrate pmac_zilog to use has_sysrq

2019-12-12 Thread Dmitry Safonov
The SUPPORT_SYSRQ ifdeffery is not nice as:
- May create misunderstanding about sizeof(struct uart_port) between
  different objects
- Prevents moving functions from serial_core.h
- Reduces readability (well, it's ifdeffery - it's hard to follow)

In order to remove SUPPORT_SYSRQ, has_sysrq variable has been added.
Initialise it in driver's probe and remove ifdeffery.

Cc: Benjamin Herrenschmidt 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Dmitry Safonov 
---
 drivers/tty/serial/pmac_zilog.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/tty/serial/pmac_zilog.c b/drivers/tty/serial/pmac_zilog.c
index bcb5bf70534e..ba65a3bbd72a 100644
--- a/drivers/tty/serial/pmac_zilog.c
+++ b/drivers/tty/serial/pmac_zilog.c
@@ -61,10 +61,6 @@
 #define of_machine_is_compatible(x) (0)
 #endif
 
-#if defined (CONFIG_SERIAL_PMACZILOG_CONSOLE) && defined(CONFIG_MAGIC_SYSRQ)
-#define SUPPORT_SYSRQ
-#endif
-
 #include 
 #include 
 
@@ -1721,6 +1717,7 @@ static int __init pmz_init_port(struct uart_pmac_port 
*uap)
uap->control_reg   = uap->port.membase;
uap->data_reg  = uap->control_reg + 4;
uap->port_type = 0;
+   uap->port.has_sysrq = IS_ENABLED(CONFIG_SERIAL_PMACZILOG_CONSOLE);
 
pmz_convert_to_zs(uap, CS8, 0, 9600);
 
-- 
2.24.0



[PATCH 05/58] tty/serial: Migrate 8250_fsl to use has_sysrq

2019-12-12 Thread Dmitry Safonov
The SUPPORT_SYSRQ ifdeffery is not nice as:
- May create misunderstanding about sizeof(struct uart_port) between
  different objects
- Prevents moving functions from serial_core.h
- Reduces readability (well, it's ifdeffery - it's hard to follow)

In order to remove SUPPORT_SYSRQ, has_sysrq variable has been added.
Initialise it in driver's probe and remove ifdeffery.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Dmitry Safonov 
---
 arch/powerpc/kernel/legacy_serial.c | 4 +++-
 drivers/tty/serial/8250/8250_fsl.c  | 4 
 drivers/tty/serial/8250/8250_of.c   | 4 +++-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/legacy_serial.c 
b/arch/powerpc/kernel/legacy_serial.c
index 7cea5978f21f..f061e06e9f51 100644
--- a/arch/powerpc/kernel/legacy_serial.c
+++ b/arch/powerpc/kernel/legacy_serial.c
@@ -479,8 +479,10 @@ static void __init fixup_port_irq(int index,
port->irq = virq;
 
 #ifdef CONFIG_SERIAL_8250_FSL
-   if (of_device_is_compatible(np, "fsl,ns16550"))
+   if (of_device_is_compatible(np, "fsl,ns16550")) {
port->handle_irq = fsl8250_handle_irq;
+   port->has_sysrq = IS_ENABLED(CONFIG_SERIAL_8250_CONSOLE);
+   }
 #endif
 }
 
diff --git a/drivers/tty/serial/8250/8250_fsl.c 
b/drivers/tty/serial/8250/8250_fsl.c
index aa0e216d5ead..0d0c80905c58 100644
--- a/drivers/tty/serial/8250/8250_fsl.c
+++ b/drivers/tty/serial/8250/8250_fsl.c
@@ -1,8 +1,4 @@
 // SPDX-License-Identifier: GPL-2.0
-#if defined(CONFIG_SERIAL_8250_CONSOLE) && defined(CONFIG_MAGIC_SYSRQ)
-#define SUPPORT_SYSRQ
-#endif
-
 #include 
 #include 
 
diff --git a/drivers/tty/serial/8250/8250_of.c 
b/drivers/tty/serial/8250/8250_of.c
index 92fbf46ce3bd..531ad67395e0 100644
--- a/drivers/tty/serial/8250/8250_of.c
+++ b/drivers/tty/serial/8250/8250_of.c
@@ -222,8 +222,10 @@ static int of_platform_serial_setup(struct platform_device 
*ofdev,
 
if (IS_ENABLED(CONFIG_SERIAL_8250_FSL) &&
(of_device_is_compatible(np, "fsl,ns16550") ||
-of_device_is_compatible(np, "fsl,16550-FIFO64")))
+of_device_is_compatible(np, "fsl,16550-FIFO64"))) {
port->handle_irq = fsl8250_handle_irq;
+   port->has_sysrq = IS_ENABLED(CONFIG_SERIAL_8250_CONSOLE);
+   }
 
return 0;
 err_unprepare:
-- 
2.24.0



[PATCH 00/58] serial/sysrq: Cleanup ifdeffery

2019-12-12 Thread Dmitry Safonov
The original purpose of the patches set was to add a way to enable
sysrq on a uart where currently it can be constantly either on or off
(CONFIG_MAGIC_SYSRQ_SERIAL), see the last patch:
  "serial/sysrq: Add MAGIC_SYSRQ_SERIAL_SEQUENCE"

But to do that, I had to add uart_try_toggle_sysrq() and I didn't want
to bloat serial_core.h even more. So, I did cleanup by removing
SUPPORT_SYSRQ resulting in a nice diff-stat and lesser ifdeffery.

Most patches are one-liners, I decided to keep them separated per-driver
to let reviewers easier follow the purpose.

Cc: Greg Kroah-Hartman 
Cc: Jiri Slaby 
Cc: Vasiliy Khoruzhick 
Cc: linux-ser...@vger.kernel.org

Dmitry Safonov (58):
  sysrq: Remove sysrq_handler_registered
  serial: Move sysrq members above
  serial_core: Un-ifdef sysrq SUPPORT_SYSRQ
  tty/serial: Migrate aspeed_vuart to use has_sysrq
  tty/serial: Migrate 8250_fsl to use has_sysrq
  tty/serial: Migrate bcm63xx_uart to use has_sysrq
  tty/serial: Migrate 8250_omap to use has_sysrq
  tty/serial: Migrate 8250_port to use has_sysrq
  tty/serial: Migrate amba-pl01* to use has_sysrq
  tty/serial: Migrate apbuart to use has_sysrq
  tty/serial: Migrate arc_uart to use has_sysrq
  tty/serial: Migrate atmel_serial to use has_sysrq
  tty/serial: Migrate clps711x to use has_sysrq
  tty/serial: Migrate cpm_uart to use has_sysrq
  tty/serial: Migrate dz to use has_sysrq
  tty/serial: Migrate efm32-uart to use has_sysrq
  tty/serial: Migrate fsl_linflexuart to use has_sysrq
  tty/serial: Migrate fsl_lpuart to use has_sysrq
  tty/serial: Migrate imx to use has_sysrq
  tty/serial: Migrate ip22zilog to use has_sysrq
  tty/serial: Migrate meson_uart to use has_sysrq
  tty/serial: Migrate milbeaut_usio to use has_sysrq
  tty/serial: Migrate mpc52xx_uart to use has_sysrq
  tty/serial: Don't zero port->sysrq
  tty/serial: Migrate msm_serial to use has_sysrq
  tty/serial: Migrate mux to use has_sysrq
  tty/serial: Migrate mxs-auart to use has_sysrq
  tty/serial: Migrate omap-serial to use has_sysrq
  tty/serial: Migrate pch_uart to use has_sysrq
  tty/serial: Don't check port->sysrq
  tty/serial: Migrate pmac_zilog to use has_sysrq
  tty/serial: Migrate pnx8xxx_uart to use has_sysrq
  serial/f81534: Don't check port->sysrq
  tty/serial: Migrate pxa to use has_sysrq
  tty/serial: Migrate qcom_geni_serial to use has_sysrq
  tty/serial: Migrate sa1100 to use has_sysrq
  tty/serial: Migrate samsung_tty to use has_sysrq
  tty/serial: Migrate sb1250-duart to use has_sysrq
  tty/serial: Migrate sccnxp to use has_sysrq
  tty/serial: Migrate serial_txx9 to use has_sysrq
  tty/serial: Migrate sh-sci to use has_sysrq
  tty/serial: Migrate sprd_serial to use has_sysrq
  tty/serial: Migrate st-asc to use has_sysrq
  tty/serial: Migrate stm32-usart to use has_sysrq
  tty/serial: Migrate sunhv to use has_sysrq
  tty/serial: Migrate sunsab to use has_sysrq
  tty/serial: Migrate sunsu to use has_sysrq
  tty/serial: Migrate sunzilog to use has_sysrq
  serial/ucc_uart: Remove ifdef SUPPORT_SYSRQ
  tty/serial: Migrate vr41xx_siu to use has_sysrq
  tty/serial: Migrate vt8500_serial to use has_sysrq
  tty/serial: Migrate xilinx_uartps to use has_sysrq
  tty/serial: Migrate zs to use has_sysrq
  serial_core: Remove SUPPORT_SYSRQ ifdeffery
  usb/serial: Don't handle break when CONFIG_MAGIC_SYSRQ is disabled
  serial_core: Move sysrq functions from header file
  sysctl/sysrq: Remove __sysrq_enabled copy
  serial/sysrq: Add MAGIC_SYSRQ_SERIAL_SEQUENCE

 arch/powerpc/kernel/legacy_serial.c |   4 +-
 drivers/tty/serial/8250/8250_aspeed_vuart.c |   5 +-
 drivers/tty/serial/8250/8250_fsl.c  |   4 -
 drivers/tty/serial/8250/8250_of.c   |   4 +-
 drivers/tty/serial/8250/8250_omap.c |   5 +-
 drivers/tty/serial/8250/8250_port.c |   5 +-
 drivers/tty/serial/amba-pl010.c |   5 +-
 drivers/tty/serial/amba-pl011.c |   6 +-
 drivers/tty/serial/apbuart.c|   5 +-
 drivers/tty/serial/arc_uart.c   |   5 +-
 drivers/tty/serial/atmel_serial.c   |   9 +-
 drivers/tty/serial/bcm63xx_uart.c   |   5 +-
 drivers/tty/serial/clps711x.c   |   5 +-
 drivers/tty/serial/cpm_uart/cpm_uart_core.c |   9 +-
 drivers/tty/serial/dz.c |   5 +-
 drivers/tty/serial/efm32-uart.c |   5 +-
 drivers/tty/serial/fsl_linflexuart.c|   8 +-
 drivers/tty/serial/fsl_lpuart.c |   9 +-
 drivers/tty/serial/imx.c|   7 +-
 drivers/tty/serial/ip22zilog.c  |   7 +-
 drivers/tty/serial/meson_uart.c |   5 +-
 drivers/tty/serial/milbeaut_usio.c  |   5 +-
 drivers/tty/serial/mpc52xx_uart.c   |  11 +-
 drivers/tty/serial/msm_serial.c |   5 +-
 drivers/tty/serial/mux.c|   5 +-
 drivers/tty/serial/mxs-auart.c  |   5 +-
 drivers/tty/serial/omap-serial.c|   5 +-
 drivers/tty/serial/pch_uart.c   |  12 +-
 

Re: [PATCH v3 3/3] powerpc: Book3S 64-bit "heavyweight" KASAN support

2019-12-12 Thread Jordan Niethe
On Fri, Dec 13, 2019 at 2:19 AM Daniel Axtens  wrote:
>
> KASAN support on Book3S is a bit tricky to get right:
>
>  - It would be good to support inline instrumentation so as to be able to
>catch stack issues that cannot be caught with outline mode.
>
>  - Inline instrumentation requires a fixed offset.
>
>  - Book3S runs code in real mode after booting. Most notably a lot of KVM
>runs in real mode, and it would be good to be able to instrument it.
>
>  - Because code runs in real mode after boot, the offset has to point to
>valid memory both in and out of real mode.
>
>[For those not immersed in ppc64, in real mode, the top nibble or 2 bits
>(depending on radix/hash mmu) of the address is ignored. The linear
>mapping is placed at 0xc000. This means that a pointer to
>part of the linear mapping will work both in real mode, where it will be
>interpreted as a physical address of the form 0x000..., and out of real
>mode, where it will go via the linear mapping.]
>

How does hash or radix mmu mode effect how many bits are ignored in real mode?

> One approach is just to give up on inline instrumentation. This way all
> checks can be delayed until after everything set is up correctly, and the
> address-to-shadow calculations can be overridden. However, the features and
> speed boost provided by inline instrumentation are worth trying to do
> better.
>
> If _at compile time_ it is known how much contiguous physical memory a
> system has, the top 1/8th of the first block of physical memory can be set
> aside for the shadow. This is a big hammer and comes with 3 big
> consequences:
>
>  - there's no nice way to handle physically discontiguous memory, so only
>the first physical memory block can be used.
>
>  - kernels will simply fail to boot on machines with less memory than
>specified when compiling.
>
>  - kernels running on machines with more memory than specified when
>compiling will simply ignore the extra memory.
>
> Implement and document KASAN this way. The current implementation is Radix
> only.
>
> Despite the limitations, it can still find bugs,
> e.g. http://patchwork.ozlabs.org/patch/1103775/
>
> At the moment, this physical memory limit must be set _even for outline
> mode_. This may be changed in a later series - a different implementation
> could be added for outline mode that dynamically allocates shadow at a
> fixed offset. For example, see https://patchwork.ozlabs.org/patch/795211/
>
> Suggested-by: Michael Ellerman 
> Cc: Balbir Singh  # ppc64 out-of-line radix version
> Cc: Christophe Leroy  # ppc32 version
> Signed-off-by: Daniel Axtens 
>
> ---
> Changes since v2:
>
>  - Address feedback from Christophe around cleanups and docs.
>  - Address feedback from Balbir: at this point I don't have a good solution
>for the issues you identify around the limitations of the inline 
> implementation
>but I think that it's worth trying to get the stack instrumentation 
> support.
>I'm happy to have an alternative and more flexible outline mode - I had
>envisoned this would be called 'lightweight' mode as it imposes fewer 
> restrictions.
>I've linked to your implementation. I think it's best to add it in a 
> follow-up series.
>  - Made the default PHYS_MEM_SIZE_FOR_KASAN value 1024MB. I think most people 
> have
>guests with at least that much memory in the Radix 64s case so it's a much
>saner default - it means that if you just turn on KASAN without reading the
>docs you're much more likely to have a bootable kernel, which you will 
> never
>have if the value is set to zero! I'm happy to bikeshed the value if we 
> want.
>
> Changes since v1:
>  - Landed kasan vmalloc support upstream
>  - Lots of feedback from Christophe.
>
> Changes since the rfc:
>
>  - Boots real and virtual hardware, kvm works.
>
>  - disabled reporting when we're checking the stack for exception
>frames. The behaviour isn't wrong, just incompatible with KASAN.
>
>  - Documentation!
>
>  - Dropped old module stuff in favour of KASAN_VMALLOC.
>
> The bugs with ftrace and kuap were due to kernel bloat pushing
> prom_init calls to be done via the plt. Because we did not have
> a relocatable kernel, and they are done very early, this caused
> everything to explode. Compile with CONFIG_RELOCATABLE!
> ---
>  Documentation/dev-tools/kasan.rst |   8 +-
>  Documentation/powerpc/kasan.txt   | 112 +-
>  arch/powerpc/Kconfig  |   3 +
>  arch/powerpc/Kconfig.debug|  21 
>  arch/powerpc/Makefile |  11 ++
>  arch/powerpc/include/asm/book3s/64/hash.h |   4 +
>  arch/powerpc/include/asm/book3s/64/pgtable.h  |   7 ++
>  arch/powerpc/include/asm/book3s/64/radix.h|   5 +
>  arch/powerpc/include/asm/kasan.h  |  21 +++-
>  arch/powerpc/kernel/process.c |   8 ++
>  arch/powerpc/kernel/prom.c 

[PATCH v2 1/1] KVM: PPC: Implement H_SVM_INIT_ABORT hcall

2019-12-12 Thread Sukadev Bhattiprolu


This patch is based on Bharata's v11 KVM patches for secure guests:
https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-November/200918.html

---
>From c0826bac72a658312f3d87e0bb18ecaf08ac2b2e Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu 
Date: Fri, 27 Sep 2019 14:30:36 -0500
Subject: [PATCH v2 1/1] KVM: PPC: Implement H_SVM_INIT_ABORT hcall

Implement the H_SVM_INIT_ABORT hcall which the Ultravisor can use to
abort an SVM after it has issued the H_SVM_INIT_START and before the
H_SVM_INIT_DONE hcalls. This hcall could be used when Ultravisor
encounters security violations or other errors when starting an SVM.

Note that this hcall is different from UV_SVM_TERMINATE ucall which
is used by HV to terminate/cleanup an VM that has becore secure.

The H_SVM_INIT_ABORT should basically undo operations that were done
since the H_SVM_INIT_START hcall - i.e page-out all the VM pages back
to normal memory, unregister memslots etc.

(If we do not bring the pages back to normal memory, the text/data
of the VM would be stuck in secure memory and since the SVM did not
go secure, its MSR_S bit will be clear and the VM wont be able to
access its pages even to do a clean exit).

Based on patches and discussion with Paul Mackerras, Ram Pai and
Bharata Rao.

Signed-off-by: Ram Pai 
Signed-off-by: Sukadev Bhattiprolu 
Signed-off-by: Bharata B Rao 
---
Changelog[v2]:
[Paul Mackerras] avoid returning to UV "one last time" after
the state is cleaned up.  So, we now have H_SVM_INIT_ABORT:
- take the VM's NIP/MSR register states as parameters
- inherit the state of other registers as at UV_ESM call.
After cleaning up the partial state, HV uses these to return
directly to the VM with a failed UV_ESM call.
---
 Documentation/powerpc/ultravisor.rst| 56 +
 arch/powerpc/include/asm/hvcall.h   |  1 +
 arch/powerpc/include/asm/kvm_book3s_uvmem.h | 10 +++-
 arch/powerpc/include/asm/kvm_host.h |  1 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c  |  2 +-
 arch/powerpc/kvm/book3s_hv.c|  5 +-
 arch/powerpc/kvm/book3s_hv_uvmem.c  | 48 +-
 7 files changed, 117 insertions(+), 6 deletions(-)

diff --git a/Documentation/powerpc/ultravisor.rst 
b/Documentation/powerpc/ultravisor.rst
index 730854f73830..ef49c9192775 100644
--- a/Documentation/powerpc/ultravisor.rst
+++ b/Documentation/powerpc/ultravisor.rst
@@ -948,6 +948,62 @@ Use cases
 up its internal state for this virtual machine.
 
 
+H_SVM_INIT_ABORT
+
+
+Abort the process of securing an SVM.
+
+Syntax
+~~
+
+.. code-block:: c
+
+   uint64_t hypercall(const uint64_t H_SVM_INIT_ABORT,
+   uint64_t guest_pc,  /* guest NIP to return to */
+   uint64_t guest_msr, /* guest MSR value */
+
+Return values
+~
+
+One of the following values:
+
+   * H_PARAMETER   on successfully cleaning up the state, 
Hypervisor will
+ return this value to the **guest**, to indicate that the underlying
+ UV_ESM ultra call failed.
+
+   * H_UNSUPPORTED if called from the wrong context (e.g. from an 
SVM
+ or before an H_SVM_INIT_START hypercall). This will return to the
+ Ultravisor which incorrectly issued the hcall.
+
+
+Description
+~~~
+
+Abort the process of securing a virtual machine. This call must
+be made after a prior call to ``H_SVM_INIT_START`` hypercall and
+before a call to ``H_SVM_INIT_DONE``.
+
+This hcall will cleanup any partial state that was established for
+the VM since the prior ``H_SVM_INIT_START hcall`` including paging
+out pages that were paged-into secure memory, unregistering memory
+slots etc.
+
+After the partial state is cleaned up, control returns to the address
+specified in ``guest_pc`` with the MSR values set to ``guest_msr``.
+These parameters are expected to match the state of NIP and MSR
+registers of the VM at the time it issued the ``UV_ESM`` ultra call
+to transition to a secure VM.
+
+Use cases
+~
+
+If after a successful call to ``H_SVM_INIT_START``, the Ultravisor
+encounters an error while securing a virtual machine, either due
+to lack of resources or because the VM's security information could
+not be validated, Ultravisor informs the Hypervisor about it.
+Hypervisor should use this call to clean up any internal state for
+this virtual machine and return to the VM.
+
 H_SVM_PAGE_IN
 -
 
diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 13bd870609c3..e90c073e437e 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -350,6 +350,7 @@
 #define H_SVM_PAGE_OUT 0xEF04
 #define H_SVM_INIT_START   0xEF08
 #define H_SVM_INIT_DONE0xEF0C
+#define H_SVM_INIT_ABORT   0xEF14
 
 /* Values for 2nd argument to 

Re: Call for report - G5/PPC970 status

2019-12-12 Thread Andreas Schwab
On Dez 12 2019, Romain Dolbeau wrote:

> Can you share your kernel config, compiler version, and other details?

I'm using 4K pages, in case that matters, and gcc 4.8.

Andreas.

-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."
#
# Automatically generated file; DO NOT EDIT.
# Linux/powerpc 5.5.0-rc1 Kernel Configuration
#

#
# Compiler: gcc (SUSE Linux) 4.8.1 20130909 [gcc-4_8-branch revision 202388]
#
CONFIG_CC_IS_GCC=y
CONFIG_GCC_VERSION=40801
CONFIG_CLANG_VERSION=0
CONFIG_CC_CAN_LINK=y
CONFIG_CC_HAS_ASM_GOTO=y
CONFIG_CC_HAS_WARN_MAYBE_UNINITIALIZED=y
CONFIG_CC_DISABLE_WARN_MAYBE_UNINITIALIZED=y
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_BUILD_SALT=""
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_XZ is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_USELIB=y
CONFIG_AUDIT=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y
CONFIG_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_SHOW_LEVEL=y
CONFIG_GENERIC_IRQ_MIGRATION=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
# CONFIG_GENERIC_IRQ_DEBUGFS is not set
# end of IRQ subsystem

CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_ARCH_HAS_TICK_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
# end of Timers subsystem

# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set

#
# CPU/Task time and stats accounting
#
CONFIG_VIRT_CPU_ACCOUNTING=y
# CONFIG_TICK_CPU_ACCOUNTING is not set
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y
CONFIG_PSI=y
# CONFIG_PSI_DEFAULT_DISABLED is not set
# end of CPU/Task time and stats accounting

# CONFIG_CPU_ISOLATION is not set

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
# end of RCU Subsystem

CONFIG_BUILD_BIN2C=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
# CONFIG_IKHEADERS is not set
CONFIG_LOG_BUF_SHIFT=18
CONFIG_LOG_CPU_MAX_BUF_SHIFT=15
CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=13

#
# Scheduler features
#
CONFIG_UCLAMP_TASK=y
CONFIG_UCLAMP_BUCKETS_COUNT=5
# end of Scheduler features

CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_CC_HAS_INT128=y
CONFIG_CGROUPS=y
CONFIG_PAGE_COUNTER=y
CONFIG_MEMCG=y
CONFIG_MEMCG_SWAP=y
# CONFIG_MEMCG_SWAP_ENABLED is not set
CONFIG_MEMCG_KMEM=y
CONFIG_BLK_CGROUP=y
CONFIG_CGROUP_WRITEBACK=y
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_CFS_BANDWIDTH=y
CONFIG_RT_GROUP_SCHED=y
# CONFIG_UCLAMP_TASK_GROUP is not set
CONFIG_CGROUP_PIDS=y
CONFIG_CGROUP_RDMA=y
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_HUGETLB=y
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_PERF=y
CONFIG_CGROUP_BPF=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_SOCK_CGROUP_DATA=y
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_IPC_NS=y
CONFIG_USER_NS=y
CONFIG_PID_NS=y
CONFIG_NET_NS=y
CONFIG_CHECKPOINT_RESTORE=y
# CONFIG_SCHED_AUTOGROUP is not set
# CONFIG_SYSFS_DEPRECATED is not set
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
CONFIG_RD_BZIP2=y
CONFIG_RD_LZMA=y
CONFIG_RD_XZ=y
CONFIG_RD_LZO=y
CONFIG_RD_LZ4=y
# CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE is not set
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_HAVE_LD_DEAD_CODE_DATA_ELIMINATION=y
# CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is not set
CONFIG_SYSCTL=y
CONFIG_SYSCTL_EXCEPTION_TRACE=y
CONFIG_BPF=y
CONFIG_EXPERT=y
CONFIG_MULTIUSER=y
CONFIG_SGETMASK_SYSCALL=y
CONFIG_SYSFS_SYSCALL=y
CONFIG_FHANDLE=y
CONFIG_POSIX_TIMERS=y
CONFIG_PRINTK=y
CONFIG_PRINTK_NMI=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_FUTEX_PI=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_IO_URING=y
CONFIG_ADVISE_SYSCALLS=y
CONFIG_MEMBARRIER=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_KALLSYMS_BASE_RELATIVE=y
CONFIG_BPF_SYSCALL=y
CONFIG_USERFAULTFD=y
CONFIG_ARCH_HAS_MEMBARRIER_CALLBACKS=y
CONFIG_RSEQ=y
# CONFIG_DEBUG_RSEQ is not set
# CONFIG_EMBEDDED is not set
CONFIG_HAVE_PERF_EVENTS=y
# CONFIG_PC104 is not set

#
# Kernel Performance Events And Counters
#
CONFIG_PERF_EVENTS=y
# end of Kernel Performance 

Re: READ_ONCE() + STACKPROTECTOR_STRONG == :/ (was Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.5-2 tag (topic/kasan-bitops))

2019-12-12 Thread Peter Zijlstra
On Thu, Dec 12, 2019 at 09:21:57PM +0100, Peter Zijlstra wrote:
> On Thu, Dec 12, 2019 at 07:34:01PM +, Will Deacon wrote:
> > void ool_store_release(volatile unsigned long *ptr, unsigned long val)
> > {
> > smp_store_release(ptr, val);
> > }
> > 
> >  :
> >0:   a9be7bfdstp x29, x30, [sp, #-32]!
> >4:   9002adrpx2, 0 <__stack_chk_guard>
> >8:   9142add x2, x2, #0x0
> >c:   910003fdmov x29, sp
> >   10:   f9400043ldr x3, [x2]
> >   14:   f9000fa3str x3, [x29, #24]
> >   18:   d283mov x3, #0x0// #0
> >   1c:   c89ffc01stlrx1, [x0]
> >   20:   f9400fa1ldr x1, [x29, #24]
> >   24:   f9400040ldr x0, [x2]
> >   28:   ca20eor x0, x1, x0
> >   2c:   b560cbnzx0, 38 
> >   30:   a8c27bfdldp x29, x30, [sp], #32
> >   34:   d65f03c0ret
> >   38:   9400bl  0 <__stack_chk_fail>
> > 
> > It's a mess, and fixing READ_ONCE() doesn't help this case, which is why
> > I was looking at getting rid of volatile where it's not strictly needed.
> > I'm certainly open to other suggestions, I just haven't managed to think
> > of anything else.
> 
> We could move the kernel to C++ and write:
> 
>   std::remove_volatile::type __p = (p);
> 
> /me runs like hell...

Also, the GCC __auto_type thing strips _Atomic and const qualifiers but
for some obscure raisin forgets to strip volatile :/

  https://gcc.gnu.org/ml/gcc-patches/2013-11/msg01378.html

Now, looking at the current GCC source:

  
https://github.com/gcc-mirror/gcc/blob/97d7270f894395e513667a031a0c309d1819d05e/gcc/c/c-parser.c#L3707

it seems that __typeof__() is supposed to strip all qualifiers from
_Atomic types. That lead me to try:

typeof(_Atomic typeof(p)) __p = (p);

But alas, I still get the same junk you got for ool_store_release() :/


Re: READ_ONCE() + STACKPROTECTOR_STRONG == :/ (was Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.5-2 tag (topic/kasan-bitops))

2019-12-12 Thread Linus Torvalds
On Thu, Dec 12, 2019 at 11:34 AM Will Deacon  wrote:
>
> The root of my concern in all of this, and what started me looking at it in
> the first place, is the interaction with 'typeof()'. Inheriting 'volatile'
> for a pointer means that local variables in macros declared using typeof()
> suddenly start generating *hideous* code, particularly when pointless stack
> spills get stackprotector all excited.

Yeah, removing volatile can be a bit annoying.

For the particular case of the bitops, though, it's not an issue.
Since you know the type there, you can just cast it.

And if we had the rule that READ_ONCE() was an arithmetic type, you could do

typeof(0+(*p)) __var;

since you might as well get the integer promotion anyway (on the
non-volatile result).

But that doesn't work with structures or unions, of course.

I'm not entirely sure we have READ_ONCE() with a struct. I do know we
have it with 64-bit entities on 32-bit machines, but that's ok with
the "0+" trick.

   Linus


Re: READ_ONCE() + STACKPROTECTOR_STRONG == :/ (was Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.5-2 tag (topic/kasan-bitops))

2019-12-12 Thread Peter Zijlstra
On Thu, Dec 12, 2019 at 07:34:01PM +, Will Deacon wrote:
> void ool_store_release(volatile unsigned long *ptr, unsigned long val)
> {
>   smp_store_release(ptr, val);
> }
> 
>  :
>0:   a9be7bfdstp x29, x30, [sp, #-32]!
>4:   9002adrpx2, 0 <__stack_chk_guard>
>8:   9142add x2, x2, #0x0
>c:   910003fdmov x29, sp
>   10:   f9400043ldr x3, [x2]
>   14:   f9000fa3str x3, [x29, #24]
>   18:   d283mov x3, #0x0// #0
>   1c:   c89ffc01stlrx1, [x0]
>   20:   f9400fa1ldr x1, [x29, #24]
>   24:   f9400040ldr x0, [x2]
>   28:   ca20eor x0, x1, x0
>   2c:   b560cbnzx0, 38 
>   30:   a8c27bfdldp x29, x30, [sp], #32
>   34:   d65f03c0ret
>   38:   9400bl  0 <__stack_chk_fail>
> 
> It's a mess, and fixing READ_ONCE() doesn't help this case, which is why
> I was looking at getting rid of volatile where it's not strictly needed.
> I'm certainly open to other suggestions, I just haven't managed to think
> of anything else.

We could move the kernel to C++ and write:

std::remove_volatile::type __p = (p);

/me runs like hell...


Re: READ_ONCE() + STACKPROTECTOR_STRONG == :/ (was Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.5-2 tag (topic/kasan-bitops))

2019-12-12 Thread Will Deacon
Hi Linus,

On Thu, Dec 12, 2019 at 10:43:05AM -0800, Linus Torvalds wrote:
> On Thu, Dec 12, 2019 at 10:06 AM Will Deacon  wrote:
> >
> > I'm currently trying to solve the issue by removing volatile from the bitop
> > function signatures
> 
> I really think that's the wrong thing to do.
> 
> The bitop signature really should be "volatile" (and it should be
> "const volatile" for test_bit, but I'm not sure anybody cares).

Agreed on the "const" part, although I do think the "volatile" aspect has
nasty side-effects despite being a visual indicator that we're eliding
locks. More below.

> Exactly because it's simply valid to say "hey, my data is volatile,
> but do an atomic test of this bit". So it might be volatile in the
> caller.

That's fair, although the cases I've run into so far for the bitops are
usually just that the functions have been wrapped, and volatile could easily
be dropped from the caller as well (e.g. assign_bit(), __node_clear(),
linkmode_test_bit()).

> Now, I generally frown on actual volatile data structures - because
> the data structure volatility often depends on _context_. The same
> data might be volatile in one context (when you do some optimistic
> test on it without locking), but 100% stable in another (when you do
> have a lock).

There are cases in driver code where it looks as though data members are
being declared volatile specifically because of the bitops type signatures
(e.g. 'wrapped' in 'struct mdp5_mdss', 'context_flag' in 'struct
drm_device', 'state' in 'struct s2io_nic'). Yeah, it's bogus, but I think
that having the modifier in the function signature is still leading people
astray.

> So I don't want to see "volatile" on data definitions ("jiffies" being
> the one traditional exception), but marking things volatile in code
> (because you know you're working with unlocked data) and then passing
> them down to various helper functions - including the bitops ones - is
> quite traditional and accepted.
> 
> In other words, 'volatile" should be treated the same way "const" is
> largely treated in C.
> 
> A pointer to "const" data doesn't mean that the data is read-only, or
> that it cannot be modified _elsewhere_, it means that within this
> particular context and this copy of the pointer we promise not to
> write to it.
> 
> Similarly, a pointer to "volatile" data doesn't mean that the data
> might not be stable once you take a lock, for example. So it's ok to
> have volatile pointers even if the data declaration itself isn't
> volatile - you're stating something about the context, not something
> fundamental about the data.
> 
> And in the context of the bit operations, "volatile" is the correct thing
> to do.

The root of my concern in all of this, and what started me looking at it in
the first place, is the interaction with 'typeof()'. Inheriting 'volatile'
for a pointer means that local variables in macros declared using typeof()
suddenly start generating *hideous* code, particularly when pointless stack
spills get stackprotector all excited. Even if we simplify READ_ONCE() back
to its old incantation, the acquire/release accessors will have the exact
same issues on architectures that implement them.

For example, consider this code on arm64:

void ool_store_release(unsigned long *ptr, unsigned long val)
{
smp_store_release(ptr, val);
}

This compiles to a single instruction plus return, which is what we want:

 :
   0:   c89ffc01stlrx1, [x0]
   4:   d65f03c0ret

Now, see what happens if we make the 'ptr' argument volatile:

void ool_store_release(volatile unsigned long *ptr, unsigned long val)
{
smp_store_release(ptr, val);
}

 :
   0:   a9be7bfdstp x29, x30, [sp, #-32]!
   4:   9002adrpx2, 0 <__stack_chk_guard>
   8:   9142add x2, x2, #0x0
   c:   910003fdmov x29, sp
  10:   f9400043ldr x3, [x2]
  14:   f9000fa3str x3, [x29, #24]
  18:   d283mov x3, #0x0// #0
  1c:   c89ffc01stlrx1, [x0]
  20:   f9400fa1ldr x1, [x29, #24]
  24:   f9400040ldr x0, [x2]
  28:   ca20eor x0, x1, x0
  2c:   b560cbnzx0, 38 
  30:   a8c27bfdldp x29, x30, [sp], #32
  34:   d65f03c0ret
  38:   9400bl  0 <__stack_chk_fail>

It's a mess, and fixing READ_ONCE() doesn't help this case, which is why
I was looking at getting rid of volatile where it's not strictly needed.
I'm certainly open to other suggestions, I just haven't managed to think
of anything else.

Will


Re: READ_ONCE() + STACKPROTECTOR_STRONG == :/ (was Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.5-2 tag (topic/kasan-bitops))

2019-12-12 Thread Linus Torvalds
On Thu, Dec 12, 2019 at 10:06 AM Will Deacon  wrote:
>
> I'm currently trying to solve the issue by removing volatile from the bitop
> function signatures

I really think that's the wrong thing to do.

The bitop signature really should be "volatile" (and it should be
"const volatile" for test_bit, but I'm not sure anybody cares).

Exactly because it's simply valid to say "hey, my data is volatile,
but do an atomic test of this bit". So it might be volatile in the
caller.

Now, I generally frown on actual volatile data structures - because
the data structure volatility often depends on _context_. The same
data might be volatile in one context (when you do some optimistic
test on it without locking), but 100% stable in another (when you do
have a lock).

So I don't want to see "volatile" on data definitions ("jiffies" being
the one traditional exception), but marking things volatile in code
(because you know you're working with unlocked data) and then passing
them down to various helper functions - including the bitops ones - is
quite traditional and accepted.

In other words, 'volatile" should be treated the same way "const" is
largely treated in C.

A pointer to "const" data doesn't mean that the data is read-only, or
that it cannot be modified _elsewhere_, it means that within this
particular context and this copy of the pointer we promise not to
write to it.

Similarly, a pointer to "volatile" data doesn't mean that the data
might not be stable once you take a lock, for example. So it's ok to
have volatile pointers even if the data declaration itself isn't
volatile - you're stating something about the context, not something
fundamental about the data.

And in the context of the bit operations, "volatile" is the correct thing to do.

 Linus


Re: Call for report - G5/PPC970 status

2019-12-12 Thread Romain Dolbeau
Le jeu. 12 déc. 2019 à 19:20, Andreas Schwab  a écrit :
> My PowerMac7,3 (DP 2.0GHz) can boot 5.5-rc1 without issues.

This is weird, as except for the frequency it should be the same
system as Jeroen's crashing G5!

Can you share your kernel config, compiler version, and other details?
Perhaps even the binary...

Thanks & cordially,

-- 
Romain Dolbeau


Re: READ_ONCE() + STACKPROTECTOR_STRONG == :/ (was Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.5-2 tag (topic/kasan-bitops))

2019-12-12 Thread Christian Borntraeger



On 12.12.19 19:06, Will Deacon wrote:
> On Thu, Dec 12, 2019 at 09:41:32AM -0800, Linus Torvalds wrote:
>> On Thu, Dec 12, 2019 at 2:46 AM Peter Zijlstra  wrote:
>>>
>>> +#ifdef GCC_VERSION < 40800
>>
>> Where does that 4.8 version check come from, and why?
>>
>> Yeah, I know, but this really wants a comment. Sadly it looks like gcc
>> bugzilla is down, so
>>
>>https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145
>>
>> currently gives an "Internal Server Error" for me.
>>
>> [ Delete the horrid code we have because of gcc bugs ]
>>
>>> +#else /* GCC_VERSION < 40800 */
>>> +
>>> +#define READ_ONCE_NOCHECK(x)   \
>>> +({ \
>>> +   typeof(x) __x = *(volatile typeof(x))&(x);  \
>>
>> I think we can/should just do this unconditionally if it helps th eissue.
> 
> I'm currently trying to solve the issue by removing volatile from the bitop
> function signatures, but it's grotty because there are quite a few callers
> to fix up. I'm still trying to do it, because removing volatile fields from
> structurs is generally a "good thing", but I'd be keen to simplify
> READ_ONCE() as you suggest regardless.

As I am the one who added the foundation of READ_ONCEs uglyness, I am now in
favour of re-simplifying it again. I was first a bit scared about re-introducing
bugs, but the gcc testsuite has this particular case covered, so hopefully we
should not see the issue with volatile and aggregate types again.

Christian



Re: Call for report - G5/PPC970 status

2019-12-12 Thread Andreas Schwab
On Dez 11 2019, Romain Dolbeau wrote:

> Same question to anyone else with a G5 / PPC970 - what is it and does
> it boot recent PPC64 Linux kernel ?

My PowerMac7,3 (DP 2.0GHz) can boot 5.5-rc1 without issues.

Andreas.

-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."


Re: Call for report - G5/PPC970 status

2019-12-12 Thread Romain Dolbeau
Le jeu. 12 déc. 2019 à 09:08, John Paul Adrian Glaubitz
 a écrit :
> I suggest booting the machine with a netconsole to get a dump of the crash
> over the network, see [1].

I added netconsole (not as a module, directly in the kernel), but I
get nothing on my receiver 'nc'.
I don't see the network interface identified anywhere prior to the
crash, which could explains it
(I might also have misconfigured the command-line...). The crash is
probably too early.

I also tried compiling w/o SMP to see if that changed anything, but
the kernel won't compile:
first the ppc_watchdog seems to rely on a SMP symbol, and once I
disabled it, the MM subsystem was missing some 'numa' symbols.

Cordially,

-- 
Romain Dolbeau


Re: READ_ONCE() + STACKPROTECTOR_STRONG == :/ (was Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.5-2 tag (topic/kasan-bitops))

2019-12-12 Thread Will Deacon
On Thu, Dec 12, 2019 at 09:41:32AM -0800, Linus Torvalds wrote:
> On Thu, Dec 12, 2019 at 2:46 AM Peter Zijlstra  wrote:
> >
> > +#ifdef GCC_VERSION < 40800
> 
> Where does that 4.8 version check come from, and why?
> 
> Yeah, I know, but this really wants a comment. Sadly it looks like gcc
> bugzilla is down, so
> 
>https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145
> 
> currently gives an "Internal Server Error" for me.
> 
> [ Delete the horrid code we have because of gcc bugs ]
> 
> > +#else /* GCC_VERSION < 40800 */
> > +
> > +#define READ_ONCE_NOCHECK(x)   \
> > +({ \
> > +   typeof(x) __x = *(volatile typeof(x))&(x);  \
> 
> I think we can/should just do this unconditionally if it helps th eissue.

I'm currently trying to solve the issue by removing volatile from the bitop
function signatures, but it's grotty because there are quite a few callers
to fix up. I'm still trying to do it, because removing volatile fields from
structurs is generally a "good thing", but I'd be keen to simplify
READ_ONCE() as you suggest regardless.

> Maybe add a warning about how gcc < 4.8 might mis-compile the kernel -
> those versions are getting close to being unacceptable for kernel
> builds anyway.
> 
> We could also look at being stricter for the normal READ/WRITE_ONCE(),
> and require that they are
> 
>  (a) regular integer types
> 
>  (b) fit in an atomic word
> 
> We actually did (b) for a while, until we noticed that we do it on
> loff_t's etc and relaxed the rules. But maybe we could have a
> "non-atomic" version of READ/WRITE_ONCE() that is used for the
> questionable cases?

That makes a lot of sense to me, and it would allow us to use
compiletime_assert_atomic_type() as we do for the acquire/release accessors.

Will


Re: READ_ONCE() + STACKPROTECTOR_STRONG == :/ (was Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.5-2 tag (topic/kasan-bitops))

2019-12-12 Thread Segher Boessenkool
On Thu, Dec 12, 2019 at 09:41:32AM -0800, Linus Torvalds wrote:
> Yeah, I know, but this really wants a comment. Sadly it looks like gcc
> bugzilla is down, so
> 
>https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145
> 
> currently gives an "Internal Server Error" for me.

We're being DoSsed again.  Reload, it will work after a while :-/


Segher


Re: READ_ONCE() + STACKPROTECTOR_STRONG == :/ (was Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.5-2 tag (topic/kasan-bitops))

2019-12-12 Thread Linus Torvalds
On Thu, Dec 12, 2019 at 2:46 AM Peter Zijlstra  wrote:
>
> +#ifdef GCC_VERSION < 40800

Where does that 4.8 version check come from, and why?

Yeah, I know, but this really wants a comment. Sadly it looks like gcc
bugzilla is down, so

   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145

currently gives an "Internal Server Error" for me.

[ Delete the horrid code we have because of gcc bugs ]

> +#else /* GCC_VERSION < 40800 */
> +
> +#define READ_ONCE_NOCHECK(x)   \
> +({ \
> +   typeof(x) __x = *(volatile typeof(x))&(x);  \

I think we can/should just do this unconditionally if it helps th eissue.

Maybe add a warning about how gcc < 4.8 might mis-compile the kernel -
those versions are getting close to being unacceptable for kernel
builds anyway.

We could also look at being stricter for the normal READ/WRITE_ONCE(),
and require that they are

 (a) regular integer types

 (b) fit in an atomic word

We actually did (b) for a while, until we noticed that we do it on
loff_t's etc and relaxed the rules. But maybe we could have a
"non-atomic" version of READ/WRITE_ONCE() that is used for the
questionable cases?

  Linus


[PATCH] spi: fsl: use platform_get_irq() instead of of_irq_to_resource()

2019-12-12 Thread Christophe Leroy
Unlike irq_of_parse_and_map() which has a dummy definition on SPARC,
of_irq_to_resource() hasn't.

But as platform_get_irq() can be used instead and is generic, use it.

Reported-by: kbuild test robot 
Suggested-by: Mark Brown 
Fixes:  3194d2533eff ("spi: fsl: don't map irq during probe")
Cc: sta...@vger.kernel.org
Signed-off-by: Christophe Leroy 
---
 drivers/spi/spi-fsl-spi.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/spi/spi-fsl-spi.c b/drivers/spi/spi-fsl-spi.c
index 2d85c81983b1..c76128cadf0c 100644
--- a/drivers/spi/spi-fsl-spi.c
+++ b/drivers/spi/spi-fsl-spi.c
@@ -765,9 +765,9 @@ static int of_fsl_spi_probe(struct platform_device *ofdev)
if (ret)
goto err;
 
-   irq = of_irq_to_resource(np, 0, NULL);
-   if (irq <= 0) {
-   ret = -EINVAL;
+   irq = platform_get_irq(ofdev, 0);
+   if (irq < 0) {
+   ret = irq;
goto err;
}
 
-- 
2.13.3



Re: READ_ONCE() + STACKPROTECTOR_STRONG == :/ (was Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.5-2 tag (topic/kasan-bitops))

2019-12-12 Thread Will Deacon
On Thu, Dec 12, 2019 at 05:04:27PM +, Will Deacon wrote:
> On Thu, Dec 12, 2019 at 11:46:10AM +0100, Peter Zijlstra wrote:
> > On Thu, Dec 12, 2019 at 10:07:56AM +, Will Deacon wrote:
> > 
> > > > So your proposed change _should_ be fine. Will, I'm assuming you never
> > > > saw this on your ARGH64 builds when you did this code ?
> > > 
> > > I did see it, but (a) looking at the code out-of-line makes it look a lot
> > > worse than it actually is (so the ext4 example is really helpful -- thanks
> > > Michael!) and (b) I chalked it up to a crappy compiler.
> > > 
> > > However, see this comment from Arnd on my READ_ONCE series from the other
> > > day:
> > > 
> > > https://lore.kernel.org/lkml/CAK8P3a0f=wvsqsbq4t0fmekcfe_mc3oarxaetvitsksa-d2...@mail.gmail.com
> > > 
> > > In which case, I'm thinking that we should be doing better in READ_ONCE()
> > > for non-buggy compilers which would also keep the KCSAN folks happy for 
> > > this
> > > code (and would help with [1] too).
> > 
> > So something like this then? Although I suppose that should be moved
> > into compiler-gcc.h and then guarded by #ifndef READ_ONCE or so.
> 
> Ah wait, I think we've been looking at this wrong. The volatile pointer
> argument is actually the problem here, not READ_ONCE()! The use of typeof()
> means that the temporary variable to which __READ_ONCE_SIZE writes ends up
> being a volatile store, so it can't be optimised away. This is why we get
> a stack access and why stack protector then wrecks the codegen for us.

Hmm, it's actually probably the volatile read which is causing the problem,
since __READ_ONCE_SIZE has casted that away and just uses "void *", but you
get the idea.

Will


Re: READ_ONCE() + STACKPROTECTOR_STRONG == :/ (was Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.5-2 tag (topic/kasan-bitops))

2019-12-12 Thread Will Deacon
On Thu, Dec 12, 2019 at 11:46:10AM +0100, Peter Zijlstra wrote:
> On Thu, Dec 12, 2019 at 10:07:56AM +, Will Deacon wrote:
> 
> > > So your proposed change _should_ be fine. Will, I'm assuming you never
> > > saw this on your ARGH64 builds when you did this code ?
> > 
> > I did see it, but (a) looking at the code out-of-line makes it look a lot
> > worse than it actually is (so the ext4 example is really helpful -- thanks
> > Michael!) and (b) I chalked it up to a crappy compiler.
> > 
> > However, see this comment from Arnd on my READ_ONCE series from the other
> > day:
> > 
> > https://lore.kernel.org/lkml/CAK8P3a0f=wvsqsbq4t0fmekcfe_mc3oarxaetvitsksa-d2...@mail.gmail.com
> > 
> > In which case, I'm thinking that we should be doing better in READ_ONCE()
> > for non-buggy compilers which would also keep the KCSAN folks happy for this
> > code (and would help with [1] too).
> 
> So something like this then? Although I suppose that should be moved
> into compiler-gcc.h and then guarded by #ifndef READ_ONCE or so.

Ah wait, I think we've been looking at this wrong. The volatile pointer
argument is actually the problem here, not READ_ONCE()! The use of typeof()
means that the temporary variable to which __READ_ONCE_SIZE writes ends up
being a volatile store, so it can't be optimised away. This is why we get
a stack access and why stack protector then wrecks the codegen for us.

I'll cook a patch getting rid of those volatiles.

Will


[PATCH 3/3] soc: fsl: dpio: Replace QMAN array mode by ring mode enqueue.

2019-12-12 Thread Youri Querry
This change of algorithm will enable faster bulk enqueue.
This will grately benefit XDP bulk enqueue.

Signed-off-by: Youri Querry 
---
 drivers/soc/fsl/dpio/qbman-portal.c | 410 +++-
 drivers/soc/fsl/dpio/qbman-portal.h |  13 ++
 2 files changed, 328 insertions(+), 95 deletions(-)

diff --git a/drivers/soc/fsl/dpio/qbman-portal.c 
b/drivers/soc/fsl/dpio/qbman-portal.c
index 0ffe018..740ee0d 100644
--- a/drivers/soc/fsl/dpio/qbman-portal.c
+++ b/drivers/soc/fsl/dpio/qbman-portal.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "qbman-portal.h"
@@ -21,6 +22,7 @@
 
 /* CINH register offsets */
 #define QBMAN_CINH_SWP_EQCR_PI  0x800
+#define QBMAN_CINH_SWP_EQCR_CI 0x840
 #define QBMAN_CINH_SWP_EQAR0x8c0
 #define QBMAN_CINH_SWP_CR_RT0x900
 #define QBMAN_CINH_SWP_VDQCR_RT 0x940
@@ -44,6 +46,8 @@
 #define QBMAN_CENA_SWP_CR  0x600
 #define QBMAN_CENA_SWP_RR(vb)  (0x700 + ((u32)(vb) >> 1))
 #define QBMAN_CENA_SWP_VDQCR   0x780
+#define QBMAN_CENA_SWP_EQCR_CI 0x840
+#define QBMAN_CENA_SWP_EQCR_CI_MEMBACK 0x1840
 
 /* CENA register offsets in memory-backed mode */
 #define QBMAN_CENA_SWP_DQRR_MEM(n)  (0x800 + ((u32)(n) << 6))
@@ -71,6 +75,12 @@
 /* opaque token for static dequeues */
 #define QMAN_SDQCR_TOKEN0xbb
 
+#define QBMAN_EQCR_DCA_IDXMASK  0x0f
+#define QBMAN_ENQUEUE_FLAG_DCA  (1ULL << 31)
+
+#define EQ_DESC_SIZE_WITHOUT_FD 29
+#define EQ_DESC_SIZE_FD_START 32
+
 enum qbman_sdqcr_dct {
qbman_sdqcr_dct_null = 0,
qbman_sdqcr_dct_prio_ics,
@@ -215,6 +225,15 @@ static inline u32 qbman_set_swp_cfg(u8 max_fill, u8 wn,
u8 est, u8 rpm, u8 dcm,
 
 #define QMAN_RT_MODE  0x0100
 
+static inline u8 qm_cyc_diff(u8 ringsize, u8 first, u8 last)
+{
+   /* 'first' is included, 'last' is excluded */
+   if (first <= last)
+   return last - first;
+   else
+   return (2 * ringsize) - (first - last);
+}
+
 /**
  * qbman_swp_init() - Create a functional object representing the given
  *QBMan portal descriptor.
@@ -227,6 +246,10 @@ struct qbman_swp *qbman_swp_init(const struct 
qbman_swp_desc *d)
 {
struct qbman_swp *p = kzalloc(sizeof(*p), GFP_KERNEL);
u32 reg;
+   u32 mask_size;
+   u32 eqcr_pi;
+
+   spin_lock_init(>access_spinlock);
 
if (!p)
return NULL;
@@ -255,25 +278,38 @@ struct qbman_swp *qbman_swp_init(const struct 
qbman_swp_desc *d)
p->addr_cena = d->cena_bar;
p->addr_cinh = d->cinh_bar;
 
-   if ((p->desc->qman_version & QMAN_REV_MASK) >= QMAN_REV_5000)
-   memset(p->addr_cena, 0, 64 * 1024);
+   if ((p->desc->qman_version & QMAN_REV_MASK) < QMAN_REV_5000) {
 
-   reg = qbman_set_swp_cfg(p->dqrr.dqrr_size,
-   1, /* Writes Non-cacheable */
-   0, /* EQCR_CI stashing threshold */
-   3, /* RPM: Valid bit mode, RCR in array mode */
-   2, /* DCM: Discrete consumption ack mode */
-   3, /* EPM: Valid bit mode, EQCR in array mode */
-   1, /* mem stashing drop enable == TRUE */
-   1, /* mem stashing priority == TRUE */
-   1, /* mem stashing enable == TRUE */
-   1, /* dequeue stashing priority == TRUE */
-   0, /* dequeue stashing enable == FALSE */
-   0); /* EQCR_CI stashing priority == FALSE */
-   if ((p->desc->qman_version & QMAN_REV_MASK) >= QMAN_REV_5000)
+   reg = qbman_set_swp_cfg(p->dqrr.dqrr_size,
+   1, /* Writes Non-cacheable */
+   0, /* EQCR_CI stashing threshold */
+   3, /* RPM: RCR in array mode */
+   2, /* DCM: Discrete consumption ack */
+   2, /* EPM: EQCR in ring mode */
+   1, /* mem stashing drop enable enable */
+   1, /* mem stashing priority enable */
+   1, /* mem stashing enable */
+   1, /* dequeue stashing priority enable */
+   0, /* dequeue stashing enable enable */
+   0); /* EQCR_CI stashing priority enable */
+   } else {
+   memset(p->addr_cena, 0, 64 * 1024);
+   reg = qbman_set_swp_cfg(p->dqrr.dqrr_size,
+   1, /* Writes Non-cacheable */
+   1, /* EQCR_CI stashing threshold */
+   3, /* RPM: RCR in array mode */
+   2, /* DCM: Discrete consumption ack */
+   0, /* EPM: EQCR in ring mode */
+   1, /* mem stashing drop enable */
+   1, /* mem stashing priority enable */
+   

[PATCH 2/3] soc: fsl: dpio: QMAN performance improvement. Function pointer indirection.

2019-12-12 Thread Youri Querry
We are making the access decision in the initialization and
setting the function pointers accordingly.

Signed-off-by: Youri Querry 
---
 drivers/soc/fsl/dpio/qbman-portal.c | 451 +++-
 drivers/soc/fsl/dpio/qbman-portal.h | 129 ++-
 2 files changed, 507 insertions(+), 73 deletions(-)

diff --git a/drivers/soc/fsl/dpio/qbman-portal.c 
b/drivers/soc/fsl/dpio/qbman-portal.c
index 5a37ac8..0ffe018 100644
--- a/drivers/soc/fsl/dpio/qbman-portal.c
+++ b/drivers/soc/fsl/dpio/qbman-portal.c
@@ -83,6 +83,82 @@ enum qbman_sdqcr_fc {
qbman_sdqcr_fc_up_to_3 = 1
 };
 
+/* Internal Function declaration */
+static int qbman_swp_enqueue_direct(struct qbman_swp *s,
+   const struct qbman_eq_desc *d,
+   const struct dpaa2_fd *fd);
+static int qbman_swp_enqueue_mem_back(struct qbman_swp *s,
+ const struct qbman_eq_desc *d,
+ const struct dpaa2_fd *fd);
+static int qbman_swp_enqueue_multiple_direct(struct qbman_swp *s,
+const struct qbman_eq_desc *d,
+const struct dpaa2_fd *fd,
+uint32_t *flags,
+int num_frames);
+static int qbman_swp_enqueue_multiple_mem_back(struct qbman_swp *s,
+  const struct qbman_eq_desc *d,
+  const struct dpaa2_fd *fd,
+  uint32_t *flags,
+  int num_frames);
+static int
+qbman_swp_enqueue_multiple_desc_direct(struct qbman_swp *s,
+  const struct qbman_eq_desc *d,
+  const struct dpaa2_fd *fd,
+  int num_frames);
+static
+int qbman_swp_enqueue_multiple_desc_mem_back(struct qbman_swp *s,
+const struct qbman_eq_desc *d,
+const struct dpaa2_fd *fd,
+int num_frames);
+static int qbman_swp_pull_direct(struct qbman_swp *s,
+struct qbman_pull_desc *d);
+static int qbman_swp_pull_mem_back(struct qbman_swp *s,
+  struct qbman_pull_desc *d);
+
+const struct dpaa2_dq *qbman_swp_dqrr_next_direct(struct qbman_swp *s);
+const struct dpaa2_dq *qbman_swp_dqrr_next_mem_back(struct qbman_swp *s);
+
+static int qbman_swp_release_direct(struct qbman_swp *s,
+   const struct qbman_release_desc *d,
+   const u64 *buffers,
+   unsigned int num_buffers);
+static int qbman_swp_release_mem_back(struct qbman_swp *s,
+ const struct qbman_release_desc *d,
+ const u64 *buffers,
+ unsigned int num_buffers);
+
+/* Function pointers */
+int (*qbman_swp_enqueue_ptr)(struct qbman_swp *s,
+const struct qbman_eq_desc *d,
+const struct dpaa2_fd *fd)
+   = qbman_swp_enqueue_direct;
+
+int (*qbman_swp_enqueue_multiple_ptr)(struct qbman_swp *s,
+ const struct qbman_eq_desc *d,
+ const struct dpaa2_fd *fd,
+ uint32_t *flags,
+int num_frames)
+   = qbman_swp_enqueue_multiple_direct;
+
+int
+(*qbman_swp_enqueue_multiple_desc_ptr)(struct qbman_swp *s,
+  const struct qbman_eq_desc *d,
+  const struct dpaa2_fd *fd,
+  int num_frames)
+   = qbman_swp_enqueue_multiple_desc_direct;
+
+int (*qbman_swp_pull_ptr)(struct qbman_swp *s, struct qbman_pull_desc *d)
+   = qbman_swp_pull_direct;
+
+const struct dpaa2_dq *(*qbman_swp_dqrr_next_ptr)(struct qbman_swp *s)
+   = qbman_swp_dqrr_next_direct;
+
+int (*qbman_swp_release_ptr)(struct qbman_swp *s,
+const struct qbman_release_desc *d,
+const u64 *buffers,
+unsigned int num_buffers)
+   = qbman_swp_release_direct;
+
 /* Portal Access */
 
 static inline u32 qbman_read_register(struct qbman_swp *p, u32 offset)
@@ -218,6 +294,19 @@ struct qbman_swp *qbman_swp_init(const struct 
qbman_swp_desc *d)
 * applied when dequeues from a specific channel are enabled.
 */
qbman_write_register(p, QBMAN_CINH_SWP_SDQCR, 0);
+
+   if ((p->desc->qman_version & QMAN_REV_MASK) >= QMAN_REV_5000) {
+ 

[PATCH 1/3] soc: fsl: dpio: Adding QMAN multiple enqueue interface.

2019-12-12 Thread Youri Querry
Update of QMAN the interface to enqueue frame. We now support multiple
enqueue (qbman_swp_enqueue_multiple) and multiple enqueue with
a table of descriptor (qbman_swp_enqueue_multiple_desc).

Signed-off-by: Youri Querry 
---
 drivers/soc/fsl/dpio/dpio-service.c | 69 --
 drivers/soc/fsl/dpio/qbman-portal.c | 83 +++--
 drivers/soc/fsl/dpio/qbman-portal.h | 24 +++
 include/soc/fsl/dpaa2-io.h  |  6 ++-
 4 files changed, 165 insertions(+), 17 deletions(-)

diff --git a/drivers/soc/fsl/dpio/dpio-service.c 
b/drivers/soc/fsl/dpio/dpio-service.c
index 518a8e0..cd4f641 100644
--- a/drivers/soc/fsl/dpio/dpio-service.c
+++ b/drivers/soc/fsl/dpio/dpio-service.c
@@ -1,7 +1,7 @@
 // SPDX-License-Identifier: (GPL-2.0+ OR BSD-3-Clause)
 /*
  * Copyright 2014-2016 Freescale Semiconductor Inc.
- * Copyright 2016 NXP
+ * Copyright 2016-2019 NXP
  *
  */
 #include 
@@ -433,6 +433,69 @@ int dpaa2_io_service_enqueue_fq(struct dpaa2_io *d,
 EXPORT_SYMBOL(dpaa2_io_service_enqueue_fq);
 
 /**
+ * dpaa2_io_service_enqueue_multiple_fq() - Enqueue multiple frames
+ * to a frame queue using one fqid.
+ * @d: the given DPIO service.
+ * @fqid: the given frame queue id.
+ * @fd: the frame descriptor which is enqueued.
+ * @nb: number of frames to be enqueud
+ *
+ * Return 0 for successful enqueue, -EBUSY if the enqueue ring is not ready,
+ * or -ENODEV if there is no dpio service.
+ */
+int dpaa2_io_service_enqueue_multiple_fq(struct dpaa2_io *d,
+   u32 fqid,
+   const struct dpaa2_fd *fd,
+   int nb)
+{
+   struct qbman_eq_desc ed;
+
+   d = service_select(d);
+   if (!d)
+   return -ENODEV;
+
+   qbman_eq_desc_clear();
+   qbman_eq_desc_set_no_orp(, 0);
+   qbman_eq_desc_set_fq(, fqid);
+
+   return qbman_swp_enqueue_multiple(d->swp, , fd, 0, nb);
+}
+EXPORT_SYMBOL(dpaa2_io_service_enqueue_multiple_fq);
+
+/**
+ * dpaa2_io_service_enqueue_multiple_desc_fq() - Enqueue multiple frames
+ * to different frame queue using a list of fqids.
+ * @d: the given DPIO service.
+ * @fqid: the given list of frame queue ids.
+ * @fd: the frame descriptor which is enqueued.
+ * @nb: number of frames to be enqueud
+ *
+ * Return 0 for successful enqueue, -EBUSY if the enqueue ring is not ready,
+ * or -ENODEV if there is no dpio service.
+ */
+int dpaa2_io_service_enqueue_multiple_desc_fq(struct dpaa2_io *d,
+   u32 *fqid,
+   const struct dpaa2_fd *fd,
+   int nb)
+{
+   int i;
+   struct qbman_eq_desc ed[32];
+
+   d = service_select(d);
+   if (!d)
+   return -ENODEV;
+
+   for (i = 0; i < nb; i++) {
+   qbman_eq_desc_clear([i]);
+   qbman_eq_desc_set_no_orp([i], 0);
+   qbman_eq_desc_set_fq([i], fqid[i]);
+   }
+
+   return qbman_swp_enqueue_multiple_desc(d->swp, [0], fd, nb);
+}
+EXPORT_SYMBOL(dpaa2_io_service_enqueue_multiple_desc_fq);
+
+/**
  * dpaa2_io_service_enqueue_qd() - Enqueue a frame to a QD.
  * @d: the given DPIO service.
  * @qdid: the given queuing destination id.
@@ -526,7 +589,7 @@ EXPORT_SYMBOL_GPL(dpaa2_io_service_acquire);
 
 /**
  * dpaa2_io_store_create() - Create the dma memory storage for dequeue result.
- * @max_frames: the maximum number of dequeued result for frames, must be <= 
16.
+ * @max_frames: the maximum number of dequeued result for frames, must be <= 
32.
  * @dev:the device to allow mapping/unmapping the DMAable region.
  *
  * The size of the storage is "max_frames*sizeof(struct dpaa2_dq)".
@@ -541,7 +604,7 @@ struct dpaa2_io_store *dpaa2_io_store_create(unsigned int 
max_frames,
struct dpaa2_io_store *ret;
size_t size;
 
-   if (!max_frames || (max_frames > 16))
+   if (!max_frames || (max_frames > 32))
return NULL;
 
ret = kmalloc(sizeof(*ret), GFP_KERNEL);
diff --git a/drivers/soc/fsl/dpio/qbman-portal.c 
b/drivers/soc/fsl/dpio/qbman-portal.c
index c66f5b7..5a37ac8 100644
--- a/drivers/soc/fsl/dpio/qbman-portal.c
+++ b/drivers/soc/fsl/dpio/qbman-portal.c
@@ -1,7 +1,7 @@
 // SPDX-License-Identifier: (GPL-2.0+ OR BSD-3-Clause)
 /*
  * Copyright (C) 2014-2016 Freescale Semiconductor, Inc.
- * Copyright 2016 NXP
+ * Copyright 2016-2019 NXP
  *
  */
 
@@ -12,13 +12,6 @@
 
 #include "qbman-portal.h"
 
-#define QMAN_REV_4000   0x0400
-#define QMAN_REV_4100   0x0401
-#define QMAN_REV_4101   0x04010001
-#define QMAN_REV_5000   0x0500
-
-#define QMAN_REV_MASK   0x
-
 /* All QBMan command and result structures use this "valid bit" encoding */
 #define QB_VALID_BIT ((u32)0x80)
 
@@ -156,7 +149,7 @@ static inline u32 qbman_set_swp_cfg(u8 max_fill, u8 wn, 
u8 est, u8 rpm, u8 dcm,
  */
 struct qbman_swp *qbman_swp_init(const struct qbman_swp_desc *d)
 {
-   struct qbman_swp *p = 

[PATCH 0/3] soc: fsl: dpio: Enable QMAN batch enqueuing

2019-12-12 Thread Youri Querry
This patch set consists of:
- We added an interface to enqueue several packets at a time and
  improve performance.
- Make the algorithm decisions once at initialization and use
  function pointers to improve performance.
- Replaced the QMAN enqueue array mode algorithm with a ring
  mode algorithm. This is to make the enqueue of several frames
  at a time more effective.

Youri Querry (3):
  soc: fsl: dpio: Adding QMAN multiple enqueue interface.
  soc: fsl: dpio: QMAN performance improvement. Function pointer
indirection.
  soc: fsl: dpio: Replace QMAN array mode by ring mode enqueue.

 drivers/soc/fsl/dpio/dpio-service.c |  69 +++-
 drivers/soc/fsl/dpio/qbman-portal.c | 766 
 drivers/soc/fsl/dpio/qbman-portal.h | 158 +++-
 include/soc/fsl/dpaa2-io.h  |   6 +-
 4 files changed, 907 insertions(+), 92 deletions(-)

-- 
2.7.4



Re: [PATCH v5] powerpc/irq: inline call_do_irq() and call_do_softirq() on PPC32

2019-12-12 Thread Christophe Leroy




Le 12/12/2019 à 13:52, Christoph Hellwig a écrit :

On Sat, Dec 07, 2019 at 05:20:04PM +, Christophe Leroy wrote:

call_do_irq() and call_do_softirq() are simple enough to be
worth inlining.

Inlining them avoids an mflr/mtlr pair plus a save/reload on stack.
It also allows GCC to keep the saved ksp_limit in an nonvolatile reg.

This is inspired from S390 arch. Several other arches do more or
less the same. The way sparc arch does seems odd thought.


Any reason you only do this for 32-bit and not 64-bit as well?



Yes ... There has been a long discussion on this in v4, see 
https://patchwork.ozlabs.org/patch/1174288/


The problem is that on PPC64, r2 register is used as TOC pointer and it 
is apparently not straithforward to make sure the caller and the callee 
are using the same TOC.


On PPC32 it's more simple, r2 is current task_struct at all time, it 
never changes.


Christophe


Re: [PATCH] powerpc/irq: don't use current_stack_pointer() in do_IRQ()

2019-12-12 Thread Christophe Leroy




Le 12/12/2019 à 13:51, Christoph Hellwig a écrit :

Why can't current_stack_pointer be turned into an inline function using
inline assembly?  That would reduce the overhead for all callers.



In the old days, it was a macro, and it was changed into an assembly 
function by commit 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bfe9a2cfe91a


It was later renamed from __get_SP() to current_stack_pointer() by 
commit 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=acf620ecf56cfc4edaffaf158250e128539cdd26


But in fact this function is badly named as it doesn't provide the 
current stack pointer but a pointer to the parent's stack frame.


Having it as an extern function forces GCC to set a stack frame in the 
calling function. If inline assembly is used instead, there's a risk of 
not getting a stack frame in the calling function, in which case the 
current_stack_pointer() will return the grandparent's stackframe pointer 
instead of the parent's one.


Christophe


Re: [PATCH v3 1/3] kasan: define and use MAX_PTRS_PER_* for early shadow tables

2019-12-12 Thread Christophe Leroy




Le 12/12/2019 à 16:16, Daniel Axtens a écrit :

powerpc has a variable number of PTRS_PER_*, set at runtime based
on the MMU that the kernel is booted under.

This means the PTRS_PER_* are no longer constants, and therefore
breaks the build.

Define default MAX_PTRS_PER_*s in the same style as MAX_PTRS_PER_P4D.
As KASAN is the only user at the moment, just define them in the kasan
header, and have them default to PTRS_PER_* unless overridden in arch
code.

Suggested-by: Christophe Leroy 
Suggested-by: Balbir Singh 
Signed-off-by: Daniel Axtens 


Reviewed-by: Christophe Leroy 


---
  include/linux/kasan.h | 18 +++---
  mm/kasan/init.c   |  6 +++---
  2 files changed, 18 insertions(+), 6 deletions(-)



Christophe


[PATCH v3 3/3] powerpc: Book3S 64-bit "heavyweight" KASAN support

2019-12-12 Thread Daniel Axtens
KASAN support on Book3S is a bit tricky to get right:

 - It would be good to support inline instrumentation so as to be able to
   catch stack issues that cannot be caught with outline mode.

 - Inline instrumentation requires a fixed offset.

 - Book3S runs code in real mode after booting. Most notably a lot of KVM
   runs in real mode, and it would be good to be able to instrument it.

 - Because code runs in real mode after boot, the offset has to point to
   valid memory both in and out of real mode.

   [For those not immersed in ppc64, in real mode, the top nibble or 2 bits
   (depending on radix/hash mmu) of the address is ignored. The linear
   mapping is placed at 0xc000. This means that a pointer to
   part of the linear mapping will work both in real mode, where it will be
   interpreted as a physical address of the form 0x000..., and out of real
   mode, where it will go via the linear mapping.]

One approach is just to give up on inline instrumentation. This way all
checks can be delayed until after everything set is up correctly, and the
address-to-shadow calculations can be overridden. However, the features and
speed boost provided by inline instrumentation are worth trying to do
better.

If _at compile time_ it is known how much contiguous physical memory a
system has, the top 1/8th of the first block of physical memory can be set
aside for the shadow. This is a big hammer and comes with 3 big
consequences:

 - there's no nice way to handle physically discontiguous memory, so only
   the first physical memory block can be used.

 - kernels will simply fail to boot on machines with less memory than
   specified when compiling.

 - kernels running on machines with more memory than specified when
   compiling will simply ignore the extra memory.

Implement and document KASAN this way. The current implementation is Radix
only.

Despite the limitations, it can still find bugs,
e.g. http://patchwork.ozlabs.org/patch/1103775/

At the moment, this physical memory limit must be set _even for outline
mode_. This may be changed in a later series - a different implementation
could be added for outline mode that dynamically allocates shadow at a
fixed offset. For example, see https://patchwork.ozlabs.org/patch/795211/

Suggested-by: Michael Ellerman 
Cc: Balbir Singh  # ppc64 out-of-line radix version
Cc: Christophe Leroy  # ppc32 version
Signed-off-by: Daniel Axtens 

---
Changes since v2:

 - Address feedback from Christophe around cleanups and docs.
 - Address feedback from Balbir: at this point I don't have a good solution
   for the issues you identify around the limitations of the inline 
implementation
   but I think that it's worth trying to get the stack instrumentation support.
   I'm happy to have an alternative and more flexible outline mode - I had
   envisoned this would be called 'lightweight' mode as it imposes fewer 
restrictions.
   I've linked to your implementation. I think it's best to add it in a 
follow-up series.
 - Made the default PHYS_MEM_SIZE_FOR_KASAN value 1024MB. I think most people 
have
   guests with at least that much memory in the Radix 64s case so it's a much
   saner default - it means that if you just turn on KASAN without reading the
   docs you're much more likely to have a bootable kernel, which you will never
   have if the value is set to zero! I'm happy to bikeshed the value if we want.

Changes since v1:
 - Landed kasan vmalloc support upstream
 - Lots of feedback from Christophe.

Changes since the rfc:

 - Boots real and virtual hardware, kvm works.

 - disabled reporting when we're checking the stack for exception
   frames. The behaviour isn't wrong, just incompatible with KASAN.

 - Documentation!

 - Dropped old module stuff in favour of KASAN_VMALLOC.

The bugs with ftrace and kuap were due to kernel bloat pushing
prom_init calls to be done via the plt. Because we did not have
a relocatable kernel, and they are done very early, this caused
everything to explode. Compile with CONFIG_RELOCATABLE!
---
 Documentation/dev-tools/kasan.rst |   8 +-
 Documentation/powerpc/kasan.txt   | 112 +-
 arch/powerpc/Kconfig  |   3 +
 arch/powerpc/Kconfig.debug|  21 
 arch/powerpc/Makefile |  11 ++
 arch/powerpc/include/asm/book3s/64/hash.h |   4 +
 arch/powerpc/include/asm/book3s/64/pgtable.h  |   7 ++
 arch/powerpc/include/asm/book3s/64/radix.h|   5 +
 arch/powerpc/include/asm/kasan.h  |  21 +++-
 arch/powerpc/kernel/process.c |   8 ++
 arch/powerpc/kernel/prom.c|  64 +-
 arch/powerpc/mm/kasan/Makefile|   3 +-
 .../mm/kasan/{kasan_init_32.c => init_32.c}   |   0
 arch/powerpc/mm/kasan/init_book3s_64.c|  72 +++
 14 files changed, 330 insertions(+), 9 deletions(-)
 rename arch/powerpc/mm/kasan/{kasan_init_32.c => init_32.c} (100%)
 create mode 100644 

[PATCH v3 2/3] kasan: Document support on 32-bit powerpc

2019-12-12 Thread Daniel Axtens
KASAN is supported on 32-bit powerpc and the docs should reflect this.

Suggested-by: Christophe Leroy 
Reviewed-by: Christophe Leroy 
Signed-off-by: Daniel Axtens 
---
 Documentation/dev-tools/kasan.rst |  3 ++-
 Documentation/powerpc/kasan.txt   | 12 
 2 files changed, 14 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/powerpc/kasan.txt

diff --git a/Documentation/dev-tools/kasan.rst 
b/Documentation/dev-tools/kasan.rst
index e4d66e7c50de..4af2b5d2c9b4 100644
--- a/Documentation/dev-tools/kasan.rst
+++ b/Documentation/dev-tools/kasan.rst
@@ -22,7 +22,8 @@ global variables yet.
 Tag-based KASAN is only supported in Clang and requires version 7.0.0 or later.
 
 Currently generic KASAN is supported for the x86_64, arm64, xtensa and s390
-architectures, and tag-based KASAN is supported only for arm64.
+architectures. It is also supported on 32-bit powerpc kernels. Tag-based KASAN
+is supported only on arm64.
 
 Usage
 -
diff --git a/Documentation/powerpc/kasan.txt b/Documentation/powerpc/kasan.txt
new file mode 100644
index ..a85ce2ff8244
--- /dev/null
+++ b/Documentation/powerpc/kasan.txt
@@ -0,0 +1,12 @@
+KASAN is supported on powerpc on 32-bit only.
+
+32 bit support
+==
+
+KASAN is supported on both hash and nohash MMUs on 32-bit.
+
+The shadow area sits at the top of the kernel virtual memory space above the
+fixmap area and occupies one eighth of the total kernel virtual memory space.
+
+Instrumentation of the vmalloc area is not currently supported, but modules
+are.
-- 
2.20.1



[PATCH v3 0/3] KASAN for powerpc64 radix

2019-12-12 Thread Daniel Axtens
Building on the work of Christophe, Aneesh and Balbir, I've ported
KASAN to 64-bit Book3S kernels running on the Radix MMU.

This provides full inline instrumentation on radix, but does require
that you be able to specify the amount of physically contiguous memory
on the system at compile time. More details in patch 3.

v3: Reduce the overly ambitious scope of the MAX_PTRS change.
Document more things, including around why some of the
restrictions apply.
Clean up the code more, thanks Christophe.

v2: The big change is the introduction of tree-wide(ish)
MAX_PTRS_PER_{PTE,PMD,PUD} macros in preference to the previous
approach, which was for the arch to override the page table array
definitions with their own. (And I squashed the annoying
intermittent crash!)

Apart from that there's just a lot of cleanup. Christophe, I've
addressed most of what you asked for and I will reply to your v1
emails to clarify what remains unchanged.


Daniel Axtens (3):
  kasan: define and use MAX_PTRS_PER_* for early shadow tables
  kasan: Document support on 32-bit powerpc
  powerpc: Book3S 64-bit "heavyweight" KASAN support

 Documentation/dev-tools/kasan.rst |   7 +-
 Documentation/powerpc/kasan.txt   | 122 ++
 arch/powerpc/Kconfig  |   3 +
 arch/powerpc/Kconfig.debug|  21 +++
 arch/powerpc/Makefile |  11 ++
 arch/powerpc/include/asm/book3s/64/hash.h |   4 +
 arch/powerpc/include/asm/book3s/64/pgtable.h  |   7 +
 arch/powerpc/include/asm/book3s/64/radix.h|   5 +
 arch/powerpc/include/asm/kasan.h  |  21 ++-
 arch/powerpc/kernel/process.c |   8 ++
 arch/powerpc/kernel/prom.c|  64 -
 arch/powerpc/mm/kasan/Makefile|   3 +-
 .../mm/kasan/{kasan_init_32.c => init_32.c}   |   0
 arch/powerpc/mm/kasan/init_book3s_64.c|  72 +++
 include/linux/kasan.h |  18 ++-
 mm/kasan/init.c   |   6 +-
 16 files changed, 359 insertions(+), 13 deletions(-)
 create mode 100644 Documentation/powerpc/kasan.txt
 rename arch/powerpc/mm/kasan/{kasan_init_32.c => init_32.c} (100%)
 create mode 100644 arch/powerpc/mm/kasan/init_book3s_64.c

-- 
2.20.1



[PATCH v3 1/3] kasan: define and use MAX_PTRS_PER_* for early shadow tables

2019-12-12 Thread Daniel Axtens
powerpc has a variable number of PTRS_PER_*, set at runtime based
on the MMU that the kernel is booted under.

This means the PTRS_PER_* are no longer constants, and therefore
breaks the build.

Define default MAX_PTRS_PER_*s in the same style as MAX_PTRS_PER_P4D.
As KASAN is the only user at the moment, just define them in the kasan
header, and have them default to PTRS_PER_* unless overridden in arch
code.

Suggested-by: Christophe Leroy 
Suggested-by: Balbir Singh 
Signed-off-by: Daniel Axtens 
---
 include/linux/kasan.h | 18 +++---
 mm/kasan/init.c   |  6 +++---
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index e18fe54969e9..70865810d0e7 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -14,10 +14,22 @@ struct task_struct;
 #include 
 #include 
 
+#ifndef MAX_PTRS_PER_PTE
+#define MAX_PTRS_PER_PTE PTRS_PER_PTE
+#endif
+
+#ifndef MAX_PTRS_PER_PMD
+#define MAX_PTRS_PER_PMD PTRS_PER_PMD
+#endif
+
+#ifndef MAX_PTRS_PER_PUD
+#define MAX_PTRS_PER_PUD PTRS_PER_PUD
+#endif
+
 extern unsigned char kasan_early_shadow_page[PAGE_SIZE];
-extern pte_t kasan_early_shadow_pte[PTRS_PER_PTE];
-extern pmd_t kasan_early_shadow_pmd[PTRS_PER_PMD];
-extern pud_t kasan_early_shadow_pud[PTRS_PER_PUD];
+extern pte_t kasan_early_shadow_pte[MAX_PTRS_PER_PTE];
+extern pmd_t kasan_early_shadow_pmd[MAX_PTRS_PER_PMD];
+extern pud_t kasan_early_shadow_pud[MAX_PTRS_PER_PUD];
 extern p4d_t kasan_early_shadow_p4d[MAX_PTRS_PER_P4D];
 
 int kasan_populate_early_shadow(const void *shadow_start,
diff --git a/mm/kasan/init.c b/mm/kasan/init.c
index ce45c491ebcd..8b54a96d3b3e 100644
--- a/mm/kasan/init.c
+++ b/mm/kasan/init.c
@@ -46,7 +46,7 @@ static inline bool kasan_p4d_table(pgd_t pgd)
 }
 #endif
 #if CONFIG_PGTABLE_LEVELS > 3
-pud_t kasan_early_shadow_pud[PTRS_PER_PUD] __page_aligned_bss;
+pud_t kasan_early_shadow_pud[MAX_PTRS_PER_PUD] __page_aligned_bss;
 static inline bool kasan_pud_table(p4d_t p4d)
 {
return p4d_page(p4d) == virt_to_page(lm_alias(kasan_early_shadow_pud));
@@ -58,7 +58,7 @@ static inline bool kasan_pud_table(p4d_t p4d)
 }
 #endif
 #if CONFIG_PGTABLE_LEVELS > 2
-pmd_t kasan_early_shadow_pmd[PTRS_PER_PMD] __page_aligned_bss;
+pmd_t kasan_early_shadow_pmd[MAX_PTRS_PER_PMD] __page_aligned_bss;
 static inline bool kasan_pmd_table(pud_t pud)
 {
return pud_page(pud) == virt_to_page(lm_alias(kasan_early_shadow_pmd));
@@ -69,7 +69,7 @@ static inline bool kasan_pmd_table(pud_t pud)
return false;
 }
 #endif
-pte_t kasan_early_shadow_pte[PTRS_PER_PTE] __page_aligned_bss;
+pte_t kasan_early_shadow_pte[MAX_PTRS_PER_PTE] __page_aligned_bss;
 
 static inline bool kasan_pte_table(pmd_t pmd)
 {
-- 
2.20.1



Re: READ_ONCE() + STACKPROTECTOR_STRONG == :/ (was Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.5-2 tag (topic/kasan-bitops))

2019-12-12 Thread Segher Boessenkool
Hi,

On Thu, Dec 12, 2019 at 04:42:13PM +1100, Michael Ellerman wrote:
> Some of the generic versions don't generate good code compared to our
> versions, but that's because READ_ONCE() is triggering stack protector
> to be enabled.

The *big* difference is the generic code has a special path that does not
do an atomic access at all.  Either that is a good idea or not, but we
probably should not change the behaviour here, not without benchmarking
anyway.

> For example, comparing an out-of-line copy of the generic and ppc
> versions of test_and_set_bit_lock():

(With what GCC version, and what exact flags?)

(A stand-alone testcase would be nice too, btw).

(Michael gave me one, thanks!)

> If you squint, the generated code for the actual logic is pretty similar, but
> the stack protector gunk makes a big mess.

And with stack protector it cannot shrink-wrap the exit, one of the bigger
performance costs of the stack protector.  The extra branch in the generic
code isn't fun either (but maybe it is good for performance?

> It's particularly bad here
> because the ppc version doesn't even need a stack frame.

You are hit by this:

  if (... || (RECORD_OR_UNION_TYPE_P (var_type)
  && record_or_union_type_has_array_p (var_type)) ...)

(in the GCC code, stack_protect_decl_p (), cfgexpand.c)

for the variable __u from

#define __READ_ONCE(x, check)   \
({  \
union { typeof(x) __val; char __c[1]; } __u;\
__read_once_size(&(x), __u.__c, sizeof(x)); \
smp_read_barrier_depends(); /* Enforce dependency ordering from x */ \
__u.__val;  \
})

This is all optimised away later, but at the point this decision is made
GCC does not know that.

> So READ_ONCE() + STACKPROTECTOR_STRONG is problematic. The root cause is
> presumably that READ_ONCE() does an access to an on-stack variable,
> which triggers the heuristics in the compiler that the stack needs
> protecting.

Not exactly, but the problem is READ_ONCE alright.

> It seems like a compiler "mis-feature" that a constant-sized access to the 
> stack
> triggers the stack protector logic, especially when the access is eventually
> optimised away. But I guess that's probably what we get for doing tricks like
> READ_ONCE() in the first place :/

__c is an array.  That is all that matters.  I don't think it is very
reasonable to fault GCC for this.

> I tried going back to the version of READ_ONCE() that doesn't use a
> union, ie. effectively reverting dd36929720f4 ("kernel: make READ_ONCE()
> valid on const arguments") to get:
> 
> #define READ_ONCE(x)  \
>   ({ typeof(x) __val; __read_once_size(, &__val, sizeof(__val)); __val; 
> })

With that, it is that the address of __val is taken:

  ...
  || TREE_ADDRESSABLE (var)
  ...

> But it makes no difference, the stack protector stuff still triggers. So
> I guess it's simply taking the address of a stack variable that triggers
> it.

Not in the earlier testcase.  Btw, there is no such thing as a "stack
variable" at that point in the compiler: it just is a local var.

> There seems to be a function attribute to enable stack protector for a
> function, but not one to disable it:
>   
> https://gcc.gnu.org/onlinedocs/gcc-9.2.0/gcc/Common-Function-Attributes.html#index-stack_005fprotect-function-attribute

Yes.

> That may not be a good solution even if it did exist, because it would
> potentially disable stack protector in places where we do want it
> enabled.

Right, I don't think we want that, such an attribute invites people to
write dangerous code.  (You already can just put the functions that you
want to be unsafe in a separate source file...  It sounds even sillier
that way, heh).


Segher


Re: [PATCH v2 4/4] powerpc: Book3S 64-bit "heavyweight" KASAN support

2019-12-12 Thread Daniel Axtens
Hi Christophe,

I think I've covered everything you've mentioned in the v3 I'm about to
send, except for:

>> +/* mark early shadow region as RO and wipe */
>> +pte = __pte(__pa(kasan_early_shadow_page) |
>> +pgprot_val(PAGE_KERNEL_RO) | _PAGE_PTE);
>
> Any reason for _PAGE_PTE being required here and not being included in 
> PAGE_KERNEL_RO ?

I'm not 100% sure quite what you mean here. I think you're asking: why
do we need to supply _PAGE_PTE here, shouldn't PAGE_KERNEL_RO set that
bit or cover that case?

_PAGE_PTE is defined by section 5.7.10.2 of Book III of ISA 3.0: bit 1
(linux bit 62) is 'Leaf (entry is a PTE)' I originally had this because
it was set in Balbir's original implementation, but the bit is also set
by pte_mkpte which is called in set_pte_at, so I also think it's right
to set it.

I don't know why it's not included in the permission classes; I suspect
it's because it's not conceptually a permission, it's set and cleared in
things like swp entry code.

Does that answer your question?

Regards,
Daniel


Re: [PATCH V2 08/13] powerpc/vas: Update CSB and notify process for fault CRBs

2019-12-12 Thread Christoph Hellwig
On Sun, Dec 08, 2019 at 07:33:37PM -0800, Haren Myneni wrote:
> +static void notify_process(pid_t pid, u64 fault_addr)
> +{
> + int rc;
> + struct kernel_siginfo info;
> +
> + memset(, 0, sizeof(info));
> +
> + info.si_signo = SIGSEGV;
> + info.si_errno = EFAULT;
> + info.si_code = SEGV_MAPERR;
> + info.si_addr = (void *)fault_addr;
> + /*
> +  * process will be polling on csb.flags after request is sent to
> +  * NX. So generally CSB update should not fail except when an
> +  * application does not follow the process properly. So an error
> +  * message will be displayed and leave it to user space whether
> +  * to ignore or handle this signal.
> +  */
> + rcu_read_lock();
> + rc = kill_pid_info(SIGSEGV, , find_vpid(pid));
> + rcu_read_unlock();
> +
> + pr_devel("%s(): pid %d kill_proc_info() rc %d\n", __func__, pid, rc);
> +}

I think you want to pass in the struct pid * here instead of looking
up again, given that..

> + if (tsk) {
> + if (tsk->flags & PF_EXITING)
> + task_exit = 1;
> + put_task_struct(tsk);
> + pid = vas_window_pid(window);

We already have the struct pid in the window structure here.

> + } else {
> + pid = window->tgid;
> +
> + rcu_read_lock();
> + tsk = find_task_by_vpid(pid);
> + if (!tsk) {

.. and could have easily stored on here.  Or at least only do the
look up once, given that already looks it up.

> + /* Do not notify if the task is exiting. */
> + if (!task_exit) {
> + pr_err("Invalid CSB address 0x%p signalling pid(%d)\n",
> + csb_addr, pid);
> + notify_process(pid, (u64)csb_addr);
> + }

I suspect inlining notify_process and just existing early for the
task_exit case also makes the code a bit easier to follow.


Re: [PATCH V2 07/13] powerpc/vas: Take reference to PID and mm for user space windows

2019-12-12 Thread Christoph Hellwig
> + if (txwin->user_win) {
> + /*
> +  * Window opened by child thread may not be closed when
> +  * it exits. So take reference to its pid and release it
> +  * when the window is free by parent thread.
> +  * Acquire a reference to the task's pid to make sure
> +  * pid will not be re-used.
> +  */
> + txwin->pid = get_task_pid(current, PIDTYPE_PID);
> + /*
> +  * Acquire a reference to the task's mm.
> +  */
> + txwin->mm = get_task_mm(current);
> +
> + if (txwin->mm) {
> + mmput(txwin->mm);
> + mmgrab(txwin->mm);

Doesn't the mmgrab need to come before the mmput?

> + mm_context_add_copro(txwin->mm);
> + } else {
> + put_pid(txwin->pid);
> + pr_err("VAS: pid(%d): mm_struct is not found\n",
> + current->pid);
> + rc = -EPERM;
> + goto free_window;
> + }

Also the code is much easier to follow if you handle the error
first and avoid the else:

txwin->mm = get_task_mm(current);
if (!txwin->mm) {
put_pid(txwin->pid);
pr_err("VAS: pid(%d): mm_struct is not found\n",
current->pid);
rc = -EPERM;
goto free_window;
}
mmgrab(txwin->mm);
mmput(txwin->mm);

Also don't you need to take a reference to the struct pid for the
tgid as well?


Re: [PATCH V2 05/13] powerpc/vas: Setup thread IRQ handler per VAS instance

2019-12-12 Thread Christoph Hellwig
On Sun, Dec 08, 2019 at 07:31:24PM -0800, Haren Myneni wrote:
> 
> Setup thread IRQ handler per each VAS instance. When NX sees a fault
> on CRB, kernel gets an interrupt and vas_fault_handler will be
> executed to process fault CRBs. Read all valid CRBs from fault FIFO,
> determine the corresponding send window from CRB and process fault
> requests.
> 
> Signed-off-by: Sukadev Bhattiprolu 
> Signed-off-by: Haren Myneni 
> ---
>  arch/powerpc/platforms/powernv/vas-fault.c  | 83 
> +
>  arch/powerpc/platforms/powernv/vas-window.c | 60 +
>  arch/powerpc/platforms/powernv/vas.c| 15 +-
>  arch/powerpc/platforms/powernv/vas.h|  4 ++
>  4 files changed, 161 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/vas-fault.c 
> b/arch/powerpc/platforms/powernv/vas-fault.c
> index b0258ed..e1e34c6 100644
> --- a/arch/powerpc/platforms/powernv/vas-fault.c
> +++ b/arch/powerpc/platforms/powernv/vas-fault.c
> @@ -11,6 +11,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #include "vas.h"
> @@ -25,6 +26,88 @@
>  #define VAS_FAULT_WIN_FIFO_SIZE  (4 << 20)
>  
>  /*
> + * Process CRBs that we receive on the fault window.
> + */
> +irqreturn_t vas_fault_handler(int irq, void *data)
> +{
> + struct vas_instance *vinst = (struct vas_instance *)data;

No need for the cast.

> + crb = (struct coprocessor_request_block *)fifo;

Or this one.

> + if (vinst->fault_crbs == vinst->fault_fifo_size/CRB_SIZE)

Missing whitespace before and after the /

> + rc = request_threaded_irq(vinst->virq, NULL, vas_fault_handler,
> + IRQF_ONESHOT, devname, vinst);
> + if (rc) {
> + pr_err("VAS[%d]: Request IRQ(%d) failed with %d\n",
> + vinst->vas_id, vinst->virq, rc);
> + } else {
> + rc = vas_setup_fault_window(vinst);
> + if (rc)
> + free_irq(vinst->virq, vinst);
> + }
>  
>   return rc;

This would be a tad cleaner with proper goto unwinding:

rc = request_threaded_irq(vinst->virq, NULL, vas_fault_handler,
IRQF_ONESHOT, devname, vinst);
if (rc) {
pr_err("VAS[%d]: Request IRQ(%d) failed with %d\n",
vinst->vas_id, vinst->virq, rc);
goto out;
}
rc = vas_setup_fault_window(vinst);
if (rc)
goto out_free_irq;
return 0;
out_free_irq:
free_irq(vinst->virq, vinst);
out:
return rc;


Re: [PATCH V2 04/13] powerpc/vas: Setup fault window per VAS instance

2019-12-12 Thread Christoph Hellwig
On Sun, Dec 08, 2019 at 07:30:33PM -0800, Haren Myneni wrote:
> +static int vas_irq_fault_window_setup(struct vas_instance *vinst)
> +{
> + int rc = 0;
> +
> + rc = vas_setup_fault_window(vinst);
> +
> + return rc;
> +}

No real need for the local variable here.


Re: [PATCH v5] powerpc/irq: inline call_do_irq() and call_do_softirq() on PPC32

2019-12-12 Thread Christoph Hellwig
On Sat, Dec 07, 2019 at 05:20:04PM +, Christophe Leroy wrote:
> call_do_irq() and call_do_softirq() are simple enough to be
> worth inlining.
> 
> Inlining them avoids an mflr/mtlr pair plus a save/reload on stack.
> It also allows GCC to keep the saved ksp_limit in an nonvolatile reg.
> 
> This is inspired from S390 arch. Several other arches do more or
> less the same. The way sparc arch does seems odd thought.

Any reason you only do this for 32-bit and not 64-bit as well?


Re: [PATCH] powerpc/irq: don't use current_stack_pointer() in do_IRQ()

2019-12-12 Thread Christoph Hellwig
Why can't current_stack_pointer be turned into an inline function using
inline assembly?  That would reduce the overhead for all callers.


Re: READ_ONCE() + STACKPROTECTOR_STRONG == :/ (was Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.5-2 tag (topic/kasan-bitops))

2019-12-12 Thread Peter Zijlstra
On Thu, Dec 12, 2019 at 10:07:56AM +, Will Deacon wrote:

> > So your proposed change _should_ be fine. Will, I'm assuming you never
> > saw this on your ARGH64 builds when you did this code ?
> 
> I did see it, but (a) looking at the code out-of-line makes it look a lot
> worse than it actually is (so the ext4 example is really helpful -- thanks
> Michael!) and (b) I chalked it up to a crappy compiler.
> 
> However, see this comment from Arnd on my READ_ONCE series from the other
> day:
> 
> https://lore.kernel.org/lkml/CAK8P3a0f=wvsqsbq4t0fmekcfe_mc3oarxaetvitsksa-d2...@mail.gmail.com
> 
> In which case, I'm thinking that we should be doing better in READ_ONCE()
> for non-buggy compilers which would also keep the KCSAN folks happy for this
> code (and would help with [1] too).

So something like this then? Although I suppose that should be moved
into compiler-gcc.h and then guarded by #ifndef READ_ONCE or so.

---

diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index ad8c76144a3c..8326e2cf28b4 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -179,20 +179,8 @@ void ftrace_likely_update(struct ftrace_likely_data *f, 
int val,
 
 #include 
 #include 
-
-#define __READ_ONCE_SIZE   \
-({ \
-   switch (size) { \
-   case 1: *(__u8 *)res = *(volatile __u8 *)p; break;  \
-   case 2: *(__u16 *)res = *(volatile __u16 *)p; break;\
-   case 4: *(__u32 *)res = *(volatile __u32 *)p; break;\
-   case 8: *(__u64 *)res = *(volatile __u64 *)p; break;\
-   default:\
-   barrier();  \
-   __builtin_memcpy((void *)res, (const void *)p, size);   \
-   barrier();  \
-   }   \
-})
+#include 
+#include 
 
 #ifdef CONFIG_KASAN
 /*
@@ -222,6 +210,22 @@ void ftrace_likely_update(struct ftrace_likely_data *f, 
int val,
 #define __no_sanitize_or_inline __always_inline
 #endif
 
+#ifdef GCC_VERSION < 40800
+
+#define __READ_ONCE_SIZE   \
+({ \
+   switch (size) { \
+   case 1: *(__u8 *)res = *(volatile __u8 *)p; break;  \
+   case 2: *(__u16 *)res = *(volatile __u16 *)p; break;\
+   case 4: *(__u32 *)res = *(volatile __u32 *)p; break;\
+   case 8: *(__u64 *)res = *(volatile __u64 *)p; break;\
+   default:\
+   barrier();  \
+   __builtin_memcpy((void *)res, (const void *)p, size);   \
+   barrier();  \
+   }   \
+})
+
 static __no_kcsan_or_inline
 void __read_once_size(const volatile void *p, void *res, int size)
 {
@@ -274,9 +278,6 @@ void __write_once_size(volatile void *p, void *res, int 
size)
  * with an explicit memory barrier or atomic instruction that provides the
  * required ordering.
  */
-#include 
-#include 
-
 #define __READ_ONCE(x, check)  \
 ({ \
union { typeof(x) __val; char __c[1]; } __u;\
@@ -295,6 +296,23 @@ void __write_once_size(volatile void *p, void *res, int 
size)
  */
 #define READ_ONCE_NOCHECK(x) __READ_ONCE(x, 0)
 
+#else /* GCC_VERSION < 40800 */
+
+#define READ_ONCE_NOCHECK(x)   \
+({ \
+   typeof(x) __x = *(volatile typeof(x))&(x);  \
+   smp_read_barrier_depends(); \
+   __x;
+})
+
+#define READ_ONCE(x)   \
+({ \
+   kcsan_check_atomic_read(&(x), sizeof(x));   \
+   READ_ONCE_NOCHECK(x);   \
+})
+
+#endif /* GCC_VERSION < 40800 */
+
 static __no_kasan_or_inline
 unsigned long read_word_at_a_time(const void *addr)
 {


Re: [RESEND PATCH v5 5/5] Documentation/vmcoreinfo: Add documentation for 'TCR_EL1.T1SZ'

2019-12-12 Thread James Morse
Hi Bhupesh,

On 29/11/2019 19:59, Bhupesh Sharma wrote:
> Add documentation for TCR_EL1.T1SZ variable being added to
> vmcoreinfo.
> 
> It indicates the size offset of the memory region addressed by TTBR1_EL1

> and hence can be used for determining the vabits_actual value.

used for determining random-internal-kernel-variable, that might not exist 
tomorrow.

Could you describe how this is useful/necessary if a debugger wants to walk the 
page
tables from the core file? I think this is a better argument.

Wouldn't the documentation be better as part of the patch that adds the export?
(... unless these have to go via different trees? ..)


> diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst 
> b/Documentation/admin-guide/kdump/vmcoreinfo.rst
> index 447b64314f56..f9349f9d3345 100644
> --- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
> +++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
> @@ -398,6 +398,12 @@ KERNELOFFSET
>  The kernel randomization offset. Used to compute the page offset. If
>  KASLR is disabled, this value is zero.
>  
> +TCR_EL1.T1SZ
> +
> +
> +Indicates the size offset of the memory region addressed by TTBR1_EL1

> +and hence can be used for determining the vabits_actual value.

'vabits_actual' may not exist when the next person comes to read this 
documentation (its
going to rot really quickly).

I think the first half of this text is enough to say what this is for. You 
should include
words to the effect that its the hardware value that goes with swapper_pg_dir. 
You may
want to point readers to the arm-arm for more details on what the value means.


Thanks,

James


Re: [RESEND PATCH v5 2/5] arm64/crash_core: Export TCR_EL1.T1SZ in vmcoreinfo

2019-12-12 Thread James Morse
Hi Bhupesh,

On 29/11/2019 19:59, Bhupesh Sharma wrote:
> vabits_actual variable on arm64 indicates the actual VA space size,
> and allows a single binary to support both 48-bit and 52-bit VA
> spaces.
> 
> If the ARMv8.2-LVA optional feature is present, and we are running
> with a 64KB page size; then it is possible to use 52-bits of address
> space for both userspace and kernel addresses. However, any kernel
> binary that supports 52-bit must also be able to fall back to 48-bit
> at early boot time if the hardware feature is not present.
> 
> Since TCR_EL1.T1SZ indicates the size offset of the memory region
> addressed by TTBR1_EL1 (and hence can be used for determining the
> vabits_actual value) it makes more sense to export the same in
> vmcoreinfo rather than vabits_actual variable, as the name of the
> variable can change in future kernel versions, but the architectural
> constructs like TCR_EL1.T1SZ can be used better to indicate intended
> specific fields to user-space.
> 
> User-space utilities like makedumpfile and crash-utility, need to
> read/write this value from/to vmcoreinfo

(write?)

> for determining if a virtual address lies in the linear map range.

I think this is a fragile example. The debugger shouldn't need to know this.


> The user-space computation for determining whether an address lies in
> the linear map range is the same as we have in kernel-space:
> 
>   #define __is_lm_address(addr)   (!(((u64)addr) & BIT(vabits_actual - 
> 1)))

This was changed with 14c127c957c1 ("arm64: mm: Flip kernel VA space"). If 
user-space
tools rely on 'knowing' the kernel memory layout, they must have to constantly 
be fixed
and updated. This is a poor argument for adding this to something that ends up 
as ABI.


I think a better argument is walking the kernel page tables from the core dump.
Core code's vmcoreinfo exports the location of the kernel page tables, but in 
the example
above you can't walk them without knowing how T1SZ was configured.

On older kernels, user-space that needs this would have to assume the value it 
computes
from VA_BITs (also in vmcoreinfo) is the value in use.


---%<---
> I have sent out user-space patches for makedumpfile and crash-utility
> to add features for obtaining vabits_actual value from TCR_EL1.T1SZ (see
> [0] and [1]).
> 
> Akashi reported that he was able to use this patchset and the user-space
> changes to get user-space working fine with the 52-bit kernel VA
> changes (see [2]).
> 
> [0]. http://lists.infradead.org/pipermail/kexec/2019-November/023966.html
> [1]. http://lists.infradead.org/pipermail/kexec/2019-November/024006.html
> [2]. http://lists.infradead.org/pipermail/kexec/2019-November/023992.html
---%<---

This probably belongs in the cover letter instead of the commit log.

(From-memory: one of vmcore/kcore is virtually addressed, the other physically. 
Does this
fix your poblem in both cases?)


> diff --git a/arch/arm64/kernel/crash_core.c b/arch/arm64/kernel/crash_core.c
> index ca4c3e12d8c5..f78310ba65ea 100644
> --- a/arch/arm64/kernel/crash_core.c
> +++ b/arch/arm64/kernel/crash_core.c
> @@ -7,6 +7,13 @@
>  #include 
>  #include 

You need to include asm/sysreg.h for read_sysreg(), and asm/pgtable-hwdef.h for 
the macros
you added.


> +static inline u64 get_tcr_el1_t1sz(void);

Why do you need to do this?


> +static inline u64 get_tcr_el1_t1sz(void)
> +{
> + return (read_sysreg(tcr_el1) & TCR_T1SZ_MASK) >> TCR_T1SZ_OFFSET;
> +}

(We don't modify this one, and its always the same one very CPU, so this is 
fine.
This function is only called once when the stringy vmcoreinfo elf_note is 
created...)


>  void arch_crash_save_vmcoreinfo(void)
>  {
>   VMCOREINFO_NUMBER(VA_BITS);
> @@ -15,5 +22,7 @@ void arch_crash_save_vmcoreinfo(void)
>   kimage_voffset);
>   vmcoreinfo_append_str("NUMBER(PHYS_OFFSET)=0x%llx\n",
>   PHYS_OFFSET);
> + vmcoreinfo_append_str("NUMBER(tcr_el1_t1sz)=0x%llx\n",
> + get_tcr_el1_t1sz());

You document the name as being upper-case.
The two values either values either side are upper-case.


>   vmcoreinfo_append_str("KERNELOFFSET=%lx\n", kaslr_offset());
>  }


Thanks,

James


Re: [PATCH v10 23/25] mm/gup: track FOLL_PIN pages

2019-12-12 Thread Jan Kara
On Thu 12-12-19 00:19:15, John Hubbard wrote:
> Add tracking of pages that were pinned via FOLL_PIN.
> 
> As mentioned in the FOLL_PIN documentation, callers who effectively set
> FOLL_PIN are required to ultimately free such pages via unpin_user_page().
> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
> for DIO and/or RDMA use".
> 
> Pages that have been pinned via FOLL_PIN are identifiable via a
> new function call:
> 
>bool page_dma_pinned(struct page *page);
> 
> What to do in response to encountering such a page, is left to later
> patchsets. There is discussion about this in [1], [2], and [3].
> 
> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
> 
> [1] Some slow progress on get_user_pages() (Apr 2, 2019):
> https://lwn.net/Articles/784574/
> [2] DMA and get_user_pages() (LPC: Dec 12, 2018):
> https://lwn.net/Articles/774411/
> [3] The trouble with get_user_pages() (Apr 30, 2018):
> https://lwn.net/Articles/753027/
> 
> Suggested-by: Jan Kara 
> Suggested-by: Jérôme Glisse 
> Cc: Kirill A. Shutemov 
> Signed-off-by: John Hubbard 

Thanks for the patch. As a side note, given this series is rather big, it
may be better to send just individual updated patches (as replies to the
review comments) instead of resending the whole series every time. And then
you can resend the whole series once enough changes accumulate or we reach
seemingly final state.  That way people don't have to crawl through lots of
uninteresing emails...  Just something to keep in mind for the future.

I've spotted just one issue in this patch (see below), the rest are just
small style nits.

> +#define page_ref_zero_or_close_to_bias_overflow(page) \
> + ((unsigned int) page_ref_count(page) + \
> + GUP_PIN_COUNTING_BIAS <= GUP_PIN_COUNTING_BIAS)
> +

...

> +/**
> + * page_dma_pinned() - report if a page is pinned for DMA.
> + *
> + * This function checks if a page has been pinned via a call to
> + * pin_user_pages*().
> + *
> + * The return value is partially fuzzy: false is not fuzzy, because it means
> + * "definitely not pinned for DMA", but true means "probably pinned for DMA, 
> but
> + * possibly a false positive due to having at least GUP_PIN_COUNTING_BIAS 
> worth
> + * of normal page references".
> + *
> + * False positives are OK, because: a) it's unlikely for a page to get that 
> many
> + * refcounts, and b) all the callers of this routine are expected to be able 
> to
> + * deal gracefully with a false positive.
> + *
> + * For more information, please see Documentation/vm/pin_user_pages.rst.
> + *
> + * @page:pointer to page to be queried.
> + * @Return:  True, if it is likely that the page has been "dma-pinned".
> + *   False, if the page is definitely not dma-pinned.
> + */
> +static inline bool page_dma_pinned(struct page *page)
> +{
> + return (page_ref_count(compound_head(page))) >= GUP_PIN_COUNTING_BIAS;
> +}
> +

I realized one think WRT handling of page refcount overflow: Page refcount is
signed and e.g. try_get_page() fails once the refcount is negative. That
means that:

a) page_ref_zero_or_close_to_bias_overflow() is not necessary - all places
that use pinning (i.e., advance refcount by GUP_PIN_COUNTING_BIAS) are not
necesary, we should just rely on the check for negative value for
consistency.

b) page_dma_pinned() has to be careful and type page_ref_count() to
unsigned type for comparison as otherwise overflowed refcount would
suddently appear as not-pinned.

> +/**
> + * try_pin_compound_head() - mark a compound page as being used by
> + * pin_user_pages*().
> + *
> + * This is the FOLL_PIN counterpart to try_get_compound_head().
> + *
> + * @page:pointer to page to be marked
> + * @Return:  the compound head page, with ref appropriately incremented,
> + * or NULL upon failure.
> + */
> +__must_check struct page *try_pin_compound_head(struct page *page, int refs)
> +{
> + struct page *head = try_get_compound_head(page,
> +   GUP_PIN_COUNTING_BIAS * refs);
> + if (!head)
> + return NULL;
> +
> + __update_proc_vmstat(page, NR_FOLL_PIN_REQUESTED, refs);
> + return head;
> +}
> +
> +/*
> + * try_grab_compound_head() - attempt to elevate a page's refcount, by a
> + * flags-dependent amount.
> + *
> + * "grab" names in this file mean, "look at flags to decide whether to use
> + * FOLL_PIN or FOLL_GET behavior, when incrementing the page's refcount.
> + *
> + * Either FOLL_PIN or FOLL_GET (or neither) must be set, but not both at the
> + * same time. (That's true throughout the get_user_pages*() and
> + * pin_user_pages*() APIs.) Cases:
> + *
> + *   FOLL_GET: page's refcount will be incremented by 1.
> + *  FOLL_PIN: page's refcount will be incremented by 
> GUP_PIN_COUNTING_BIAS.

Some tab vs space issue here... Generally we don't use tabs inside comments
for indenting so I'd wote for using just spaces.

> + *
> + * Return: head page (with 

Re: READ_ONCE() + STACKPROTECTOR_STRONG == :/ (was Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.5-2 tag (topic/kasan-bitops))

2019-12-12 Thread Will Deacon
On Thu, Dec 12, 2019 at 09:01:05AM +0100, Peter Zijlstra wrote:
> On Thu, Dec 12, 2019 at 04:42:13PM +1100, Michael Ellerman wrote:
> > Peter Zijlstra  writes:
> > > On Fri, Dec 06, 2019 at 11:46:11PM +1100, Michael Ellerman wrote:
> > Some of the generic versions don't generate good code compared to our
> > versions, but that's because READ_ONCE() is triggering stack protector
> > to be enabled.
> 
> Bah, there's never anything simple, is there :/
> 
> > For example, comparing an out-of-line copy of the generic and ppc
> > versions of test_and_set_bit_lock():
> > 
> >1 :   1 
> > :
> >2 addis   r2,r12,361
> >3 addir2,r2,-4240
> >4 stdur1,-48(r1)
> >5 rlwinm  r8,r3,29,3,28
> >6 clrlwi  r10,r3,26   2 rldicl  
> > r10,r3,58,6
> >7 ld  r9,3320(r13)
> >8 std r9,40(r1)
> >9 li  r9,0
> >   10 li  r9,13 li  r9,1
> >  4 clrlwi  r3,r3,26
> >  5 rldicr  
> > r10,r10,3,60
> >   11 sld r9,r9,r10   6 sld r3,r9,r3
> >   12 add r10,r4,r8   7 add r4,r4,r10
> >   13 ldx r8,r4,r8
> >   14 and.r8,r9,r8
> >   15 bne 34f
> >   16 ldarx   r7,0,r108 ldarx   r9,0,r4,1
> >   17 or  r8,r9,r79 or  r10,r9,r3
> >   18 stdcx.  r8,0,r10   10 stdcx.  r10,0,r4
> >   19 bne-16b11 bne-8b
> >   20 isync  12 isync
> >   21 and r9,r7,r9   13 and r3,r3,r9
> >   22 addic   r7,r9,-1   14 addic   r9,r3,-1
> >   23 subfe   r7,r7,r9   15 subfe   r3,r9,r3
> >   24 ld  r9,40(r1)
> >   25 ld  r10,3320(r13)
> >   26 xor.r9,r9,r10
> >   27 li  r10,0
> >   28 mr  r3,r7
> >   29 bne 36f
> >   30 addir1,r1,48
> >   31 blr16 blr
> >   32 nop
> >   33 nop
> >   34 li  r7,1
> >   35 b   24b
> >   36 mflrr0
> >   37 std r0,64(r1)
> >   38 bl  <__stack_chk_fail+0x8>
> > 
> > 
> > If you squint, the generated code for the actual logic is pretty similar, 
> > but
> > the stack protector gunk makes a big mess. It's particularly bad here
> > because the ppc version doesn't even need a stack frame.
> > 
> > I've also confirmed that even when test_and_set_bit_lock() is inlined
> > into an actual call site the stack protector logic still triggers.
> 
> > If I change the READ_ONCE() in test_and_set_bit_lock():
> > 
> > if (READ_ONCE(*p) & mask)
> > return 1;
> > 
> > to a regular pointer access:
> > 
> > if (*p & mask)
> > return 1;
> > 
> > Then the generated code looks more or less the same, except for the extra 
> > early
> > return in the generic version of test_and_set_bit_lock(), and different 
> > handling
> > of the return code by the compiler.
> 
> So given that the function signature is:
> 
> static inline int test_and_set_bit_lock(unsigned int nr,
>   volatile unsigned long *p)
> 
> @p already carries the required volatile qualifier, so READ_ONCE() does
> not add anything here (except for easier to read code and poor code
> generation).
> 
> So your proposed change _should_ be fine. Will, I'm assuming you never
> saw this on your ARGH64 builds when you did this code ?

I did see it, but (a) looking at the code out-of-line makes it look a lot
worse than it actually is (so the ext4 example is really helpful -- thanks
Michael!) and (b) I chalked it up to a crappy compiler.

However, see this comment from Arnd on my READ_ONCE series from the other
day:

https://lore.kernel.org/lkml/CAK8P3a0f=wvsqsbq4t0fmekcfe_mc3oarxaetvitsksa-d2...@mail.gmail.com

In which case, I'm thinking that we should be doing better in READ_ONCE()
for non-buggy compilers which would also keep the KCSAN folks happy for this
code (and would help with [1] too).

Will

[1] https://lkml.org/lkml/2019/11/12/898


Re: [PATCH v2 4/4] powerpc: Book3S 64-bit "heavyweight" KASAN support

2019-12-12 Thread Andrey Ryabinin
On 12/11/19 5:24 PM, Daniel Axtens wrote:
> Hi Balbir,
> 
> +Discontiguous memory can occur when you have a machine with memory spread
> +across multiple nodes. For example, on a Talos II with 64GB of RAM:
> +
> + - 32GB runs from 0x0 to 0x_0008__,
> + - then there's a gap,
> + - then the final 32GB runs from 0x_2000__ to 
> 0x_2008__
> +
> +This can create _significant_ issues:
> +
> + - If we try to treat the machine as having 64GB of _contiguous_ RAM, we 
> would
> +   assume that ran from 0x0 to 0x_0010__. We'd then reserve 
> the
> +   last 1/8th - 0x_000e__ to 0x_0010__ as the 
> shadow
> +   region. But when we try to access any of that, we'll try to access 
> pages
> +   that are not physically present.
> +

 If we reserved memory for KASAN from each node (discontig region), we 
 might survive
 this no? May be we need NUMA aware KASAN? That might be a generic change, 
 just thinking
 out loud.
>>>
>>> The challenge is that - AIUI - in inline instrumentation, the compiler
>>> doesn't generate calls to things like __asan_loadN and
>>> __asan_storeN. Instead it uses -fasan-shadow-offset to compute the
>>> checks, and only calls the __asan_report* family of functions if it
>>> detects an issue. This also matches what I can observe with objdump
>>> across outline and inline instrumentation settings.
>>>
>>> This means that for this sort of thing to work we would need to either
>>> drop back to out-of-line calls, or teach the compiler how to use a
>>> nonlinear, NUMA aware mem-to-shadow mapping.
>>
>> Yes, out of line is expensive, but seems to work well for all use cases.
> 
> I'm not sure this is true. Looking at scripts/Makefile.kasan, allocas,
> stacks and globals will only be instrumented if you can provide
> KASAN_SHADOW_OFFSET. In the case you're proposing, we can't provide a
> static offset. I _think_ this is a compiler limitation, where some of
> those instrumentations only work/make sense with a static offset, but
> perhaps that's not right? Dmitry and Andrey, can you shed some light on
> this?
> 

There is no code in the kernel is poisoning/unpoisoning
redzones/variables on stack. It's because it's always done by the compiler, it 
inserts
some code in prologue/epilogue of every function.
So compiler needs to know the shadow offset which will be used to 
poison/unpoison
stack frames.

There is no such kind of limitation on globals instrumentation. The only reason 
globals
instrumentation depends on -fasan-shadow-offset is because there was some bug 
related to
globals in old gcc version which didn't support -fasan-shadow-offset.


If you want stack instrumentation with not standard mem-to-shadow mapping, the 
options are:
1. Patch compiler to make it possible the poisoning/unpoisonig of stack frames 
via function calls.
2. Use out-line instrumentation and do whatever mem-to-shadow mapping you want, 
but keep all kernel
stacks in some special place for which standard mem-to-shadow mapping (addr >>3 
+offset)
works.


> Also, as it currently stands, the speed difference between inline and
> outline is approximately 2x, and given that we'd like to run this
> full-time in syzkaller I think there is value in trading off speed for
> some limitations.
> 


Re: [PATCH v2 4/4] powerpc: Book3S 64-bit "heavyweight" KASAN support

2019-12-12 Thread Christophe Leroy




Le 12/12/2019 à 08:42, Balbir Singh a écrit :



On 12/12/19 1:24 am, Daniel Axtens wrote:

Hi Balbir,


+Discontiguous memory can occur when you have a machine with memory spread
+across multiple nodes. For example, on a Talos II with 64GB of RAM:
+
+ - 32GB runs from 0x0 to 0x_0008__,
+ - then there's a gap,
+ - then the final 32GB runs from 0x_2000__ to 0x_2008__
+
+This can create _significant_ issues:
+
+ - If we try to treat the machine as having 64GB of _contiguous_ RAM, we would
+   assume that ran from 0x0 to 0x_0010__. We'd then reserve the
+   last 1/8th - 0x_000e__ to 0x_0010__ as the shadow
+   region. But when we try to access any of that, we'll try to access pages
+   that are not physically present.
+


If we reserved memory for KASAN from each node (discontig region), we might 
survive
this no? May be we need NUMA aware KASAN? That might be a generic change, just 
thinking
out loud.


The challenge is that - AIUI - in inline instrumentation, the compiler
doesn't generate calls to things like __asan_loadN and
__asan_storeN. Instead it uses -fasan-shadow-offset to compute the
checks, and only calls the __asan_report* family of functions if it
detects an issue. This also matches what I can observe with objdump
across outline and inline instrumentation settings.

This means that for this sort of thing to work we would need to either
drop back to out-of-line calls, or teach the compiler how to use a
nonlinear, NUMA aware mem-to-shadow mapping.


Yes, out of line is expensive, but seems to work well for all use cases.


I'm not sure this is true. Looking at scripts/Makefile.kasan, allocas,
stacks and globals will only be instrumented if you can provide
KASAN_SHADOW_OFFSET. In the case you're proposing, we can't provide a
static offset. I _think_ this is a compiler limitation, where some of
those instrumentations only work/make sense with a static offset, but
perhaps that's not right? Dmitry and Andrey, can you shed some light on
this?



 From what I can read, everything should still be supported, the info page
for gcc states that globals, stack asan should be enabled by default.
allocas may have limited meaning if stack-protector is turned on (no?)


Where do you read that ?

As far as I can see, there is not much details about 
-fsanitize=kernel-address and -fasan-shadow-offset=number in GCC doc 
(https://gcc.gnu.org/onlinedocs/gcc/Instrumentation-Options.html)


[...]






I think I got CONFIG_PHYS_MEM_SIZE_FOR_KASN wrong, honestly I don't get why
we need this size? The size is in MB and the default is 0.

Why does the powerpc port of KASAN need the SIZE to be explicitly specified?



AFAICS, it is explained in details in Daniel's commit log. That's 
because on book3s64, KVM requires KASAN to also work when MMU is off.


The 0 default is for when CONFIG_KASAN is not selected, in order to 
avoid a forest of #ifdefs in the code.


Christophe


[PATCH v4 1/2] powerpc/vcpu: Assume dedicated processors as non-preempt

2019-12-12 Thread Srikar Dronamraju
With commit 247f2f6f3c70 ("sched/core: Don't schedule threads on pre-empted
vCPUs"), scheduler avoids preempted vCPUs to schedule tasks on wakeup.
This leads to wrong choice of CPU, which in-turn leads to larger wakeup
latencies. Eventually, it leads to performance regression in latency
sensitive benchmarks like soltp, schbench etc.

On Powerpc, vcpu_is_preempted only looks at yield_count. If the
yield_count is odd, the vCPU is assumed to be preempted. However
yield_count is increased whenever LPAR enters CEDE state. So any CPU
that has entered CEDE state is assumed to be preempted.

Even if vCPU of dedicated LPAR is preempted/donated, it should have
right of first-use since they are suppose to own the vCPU.

On a Power9 System with 32 cores
 # lscpu
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  128
On-line CPU(s) list: 0-127
Thread(s) per core:  8
Core(s) per socket:  1
Socket(s):   16
NUMA node(s):2
Model:   2.2 (pvr 004e 0202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
L2 cache:512K
L3 cache:10240K
NUMA node0 CPU(s):   0-63
NUMA node1 CPU(s):   64-127

  # perf stat -a -r 5 ./schbench
v5.4 v5.4 + patch
Latency percentiles (usec)   Latency percentiles (usec)
50.th: 45   50.th: 39
75.th: 62   75.th: 53
90.th: 71   90.th: 67
95.th: 77   95.th: 76
*99.th: 91  *99.th: 89
99.5000th: 707  99.5000th: 93
99.9000th: 6920 99.9000th: 118
min=0, max=10048min=0, max=211
Latency percentiles (usec)   Latency percentiles (usec)
50.th: 45   50.th: 34
75.th: 61   75.th: 45
90.th: 72   90.th: 53
95.th: 79   95.th: 56
*99.th: 691 *99.th: 61
99.5000th: 3972 99.5000th: 63
99.9000th: 8368 99.9000th: 78
min=0, max=16606min=0, max=228
Latency percentiles (usec)   Latency percentiles (usec)
50.th: 45   50.th: 34
75.th: 61   75.th: 45
90.th: 71   90.th: 53
95.th: 77   95.th: 57
*99.th: 106 *99.th: 63
99.5000th: 2364 99.5000th: 68
99.9000th: 7480 99.9000th: 100
min=0, max=10001min=0, max=134
Latency percentiles (usec)   Latency percentiles (usec)
50.th: 45   50.th: 34
75.th: 62   75.th: 46
90.th: 72   90.th: 53
95.th: 78   95.th: 56
*99.th: 93  *99.th: 61
99.5000th: 108  99.5000th: 64
99.9000th: 6792 99.9000th: 85
min=0, max=17681min=0, max=121
Latency percentiles (usec)   Latency percentiles (usec)
50.th: 46   50.th: 33
75.th: 62   75.th: 44
90.th: 73   90.th: 51
95.th: 79   95.th: 54
*99.th: 113 *99.th: 61
99.5000th: 2724 99.5000th: 64
99.9000th: 6184 99.9000th: 82
min=0, max=9887 min=0, max=121

 Performance counter stats for 'system wide' (5 runs):

context-switches43,373  ( +-  0.40% )   44,597 ( +-  0.55% )
cpu-migrations   1,211  ( +-  5.04% )  220 ( +-  6.23% )
page-faults 15,983  ( +-  5.21% )   15,360 ( +-  3.38% )

Waiman Long suggested using static_keys.

Fixes: 41946c86876e ("locking/core, powerpc: Implement vcpu_is_preempted(cpu)")

Cc: Parth Shah 
Cc: Ihor Pasichnyk 
Cc: Juri Lelli 
Cc: Phil Auld 
Cc: Waiman Long 
Cc: Gautham R. Shenoy 
Cc: Vaidyanathan Srinivasan 
Reported-by: Parth Shah 
Reported-by: Ihor Pasichnyk 
Tested-by: Juri Lelli 
Tested-by: Parth Shah 
Acked-by: Waiman Long 
Acked-by: Phil Auld 
Reviewed-by: Gautham R. Shenoy 
Reviewed-by: Vaidyanathan Srinivasan 
Signed-off-by: Srikar Dronamraju 
---
Changelog v1 

[PATCH v10 09/25] IB/umem: use get_user_pages_fast() to pin DMA pages

2019-12-12 Thread John Hubbard
And get rid of the mmap_sem calls, as part of that. Note
that get_user_pages_fast() will, if necessary, fall back to
__gup_longterm_unlocked(), which takes the mmap_sem as needed.

Reviewed-by: Leon Romanovsky 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 drivers/infiniband/core/umem.c | 17 ++---
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 7a3b99597ead..214e87aa609d 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -266,16 +266,13 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, 
unsigned long addr,
sg = umem->sg_head.sgl;
 
while (npages) {
-   down_read(>mmap_sem);
-   ret = get_user_pages(cur_base,
-min_t(unsigned long, npages,
-  PAGE_SIZE / sizeof (struct page *)),
-gup_flags | FOLL_LONGTERM,
-page_list, NULL);
-   if (ret < 0) {
-   up_read(>mmap_sem);
+   ret = get_user_pages_fast(cur_base,
+ min_t(unsigned long, npages,
+   PAGE_SIZE /
+   sizeof(struct page *)),
+ gup_flags | FOLL_LONGTERM, page_list);
+   if (ret < 0)
goto umem_release;
-   }
 
cur_base += ret * PAGE_SIZE;
npages   -= ret;
@@ -283,8 +280,6 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, 
unsigned long addr,
sg = ib_umem_add_sg_table(sg, page_list, ret,
dma_get_max_seg_size(context->device->dma_device),
>sg_nents);
-
-   up_read(>mmap_sem);
}
 
sg_mark_end(sg);
-- 
2.24.0



[PATCH v10 07/25] vfio: fix FOLL_LONGTERM use, simplify get_user_pages_remote() call

2019-12-12 Thread John Hubbard
Update VFIO to take advantage of the recently loosened restriction on
FOLL_LONGTERM with get_user_pages_remote(). Also, now it is possible to
fix a bug: the VFIO caller is logically a FOLL_LONGTERM user, but it
wasn't setting FOLL_LONGTERM.

Also, remove an unnessary pair of calls that were releasing and
reacquiring the mmap_sem. There is no need to avoid holding mmap_sem
just in order to call page_to_pfn().

Also, now that the the DAX check ("if a VMA is DAX, don't allow long
term pinning") is in the internals of get_user_pages_remote() and
__gup_longterm_locked(), there's no need for it at the VFIO call site.
So remove it.

Tested-by: Alex Williamson 
Acked-by: Alex Williamson 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Ira Weiny 
Suggested-by: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Jerome Glisse 
Signed-off-by: John Hubbard 
---
 drivers/vfio/vfio_iommu_type1.c | 30 +-
 1 file changed, 5 insertions(+), 25 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ada8e6cdb88..b800fc9a0251 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -322,7 +322,6 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
 {
struct page *page[1];
struct vm_area_struct *vma;
-   struct vm_area_struct *vmas[1];
unsigned int flags = 0;
int ret;
 
@@ -330,33 +329,14 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
flags |= FOLL_WRITE;
 
down_read(>mmap_sem);
-   if (mm == current->mm) {
-   ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
-vmas);
-   } else {
-   ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
-   vmas, NULL);
-   /*
-* The lifetime of a vaddr_get_pfn() page pin is
-* userspace-controlled. In the fs-dax case this could
-* lead to indefinite stalls in filesystem operations.
-* Disallow attempts to pin fs-dax pages via this
-* interface.
-*/
-   if (ret > 0 && vma_is_fsdax(vmas[0])) {
-   ret = -EOPNOTSUPP;
-   put_page(page[0]);
-   }
-   }
-   up_read(>mmap_sem);
-
+   ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
+   page, NULL, NULL);
if (ret == 1) {
*pfn = page_to_pfn(page[0]);
-   return 0;
+   ret = 0;
+   goto done;
}
 
-   down_read(>mmap_sem);
-
vaddr = untagged_addr(vaddr);
 
vma = find_vma_intersection(mm, vaddr, vaddr + 1);
@@ -366,7 +346,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
if (is_invalid_reserved_pfn(*pfn))
ret = 0;
}
-
+done:
up_read(>mmap_sem);
return ret;
 }
-- 
2.24.0



[PATCH v10 23/25] mm/gup: track FOLL_PIN pages

2019-12-12 Thread John Hubbard
Add tracking of pages that were pinned via FOLL_PIN.

As mentioned in the FOLL_PIN documentation, callers who effectively set
FOLL_PIN are required to ultimately free such pages via unpin_user_page().
The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
for DIO and/or RDMA use".

Pages that have been pinned via FOLL_PIN are identifiable via a
new function call:

   bool page_dma_pinned(struct page *page);

What to do in response to encountering such a page, is left to later
patchsets. There is discussion about this in [1], [2], and [3].

This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().

[1] Some slow progress on get_user_pages() (Apr 2, 2019):
https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018):
https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018):
https://lwn.net/Articles/753027/

Suggested-by: Jan Kara 
Suggested-by: Jérôme Glisse 
Cc: Kirill A. Shutemov 
Signed-off-by: John Hubbard 
---
 Documentation/core-api/pin_user_pages.rst |   2 +-
 include/linux/mm.h|  77 -
 include/linux/mmzone.h|   2 +
 include/linux/page_ref.h  |  10 +
 mm/gup.c  | 399 +-
 mm/huge_memory.c  |  45 ++-
 mm/hugetlb.c  |  38 ++-
 mm/vmstat.c   |   2 +
 8 files changed, 436 insertions(+), 139 deletions(-)

diff --git a/Documentation/core-api/pin_user_pages.rst 
b/Documentation/core-api/pin_user_pages.rst
index 1d490155ecd7..2db14df1f2d7 100644
--- a/Documentation/core-api/pin_user_pages.rst
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -53,7 +53,7 @@ Which flags are set by each wrapper
 For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup
 flags the caller provides. The caller is required to pass in a non-null struct
 pages* array, and the function then pin pages by incrementing each by a special
-value. For now, that value is +1, just like get_user_pages*().::
+value: GUP_PIN_COUNTING_BIAS.::
 
  Function
  
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6a1a357e7d86..1765332f27e8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1005,6 +1005,10 @@ static inline bool is_pci_p2pdma_page(const struct page 
*page)
 #define page_ref_zero_or_close_to_overflow(page) \
((unsigned int) page_ref_count(page) + 127u <= 127u)
 
+#define page_ref_zero_or_close_to_bias_overflow(page) \
+   ((unsigned int) page_ref_count(page) + \
+   GUP_PIN_COUNTING_BIAS <= GUP_PIN_COUNTING_BIAS)
+
 static inline void get_page(struct page *page)
 {
page = compound_head(page);
@@ -1016,6 +1020,8 @@ static inline void get_page(struct page *page)
page_ref_inc(page);
 }
 
+bool __must_check try_grab_page(struct page *page, unsigned int flags);
+
 static inline __must_check bool try_get_page(struct page *page)
 {
page = compound_head(page);
@@ -1044,29 +1050,70 @@ static inline void put_page(struct page *page)
__put_page(page);
 }
 
-/**
- * unpin_user_page() - release a gup-pinned page
- * @page:pointer to page to be released
+/*
+ * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload
+ * the page's refcount so that two separate items are tracked: the original 
page
+ * reference count, and also a new count of how many pin_user_pages() calls 
were
+ * made against the page. ("gup-pinned" is another term for the latter).
+ *
+ * With this scheme, pin_user_pages() becomes special: such pages are marked as
+ * distinct from normal pages. As such, the unpin_user_page() call (and its
+ * variants) must be used in order to release gup-pinned pages.
  *
- * Pages that were pinned via pin_user_pages*() must be released via either
- * unpin_user_page(), or one of the unpin_user_pages*() routines. This is so
- * that eventually such pages can be separately tracked and uniquely handled. 
In
- * particular, interactions with RDMA and filesystems need special handling.
+ * Choice of value:
  *
- * unpin_user_page() and put_page() are not interchangeable, despite this early
- * implementation that makes them look the same. unpin_user_page() calls must
- * be perfectly matched up with pin*() calls.
+ * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference
+ * counts with respect to pin_user_pages() and unpin_user_page() becomes
+ * simpler, due to the fact that adding an even power of two to the page
+ * refcount has the effect of using only the upper N bits, for the code that
+ * counts up using the bias value. This means that the lower bits are left for
+ * the exclusive use of the original code that increments and decrements by one
+ * (or at least, by much smaller values than the bias value).
+ *
+ * Of course, once the lower bits overflow into the upper bits (and this is
+ * OK, because 

[PATCH v10 24/25] mm/gup_benchmark: support pin_user_pages() and related calls

2019-12-12 Thread John Hubbard
Up until now, gup_benchmark supported testing of the
following kernel functions:

* get_user_pages(): via the '-U' command line option
* get_user_pages_longterm(): via the '-L' command line option
* get_user_pages_fast(): as the default (no options required)

Add test coverage for the new corresponding pin_*() functions:

* pin_user_pages_fast(): via the '-a' command line option
* pin_user_pages():  via the '-b' command line option

Also, add an option for clarity: '-u' for what is now (still) the
default choice: get_user_pages_fast().

Also, for the commands that set FOLL_PIN, verify that the pages
really are dma-pinned, via the new is_dma_pinned() routine.
Those commands are:

PIN_FAST_BENCHMARK : calls pin_user_pages_fast()
PIN_BENCHMARK  : calls pin_user_pages()

In between the calls to pin_*() and unpin_user_pages(),
check each page: if page_dma_pinned() returns false, then
WARN and return.

Do this outside of the benchmark timestamps, so that it doesn't
affect reported times.

Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 mm/gup_benchmark.c | 65 --
 tools/testing/selftests/vm/gup_benchmark.c | 15 -
 2 files changed, 74 insertions(+), 6 deletions(-)

diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
index 7fc44d25eca7..76d32db48af8 100644
--- a/mm/gup_benchmark.c
+++ b/mm/gup_benchmark.c
@@ -8,6 +8,8 @@
 #define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark)
 #define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark)
 #define GUP_BENCHMARK  _IOWR('g', 3, struct gup_benchmark)
+#define PIN_FAST_BENCHMARK _IOWR('g', 4, struct gup_benchmark)
+#define PIN_BENCHMARK  _IOWR('g', 5, struct gup_benchmark)
 
 struct gup_benchmark {
__u64 get_delta_usec;
@@ -19,6 +21,42 @@ struct gup_benchmark {
__u64 expansion[10];/* For future use */
 };
 
+static void put_back_pages(int cmd, struct page **pages, unsigned long 
nr_pages)
+{
+   int i;
+
+   switch (cmd) {
+   case GUP_FAST_BENCHMARK:
+   case GUP_LONGTERM_BENCHMARK:
+   case GUP_BENCHMARK:
+   for (i = 0; i < nr_pages; i++)
+   put_page(pages[i]);
+   break;
+
+   case PIN_FAST_BENCHMARK:
+   case PIN_BENCHMARK:
+   unpin_user_pages(pages, nr_pages);
+   break;
+   }
+}
+
+static void verify_dma_pinned(int cmd, struct page **pages,
+ unsigned long nr_pages)
+{
+   int i;
+
+   switch (cmd) {
+   case PIN_FAST_BENCHMARK:
+   case PIN_BENCHMARK:
+   for (i = 0; i < nr_pages; i++) {
+   if (WARN(!page_dma_pinned(pages[i]),
+"pages[%d] is NOT dma-pinned\n", i))
+   break;
+   }
+   break;
+   }
+}
+
 static int __gup_benchmark_ioctl(unsigned int cmd,
struct gup_benchmark *gup)
 {
@@ -65,6 +103,14 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
nr = get_user_pages(addr, nr, gup->flags, pages + i,
NULL);
break;
+   case PIN_FAST_BENCHMARK:
+   nr = pin_user_pages_fast(addr, nr, gup->flags,
+pages + i);
+   break;
+   case PIN_BENCHMARK:
+   nr = pin_user_pages(addr, nr, gup->flags, pages + i,
+   NULL);
+   break;
default:
return -1;
}
@@ -75,15 +121,22 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
}
end_time = ktime_get();
 
+   /* Shifting the meaning of nr_pages: now it is actual number pinned: */
+   nr_pages = i;
+
gup->get_delta_usec = ktime_us_delta(end_time, start_time);
gup->size = addr - gup->addr;
 
+   /*
+* Take an un-benchmark-timed moment to verify DMA pinned
+* state: print a warning if any non-dma-pinned pages are found:
+*/
+   verify_dma_pinned(cmd, pages, nr_pages);
+
start_time = ktime_get();
-   for (i = 0; i < nr_pages; i++) {
-   if (!pages[i])
-   break;
-   put_page(pages[i]);
-   }
+
+   put_back_pages(cmd, pages, nr_pages);
+
end_time = ktime_get();
gup->put_delta_usec = ktime_us_delta(end_time, start_time);
 
@@ -101,6 +154,8 @@ static long gup_benchmark_ioctl(struct file *filep, 
unsigned int cmd,
case GUP_FAST_BENCHMARK:
case GUP_LONGTERM_BENCHMARK:
case GUP_BENCHMARK:
+   case PIN_FAST_BENCHMARK:
+   case PIN_BENCHMARK:
break;
default:
return -EINVAL;
diff --git a/tools/testing/selftests/vm/gup_benchmark.c 

[PATCH v10 06/25] mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM

2019-12-12 Thread John Hubbard
As it says in the updated comment in gup.c: current FOLL_LONGTERM
behavior is incompatible with FAULT_FLAG_ALLOW_RETRY because of the
FS DAX check requirement on vmas.

However, the corresponding restriction in get_user_pages_remote() was
slightly stricter than is actually required: it forbade all
FOLL_LONGTERM callers, but we can actually allow FOLL_LONGTERM callers
that do not set the "locked" arg.

Update the code and comments to loosen the restriction, allowing
FOLL_LONGTERM in some cases.

Also, copy the DAX check ("if a VMA is DAX, don't allow long term
pinning") from the VFIO call site, all the way into the internals
of get_user_pages_remote() and __gup_longterm_locked(). That is:
get_user_pages_remote() calls __gup_longterm_locked(), which in turn
calls check_dax_vmas(). This check will then be removed from the VFIO
call site in a subsequent patch.

Thanks to Jason Gunthorpe for pointing out a clean way to fix this,
and to Dan Williams for helping clarify the DAX refactoring.

Tested-by: Alex Williamson 
Acked-by: Alex Williamson 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Ira Weiny 
Suggested-by: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Jerome Glisse 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 27 ++-
 1 file changed, 22 insertions(+), 5 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 3ecce297a47f..c0c56888e7cc 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -29,6 +29,13 @@ struct follow_page_context {
unsigned int page_mask;
 };
 
+static __always_inline long __gup_longterm_locked(struct task_struct *tsk,
+ struct mm_struct *mm,
+ unsigned long start,
+ unsigned long nr_pages,
+ struct page **pages,
+ struct vm_area_struct **vmas,
+ unsigned int flags);
 /*
  * Return the compound head page with ref appropriately incremented,
  * or NULL if that failed.
@@ -1179,13 +1186,23 @@ long get_user_pages_remote(struct task_struct *tsk, 
struct mm_struct *mm,
struct vm_area_struct **vmas, int *locked)
 {
/*
-* FIXME: Current FOLL_LONGTERM behavior is incompatible with
+* Parts of FOLL_LONGTERM behavior are incompatible with
 * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
-* vmas.  As there are no users of this flag in this call we simply
-* disallow this option for now.
+* vmas. However, this only comes up if locked is set, and there are
+* callers that do request FOLL_LONGTERM, but do not set locked. So,
+* allow what we can.
 */
-   if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM))
-   return -EINVAL;
+   if (gup_flags & FOLL_LONGTERM) {
+   if (WARN_ON_ONCE(locked))
+   return -EINVAL;
+   /*
+* This will check the vmas (even if our vmas arg is NULL)
+* and return -ENOTSUPP if DAX isn't allowed in this case:
+*/
+   return __gup_longterm_locked(tsk, mm, start, nr_pages, pages,
+vmas, gup_flags | FOLL_TOUCH |
+FOLL_REMOTE);
+   }
 
return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
   locked,
-- 
2.24.0



[PATCH v10 25/25] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage

2019-12-12 Thread John Hubbard
It's good to have basic unit test coverage of the new FOLL_PIN
behavior. Fortunately, the gup_benchmark unit test is extremely
fast (a few milliseconds), so adding it the the run_vmtests suite
is going to cause no noticeable change in running time.

So, add two new invocations to run_vmtests:

1) Run gup_benchmark with normal get_user_pages().

2) Run gup_benchmark with pin_user_pages(). This is much like
the first call, except that it sets FOLL_PIN.

Running these two in quick succession also provide a visual
comparison of the running times, which is convenient.

The new invocations are fairly early in the run_vmtests script,
because with test suites, it's usually preferable to put the
shorter, faster tests first, all other things being equal.

Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 tools/testing/selftests/vm/run_vmtests | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/tools/testing/selftests/vm/run_vmtests 
b/tools/testing/selftests/vm/run_vmtests
index a692ea828317..df6a6bf3f238 100755
--- a/tools/testing/selftests/vm/run_vmtests
+++ b/tools/testing/selftests/vm/run_vmtests
@@ -112,6 +112,28 @@ echo "NOTE: The above hugetlb tests provide minimal 
coverage.  Use"
 echo "  https://github.com/libhugetlbfs/libhugetlbfs.git for"
 echo "  hugetlb regression testing."
 
+echo ""
+echo "running 'gup_benchmark -U' (normal/slow gup)"
+echo ""
+./gup_benchmark -U
+if [ $? -ne 0 ]; then
+   echo "[FAIL]"
+   exitcode=1
+else
+   echo "[PASS]"
+fi
+
+echo "--"
+echo "running gup_benchmark -b (pin_user_pages)"
+echo "--"
+./gup_benchmark -b
+if [ $? -ne 0 ]; then
+   echo "[FAIL]"
+   exitcode=1
+else
+   echo "[PASS]"
+fi
+
 echo "---"
 echo "running userfaultfd"
 echo "---"
-- 
2.24.0



[PATCH v10 08/25] mm/gup: allow FOLL_FORCE for get_user_pages_fast()

2019-12-12 Thread John Hubbard
Commit 817be129e6f2 ("mm: validate get_user_pages_fast flags") allowed
only FOLL_WRITE and FOLL_LONGTERM to be passed to get_user_pages_fast().
This, combined with the fact that get_user_pages_fast() falls back to
"slow gup", which *does* accept FOLL_FORCE, leads to an odd situation:
if you need FOLL_FORCE, you cannot call get_user_pages_fast().

There does not appear to be any reason for filtering out FOLL_FORCE.
There is nothing in the _fast() implementation that requires that we
avoid writing to the pages. So it appears to have been an oversight.

Fix by allowing FOLL_FORCE to be set for get_user_pages_fast().

Fixes: 817be129e6f2 ("mm: validate get_user_pages_fast flags")
Cc: Christoph Hellwig 
Reviewed-by: Leon Romanovsky 
Reviewed-by: Jan Kara 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index c0c56888e7cc..958ab0757389 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2414,7 +2414,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
unsigned long addr, len, end;
int nr = 0, ret = 0;
 
-   if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM)))
+   if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
+  FOLL_FORCE)))
return -EINVAL;
 
start = untagged_addr(start) & PAGE_MASK;
-- 
2.24.0



[PATCH v10 10/25] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-12-12 Thread John Hubbard
Introduce pin_user_pages*() variations of get_user_pages*() calls,
and also pin_longterm_pages*() variations.

For now, these are placeholder calls, until the various call sites
are converted to use the correct get_user_pages*() or
pin_user_pages*() API.

These variants will eventually all set FOLL_PIN, which is also
introduced, and thoroughly documented.

pin_user_pages()
pin_user_pages_remote()
pin_user_pages_fast()

All pages that are pinned via the above calls, must be unpinned via
put_user_page().

The underlying rules are:

* FOLL_PIN is a gup-internal flag, so the call sites should not directly
set it. That behavior is enforced with assertions.

* Call sites that want to indicate that they are going to do DirectIO
  ("DIO") or something with similar characteristics, should call a
  get_user_pages()-like wrapper call that sets FOLL_PIN. These wrappers
  will:
* Start with "pin_user_pages" instead of "get_user_pages". That
  makes it easy to find and audit the call sites.
* Set FOLL_PIN

* For pages that are received via FOLL_PIN, those pages must be returned
  via put_user_page().

Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases
in this documentation. (I've reworded it and expanded upon it.)

Reviewed-by: Jan Kara 
Reviewed-by: Mike Rapoport   # Documentation
Reviewed-by: Jérôme Glisse 
Cc: Jonathan Corbet 
Cc: Ira Weiny 
Signed-off-by: John Hubbard 
---
 Documentation/core-api/index.rst  |   1 +
 Documentation/core-api/pin_user_pages.rst | 232 ++
 include/linux/mm.h|  63 --
 mm/gup.c  | 161 +--
 4 files changed, 423 insertions(+), 34 deletions(-)
 create mode 100644 Documentation/core-api/pin_user_pages.rst

diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index ab0eae1c153a..413f7d7c8642 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -31,6 +31,7 @@ Core utilities
generic-radix-tree
memory-allocation
mm-api
+   pin_user_pages
gfp_mask-from-fs-io
timekeeping
boot-time-mm
diff --git a/Documentation/core-api/pin_user_pages.rst 
b/Documentation/core-api/pin_user_pages.rst
new file mode 100644
index ..71849830cd48
--- /dev/null
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -0,0 +1,232 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+
+pin_user_pages() and related calls
+
+
+.. contents:: :local:
+
+Overview
+
+
+This document describes the following functions::
+
+ pin_user_pages()
+ pin_user_pages_fast()
+ pin_user_pages_remote()
+
+Basic description of FOLL_PIN
+=
+
+FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the 
get_user_pages*()
+("gup") family of functions. FOLL_PIN has significant interactions and
+interdependencies with FOLL_LONGTERM, so both are covered here.
+
+FOLL_PIN is internal to gup, meaning that it should not appear at the gup call
+sites. This allows the associated wrapper functions  (pin_user_pages*() and
+others) to set the correct combination of these flags, and to check for 
problems
+as well.
+
+FOLL_LONGTERM, on the other hand, *is* allowed to be set at the gup call sites.
+This is in order to avoid creating a large number of wrapper functions to cover
+all combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, the
+pin_user_pages*() APIs are clearly distinct from the get_user_pages*() APIs, so
+that's a natural dividing line, and a good point to make separate wrapper 
calls.
+In other words, use pin_user_pages*() for DMA-pinned pages, and
+get_user_pages*() for other cases. There are four cases described later on in
+this document, to further clarify that concept.
+
+FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However,
+multiple threads and call sites are free to pin the same struct pages, via both
+FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the
+other, not the struct page(s).
+
+The FOLL_PIN implementation is nearly the same as FOLL_GET, except that 
FOLL_PIN
+uses a different reference counting technique.
+
+FOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying that is,
+FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN.
+
+Which flags are set by each wrapper
+===
+
+For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup
+flags the caller provides. The caller is required to pass in a non-null struct
+pages* array, and the function then pin pages by incrementing each by a special
+value. For now, that value is +1, just like get_user_pages*().::
+
+ Function
+ 
+ pin_user_pages  FOLL_PIN is always set internally by this function.
+ pin_user_pages_fast FOLL_PIN is always set internally by this function.
+ 

[PATCH v10 19/25] vfio, mm: pin_user_pages (FOLL_PIN) and put_user_page() conversion

2019-12-12 Thread John Hubbard
1. Change vfio from get_user_pages_remote(), to
pin_user_pages_remote().

2. Because all FOLL_PIN-acquired pages must be released via
put_user_page(), also convert the put_page() call over to
put_user_pages_dirty_lock().

Note that this effectively changes the code's behavior in
vfio_iommu_type1.c: put_pfn(): it now ultimately calls
set_page_dirty_lock(), instead of set_page_dirty(). This is
probably more accurate.

As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]

[1] https://lore.kernel.org/r/20190723153640.gb...@lst.de

Tested-by: Alex Williamson 
Acked-by: Alex Williamson 
Signed-off-by: John Hubbard 
---
 drivers/vfio/vfio_iommu_type1.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index b800fc9a0251..18bfc2fc8e6d 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -309,9 +309,8 @@ static int put_pfn(unsigned long pfn, int prot)
 {
if (!is_invalid_reserved_pfn(pfn)) {
struct page *page = pfn_to_page(pfn);
-   if (prot & IOMMU_WRITE)
-   SetPageDirty(page);
-   put_page(page);
+
+   put_user_pages_dirty_lock(, 1, prot & IOMMU_WRITE);
return 1;
}
return 0;
@@ -329,7 +328,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
flags |= FOLL_WRITE;
 
down_read(>mmap_sem);
-   ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
+   ret = pin_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
page, NULL, NULL);
if (ret == 1) {
*pfn = page_to_pfn(page[0]);
-- 
2.24.0



[PATCH v10 04/25] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

2019-12-12 Thread John Hubbard
An upcoming patch changes and complicates the refcounting and
especially the "put page" aspects of it. In order to keep
everything clean, refactor the devmap page release routines:

* Rename put_devmap_managed_page() to page_is_devmap_managed(),
  and limit the functionality to "read only": return a bool,
  with no side effects.

* Add a new routine, put_devmap_managed_page(), to handle checking
  what kind of page it is, and what kind of refcount handling it
  requires.

* Rename __put_devmap_managed_page() to free_devmap_managed_page(),
  and limit the functionality to unconditionally freeing a devmap
  page.

This is originally based on a separate patch by Ira Weiny, which
applied to an early version of the put_user_page() experiments.
Since then, Jérôme Glisse suggested the refactoring described above.

Cc: Christoph Hellwig 
Suggested-by: Jérôme Glisse 
Reviewed-by: Dan Williams 
Reviewed-by: Jan Kara 
Signed-off-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 include/linux/mm.h | 17 +
 mm/memremap.c  | 16 ++--
 mm/swap.c  | 24 
 3 files changed, 39 insertions(+), 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c97ea3b694e6..77a4df06c8a7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -952,9 +952,10 @@ static inline bool is_zone_device_page(const struct page 
*page)
 #endif
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
-void __put_devmap_managed_page(struct page *page);
+void free_devmap_managed_page(struct page *page);
 DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-static inline bool put_devmap_managed_page(struct page *page)
+
+static inline bool page_is_devmap_managed(struct page *page)
 {
if (!static_branch_unlikely(_managed_key))
return false;
@@ -963,7 +964,6 @@ static inline bool put_devmap_managed_page(struct page 
*page)
switch (page->pgmap->type) {
case MEMORY_DEVICE_PRIVATE:
case MEMORY_DEVICE_FS_DAX:
-   __put_devmap_managed_page(page);
return true;
default:
break;
@@ -971,7 +971,14 @@ static inline bool put_devmap_managed_page(struct page 
*page)
return false;
 }
 
+bool put_devmap_managed_page(struct page *page);
+
 #else /* CONFIG_DEV_PAGEMAP_OPS */
+static inline bool page_is_devmap_managed(struct page *page)
+{
+   return false;
+}
+
 static inline bool put_devmap_managed_page(struct page *page)
 {
return false;
@@ -1028,8 +1035,10 @@ static inline void put_page(struct page *page)
 * need to inform the device driver through callback. See
 * include/linux/memremap.h and HMM for details.
 */
-   if (put_devmap_managed_page(page))
+   if (page_is_devmap_managed(page)) {
+   put_devmap_managed_page(page);
return;
+   }
 
if (put_page_testzero(page))
__put_page(page);
diff --git a/mm/memremap.c b/mm/memremap.c
index e899fa876a62..2ba773859031 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -411,20 +411,8 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 EXPORT_SYMBOL_GPL(get_dev_pagemap);
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
-void __put_devmap_managed_page(struct page *page)
+void free_devmap_managed_page(struct page *page)
 {
-   int count = page_ref_dec_return(page);
-
-   /* still busy */
-   if (count > 1)
-   return;
-
-   /* only triggered by the dev_pagemap shutdown path */
-   if (count == 0) {
-   __put_page(page);
-   return;
-   }
-
/* notify page idle for dax */
if (!is_device_private_page(page)) {
wake_up_var(>_refcount);
@@ -461,5 +449,5 @@ void __put_devmap_managed_page(struct page *page)
page->mapping = NULL;
page->pgmap->ops->page_free(page);
 }
-EXPORT_SYMBOL(__put_devmap_managed_page);
+EXPORT_SYMBOL(free_devmap_managed_page);
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
diff --git a/mm/swap.c b/mm/swap.c
index 5341ae93861f..49f7c2eea0ba 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1102,3 +1102,27 @@ void __init swap_setup(void)
 * _really_ don't want to cluster much more
 */
 }
+
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+bool put_devmap_managed_page(struct page *page)
+{
+   bool is_devmap = page_is_devmap_managed(page);
+
+   if (is_devmap) {
+   int count = page_ref_dec_return(page);
+
+   /*
+* devmap page refcounts are 1-based, rather than 0-based: if
+* refcount is 1, then the page is free and the refcount is
+* stable because nobody holds a reference on the page.
+*/
+   if (count == 1)
+   free_devmap_managed_page(page);
+   else if (!count)
+   __put_page(page);
+   }
+
+   return is_devmap;
+}
+EXPORT_SYMBOL(put_devmap_managed_page);
+#endif
-- 
2.24.0



[PATCH v10 21/25] mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding "1"

2019-12-12 Thread John Hubbard
Fix the gup benchmark flags to use the symbolic FOLL_WRITE,
instead of a hard-coded "1" value.

Also, clean up the filtering of gup flags a little, by just doing
it once before issuing any of the get_user_pages*() calls. This
makes it harder to overlook, instead of having little "gup_flags & 1"
phrases in the function calls.

Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 mm/gup_benchmark.c | 9 ++---
 tools/testing/selftests/vm/gup_benchmark.c | 6 +-
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
index 7dd602d7f8db..7fc44d25eca7 100644
--- a/mm/gup_benchmark.c
+++ b/mm/gup_benchmark.c
@@ -48,18 +48,21 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
nr = (next - addr) / PAGE_SIZE;
}
 
+   /* Filter out most gup flags: only allow a tiny subset here: */
+   gup->flags &= FOLL_WRITE;
+
switch (cmd) {
case GUP_FAST_BENCHMARK:
-   nr = get_user_pages_fast(addr, nr, gup->flags & 1,
+   nr = get_user_pages_fast(addr, nr, gup->flags,
 pages + i);
break;
case GUP_LONGTERM_BENCHMARK:
nr = get_user_pages(addr, nr,
-   (gup->flags & 1) | FOLL_LONGTERM,
+   gup->flags | FOLL_LONGTERM,
pages + i, NULL);
break;
case GUP_BENCHMARK:
-   nr = get_user_pages(addr, nr, gup->flags & 1, pages + i,
+   nr = get_user_pages(addr, nr, gup->flags, pages + i,
NULL);
break;
default:
diff --git a/tools/testing/selftests/vm/gup_benchmark.c 
b/tools/testing/selftests/vm/gup_benchmark.c
index 485cf06ef013..389327e9b30a 100644
--- a/tools/testing/selftests/vm/gup_benchmark.c
+++ b/tools/testing/selftests/vm/gup_benchmark.c
@@ -18,6 +18,9 @@
 #define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark)
 #define GUP_BENCHMARK  _IOWR('g', 3, struct gup_benchmark)
 
+/* Just the flags we need, copied from mm.h: */
+#define FOLL_WRITE 0x01/* check pte is writable */
+
 struct gup_benchmark {
__u64 get_delta_usec;
__u64 put_delta_usec;
@@ -85,7 +88,8 @@ int main(int argc, char **argv)
}
 
gup.nr_pages_per_call = nr_pages;
-   gup.flags = write;
+   if (write)
+   gup.flags |= FOLL_WRITE;
 
fd = open("/sys/kernel/debug/gup_benchmark", O_RDWR);
if (fd == -1)
-- 
2.24.0



[PATCH v10 12/25] IB/{core, hw, umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP

2019-12-12 Thread John Hubbard
Convert infiniband to use the new pin_user_pages*() calls.

Also, revert earlier changes to Infiniband ODP that had it using
put_user_page(). ODP is "Case 3" in
Documentation/core-api/pin_user_pages.rst, which is to say, normal
get_user_pages() and put_page() is the API to use there.

The new pin_user_pages*() calls replace corresponding get_user_pages*()
calls, and set the FOLL_PIN flag. The FOLL_PIN flag requires that the
caller must return the pages via put_user_page*() calls, but infiniband
was already doing that as part of an earlier commit.

Reviewed-by: Jason Gunthorpe 
Signed-off-by: John Hubbard 
---
 drivers/infiniband/core/umem.c  |  2 +-
 drivers/infiniband/core/umem_odp.c  | 13 ++---
 drivers/infiniband/hw/hfi1/user_pages.c |  2 +-
 drivers/infiniband/hw/mthca/mthca_memfree.c |  2 +-
 drivers/infiniband/hw/qib/qib_user_pages.c  |  2 +-
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  2 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c|  2 +-
 drivers/infiniband/sw/siw/siw_mem.c |  2 +-
 8 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 214e87aa609d..55daefaa9b88 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -266,7 +266,7 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, 
unsigned long addr,
sg = umem->sg_head.sgl;
 
while (npages) {
-   ret = get_user_pages_fast(cur_base,
+   ret = pin_user_pages_fast(cur_base,
  min_t(unsigned long, npages,
PAGE_SIZE /
sizeof(struct page *)),
diff --git a/drivers/infiniband/core/umem_odp.c 
b/drivers/infiniband/core/umem_odp.c
index e42d44e501fd..abc3bb6578cc 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -308,9 +308,8 @@ EXPORT_SYMBOL(ib_umem_odp_release);
  * The function returns -EFAULT if the DMA mapping operation fails. It returns
  * -EAGAIN if a concurrent invalidation prevents us from updating the page.
  *
- * The page is released via put_user_page even if the operation failed. For
- * on-demand pinning, the page is released whenever it isn't stored in the
- * umem.
+ * The page is released via put_page even if the operation failed. For 
on-demand
+ * pinning, the page is released whenever it isn't stored in the umem.
  */
 static int ib_umem_odp_map_dma_single_page(
struct ib_umem_odp *umem_odp,
@@ -363,7 +362,7 @@ static int ib_umem_odp_map_dma_single_page(
}
 
 out:
-   put_user_page(page);
+   put_page(page);
return ret;
 }
 
@@ -473,7 +472,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, 
u64 user_virt,
ret = -EFAULT;
break;
}
-   put_user_page(local_page_list[j]);
+   put_page(local_page_list[j]);
continue;
}
 
@@ -500,8 +499,8 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, 
u64 user_virt,
 * ib_umem_odp_map_dma_single_page().
 */
if (npages - (j + 1) > 0)
-   put_user_pages(_page_list[j+1],
-  npages - (j + 1));
+   release_pages(_page_list[j+1],
+ npages - (j + 1));
break;
}
}
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c 
b/drivers/infiniband/hw/hfi1/user_pages.c
index 469acb961fbd..9a94761765c0 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -106,7 +106,7 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned 
long vaddr, size_t np
int ret;
unsigned int gup_flags = FOLL_LONGTERM | (writable ? FOLL_WRITE : 0);
 
-   ret = get_user_pages_fast(vaddr, npages, gup_flags, pages);
+   ret = pin_user_pages_fast(vaddr, npages, gup_flags, pages);
if (ret < 0)
return ret;
 
diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c 
b/drivers/infiniband/hw/mthca/mthca_memfree.c
index edccfd6e178f..8269ab040c21 100644
--- a/drivers/infiniband/hw/mthca/mthca_memfree.c
+++ b/drivers/infiniband/hw/mthca/mthca_memfree.c
@@ -472,7 +472,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct 
mthca_uar *uar,
goto out;
}
 
-   ret = get_user_pages_fast(uaddr & PAGE_MASK, 1,
+   ret = pin_user_pages_fast(uaddr & PAGE_MASK, 1,
  FOLL_WRITE | FOLL_LONGTERM, pages);
if (ret < 0)
goto out;
diff --git 

[PATCH v10 22/25] mm, tree-wide: rename put_user_page*() to unpin_user_page*()

2019-12-12 Thread John Hubbard
In order to provide a clearer, more symmetric API for pinning
and unpinning DMA pages. This way, pin_user_pages*() calls
match up with unpin_user_pages*() calls, and the API is a lot
closer to being self-explanatory.

Reviewed-by: Jan Kara 
Signed-off-by: John Hubbard 
---
 Documentation/core-api/pin_user_pages.rst   |  2 +-
 arch/powerpc/mm/book3s64/iommu_api.c|  4 +--
 drivers/gpu/drm/via/via_dmablit.c   |  4 +--
 drivers/infiniband/core/umem.c  |  2 +-
 drivers/infiniband/hw/hfi1/user_pages.c |  2 +-
 drivers/infiniband/hw/mthca/mthca_memfree.c |  6 ++--
 drivers/infiniband/hw/qib/qib_user_pages.c  |  2 +-
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  6 ++--
 drivers/infiniband/hw/usnic/usnic_uiom.c|  2 +-
 drivers/infiniband/sw/siw/siw_mem.c |  2 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c   |  4 +--
 drivers/platform/goldfish/goldfish_pipe.c   |  4 +--
 drivers/vfio/vfio_iommu_type1.c |  2 +-
 fs/io_uring.c   |  4 +--
 include/linux/mm.h  | 26 -
 mm/gup.c| 32 ++---
 mm/process_vm_access.c  |  4 +--
 net/xdp/xdp_umem.c  |  2 +-
 18 files changed, 55 insertions(+), 55 deletions(-)

diff --git a/Documentation/core-api/pin_user_pages.rst 
b/Documentation/core-api/pin_user_pages.rst
index 71849830cd48..1d490155ecd7 100644
--- a/Documentation/core-api/pin_user_pages.rst
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -219,7 +219,7 @@ since the system was booted, via two new /proc/vmstat 
entries: ::
 /proc/vmstat/nr_foll_pin_requested
 
 Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is
-because there is a noticeable performance drop in put_user_page(), when they
+because there is a noticeable performance drop in unpin_user_page(), when they
 are activated.
 
 References
diff --git a/arch/powerpc/mm/book3s64/iommu_api.c 
b/arch/powerpc/mm/book3s64/iommu_api.c
index a86547822034..eba73ebd8ae5 100644
--- a/arch/powerpc/mm/book3s64/iommu_api.c
+++ b/arch/powerpc/mm/book3s64/iommu_api.c
@@ -168,7 +168,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
unsigned long ua,
 
 free_exit:
/* free the references taken */
-   put_user_pages(mem->hpages, pinned);
+   unpin_user_pages(mem->hpages, pinned);
 
vfree(mem->hpas);
kfree(mem);
@@ -214,7 +214,7 @@ static void mm_iommu_unpin(struct 
mm_iommu_table_group_mem_t *mem)
if (mem->hpas[i] & MM_IOMMU_TABLE_GROUP_PAGE_DIRTY)
SetPageDirty(page);
 
-   put_user_page(page);
+   unpin_user_page(page);
 
mem->hpas[i] = 0;
}
diff --git a/drivers/gpu/drm/via/via_dmablit.c 
b/drivers/gpu/drm/via/via_dmablit.c
index 37c5e572993a..719d036c9384 100644
--- a/drivers/gpu/drm/via/via_dmablit.c
+++ b/drivers/gpu/drm/via/via_dmablit.c
@@ -188,8 +188,8 @@ via_free_sg_info(struct pci_dev *pdev, drm_via_sg_info_t 
*vsg)
kfree(vsg->desc_pages);
/* fall through */
case dr_via_pages_locked:
-   put_user_pages_dirty_lock(vsg->pages, vsg->num_pages,
- (vsg->direction == DMA_FROM_DEVICE));
+   unpin_user_pages_dirty_lock(vsg->pages, vsg->num_pages,
+  (vsg->direction == DMA_FROM_DEVICE));
/* fall through */
case dr_via_pages_alloc:
vfree(vsg->pages);
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 55daefaa9b88..a6094766b6f5 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -54,7 +54,7 @@ static void __ib_umem_release(struct ib_device *dev, struct 
ib_umem *umem, int d
 
for_each_sg_page(umem->sg_head.sgl, _iter, umem->sg_nents, 0) {
page = sg_page_iter_page(_iter);
-   put_user_pages_dirty_lock(, 1, umem->writable && dirty);
+   unpin_user_pages_dirty_lock(, 1, umem->writable && dirty);
}
 
sg_free_table(>sg_head);
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c 
b/drivers/infiniband/hw/hfi1/user_pages.c
index 9a94761765c0..3b505006c0a6 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -118,7 +118,7 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned 
long vaddr, size_t np
 void hfi1_release_user_pages(struct mm_struct *mm, struct page **p,
 size_t npages, bool dirty)
 {
-   put_user_pages_dirty_lock(p, npages, dirty);
+   unpin_user_pages_dirty_lock(p, npages, dirty);
 
if (mm) { /* during close after signal, mm can be NULL */
atomic64_sub(npages, >pinned_vm);
diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c 
b/drivers/infiniband/hw/mthca/mthca_memfree.c
index 

[PATCH v10 03/25] mm: Cleanup __put_devmap_managed_page() vs ->page_free()

2019-12-12 Thread John Hubbard
From: Dan Williams 

After the removal of the device-public infrastructure there are only 2
->page_free() call backs in the kernel. One of those is a device-private
callback in the nouveau driver, the other is a generic wakeup needed in
the DAX case. In the hopes that all ->page_free() callbacks can be
migrated to common core kernel functionality, move the device-private
specific actions in __put_devmap_managed_page() under the
is_device_private_page() conditional, including the ->page_free()
callback. For the other page types just open-code the generic wakeup.

Yes, the wakeup is only needed in the MEMORY_DEVICE_FSDAX case, but it
does no harm in the MEMORY_DEVICE_DEVDAX and MEMORY_DEVICE_PCI_P2PDMA
case.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Jérôme Glisse 
Cc: Jan Kara 
Cc: Ira Weiny 
Signed-off-by: Dan Williams 
Signed-off-by: John Hubbard 
---
 drivers/nvdimm/pmem.c |  6 
 mm/memremap.c | 80 ---
 2 files changed, 44 insertions(+), 42 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index ad8e4df1282b..4eae441f86c9 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -337,13 +337,7 @@ static void pmem_release_disk(void *__pmem)
put_disk(pmem->disk);
 }
 
-static void pmem_pagemap_page_free(struct page *page)
-{
-   wake_up_var(>_refcount);
-}
-
 static const struct dev_pagemap_ops fsdax_pagemap_ops = {
-   .page_free  = pmem_pagemap_page_free,
.kill   = pmem_pagemap_kill,
.cleanup= pmem_pagemap_cleanup,
 };
diff --git a/mm/memremap.c b/mm/memremap.c
index 03ccbdfeb697..e899fa876a62 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -27,7 +27,8 @@ static void devmap_managed_enable_put(void)
 
 static int devmap_managed_enable_get(struct dev_pagemap *pgmap)
 {
-   if (!pgmap->ops || !pgmap->ops->page_free) {
+   if (pgmap->type == MEMORY_DEVICE_PRIVATE &&
+   (!pgmap->ops || !pgmap->ops->page_free)) {
WARN(1, "Missing page_free method\n");
return -EINVAL;
}
@@ -414,44 +415,51 @@ void __put_devmap_managed_page(struct page *page)
 {
int count = page_ref_dec_return(page);
 
-   /*
-* If refcount is 1 then page is freed and refcount is stable as nobody
-* holds a reference on the page.
-*/
-   if (count == 1) {
-   /* Clear Active bit in case of parallel mark_page_accessed */
-   __ClearPageActive(page);
-   __ClearPageWaiters(page);
+   /* still busy */
+   if (count > 1)
+   return;
 
-   mem_cgroup_uncharge(page);
+   /* only triggered by the dev_pagemap shutdown path */
+   if (count == 0) {
+   __put_page(page);
+   return;
+   }
 
-   /*
-* When a device_private page is freed, the page->mapping field
-* may still contain a (stale) mapping value. For example, the
-* lower bits of page->mapping may still identify the page as
-* an anonymous page. Ultimately, this entire field is just
-* stale and wrong, and it will cause errors if not cleared.
-* One example is:
-*
-*  migrate_vma_pages()
-*migrate_vma_insert_page()
-*  page_add_new_anon_rmap()
-*__page_set_anon_rmap()
-*  ...checks page->mapping, via PageAnon(page) call,
-*and incorrectly concludes that the page is an
-*anonymous page. Therefore, it incorrectly,
-*silently fails to set up the new anon rmap.
-*
-* For other types of ZONE_DEVICE pages, migration is either
-* handled differently or not done at all, so there is no need
-* to clear page->mapping.
-*/
-   if (is_device_private_page(page))
-   page->mapping = NULL;
+   /* notify page idle for dax */
+   if (!is_device_private_page(page)) {
+   wake_up_var(>_refcount);
+   return;
+   }
 
-   page->pgmap->ops->page_free(page);
-   } else if (!count)
-   __put_page(page);
+   /* Clear Active bit in case of parallel mark_page_accessed */
+   __ClearPageActive(page);
+   __ClearPageWaiters(page);
+
+   mem_cgroup_uncharge(page);
+
+   /*
+* When a device_private page is freed, the page->mapping field
+* may still contain a (stale) mapping value. For example, the
+* lower bits of page->mapping may still identify the page as an
+* anonymous page. Ultimately, this entire field is just stale
+* and wrong, and it will cause errors if not cleared.  One
+* example is:
+*
+*  

[PATCH v10 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-12 Thread John Hubbard
Hi,

This implements an API naming change (put_user_page*() -->
unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
extends that tracking to a few select subsystems. More subsystems will
be added in follow up work.

Christoph Hellwig, a point of interest:

a) I've moved the bulk of the code out of the inline functions, as
   requested, for the devmap changes (patch 4: "mm: devmap: refactor
   1-based refcounting for ZONE_DEVICE pages").

Changes since v9: Fixes resulting from Jan Kara's and Jonathan Corbet's
reviews:

* Removed reviewed-by tags from the "mm/gup: track FOLL_PIN pages" (those
  were improperly inherited from the much smaller refactoring patch that
  was merged into it).

* Made try_grab_compound_head() and try_grab_page() behavior similar in
  their behavior with flags, in order to avoid "gotchas" later.

* follow_trans_huge_pmd(): moved the try_grab_page() to earlier in the
  routine, in order to avoid having to undo mlock_vma_page().

* follow_hugetlb_page(): removed a refcount overflow check that is now
  extraneous (and weaker than what try_grab_page() provides a few lines
  further down).

* Fixed up two Documentation flaws, pointed out by Jonathan Corbet's
  review.

Changes since v8:

* Merged the "mm/gup: pass flags arg to __gup_device_* functions" patch
  into the "mm/gup: track FOLL_PIN pages" patch, as requested by
  Christoph and Jan.

* Changed void grab_page() to bool try_grab_page(), and handled errors
  at the call sites. (From Jan's review comments.) try_grab_page()
  attempts to avoid page refcount overflows, even when counting up with
  GUP_PIN_COUNTING_BIAS increments.

* Fixed a bug that I'd introduced, when changing a BUG() to a WARN().

* Added Jan's reviewed-by tag to the " mm/gup: allow FOLL_FORCE for
  get_user_pages_fast()" patch.

* Documentation: pin_user_pages.rst: fixed an incorrect gup_benchmark
  invocation, left over from the pin_longterm days, spotted while preparing
  this version.

* Rebased onto today's linux.git (-rc1), and re-tested.

Changes since v7:

* Rebased onto Linux 5.5-rc1

* Reworked the grab_page() and try_grab_compound_head(), for API
  consistency and less diffs (thanks to Jan Kara's reviews).

* Added Leon Romanovsky's reviewed-by tags for two of the IB-related
  patches.

* patch 4 refactoring changes, as mentioned above.

There is a git repo and branch, for convenience:

g...@github.com:johnhubbard/linux.git pin_user_pages_tracking_v8

For the remaining list of "changes since version N", those are all in
v7, which is here:

  https://lore.kernel.org/r/20191121071354.456618-1-jhubb...@nvidia.com


Overview:

This is a prerequisite to solving the problem of proper interactions
between file-backed pages, and [R]DMA activities, as discussed in [1],
[2], [3], and in a remarkable number of email threads since about
2017. :)

A new internal gup flag, FOLL_PIN is introduced, and thoroughly
documented in the last patch's Documentation/vm/pin_user_pages.rst.

I believe that this will provide a good starting point for doing the
layout lease work that Ira Weiny has been working on. That's because
these new wrapper functions provide a clean, constrained, systematically
named set of functionality that, again, is required in order to even
know if a page is "dma-pinned".

In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:

get_user_pages() (sets FOLL_GET)
put_page()

to this:
pin_user_pages() (sets FOLL_PIN)
unpin_user_page()


Testing:

* I've done some overall kernel testing (LTP, and a few other goodies),
  and some directed testing to exercise some of the changes. And as you
  can see, gup_benchmark is enhanced to exercise this. Basically, I've
  been able to runtime test the core get_user_pages() and
  pin_user_pages() and related routines, but not so much on several of
  the call sites--but those are generally just a couple of lines
  changed, each.

  Not much of the kernel is actually using this, which on one hand
  reduces risk quite a lot. But on the other hand, testing coverage
  is low. So I'd love it if, in particular, the Infiniband and PowerPC
  folks could do a smoke test of this series for me.

  Runtime testing for the call sites so far is pretty light:

* io_uring: Some directed tests from liburing exercise this, and
they pass.
* process_vm_access.c: A small directed test passes.
* gup_benchmark: the enhanced version hits the new gup.c code, and
 passes.
* infiniband: ran "ib_write_bw", which exercises the umem.c changes,
  but not the other changes.
* VFIO: compiles (I'm vowing to set up a run time test soon, but it's

[PATCH v10 14/25] drm/via: set FOLL_PIN via pin_user_pages_fast()

2019-12-12 Thread John Hubbard
Convert drm/via to use the new pin_user_pages_fast() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages, and therefore for any code that calls
put_user_page().

In partial anticipation of this work, the drm/via driver was already
calling put_user_page() instead of put_page(). Therefore, in order to
convert from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required
is to change get_user_pages() to pin_user_pages().

Acked-by: Daniel Vetter 
Reviewed-by: Jérôme Glisse 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 drivers/gpu/drm/via/via_dmablit.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/via/via_dmablit.c 
b/drivers/gpu/drm/via/via_dmablit.c
index 3db000aacd26..37c5e572993a 100644
--- a/drivers/gpu/drm/via/via_dmablit.c
+++ b/drivers/gpu/drm/via/via_dmablit.c
@@ -239,7 +239,7 @@ via_lock_all_dma_pages(drm_via_sg_info_t *vsg,  
drm_via_dmablit_t *xfer)
vsg->pages = vzalloc(array_size(sizeof(struct page *), vsg->num_pages));
if (NULL == vsg->pages)
return -ENOMEM;
-   ret = get_user_pages_fast((unsigned long)xfer->mem_addr,
+   ret = pin_user_pages_fast((unsigned long)xfer->mem_addr,
vsg->num_pages,
vsg->direction == DMA_FROM_DEVICE ? FOLL_WRITE : 0,
vsg->pages);
-- 
2.24.0



[PATCH v10 18/25] media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page() conversion

2019-12-12 Thread John Hubbard
1. Change v4l2 from get_user_pages() to pin_user_pages().

2. Because all FOLL_PIN-acquired pages must be released via
put_user_page(), also convert the put_page() call over to
put_user_pages_dirty_lock().

Acked-by: Hans Verkuil 
Cc: Ira Weiny 
Signed-off-by: John Hubbard 
---
 drivers/media/v4l2-core/videobuf-dma-sg.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c 
b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 28262190c3ab..162a2633b1e3 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -183,12 +183,12 @@ static int videobuf_dma_init_user_locked(struct 
videobuf_dmabuf *dma,
dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n",
data, size, dma->nr_pages);
 
-   err = get_user_pages(data & PAGE_MASK, dma->nr_pages,
+   err = pin_user_pages(data & PAGE_MASK, dma->nr_pages,
 flags | FOLL_LONGTERM, dma->pages, NULL);
 
if (err != dma->nr_pages) {
dma->nr_pages = (err >= 0) ? err : 0;
-   dprintk(1, "get_user_pages: err=%d [%d]\n", err,
+   dprintk(1, "pin_user_pages: err=%d [%d]\n", err,
dma->nr_pages);
return err < 0 ? err : -EINVAL;
}
@@ -349,11 +349,8 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma)
BUG_ON(dma->sglen);
 
if (dma->pages) {
-   for (i = 0; i < dma->nr_pages; i++) {
-   if (dma->direction == DMA_FROM_DEVICE)
-   set_page_dirty_lock(dma->pages[i]);
-   put_page(dma->pages[i]);
-   }
+   put_user_pages_dirty_lock(dma->pages, dma->nr_pages,
+ dma->direction == DMA_FROM_DEVICE);
kfree(dma->pages);
dma->pages = NULL;
}
-- 
2.24.0



[PATCH v10 17/25] media/v4l2-core: set pages dirty upon releasing DMA buffers

2019-12-12 Thread John Hubbard
After DMA is complete, and the device and CPU caches are synchronized,
it's still required to mark the CPU pages as dirty, if the data was
coming from the device. However, this driver was just issuing a
bare put_page() call, without any set_page_dirty*() call.

Fix the problem, by calling set_page_dirty_lock() if the CPU pages
were potentially receiving data from the device.

Reviewed-by: Christoph Hellwig 
Acked-by: Hans Verkuil 
Cc: Mauro Carvalho Chehab 
Cc: 
Signed-off-by: John Hubbard 
---
 drivers/media/v4l2-core/videobuf-dma-sg.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c 
b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 66a6c6c236a7..28262190c3ab 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -349,8 +349,11 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma)
BUG_ON(dma->sglen);
 
if (dma->pages) {
-   for (i = 0; i < dma->nr_pages; i++)
+   for (i = 0; i < dma->nr_pages; i++) {
+   if (dma->direction == DMA_FROM_DEVICE)
+   set_page_dirty_lock(dma->pages[i]);
put_page(dma->pages[i]);
+   }
kfree(dma->pages);
dma->pages = NULL;
}
-- 
2.24.0



[PATCH v10 20/25] powerpc: book3s64: convert to pin_user_pages() and put_user_page()

2019-12-12 Thread John Hubbard
1. Convert from get_user_pages() to pin_user_pages().

2. As required by pin_user_pages(), release these pages via
put_user_page().

Reviewed-by: Jan Kara 
Signed-off-by: John Hubbard 
---
 arch/powerpc/mm/book3s64/iommu_api.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/iommu_api.c 
b/arch/powerpc/mm/book3s64/iommu_api.c
index 56cc84520577..a86547822034 100644
--- a/arch/powerpc/mm/book3s64/iommu_api.c
+++ b/arch/powerpc/mm/book3s64/iommu_api.c
@@ -103,7 +103,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
unsigned long ua,
for (entry = 0; entry < entries; entry += chunk) {
unsigned long n = min(entries - entry, chunk);
 
-   ret = get_user_pages(ua + (entry << PAGE_SHIFT), n,
+   ret = pin_user_pages(ua + (entry << PAGE_SHIFT), n,
FOLL_WRITE | FOLL_LONGTERM,
mem->hpages + entry, NULL);
if (ret == n) {
@@ -167,9 +167,8 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
unsigned long ua,
return 0;
 
 free_exit:
-   /* free the reference taken */
-   for (i = 0; i < pinned; i++)
-   put_page(mem->hpages[i]);
+   /* free the references taken */
+   put_user_pages(mem->hpages, pinned);
 
vfree(mem->hpas);
kfree(mem);
@@ -215,7 +214,8 @@ static void mm_iommu_unpin(struct 
mm_iommu_table_group_mem_t *mem)
if (mem->hpas[i] & MM_IOMMU_TABLE_GROUP_PAGE_DIRTY)
SetPageDirty(page);
 
-   put_page(page);
+   put_user_page(page);
+
mem->hpas[i] = 0;
}
 }
-- 
2.24.0



[PATCH v10 05/25] goldish_pipe: rename local pin_user_pages() routine

2019-12-12 Thread John Hubbard
1. Avoid naming conflicts: rename local static function from
"pin_user_pages()" to "goldfish_pin_pages()".

An upcoming patch will introduce a global pin_user_pages()
function.

Reviewed-by: Jan Kara 
Reviewed-by: Jérôme Glisse 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 drivers/platform/goldfish/goldfish_pipe.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/platform/goldfish/goldfish_pipe.c 
b/drivers/platform/goldfish/goldfish_pipe.c
index cef0133aa47a..ef50c264db71 100644
--- a/drivers/platform/goldfish/goldfish_pipe.c
+++ b/drivers/platform/goldfish/goldfish_pipe.c
@@ -257,12 +257,12 @@ static int goldfish_pipe_error_convert(int status)
}
 }
 
-static int pin_user_pages(unsigned long first_page,
- unsigned long last_page,
- unsigned int last_page_size,
- int is_write,
- struct page *pages[MAX_BUFFERS_PER_COMMAND],
- unsigned int *iter_last_page_size)
+static int goldfish_pin_pages(unsigned long first_page,
+ unsigned long last_page,
+ unsigned int last_page_size,
+ int is_write,
+ struct page *pages[MAX_BUFFERS_PER_COMMAND],
+ unsigned int *iter_last_page_size)
 {
int ret;
int requested_pages = ((last_page - first_page) >> PAGE_SHIFT) + 1;
@@ -354,9 +354,9 @@ static int transfer_max_buffers(struct goldfish_pipe *pipe,
if (mutex_lock_interruptible(>lock))
return -ERESTARTSYS;
 
-   pages_count = pin_user_pages(first_page, last_page,
-last_page_size, is_write,
-pipe->pages, _last_page_size);
+   pages_count = goldfish_pin_pages(first_page, last_page,
+last_page_size, is_write,
+pipe->pages, _last_page_size);
if (pages_count < 0) {
mutex_unlock(>lock);
return pages_count;
-- 
2.24.0



[PATCH v10 16/25] net/xdp: set FOLL_PIN via pin_user_pages()

2019-12-12 Thread John Hubbard
Convert net/xdp to use the new pin_longterm_pages() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages.

In partial anticipation of this work, the net/xdp code was already
calling put_user_page() instead of put_page(). Therefore, in order to
convert from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required
here is to change get_user_pages() to pin_user_pages().

Acked-by: Björn Töpel 
Signed-off-by: John Hubbard 
---
 net/xdp/xdp_umem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 3049af269fbf..d071003b5e76 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -291,7 +291,7 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem)
return -ENOMEM;
 
down_read(>mm->mmap_sem);
-   npgs = get_user_pages(umem->address, umem->npgs,
+   npgs = pin_user_pages(umem->address, umem->npgs,
  gup_flags | FOLL_LONGTERM, >pgs[0], NULL);
up_read(>mm->mmap_sem);
 
-- 
2.24.0



[PATCH v10 13/25] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()

2019-12-12 Thread John Hubbard
Convert process_vm_access to use the new pin_user_pages_remote()
call, which sets FOLL_PIN. Setting FOLL_PIN is now required for
code that requires tracking of pinned pages.

Also, release the pages via put_user_page*().

Also, rename "pages" to "pinned_pages", as this makes for
easier reading of process_vm_rw_single_vec().

Reviewed-by: Jan Kara 
Reviewed-by: Jérôme Glisse 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 mm/process_vm_access.c | 28 +++-
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 357aa7bef6c0..fd20ab675b85 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -42,12 +42,11 @@ static int process_vm_rw_pages(struct page **pages,
if (copy > len)
copy = len;
 
-   if (vm_write) {
+   if (vm_write)
copied = copy_page_from_iter(page, offset, copy, iter);
-   set_page_dirty_lock(page);
-   } else {
+   else
copied = copy_page_to_iter(page, offset, copy, iter);
-   }
+
len -= copied;
if (copied < copy && iov_iter_count(iter))
return -EFAULT;
@@ -96,7 +95,7 @@ static int process_vm_rw_single_vec(unsigned long addr,
flags |= FOLL_WRITE;
 
while (!rc && nr_pages && iov_iter_count(iter)) {
-   int pages = min(nr_pages, max_pages_per_loop);
+   int pinned_pages = min(nr_pages, max_pages_per_loop);
int locked = 1;
size_t bytes;
 
@@ -106,14 +105,15 @@ static int process_vm_rw_single_vec(unsigned long addr,
 * current/current->mm
 */
down_read(>mmap_sem);
-   pages = get_user_pages_remote(task, mm, pa, pages, flags,
- process_pages, NULL, );
+   pinned_pages = pin_user_pages_remote(task, mm, pa, pinned_pages,
+flags, process_pages,
+NULL, );
if (locked)
up_read(>mmap_sem);
-   if (pages <= 0)
+   if (pinned_pages <= 0)
return -EFAULT;
 
-   bytes = pages * PAGE_SIZE - start_offset;
+   bytes = pinned_pages * PAGE_SIZE - start_offset;
if (bytes > len)
bytes = len;
 
@@ -122,10 +122,12 @@ static int process_vm_rw_single_vec(unsigned long addr,
 vm_write);
len -= bytes;
start_offset = 0;
-   nr_pages -= pages;
-   pa += pages * PAGE_SIZE;
-   while (pages)
-   put_page(process_pages[--pages]);
+   nr_pages -= pinned_pages;
+   pa += pinned_pages * PAGE_SIZE;
+
+   /* If vm_write is set, the pages need to be made dirty: */
+   put_user_pages_dirty_lock(process_pages, pinned_pages,
+ vm_write);
}
 
return rc;
-- 
2.24.0



[PATCH v10 11/25] goldish_pipe: convert to pin_user_pages() and put_user_page()

2019-12-12 Thread John Hubbard
1. Call the new global pin_user_pages_fast(), from pin_goldfish_pages().

2. As required by pin_user_pages(), release these pages via
put_user_page(). In this case, do so via put_user_pages_dirty_lock().

That has the side effect of calling set_page_dirty_lock(), instead
of set_page_dirty(). This is probably more accurate.

As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]

Another side effect is that the release code is simplified because
the page[] loop is now in gup.c instead of here, so just delete the
local release_user_pages() entirely, and call
put_user_pages_dirty_lock() directly, instead.

[1] https://lore.kernel.org/r/20190723153640.gb...@lst.de

Reviewed-by: Jan Kara 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 drivers/platform/goldfish/goldfish_pipe.c | 17 +++--
 1 file changed, 3 insertions(+), 14 deletions(-)

diff --git a/drivers/platform/goldfish/goldfish_pipe.c 
b/drivers/platform/goldfish/goldfish_pipe.c
index ef50c264db71..2a5901efecde 100644
--- a/drivers/platform/goldfish/goldfish_pipe.c
+++ b/drivers/platform/goldfish/goldfish_pipe.c
@@ -274,7 +274,7 @@ static int goldfish_pin_pages(unsigned long first_page,
*iter_last_page_size = last_page_size;
}
 
-   ret = get_user_pages_fast(first_page, requested_pages,
+   ret = pin_user_pages_fast(first_page, requested_pages,
  !is_write ? FOLL_WRITE : 0,
  pages);
if (ret <= 0)
@@ -285,18 +285,6 @@ static int goldfish_pin_pages(unsigned long first_page,
return ret;
 }
 
-static void release_user_pages(struct page **pages, int pages_count,
-  int is_write, s32 consumed_size)
-{
-   int i;
-
-   for (i = 0; i < pages_count; i++) {
-   if (!is_write && consumed_size > 0)
-   set_page_dirty(pages[i]);
-   put_page(pages[i]);
-   }
-}
-
 /* Populate the call parameters, merging adjacent pages together */
 static void populate_rw_params(struct page **pages,
   int pages_count,
@@ -372,7 +360,8 @@ static int transfer_max_buffers(struct goldfish_pipe *pipe,
 
*consumed_size = pipe->command_buffer->rw_params.consumed_size;
 
-   release_user_pages(pipe->pages, pages_count, is_write, *consumed_size);
+   put_user_pages_dirty_lock(pipe->pages, pages_count,
+ !is_write && *consumed_size > 0);
 
mutex_unlock(>lock);
return 0;
-- 
2.24.0



[PATCH v10 15/25] fs/io_uring: set FOLL_PIN via pin_user_pages()

2019-12-12 Thread John Hubbard
Convert fs/io_uring to use the new pin_user_pages() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages, and therefore for any code that calls
put_user_page().

In partial anticipation of this work, the io_uring code was already
calling put_user_page() instead of put_page(). Therefore, in order to
convert from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required
here is to change get_user_pages() to pin_user_pages().

Reviewed-by: Jens Axboe 
Reviewed-by: Jan Kara 
Signed-off-by: John Hubbard 
---
 fs/io_uring.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 405be10da73d..9639ebc21e8a 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4521,7 +4521,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx 
*ctx, void __user *arg,
 
ret = 0;
down_read(>mm->mmap_sem);
-   pret = get_user_pages(ubuf, nr_pages,
+   pret = pin_user_pages(ubuf, nr_pages,
  FOLL_WRITE | FOLL_LONGTERM,
  pages, vmas);
if (pret == nr_pages) {
-- 
2.24.0



[PATCH v10 02/25] mm/gup: move try_get_compound_head() to top, fix minor issues

2019-12-12 Thread John Hubbard
An upcoming patch uses try_get_compound_head() more widely,
so move it to the top of gup.c.

Also fix a tiny spelling error and a checkpatch.pl warning.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 29 +++--
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index f764432914c4..3ecce297a47f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -29,6 +29,21 @@ struct follow_page_context {
unsigned int page_mask;
 };
 
+/*
+ * Return the compound head page with ref appropriately incremented,
+ * or NULL if that failed.
+ */
+static inline struct page *try_get_compound_head(struct page *page, int refs)
+{
+   struct page *head = compound_head(page);
+
+   if (WARN_ON_ONCE(page_ref_count(head) < 0))
+   return NULL;
+   if (unlikely(!page_cache_add_speculative(head, refs)))
+   return NULL;
+   return head;
+}
+
 /**
  * put_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
  * @pages:  array of pages to be maybe marked dirty, and definitely released.
@@ -1807,20 +1822,6 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int 
nr_start,
}
 }
 
-/*
- * Return the compund head page with ref appropriately incremented,
- * or NULL if that failed.
- */
-static inline struct page *try_get_compound_head(struct page *page, int refs)
-{
-   struct page *head = compound_head(page);
-   if (WARN_ON_ONCE(page_ref_count(head) < 0))
-   return NULL;
-   if (unlikely(!page_cache_add_speculative(head, refs)))
-   return NULL;
-   return head;
-}
-
 #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
 static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 unsigned int flags, struct page **pages, int *nr)
-- 
2.24.0



[PATCH v10 01/25] mm/gup: factor out duplicate code from four routines

2019-12-12 Thread John Hubbard
There are four locations in gup.c that have a fair amount of code
duplication. This means that changing one requires making the same
changes in four places, not to mention reading the same code four
times, and wondering if there are subtle differences.

Factor out the common code into static functions, thus reducing the
overall line count and the code's complexity.

Also, take the opportunity to slightly improve the efficiency of the
error cases, by doing a mass subtraction of the refcount, surrounded
by get_page()/put_page().

Also, further simplify (slightly), by waiting until the the successful
end of each routine, to increment *nr.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Jérôme Glisse 
Reviewed-by: Jan Kara 
Cc: Ira Weiny 
Cc: Christoph Hellwig 
Cc: Aneesh Kumar K.V 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 91 ++--
 1 file changed, 36 insertions(+), 55 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 7646bf993b25..f764432914c4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1978,6 +1978,25 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, 
unsigned long addr,
 }
 #endif
 
+static int record_subpages(struct page *page, unsigned long addr,
+  unsigned long end, struct page **pages)
+{
+   int nr;
+
+   for (nr = 0; addr != end; addr += PAGE_SIZE)
+   pages[nr++] = page++;
+
+   return nr;
+}
+
+static void put_compound_head(struct page *page, int refs)
+{
+   /* Do a get_page() first, in case refs == page->_refcount */
+   get_page(page);
+   page_ref_sub(page, refs);
+   put_page(page);
+}
+
 #ifdef CONFIG_ARCH_HAS_HUGEPD
 static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
  unsigned long sz)
@@ -2007,32 +2026,20 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, 
unsigned long addr,
/* hugepages are never "special" */
VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 
-   refs = 0;
head = pte_page(pte);
-
page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
-   do {
-   VM_BUG_ON(compound_head(page) != head);
-   pages[*nr] = page;
-   (*nr)++;
-   page++;
-   refs++;
-   } while (addr += PAGE_SIZE, addr != end);
+   refs = record_subpages(page, addr, end, pages + *nr);
 
head = try_get_compound_head(head, refs);
-   if (!head) {
-   *nr -= refs;
+   if (!head)
return 0;
-   }
 
if (unlikely(pte_val(pte) != pte_val(*ptep))) {
-   /* Could be optimized better */
-   *nr -= refs;
-   while (refs--)
-   put_page(head);
+   put_compound_head(head, refs);
return 0;
}
 
+   *nr += refs;
SetPageReferenced(head);
return 1;
 }
@@ -2079,28 +2086,19 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, 
unsigned long addr,
return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr);
}
 
-   refs = 0;
page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-   do {
-   pages[*nr] = page;
-   (*nr)++;
-   page++;
-   refs++;
-   } while (addr += PAGE_SIZE, addr != end);
+   refs = record_subpages(page, addr, end, pages + *nr);
 
head = try_get_compound_head(pmd_page(orig), refs);
-   if (!head) {
-   *nr -= refs;
+   if (!head)
return 0;
-   }
 
if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
-   *nr -= refs;
-   while (refs--)
-   put_page(head);
+   put_compound_head(head, refs);
return 0;
}
 
+   *nr += refs;
SetPageReferenced(head);
return 1;
 }
@@ -2120,28 +2118,19 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, 
unsigned long addr,
return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr);
}
 
-   refs = 0;
page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-   do {
-   pages[*nr] = page;
-   (*nr)++;
-   page++;
-   refs++;
-   } while (addr += PAGE_SIZE, addr != end);
+   refs = record_subpages(page, addr, end, pages + *nr);
 
head = try_get_compound_head(pud_page(orig), refs);
-   if (!head) {
-   *nr -= refs;
+   if (!head)
return 0;
-   }
 
if (unlikely(pud_val(orig) != pud_val(*pudp))) {
-   *nr -= refs;
-   while (refs--)
-   put_page(head);
+   put_compound_head(head, refs);
return 0;
}
 
+   *nr += refs;
SetPageReferenced(head);
return 1;
 }
@@ -2157,28 +2146,20 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, 
unsigned long 

Re: Call for report - G5/PPC970 status

2019-12-12 Thread John Paul Adrian Glaubitz
On 12/12/19 9:00 AM, Romain Dolbeau wrote:
> Could you share the screen/log of the crash?
> 
> For my G5 with 5.5rc1 I have one, but the photo is terrible:
> 
> Timestamps overlap, as after the 'crash' (backtrace) there was
> more messages from the (S)ATA subsystem.

I suggest booting the machine with a netconsole to get a dump of the crash
over the network, see [1].

Adrian

> [1] https://wiki.archlinux.org/index.php/Netconsole

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer - glaub...@debian.org
`. `'   Freie Universitaet Berlin - glaub...@physik.fu-berlin.de
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913


Re: READ_ONCE() + STACKPROTECTOR_STRONG == :/ (was Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.5-2 tag (topic/kasan-bitops))

2019-12-12 Thread Peter Zijlstra
On Thu, Dec 12, 2019 at 04:42:13PM +1100, Michael Ellerman wrote:
> [ trimmed CC a bit ]
> 
> Peter Zijlstra  writes:
> > On Fri, Dec 06, 2019 at 11:46:11PM +1100, Michael Ellerman wrote:
> ...
> > you write:
> >
> >   "Currently bitops-instrumented.h assumes that the architecture provides
> > atomic, non-atomic and locking bitops (e.g. both set_bit and __set_bit).
> > This is true on x86 and s390, but is not always true: there is a
> > generic bitops/non-atomic.h header that provides generic non-atomic
> > operations, and also a generic bitops/lock.h for locking operations."
> >
> > Is there any actual benefit for PPC to using their own atomic bitops
> > over bitops/lock.h ? I'm thinking that the generic code is fairly
> > optimal for most LL/SC architectures.
> 
> Yes and no :)
> 
> Some of the generic versions don't generate good code compared to our
> versions, but that's because READ_ONCE() is triggering stack protector
> to be enabled.

Bah, there's never anything simple, is there :/

> For example, comparing an out-of-line copy of the generic and ppc
> versions of test_and_set_bit_lock():
> 
>1 :   1 :
>2 addis   r2,r12,361
>3 addir2,r2,-4240
>4 stdur1,-48(r1)
>5 rlwinm  r8,r3,29,3,28
>6 clrlwi  r10,r3,26   2 rldicl  r10,r3,58,6
>7 ld  r9,3320(r13)
>8 std r9,40(r1)
>9 li  r9,0
>   10 li  r9,13 li  r9,1
>  4 clrlwi  r3,r3,26
>  5 rldicr  
> r10,r10,3,60
>   11 sld r9,r9,r10   6 sld r3,r9,r3
>   12 add r10,r4,r8   7 add r4,r4,r10
>   13 ldx r8,r4,r8
>   14 and.r8,r9,r8
>   15 bne 34f
>   16 ldarx   r7,0,r108 ldarx   r9,0,r4,1
>   17 or  r8,r9,r79 or  r10,r9,r3
>   18 stdcx.  r8,0,r10   10 stdcx.  r10,0,r4
>   19 bne-16b11 bne-8b
>   20 isync  12 isync
>   21 and r9,r7,r9   13 and r3,r3,r9
>   22 addic   r7,r9,-1   14 addic   r9,r3,-1
>   23 subfe   r7,r7,r9   15 subfe   r3,r9,r3
>   24 ld  r9,40(r1)
>   25 ld  r10,3320(r13)
>   26 xor.r9,r9,r10
>   27 li  r10,0
>   28 mr  r3,r7
>   29 bne 36f
>   30 addir1,r1,48
>   31 blr16 blr
>   32 nop
>   33 nop
>   34 li  r7,1
>   35 b   24b
>   36 mflrr0
>   37 std r0,64(r1)
>   38 bl  <__stack_chk_fail+0x8>
> 
> 
> If you squint, the generated code for the actual logic is pretty similar, but
> the stack protector gunk makes a big mess. It's particularly bad here
> because the ppc version doesn't even need a stack frame.
> 
> I've also confirmed that even when test_and_set_bit_lock() is inlined
> into an actual call site the stack protector logic still triggers.

> If I change the READ_ONCE() in test_and_set_bit_lock():
> 
>   if (READ_ONCE(*p) & mask)
>   return 1;
> 
> to a regular pointer access:
> 
>   if (*p & mask)
>   return 1;
> 
> Then the generated code looks more or less the same, except for the extra 
> early
> return in the generic version of test_and_set_bit_lock(), and different 
> handling
> of the return code by the compiler.

So given that the function signature is:

static inline int test_and_set_bit_lock(unsigned int nr,
volatile unsigned long *p)

@p already carries the required volatile qualifier, so READ_ONCE() does
not add anything here (except for easier to read code and poor code
generation).

So your proposed change _should_ be fine. Will, I'm assuming you never
saw this on your ARGH64 builds when you did this code ?

---
diff --git a/include/asm-generic/bitops/atomic.h 
b/include/asm-generic/bitops/atomic.h
index dd90c9792909..10264e8808f8 100644
--- a/include/asm-generic/bitops/atomic.h
+++ b/include/asm-generic/bitops/atomic.h
@@ -35,7 +35,7 @@ static inline int test_and_set_bit(unsigned int nr, volatile 
unsigned long *p)
unsigned long mask = BIT_MASK(nr);
 
p += BIT_WORD(nr);
-   if (READ_ONCE(*p) & mask)
+   if (*p & mask)
return 1;
 
old = atomic_long_fetch_or(mask, (atomic_long_t *)p);
@@ -48,7 +48,7 @@ static inline int test_and_clear_bit(unsigned int nr, 
volatile unsigned long *p)
unsigned long mask = BIT_MASK(nr);
 
p += BIT_WORD(nr);
-   if (!(READ_ONCE(*p) 

Re: Call for report - G5/PPC970 status

2019-12-12 Thread Romain Dolbeau
Le jeu. 12 déc. 2019 à 08:32, jjhdiederen  a écrit :
> PowerMac 7,3 G5 2.5 DP PCI-X Mid-2004 is affected with this bug. The
> machine freezes at boot due to the new ppc64 kernel.

Thanks for the reports!

So it's not all G5, but it's across generations - the iMac is PCIe.
Perhaps multiprocessors?

Could you share the screen/log of the crash?

For my G5 with 5.5rc1 I have one, but the photo is terrible:

Timestamps overlap, as after the 'crash' (backtrace) there was
more messages from the (S)ATA subsystem.

Cordially,

-- 
Romain Dolbeau