Re: [PATCH v2 3/7] mm/memory_hotplug: prepare passing flags to add_memory() and friends

2020-09-08 Thread Jürgen Groß

On 08.09.20 22:10, David Hildenbrand wrote:

We soon want to pass flags, e.g., to mark added System RAM resources.
mergeable. Prepare for that.

This patch is based on a similar patch by Oscar Salvador:

https://lkml.kernel.org/r/20190625075227.15193-3-osalva...@suse.de

Acked-by: Wei Liu 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Dan Williams 
Cc: Jason Gunthorpe 
Cc: Pankaj Gupta 
Cc: Baoquan He 
Cc: Wei Yang 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Greg Kroah-Hartman 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Wei Liu 
Cc: Heiko Carstens 
Cc: Vasily Gorbik 
Cc: Christian Borntraeger 
Cc: David Hildenbrand 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: "Oliver O'Halloran" 
Cc: Pingfan Liu 
Cc: Nathan Lynch 
Cc: Libor Pechacek 
Cc: Anton Blanchard 
Cc: Leonardo Bras 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-a...@vger.kernel.org
Cc: linux-nvd...@lists.01.org
Cc: linux-hyp...@vger.kernel.org
Cc: linux-s...@vger.kernel.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: xen-de...@lists.xenproject.org
Signed-off-by: David Hildenbrand 


Reviewed-by: Juergen Gross  (Xen related part)


Juergen


---
  arch/powerpc/platforms/powernv/memtrace.c   |  2 +-
  arch/powerpc/platforms/pseries/hotplug-memory.c |  2 +-
  drivers/acpi/acpi_memhotplug.c  |  2 +-
  drivers/base/memory.c   |  2 +-
  drivers/dax/kmem.c  |  2 +-
  drivers/hv/hv_balloon.c |  2 +-
  drivers/s390/char/sclp_cmd.c|  2 +-
  drivers/virtio/virtio_mem.c |  2 +-
  drivers/xen/balloon.c   |  2 +-
  include/linux/memory_hotplug.h  | 10 ++
  mm/memory_hotplug.c | 15 ---
  11 files changed, 23 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/memtrace.c 
b/arch/powerpc/platforms/powernv/memtrace.c
index 13b369d2cc454..a7475d18c671c 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -224,7 +224,7 @@ static int memtrace_online(void)
ent->mem = 0;
}
  
-		if (add_memory(ent->nid, ent->start, ent->size)) {

+   if (add_memory(ent->nid, ent->start, ent->size, 0)) {
pr_err("Failed to add trace memory to node %d\n",
ent->nid);
ret += 1;
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 5d545b78111f9..54a888ea7f751 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -606,7 +606,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
block_sz = memory_block_size_bytes();
  
  	/* Add the memory */

-   rc = __add_memory(lmb->nid, lmb->base_addr, block_sz);
+   rc = __add_memory(lmb->nid, lmb->base_addr, block_sz, 0);
if (rc) {
invalidate_lmb_associativity_index(lmb);
return rc;
diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index e294f44a78504..d91b3584d4b2b 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -207,7 +207,7 @@ static int acpi_memory_enable_device(struct 
acpi_memory_device *mem_device)
if (node < 0)
node = memory_add_physaddr_to_nid(info->start_addr);
  
-		result = __add_memory(node, info->start_addr, info->length);

+   result = __add_memory(node, info->start_addr, info->length, 0);
  
  		/*

 * If the memory block has been used by the kernel, add_memory()
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 4db3c660de831..2287bcf86480e 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -432,7 +432,7 @@ static ssize_t probe_store(struct device *dev, struct 
device_attribute *attr,
  
  	nid = memory_add_physaddr_to_nid(phys_addr);

ret = __add_memory(nid, phys_addr,
-  MIN_MEMORY_BLOCK_SIZE * sections_per_block);
+  MIN_MEMORY_BLOCK_SIZE * sections_per_block, 0);
  
  	if (ret)

goto out;
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 7dcb2902e9b1b..8e66b28ef5bc6 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -95,7 +95,7 @@ int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 * this as RAM automatically.
 */
rc = add_memory_driver_managed(numa_node, range.start,
-   range_len(), kmem_name);
+   range_len(), kmem_name, 0);
  
  		res->flags |= IORESOURCE_BUSY;

if (rc) {

Re: [PATCH v2] kbuild: preprocess module linker script

2020-09-08 Thread Palmer Dabbelt

On Mon, 07 Sep 2020 21:27:08 PDT (-0700), masahi...@kernel.org wrote:

There was a request to preprocess the module linker script like we
do for the vmlinux one. (https://lkml.org/lkml/2020/8/21/512)

The difference between vmlinux.lds and module.lds is that the latter
is needed for external module builds, thus must be cleaned up by
'make mrproper' instead of 'make clean'. Also, it must be created
by 'make modules_prepare'.

You cannot put it in arch/$(SRCARCH)/kernel/, which is cleaned up by
'make clean'. I moved arch/$(SRCARCH)/kernel/module.lds to
arch/$(SRCARCH)/include/asm/module.lds.h, which is included from
scripts/module.lds.S.

scripts/module.lds is fine because 'make clean' keeps all the
build artifacts under scripts/.

You can add arch-specific sections in .


for the arch/riscv stuff

Acked-by: Palmer Dabbelt 

Thanks!


Signed-off-by: Masahiro Yamada 
Tested-by: Jessica Yu 
Acked-by: Will Deacon 
---

Changes in v2:
  - Fix the race between the two targets 'scripts' and 'asm-generic'

 Makefile   | 10 ++
 arch/arm/Makefile  |  4 
 .../{kernel/module.lds => include/asm/module.lds.h}|  2 ++
 arch/arm64/Makefile|  4 
 .../{kernel/module.lds => include/asm/module.lds.h}|  2 ++
 arch/ia64/Makefile |  1 -
 arch/ia64/{module.lds => include/asm/module.lds.h} |  0
 arch/m68k/Makefile |  1 -
 .../{kernel/module.lds => include/asm/module.lds.h}|  0
 arch/powerpc/Makefile  |  1 -
 .../{kernel/module.lds => include/asm/module.lds.h}|  0
 arch/riscv/Makefile|  3 ---
 .../{kernel/module.lds => include/asm/module.lds.h}|  3 ++-
 arch/um/include/asm/Kbuild |  1 +
 include/asm-generic/Kbuild |  1 +
 include/asm-generic/module.lds.h   | 10 ++
 scripts/.gitignore |  1 +
 scripts/Makefile   |  3 +++
 scripts/Makefile.modfinal  |  5 ++---
 scripts/{module-common.lds => module.lds.S}|  3 +++
 scripts/package/builddeb   |  2 +-
 21 files changed, 34 insertions(+), 23 deletions(-)
 rename arch/arm/{kernel/module.lds => include/asm/module.lds.h} (72%)
 rename arch/arm64/{kernel/module.lds => include/asm/module.lds.h} (76%)
 rename arch/ia64/{module.lds => include/asm/module.lds.h} (100%)
 rename arch/m68k/{kernel/module.lds => include/asm/module.lds.h} (100%)
 rename arch/powerpc/{kernel/module.lds => include/asm/module.lds.h} (100%)
 rename arch/riscv/{kernel/module.lds => include/asm/module.lds.h} (84%)
 create mode 100644 include/asm-generic/module.lds.h
 rename scripts/{module-common.lds => module.lds.S} (93%)

diff --git a/Makefile b/Makefile
index 37739ee53f27..97b1dae1783b 100644
--- a/Makefile
+++ b/Makefile
@@ -505,7 +505,6 @@ KBUILD_CFLAGS_KERNEL :=
 KBUILD_AFLAGS_MODULE  := -DMODULE
 KBUILD_CFLAGS_MODULE  := -DMODULE
 KBUILD_LDFLAGS_MODULE :=
-export KBUILD_LDS_MODULE := $(srctree)/scripts/module-common.lds
 KBUILD_LDFLAGS :=
 CLANG_FLAGS :=

@@ -1395,7 +1394,7 @@ endif
 # using awk while concatenating to the final file.

 PHONY += modules
-modules: $(if $(KBUILD_BUILTIN),vmlinux) modules_check
+modules: $(if $(KBUILD_BUILTIN),vmlinux) modules_check modules_prepare
$(Q)$(MAKE) -f $(srctree)/scripts/Makefile.modpost

 PHONY += modules_check
@@ -1412,6 +1411,7 @@ targets += modules.order
 # Target to prepare building external modules
 PHONY += modules_prepare
 modules_prepare: prepare
+   $(Q)$(MAKE) $(build)=scripts scripts/module.lds

 # Target to install modules
 PHONY += modules_install
@@ -1743,7 +1743,9 @@ help:
@echo  '  clean   - remove generated files in module directory 
only'
@echo  ''

-PHONY += prepare
+# no-op for external module builds
+PHONY += prepare modules_prepare
+
 endif # KBUILD_EXTMOD

 # Single targets
@@ -1776,7 +1778,7 @@ MODORDER := .modules.tmp
 endif

 PHONY += single_modpost
-single_modpost: $(single-no-ko)
+single_modpost: $(single-no-ko) modules_prepare
$(Q){ $(foreach m, $(single-ko), echo $(extmod-prefix)$m;) } > 
$(MODORDER)
$(Q)$(MAKE) -f $(srctree)/scripts/Makefile.modpost

diff --git a/arch/arm/Makefile b/arch/arm/Makefile
index 4e877354515f..a0cb15de9677 100644
--- a/arch/arm/Makefile
+++ b/arch/arm/Makefile
@@ -16,10 +16,6 @@ LDFLAGS_vmlinux  += --be8
 KBUILD_LDFLAGS_MODULE  += --be8
 endif

-ifeq ($(CONFIG_ARM_MODULE_PLTS),y)
-KBUILD_LDS_MODULE  += $(srctree)/arch/arm/kernel/module.lds
-endif
-
 GZFLAGS:=-9
 #KBUILD_CFLAGS +=-pipe

diff --git a/arch/arm/kernel/module.lds b/arch/arm/include/asm/module.lds.h
similarity index 72%
rename from arch/arm/kernel/module.lds
rename to 

Re: [PATCH 00/29] treewide: Convert comma separated statements

2020-09-08 Thread Martin K. Petersen
On Mon, 24 Aug 2020 21:55:57 -0700, Joe Perches wrote:

> There are many comma separated statements in the kernel.
> See:https://lore.kernel.org/lkml/alpine.DEB.2.22.394.2008201856110.2524@hadrien/
> 
> Convert the comma separated statements that are in if/do/while blocks
> to use braces and semicolons.
> 
> Many comma separated statements still exist but those are changes for
> another day.
> 
> [...]

Applied to 5.10/scsi-queue, thanks!

[01/29] coding-style.rst: Avoid comma statements
(no commit info)
[02/29] alpha: Avoid comma separated statements
(no commit info)
[03/29] ia64: Avoid comma separated statements
(no commit info)
[04/29] sparc: Avoid comma separated statements
(no commit info)
[05/29] ata: Avoid comma separated statements
(no commit info)
[06/29] drbd: Avoid comma separated statements
(no commit info)
[07/29] lp: Avoid comma separated statements
(no commit info)
[08/29] dma-buf: Avoid comma separated statements
(no commit info)
[09/29] drm/gma500: Avoid comma separated statements
(no commit info)
[10/29] drm/i915: Avoid comma separated statements
(no commit info)
[11/29] hwmon: (scmi-hwmon): Avoid comma separated statements
(no commit info)
[12/29] Input: MT - Avoid comma separated statements
(no commit info)
[13/29] bcache: Avoid comma separated statements
(no commit info)
[14/29] media: Avoid comma separated statements
(no commit info)
[15/29] mtd: Avoid comma separated statements
(no commit info)
[16/29] 8390: Avoid comma separated statements
(no commit info)
[17/29] fs_enet: Avoid comma separated statements
(no commit info)
[18/29] wan: sbni: Avoid comma separated statements
(no commit info)
[19/29] s390/tty3270: Avoid comma separated statements
(no commit info)
[20/29] scsi: arm: Avoid comma separated statements
https://git.kernel.org/mkp/scsi/c/a08a07326510
[21/29] media: atomisp: Avoid comma separated statements
(no commit info)
[22/29] video: fbdev: Avoid comma separated statements
(no commit info)
[23/29] fuse: Avoid comma separated statements
(no commit info)
[24/29] reiserfs: Avoid comma separated statements
(no commit info)
[25/29] lib/zlib: Avoid comma separated statements
(no commit info)
[26/29] lib: zstd: Avoid comma separated statements
(no commit info)
[27/29] ipv6: fib6: Avoid comma separated statements
(no commit info)
[28/29] sunrpc: Avoid comma separated statements
(no commit info)
[29/29] tools: Avoid comma separated statements
(no commit info)

-- 
Martin K. Petersen  Oracle Linux Engineering


[powerpc:next-test] BUILD SUCCESS 1caa0de62d8cfeb6874fba2a6af5fbe67ebfca70

2020-09-08 Thread kernel test robot
   randconfig-a004-20200907
x86_64   randconfig-a003-20200907
x86_64   randconfig-a005-20200907
x86_64   randconfig-a001-20200907
x86_64   randconfig-a002-20200907
i386 randconfig-a004-20200908
i386 randconfig-a005-20200908
i386 randconfig-a006-20200908
i386 randconfig-a002-20200908
i386 randconfig-a001-20200908
i386 randconfig-a003-20200908
i386 randconfig-a004-20200907
i386 randconfig-a005-20200907
i386 randconfig-a006-20200907
i386 randconfig-a002-20200907
i386 randconfig-a003-20200907
i386 randconfig-a001-20200907
x86_64   randconfig-a013-20200908
x86_64   randconfig-a016-20200908
x86_64   randconfig-a011-20200908
x86_64   randconfig-a012-20200908
x86_64   randconfig-a015-20200908
x86_64   randconfig-a014-20200908
i386 randconfig-a016-20200907
i386 randconfig-a015-20200907
i386 randconfig-a011-20200907
i386 randconfig-a013-20200907
i386 randconfig-a014-20200907
i386 randconfig-a012-20200907
i386 randconfig-a016-20200908
i386 randconfig-a015-20200908
i386 randconfig-a011-20200908
i386 randconfig-a013-20200908
i386 randconfig-a014-20200908
i386 randconfig-a012-20200908
riscvallyesconfig
riscv allnoconfig
riscv   defconfig
riscvallmodconfig
x86_64   rhel
x86_64   allyesconfig
x86_64rhel-7.6-kselftests
x86_64  defconfig
x86_64   rhel-8.3
x86_64  kexec

clang tested configs:
x86_64   randconfig-a004-20200908
x86_64   randconfig-a006-20200908
x86_64   randconfig-a003-20200908
x86_64   randconfig-a001-20200908
x86_64   randconfig-a005-20200908
x86_64   randconfig-a002-20200908
x86_64   randconfig-a013-20200907
x86_64   randconfig-a011-20200907
x86_64   randconfig-a016-20200907
x86_64   randconfig-a012-20200907
x86_64   randconfig-a015-20200907
x86_64   randconfig-a014-20200907

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


[powerpc:next] BUILD SUCCESS dc462267d2d7aacffc3c1d99b02d7a7c59db7c66

2020-09-08 Thread kernel test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  
next
branch HEAD: dc462267d2d7aacffc3c1d99b02d7a7c59db7c66  powerpc/64s: handle ISA 
v3.1 local copy-paste context switches

elapsed time: 744m

configs tested: 151
configs skipped: 12

The following configs have been built successfully.
More configs may be tested in the coming days.

gcc tested configs:
arm defconfig
arm64allyesconfig
arm64   defconfig
arm  allyesconfig
arm  allmodconfig
parisc  defconfig
powerpcmvme5100_defconfig
sh microdev_defconfig
c6x dsk6455_defconfig
powerpc  allmodconfig
arm   omap2plus_defconfig
powerpcadder875_defconfig
mipsar7_defconfig
arm   omap1_defconfig
sh   se7619_defconfig
mipsmalta_qemu_32r6_defconfig
powerpc ep8248e_defconfig
xtensa  defconfig
arm  pxa910_defconfig
armspear3xx_defconfig
mipsmalta_kvm_guest_defconfig
m68km5407c3_defconfig
c6x  allyesconfig
mipsgpr_defconfig
sh   se7780_defconfig
powerpc   allnoconfig
c6xevmc6457_defconfig
m68kq40_defconfig
sparcallyesconfig
shsh7763rdp_defconfig
armzeus_defconfig
sh   se7705_defconfig
riscvnommu_k210_defconfig
arm   spear13xx_defconfig
arc haps_hs_smp_defconfig
m68k   m5275evb_defconfig
h8300   defconfig
c6xevmc6474_defconfig
mips   sb1250_swarm_defconfig
armqcom_defconfig
sh   sh7724_generic_defconfig
mips loongson1b_defconfig
pariscgeneric-32bit_defconfig
armspear6xx_defconfig
ia64  tiger_defconfig
microblaze  mmu_defconfig
powerpc mpc512x_defconfig
arm  collie_defconfig
arm hackkit_defconfig
powerpc linkstation_defconfig
arm  footbridge_defconfig
mips tb0287_defconfig
mips  bmips_stb_defconfig
arm rpc_defconfig
mipsnlm_xlr_defconfig
m68kstmark2_defconfig
h8300alldefconfig
shsh7785lcr_defconfig
arc  axs103_defconfig
arm shannon_defconfig
armpleb_defconfig
arc  axs101_defconfig
nios2allyesconfig
shapsh4ad0a_defconfig
c6xevmc6472_defconfig
riscv  rv32_defconfig
ia64 allmodconfig
ia64defconfig
ia64 allyesconfig
m68k allmodconfig
m68kdefconfig
m68k allyesconfig
nios2   defconfig
arc  allyesconfig
nds32 allnoconfig
nds32   defconfig
cskydefconfig
alpha   defconfig
alphaallyesconfig
xtensa   allyesconfig
h8300allyesconfig
arc defconfig
sh   allmodconfig
s390 allyesconfig
parisc   allyesconfig
s390defconfig
i386 allyesconfig
sparc   defconfig
i386defconfig
mips allyesconfig
mips allmodconfig
powerpc defconfig
powerpc  allyesconfig
x86_64   randconfig-a006-20200907
x86_64   randconfig-a004-20200907
x86_64   randconfig-a003-20200907
x86_64   randconfig-a005-20200907
x86_64   randconfig-a001-20200907
x86_64   randconfig-a002-20200907
i386 randconfig-a004-20200908
i386

[powerpc:fixes-test] BUILD SUCCESS 1ae9eb3074160f102fc2eb13d7b3d776e9b9e462

2020-09-08 Thread kernel test robot
-20200907
x86_64   randconfig-a001-20200907
x86_64   randconfig-a002-20200907
i386 randconfig-a004-20200908
i386 randconfig-a005-20200908
i386 randconfig-a006-20200908
i386 randconfig-a002-20200908
i386 randconfig-a001-20200908
i386 randconfig-a003-20200908
i386 randconfig-a004-20200907
i386 randconfig-a005-20200907
i386 randconfig-a006-20200907
i386 randconfig-a002-20200907
i386 randconfig-a003-20200907
i386 randconfig-a001-20200907
x86_64   randconfig-a013-20200908
x86_64   randconfig-a016-20200908
x86_64   randconfig-a011-20200908
x86_64   randconfig-a012-20200908
x86_64   randconfig-a015-20200908
x86_64   randconfig-a014-20200908
i386 randconfig-a016-20200907
i386 randconfig-a015-20200907
i386 randconfig-a011-20200907
i386 randconfig-a013-20200907
i386 randconfig-a014-20200907
i386 randconfig-a012-20200907
i386 randconfig-a016-20200908
i386 randconfig-a015-20200908
i386 randconfig-a011-20200908
i386 randconfig-a013-20200908
i386 randconfig-a014-20200908
i386 randconfig-a012-20200908
riscvallyesconfig
riscv allnoconfig
riscv   defconfig
riscvallmodconfig
x86_64   rhel
x86_64   allyesconfig
x86_64rhel-7.6-kselftests
x86_64  defconfig
x86_64   rhel-8.3
x86_64  kexec

clang tested configs:
x86_64   randconfig-a004-20200908
x86_64   randconfig-a006-20200908
x86_64   randconfig-a003-20200908
x86_64   randconfig-a001-20200908
x86_64   randconfig-a005-20200908
x86_64   randconfig-a002-20200908
x86_64   randconfig-a013-20200907
x86_64   randconfig-a011-20200907
x86_64   randconfig-a016-20200907
x86_64   randconfig-a012-20200907
x86_64   randconfig-a015-20200907
x86_64   randconfig-a014-20200907

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


[powerpc:merge] BUILD SUCCESS 952478355a3e375c9bc70a8f237900bf88c3f531

2020-09-08 Thread kernel test robot
   randconfig-a006-20200907
x86_64   randconfig-a004-20200907
x86_64   randconfig-a003-20200907
x86_64   randconfig-a005-20200907
x86_64   randconfig-a001-20200907
x86_64   randconfig-a002-20200907
i386 randconfig-a004-20200908
i386 randconfig-a005-20200908
i386 randconfig-a006-20200908
i386 randconfig-a002-20200908
i386 randconfig-a001-20200908
i386 randconfig-a003-20200908
i386 randconfig-a004-20200907
i386 randconfig-a005-20200907
i386 randconfig-a006-20200907
i386 randconfig-a002-20200907
i386 randconfig-a003-20200907
i386 randconfig-a001-20200907
i386 randconfig-a004-20200909
i386 randconfig-a005-20200909
i386 randconfig-a006-20200909
i386 randconfig-a002-20200909
i386 randconfig-a001-20200909
i386 randconfig-a003-20200909
x86_64   randconfig-a013-20200908
x86_64   randconfig-a016-20200908
x86_64   randconfig-a011-20200908
x86_64   randconfig-a012-20200908
x86_64   randconfig-a015-20200908
x86_64   randconfig-a014-20200908
i386 randconfig-a016-20200907
i386 randconfig-a015-20200907
i386 randconfig-a011-20200907
i386 randconfig-a013-20200907
i386 randconfig-a014-20200907
i386 randconfig-a012-20200907
i386 randconfig-a016-20200908
i386 randconfig-a015-20200908
i386 randconfig-a011-20200908
i386 randconfig-a013-20200908
i386 randconfig-a014-20200908
i386 randconfig-a012-20200908
riscvallyesconfig
riscv allnoconfig
riscv   defconfig
x86_64   rhel
x86_64   allyesconfig
x86_64rhel-7.6-kselftests
x86_64  defconfig
x86_64   rhel-8.3
x86_64  kexec

clang tested configs:
x86_64   randconfig-a004-20200908
x86_64   randconfig-a006-20200908
x86_64   randconfig-a003-20200908
x86_64   randconfig-a001-20200908
x86_64   randconfig-a005-20200908
x86_64   randconfig-a002-20200908
x86_64   randconfig-a013-20200909
x86_64   randconfig-a016-20200909
x86_64   randconfig-a011-20200909
x86_64   randconfig-a012-20200909
x86_64   randconfig-a015-20200909
x86_64   randconfig-a014-20200909
x86_64   randconfig-a013-20200907
x86_64   randconfig-a011-20200907
x86_64   randconfig-a016-20200907
x86_64   randconfig-a012-20200907
x86_64   randconfig-a015-20200907
x86_64   randconfig-a014-20200907

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


Re: [PATCH v3 1/2] scsi: ibmvfc: use compiler attribute defines instead of __attribute__()

2020-09-08 Thread Martin K. Petersen


Tyrel,

> Update ibmvfc.h structs to use the preferred  __packed and __aligned()
> attribute macros defined in include/linux/compiler_attributes.h in place
> of __attribute__().

Applied 1+2 to my 5.10 staging tree. Thanks!

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH] powerpc/64: Make VDSO32 track COMPAT on 64-bit

2020-09-08 Thread Michael Ellerman
Christophe Leroy  writes:
> Le 08/09/2020 à 14:58, Michael Ellerman a écrit :
>> When we added the VDSO32 kconfig symbol, which controls building of
>> the 32-bit VDSO, we made it depend on CPU_BIG_ENDIAN (for 64-bit).
>> 
>> That was because back then COMPAT was always enabled for 64-bit, so
>> depending on it would have left the 32-bit VDSO always enabled, which
>> we didn't want.
>> 
>> But since then we have made COMPAT selectable, and off by default for
>> ppc64le, so VDSO32 should really depend on that.
>> 
>> For most people this makes no difference, none of the defconfigs
>> change, it's only if someone is building ppc64le with COMPAT=y, they
>> will now also get VDSO32. If they've enabled COMPAT in order to run
>> 32-bit binaries they presumably also want the 32-bit VDSO.
>> 
>> Signed-off-by: Michael Ellerman 
>
>
> Reviewed-by: Christophe Leroy 
>
> Michael, please note that christophe.le...@c-s.fr is a deprecated 
> address that will one day not work anymore. Please use the new one 
> whenever possible.

OK, I had the old one in my ~/.mailrc, fixed now.

cheers


Re: [PATCH] selftests/seccomp: fix ptrace tests on powerpc

2020-09-08 Thread Kees Cook
On Tue, Jun 30, 2020 at 01:47:39PM -0300, Thadeu Lima de Souza Cascardo wrote:
> As pointed out by Michael Ellerman, the ptrace ABI on powerpc does not
> allow or require the return code to be set on syscall entry when
> skipping the syscall. It will always return ENOSYS and the return code
> must be set on syscall exit.
> 
> This code does that, behaving more similarly to strace. It still sets
> the return code on entry, which is overridden on powerpc, and it will
> always repeat the same on exit. Also, on powerpc, the errno is not
> inverted, and depends on ccr.so being set.
> 
> This has been tested on powerpc and amd64.
> 
> Cc: Michael Ellerman 
> Cc: Kees Cook 
> Signed-off-by: Thadeu Lima de Souza Cascardo 

Yikes, I missed this from a while ago. I apologize for responding so
late!

This appears still unfixed; is that correct?

> ---
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 24 +++
>  1 file changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c 
> b/tools/testing/selftests/seccomp/seccomp_bpf.c
> index 252140a52553..b90a9190ba88 100644
> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> @@ -1738,6 +1738,14 @@ void change_syscall(struct __test_metadata *_metadata,
>   TH_LOG("Can't modify syscall return on this architecture");
>  #else
>   regs.SYSCALL_RET = result;
> +# if defined(__powerpc__)
> + if (result < 0) {
> + regs.SYSCALL_RET = -result;
> + regs.ccr |= 0x1000;
> + } else {
> + regs.ccr &= ~0x1000;
> + }
> +# endif
>  #endif

Just so I understand correctly: for ppc to "see" this result, it needs
to be both negative AND have this specific register set?

>  
>  #ifdef HAVE_GETREGS
> @@ -1796,6 +1804,7 @@ void tracer_ptrace(struct __test_metadata *_metadata, 
> pid_t tracee,
>   int ret, nr;
>   unsigned long msg;
>   static bool entry;
> + int *syscall_nr = args;
>  
>   /*
>* The traditional way to tell PTRACE_SYSCALL entry/exit
> @@ -1809,10 +1818,15 @@ void tracer_ptrace(struct __test_metadata *_metadata, 
> pid_t tracee,
>   EXPECT_EQ(entry ? PTRACE_EVENTMSG_SYSCALL_ENTRY
>   : PTRACE_EVENTMSG_SYSCALL_EXIT, msg);
>  
> - if (!entry)
> + if (!entry && !syscall_nr)
>   return;
>  
> - nr = get_syscall(_metadata, tracee);
> + if (entry)
> + nr = get_syscall(_metadata, tracee);
> + else
> + nr = *syscall_nr;

This is weird? Shouldn't get_syscall() be modified to do the right thing
here instead of depending on the extra arg?

> + if (syscall_nr)
> + *syscall_nr = nr;
>  
>   if (nr == __NR_getpid)
>   change_syscall(_metadata, tracee, __NR_getppid, 0);
> @@ -1889,9 +1903,10 @@ TEST_F(TRACE_syscall, ptrace_syscall_redirected)
>  
>  TEST_F(TRACE_syscall, ptrace_syscall_errno)
>  {
> + int syscall_nr = -1;
>   /* Swap SECCOMP_RET_TRACE tracer for PTRACE_SYSCALL tracer. */
>   teardown_trace_fixture(_metadata, self->tracer);
> - self->tracer = setup_trace_fixture(_metadata, tracer_ptrace, NULL,
> + self->tracer = setup_trace_fixture(_metadata, tracer_ptrace, 
> _nr,
>  true);
>  
>   /* Tracer should skip the open syscall, resulting in ESRCH. */
> @@ -1900,9 +1915,10 @@ TEST_F(TRACE_syscall, ptrace_syscall_errno)
>  
>  TEST_F(TRACE_syscall, ptrace_syscall_faked)
>  {
> + int syscall_nr = -1;
>   /* Swap SECCOMP_RET_TRACE tracer for PTRACE_SYSCALL tracer. */
>   teardown_trace_fixture(_metadata, self->tracer);
> - self->tracer = setup_trace_fixture(_metadata, tracer_ptrace, NULL,
> + self->tracer = setup_trace_fixture(_metadata, tracer_ptrace, 
> _nr,
>  true);
>  
>   /* Tracer should skip the gettid syscall, resulting fake pid. */
> -- 
> 2.25.1
> 

-- 
Kees Cook


Re: [PATCH v2] dt-bindings: vendor-prefixes: Add Cisco Meraki vendor prefix

2020-09-08 Thread Rob Herring
On Sat, 22 Aug 2020 17:40:45 +0200, Christian Lamparter wrote:
> Meraki was founded in 2006. The start-up quickly rose to prominence
> by being based in part on the MIT Roofnet Project.
> In December 2012, Cisco Systems, Inc. bought Meraki.
> The "Meraki" branding is still around to this day.
> 
> Web site of the company: https://meraki.cisco.com/
> 
> Signed-off-by: Christian Lamparter 
> ---
> 
> v1 -> v2:
>   Split from Meraki MR32 upstreaming attempt. (Florian Fainelli)
>   (This patch will be needed for the MR24 upstreaming series as well)
> ---
>  Documentation/devicetree/bindings/vendor-prefixes.yaml | 2 ++
>  1 file changed, 2 insertions(+)
> 

Applied, thanks!


Re: [PATCH v2] powerpc/papr_scm: Limit the readability of 'perf_stats' sysfs attribute

2020-09-08 Thread Ira Weiny
On Mon, Sep 07, 2020 at 04:35:40PM +0530, Vaibhav Jain wrote:
> The newly introduced 'perf_stats' attribute uses the default access
> mode of 0444 letting non-root users access performance stats of an
> nvdimm and potentially force the kernel into issuing large number of
> expensive HCALLs. Since the information exposed by this attribute
> cannot be cached hence its better to ward of access to this attribute
   ^^^
   off?

> from users who don't need to access these performance statistics.
> 
> Hence this patch updates access mode of 'perf_stats' attribute to
> be only readable by root users.

Generally it is bad form to say "this patch".  See 4c here:

-- https://www.ozlabs.org/~akpm/stuff/tpp.txt

But I'm not picky...  :-D

With the s/of/off/ change:

Reviewed-by: Ira Weiny 

> 
> Fixes: 2d02bf835e573 ('powerpc/papr_scm: Fetch nvdimm performance stats from 
> PHYP')
> Reported-by: Aneesh Kumar K.V 
> Signed-off-by: Vaibhav Jain 
> ---
> Change-log:
> 
> v2:
> * Instead of checking for perfmon_capable() inside show_perf_stats()
>   set the attribute as DEVICE_ATTR_ADMIN_RO [ Aneesh ]
> * Update patch description
> ---
>  arch/powerpc/platforms/pseries/papr_scm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index f439f0dfea7d1..a88a707a608aa 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -822,7 +822,7 @@ static ssize_t perf_stats_show(struct device *dev,
>   kfree(stats);
>   return rc ? rc : seq_buf_used();
>  }
> -DEVICE_ATTR_RO(perf_stats);
> +DEVICE_ATTR_ADMIN_RO(perf_stats);
>  
>  static ssize_t flags_show(struct device *dev,
> struct device_attribute *attr, char *buf)
> -- 
> 2.26.2
> 


[PATCH v2 3/7] mm/memory_hotplug: prepare passing flags to add_memory() and friends

2020-09-08 Thread David Hildenbrand
We soon want to pass flags, e.g., to mark added System RAM resources.
mergeable. Prepare for that.

This patch is based on a similar patch by Oscar Salvador:

https://lkml.kernel.org/r/20190625075227.15193-3-osalva...@suse.de

Acked-by: Wei Liu 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Dan Williams 
Cc: Jason Gunthorpe 
Cc: Pankaj Gupta 
Cc: Baoquan He 
Cc: Wei Yang 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Greg Kroah-Hartman 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Wei Liu 
Cc: Heiko Carstens 
Cc: Vasily Gorbik 
Cc: Christian Borntraeger 
Cc: David Hildenbrand 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: "Oliver O'Halloran" 
Cc: Pingfan Liu 
Cc: Nathan Lynch 
Cc: Libor Pechacek 
Cc: Anton Blanchard 
Cc: Leonardo Bras 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-a...@vger.kernel.org
Cc: linux-nvd...@lists.01.org
Cc: linux-hyp...@vger.kernel.org
Cc: linux-s...@vger.kernel.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: xen-de...@lists.xenproject.org
Signed-off-by: David Hildenbrand 
---
 arch/powerpc/platforms/powernv/memtrace.c   |  2 +-
 arch/powerpc/platforms/pseries/hotplug-memory.c |  2 +-
 drivers/acpi/acpi_memhotplug.c  |  2 +-
 drivers/base/memory.c   |  2 +-
 drivers/dax/kmem.c  |  2 +-
 drivers/hv/hv_balloon.c |  2 +-
 drivers/s390/char/sclp_cmd.c|  2 +-
 drivers/virtio/virtio_mem.c |  2 +-
 drivers/xen/balloon.c   |  2 +-
 include/linux/memory_hotplug.h  | 10 ++
 mm/memory_hotplug.c | 15 ---
 11 files changed, 23 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/memtrace.c 
b/arch/powerpc/platforms/powernv/memtrace.c
index 13b369d2cc454..a7475d18c671c 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -224,7 +224,7 @@ static int memtrace_online(void)
ent->mem = 0;
}
 
-   if (add_memory(ent->nid, ent->start, ent->size)) {
+   if (add_memory(ent->nid, ent->start, ent->size, 0)) {
pr_err("Failed to add trace memory to node %d\n",
ent->nid);
ret += 1;
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 5d545b78111f9..54a888ea7f751 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -606,7 +606,7 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
block_sz = memory_block_size_bytes();
 
/* Add the memory */
-   rc = __add_memory(lmb->nid, lmb->base_addr, block_sz);
+   rc = __add_memory(lmb->nid, lmb->base_addr, block_sz, 0);
if (rc) {
invalidate_lmb_associativity_index(lmb);
return rc;
diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index e294f44a78504..d91b3584d4b2b 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -207,7 +207,7 @@ static int acpi_memory_enable_device(struct 
acpi_memory_device *mem_device)
if (node < 0)
node = memory_add_physaddr_to_nid(info->start_addr);
 
-   result = __add_memory(node, info->start_addr, info->length);
+   result = __add_memory(node, info->start_addr, info->length, 0);
 
/*
 * If the memory block has been used by the kernel, add_memory()
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 4db3c660de831..2287bcf86480e 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -432,7 +432,7 @@ static ssize_t probe_store(struct device *dev, struct 
device_attribute *attr,
 
nid = memory_add_physaddr_to_nid(phys_addr);
ret = __add_memory(nid, phys_addr,
-  MIN_MEMORY_BLOCK_SIZE * sections_per_block);
+  MIN_MEMORY_BLOCK_SIZE * sections_per_block, 0);
 
if (ret)
goto out;
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 7dcb2902e9b1b..8e66b28ef5bc6 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -95,7 +95,7 @@ int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 * this as RAM automatically.
 */
rc = add_memory_driver_managed(numa_node, range.start,
-   range_len(), kmem_name);
+   range_len(), kmem_name, 0);
 
res->flags |= IORESOURCE_BUSY;
if (rc) {
diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c

Re: [PATCH] powerpc/boot/dts: Fix dtc "pciex" warnings

2020-09-08 Thread Christian Lamparter
Hello,

On Tue, Sep 8, 2020 at 9:12 AM Michael Ellerman  wrote:
> Christian Lamparter  writes:
> > On 2020-06-23 15:03, Michael Ellerman wrote:
> >> With CONFIG_OF_ALL_DTBS=y, as set by eg. allmodconfig, we see lots of
> >> warnings about our dts files, such as:
> >>
> >>arch/powerpc/boot/dts/glacier.dts:492.26-532.5:
> >>Warning (pci_bridge): /plb/pciex@d: node name is not "pci"
> >>or "pcie"
> >>
> >
> >
> > Unfortunately yes. This patch will break uboot code in Meraki MX60(W) / 
> > MX60.
> >
> >  > 
> > https://github.com/riptidewave93/meraki-uboot/blob/mx60w-20180413/board/amcc/bluestone/bluestone.c#L1178
> >
> > | if (!pci_available()) {
> > | fdt_find_and_setprop(blob, "/plb/pciex@d", "status",
> > |   "disabled", sizeof("disabled"), 1);
> > | }
> >
> >
> > Backstory: There are two version of the Meraki MX60. The MX60
> > and the MX60W. The difference is that the MX60W has a populated
> > mini-pcie slot on the PCB for a >W >
> > That said, this is not earth shattering.
>
> I'm happy to revert that hunk if you think any one is actually booting
> mainline on those.

The MX60(W) or APM82181 in general?

The APM82181 devices run really well on the kernel 5.8. The APM82181
had some bitrot and missing pieces. But I started at around 4.4 with
upstreaming various bits and stuff. For proof, I can post a bootlog from
my MyBook Live dev station running my mbl-debian on this weekend:
.

This WD MyBook Live project really only needs
 - vanilla 5.8 (I got rid of the make-kpkg hack by switching to make bindeb-pkg)
 - the MyBookLive DTS.
 - kernel config is based on arch/powerpc/configs/44x/bluestone_defconfig +
   (I enabled dwc2 (USB-Port on the DUO), SATA, ext4(+squashfs),
   gpio-support, leds-support, crypto44x)
 - a powerpc32 userspace (debian's sid still builds up-to-date powerpc packages)

For the MX60(W): We have those two supported in OpenWrt. Currently they
are running a OpenWrt kernel based on stable 5.4 series. The missing "bit"
is upstream support for the AR8327 ethernet switch. I know that the chip
can be supported by qca8k: 



But of course: My future work with the MX60(W) (and AR8327) depends on how
this series goes. I'm testing the waters with the Meraki MR24 AP and the
WD MyBook Live series. Reason being that both devices are well supported.
They are available in great quantities... and all the core functions
are working.

Cheers,
Christian


Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-08 Thread Gerald Schaefer
On Tue, 8 Sep 2020 07:30:50 -0700
Dave Hansen  wrote:

> On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> > code") introduced a subtle but severe bug on s390 with gup_fast, due to
> > dynamic page table folding.
> 
> Would it be fair to say that the "fake" page table entries s390
> allocates on the stack are what's causing the trouble here?  That might
> be a nice thing to open up with here.  "Dynamic page table folding"
> really means nothing to me.

We do not really allocate anything on the stack, it is the generic logic
from gup_fast that passes over pXd values (read once before), and using
pointers to such (stack) variables instead of real pXd pointers.
That, combined with the fact that we just return the passed in pointer in
pXd_offset() for folded levels.

That works similar on x86 IIUC, but with static folding, and thus also
proper pXd_addr_end() results because of statically (and correspondingly)
defined Pxd_INDEX/SHIFT. We always have static 5-level PxD_INDEX/SHIFT, and
that cannot really be made dynamic, so we just make pXd_addr_end()
dynamic instead, and that requires the pXd value to determine the correct
pagetable level.

Still makes my head spin when trying to explain, sorry. It is a very
special s390 oddity, or lets call it "feature", because I don't think any
other architecture has "dynamic pagetable folding" capability, depending
on process requirements, for whatever it is worth...

> 
> > @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long 
> > addr, unsigned long end,
> > do {
> > pmd_t pmd = READ_ONCE(*pmdp);
> >  
> > -   next = pmd_addr_end(addr, end);
> > +   next = pmd_addr_end_folded(pmd, addr, end);
> > if (!pmd_present(pmd))
> > return 0;
> 
> It looks like you fix this up later, but this would be a problem if left
> this way.  There's no documentation for whether I use
> pmd_addr_end_folded() or pmd_addr_end() when writing a page table walker.

Yes, that is very unfortunate. We did have some lengthy comment in
include/linux/pgtable.h where the pXd_addr_end(_folded) were defined.
But that was moved to arch/s390/include/asm/pgtable.h in this version,
probably because we already had the generalization in mind, where we
would not need such explanation in common header any more.

So, it might help better understand the issue that we have with
dynamic page table folding and READ_ONCE-style pagetable walkers
when looking at that comment.

Thanks for pointing out, that comment should definitely go into
include/linux/pgtable.h again. At least if we would still go for
that "s390 fix first, generalization second" approach, but it
seems we have other / better options now.


Re: [PATCH -next] fork: silence a false postive warning in __mmdrop

2020-09-08 Thread peterz
On Tue, Sep 08, 2020 at 12:50:44PM -0400, Qian Cai wrote:
> > No, you're talking nonsense. We must not free @mm when
> > 'current->active_mm == mm', never.
> 
> Yes, you are right. It still trigger this below on powerpc with today's
> linux-next by fuzzing for a while (saw a few times on recent linux-next before
> as well but so far mostly reproducible on powerpc here). Any idea?

If you can reliably reproduce this, the next thing is to trace mm_count
and figure out where it goes side-ways. I suppose we're looking for an
'extra' decrement.

Mark tried this for a while but gave up because he couldn't reliably
reproduce.


Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-08 Thread Gerald Schaefer
On Tue, 8 Sep 2020 07:22:39 +0200
Christophe Leroy  wrote:

> 
> 
> Le 07/09/2020 à 22:12, Mike Rapoport a écrit :
> > On Mon, Sep 07, 2020 at 08:00:55PM +0200, Gerald Schaefer wrote:
> >> This is v2 of an RFC previously discussed here:
> >> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schae...@linux.ibm.com/
> >>
> >> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
> >> to common gup_fast code. It will introduce special helper functions
> >> pXd_addr_end_folded(), which have to be used in places where pagetable walk
> >> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
> >>
> >> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
> >> themselves by adding an extra pXd value parameter. That was suggested by
> >> Jason during v1 discussion, because he is already thinking of some other
> >> places where he might want to switch to the READ_ONCE logic for pagetable
> >> walks. In general, that would be the cleanest / safest solution, but there
> >> is some impact on other architectures and common code, hence the new and
> >> greatly enlarged recipient list.
> >>
> >> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
> >> functions instead of #defines, so that we get some type checking for the
> >> new pXd value parameter.
> >>
> >> Not sure about Fixes/stable tags for the generic solution. Only patch 1
> >> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
> >> still be nice to have in stable, to ease future backports, but I guess
> >> "nice to have" does not really qualify for stable backports.
> > 
> > I also think that adding pXd parameter to pXd_addr_end() is a cleaner
> > way and with this patch 1 is not really required. I would even merge
> > patches 2 and 3 into a single patch and use only it as the fix.
> 
> Why not merging patches 2 and 3, but I would keep patch 1 separate but 
> after the generic changes, so that we first do the generic changes, then 
> we do the specific S390 use of it.

Yes, we thought about that approach too. It would at least allow to
get all into stable, more or less nicely, as prerequisite for the s390
fix.

Two concerns kept us from going that way. For once, it might not be
the nicest way to get it all in stable, and we would not want to risk
further objections due to the imminent and rather scary data corruption
issue that we want to fix asap.

For the same reason, we thought that the generalization part might
need more time and agreement from various people, so that we could at
least get the first patch as short-term solution.

It seems now that the generalization is very well accepted so far,
apart from some apparent issues on arm. Also, merging 2 + 3 and
putting them first seems to be acceptable, so we could do that for
v3, if there are no objections.

Of course, we first need to address the few remaining issues for
arm(32?), which do look quite confusing to me so far. BTW, sorry for
the compile error with patch 3, I guess we did the cross-compile only
for 1 + 2 applied, to see the bloat-o-meter changes. But I guess
patch 3 already proved its usefulness by that :-)


Re: [PATCH V2] ASoC: fsl: imx-es8328: add missing put_device() call in imx_es8328_probe()

2020-09-08 Thread Mark Brown
On Tue, 25 Aug 2020 21:02:24 +0800, Yu Kuai wrote:
> if of_find_device_by_node() succeed, imx_es8328_probe() doesn't have
> a corresponding put_device(). Thus add a jump target to fix the exception
> handling for this function implementation.

Applied to

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-next

Thanks!

[1/1] ASoC: fsl: imx-es8328: add missing put_device() call in imx_es8328_probe()
  commit: e525db7e4b44c5b2b5aac0dad24e23cb58c54d22

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark


Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions

2020-09-08 Thread Christophe Leroy




Le 08/09/2020 à 17:48, Alexander Gordeev a écrit :

On Tue, Sep 08, 2020 at 07:19:38AM +0200, Christophe Leroy wrote:

[...]


diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 67ebc22cf83d..d9e7d16c2263 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
   */
  #ifndef pgd_addr_end
-#define pgd_addr_end(pgd, addr, end)   \
-({ unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;  \
-   (__boundary - 1 < (end) - 1)? __boundary: (end); \
-})
+#define pgd_addr_end pgd_addr_end


I think that #define is pointless, usually there is no such #define
for the default case.


Default pgd_addr_end() gets overriden on s390 (arch/s390/include/asm/pgtable.h):

#define pgd_addr_end pgd_addr_end
static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, 
unsigned long end)
{
return rste_addr_end_folded(pgd_val(pgd), addr, end);
}


Yes, there in s390 the #define is needed to hit the #ifndef pgd_addr_end 
that's in include/linux/pgtable.h


But in include/linux/pgtable.h, there is no need of an #define 
pgd_addr_end pgd_addr_end I think





+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, 
unsigned long end)
+{  unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
+   return (__boundary - 1 < end - 1) ? __boundary : end;
+}



Christophe


Re: [PATCH -next] fork: silence a false postive warning in __mmdrop

2020-09-08 Thread Qian Cai
0d4d0da23: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b 6b  
[12802.555848][T191552] Object 545dfae3: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b 6b  
[12802.555965][T191552] Object c686086a: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b 6b  
[12802.556093][T191552] Object 76efef7b: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b 6b  
[12802.556213][T191552] Object 7642cc9f: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b 6b  
[12802.556313][T191552] Object a2c7182e: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b 6b  
[12802.556431][T191552] Object d8508993: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b 6b  
[12802.556570][T191552] Object 07078b31: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b 6b  
[12802.556676][T191552] Object 228f: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b 6b  
[12802.556769][T191552] Object 96a989ba: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b 6b  
[12802.556904][T191552] Object 078fa309: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b 6b  
[12802.557025][T191552] Object b68d0e77: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b 6b  
[12802.557158][T191552] Object 144b15b3: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b 6b  
[12802.557247][T191552] Object a806800d: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b 6b  
[12802.557351][T191552] Object 5edb4355: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b 6b  
[12802.557484][T191552] Object 49aaca1e: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b 6b  
[12802.557600][T191552] Object 0eb0b7f9: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b 6b  
[12802.557727][T191552] Object 8fdb29be: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b a5  kkk.
[12802.557831][T191552] Redzone c5a61231: bb bb bb bb bb bb bb bb   
   
[12802.557947][T191552] Padding 3163b13a: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 
5a 5a 5a 5a 5a 5a  
[12802.558076][T191552] Padding 92412b1a: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 
5a 5a 5a 5a 5a 5a  
[12802.558187][T191552] Padding 319fa8cb: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 
5a 5a 5a 5a 5a 5a  
[12802.558314][T191552] Padding 963c7ce8: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 
5a 5a 5a 5a 5a 5a  
[12802.558432][T191552] CPU: 71 PID: 191552 Comm: trinity-main Tainted: GB  
O  5.9.0-rc4-next-20200908+ #1
[12802.558551][T191552] Call Trace:
[12802.558590][T191552] [c000201cb3427620] [c0701758] 
dump_stack+0xec/0x144 (unreliable)
[12802.558691][T191552] [c000201cb3427660] [c03cb53c] 
print_trailer+0x278/0x2a0
[12802.558794][T191552] [c000201cb34276f0] [c03c0d14] 
check_bytes_and_report+0x184/0x1b0
[12802.558900][T191552] [c000201cb34277a0] [c03c1000] 
check_object+0x2c0/0x330
[12802.558990][T191552] [c000201cb3427800] [c03c11ec] 
alloc_debug_processing+0x17c/0x1e0
[12802.559096][T191552] [c000201cb3427880] [c03c5468] 
___slab_alloc+0xb78/0xc60
[12802.559190][T191552] [c000201cb3427980] [c03c55f4] 
__slab_alloc+0xa4/0xf0
[12802.559284][T191552] [c000201cb34279d0] [c03c5954] 
kmem_cache_alloc+0x314/0x4a0
[12802.559362][T191552] [c000201cb3427a50] [c00c4818] dup_mm+0x48/0x6d0
[12802.559445][T191552] [c000201cb3427b00] [c00c665c] 
copy_process+0x11bc/0x19a0
[12802.559528][T191552] [c000201cb3427c20] [c00c7210] 
kernel_clone+0x120/0xb80
[12802.559630][T191552] [c000201cb3427d00] [c00c7cf8] 
__do_sys_clone+0x88/0xd0
[12802.559714][T191552] [c000201cb3427dc0] [c002c748] 
system_call_exception+0xf8/0x1d0
[12802.559810][T191552] [c000201cb3427e20] [c000d0a8] 
system_call_common+0xe8/0x218
[12802.559906][T191552] FIX mm_struct: Restoring 
0x0e2a54ec-0x0e2a54ec=0x6b
[12802.559906][T191552] 
[12802.560030][T191552] FIX mm_struct: Marking all objects used


Re: [RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions

2020-09-08 Thread Alexander Gordeev
On Tue, Sep 08, 2020 at 07:19:38AM +0200, Christophe Leroy wrote:

[...]

> >diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >index 67ebc22cf83d..d9e7d16c2263 100644
> >--- a/include/linux/pgtable.h
> >+++ b/include/linux/pgtable.h
> >@@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
> >   */
> >  #ifndef pgd_addr_end
> >-#define pgd_addr_end(pgd, addr, end)
> >\
> >-({  unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;  \
> >-(__boundary - 1 < (end) - 1)? __boundary: (end);\
> >-})
> >+#define pgd_addr_end pgd_addr_end
> 
> I think that #define is pointless, usually there is no such #define
> for the default case.

Default pgd_addr_end() gets overriden on s390 (arch/s390/include/asm/pgtable.h):

#define pgd_addr_end pgd_addr_end
static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, 
unsigned long end)
{
return rste_addr_end_folded(pgd_val(pgd), addr, end);
}

> >+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, 
> >unsigned long end)
> >+{   unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
> >+return (__boundary - 1 < end - 1) ? __boundary : end;
> >+}


Re: [PATCH v4 00/13] mm/debug_vm_pgtable fixes

2020-09-08 Thread Gerald Schaefer
On Fri, 4 Sep 2020 18:01:15 +0200
Gerald Schaefer  wrote:

[...]
> 
> BTW2, a quick test with this change (so far) made the issues on s390
> go away:
> 
> @@ -1069,7 +1074,7 @@ static int __init debug_vm_pgtable(void)
> spin_unlock(ptl);
> 
>  #ifndef CONFIG_PPC_BOOK3S_64
> -   hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
> +   hugetlb_advanced_tests(mm, vma, (pte_t *) pmdp, pmd_aligned, vaddr, 
> prot);
>  #endif
> 
> spin_lock(>page_table_lock);
> 
> That would more match the "pte_t pointer" usage for hugetlb code,
> i.e. just cast a pmd_t pointer to it. Also changed to pmd_aligned,
> but I think the root cause is the pte_t pointer.
> 
> Not entirely sure though if that would really be the correct fix.
> I somehow lost whatever little track I had about what these tests
> really want to check, and if that would still be valid with that
> change.

Uh oh, wasn't aware that this (or some predecessor) already went
upstream, and broke our debug kernel today.

I found out now what goes (horribly) wrong on s390, see below for
more details. In short, using hugetlb primitives with ptep pointers
that do _not_ point to a pmd or pud entry will not work on s390.
It also seems to make no sense to verify / test such a thing in general,
as it would also be a severe bug if any kernel code would do that.
After all, with hugepages, there are no pte tables, only pmd etc.
tables.

My change above would fix the issue for s390, but I can still not
completely judge if that would not break other things for your
tests. In general, for normal kernel code, much of what you do would
be very broken, but I guess your tests are doing such "special" things
because they can. E.g. because they operate on some "sandbox" mm
and page tables, and you also do not need properly populated page
tables for some exit / free cleanup, you just throw them away
explicitly with pXd_free at the end. So it might just be "the right
thing" to pass a casted pmd pointer to hugetlb_advanced_tests(),
to simulate and test (proper) usage of the hugetlb primitives.
I also see no other way to make this work for s390, than using a
proper pmd/pud pointer. If not possible, please add us to the
#ifndef.

So, for all those interested, here is what goes wrong on s390.
huge_ptep_get_and_clear() uses the "idte" instruction for the
clearing (and TLB invalidation) part. That instruction expects
a "region or segment table" origin, which is a pmd/pud/p4d/pgd,
but not a pte table. Even worse, when we calculate the table
origin from the given ptep (which *should* not point to a pte),
due to different table sizes for pte / pXd tables, we end up
at some place before the given pte table.

The "idte" instruction also gets the virtual address, and does
corresponding index addition to the given table origin. Depending
on the pmd_index we now end up either within the pte table again,
in which case we see a panic because idte complains about seeing
a pte value. If we are unlucky, then we end up outside the pte
table, and depending on the content of that memory location, idte
might succeed, effectively corrupting that memory.

That explains why we only see the panic sometimes, depending on
random vaddr, other symptoms other times, and probably completely
silent memory corruption for the rest...


Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware

2020-09-08 Thread Dave Hansen
On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> x86:
> add/remove: 0/0 grow/shrink: 2/0 up/down: 10/0 (10)
> Function old new   delta
> vmemmap_populate 587 592  +5
> munlock_vma_pages_range  556 561  +5
> Total: Before=15534694, After=15534704, chg +0.00%
...
>  arch/x86/mm/init_64.c| 15 -
>  arch/x86/mm/kasan_init_64.c  | 16 +-

I didn't do a super thorough review on this, but it generally looks OK
and the benefits of sharing more code between arches certainly outweigh
a few bytes of binary growth.  For the x86 bits at least, feel free to
add my ack.


Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-08 Thread Dave Hansen
On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> code") introduced a subtle but severe bug on s390 with gup_fast, due to
> dynamic page table folding.

Would it be fair to say that the "fake" page table entries s390
allocates on the stack are what's causing the trouble here?  That might
be a nice thing to open up with here.  "Dynamic page table folding"
really means nothing to me.

> @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, 
> unsigned long end,
>   do {
>   pmd_t pmd = READ_ONCE(*pmdp);
>  
> - next = pmd_addr_end(addr, end);
> + next = pmd_addr_end_folded(pmd, addr, end);
>   if (!pmd_present(pmd))
>   return 0;

It looks like you fix this up later, but this would be a problem if left
this way.  There's no documentation for whether I use
pmd_addr_end_folded() or pmd_addr_end() when writing a page table walker.



Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware

2020-09-08 Thread Alexander Gordeev
On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
[...]
> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.

If this one would be okay?

diff --git a/arch/powerpc/mm/book3s64/subpage_prot.c 
b/arch/powerpc/mm/book3s64/subpage_prot.c
index 60c6ea16..3690d22 100644
--- a/arch/powerpc/mm/book3s64/subpage_prot.c
+++ b/arch/powerpc/mm/book3s64/subpage_prot.c
@@ -88,6 +88,7 @@ static void hpte_flush_range(struct mm_struct *mm, unsigned 
long addr,
 static void subpage_prot_clear(unsigned long addr, unsigned long len)
 {
struct mm_struct *mm = current->mm;
+   pmd_t *pmd = pmd_off(mm, addr);
struct subpage_prot_table *spt;
u32 **spm, *spp;
unsigned long i;
@@ -103,8 +104,8 @@ static void subpage_prot_clear(unsigned long addr, unsigned 
long len)
limit = addr + len;
if (limit > spt->maxaddr)
limit = spt->maxaddr;
-   for (; addr < limit; addr = next) {
-   next = pmd_addr_end(addr, limit);
+   for (; addr < limit; addr = next, pmd++) {
+   next = pmd_addr_end(*pmd, addr, limit);
if (addr < 0x1UL) {
spm = spt->low_prot;
} else {
@@ -191,6 +192,7 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, 
unsigned long addr,
unsigned long, len, u32 __user *, map)
 {
struct mm_struct *mm = current->mm;
+   pmd_t *pmd = pmd_off(mm, addr);
struct subpage_prot_table *spt;
u32 **spm, *spp;
unsigned long i;
@@ -236,8 +238,8 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, 
unsigned long addr,
}
 
subpage_mark_vma_nohuge(mm, addr, len);
-   for (limit = addr + len; addr < limit; addr = next) {
-   next = pmd_addr_end(addr, limit);
+   for (limit = addr + len; addr < limit; addr = next, pmd++) {
+   next = pmd_addr_end(*pmd, addr, limit);
err = -ENOMEM;
if (addr < 0x1UL) {
spm = spt->low_prot;

Thanks!

> Christophe


Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware

2020-09-08 Thread Alexander Gordeev
On Tue, Sep 08, 2020 at 10:16:49AM +0200, Christophe Leroy wrote:
> >Yes, and also two more sources :/
> > arch/powerpc/mm/kasan/8xx.c
> > arch/powerpc/mm/kasan/kasan_init_32.c
> >
> >But these two are not quite obvious wrt pgd_addr_end() used
> >while traversing pmds. Could you please clarify a bit?
> >
> >
> >diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
> >index 2784224..89c5053 100644
> >--- a/arch/powerpc/mm/kasan/8xx.c
> >+++ b/arch/powerpc/mm/kasan/8xx.c
> >@@ -15,8 +15,8 @@
> > for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block 
> > += SZ_8M) {
> > pte_basic_t *new;
> >-k_next = pgd_addr_end(k_cur, k_end);
> >-k_next = pgd_addr_end(k_next, k_end);
> >+k_next = pmd_addr_end(k_cur, k_end);
> >+k_next = pmd_addr_end(k_next, k_end);
> 
> No, I don't think so.
> On powerpc32 we have only two levels, so pgd and pmd are more or
> less the same.
> But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h
> is a no-op, so I don't think it will work.
> 
> It is likely that this function should iterate on pgd, then you get
> pmd = pmd_offset(pud_offset(p4d_offset(pgd)));

It looks like the code iterates over single pmd table while using
pgd_addr_end() only to skip all the middle levels and bail out
from the loop.

I would be wary for switching from pmds to pgds, since we are
trying to minimize impact (especially functional) and the
rework does not seem that obvious.

Assuming pmd and pgd are the same would actually such approach
work for now?

diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224..94466cc 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -15,8 +15,8 @@
for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block 
+= SZ_8M) {
pte_basic_t *new;
 
-   k_next = pgd_addr_end(k_cur, k_end);
-   k_next = pgd_addr_end(k_next, k_end);
+   k_next = pgd_addr_end(__pgd(pmd_val(*pmd)), k_cur, k_end);
+   k_next = pgd_addr_end(__pgd(pmd_val(*(pmd + 1))), k_next, 
k_end);
if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
continue;
 
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c 
b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb29404..c0bcd64 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long 
k_start, unsigned long k_
for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
pte_t *new;
 
-   k_next = pgd_addr_end(k_cur, k_end);
+   k_next = pgd_addr_end(__pgd(pmd_val(*pmd)), k_cur, k_end);
if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
continue;
 
@@ -196,7 +196,7 @@ void __init kasan_early_init(void)
kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
do {
-   next = pgd_addr_end(addr, end);
+   next = pgd_addr_end(__pgd(pmd_val(*pmd)), addr, end);
pmd_populate_kernel(_mm, pmd, kasan_early_shadow_pte);
} while (pmd++, addr = next, addr != end);
 

Alternatively we could pass invalid pgd to keep the code structure
intact, but that of course is less nice.

Thanks!

> Christophe


Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-08 Thread Gerald Schaefer
On Tue, 8 Sep 2020 14:40:10 +0200
Christophe Leroy  wrote:

> 
> 
> Le 08/09/2020 à 14:09, Christian Borntraeger a écrit :
> > 
> > 
> > On 08.09.20 07:06, Christophe Leroy wrote:
> >>
> >>
> >> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> >>> From: Alexander Gordeev 
> >>>
> >>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> >>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
> >>> dynamic page table folding.
> >>>
> >>> The question "What would it require for the generic code to work for s390"
> >>> has already been discussed here
> >>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> >>> and ended with a promising approach here
> >>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
> >>> which in the end unfortunately didn't quite work completely.
> >>>
> >>> We tried to mimic static level folding by changing pgd_offset to always
> >>> calculate top level page table offset, and do nothing in folded 
> >>> pXd_offset.
> >>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
> >>> not reflect this dynamic behaviour, and still act like static 5-level
> >>> page tables.
> >>>
> >>
> >> [...]
> >>
> >>>
> >>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
> >>> additional pXd entry value parameter, that can be used on s390
> >>> to determine the correct page table level and return corresponding
> >>> end / boundary. With that, the pointer iteration will always
> >>> happen in gup_pgd_range for s390. No change for other architectures
> >>> introduced.
> >>
> >> Not sure pXd_addr_end_folded() is the best understandable name, allthough 
> >> I don't have any alternative suggestion at the moment.
> >> Maybe could be something like pXd_addr_end_fixup() as it will disappear in 
> >> the next patch, or pXd_addr_end_gup() ?
> >>
> >> Also, if it happens to be acceptable to get patch 2 in stable, I think you 
> >> should switch patch 1 and patch 2 to avoid the step through 
> >> pXd_addr_end_folded()
> > 
> > given that this fixes a data corruption issue, wouldnt it be the best to go 
> > forward
> > with this patch ASAP and then handle the other patches on top with all the 
> > time that
> > we need?
> 
> I have no strong opinion on this, but I feel rather tricky to have to 
> change generic part of GUP to use a new fonction then revert that change 
> in the following patch, just because you want the first patch in stable 
> and not the second one.
> 
> Regardless, I was wondering, why do we need a reference to the pXd at 
> all when calling pXd_addr_end() ?
> 
> Couldn't S390 retrieve the pXd by using the pXd_offset() dance with the 
> passed addr ?

Apart from performance impact when re-doing that what has already been
done by the caller, I think we would also break the READ_ONCE semantics.
After all, the pXd_offset() would also require some pXd pointer input,
which we don't have. So we would need to start over again from mm->pgd.

Also, it seems to be more in line with other primitives that take
a pXd value or pointer.


Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware

2020-09-08 Thread Jason Gunthorpe
On Mon, Sep 07, 2020 at 08:00:57PM +0200, Gerald Schaefer wrote:
> From: Alexander Gordeev 
> 
> Unlike all other page-table abstractions pXd_addr_end() do not take
> into account a particular table entry in which context the functions
> are called. On architectures with dynamic page-tables folding that
> might lead to lack of necessary information that is difficult to
> obtain other than from the table entry itself. That already led to
> a subtle memory corruption issue on s390.
> 
> By letting pXd_addr_end() functions know about the page-table entry
> we allow archs not only make extra checks, but also optimizations.
> 
> As result of this change the pXd_addr_end_folded() functions used
> in gup_fast traversal code become unnecessary and get replaced with
> universal pXd_addr_end() variants.
> 
> The arch-specific updates not only add dereferencing of page-table
> entry pointers, but also small changes to the code flow to make those
> dereferences possible, at least for x86 and powerpc. Also for arm64,
> but in way that should not have any impact.
> 
> So, even though the dereferenced page-table entries are not used on
> archs other than s390, and are optimized out by the compiler, there
> is a small change in kernel size and this is what bloat-o-meter reports:

This looks pretty clean and straightfoward, only
__munlock_pagevec_fill() had any real increased complexity.

Thanks,
Jason


Re: [PATCH] powerpc/64: Make VDSO32 track COMPAT on 64-bit

2020-09-08 Thread Christophe Leroy




Le 08/09/2020 à 14:58, Michael Ellerman a écrit :

When we added the VDSO32 kconfig symbol, which controls building of
the 32-bit VDSO, we made it depend on CPU_BIG_ENDIAN (for 64-bit).

That was because back then COMPAT was always enabled for 64-bit, so
depending on it would have left the 32-bit VDSO always enabled, which
we didn't want.

But since then we have made COMPAT selectable, and off by default for
ppc64le, so VDSO32 should really depend on that.

For most people this makes no difference, none of the defconfigs
change, it's only if someone is building ppc64le with COMPAT=y, they
will now also get VDSO32. If they've enabled COMPAT in order to run
32-bit binaries they presumably also want the 32-bit VDSO.

Signed-off-by: Michael Ellerman 



Reviewed-by: Christophe Leroy 

Michael, please note that christophe.le...@c-s.fr is a deprecated 
address that will one day not work anymore. Please use the new one 
whenever possible.


Christophe



---
  arch/powerpc/platforms/Kconfig.cputype | 7 +++
  1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/Kconfig.cputype 
b/arch/powerpc/platforms/Kconfig.cputype
index 87737ec86d39..a80ad0ef436e 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -490,13 +490,12 @@ endmenu
  
  config VDSO32

def_bool y
-   depends on PPC32 || CPU_BIG_ENDIAN
+   depends on PPC32 || COMPAT
help
  This symbol controls whether we build the 32-bit VDSO. We obviously
  want to do that if we're building a 32-bit kernel. If we're building
- a 64-bit kernel then we only want a 32-bit VDSO if we're building for
- big endian. That is because the only little endian configuration we
- support is ppc64le which is 64-bit only.
+ a 64-bit kernel then we only want a 32-bit VDSO if we're also enabling
+ COMPAT.
  
  choice

prompt "Endianness selection"



[PATCH] powerpc/64: Make VDSO32 track COMPAT on 64-bit

2020-09-08 Thread Michael Ellerman
When we added the VDSO32 kconfig symbol, which controls building of
the 32-bit VDSO, we made it depend on CPU_BIG_ENDIAN (for 64-bit).

That was because back then COMPAT was always enabled for 64-bit, so
depending on it would have left the 32-bit VDSO always enabled, which
we didn't want.

But since then we have made COMPAT selectable, and off by default for
ppc64le, so VDSO32 should really depend on that.

For most people this makes no difference, none of the defconfigs
change, it's only if someone is building ppc64le with COMPAT=y, they
will now also get VDSO32. If they've enabled COMPAT in order to run
32-bit binaries they presumably also want the 32-bit VDSO.

Signed-off-by: Michael Ellerman 
---
 arch/powerpc/platforms/Kconfig.cputype | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/Kconfig.cputype 
b/arch/powerpc/platforms/Kconfig.cputype
index 87737ec86d39..a80ad0ef436e 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -490,13 +490,12 @@ endmenu
 
 config VDSO32
def_bool y
-   depends on PPC32 || CPU_BIG_ENDIAN
+   depends on PPC32 || COMPAT
help
  This symbol controls whether we build the 32-bit VDSO. We obviously
  want to do that if we're building a 32-bit kernel. If we're building
- a 64-bit kernel then we only want a 32-bit VDSO if we're building for
- big endian. That is because the only little endian configuration we
- support is ppc64le which is 64-bit only.
+ a 64-bit kernel then we only want a 32-bit VDSO if we're also enabling
+ COMPAT.
 
 choice
prompt "Endianness selection"
-- 
2.25.1



Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-08 Thread Christophe Leroy




Le 08/09/2020 à 14:09, Christian Borntraeger a écrit :



On 08.09.20 07:06, Christophe Leroy wrote:



Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :

From: Alexander Gordeev 

Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
code") introduced a subtle but severe bug on s390 with gup_fast, due to
dynamic page table folding.

The question "What would it require for the generic code to work for s390"
has already been discussed here
https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
and ended with a promising approach here
https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
which in the end unfortunately didn't quite work completely.

We tried to mimic static level folding by changing pgd_offset to always
calculate top level page table offset, and do nothing in folded pXd_offset.
What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
not reflect this dynamic behaviour, and still act like static 5-level
page tables.



[...]



Fix this by introducing new pXd_addr_end_folded helpers, which take an
additional pXd entry value parameter, that can be used on s390
to determine the correct page table level and return corresponding
end / boundary. With that, the pointer iteration will always
happen in gup_pgd_range for s390. No change for other architectures
introduced.


Not sure pXd_addr_end_folded() is the best understandable name, allthough I 
don't have any alternative suggestion at the moment.
Maybe could be something like pXd_addr_end_fixup() as it will disappear in the 
next patch, or pXd_addr_end_gup() ?

Also, if it happens to be acceptable to get patch 2 in stable, I think you 
should switch patch 1 and patch 2 to avoid the step through 
pXd_addr_end_folded()


given that this fixes a data corruption issue, wouldnt it be the best to go 
forward
with this patch ASAP and then handle the other patches on top with all the time 
that
we need?


I have no strong opinion on this, but I feel rather tricky to have to 
change generic part of GUP to use a new fonction then revert that change 
in the following patch, just because you want the first patch in stable 
and not the second one.


Regardless, I was wondering, why do we need a reference to the pXd at 
all when calling pXd_addr_end() ?


Couldn't S390 retrieve the pXd by using the pXd_offset() dance with the 
passed addr ?


Christophe


Re: [PATCH kernel] powerpc/dma: Fix dma_map_ops::get_required_mask

2020-09-08 Thread Christoph Hellwig
On Tue, Sep 08, 2020 at 10:06:56PM +1000, Alexey Kardashevskiy wrote:
> On 08/09/2020 15:44, Christoph Hellwig wrote:
>> On Tue, Sep 08, 2020 at 11:51:06AM +1000, Alexey Kardashevskiy wrote:
>>> What is dma_get_required_mask() for anyway? What "requires" what here?
>>
>> Yes, it is a really odd API.  It comes from classic old PCI where
>> 64-bit addressing required an additional bus cycle, and various devices
>> had different addressing schemes, with the smaller addresses beeing
>> more efficient.  So this allows the driver to request the "required"
>> addressing mode to address all memory.  "preferred" might be a better
>> name as we'll bounce buffer if it isn't met.  I also don't really see
>> why a driver would ever want to use it for a modern PCIe device.
>
>
> a-ha, this makes more sense, thanks. Then I guess we need to revert that 
> one bit from yours f1565c24b596, do not we?

Why?  The was the original intent of the API, but now we also use
internally to check the addressing capabilities.


Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-08 Thread Christian Borntraeger



On 08.09.20 07:06, Christophe Leroy wrote:
> 
> 
> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
>> From: Alexander Gordeev 
>>
>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
>> dynamic page table folding.
>>
>> The question "What would it require for the generic code to work for s390"
>> has already been discussed here
>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
>> and ended with a promising approach here
>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
>> which in the end unfortunately didn't quite work completely.
>>
>> We tried to mimic static level folding by changing pgd_offset to always
>> calculate top level page table offset, and do nothing in folded pXd_offset.
>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
>> not reflect this dynamic behaviour, and still act like static 5-level
>> page tables.
>>
> 
> [...]
> 
>>
>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
>> additional pXd entry value parameter, that can be used on s390
>> to determine the correct page table level and return corresponding
>> end / boundary. With that, the pointer iteration will always
>> happen in gup_pgd_range for s390. No change for other architectures
>> introduced.
> 
> Not sure pXd_addr_end_folded() is the best understandable name, allthough I 
> don't have any alternative suggestion at the moment.
> Maybe could be something like pXd_addr_end_fixup() as it will disappear in 
> the next patch, or pXd_addr_end_gup() ?
> 
> Also, if it happens to be acceptable to get patch 2 in stable, I think you 
> should switch patch 1 and patch 2 to avoid the step through 
> pXd_addr_end_folded()

given that this fixes a data corruption issue, wouldnt it be the best to go 
forward
with this patch ASAP and then handle the other patches on top with all the time 
that
we need?
> 
> 
>>
>> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast 
>> code")
>> Cc:  # 5.2+
>> Reviewed-by: Gerald Schaefer 
>> Signed-off-by: Alexander Gordeev 
>> Signed-off-by: Gerald Schaefer 
>> ---
>>   arch/s390/include/asm/pgtable.h | 42 +
>>   include/linux/pgtable.h | 16 +
>>   mm/gup.c    |  8 +++
>>   3 files changed, 62 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/s390/include/asm/pgtable.h 
>> b/arch/s390/include/asm/pgtable.h
>> index 7eb01a5459cd..027206e4959d 100644
>> --- a/arch/s390/include/asm/pgtable.h
>> +++ b/arch/s390/include/asm/pgtable.h
>> @@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm)
>>   }
>>   #define mm_pmd_folded(mm) mm_pmd_folded(mm)
>>   +/*
>> + * With dynamic page table levels on s390, the static pXd_addr_end() 
>> functions
>> + * will not return corresponding dynamic boundaries. This is no problem as 
>> long
>> + * as only pXd pointers are passed down during page table walk, because
>> + * pXd_offset() will simply return the given pointer for folded levels, and 
>> the
>> + * pointer iteration over a range simply happens at the correct page table
>> + * level.
>> + * It is however a problem with gup_fast, or other places walking the page
>> + * tables w/o locks using READ_ONCE(), and passing down the pXd values 
>> instead
>> + * of pointers. In this case, the pointer given to pXd_offset() is a 
>> pointer to
>> + * a stack variable, which cannot be used for pointer iteration at the 
>> correct
>> + * level. Instead, the iteration then has to happen by going up to pgd level
>> + * again. To allow this, provide pXd_addr_end_folded() functions with an
>> + * additional pXd value parameter, which can be used on s390 to determine 
>> the
>> + * folding level and return the corresponding boundary.
>> + */
>> +static inline unsigned long rste_addr_end_folded(unsigned long rste, 
>> unsigned long addr, unsigned long end)
> 
> What does 'rste' stands for ?
> 
> Isn't this line a bit long ?

this is region/segment table entry according to the architecture. 
On our platform we do have the pagetables with a different format that
next levels (segment table -> 1MB granularity, region 3rd table -> 2 GB
granularity, region 2nd table -> 4TB granularity, region 1st table -> 8 PB
granularity. ST,R3,R2,R1 have the same format and are thus often called
crste (combined region and segment table entry).


Re: [PATCH kernel] powerpc/dma: Fix dma_map_ops::get_required_mask

2020-09-08 Thread Alexey Kardashevskiy




On 08/09/2020 15:44, Christoph Hellwig wrote:

On Tue, Sep 08, 2020 at 11:51:06AM +1000, Alexey Kardashevskiy wrote:

What is dma_get_required_mask() for anyway? What "requires" what here?


Yes, it is a really odd API.  It comes from classic old PCI where
64-bit addressing required an additional bus cycle, and various devices
had different addressing schemes, with the smaller addresses beeing
more efficient.  So this allows the driver to request the "required"
addressing mode to address all memory.  "preferred" might be a better
name as we'll bounce buffer if it isn't met.  I also don't really see
why a driver would ever want to use it for a modern PCIe device.



a-ha, this makes more sense, thanks. Then I guess we need to revert that 
one bit from yours f1565c24b596, do not we?



--
Alexey


Re: [PATCH kernel] powerpc/dma: Fix dma_map_ops::get_required_mask

2020-09-08 Thread Cédric Le Goater
On 9/8/20 3:51 AM, Alexey Kardashevskiy wrote:
> There are 2 problems with it:
> 1. "<" vs expected "<<"
> 2. the shift number is an IOMMU page number mask, not an address mask
> as the IOMMU page shift is missing.
> 
> This did not hit us before f1565c24b596 ("powerpc: use the generic
> dma_ops_bypass mode") because we had there additional code to handle
> bypass mask so this chunk (almost?) never executed. However there
> were reports that aacraid does not work with "iommu=nobypass".
> After f1565c24b596, aacraid (and probably others which call
> dma_get_required_mask() before setting the mask) was unable to
> enable 64bit DMA and fall back to using IOMMU which was known not to work,
> one of the problems is double free of an IOMMU page.
> 
> This fixes DMA for aacraid, both with and without "iommu=nobypass"
> in the kernel command line. Verified with "stress-ng -d 4".
> 
> Fixes: f1565c24b596 ("powerpc: use the generic dma_ops_bypass mode")
> Signed-off-by: Alexey Kardashevskiy 


The boston system looks solid with this patch.

Tested-by: Cédric Le Goater 

Thanks a lot ! 

C. 


> ---
> 
> The original code came Jun 24 2011:
> 6a5c7be5e484 ("powerpc: Override dma_get_required_mask by platform hook and 
> ops")
> 
> 
> What is dma_get_required_mask() for anyway? What "requires" what here?
> 
> Even though it works for now (due to huge - >4GB - default DMA window),
> I am still not convinced we do not want this chunk here
> (this is what f1565c24b596 removed):
> 
> if (dev_is_pci(dev)) {
> u64 bypass_mask = dma_direct_get_required_mask(dev);
> 
> if (dma_iommu_bypass_supported(dev, bypass_mask))
> return bypass_mask;
> }
> ---
>  arch/powerpc/kernel/dma-iommu.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c
> index 569fecd7b5b2..9053fc9d20c7 100644
> --- a/arch/powerpc/kernel/dma-iommu.c
> +++ b/arch/powerpc/kernel/dma-iommu.c
> @@ -120,7 +120,8 @@ u64 dma_iommu_get_required_mask(struct device *dev)
>   if (!tbl)
>   return 0;
>  
> - mask = 1ULL < (fls_long(tbl->it_offset + tbl->it_size) - 1);
> + mask = 1ULL << (fls_long(tbl->it_offset + tbl->it_size) +
> + tbl->it_page_shift - 1);
>   mask += mask - 1;
>  
>   return mask;
> 



Re: [PATCH v1 5/5] powerpc/fault: Perform exception fixup in do_page_fault()

2020-09-08 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of August 7, 2020 3:15 am:
> Exception fixup doesn't require the heady full regs saving,
  heavy

> do it from do_page_fault() directly.
> 
> For that, split bad_page_fault() in two parts.
> 
> As bad_page_fault() can also be called from other places than
> handle_page_fault(), it will still perform exception fixup and
> fallback on __bad_page_fault().
> 
> handle_page_fault() directly calls __bad_page_fault() as the
> exception fixup will now be done by do_page_fault()

Looks good. We can probably get rid of bad_page_fault completely after 
this too.

Hmm, the alignment exception might(?) hit user copies if the user points 
it to CI memory. Then you could race and the memory gets unmapped. In 
that case the exception table check might be better to be explicit there
with comments.

The first call in do_hash_fault is not required (copy user will never
be in nmi context). The second one and the one in slb_fault could be
made explicit too. Anyway for now this is fine.

Thanks,
Nick

Reviewed-by: Nicholas Piggin 


> 
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/kernel/entry_32.S   |  2 +-
>  arch/powerpc/kernel/exceptions-64e.S |  2 +-
>  arch/powerpc/kernel/exceptions-64s.S |  2 +-
>  arch/powerpc/mm/fault.c  | 33 
>  4 files changed, 27 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
> index f4d0af8e1136..c198786591f9 100644
> --- a/arch/powerpc/kernel/entry_32.S
> +++ b/arch/powerpc/kernel/entry_32.S
> @@ -678,7 +678,7 @@ handle_page_fault:
>   mr  r5,r3
>   addir3,r1,STACK_FRAME_OVERHEAD
>   lwz r4,_DAR(r1)
> - bl  bad_page_fault
> + bl  __bad_page_fault
>   b   ret_from_except_full
>  
>  #ifdef CONFIG_PPC_BOOK3S_32
> diff --git a/arch/powerpc/kernel/exceptions-64e.S 
> b/arch/powerpc/kernel/exceptions-64e.S
> index d9ed79415100..dd9161ea5da8 100644
> --- a/arch/powerpc/kernel/exceptions-64e.S
> +++ b/arch/powerpc/kernel/exceptions-64e.S
> @@ -1024,7 +1024,7 @@ storage_fault_common:
>   mr  r5,r3
>   addir3,r1,STACK_FRAME_OVERHEAD
>   ld  r4,_DAR(r1)
> - bl  bad_page_fault
> + bl  __bad_page_fault
>   b   ret_from_except
>  
>  /*
> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
> b/arch/powerpc/kernel/exceptions-64s.S
> index f7d748b88705..2cb3bcfb896d 100644
> --- a/arch/powerpc/kernel/exceptions-64s.S
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> @@ -3254,7 +3254,7 @@ handle_page_fault:
>   mr  r5,r3
>   addir3,r1,STACK_FRAME_OVERHEAD
>   ld  r4,_DAR(r1)
> - bl  bad_page_fault
> + bl  __bad_page_fault
>   b   interrupt_return
>  
>  /* We have a data breakpoint exception - handle it */
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index edde169ba3a6..bd6e397eb84a 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -542,10 +542,20 @@ NOKPROBE_SYMBOL(__do_page_fault);
>  int do_page_fault(struct pt_regs *regs, unsigned long address,
> unsigned long error_code)
>  {
> + const struct exception_table_entry *entry;
>   enum ctx_state prev_state = exception_enter();
>   int rc = __do_page_fault(regs, address, error_code);
>   exception_exit(prev_state);
> - return rc;
> + if (likely(!rc))
> + return 0;
> +
> + entry = search_exception_tables(regs->nip);
> + if (unlikely(!entry))
> + return rc;
> +
> + instruction_pointer_set(regs, extable_fixup(entry));
> +
> + return 0;
>  }
>  NOKPROBE_SYMBOL(do_page_fault);
>  
> @@ -554,17 +564,10 @@ NOKPROBE_SYMBOL(do_page_fault);
>   * It is called from the DSI and ISI handlers in head.S and from some
>   * of the procedures in traps.c.
>   */
> -void bad_page_fault(struct pt_regs *regs, unsigned long address, int sig)
> +void __bad_page_fault(struct pt_regs *regs, unsigned long address, int sig)
>  {
> - const struct exception_table_entry *entry;
>   int is_write = page_fault_is_write(regs->dsisr);
>  
> - /* Are we prepared to handle this fault?  */
> - if ((entry = search_exception_tables(regs->nip)) != NULL) {
> - regs->nip = extable_fixup(entry);
> - return;
> - }
> -
>   /* kernel has accessed a bad area */
>  
>   switch (TRAP(regs)) {
> @@ -598,3 +601,15 @@ void bad_page_fault(struct pt_regs *regs, unsigned long 
> address, int sig)
>  
>   die("Kernel access of bad area", regs, sig);
>  }
> +
> +void bad_page_fault(struct pt_regs *regs, unsigned long address, int sig)
> +{
> + const struct exception_table_entry *entry;
> +
> + /* Are we prepared to handle this fault?  */
> + entry = search_exception_tables(instruction_pointer(regs));
> + if (entry)
> + instruction_pointer_set(regs, extable_fixup(entry));
> + 

Re: [PATCH v1 1/5] powerpc/mm: sanity_check_fault() should work for all, not only BOOK3S

2020-09-08 Thread Christophe Leroy




Le 08/09/2020 à 10:43, Nicholas Piggin a écrit :

Excerpts from Christophe Leroy's message of August 7, 2020 3:15 am:

The verification and message introduced by commit 374f3f5979f9
("powerpc/mm/hash: Handle user access of kernel address gracefully")
applies to all platforms, it should not be limited to BOOK3S.

Make the BOOK3S version of sanity_check_fault() the one for all,
and bail out earlier if not BOOK3S.

Fixes: 374f3f5979f9 ("powerpc/mm/hash: Handle user access of kernel address 
gracefully")
Signed-off-by: Christophe Leroy 
---
  arch/powerpc/mm/fault.c | 8 +++-
  1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 925a7231abb3..2efa34d7e644 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -303,7 +303,6 @@ static inline void cmo_account_page_fault(void)
  static inline void cmo_account_page_fault(void) { }
  #endif /* CONFIG_PPC_SMLPAR */
  
-#ifdef CONFIG_PPC_BOOK3S

  static void sanity_check_fault(bool is_write, bool is_user,
   unsigned long error_code, unsigned long address)
  {
@@ -320,6 +319,9 @@ static void sanity_check_fault(bool is_write, bool is_user,
return;
}
  
+	if (!IS_ENABLED(CONFIG_PPC_BOOK3S))

+   return;


Seems okay. Why is address == -1 special though? I guess it's because
it may not be an exploit kernel reference but a buggy pointer underflow?
In that case -1 doesn't seem like it would catch very much. Would it be
better to test for high bit set for example ((long)address < 0) ?


See 
https://github.com/linuxppc/linux/commit/0f9aee0cb9da7db7d96f63cfa2dc5e4f1bffeb87#diff-f9658f412252f3bb3093e0a95b37f3ac


-1 is what mmap() returns on error, if the app uses that as a pointer 
that's a programming error not an exploit.


Euh .. If you test (long)address < 0, then the entire kernel falls into 
that range as usually it goes from 0xc000 to 0x


But we could skip the top page entirely, anyway it is never mapped.



Anyway for your patch

Reviewed-by: Nicholas Piggin 


Thanks
Christophe



Re: [PATCH v1 4/5] powerpc/fault: Avoid heavy search_exception_tables() verification

2020-09-08 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of August 7, 2020 3:15 am:
> search_exception_tables() is an heavy operation, we have to avoid it.
> When KUAP is selected, we'll know the fault has been blocked by KUAP.
> Otherwise, it behaves just as if the address was already in the TLBs
> and no fault was generated.
> 
> Signed-off-by: Christophe Leroy 

Sorry I missed reviewing this. Yes, we discussed this and decided
that it's not effective I think (and KUAP solves it properly).

Reviewed-by: Nicholas Piggin 

> ---
>  arch/powerpc/mm/fault.c | 20 +---
>  1 file changed, 5 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 525e0c2b5406..edde169ba3a6 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -214,24 +214,14 @@ static bool bad_kernel_fault(struct pt_regs *regs, 
> unsigned long error_code,
>   if (address >= TASK_SIZE)
>   return true;
>  
> - if (!is_exec && (error_code & DSISR_PROTFAULT) &&
> - !search_exception_tables(regs->nip)) {
> + // Read/write fault blocked by KUAP is bad, it can never succeed.
> + if (bad_kuap_fault(regs, address, is_write)) {
>   pr_crit_ratelimited("Kernel attempted to access user page (%lx) 
> - exploit attempt? (uid: %d)\n",
> - address,
> - from_kuid(_user_ns, current_uid()));
> - }
> -
> - // Fault on user outside of certain regions (eg. copy_tofrom_user()) is 
> bad
> - if (!search_exception_tables(regs->nip))
> - return true;
> -
> - // Read/write fault in a valid region (the exception table search passed
> - // above), but blocked by KUAP is bad, it can never succeed.
> - if (bad_kuap_fault(regs, address, is_write))
> + address, from_kuid(_user_ns, 
> current_uid()));
>   return true;
> + }
>  
> - // What's left? Kernel fault on user in well defined regions (extable
> - // matched), and allowed by KUAP in the faulting context.
> + // What's left? Kernel fault on user and allowed by KUAP in the 
> faulting context.
>   return false;
>  }
>  
> -- 
> 2.25.0
> 
> 


Re: [RFC PATCH 02/12] powerpc: remove arguments from interrupt handler functions

2020-09-08 Thread Christophe Leroy




Le 08/09/2020 à 10:29, Christophe Leroy a écrit :



Le 08/09/2020 à 09:48, Nicholas Piggin a écrit :

Excerpts from Christophe Leroy's message of September 7, 2020 9:34 pm:

On Mon, 2020-09-07 at 11:20 +0200, Christophe Leroy wrote:


Le 05/09/2020 à 19:43, Nicholas Piggin a écrit :

Make interrupt handlers all just take the pt_regs * argument and load
DAR/DSISR etc from that. Make those that return a value return long.


I like this, it will likely simplify a bit the VMAP_STACK mess.

Not sure it is that easy. My board is stuck after the start of init.


On the 8xx, on Instruction TLB Error exception, we do

andis.    r5,r9,DSISR_SRR1_MATCH_32S@h /* Filter relevant SRR1 
bits */


On book3s/32, on ISI exception we do:
andis.    r5,r9,DSISR_SRR1_MATCH_32S@h /* Filter relevant SRR1 
bits */


On 40x and bookE, on ISI exception we do:
li    r5,0    /* Pass zero as arg3 */


And regs->dsisr will just contain nothing

So it means we should at least write back r5 into regs->dsisr from 
there

? The performance impact should be minimal as we already write _DAR so
the cache line should already be in the cache.

A hacky 'stw r5, _DSISR(r1)' in handle_page_fault() does the trick,
allthough we don't want to do it for both ISI and DSI at the end, so
you'll have to do it in every head_xxx.S


To get you series build and work, I did the following hacks:


Great, thanks for this.


diff --git a/arch/powerpc/include/asm/interrupt.h
b/arch/powerpc/include/asm/interrupt.h
index acfcc7d5779b..c11045d3113a 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -93,7 +93,9 @@ static inline void interrupt_nmi_exit_prepare(struct
pt_regs *regs, struct inter
  {
  nmi_exit();
+#ifdef CONFIG_PPC64
  this_cpu_set_ftrace_enabled(state->ftrace_enabled);
+#endif


This seems okay, not a hack.


  #ifdef CONFIG_PPC_BOOK3S_64
  /* Check we didn't change the pending interrupt mask. */
diff --git a/arch/powerpc/kernel/entry_32.S
b/arch/powerpc/kernel/entry_32.S
index f4d0af8e1136..66f7adbe1076 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -663,6 +663,7 @@ ppc_swapcontext:
   */
  .globl    handle_page_fault
  handle_page_fault:
+    stw    r5,_DSISR(r1)
  addi    r3,r1,STACK_FRAME_OVERHEAD
  #ifdef CONFIG_PPC_BOOK3S_32
  andis.  r0,r5,DSISR_DABRMATCH@h


Is this what you want to do for 32, or do you want to seperate
ISI and DSI sides?



No I think we want to separate ISI and DSI sides.

And I think the specific filtering done in ISI could be done in 
do_page_fault() in C. Ok, it would make a special handling for is_exec 
but there are already several places where the behaviour differs based 
on is_exec.
The only thing we need to keep at ASM level is the DABR stuff for 
calling do_break() in handle_page_fault(), because it is used to decide 
whether full regs are saved or not. But it could be a test done earlier 
in the prolog and the result being kept in one of the CR bits.


Looking at it once more, I'm wondering whether we really need a full regs.

Before commit 
https://github.com/linuxppc/linux/commit/d300627c6a53693fb01479b59b0cdd293761b1fa#diff-f9658f412252f3bb3093e0a95b37f3ac 
do_break() was called from do_page_fault() without a full regs set.


Christophe


Re: [PATCH v1 3/5] powerpc/fault: Reorder tests in bad_kernel_fault()

2020-09-08 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of August 7, 2020 3:15 am:
> Check address earlier to simplify the following test.

Good logic reduction.

Reviewed-by: Nicholas Piggin 

> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/mm/fault.c | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 9ef9ee244f72..525e0c2b5406 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -210,17 +210,17 @@ static bool bad_kernel_fault(struct pt_regs *regs, 
> unsigned long error_code,
>   return true;
>   }
>  
> - if (!is_exec && address < TASK_SIZE && (error_code & DSISR_PROTFAULT) &&
> + // Kernel fault on kernel address is bad
> + if (address >= TASK_SIZE)
> + return true;
> +
> + if (!is_exec && (error_code & DSISR_PROTFAULT) &&
>   !search_exception_tables(regs->nip)) {
>   pr_crit_ratelimited("Kernel attempted to access user page (%lx) 
> - exploit attempt? (uid: %d)\n",
>   address,
>   from_kuid(_user_ns, current_uid()));
>   }
>  
> - // Kernel fault on kernel address is bad
> - if (address >= TASK_SIZE)
> - return true;
> -
>   // Fault on user outside of certain regions (eg. copy_tofrom_user()) is 
> bad
>   if (!search_exception_tables(regs->nip))
>   return true;
> -- 
> 2.25.0
> 
> 


Re: [PATCH v1 2/5] powerpc/fault: Unnest definition of page_fault_is_write() and page_fault_is_bad()

2020-09-08 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of August 7, 2020 3:15 am:
> To make it more readable, separate page_fault_is_write() and 
> page_fault_is_bad()
> to avoir several levels of #ifdefs

Reviewed-by: Nicholas Piggin 

> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/mm/fault.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 2efa34d7e644..9ef9ee244f72 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -363,17 +363,19 @@ static void sanity_check_fault(bool is_write, bool 
> is_user,
>   */
>  #if (defined(CONFIG_4xx) || defined(CONFIG_BOOKE))
>  #define page_fault_is_write(__err)   ((__err) & ESR_DST)
> -#define page_fault_is_bad(__err) (0)
>  #else
>  #define page_fault_is_write(__err)   ((__err) & DSISR_ISSTORE)
> -#if defined(CONFIG_PPC_8xx)
> +#endif
> +
> +#if defined(CONFIG_4xx) || defined(CONFIG_BOOKE)
> +#define page_fault_is_bad(__err) (0)
> +#elif defined(CONFIG_PPC_8xx)
>  #define page_fault_is_bad(__err) ((__err) & DSISR_NOEXEC_OR_G)
>  #elif defined(CONFIG_PPC64)
>  #define page_fault_is_bad(__err) ((__err) & DSISR_BAD_FAULT_64S)
>  #else
>  #define page_fault_is_bad(__err) ((__err) & DSISR_BAD_FAULT_32S)
>  #endif
> -#endif
>  
>  /*
>   * For 600- and 800-family processors, the error_code parameter is DSISR
> -- 
> 2.25.0
> 
> 


Re: [PATCH v1 1/5] powerpc/mm: sanity_check_fault() should work for all, not only BOOK3S

2020-09-08 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of August 7, 2020 3:15 am:
> The verification and message introduced by commit 374f3f5979f9
> ("powerpc/mm/hash: Handle user access of kernel address gracefully")
> applies to all platforms, it should not be limited to BOOK3S.
> 
> Make the BOOK3S version of sanity_check_fault() the one for all,
> and bail out earlier if not BOOK3S.
> 
> Fixes: 374f3f5979f9 ("powerpc/mm/hash: Handle user access of kernel address 
> gracefully")
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/mm/fault.c | 8 +++-
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 925a7231abb3..2efa34d7e644 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -303,7 +303,6 @@ static inline void cmo_account_page_fault(void)
>  static inline void cmo_account_page_fault(void) { }
>  #endif /* CONFIG_PPC_SMLPAR */
>  
> -#ifdef CONFIG_PPC_BOOK3S
>  static void sanity_check_fault(bool is_write, bool is_user,
>  unsigned long error_code, unsigned long address)
>  {
> @@ -320,6 +319,9 @@ static void sanity_check_fault(bool is_write, bool 
> is_user,
>   return;
>   }
>  
> + if (!IS_ENABLED(CONFIG_PPC_BOOK3S))
> + return;

Seems okay. Why is address == -1 special though? I guess it's because 
it may not be an exploit kernel reference but a buggy pointer underflow? 
In that case -1 doesn't seem like it would catch very much. Would it be 
better to test for high bit set for example ((long)address < 0) ?

Anyway for your patch

Reviewed-by: Nicholas Piggin 

> +
>   /*
>* For hash translation mode, we should never get a
>* PROTFAULT. Any update to pte to reduce access will result in us
> @@ -354,10 +356,6 @@ static void sanity_check_fault(bool is_write, bool 
> is_user,
>  
>   WARN_ON_ONCE(error_code & DSISR_PROTFAULT);
>  }
> -#else
> -static void sanity_check_fault(bool is_write, bool is_user,
> -unsigned long error_code, unsigned long address) 
> { }
> -#endif /* CONFIG_PPC_BOOK3S */
>  
>  /*
>   * Define the correct "is_write" bit in error_code based
> -- 
> 2.25.0
> 
> 


Re: [RFC PATCH 02/12] powerpc: remove arguments from interrupt handler functions

2020-09-08 Thread Christophe Leroy




Le 08/09/2020 à 09:48, Nicholas Piggin a écrit :

Excerpts from Christophe Leroy's message of September 7, 2020 9:34 pm:

On Mon, 2020-09-07 at 11:20 +0200, Christophe Leroy wrote:


Le 05/09/2020 à 19:43, Nicholas Piggin a écrit :

Make interrupt handlers all just take the pt_regs * argument and load
DAR/DSISR etc from that. Make those that return a value return long.


I like this, it will likely simplify a bit the VMAP_STACK mess.

Not sure it is that easy. My board is stuck after the start of init.


On the 8xx, on Instruction TLB Error exception, we do

andis.  r5,r9,DSISR_SRR1_MATCH_32S@h /* Filter relevant SRR1 bits */

On book3s/32, on ISI exception we do:
andis.  r5,r9,DSISR_SRR1_MATCH_32S@h /* Filter relevant SRR1 bits */

On 40x and bookE, on ISI exception we do:
li  r5,0/* Pass zero as arg3 */


And regs->dsisr will just contain nothing

So it means we should at least write back r5 into regs->dsisr from there
? The performance impact should be minimal as we already write _DAR so
the cache line should already be in the cache.

A hacky 'stw r5, _DSISR(r1)' in handle_page_fault() does the trick,
allthough we don't want to do it for both ISI and DSI at the end, so
you'll have to do it in every head_xxx.S


To get you series build and work, I did the following hacks:


Great, thanks for this.


diff --git a/arch/powerpc/include/asm/interrupt.h
b/arch/powerpc/include/asm/interrupt.h
index acfcc7d5779b..c11045d3113a 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -93,7 +93,9 @@ static inline void interrupt_nmi_exit_prepare(struct
pt_regs *regs, struct inter
  {
nmi_exit();
  
+#ifdef CONFIG_PPC64

this_cpu_set_ftrace_enabled(state->ftrace_enabled);
+#endif


This seems okay, not a hack.


  #ifdef CONFIG_PPC_BOOK3S_64
/* Check we didn't change the pending interrupt mask. */
diff --git a/arch/powerpc/kernel/entry_32.S
b/arch/powerpc/kernel/entry_32.S
index f4d0af8e1136..66f7adbe1076 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -663,6 +663,7 @@ ppc_swapcontext:
   */
.globl  handle_page_fault
  handle_page_fault:
+   stw r5,_DSISR(r1)
addir3,r1,STACK_FRAME_OVERHEAD
  #ifdef CONFIG_PPC_BOOK3S_32
andis.  r0,r5,DSISR_DABRMATCH@h


Is this what you want to do for 32, or do you want to seperate
ISI and DSI sides?



No I think we want to separate ISI and DSI sides.

And I think the specific filtering done in ISI could be done in 
do_page_fault() in C. Ok, it would make a special handling for is_exec 
but there are already several places where the behaviour differs based 
on is_exec.
The only thing we need to keep at ASM level is the DABR stuff for 
calling do_break() in handle_page_fault(), because it is used to decide 
whether full regs are saved or not. But it could be a test done earlier 
in the prolog and the result being kept in one of the CR bits.


That way we can avoid reloading dsisr and dar from the stack, especially 
for VMAP stack as they are saved quite early in the prolog.


Or we can take them out of the thread struct and save them in the stack 
a little latter in the prolog, but then we have to keep the RI bit a bit 
cleared longer.


While we are playing with do_page_fault(), did you have a look at the 
series I sent 
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/7baae4086cbb9ffb08c933b065ff7d29dbc03dd6.1596734104.git.christophe.le...@csgroup.eu/


Christophe


Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware

2020-09-08 Thread Christophe Leroy




Le 08/09/2020 à 09:46, Alexander Gordeev a écrit :

On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:

You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.


Yes, and also two more sources :/
arch/powerpc/mm/kasan/8xx.c
arch/powerpc/mm/kasan/kasan_init_32.c

But these two are not quite obvious wrt pgd_addr_end() used
while traversing pmds. Could you please clarify a bit?


diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224..89c5053 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -15,8 +15,8 @@
for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block 
+= SZ_8M) {
pte_basic_t *new;
  
-		k_next = pgd_addr_end(k_cur, k_end);

-   k_next = pgd_addr_end(k_next, k_end);
+   k_next = pmd_addr_end(k_cur, k_end);
+   k_next = pmd_addr_end(k_next, k_end);


No, I don't think so.
On powerpc32 we have only two levels, so pgd and pmd are more or less 
the same.
But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h is 
a no-op, so I don't think it will work.


It is likely that this function should iterate on pgd, then you get pmd 
= pmd_offset(pud_offset(p4d_offset(pgd)));



if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
continue;
  
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c

index fb29404..3f7d6dc6 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long 
k_start, unsigned long k_
for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
pte_t *new;
  
-		k_next = pgd_addr_end(k_cur, k_end);

+   k_next = pmd_addr_end(k_cur, k_end);


Same here I get, iterate on pgd then get pmd = 
pmd_offset(pud_offset(p4d_offset(pgd)));



if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
continue;
  
@@ -196,7 +196,7 @@ void __init kasan_early_init(void)

kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
  
  	do {

-   next = pgd_addr_end(addr, end);
+   next = pmd_addr_end(addr, end);
pmd_populate_kernel(_mm, pmd, kasan_early_shadow_pte);
} while (pmd++, addr = next, addr != end);
  



Christophe


Re: [RFC PATCH 02/12] powerpc: remove arguments from interrupt handler functions

2020-09-08 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of September 7, 2020 9:34 pm:
> On Mon, 2020-09-07 at 11:20 +0200, Christophe Leroy wrote:
>> 
>> Le 05/09/2020 à 19:43, Nicholas Piggin a écrit :
>> > Make interrupt handlers all just take the pt_regs * argument and load
>> > DAR/DSISR etc from that. Make those that return a value return long.
>> 
>> I like this, it will likely simplify a bit the VMAP_STACK mess.
>> 
>> Not sure it is that easy. My board is stuck after the start of init.
>> 
>> 
>> On the 8xx, on Instruction TLB Error exception, we do
>> 
>>  andis.  r5,r9,DSISR_SRR1_MATCH_32S@h /* Filter relevant SRR1 bits */
>> 
>> On book3s/32, on ISI exception we do:
>>  andis.  r5,r9,DSISR_SRR1_MATCH_32S@h /* Filter relevant SRR1 bits */
>> 
>> On 40x and bookE, on ISI exception we do:
>>  li  r5,0/* Pass zero as arg3 */
>> 
>> 
>> And regs->dsisr will just contain nothing
>> 
>> So it means we should at least write back r5 into regs->dsisr from there 
>> ? The performance impact should be minimal as we already write _DAR so 
>> the cache line should already be in the cache.
>> 
>> A hacky 'stw r5, _DSISR(r1)' in handle_page_fault() does the trick, 
>> allthough we don't want to do it for both ISI and DSI at the end, so 
>> you'll have to do it in every head_xxx.S
> 
> To get you series build and work, I did the following hacks:

Great, thanks for this.

> diff --git a/arch/powerpc/include/asm/interrupt.h
> b/arch/powerpc/include/asm/interrupt.h
> index acfcc7d5779b..c11045d3113a 100644
> --- a/arch/powerpc/include/asm/interrupt.h
> +++ b/arch/powerpc/include/asm/interrupt.h
> @@ -93,7 +93,9 @@ static inline void interrupt_nmi_exit_prepare(struct
> pt_regs *regs, struct inter
>  {
>   nmi_exit();
>  
> +#ifdef CONFIG_PPC64
>   this_cpu_set_ftrace_enabled(state->ftrace_enabled);
> +#endif

This seems okay, not a hack.

>  #ifdef CONFIG_PPC_BOOK3S_64
>   /* Check we didn't change the pending interrupt mask. */
> diff --git a/arch/powerpc/kernel/entry_32.S
> b/arch/powerpc/kernel/entry_32.S
> index f4d0af8e1136..66f7adbe1076 100644
> --- a/arch/powerpc/kernel/entry_32.S
> +++ b/arch/powerpc/kernel/entry_32.S
> @@ -663,6 +663,7 @@ ppc_swapcontext:
>   */
>   .globl  handle_page_fault
>  handle_page_fault:
> + stw r5,_DSISR(r1)
>   addir3,r1,STACK_FRAME_OVERHEAD
>  #ifdef CONFIG_PPC_BOOK3S_32
>   andis.  r0,r5,DSISR_DABRMATCH@h

Is this what you want to do for 32, or do you want to seperate
ISI and DSI sides?

Thanks,
Nick


Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware

2020-09-08 Thread Alexander Gordeev
On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.

Yes, and also two more sources :/
arch/powerpc/mm/kasan/8xx.c
arch/powerpc/mm/kasan/kasan_init_32.c

But these two are not quite obvious wrt pgd_addr_end() used
while traversing pmds. Could you please clarify a bit?


diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224..89c5053 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -15,8 +15,8 @@
for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block 
+= SZ_8M) {
pte_basic_t *new;
 
-   k_next = pgd_addr_end(k_cur, k_end);
-   k_next = pgd_addr_end(k_next, k_end);
+   k_next = pmd_addr_end(k_cur, k_end);
+   k_next = pmd_addr_end(k_next, k_end);
if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
continue;
 
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c 
b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb29404..3f7d6dc6 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long 
k_start, unsigned long k_
for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
pte_t *new;
 
-   k_next = pgd_addr_end(k_cur, k_end);
+   k_next = pmd_addr_end(k_cur, k_end);
if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
continue;
 
@@ -196,7 +196,7 @@ void __init kasan_early_init(void)
kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
do {
-   next = pgd_addr_end(addr, end);
+   next = pmd_addr_end(addr, end);
pmd_populate_kernel(_mm, pmd, kasan_early_shadow_pte);
} while (pmd++, addr = next, addr != end);
 

> Christophe


Re: [RFC PATCH 02/12] powerpc: remove arguments from interrupt handler functions

2020-09-08 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of September 7, 2020 7:20 pm:
> 
> 
> Le 05/09/2020 à 19:43, Nicholas Piggin a écrit :
>> Make interrupt handlers all just take the pt_regs * argument and load
>> DAR/DSISR etc from that. Make those that return a value return long.
> 
> I like this, it will likely simplify a bit the VMAP_STACK mess.
> 
> Not sure it is that easy. My board is stuck after the start of init.
> 
> 
> On the 8xx, on Instruction TLB Error exception, we do
> 
>   andis.  r5,r9,DSISR_SRR1_MATCH_32S@h /* Filter relevant SRR1 bits */
> 
> On book3s/32, on ISI exception we do:
>   andis.  r5,r9,DSISR_SRR1_MATCH_32S@h /* Filter relevant SRR1 bits */
> 
> On 40x and bookE, on ISI exception we do:
>   li  r5,0/* Pass zero as arg3 */
> 
> 
> And regs->dsisr will just contain nothing
> 
> So it means we should at least write back r5 into regs->dsisr from there 
> ? The performance impact should be minimal as we already write _DAR so 
> the cache line should already be in the cache.

Yes, I think that would be required. Sorry I didn't look closely at
32 bit.

> A hacky 'stw r5, _DSISR(r1)' in handle_page_fault() does the trick, 
> allthough we don't want to do it for both ISI and DSI at the end, so 
> you'll have to do it in every head_xxx.S
> 
> 
> While you are at it, it would probably also make sense to do remove the 
> address param of bad_page_fault(), there is no point in loading back 
> regs->dar in handle_page_fault() and machine_check_8xx() and 
> alignment_exception(), just read regs->dar in bad_page_fault()
> 
> The case of do_break() should also be looked at.

Yeah that's valid, I didn't do that because bad_page_fault was also
being called from asm, but an incremental patch should be quite easy.

> Why changing return code from int to long ?

Oh it's to make the next patch work without any changes to function
prototypes. Some handlers are returning int, others long. There is
no reason not to just return long AFAIKS so that's what I changed to.

Thanks,
Nick


Re: [PATCH] selftests/powerpc: Skip PROT_SAO test in guests/LPARS

2020-09-08 Thread Michael Ellerman
Sachin Sant  writes:
>> On 01-Sep-2020, at 6:16 PM, Michael Ellerman  wrote:
>> 
>> In commit 9b725a90a8f1 ("powerpc/64s: Disallow PROT_SAO in LPARs by
>> default") PROT_SAO was disabled in guests/LPARs by default. So skip
>> the test if we are running in a guest to avoid a spurious failure.
>> 
>> Signed-off-by: Michael Ellerman 
>> —
>
> Tested-by: Sachin Sant 
>
> With the fix test is skipped while running in a guest
>
> # ./prot_sao 
> test: prot-sao
> tags: git_version:unknown
> [SKIP] Test skipped on line 25
> skip: prot-sao
> #

Thanks. Sorry I missed adding your Tested-by tag.

cheers


Re: [PATCH] powerpc/boot/dts: Fix dtc "pciex" warnings

2020-09-08 Thread Michael Ellerman
Christian Lamparter  writes:
> On 2020-06-23 15:03, Michael Ellerman wrote:
>> With CONFIG_OF_ALL_DTBS=y, as set by eg. allmodconfig, we see lots of
>> warnings about our dts files, such as:
>>
>>arch/powerpc/boot/dts/glacier.dts:492.26-532.5:
>>Warning (pci_bridge): /plb/pciex@d: node name is not "pci"
>>or "pcie"
>>
>> The node name should not particularly matter, it's just a name, and
>> AFAICS there's no kernel code that cares whether nodes are *named*
>> "pciex" or "pcie". So shutup these warnings by converting to the name
>> dtc wants.
>>
>> As always there's some risk this could break something obscure that
>> does rely on the name, in which case we can revert.
>
> Hmm, I noticed this when I was looking up why nobody commented
> on my series of adding more devices to the APM82181/bluestone series:
>
> 
> (I'll post a v3 "soonish".)
>
>
> Unfortunately yes. This patch will break uboot code in Meraki MX60(W) / MX60.
>
>  > 
> https://github.com/riptidewave93/meraki-uboot/blob/mx60w-20180413/board/amcc/bluestone/bluestone.c#L1178
>
> | if (!pci_available()) {
> |     fdt_find_and_setprop(blob, "/plb/pciex@d", "status",
> |   "disabled", sizeof("disabled"), 1);
> | }
>
>
> Backstory: There are two version of the Meraki MX60. The MX60
> and the MX60W. The difference is that the MX60W has a populated
> mini-pcie slot on the PCB for a >W
> That said, this is not earth shattering.

I'm happy to revert that hunk if you think any one is actually booting
mainline on those.

cheers

> (In theory, this can also cause problems for the bluestone and canyonlands
> dev boards that have the option to be configured as either dual sata or
> pcie+sata But this is probably not a problem for customer boards)
>
> OT: Please note that the plb, opb and ebc node paths (/plb/opb/ebc) are
> hardcoded too :(. Amending the proper unit-addresses will lead to no-longer
> working DTBs as the "ranges" are missing.
>
> Cheers,
> Christian
>> Signed-off-by: Michael Ellerman 
>> ---
>>
>> diff --git a/arch/powerpc/boot/dts/bluestone.dts 
>> b/arch/powerpc/boot/dts/bluestone.dts
>> index cc965a1816b6..aa1ae94cd776 100644
>> --- a/arch/powerpc/boot/dts/bluestone.dts
>> +++ b/arch/powerpc/boot/dts/bluestone.dts
>> @@ -325,7 +325,7 @@ EMAC0: ethernet@ef600c00 {
>>  };
>>  };
>>   
>> -PCIE0: pciex@d {
>> +PCIE0: pcie@d {
>>  device_type = "pci";
>>  #interrupt-cells = <1>;
>>  #size-cells = <2>;


Re: [PATCH v2] kbuild: preprocess module linker script

2020-09-08 Thread Geert Uytterhoeven
On Tue, Sep 8, 2020 at 6:29 AM Masahiro Yamada  wrote:
> There was a request to preprocess the module linker script like we
> do for the vmlinux one. (https://lkml.org/lkml/2020/8/21/512)
>
> The difference between vmlinux.lds and module.lds is that the latter
> is needed for external module builds, thus must be cleaned up by
> 'make mrproper' instead of 'make clean'. Also, it must be created
> by 'make modules_prepare'.
>
> You cannot put it in arch/$(SRCARCH)/kernel/, which is cleaned up by
> 'make clean'. I moved arch/$(SRCARCH)/kernel/module.lds to
> arch/$(SRCARCH)/include/asm/module.lds.h, which is included from
> scripts/module.lds.S.
>
> scripts/module.lds is fine because 'make clean' keeps all the
> build artifacts under scripts/.
>
> You can add arch-specific sections in .
>
> Signed-off-by: Masahiro Yamada 
> Tested-by: Jessica Yu 
> Acked-by: Will Deacon 

>  arch/m68k/Makefile |  1 -
>  .../{kernel/module.lds => include/asm/module.lds.h}|  0

Acked-by: Geert Uytterhoeven 

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: [PATCH] mm: check for memory's node later during boot

2020-09-08 Thread Laurent Dufour

Le 03/09/2020 à 23:35, Andrew Morton a écrit :

On Wed,  2 Sep 2020 11:09:11 +0200 Laurent Dufour  wrote:


register_mem_sect_under_nodem() is checking the memory block's node id only
if the system state is "SYSTEM_BOOTING". On PowerPC, the memory blocks are
registered while the system state is "SYSTEM_SCHEDULING", the one before
SYSTEM_RUNNING.

The consequence on PowerPC guest with interleaved memory node's ranges is
that some memory block could be assigned to multiple nodes on sysfs. This
lately prevents some memory hot-plug and hot-unplug to succeed because
links are remaining. Such a panic is then displayed:

[ cut here ]
kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul 
binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
NIP:  c0403f34 LR: c0403f2c CTR: 
REGS: c004876e3660 TRAP: 0700   Not tainted  (5.9.0-rc1+)
MSR:  8282b033   CR: 24000448  XER: 
2004
CFAR: c0846d20 IRQMASK: 0
GPR00: c0403f2c c004876e38f0 c12f6f00 ffef
GPR04: 0227 c004805ae680  0004886f
GPR08: 0226 0003 0002 fffd
GPR12: 88000484 c0001ec96280  
GPR16:   0004 0003
GPR20: c0047814ffe0 c0077c08 0010 c13332c8
GPR24:  c11f6cc0  
GPR28: ffef 0001 00015000 1000
NIP [c0403f34] add_memory_resource+0x244/0x340
LR [c0403f2c] add_memory_resource+0x23c/0x340
Call Trace:
[c004876e38f0] [c0403f2c] add_memory_resource+0x23c/0x340 
(unreliable)
[c004876e39c0] [c040408c] __add_memory+0x5c/0xf0
[c004876e39f0] [c00e2b94] dlpar_add_lmb+0x1b4/0x500
[c004876e3ad0] [c00e3888] dlpar_memory+0x1f8/0xb80
[c004876e3b60] [c00dc0d0] handle_dlpar_errorlog+0xc0/0x190
[c004876e3bd0] [c00dc398] dlpar_store+0x198/0x4a0
[c004876e3c90] [c072e630] kobj_attr_store+0x30/0x50
[c004876e3cb0] [c051f954] sysfs_kf_write+0x64/0x90
[c004876e3cd0] [c051ee40] kernfs_fop_write+0x1b0/0x290
[c004876e3d20] [c0438dd8] vfs_write+0xe8/0x290
[c004876e3d70] [c04391ac] ksys_write+0xdc/0x130
[c004876e3dc0] [c0034e40] system_call_exception+0x160/0x270
[c004876e3e20] [c000d740] system_call_common+0xf0/0x27c
Instruction dump:
48442e35 6000 0b03 3cbe0001 7fa3eb78 7bc48402 38a5fffe 7ca5fa14
78a58402 48442db1 6000 7c7c1b78 <0b03> 7f23cb78 4bda371d 6000
---[ end trace 562fd6c109cd0fb2 ]---

To prevent this multiple links, make the node checking done for states
prior to SYSTEM_RUNNING.


Did you consider adding a cc:stable to this fix?


I should have, but now I've to review the fix based on David's comment.



Re: [PATCH kernel] powerpc/dma: Fix dma_map_ops::get_required_mask

2020-09-08 Thread Michael Ellerman
Alexey Kardashevskiy  writes:
> There are 2 problems with it:
> 1. "<" vs expected "<<"
> 2. the shift number is an IOMMU page number mask, not an address mask
> as the IOMMU page shift is missing.
>
> This did not hit us before f1565c24b596 ("powerpc: use the generic
> dma_ops_bypass mode") because we had there additional code to handle
> bypass mask so this chunk (almost?) never executed. However there
> were reports that aacraid does not work with "iommu=nobypass".
> After f1565c24b596, aacraid (and probably others which call
> dma_get_required_mask() before setting the mask) was unable to
> enable 64bit DMA and fall back to using IOMMU which was known not to work,
> one of the problems is double free of an IOMMU page.
>
> This fixes DMA for aacraid, both with and without "iommu=nobypass"
> in the kernel command line. Verified with "stress-ng -d 4".
>
> Fixes: f1565c24b596 ("powerpc: use the generic dma_ops_bypass mode")

I think it'd be better to point the Fixes tag at 6a5c7be5e484, which
originally introduced the bug, even if we didn't notice it until
f1565c24b596 exposed it (or made it more likely).

cheers

> Signed-off-by: Alexey Kardashevskiy 
> ---
>
> The original code came Jun 24 2011:
> 6a5c7be5e484 ("powerpc: Override dma_get_required_mask by platform hook and 
> ops")
>
>
> What is dma_get_required_mask() for anyway? What "requires" what here?
>
> Even though it works for now (due to huge - >4GB - default DMA window),
> I am still not convinced we do not want this chunk here
> (this is what f1565c24b596 removed):
>
> if (dev_is_pci(dev)) {
> u64 bypass_mask = dma_direct_get_required_mask(dev);
>
> if (dma_iommu_bypass_supported(dev, bypass_mask))
> return bypass_mask;
> }
> ---
>  arch/powerpc/kernel/dma-iommu.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c
> index 569fecd7b5b2..9053fc9d20c7 100644
> --- a/arch/powerpc/kernel/dma-iommu.c
> +++ b/arch/powerpc/kernel/dma-iommu.c
> @@ -120,7 +120,8 @@ u64 dma_iommu_get_required_mask(struct device *dev)
>   if (!tbl)
>   return 0;
>  
> - mask = 1ULL < (fls_long(tbl->it_offset + tbl->it_size) - 1);
> + mask = 1ULL << (fls_long(tbl->it_offset + tbl->it_size) +
> + tbl->it_page_shift - 1);
>   mask += mask - 1;
>  
>   return mask;
> -- 
> 2.17.1