Re: [PATCH] crypto/testmgr: fix skcipher test with CONFIG_VMAP_STACK

2018-12-10 Thread Christophe Leroy




Le 11/12/2018 à 06:07, Herbert Xu a écrit :

On Fri, Dec 07, 2018 at 09:26:15PM +0100, Ard Biesheuvel wrote:

On Fri, 7 Dec 2018 at 18:33, Christophe Leroy  wrote:


[2.364486] WARNING: CPU: 0 PID: 60 at ./arch/powerpc/include/asm/io.h:837 
dma_nommu_map_page+0x44/0xd4
[2.373579] CPU: 0 PID: 60 Comm: cryptomgr_test Tainted: GW 
4.20.0-rc5-00560-g6bfb52e23a00-dirty #531
[2.384740] NIP:  c000c540 LR: c000c584 CTR: 
[2.389743] REGS: c95abab0 TRAP: 0700   Tainted: GW  
(4.20.0-rc5-00560-g6bfb52e23a00-dirty)
[2.400042] MSR:  00029032   CR: 24042204  XER: 
[2.406669]
[2.406669] GPR00: c02f2244 c95abb60 c6262990 c95abd80 256a 0001 
0001 0001
[2.406669] GPR08:  2000 0010 0010 24042202  
0100 c95abd88
[2.406669] GPR16:  c05569d4 0001 0010 c95abc88 c0615664 
0004 
[2.406669] GPR24: 0010 c95abc88 c95abc88  c61ae210 c7ff6d40 
c61ae210 3d68
[2.441559] NIP [c000c540] dma_nommu_map_page+0x44/0xd4
[2.446720] LR [c000c584] dma_nommu_map_page+0x88/0xd4
[2.451762] Call Trace:
[2.454195] [c95abb60] [82000808] 0x82000808 (unreliable)
[2.459572] [c95abb80] [c02f2244] talitos_edesc_alloc+0xbc/0x3c8
[2.465493] [c95abbb0] [c02f2600] ablkcipher_edesc_alloc+0x4c/0x5c
[2.471606] [c95abbd0] [c02f4ed0] ablkcipher_encrypt+0x20/0x64
[2.477389] [c95abbe0] [c02023b0] __test_skcipher+0x4bc/0xa08
[2.483049] [c95abe00] [c0204b60] test_skcipher+0x2c/0xcc
[2.488385] [c95abe20] [c0204c48] alg_test_skcipher+0x48/0xbc
[2.494064] [c95abe40] [c0205cec] alg_test+0x164/0x2e8
[2.499142] [c95abf00] [c0200dec] cryptomgr_test+0x48/0x50
[2.504558] [c95abf10] [c0039ff4] kthread+0xe4/0x110
[2.509471] [c95abf40] [c000e1d0] ret_from_kernel_thread+0x14/0x1c
[2.515532] Instruction dump:
[2.518468] 7c7e1b78 7c9d2378 7cbf2b78 41820054 3d20c076 8089c200 3d20c076 
7c84e850
[2.526127] 8129c204 7c842e70 7f844840 419c0008 <0fe0> 2f9e 54847022 
7c84fa14
[2.533960] ---[ end trace bf78d94af73fe3b8 ]---
[2.539123] talitos ff02.crypto: master data transfer error
[2.544775] talitos ff02.crypto: TEA error: ISR 0x2000_0040
[2.551625] alg: skcipher: encryption failed on test 1 for ecb-aes-talitos: 
ret=22

IV cannot be on stack when CONFIG_VMAP_STACK is selected because the stack
cannot be DMA mapped anymore.

This patch allocates it with kmalloc()



This looks like a driver bug to me. Other accelerators in
drivers/crypto all take a copy of the IV before mapping it for DMA, so
it would be better if talitos did the same.


Yes please fix the driver.  In general, if a paramter is coming in
as a pointer, you must assume that it may lie on the stack.


Yes, just sent a patch for the talitos driver for it.

But note that the IV is received via areq->info and not via a standalone 
pointer.


Christophe


[PATCH] crypto: talitos - fix ablkcipher for CONFIG_VMAP_STACK

2018-12-10 Thread Christophe Leroy
[2.364486] WARNING: CPU: 0 PID: 60 at ./arch/powerpc/include/asm/io.h:837 
dma_nommu_map_page+0x44/0xd4
[2.373579] CPU: 0 PID: 60 Comm: cryptomgr_test Tainted: GW 
4.20.0-rc5-00560-g6bfb52e23a00-dirty #531
[2.384740] NIP:  c000c540 LR: c000c584 CTR: 
[2.389743] REGS: c95abab0 TRAP: 0700   Tainted: GW  
(4.20.0-rc5-00560-g6bfb52e23a00-dirty)
[2.400042] MSR:  00029032   CR: 24042204  XER: 
[2.406669]
[2.406669] GPR00: c02f2244 c95abb60 c6262990 c95abd80 256a 0001 
0001 0001
[2.406669] GPR08:  2000 0010 0010 24042202  
0100 c95abd88
[2.406669] GPR16:  c05569d4 0001 0010 c95abc88 c0615664 
0004 
[2.406669] GPR24: 0010 c95abc88 c95abc88  c61ae210 c7ff6d40 
c61ae210 3d68
[2.441559] NIP [c000c540] dma_nommu_map_page+0x44/0xd4
[2.446720] LR [c000c584] dma_nommu_map_page+0x88/0xd4
[2.451762] Call Trace:
[2.454195] [c95abb60] [82000808] 0x82000808 (unreliable)
[2.459572] [c95abb80] [c02f2244] talitos_edesc_alloc+0xbc/0x3c8
[2.465493] [c95abbb0] [c02f2600] ablkcipher_edesc_alloc+0x4c/0x5c
[2.471606] [c95abbd0] [c02f4ed0] ablkcipher_encrypt+0x20/0x64
[2.477389] [c95abbe0] [c02023b0] __test_skcipher+0x4bc/0xa08
[2.483049] [c95abe00] [c0204b60] test_skcipher+0x2c/0xcc
[2.488385] [c95abe20] [c0204c48] alg_test_skcipher+0x48/0xbc
[2.494064] [c95abe40] [c0205cec] alg_test+0x164/0x2e8
[2.499142] [c95abf00] [c0200dec] cryptomgr_test+0x48/0x50
[2.504558] [c95abf10] [c0039ff4] kthread+0xe4/0x110
[2.509471] [c95abf40] [c000e1d0] ret_from_kernel_thread+0x14/0x1c
[2.515532] Instruction dump:
[2.518468] 7c7e1b78 7c9d2378 7cbf2b78 41820054 3d20c076 8089c200 3d20c076 
7c84e850
[2.526127] 8129c204 7c842e70 7f844840 419c0008 <0fe0> 2f9e 54847022 
7c84fa14
[2.533960] ---[ end trace bf78d94af73fe3b8 ]---
[2.539123] talitos ff02.crypto: master data transfer error
[2.544775] talitos ff02.crypto: TEA error: ISR 0x2000_0040
[2.551625] alg: skcipher: encryption failed on test 1 for ecb-aes-talitos: 
ret=22

IV cannot be on stack when CONFIG_VMAP_STACK is selected because the stack
cannot be DMA mapped anymore.

This patch copies the IV from areq->info to the context ctx->iv.

Fixes: 4de9d0b547b9 ("crypto: talitos - Add ablkcipher algorithms")
Cc: sta...@vger.kernel.org
Signed-off-by: Christophe Leroy 
---
 drivers/crypto/talitos.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/crypto/talitos.c b/drivers/crypto/talitos.c
index 6988012deca4..385ec970b639 100644
--- a/drivers/crypto/talitos.c
+++ b/drivers/crypto/talitos.c
@@ -1668,8 +1668,11 @@ static struct talitos_edesc 
*ablkcipher_edesc_alloc(struct ablkcipher_request *
struct talitos_ctx *ctx = crypto_ablkcipher_ctx(cipher);
unsigned int ivsize = crypto_ablkcipher_ivsize(cipher);
 
+   if (ivsize)
+   memcpy(ctx->iv, areq->info, ivsize);
+
return talitos_edesc_alloc(ctx->dev, areq->src, areq->dst,
-  areq->info, 0, areq->nbytes, 0, ivsize, 0,
+  ctx->iv, 0, areq->nbytes, 0, ivsize, 0,
   areq->base.flags, encrypt);
 }
 
-- 
2.13.3



Re: [PATCH] powerpc: Fix bogus usage of MSR_RI on BookE and 40x

2018-12-10 Thread kbuild test robot
Hi Benjamin,

I love your patch! Yet something to improve:

[auto build test ERROR on powerpc/next]
[also build test ERROR on v4.20-rc6 next-20181210]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Benjamin-Herrenschmidt/powerpc-Fix-bogus-usage-of-MSR_RI-on-BookE-and-40x/20181211-110017
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc-ppa8548_defconfig (attached as .config)
compiler: powerpc-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.2.0 make.cross ARCH=powerpc 

All errors (new ones prefixed by >>):

   arch/powerpc/sysdev/fsl_rio.c: In function 'fsl_rio_mcheck_exception':
>> arch/powerpc/sysdev/fsl_rio.c:115:17: error: 'MSR_RI' undeclared (first use 
>> in this function); did you mean 'MSR_BE'?
   regs->msr |= MSR_RI;
^~
MSR_BE
   arch/powerpc/sysdev/fsl_rio.c:115:17: note: each undeclared identifier is 
reported only once for each function it appears in

vim +115 arch/powerpc/sysdev/fsl_rio.c

a52c8f52 Alexandre Bounine 2010-05-26   96  
ff33f182 Li Yang   2010-06-18   97  #ifdef CONFIG_E500
cce1f106 Shaohui Xie   2010-11-18   98  int fsl_rio_mcheck_exception(struct 
pt_regs *regs)
a52c8f52 Alexandre Bounine 2010-05-26   99  {
82a9a480 Scott Wood2011-06-16  100  const struct 
exception_table_entry *entry;
82a9a480 Scott Wood2011-06-16  101  unsigned long reason;
82a9a480 Scott Wood2011-06-16  102  
82a9a480 Scott Wood2011-06-16  103  if (!rio_regs_win)
82a9a480 Scott Wood2011-06-16  104  return 0;
a52c8f52 Alexandre Bounine 2010-05-26  105  
a52c8f52 Alexandre Bounine 2010-05-26  106  reason = in_be32((u32 
*)(rio_regs_win + RIO_LTLEDCSR));
a52c8f52 Alexandre Bounine 2010-05-26  107  if (reason & (RIO_LTLEDCSR_IER 
| RIO_LTLEDCSR_PRT)) {
a52c8f52 Alexandre Bounine 2010-05-26  108  /* Check if we are 
prepared to handle this fault */
a52c8f52 Alexandre Bounine 2010-05-26  109  entry = 
search_exception_tables(regs->nip);
a52c8f52 Alexandre Bounine 2010-05-26  110  if (entry) {
a52c8f52 Alexandre Bounine 2010-05-26  111  pr_debug("RIO: 
%s - MC Exception handled\n",
a52c8f52 Alexandre Bounine 2010-05-26  112   
__func__);
a52c8f52 Alexandre Bounine 2010-05-26  113  out_be32((u32 
*)(rio_regs_win + RIO_LTLEDCSR),
a52c8f52 Alexandre Bounine 2010-05-26  114   0);
a52c8f52 Alexandre Bounine 2010-05-26 @115  regs->msr |= 
MSR_RI;
61a92f70 Nicholas Piggin   2016-10-14  116  regs->nip = 
extable_fixup(entry);
a52c8f52 Alexandre Bounine 2010-05-26  117  return 1;
a52c8f52 Alexandre Bounine 2010-05-26  118  }
a52c8f52 Alexandre Bounine 2010-05-26  119  }
a52c8f52 Alexandre Bounine 2010-05-26  120  
cce1f106 Shaohui Xie   2010-11-18  121  return 0;
a52c8f52 Alexandre Bounine 2010-05-26  122  }
cce1f106 Shaohui Xie   2010-11-18  123  
EXPORT_SYMBOL_GPL(fsl_rio_mcheck_exception);
ff33f182 Li Yang   2010-06-18  124  #endif
a52c8f52 Alexandre Bounine 2010-05-26  125  

:: The code at line 115 was first introduced by commit
:: a52c8f521fed43bce53451d7dfddf2b42a2af689 rapidio, powerpc/85xx: Add MChk 
handler for SRIO port

:: TO: Alexandre Bounine 
:: CC: Linus Torvalds 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH] crypto/testmgr: fix skcipher test with CONFIG_VMAP_STACK

2018-12-10 Thread Herbert Xu
On Fri, Dec 07, 2018 at 09:26:15PM +0100, Ard Biesheuvel wrote:
> On Fri, 7 Dec 2018 at 18:33, Christophe Leroy  wrote:
> >
> > [2.364486] WARNING: CPU: 0 PID: 60 at 
> > ./arch/powerpc/include/asm/io.h:837 dma_nommu_map_page+0x44/0xd4
> > [2.373579] CPU: 0 PID: 60 Comm: cryptomgr_test Tainted: GW  
> >4.20.0-rc5-00560-g6bfb52e23a00-dirty #531
> > [2.384740] NIP:  c000c540 LR: c000c584 CTR: 
> > [2.389743] REGS: c95abab0 TRAP: 0700   Tainted: GW  
> > (4.20.0-rc5-00560-g6bfb52e23a00-dirty)
> > [2.400042] MSR:  00029032   CR: 24042204  XER: 
> > [2.406669]
> > [2.406669] GPR00: c02f2244 c95abb60 c6262990 c95abd80 256a 0001 
> > 0001 0001
> > [2.406669] GPR08:  2000 0010 0010 24042202  
> > 0100 c95abd88
> > [2.406669] GPR16:  c05569d4 0001 0010 c95abc88 c0615664 
> > 0004 
> > [2.406669] GPR24: 0010 c95abc88 c95abc88  c61ae210 c7ff6d40 
> > c61ae210 3d68
> > [2.441559] NIP [c000c540] dma_nommu_map_page+0x44/0xd4
> > [2.446720] LR [c000c584] dma_nommu_map_page+0x88/0xd4
> > [2.451762] Call Trace:
> > [2.454195] [c95abb60] [82000808] 0x82000808 (unreliable)
> > [2.459572] [c95abb80] [c02f2244] talitos_edesc_alloc+0xbc/0x3c8
> > [2.465493] [c95abbb0] [c02f2600] ablkcipher_edesc_alloc+0x4c/0x5c
> > [2.471606] [c95abbd0] [c02f4ed0] ablkcipher_encrypt+0x20/0x64
> > [2.477389] [c95abbe0] [c02023b0] __test_skcipher+0x4bc/0xa08
> > [2.483049] [c95abe00] [c0204b60] test_skcipher+0x2c/0xcc
> > [2.488385] [c95abe20] [c0204c48] alg_test_skcipher+0x48/0xbc
> > [2.494064] [c95abe40] [c0205cec] alg_test+0x164/0x2e8
> > [2.499142] [c95abf00] [c0200dec] cryptomgr_test+0x48/0x50
> > [2.504558] [c95abf10] [c0039ff4] kthread+0xe4/0x110
> > [2.509471] [c95abf40] [c000e1d0] ret_from_kernel_thread+0x14/0x1c
> > [2.515532] Instruction dump:
> > [2.518468] 7c7e1b78 7c9d2378 7cbf2b78 41820054 3d20c076 8089c200 
> > 3d20c076 7c84e850
> > [2.526127] 8129c204 7c842e70 7f844840 419c0008 <0fe0> 2f9e 
> > 54847022 7c84fa14
> > [2.533960] ---[ end trace bf78d94af73fe3b8 ]---
> > [2.539123] talitos ff02.crypto: master data transfer error
> > [2.544775] talitos ff02.crypto: TEA error: ISR 0x2000_0040
> > [2.551625] alg: skcipher: encryption failed on test 1 for 
> > ecb-aes-talitos: ret=22
> >
> > IV cannot be on stack when CONFIG_VMAP_STACK is selected because the stack
> > cannot be DMA mapped anymore.
> >
> > This patch allocates it with kmalloc()
> >
> 
> This looks like a driver bug to me. Other accelerators in
> drivers/crypto all take a copy of the IV before mapping it for DMA, so
> it would be better if talitos did the same.

Yes please fix the driver.  In general, if a paramter is coming in
as a pointer, you must assume that it may lie on the stack.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [RFC PATCH v2 11/11] powerpc/book3s32: Implement Kernel Userspace Access Protection

2018-12-10 Thread Russell Currey
On Wed, 2018-11-28 at 09:27 +, Christophe Leroy wrote:
> This patch implements Kernel Userspace Access Protection for
> book3s/32.
> 
> Due to limitations of the processor page protection capabilities,
> the protection is only against writing. read protection cannot be
> achieved using page protection.
> 
> In order to provide the protection, Ku and Ks keys are modified in
> Userspace Segment registers, and different PP bits are used to:
> 
> PP01 provides RW for Key 0 and RO for Key 1
> PP10 provides RW for all
> PP11 provides RO for all
> 
> Today PP10 is used for RW pages and PP11 for RO pages. This patch
> modifies page protection to PP01 for RW pages.
> 
> Then segment registers are set to Ku 0 and Ks 1. When kernel needs
> to write to RW pages, the associated segment register is changed to
> Ks 0 in order to allow write access to the kernel.
> 
> In order to avoid having the read all segment registers when
> locking/unlocking the access, some data is kept in the thread_struct
> and saved on stack on exceptions. The field identifies both the
> first unlocked segment and the first segment following the last
> unlocked one. When no segment is unlocked, it contains value 0.
> 
> Signed-off-by: Christophe Leroy 

Hey Christophe, I tried to test this and got a machine check after the
kernel starts init.

Vector: 700 (Program Check) at [ef0b5e70]
pc: 0ca4
lr: b7e1a030
sp: ef0b5f30
   msr: 81002
  current = 0xef0b8000
pid   = 1, comm = init

Testing with mac99 model in qemu.

- Russell



Re: [PATCH v3] kbuild: Add support for DT binding schema checks

2018-12-10 Thread Masahiro Yamada
Hi Rob,

On Tue, Dec 11, 2018 at 5:50 AM Rob Herring  wrote:
>
> This adds the build infrastructure for checking DT binding schema
> documents and validating dts files using the binding schema.
>
> Check DT binding schema documents:
> make dt_binding_check
>
> Build dts files and check using DT binding schema:
> make dtbs_check
>
> Optionally, DT_SCHEMA_FILES can passed in with a schema file(s) to use


Perhaps, "can be passed" ?



> for validation. This makes it easier to find and fix errors generated by
> a specific schema.
>
> Currently, the validation targets are separate from a normal build to
> avoid a hard dependency on the external DT schema project and because
> there are lots of warnings generated.
>
> Cc: Jonathan Corbet 
> Cc: Mark Rutland 
> Cc: Masahiro Yamada 
> Cc: Michal Marek 
> Cc: linux-...@vger.kernel.org
> Cc: devicet...@vger.kernel.org
> Cc: linux-kbu...@vger.kernel.org
> Signed-off-by: Rob Herring 
> ---
> v3:
> - Fix error causing only 1st schema file to get used.
> - Add a more useful error message when dtc is missing YAML support
> telling the user they need to install libyaml devel package.
>
>
>  .gitignore   |  1 +
>  Documentation/Makefile   |  2 +-
>  Documentation/devicetree/bindings/.gitignore |  1 +
>  Documentation/devicetree/bindings/Makefile   | 33 +
>  Makefile | 11 --
>  scripts/Makefile.lib | 37 ++--
>  6 files changed, 80 insertions(+), 5 deletions(-)
>  create mode 100644 Documentation/devicetree/bindings/.gitignore
>  create mode 100644 Documentation/devicetree/bindings/Makefile
>
> diff --git a/.gitignore b/.gitignore
> index 97ba6b79834c..a20ac26aa2f5 100644
> --- a/.gitignore
> +++ b/.gitignore
> @@ -15,6 +15,7 @@
>  *.bin
>  *.bz2
>  *.c.[012]*.*
> +*.dt.yaml
>  *.dtb
>  *.dtb.S
>  *.dwo
> diff --git a/Documentation/Makefile b/Documentation/Makefile
> index 2ca77ad0f238..9786957c6a35 100644
> --- a/Documentation/Makefile
> +++ b/Documentation/Makefile
> @@ -2,7 +2,7 @@
>  # Makefile for Sphinx documentation
>  #
>
> -subdir-y :=
> +subdir-y := devicetree/bindings/
>
>  # You can set these variables from the command line.
>  SPHINXBUILD   = sphinx-build
> diff --git a/Documentation/devicetree/bindings/.gitignore 
> b/Documentation/devicetree/bindings/.gitignore
> new file mode 100644
> index ..d9194c02dd08
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/.gitignore
> @@ -0,0 +1 @@
> +*.example.dts
> diff --git a/Documentation/devicetree/bindings/Makefile 
> b/Documentation/devicetree/bindings/Makefile
> new file mode 100644
> index ..43f8657ab070
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/Makefile
> @@ -0,0 +1,33 @@
> +# SPDX-License-Identifier: GPL-2.0
> +DT_DOC_CHECKER ?= dt-doc-validate
> +DT_EXTRACT_EX ?= dt-extract-example
> +DT_MK_SCHEMA ?= dt-mk-schema
> +DT_MK_SCHEMA_FLAGS := $(if $(DT_SCHEMA_FILES), -u)
> +
> +quiet_cmd_chk_binding = CHKDT   $<


In convention, the short log displays the target name
instead of the prerequisite.

If O=foo/bar is given, $< shows the full path,
then log will look like follows:


  CHKDT   
/home/masahiro/ref/linux-devicetree/Documentation/devicetree/bindings/arm/calxeda.yaml
  DTC Documentation/devicetree/bindings/arm/calxeda.example.dtb
  CHKDT   
/home/masahiro/ref/linux-devicetree/Documentation/devicetree/bindings/arm/qcom.yaml
  DTC Documentation/devicetree/bindings/arm/qcom.example.dtb
  CHKDT   
/home/masahiro/ref/linux-devicetree/Documentation/devicetree/bindings/arm/xilinx.yaml
  DTC Documentation/devicetree/bindings/arm/xilinx.example.dtb
  CHKDT   
/home/masahiro/ref/linux-devicetree/Documentation/devicetree/bindings/arm/ti/nspire.yaml
  DTC Documentation/devicetree/bindings/arm/ti/nspire.example.dtb
  CHKDT   
/home/masahiro/ref/linux-devicetree/Documentation/devicetree/bindings/arm/ti/ti,davinci.yaml
  DTC Documentation/devicetree/bindings/arm/ti/ti,davinci.example.dtb
  CHKDT   
/home/masahiro/ref/linux-devicetree/Documentation/devicetree/bindings/arm/altera.yaml

You might think it is ugly.




> +  cmd_chk_binding = (set -e; \
> + $(DT_DOC_CHECKER) $< ; \
> + mkdir -p $(dir $@) ; \
> + $(DT_EXTRACT_EX) $< > $@ )


- 'set -e' is redundant because if_changed already has it.

- 'mkdir mkdir -p $(dir $@)' is also redundant because
   scripts/Makefile.build automatically creates it.

-  You do not need to execute this in a sub-shell



You can simplify like this:

   cmd_chk_binding = $(DT_DOC_CHECKER) $< ; \
 $(DT_EXTRACT_EX) $< > $@



> +$(obj)/%.example.dts: $(src)/%.yaml FORCE
> +   $(call if_changed,chk_binding)
> +
> +DT_TMP_SCHEMA := .schema.yaml.tmp


BTW, why does this file start with a period?
What is the meaning of '.tmp' extension?



> +extra-y += $(DT_TMP_SCHEMA)
> +
> 

[PATCH] net/ibmvnic: Remove tests of member address

2018-12-10 Thread Wen Yang
The driver was checking for non-NULL address.
- adapter->napi[i]

This is pointless as these will be always non-NULL, since the
'dapter->napi' is allocated in init_napi().
It is safe to get rid of useless checks for addresses to fix the
coccinelle warning:
>>drivers/net/ethernet/ibm/ibmvnic.c: test of a variable/field address
Since such statements always return true, they are redundant.

Signed-off-by: Wen Yang 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Michael Ellerman 
CC: Thomas Falcon 
CC: John Allen 
CC: "David S. Miller" 
CC: linuxppc-dev@lists.ozlabs.org
CC: net...@vger.kernel.org
CC: linux-ker...@vger.kernel.org
---
 drivers/net/ethernet/ibm/ibmvnic.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index ed50b8dee44f..14d00985f087 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -773,11 +773,8 @@ static void release_napi(struct ibmvnic_adapter *adapter)
return;
 
for (i = 0; i < adapter->num_active_rx_napi; i++) {
-   if (>napi[i]) {
-   netdev_dbg(adapter->netdev,
-  "Releasing napi[%d]\n", i);
-   netif_napi_del(>napi[i]);
-   }
+   netdev_dbg(adapter->netdev, "Releasing napi[%d]\n", i);
+   netif_napi_del(>napi[i]);
}
 
kfree(adapter->napi);
-- 
2.19.1



Re: Is it worth to fix the crashkernel reserved memory blocks the hotplug issue?

2018-12-10 Thread Baoquan He
On 12/10/18 at 12:08pm, Pingfan Liu wrote:
> Hi,
> I found in powerpc code, it is doable to reserve memory region in
> movable zone, such as crashkernel does. But in x86 code, it checks the
> hotpluggable attribute of memory, hence if manually specifying a
> region in hotpluggable region, it will fail.

Yes, it is a problem. In x86, for crashkernel=xx@yy case to specify base
address of crashkernel reservation, it will find the region firstly, if
found region's base is not equal to specified base 'yy', means that
region has been occupied. The reservation will fail. And memblock
finding will only iterate unhotpluggable area, this will avoid reserving
crashkernel memory on hotpluggable region. 

In my opinion, it's worth fixing.

Thanks
Baoquan


> The x86 code:
> /* 0 means: find the address automatically */
> if (crash_base <= 0) {
> /*
> * Set CRASH_ADDR_LOW_MAX upper bound for crash memory,
> * as old kexec-tools loads bzImage below that, unless
> * "crashkernel=size[KMG],high" is specified.
> */
> crash_base = memblock_find_in_range(CRASH_ALIGN,
>high ? CRASH_ADDR_HIGH_MAX
> : CRASH_ADDR_LOW_MAX,
>crash_size, CRASH_ALIGN);
> if (!crash_base) {
> pr_info("crashkernel reservation failed - No suitable area found.\n");
> return;
> }
> 
> } else {
> unsigned long long start;
> 
> start = memblock_find_in_range(crash_base,  --> this func will check
> the hotpluggable attribute of memory and return failure if the
> specifying region intersects with it.
>   crash_base + crash_size,
>   crash_size, 1 << 20);
> if (start != crash_base) {
> pr_info("crashkernel reservation failed - memory is in use.\n");
> return;
> }
> }
> 
> Thanks,
> Pingfan


Re: [PATCH] PCI: Add no-D3 quirk for Mellanox ConnectX-[45]

2018-12-10 Thread David Gibson
On Thu, Dec 06, 2018 at 08:45:09AM +0200, Leon Romanovsky wrote:
> On Thu, Dec 06, 2018 at 03:19:51PM +1100, David Gibson wrote:
> > Mellanox ConnectX-5 IB cards (MT27800) seem to cause a call trace when
> > unbound from their regular driver and attached to vfio-pci in order to pass
> > them through to a guest.
> >
> > This goes away if the disable_idle_d3 option is used, so it looks like a
> > problem with the hardware handling D3 state.  To fix that more permanently,
> > use a device quirk to disable D3 state for these devices.
> >
> > We do this by renaming the existing quirk_no_ata_d3() more generally and
> > attaching it to the ConnectX-[45] devices (0x15b3:0x1013).
> >
> > Signed-off-by: David Gibson 
> > ---
> >  drivers/pci/quirks.c | 17 +++--
> >  1 file changed, 11 insertions(+), 6 deletions(-)
> >
> 
> Hi David,
> 
> Thank for your patch,
> 
> I would like to reproduce the calltrace before moving forward,
> but have trouble to reproduce the original issue.
> 
> I'm working with vfio-pci and CX-4/5 cards on daily basis,
> tried manually enter into D3 state now, and it worked for me.
> 
> Can you please post your full calltrace, and "lspci -s PCI_ID -vv"
> output?

Sorry, I may have jumped the gun on this.  Using disable_idle_d3 seems
to do _something_ for these cards, but there are some other things
going wrong which are confusing the issue.  This is on POWER, which
might affect the situation.  I'll get back to you once I have some
more information.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


[PATCH] powerpc/44x/bamboo: Fix PCI range

2018-12-10 Thread Benjamin Herrenschmidt
>From 88509506b80b4960004146280eb740be64513a0b Mon Sep 17 00:00:00 2001
The bamboo dts has a bug: it uses a non-naturally aligned range
for PCI memory space. This isnt' supported by the code, thus
causing PCI to break on this system.

This is due to the fact that while the chip memory map has 1G
reserved for PCI memory, it's only 512M aligned. The code doesn't
know how to split that into 2 different PMMs and fails, so limit
the region to 512M.

Signed-off-by: Benjamin Herrenschmidt 
---
 arch/powerpc/boot/dts/bamboo.dts | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/boot/dts/bamboo.dts b/arch/powerpc/boot/dts/bamboo.dts
index 538e42b..b5861fa 100644
--- a/arch/powerpc/boot/dts/bamboo.dts
+++ b/arch/powerpc/boot/dts/bamboo.dts
@@ -268,8 +268,10 @@
/* Outbound ranges, one memory and one IO,
 * later cannot be changed. Chip supports a second
 * IO range but we don't use it for now
+* The chip also supports a larger memory range but
+* it's not naturally aligned, so our code will break
 */
-   ranges = <0x0200 0x 0xa000 0x 
0xa000 0x 0x4000
+   ranges = <0x0200 0x 0xa000 0x 
0xa000 0x 0x2000
  0x0200 0x 0x 0x 
0xe000 0x 0x0010
  0x0100 0x 0x 0x 
0xe800 0x 0x0001>;
 




[PATCH] powerpc: Fix bogus usage of MSR_RI on BookE and 40x

2018-12-10 Thread Benjamin Herrenschmidt
BookE and 40x processors lack the MSR:RI bit. However, we have a
few common code places that rely on it.

This fixes it by not defining MSR_RI on those processor types and
using the appropriate ifdef's in those locations.

Signed-off-by: Benjamin Herrenschmidt 
---
 arch/powerpc/include/asm/reg.h   | 2 ++
 arch/powerpc/include/asm/reg_booke.h | 4 ++--
 arch/powerpc/kernel/process.c| 2 +-
 arch/powerpc/kernel/traps.c  | 8 ++--
 arch/powerpc/lib/sstep.c | 2 ++
 5 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index de52c31..41d0b2e 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -110,7 +110,9 @@
 #ifndef MSR_PMM
 #define MSR_PMM__MASK(MSR_PMM_LG)  /* Performance monitor 
*/
 #endif
+#if !defined(CONFIG_BOOKE) && !defined(CONFIG_4xx)
 #define MSR_RI __MASK(MSR_RI_LG)   /* Recoverable Exception */
+#endif
 #define MSR_LE __MASK(MSR_LE_LG)   /* Little Endian */
 
 #define MSR_TM __MASK(MSR_TM_LG)   /* Transactional Mem Available 
*/
diff --git a/arch/powerpc/include/asm/reg_booke.h 
b/arch/powerpc/include/asm/reg_booke.h
index eb2a33d..06967f1 100644
--- a/arch/powerpc/include/asm/reg_booke.h
+++ b/arch/powerpc/include/asm/reg_booke.h
@@ -46,10 +46,10 @@
 #define MSR_USER32 (MSR_ | MSR_PR | MSR_EE)
 #define MSR_USER64 (MSR_USER32 | MSR_64BIT)
 #elif defined (CONFIG_40x)
-#define MSR_KERNEL (MSR_ME|MSR_RI|MSR_IR|MSR_DR|MSR_CE)
+#define MSR_KERNEL (MSR_ME|MSR_IR|MSR_DR|MSR_CE)
 #define MSR_USER   (MSR_KERNEL|MSR_PR|MSR_EE)
 #else
-#define MSR_KERNEL (MSR_ME|MSR_RI|MSR_CE)
+#define MSR_KERNEL (MSR_ME|MSR_CE)
 #define MSR_USER   (MSR_KERNEL|MSR_PR|MSR_EE)
 #endif
 
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 96f3473..77679a7 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1359,7 +1359,7 @@ static struct regbit msr_bits[] = {
{MSR_IR,"IR"},
{MSR_DR,"DR"},
{MSR_PMM,   "PMM"},
-#ifndef CONFIG_BOOKE
+#if !defined(CONFIG_BOOKE) && !defined(CONFIG_40x)
{MSR_RI,"RI"},
{MSR_LE,"LE"},
 #endif
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 9a86572..2d00696 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -429,10 +429,11 @@ void system_reset_exception(struct pt_regs *regs)
if (get_paca()->in_nmi > 1)
nmi_panic(regs, "Unrecoverable nested System Reset");
 #endif
+#ifdef MSR_RI
/* Must die if the interrupt is not recoverable */
if (!(regs->msr & MSR_RI))
nmi_panic(regs, "Unrecoverable System Reset");
-
+#endif
if (!nested)
nmi_exit();
 
@@ -478,7 +479,9 @@ static inline int check_io_access(struct pt_regs *regs)
printk(KERN_DEBUG "%s bad port %lx at %p\n",
   (*nip & 0x100)? "OUT to": "IN from",
   regs->gpr[rb] - _IO_BASE, nip);
+#ifdef MSR_RI
regs->msr |= MSR_RI;
+#endif
regs->nip = extable_fixup(entry);
return 1;
}
@@ -763,10 +766,11 @@ void machine_check_exception(struct pt_regs *regs)
if (check_io_access(regs))
goto bail;
 
+#ifdef MSR_RI
/* Must die if the interrupt is not recoverable */
if (!(regs->msr & MSR_RI))
nmi_panic(regs, "Unrecoverable Machine check");
-
+#endif
if (!nested)
nmi_exit();
 
diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index d81568f..c03c453 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -3059,9 +3059,11 @@ int emulate_step(struct pt_regs *regs, unsigned int 
instr)
 
case MTMSR:
val = regs->gpr[op.reg];
+#ifdef MSR_RI
if ((val & MSR_RI) == 0)
/* can't step mtmsr[d] that would clear MSR_RI */
return -1;
+#endif
/* here op.val is the mask of bits to change */
regs->msr = (regs->msr & ~op.val) | (val & op.val);
goto instr_done;




Re: [PATCH net 0/2] net/ibmvnic: Fix reset work item locking bugs

2018-12-10 Thread David Miller
From: Thomas Falcon 
Date: Mon, 10 Dec 2018 15:22:21 -0600

> This patch set fixes issues with scheduling reset work items in
> a tasklet context. Since ibmvnic_reset can called in an interrupt,
> it should not use a mutex or allocate memory non-atomically.

Series applied, thanks.


Re: [PATCH kernel v4 19/19] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver

2018-12-10 Thread Alex Williamson
On Tue, 11 Dec 2018 11:57:20 +1100
Alexey Kardashevskiy  wrote:

> On 11/12/2018 11:08, Alex Williamson wrote:
> > On Fri, 23 Nov 2018 16:53:04 +1100
> > Alexey Kardashevskiy  wrote:
> >   
> >> POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not
> >> pluggable PCIe devices but still have PCIe links which are used
> >> for config space and MMIO. In addition to that the GPUs have 6 NVLinks
> >> which are connected to other GPUs and the POWER9 CPU. POWER9 chips
> >> have a special unit on a die called an NPU which is an NVLink2 host bus
> >> adapter with p2p connections to 2 to 3 GPUs, 3 or 2 NVLinks to each.
> >> These systems also support ATS (address translation services) which is
> >> a part of the NVLink2 protocol. Such GPUs also share on-board RAM
> >> (16GB or 32GB) to the system via the same NVLink2 so a CPU has
> >> cache-coherent access to a GPU RAM.
> >>
> >> This exports GPU RAM to the userspace as a new VFIO device region. This
> >> preregisters the new memory as device memory as it might be used for DMA.
> >> This inserts pfns from the fault handler as the GPU memory is not onlined
> >> until the vendor driver is loaded and trained the NVLinks so doing this
> >> earlier causes low level errors which we fence in the firmware so
> >> it does not hurt the host system but still better be avoided.
> >>
> >> This exports an ATSD (Address Translation Shootdown) register of NPU which
> >> allows TLB invalidations inside GPU for an operating system. The register
> >> conveniently occupies a single 64k page. It is also presented to
> >> the userspace as a new VFIO device region.
> >>
> >> In order to provide the userspace with the information about GPU-to-NVLink
> >> connections, this exports an additional capability called "tgt"
> >> (which is an abbreviated host system bus address). The "tgt" property
> >> tells the GPU its own system address and allows the guest driver to
> >> conglomerate the routing information so each GPU knows how to get directly
> >> to the other GPUs.
> >>
> >> For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to
> >> know LPID (a logical partition ID or a KVM guest hardware ID in other
> >> words) and PID (a memory context ID of a userspace process, not to be
> >> confused with a linux pid). This assigns a GPU to LPID in the NPU and
> >> this is why this adds a listener for KVM on an IOMMU group. A PID comes
> >> via NVLink from a GPU and NPU uses a PID wildcard to pass it through.
> >>
> >> This requires coherent memory and ATSD to be available on the host as
> >> the GPU vendor only supports configurations with both features enabled
> >> and other configurations are known not to work. Because of this and
> >> because of the ways the features are advertised to the host system
> >> (which is a device tree with very platform specific properties),
> >> this requires enabled POWERNV platform.
> >>
> >> The V100 GPUs do not advertise none of these capabilities via the config  
> > 
> > s/none/any/
> >   
> >> space and there are more than just one device ID so this relies on
> >> the platform to tell whether these GPUs have special abilities such as
> >> NVLinks.
> >>
> >> Signed-off-by: Alexey Kardashevskiy 
> >> ---
> >> Changes:
> >> v4:
> >> * added nvlink-speed to the NPU bridge capability as this turned out to
> >> be not a constant value
> >> * instead of looking at the exact device ID (which also changes from system
> >> to system), now this (indirectly) looks at the device tree to know
> >> if GPU and NPU support NVLink
> >>
> >> v3:
> >> * reworded the commit log about tgt
> >> * added tracepoints (do we want them enabled for entire vfio-pci?)
> >> * added code comments
> >> * added write|mmap flags to the new regions
> >> * auto enabled VFIO_PCI_NVLINK2 config option
> >> * added 'tgt' capability to a GPU so QEMU can recreate ibm,npu and ibm,gpu
> >> references; there are required by the NVIDIA driver
> >> * keep notifier registered only for short time
> >> ---
> >>  drivers/vfio/pci/Makefile   |   1 +
> >>  drivers/vfio/pci/trace.h| 102 +++
> >>  drivers/vfio/pci/vfio_pci_private.h |   2 +
> >>  include/uapi/linux/vfio.h   |  27 ++
> >>  drivers/vfio/pci/vfio_pci.c |  37 ++-
> >>  drivers/vfio/pci/vfio_pci_nvlink2.c | 448 
> >>  drivers/vfio/pci/Kconfig|   6 +
> >>  7 files changed, 621 insertions(+), 2 deletions(-)
> >>  create mode 100644 drivers/vfio/pci/trace.h
> >>  create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c
> >>
> >> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> >> index 76d8ec0..9662c06 100644
> >> --- a/drivers/vfio/pci/Makefile
> >> +++ b/drivers/vfio/pci/Makefile
> >> @@ -1,5 +1,6 @@
> >>  
> >>  vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o 
> >> vfio_pci_config.o
> >>  vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
> >> +vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
> >>  
> >>  

Re: [PATCH] ocxl: Simplify free_spa()

2018-12-10 Thread Andrew Donnellan

On 11/12/18 2:15 am, Greg Kurz wrote:

The only users of free_spa() are alloc_link() and free_link(), and
in both cases:

- link->spa != NULL

- free_spa(link) is immediatly followed by kfree(link)

The check isn't needed, and it doesn't bring much to clear the link->spa
pointer. Drop both.

Signed-off-by: Greg Kurz 


I like defensive programming but for this case I don't really care too 
much either way


Acked-by: Andrew Donnellan 


---
  drivers/misc/ocxl/link.c |7 ++-
  1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
index 31695a078485..eed92055184d 100644
--- a/drivers/misc/ocxl/link.c
+++ b/drivers/misc/ocxl/link.c
@@ -352,11 +352,8 @@ static void free_spa(struct link *link)
pr_debug("Freeing SPA for %x:%x:%x\n", link->domain, link->bus,
link->dev);
  
-	if (spa && spa->spa_mem) {

-   free_pages((unsigned long) spa->spa_mem, spa->spa_order);
-   kfree(spa);
-   link->spa = NULL;
-   }
+   free_pages((unsigned long) spa->spa_mem, spa->spa_order);
+   kfree(spa);
  }
  
  static int alloc_link(struct pci_dev *dev, int PE_mask, struct link **out_link)




--
Andrew Donnellan  OzLabs, ADL Canberra
andrew.donnel...@au1.ibm.com  IBM Australia Limited



Re: [PATCH kernel v4 19/19] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver

2018-12-10 Thread Alexey Kardashevskiy



On 11/12/2018 11:08, Alex Williamson wrote:
> On Fri, 23 Nov 2018 16:53:04 +1100
> Alexey Kardashevskiy  wrote:
> 
>> POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not
>> pluggable PCIe devices but still have PCIe links which are used
>> for config space and MMIO. In addition to that the GPUs have 6 NVLinks
>> which are connected to other GPUs and the POWER9 CPU. POWER9 chips
>> have a special unit on a die called an NPU which is an NVLink2 host bus
>> adapter with p2p connections to 2 to 3 GPUs, 3 or 2 NVLinks to each.
>> These systems also support ATS (address translation services) which is
>> a part of the NVLink2 protocol. Such GPUs also share on-board RAM
>> (16GB or 32GB) to the system via the same NVLink2 so a CPU has
>> cache-coherent access to a GPU RAM.
>>
>> This exports GPU RAM to the userspace as a new VFIO device region. This
>> preregisters the new memory as device memory as it might be used for DMA.
>> This inserts pfns from the fault handler as the GPU memory is not onlined
>> until the vendor driver is loaded and trained the NVLinks so doing this
>> earlier causes low level errors which we fence in the firmware so
>> it does not hurt the host system but still better be avoided.
>>
>> This exports an ATSD (Address Translation Shootdown) register of NPU which
>> allows TLB invalidations inside GPU for an operating system. The register
>> conveniently occupies a single 64k page. It is also presented to
>> the userspace as a new VFIO device region.
>>
>> In order to provide the userspace with the information about GPU-to-NVLink
>> connections, this exports an additional capability called "tgt"
>> (which is an abbreviated host system bus address). The "tgt" property
>> tells the GPU its own system address and allows the guest driver to
>> conglomerate the routing information so each GPU knows how to get directly
>> to the other GPUs.
>>
>> For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to
>> know LPID (a logical partition ID or a KVM guest hardware ID in other
>> words) and PID (a memory context ID of a userspace process, not to be
>> confused with a linux pid). This assigns a GPU to LPID in the NPU and
>> this is why this adds a listener for KVM on an IOMMU group. A PID comes
>> via NVLink from a GPU and NPU uses a PID wildcard to pass it through.
>>
>> This requires coherent memory and ATSD to be available on the host as
>> the GPU vendor only supports configurations with both features enabled
>> and other configurations are known not to work. Because of this and
>> because of the ways the features are advertised to the host system
>> (which is a device tree with very platform specific properties),
>> this requires enabled POWERNV platform.
>>
>> The V100 GPUs do not advertise none of these capabilities via the config
> 
> s/none/any/
> 
>> space and there are more than just one device ID so this relies on
>> the platform to tell whether these GPUs have special abilities such as
>> NVLinks.
>>
>> Signed-off-by: Alexey Kardashevskiy 
>> ---
>> Changes:
>> v4:
>> * added nvlink-speed to the NPU bridge capability as this turned out to
>> be not a constant value
>> * instead of looking at the exact device ID (which also changes from system
>> to system), now this (indirectly) looks at the device tree to know
>> if GPU and NPU support NVLink
>>
>> v3:
>> * reworded the commit log about tgt
>> * added tracepoints (do we want them enabled for entire vfio-pci?)
>> * added code comments
>> * added write|mmap flags to the new regions
>> * auto enabled VFIO_PCI_NVLINK2 config option
>> * added 'tgt' capability to a GPU so QEMU can recreate ibm,npu and ibm,gpu
>> references; there are required by the NVIDIA driver
>> * keep notifier registered only for short time
>> ---
>>  drivers/vfio/pci/Makefile   |   1 +
>>  drivers/vfio/pci/trace.h| 102 +++
>>  drivers/vfio/pci/vfio_pci_private.h |   2 +
>>  include/uapi/linux/vfio.h   |  27 ++
>>  drivers/vfio/pci/vfio_pci.c |  37 ++-
>>  drivers/vfio/pci/vfio_pci_nvlink2.c | 448 
>>  drivers/vfio/pci/Kconfig|   6 +
>>  7 files changed, 621 insertions(+), 2 deletions(-)
>>  create mode 100644 drivers/vfio/pci/trace.h
>>  create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c
>>
>> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
>> index 76d8ec0..9662c06 100644
>> --- a/drivers/vfio/pci/Makefile
>> +++ b/drivers/vfio/pci/Makefile
>> @@ -1,5 +1,6 @@
>>  
>>  vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
>>  vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
>> +vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
>>  
>>  obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
> ...
>> diff --git a/drivers/vfio/pci/vfio_pci_private.h 
>> b/drivers/vfio/pci/vfio_pci_private.h
>> index 93c1738..7639241 100644
>> --- a/drivers/vfio/pci/vfio_pci_private.h
>> +++ b/drivers/vfio/pci/vfio_pci_private.h
>> @@ -163,4 

Re: [PATCH] ocxl: Fix endiannes bug in read_afu_name()

2018-12-10 Thread Andrew Donnellan

On 11/12/18 11:05 am, Andrew Donnellan wrote:

On 11/12/18 2:10 am, Greg Kurz wrote:

The double word returned by read_afu_info(OCXL_DVSEC_TEMPL_NAME) contains
four characters of the AFU name, read from the PCI config space, hence
with a little-endian ordering. When composing the string, a big-endian
system must swap the bytes so that the characters appear in the right
order.

Do this with le32_to_cpu().

Signed-off-by: Greg Kurz 


snowpatch reports the following sparse warning:

+drivers/misc/ocxl/config.c:321:24: warning: cast to restricted __le32

You probably need to change val from a u32 to a __le32.


Also does this need to go to stable?

--
Andrew Donnellan  OzLabs, ADL Canberra
andrew.donnel...@au1.ibm.com  IBM Australia Limited



Re: [PATCH] ocxl: Clarify error path in setup_xsl_irq()

2018-12-10 Thread Andrew Donnellan

On 11/12/18 2:18 am, Greg Kurz wrote:

Implementing rollback with goto and labels is a common practice that
leads to prettier and more maintainable code. FWIW, this design pattern
is already being used in alloc_link() a few lines below in this file.

Do the same in setup_xsl_irq().

Signed-off-by: Greg Kurz 


This is good, thanks.

Acked-by: Andrew Donnellan 


---
  drivers/misc/ocxl/link.c |   23 ++-
  1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
index eed92055184d..659977a17405 100644
--- a/drivers/misc/ocxl/link.c
+++ b/drivers/misc/ocxl/link.c
@@ -273,9 +273,9 @@ static int setup_xsl_irq(struct pci_dev *dev, struct link 
*link)
spa->irq_name = kasprintf(GFP_KERNEL, "ocxl-xsl-%x-%x-%x",
link->domain, link->bus, link->dev);
if (!spa->irq_name) {
-   unmap_irq_registers(spa);
dev_err(>dev, "Can't allocate name for xsl interrupt\n");
-   return -ENOMEM;
+   rc = -ENOMEM;
+   goto err_xsl;
}
/*
 * At some point, we'll need to look into allowing a higher
@@ -283,11 +283,10 @@ static int setup_xsl_irq(struct pci_dev *dev, struct link 
*link)
 */
spa->virq = irq_create_mapping(NULL, hwirq);
if (!spa->virq) {
-   kfree(spa->irq_name);
-   unmap_irq_registers(spa);
dev_err(>dev,
"irq_create_mapping failed for translation 
interrupt\n");
-   return -EINVAL;
+   rc = -EINVAL;
+   goto err_name;
}
  
  	dev_dbg(>dev, "hwirq %d mapped to virq %d\n", hwirq, spa->virq);

@@ -295,15 +294,21 @@ static int setup_xsl_irq(struct pci_dev *dev, struct link 
*link)
rc = request_irq(spa->virq, xsl_fault_handler, 0, spa->irq_name,
link);
if (rc) {
-   irq_dispose_mapping(spa->virq);
-   kfree(spa->irq_name);
-   unmap_irq_registers(spa);
dev_err(>dev,
"request_irq failed for translation interrupt: %d\n",
rc);
-   return -EINVAL;
+   rc = -EINVAL;
+   goto err_mapping;
}
return 0;
+
+err_mapping:
+   irq_dispose_mapping(spa->virq);
+err_name:
+   kfree(spa->irq_name);
+err_xsl:
+   unmap_irq_registers(spa);
+   return rc;
  }
  
  static void release_xsl_irq(struct link *link)




--
Andrew Donnellan  OzLabs, ADL Canberra
andrew.donnel...@au1.ibm.com  IBM Australia Limited



Re: [PATCH kernel v4 18/19] vfio_pci: Allow regions to add own capabilities

2018-12-10 Thread Alex Williamson
On Fri, 23 Nov 2018 16:53:03 +1100
Alexey Kardashevskiy  wrote:

> VFIO regions already support region capabilities with a limited set of
> fields. However the subdriver might have to report to the userspace
> additional bits.
> 
> This adds an add_capability() hook to vfio_pci_regops.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v3:
> * removed confusing rationale for the patch, the next patch makes
> use of it anyway
> ---
>  drivers/vfio/pci/vfio_pci_private.h | 3 +++
>  drivers/vfio/pci/vfio_pci.c | 6 ++
>  2 files changed, 9 insertions(+)

Acked-by: Alex Williamson 

> diff --git a/drivers/vfio/pci/vfio_pci_private.h 
> b/drivers/vfio/pci/vfio_pci_private.h
> index 86aab05..93c1738 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -62,6 +62,9 @@ struct vfio_pci_regops {
>   int (*mmap)(struct vfio_pci_device *vdev,
>   struct vfio_pci_region *region,
>   struct vm_area_struct *vma);
> + int (*add_capability)(struct vfio_pci_device *vdev,
> +   struct vfio_pci_region *region,
> +   struct vfio_info_cap *caps);
>  };
>  
>  struct vfio_pci_region {
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 4a6f7c0..6cb70cf 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -763,6 +763,12 @@ static long vfio_pci_ioctl(void *device_data,
>   if (ret)
>   return ret;
>  
> + if (vdev->region[i].ops->add_capability) {
> + ret = vdev->region[i].ops->add_capability(vdev,
> + >region[i], );
> + if (ret)
> + return ret;
> + }
>   }
>   }
>  



Re: [PATCH] ocxl/afu_irq: Don't include

2018-12-10 Thread Andrew Donnellan

Acked-by: Andrew Donnellan 

On 11/12/18 2:13 am, Greg Kurz wrote:

The AFU irq code doesn't need to reach out to the platform.

Signed-off-by: Greg Kurz 
---
  drivers/misc/ocxl/afu_irq.c |1 -
  1 file changed, 1 deletion(-)

diff --git a/drivers/misc/ocxl/afu_irq.c b/drivers/misc/ocxl/afu_irq.c
index e70cfa24577f..11ab996657a2 100644
--- a/drivers/misc/ocxl/afu_irq.c
+++ b/drivers/misc/ocxl/afu_irq.c
@@ -2,7 +2,6 @@
  // Copyright 2017 IBM Corp.
  #include 
  #include 
-#include 
  #include "ocxl_internal.h"
  #include "trace.h"
  



--
Andrew Donnellan  OzLabs, ADL Canberra
andrew.donnel...@au1.ibm.com  IBM Australia Limited



Re: [PATCH kernel v4 17/19] vfio_pci: Allow mapping extra regions

2018-12-10 Thread Alex Williamson
On Fri, 23 Nov 2018 16:53:02 +1100
Alexey Kardashevskiy  wrote:

> So far we only allowed mapping of MMIO BARs to the userspace. However
> there there are GPUs with on-board coherent RAM accessible via side

s/there there/there/

Otherwise:

Acked-by: Alex Williamson 

> channels which we also want to map to the userspace. The first client
> for this is NVIDIA V100 GPU with NVLink2 direct links to a POWER9
> NPU-enabled CPU; such GPUs have 16GB RAM which is coherently mapped
> to the system address space, we are going to export these as an extra
> PCI region.
> 
> We already support extra PCI regions and this adds support for mapping
> them to the userspace.
> 
> Signed-off-by: Alexey Kardashevskiy 
> Reviewed-by: David Gibson 
> ---
> Changes:
> v2:
> * reverted one of mistakenly removed error checks
> ---
>  drivers/vfio/pci/vfio_pci_private.h | 3 +++
>  drivers/vfio/pci/vfio_pci.c | 9 +
>  2 files changed, 12 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_private.h 
> b/drivers/vfio/pci/vfio_pci_private.h
> index cde3b5d..86aab05 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -59,6 +59,9 @@ struct vfio_pci_regops {
> size_t count, loff_t *ppos, bool iswrite);
>   void(*release)(struct vfio_pci_device *vdev,
>  struct vfio_pci_region *region);
> + int (*mmap)(struct vfio_pci_device *vdev,
> + struct vfio_pci_region *region,
> + struct vm_area_struct *vma);
>  };
>  
>  struct vfio_pci_region {
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index fef5002..4a6f7c0 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1130,6 +1130,15 @@ static int vfio_pci_mmap(void *device_data, struct 
> vm_area_struct *vma)
>   return -EINVAL;
>   if ((vma->vm_flags & VM_SHARED) == 0)
>   return -EINVAL;
> + if (index >= VFIO_PCI_NUM_REGIONS) {
> + int regnum = index - VFIO_PCI_NUM_REGIONS;
> + struct vfio_pci_region *region = vdev->region + regnum;
> +
> + if (region && region->ops && region->ops->mmap &&
> + (region->flags & VFIO_REGION_INFO_FLAG_MMAP))
> + return region->ops->mmap(vdev, region, vma);
> + return -EINVAL;
> + }
>   if (index >= VFIO_PCI_ROM_REGION_INDEX)
>   return -EINVAL;
>   if (!vdev->bar_mmap_supported[index])



Re: [PATCH kernel v4 19/19] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver

2018-12-10 Thread Alex Williamson
On Fri, 23 Nov 2018 16:53:04 +1100
Alexey Kardashevskiy  wrote:

> POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not
> pluggable PCIe devices but still have PCIe links which are used
> for config space and MMIO. In addition to that the GPUs have 6 NVLinks
> which are connected to other GPUs and the POWER9 CPU. POWER9 chips
> have a special unit on a die called an NPU which is an NVLink2 host bus
> adapter with p2p connections to 2 to 3 GPUs, 3 or 2 NVLinks to each.
> These systems also support ATS (address translation services) which is
> a part of the NVLink2 protocol. Such GPUs also share on-board RAM
> (16GB or 32GB) to the system via the same NVLink2 so a CPU has
> cache-coherent access to a GPU RAM.
> 
> This exports GPU RAM to the userspace as a new VFIO device region. This
> preregisters the new memory as device memory as it might be used for DMA.
> This inserts pfns from the fault handler as the GPU memory is not onlined
> until the vendor driver is loaded and trained the NVLinks so doing this
> earlier causes low level errors which we fence in the firmware so
> it does not hurt the host system but still better be avoided.
> 
> This exports an ATSD (Address Translation Shootdown) register of NPU which
> allows TLB invalidations inside GPU for an operating system. The register
> conveniently occupies a single 64k page. It is also presented to
> the userspace as a new VFIO device region.
> 
> In order to provide the userspace with the information about GPU-to-NVLink
> connections, this exports an additional capability called "tgt"
> (which is an abbreviated host system bus address). The "tgt" property
> tells the GPU its own system address and allows the guest driver to
> conglomerate the routing information so each GPU knows how to get directly
> to the other GPUs.
> 
> For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to
> know LPID (a logical partition ID or a KVM guest hardware ID in other
> words) and PID (a memory context ID of a userspace process, not to be
> confused with a linux pid). This assigns a GPU to LPID in the NPU and
> this is why this adds a listener for KVM on an IOMMU group. A PID comes
> via NVLink from a GPU and NPU uses a PID wildcard to pass it through.
> 
> This requires coherent memory and ATSD to be available on the host as
> the GPU vendor only supports configurations with both features enabled
> and other configurations are known not to work. Because of this and
> because of the ways the features are advertised to the host system
> (which is a device tree with very platform specific properties),
> this requires enabled POWERNV platform.
> 
> The V100 GPUs do not advertise none of these capabilities via the config

s/none/any/

> space and there are more than just one device ID so this relies on
> the platform to tell whether these GPUs have special abilities such as
> NVLinks.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v4:
> * added nvlink-speed to the NPU bridge capability as this turned out to
> be not a constant value
> * instead of looking at the exact device ID (which also changes from system
> to system), now this (indirectly) looks at the device tree to know
> if GPU and NPU support NVLink
> 
> v3:
> * reworded the commit log about tgt
> * added tracepoints (do we want them enabled for entire vfio-pci?)
> * added code comments
> * added write|mmap flags to the new regions
> * auto enabled VFIO_PCI_NVLINK2 config option
> * added 'tgt' capability to a GPU so QEMU can recreate ibm,npu and ibm,gpu
> references; there are required by the NVIDIA driver
> * keep notifier registered only for short time
> ---
>  drivers/vfio/pci/Makefile   |   1 +
>  drivers/vfio/pci/trace.h| 102 +++
>  drivers/vfio/pci/vfio_pci_private.h |   2 +
>  include/uapi/linux/vfio.h   |  27 ++
>  drivers/vfio/pci/vfio_pci.c |  37 ++-
>  drivers/vfio/pci/vfio_pci_nvlink2.c | 448 
>  drivers/vfio/pci/Kconfig|   6 +
>  7 files changed, 621 insertions(+), 2 deletions(-)
>  create mode 100644 drivers/vfio/pci/trace.h
>  create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c
> 
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index 76d8ec0..9662c06 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -1,5 +1,6 @@
>  
>  vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
>  vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
> +vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o
>  
>  obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
...
> diff --git a/drivers/vfio/pci/vfio_pci_private.h 
> b/drivers/vfio/pci/vfio_pci_private.h
> index 93c1738..7639241 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -163,4 +163,6 @@ static inline int vfio_pci_igd_init(struct 
> vfio_pci_device *vdev)
>   return -ENODEV;
>  }
>  #endif
> +extern int 

Re: [PATCH] ocxl: Fix endiannes bug in read_afu_name()

2018-12-10 Thread Andrew Donnellan

On 11/12/18 2:10 am, Greg Kurz wrote:

The double word returned by read_afu_info(OCXL_DVSEC_TEMPL_NAME) contains
four characters of the AFU name, read from the PCI config space, hence
with a little-endian ordering. When composing the string, a big-endian
system must swap the bytes so that the characters appear in the right
order.

Do this with le32_to_cpu().

Signed-off-by: Greg Kurz 


snowpatch reports the following sparse warning:

+drivers/misc/ocxl/config.c:321:24: warning: cast to restricted __le32

You probably need to change val from a u32 to a __le32.


---
  drivers/misc/ocxl/config.c |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c
index 57a6bb1fd3c9..b76198ba8630 100644
--- a/drivers/misc/ocxl/config.c
+++ b/drivers/misc/ocxl/config.c
@@ -318,7 +318,7 @@ static int read_afu_name(struct pci_dev *dev, struct 
ocxl_fn_config *fn,
if (rc)
return rc;
ptr = (u32 *) >name[i];
-   *ptr = val;
+   *ptr = le32_to_cpu(val);
}
afu->name[OCXL_AFU_NAME_SZ - 1] = '\0'; /* play safe */
return 0;



--
Andrew Donnellan  OzLabs, ADL Canberra
andrew.donnel...@au1.ibm.com  IBM Australia Limited



Re: [PATCH v3 00/12] perf/core: Generalise event exclusion checking

2018-12-10 Thread Andrew Murray
On Fri, Dec 07, 2018 at 05:25:17PM +, Will Deacon wrote:
> On Thu, Dec 06, 2018 at 04:47:17PM +, Andrew Murray wrote:
> > Many PMU drivers do not have the capability to exclude counting events
> > that occur in specific contexts such as idle, kernel, guest, etc. These
> > drivers indicate this by returning an error in their event_init upon
> > testing the events attribute flags.
> > 
> > However this approach requires that each time a new event modifier is
> > added to perf, all the perf drivers need to be modified to indicate that
> > they don't support the attribute. This results in additional boiler-plate
> > code common to many drivers that needs to be maintained. Furthermore the
> > drivers are not consistent with regards to the error value they return
> > when reporting unsupported attributes.
> > 
> > This patchset allow PMU drivers to advertise their inability to exclude
> > based on context via a new capability: PERF_PMU_CAP_NO_EXCLUDE. This
> > allows the perf core to reject requests for exclusion events where there
> > is no support in the PMU.
> > 
> > This is a functional change, in particular:
> > 
> >  - Some drivers will now additionally (but correctly) report unsupported
> >exclusion flags. It's typical for existing userspace tools such as
> >perf to handle such errors by retrying the system call without the
> >unsupported flags.
> > 
> >  - Drivers that do not support any exclusion that previously reported
> >-EPERM or -EOPNOTSUPP will now report -EINVAL - this is consistent
> >with the majority and results in userspace perf retrying without
> >exclusion.
> > 
> > All drivers touched by this patchset have been compile tested.
> 
> For the bits under arch/arm/ and drivers/perf:
> 
> Acked-by: Will Deacon 
> 
> Note that I've queued the TX2 uncore PMU for 4.21 [1], which could also
> benefit from your new flag.

Ah thanks for pointing this out, I'll send a patch in due course.

Thanks,

Andrwe Murray

> 
> Will
> 
> [1]
> https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/commit/?h=for-next/perf=69c32972d59388c041268e8206e8eb1acff29b9a


Re: [PATCH] powerpc: eeh_event: convert semaphore to completion

2018-12-10 Thread Sam Bobroff
On Tue, Dec 11, 2018 at 10:18:31AM +1100, Oliver wrote:
> On Tue, Dec 11, 2018 at 8:52 AM Arnd Bergmann  wrote:
> >
> > For this use case, completions and semaphores are equivalent,
> > but semaphores are an awkward interface that should generally
> > be avoided, so use the completion instead.
> 
> IIRC Sam has been reworking the locking used inside of EEH so this is
> probably going to clash with his changes. Converting to a completion
> is probably a good idea, but we might want to do it as a part of his
> series since it's going to collide with this anyway.
> 
> Sam, what do you think?

It's such a small change, I don't think it will cause any problems for
the rework. Anyway it seems like a good change, so I'd prefer to see it
go in :-)

Cheers,
Sam.

> >
> > Signed-off-by: Arnd Bergmann 
> > ---
> >  arch/powerpc/kernel/eeh_event.c | 9 +++--
> >  1 file changed, 3 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/eeh_event.c 
> > b/arch/powerpc/kernel/eeh_event.c
> > index 61c9356bf9c9..227e57f980df 100644
> > --- a/arch/powerpc/kernel/eeh_event.c
> > +++ b/arch/powerpc/kernel/eeh_event.c
> > @@ -35,7 +35,7 @@
> >   */
> >
> >  static DEFINE_SPINLOCK(eeh_eventlist_lock);
> > -static struct semaphore eeh_eventlist_sem;
> > +static DECLARE_COMPLETION(eeh_eventlist_event);
> >  static LIST_HEAD(eeh_eventlist);
> >
> >  /**
> > @@ -55,7 +55,7 @@ static int eeh_event_handler(void * dummy)
> > struct eeh_pe *pe;
> >
> > while (!kthread_should_stop()) {
> > -   if (down_interruptible(_eventlist_sem))
> > +   if (wait_for_completion_interruptible(_eventlist_event))
> > break;
> >
> > /* Fetch EEH event from the queue */
> > @@ -102,9 +102,6 @@ int eeh_event_init(void)
> > struct task_struct *t;
> > int ret = 0;
> >
> > -   /* Initialize semaphore */
> > -   sema_init(_eventlist_sem, 0);
> > -
> > t = kthread_run(eeh_event_handler, NULL, "eehd");
> > if (IS_ERR(t)) {
> > ret = PTR_ERR(t);
> > @@ -142,7 +139,7 @@ int eeh_send_failure_event(struct eeh_pe *pe)
> > spin_unlock_irqrestore(_eventlist_lock, flags);
> >
> > /* For EEH deamon to knick in */
> > -   up(_eventlist_sem);
> > +   complete(_eventlist_event);
> >
> > return 0;
> >  }
> > --
> > 2.20.0
> >
> 


signature.asc
Description: PGP signature


Re: [PATCH] powerpc: eeh_event: convert semaphore to completion

2018-12-10 Thread Oliver
On Tue, Dec 11, 2018 at 8:52 AM Arnd Bergmann  wrote:
>
> For this use case, completions and semaphores are equivalent,
> but semaphores are an awkward interface that should generally
> be avoided, so use the completion instead.

IIRC Sam has been reworking the locking used inside of EEH so this is
probably going to clash with his changes. Converting to a completion
is probably a good idea, but we might want to do it as a part of his
series since it's going to collide with this anyway.

Sam, what do you think?

>
> Signed-off-by: Arnd Bergmann 
> ---
>  arch/powerpc/kernel/eeh_event.c | 9 +++--
>  1 file changed, 3 insertions(+), 6 deletions(-)
>
> diff --git a/arch/powerpc/kernel/eeh_event.c b/arch/powerpc/kernel/eeh_event.c
> index 61c9356bf9c9..227e57f980df 100644
> --- a/arch/powerpc/kernel/eeh_event.c
> +++ b/arch/powerpc/kernel/eeh_event.c
> @@ -35,7 +35,7 @@
>   */
>
>  static DEFINE_SPINLOCK(eeh_eventlist_lock);
> -static struct semaphore eeh_eventlist_sem;
> +static DECLARE_COMPLETION(eeh_eventlist_event);
>  static LIST_HEAD(eeh_eventlist);
>
>  /**
> @@ -55,7 +55,7 @@ static int eeh_event_handler(void * dummy)
> struct eeh_pe *pe;
>
> while (!kthread_should_stop()) {
> -   if (down_interruptible(_eventlist_sem))
> +   if (wait_for_completion_interruptible(_eventlist_event))
> break;
>
> /* Fetch EEH event from the queue */
> @@ -102,9 +102,6 @@ int eeh_event_init(void)
> struct task_struct *t;
> int ret = 0;
>
> -   /* Initialize semaphore */
> -   sema_init(_eventlist_sem, 0);
> -
> t = kthread_run(eeh_event_handler, NULL, "eehd");
> if (IS_ERR(t)) {
> ret = PTR_ERR(t);
> @@ -142,7 +139,7 @@ int eeh_send_failure_event(struct eeh_pe *pe)
> spin_unlock_irqrestore(_eventlist_lock, flags);
>
> /* For EEH deamon to knick in */
> -   up(_eventlist_sem);
> +   complete(_eventlist_event);
>
> return 0;
>  }
> --
> 2.20.0
>


Re: [PATCH] bpf: fix overflow of bpf_jit_limit when PAGE_SIZE >= 64K

2018-12-10 Thread Daniel Borkmann
On 12/10/2018 06:27 PM, Michael Roth wrote:
> Quoting Daniel Borkmann (2018-12-10 08:26:31)
>> On 12/07/2018 04:36 PM, Michael Roth wrote:
>>> Quoting Michael Ellerman (2018-12-07 06:31:13)
 Michael Roth  writes:

> Commit ede95a63b5 introduced a bpf_jit_limit tuneable to limit BPF
> JIT allocations. At compile time it defaults to PAGE_SIZE * 4,
> and is adjusted again at init time if MODULES_VADDR is defined.
>
> For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with

 But maybe it should be, I don't know why we don't define it.

> the compile-time default at boot-time, which is 0x9c40 when
> using 64K page size. This overflows the signed 32-bit bpf_jit_limit
> value:
>
>   root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit
>   -1673527296
>
> and can cause various unexpected failures throughout the network
> stack. In one case `strace dhclient eth0` reported:
>
>   setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, 
> filter=0x105dd27f8}, 16) = -1 ENOTSUPP (Unknown error 524)
>
> and similar failures can be seen with tools like tcpdump. This doesn't
> always reproduce however, and I'm not sure why. The more consistent
> failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9 host
> would time out on systemd/netplan configuring a virtio-net NIC with no
> noticeable errors in the logs.
>
> Fix this by limiting the compile-time default for bpf_jit_limit to
> INT_MAX.

 INT_MAX is a lot more than (4k * 4), so I guess I'm not clear on
 whether we should be using PAGE_SIZE here at all. I guess each BPF
 program uses at least one page is the thinking?
>>>
>>> That seems to be the case, at least, the max number of minimum-sized
>>> allocations would be less on ppc64 since the allocations are always at
>>> least PAGE_SIZE in size. The init-time default also limits to INT_MAX,
>>> so it seemed consistent to do that here too.
>>>

 Thanks for tracking this down. For some reason none of my ~10 test boxes
 have hit this, perhaps I don't have new enough userspace?
>>>
>>> I'm not too sure, I would've thought things like the dhclient case in
>>> the commit log would fail every time, but sometimes I need to reboot the
>>> guest before I start seeing the behavior. Maybe there's something special
>>> about when JIT allocations are actually done that can affect
>>> reproducibility?
>>>
>>> In my case at least the virtio-net networking timeout was consistent
>>> enough for a bisect, but maybe it depends on the specific network
>>> configuration (single NIC, basic DHCP through netplan/systemd in my case).
>>>

 You don't mention why you needed to add BPF_MIN(), I assume because the
 kernel version of min() has gotten too complicated to work here?
>>>
>>> I wasn't sure if it was safe here or not, so I tried looking at other
>>> users and came across:
>>>
>>> mm/vmalloc.c:777:#define VMAP_MIN(x, y)   ((x) < (y) ? (x) : 
>>> (y)) /* can't use min() */
>>>
>>> I'm not sure what the reasoning was (or whether it still applies), but I
>>> figured it was safer to do the same here. Maybe Nick still recalls?
>>>

 Daniel I assume you'll merge this via your tree?

 cheers

> Fixes: ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv 
> allocations")
> Cc: linuxppc-...@ozlabs.org
> Cc: Daniel Borkmann 
> Cc: Sandipan Das 
> Cc: Alexei Starovoitov 
> Signed-off-by: Michael Roth 
>>
>> Thanks for the reports / fixes and sorry for my late reply (bit too
>> swamped last week), some more thoughts below.
>>
>  kernel/bpf/core.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index b1a3545d0ec8..55de4746cdfd 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -365,7 +365,8 @@ void bpf_prog_kallsyms_del_all(struct bpf_prog *fp)
>  }
>  
>  #ifdef CONFIG_BPF_JIT
> -# define BPF_JIT_LIMIT_DEFAULT   (PAGE_SIZE * 4)
> +# define BPF_MIN(x, y) ((x) < (y) ? (x) : (y))
> +# define BPF_JIT_LIMIT_DEFAULT   BPF_MIN((PAGE_SIZE * 4), 
> INT_MAX)
>  
>  /* All BPF JIT sysctl knobs here. */
>  int bpf_jit_enable   __read_mostly = 
> IS_BUILTIN(CONFIG_BPF_JIT_ALWAYS_ON);
>>
>> I would actually just like to get rid of the BPF_JIT_LIMIT_DEFAULT
>> define also given for 4.21 arm64 will have its own dedicated area for
>> JIT allocations where neither the above limit nor the MODULES_END/
>> MODULES_VADDR one would fit and I don't want to make this even more
>> ugly with adding further cases into the core. Would the below variant
>> work for you?
> 
> Looks good to me. My one concern (which is probably a separate
> issue) is that the INT_MAX limit is a bit more punishing for larger
> page sizes since the minimum allocations 

[PATCH v2.2 24/34] dt-bindings: arm: Convert Rockchip board/soc bindings to json-schema

2018-12-10 Thread Heiko Stuebner
From: Rob Herring 

Convert Rockchip SoC bindings to DT schema format using json-schema.

Cc: Mark Rutland 
Cc: Heiko Stuebner 
Cc: devicet...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-rockc...@lists.infradead.org
Signed-off-by: Rob Herring 
[move to per-board entries and added recently added boards]
Signed-off-by: Heiko Stuebner 
---
Hi Rob,

thanks for the libyaml hint, now dtc does check my dts nicely and
emits quite a number of little complaints ;-) .

Also that suggestion to move the original firefly release+beta boards
together was great and I just did that.

Should look ok now, if so I'll apply it tomorrow.
Heiko

changes in v2.2:
- use enum to differentiate between rk3288-firefly and rk3288-firefly-beta
  as suggested by Rob (firefly-reload is a completely different board though)
- add the rk3288-evb variants, which were missing from the old binding

changes in v2.1:
- move to one entry per board
- add boards added after the original yaml conversion

 .../devicetree/bindings/arm/rockchip.txt  | 240 --
 .../devicetree/bindings/arm/rockchip.yaml | 423 ++
 2 files changed, 423 insertions(+), 240 deletions(-)
 delete mode 100644 Documentation/devicetree/bindings/arm/rockchip.txt
 create mode 100644 Documentation/devicetree/bindings/arm/rockchip.yaml

diff --git a/Documentation/devicetree/bindings/arm/rockchip.txt 
b/Documentation/devicetree/bindings/arm/rockchip.txt
deleted file mode 100644
index 0cc71236d639..
--- a/Documentation/devicetree/bindings/arm/rockchip.txt
+++ /dev/null
@@ -1,240 +0,0 @@
-Rockchip platforms device tree bindings

-
-- 96boards RK3399 Ficus (ROCK960 Enterprise Edition)
-Required root node properties:
-  - compatible = "vamrs,ficus", "rockchip,rk3399";
-
-- 96boards RK3399 Rock960 (ROCK960 Consumer Edition)
-Required root node properties:
-  - compatible = "vamrs,rock960", "rockchip,rk3399";
-
-- Amarula Vyasa RK3288 board
-Required root node properties:
-  - compatible = "amarula,vyasa-rk3288", "rockchip,rk3288";
-
-- Asus Tinker board
-Required root node properties:
-  - compatible = "asus,rk3288-tinker", "rockchip,rk3288";
-
-- Asus Tinker board S
-Required root node properties:
-  - compatible = "asus,rk3288-tinker-s", "rockchip,rk3288";
-
-- Kylin RK3036 board:
-Required root node properties:
-  - compatible = "rockchip,kylin-rk3036", "rockchip,rk3036";
-
-- MarsBoard RK3066 board:
-Required root node properties:
-  - compatible = "haoyu,marsboard-rk3066", "rockchip,rk3066a";
-
-- bq Curie 2 tablet:
-Required root node properties:
-  - compatible = "mundoreader,bq-curie2", "rockchip,rk3066a";
-
-- ChipSPARK Rayeager PX2 board:
-Required root node properties:
-  - compatible = "chipspark,rayeager-px2", "rockchip,rk3066a";
-
-- Radxa Rock board:
-Required root node properties:
-  - compatible = "radxa,rock", "rockchip,rk3188";
-
-- Radxa Rock2 Square board:
-Required root node properties:
-  - compatible = "radxa,rock2-square", "rockchip,rk3288";
-
-- Rikomagic MK808 v1 board:
-Required root node properties:
-  - compatible = "rikomagic,mk808", "rockchip,rk3066a";
-
-- Firefly Firefly-RK3288 board:
-Required root node properties:
-  - compatible = "firefly,firefly-rk3288", "rockchip,rk3288";
-or
-  - compatible = "firefly,firefly-rk3288-beta", "rockchip,rk3288";
-
-- Firefly Firefly-RK3288 Reload board:
-Required root node properties:
-  - compatible = "firefly,firefly-rk3288-reload", "rockchip,rk3288";
-
-- Firefly Firefly-RK3399 board:
-Required root node properties:
-  - compatible = "firefly,firefly-rk3399", "rockchip,rk3399";
-
-- Firefly roc-rk3328-cc board:
-Required root node properties:
-  - compatible = "firefly,roc-rk3328-cc", "rockchip,rk3328";
-
-- Firefly ROC-RK3399-PC board:
-Required root node properties:
-  - compatible = "firefly,roc-rk3399-pc", "rockchip,rk3399";
-
-- ChipSPARK PopMetal-RK3288 board:
-Required root node properties:
-  - compatible = "chipspark,popmetal-rk3288", "rockchip,rk3288";
-
-- Netxeon R89 board:
-Required root node properties:
-  - compatible = "netxeon,r89", "rockchip,rk3288";
-
-- GeekBuying GeekBox:
-Required root node properties:
-  - compatible = "geekbuying,geekbox", "rockchip,rk3368";
-
-- Google Bob (Asus Chromebook Flip C101PA):
-Required root node properties:
-   compatible = "google,bob-rev13", "google,bob-rev12",
-"google,bob-rev11", "google,bob-rev10",
-"google,bob-rev9", "google,bob-rev8",
-"google,bob-rev7", "google,bob-rev6",
-"google,bob-rev5", "google,bob-rev4",
-"google,bob", "google,gru", "rockchip,rk3399";
-
-- Google Brain (dev-board):
-Required root node properties:
-  - compatible = "google,veyron-brain-rev0", 

[PATCH v3 3/3] powerpc: Discard .branch_lt section

2018-12-10 Thread Joel Stanley
When building a 32 bit powerpc kernel with Binutils 2.31.1 this warning
is emitted:

 powerpc-linux-gnu-ld: warning: orphan section `.branch_lt' from
 `arch/powerpc/kernel/head_44x.o' being placed in section `.branch_lt'

As of binutils commit 2d7ad24e8726 ("Support PLT16 relocs against local
symbols")[1], 32 bit targets can produce .branch_lt sections in their
output.

These sections should be empty for the kernel build so discard them for
both PPC64 and PPC32.

[1] 
https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=commitdiff;h=2d7ad24e8726ba4c45c9e67be08223a146a837ce
Signed-off-by: Joel Stanley 
---
v2: Discard instead of keep. Alan said "[discarding] can be a recipe for
finding linker bugs", so lets go with that.
---
 arch/powerpc/kernel/vmlinux.lds.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
b/arch/powerpc/kernel/vmlinux.lds.S
index 2c93a420f456..af0f81df8bb9 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -316,7 +316,6 @@ SECTIONS
DATA_DATA
*(.data.rel*)
*(.toc1)
-   *(.branch_lt)
}
 
.opd : AT(ADDR(.opd) - LOAD_OFFSET) {
@@ -388,6 +387,7 @@ SECTIONS
*(.hash .gnu.hash)
*(.interp)
*(.dynstr)
+   *(.branch_lt)
 #ifndef CONFIG_PPC32
*(.dynsym)
 #endif
-- 
2.19.1



[PATCH v3 2/3] powerpc: Discard dynsym section for !PPC32

2018-12-10 Thread Joel Stanley
Alan Modra  explains:

 > Likely you could discard .interp > and .dynstr too, and .dynsym when
 > !CONFIG_PPC32.

Discarding of interp and dynstr happened in a previous patch. The dynsym
cleanup was a bit less straightforward, so it gets it's own patch.

Signed-off-by: Joel Stanley 
---
See 
https://lore.kernel.org/lkml/CACPK8Xft3n5KkpTjN3=7_vucxhfck7mxvzm2rrqu7tppcbo...@mail.gmail.com/T/#m58532c86cf0c7b4fb01cc1fe724e48d4c7d8e4a7

v3: Move dynstr hunk to patch 1 (it was incorrectly left in this patch)
---
 arch/powerpc/kernel/vmlinux.lds.S | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
b/arch/powerpc/kernel/vmlinux.lds.S
index 779b8b3075a1..2c93a420f456 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -266,13 +266,13 @@ SECTIONS
}
 #ifdef CONFIG_RELOCATABLE
. = ALIGN(8);
+#ifdef CONFIG_PPC32
.dynsym : AT(ADDR(.dynsym) - LOAD_OFFSET)
{
-#ifdef CONFIG_PPC32
__dynamic_symtab = .;
-#endif
*(.dynsym)
}
+#endif
.dynamic : AT(ADDR(.dynamic) - LOAD_OFFSET)
{
__dynamic_start = .;
@@ -388,5 +388,8 @@ SECTIONS
*(.hash .gnu.hash)
*(.interp)
*(.dynstr)
+#ifndef CONFIG_PPC32
+   *(.dynsym)
+#endif
}
 }
-- 
2.19.1



[PATCH v3 1/3] powerpc: Discard more sections in linker script

2018-12-10 Thread Joel Stanley
Building the ppc64 kernel with a modern binutils results in this
warning:

 powerpc64le-linux-gnu-ld: warning: orphan section `.gnu.hash' from
 `linker stubs' being placed in section `.gnu.hash'

Alan Modra  explains:

 > .gnu.hash, like .hash, is used by glibc ld.so for dynamic symbol
 > lookup.  I imagine you don't need either section in a kernel, so
 > discarding both sounds reasonable.  Likely you could discard .interp
 > and .dynstr too, and .dynsym when !CONFIG_PPC32.

Reported-by: Stephen Rothwell 
Signed-off-by: Joel Stanley 
---
See 
https://lore.kernel.org/lkml/CACPK8Xft3n5KkpTjN3=7_vucxhfck7mxvzm2rrqu7tppcbo...@mail.gmail.com/T/#m58532c86cf0c7b4fb01cc1fe724e48d4c7d8e4a7

v3: Add dynstr hunk to this patch (it was incorrectly left in patch 2)
---
 arch/powerpc/kernel/vmlinux.lds.S | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
b/arch/powerpc/kernel/vmlinux.lds.S
index 434581bcd5b4..779b8b3075a1 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -273,14 +273,11 @@ SECTIONS
 #endif
*(.dynsym)
}
-   .dynstr : AT(ADDR(.dynstr) - LOAD_OFFSET) { *(.dynstr) }
.dynamic : AT(ADDR(.dynamic) - LOAD_OFFSET)
{
__dynamic_start = .;
*(.dynamic)
}
-   .hash : AT(ADDR(.hash) - LOAD_OFFSET) { *(.hash) }
-   .interp : AT(ADDR(.interp) - LOAD_OFFSET) { *(.interp) }
.rela.dyn : AT(ADDR(.rela.dyn) - LOAD_OFFSET)
{
__rela_dyn_start = .;
@@ -388,5 +385,8 @@ SECTIONS
*(.gnu.version*)
*(.gnu.attributes)
*(.eh_frame)
+   *(.hash .gnu.hash)
+   *(.interp)
+   *(.dynstr)
}
 }
-- 
2.19.1



[PATCH v3 0/3] powerpc: Add to linker script discards

2018-12-10 Thread Joel Stanley
v3 fixes up the splitting of the patches, moving the dynstr hunk from
patch 2 to patch 1.

v2 pulls in the branch_lt patch too. No changes to the first two
patches.

Joel Stanley (3):
  powerpc: Discard more sections in linker script
  powerpc: Discard dynsym section for !PPC32
  powerpc: Discard .branch_lt section

 arch/powerpc/kernel/vmlinux.lds.S | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

-- 
2.19.1



Re: [PATCHv3 1/4] dt-bindings: add DT binding for the layerscape PCIe controller with EP mode

2018-12-10 Thread Rob Herring
On Mon,  3 Dec 2018 18:35:02 +0800, Xiaowei Bao wrote:
> Add the documentation for the Device Tree binding for the layerscape PCIe
> controller with EP mode.
> 
> Signed-off-by: Xiaowei Bao 
> ---
> v2:
>  - Add the SoC specific compatibles.
> v3:
>  - modify the commit message.
> 
>  .../devicetree/bindings/pci/layerscape-pci.txt |3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 

Reviewed-by: Rob Herring 


[PATCH] powerpc: eeh_event: convert semaphore to completion

2018-12-10 Thread Arnd Bergmann
For this use case, completions and semaphores are equivalent,
but semaphores are an awkward interface that should generally
be avoided, so use the completion instead.

Signed-off-by: Arnd Bergmann 
---
 arch/powerpc/kernel/eeh_event.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/eeh_event.c b/arch/powerpc/kernel/eeh_event.c
index 61c9356bf9c9..227e57f980df 100644
--- a/arch/powerpc/kernel/eeh_event.c
+++ b/arch/powerpc/kernel/eeh_event.c
@@ -35,7 +35,7 @@
  */
 
 static DEFINE_SPINLOCK(eeh_eventlist_lock);
-static struct semaphore eeh_eventlist_sem;
+static DECLARE_COMPLETION(eeh_eventlist_event);
 static LIST_HEAD(eeh_eventlist);
 
 /**
@@ -55,7 +55,7 @@ static int eeh_event_handler(void * dummy)
struct eeh_pe *pe;
 
while (!kthread_should_stop()) {
-   if (down_interruptible(_eventlist_sem))
+   if (wait_for_completion_interruptible(_eventlist_event))
break;
 
/* Fetch EEH event from the queue */
@@ -102,9 +102,6 @@ int eeh_event_init(void)
struct task_struct *t;
int ret = 0;
 
-   /* Initialize semaphore */
-   sema_init(_eventlist_sem, 0);
-
t = kthread_run(eeh_event_handler, NULL, "eehd");
if (IS_ERR(t)) {
ret = PTR_ERR(t);
@@ -142,7 +139,7 @@ int eeh_send_failure_event(struct eeh_pe *pe)
spin_unlock_irqrestore(_eventlist_lock, flags);
 
/* For EEH deamon to knick in */
-   up(_eventlist_sem);
+   complete(_eventlist_event);
 
return 0;
 }
-- 
2.20.0



Re: [PATCH v2 2/3] powerpc: Discard dynsym section for !PPC32

2018-12-10 Thread Joel Stanley
On Wed, 5 Dec 2018 at 04:11, Segher Boessenkool
 wrote:
>
> On Tue, Dec 04, 2018 at 11:24:28AM +1030, Joel Stanley wrote:
> > Alan Modra  explains:
> >
> >  > Likely you could discard .interp > and .dynstr too, and .dynsym when
> >  > !CONFIG_PPC32.
> >
> > Discarding of interp and dynstr happened in a previous patch. The dynsym
> > cleanup was a bit less straightforward, so it gets it's own patch.
>
> > diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
> > b/arch/powerpc/kernel/vmlinux.lds.S
> > index 6570209b0671..2c93a420f456 100644
> > --- a/arch/powerpc/kernel/vmlinux.lds.S
> > +++ b/arch/powerpc/kernel/vmlinux.lds.S
> > @@ -266,14 +266,13 @@ SECTIONS
> >   }
> >  #ifdef CONFIG_RELOCATABLE
> >   . = ALIGN(8);
> > +#ifdef CONFIG_PPC32
> >   .dynsym : AT(ADDR(.dynsym) - LOAD_OFFSET)
> >   {
> > -#ifdef CONFIG_PPC32
> >   __dynamic_symtab = .;
> > -#endif
> >   *(.dynsym)
> >   }
> > - .dynstr : AT(ADDR(.dynstr) - LOAD_OFFSET) { *(.dynstr) }
>
> So this last line belongs in the previous patch then, right?

Correct. Good catch.


[PATCH v03] powerpc/mobility: Fix node detach/rename problem

2018-12-10 Thread Michael Bringmann
The PPC mobility code receives RTAS requests to delete nodes with
platform-/hardware-specific attributes when restarting the kernel
after a migration.  My example is for migration between a P8 Alpine
and a P8 Brazos.   Nodes to be deleted include 'ibm,random-v1',
'ibm,platform-facilities', 'ibm,sym-encryption-v1', and,
'ibm,compression-v1'.

The mobility.c code calls 'of_detach_node' for the nodes and their
children.  This makes calls to detach the properties and to remove
the associated sysfs/kernfs files.

Then new copies of the same nodes are next provided by the PHYP,
local copies are built, and a pointer to the 'struct device_node'
is passed to of_attach_node.  Before the call to of_attach_node,
the phandle is initialized to 0 when the data structure is alloced.
During the call to of_attach_node, it calls __of_attach_node which
pulls the actual name and phandle from just created sub-properties
named something like 'name' and 'ibm,phandle'.

This is all fine for the first migration.  The problem occurs with
the second and subsequent migrations when the PHYP on the new system
wants to replace the same set of nodes again, referenced with the
same names and phandle values.

On the second and subsequent migrations, the PHYP tells the system
to again delete the nodes 'ibm,platform-facilities', 'ibm,random-v1',
'ibm,compression-v1', 'ibm,sym-encryption-v1'.  It specifies these
nodes by its known set of phandle values -- the same handles used
by the PHYP on the source system are known on the target system.
The mobility.c code calls of_find_node_by_phandle() with these values
and ends up locating the first instance of each node that was added
during the original boot, instead of the second instance of each node
created after the first migration.  The detach during the second
migration fails with errors like,

[ 4565.030704] WARNING: CPU: 3 PID: 4787 at drivers/of/dynamic.c:252 
__of_detach_node+0x8/0xa0
[ 4565.030708] Modules linked in: nfsv3 nfs_acl nfs tcp_diag udp_diag inet_diag 
unix_diag af_packet_diag netlink_diag lockd grace fscache sunrpc xts vmx_crypto 
sg pseries_rng binfmt_misc ip_tables xfs libcrc32c sd_mod ibmveth ibmvscsi 
scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod
[ 4565.030733] CPU: 3 PID: 4787 Comm: drmgr Tainted: GW 
4.18.0-rc1-wi107836-v05-120+ #201
[ 4565.030737] NIP:  c07c1ea8 LR: c07c1fb4 CTR: 00655170
[ 4565.030741] REGS: c003f302b690 TRAP: 0700   Tainted: GW  
(4.18.0-rc1-wi107836-v05-120+)
[ 4565.030745] MSR:  80010282b033   
CR: 22288822  XER: 000a
[ 4565.030757] CFAR: c07c1fb0 IRQMASK: 1
[ 4565.030757] GPR00: c07c1fa4 c003f302b910 c114bf00 
c0038e68
[ 4565.030757] GPR04: 0001  80c008e0b4b8 

[ 4565.030757] GPR08:  0001 8003 
2843
[ 4565.030757] GPR12: 8800 c0001ec9ae00 4000 

[ 4565.030757] GPR16:  0008  
f6ff
[ 4565.030757] GPR20: 0007  c003e9f1f034 
0001
[ 4565.030757] GPR24:    

[ 4565.030757] GPR28: c1549d28 c1134828 c0038e68 
c003f302b930
[ 4565.030804] NIP [c07c1ea8] __of_detach_node+0x8/0xa0
[ 4565.030808] LR [c07c1fb4] of_detach_node+0x74/0xd0
[ 4565.030811] Call Trace:
[ 4565.030815] [c003f302b910] [c07c1fa4] of_detach_node+0x64/0xd0 
(unreliable)
[ 4565.030821] [c003f302b980] [c00c33c4] 
dlpar_detach_node+0xb4/0x150
[ 4565.030826] [c003f302ba10] [c00c3ffc] delete_dt_node+0x3c/0x80
[ 4565.030831] [c003f302ba40] [c00c4380] 
pseries_devicetree_update+0x150/0x4f0
[ 4565.030836] [c003f302bb70] [c00c479c] 
post_mobility_fixup+0x7c/0xf0
[ 4565.030841] [c003f302bbe0] [c00c4908] migration_store+0xf8/0x130
[ 4565.030847] [c003f302bc70] [c0998160] kobj_attr_store+0x30/0x60
[ 4565.030852] [c003f302bc90] [c0412f14] sysfs_kf_write+0x64/0xa0
[ 4565.030857] [c003f302bcb0] [c0411cac] 
kernfs_fop_write+0x16c/0x240
[ 4565.030862] [c003f302bd00] [c0355f20] __vfs_write+0x40/0x220
[ 4565.030867] [c003f302bd90] [c0356358] vfs_write+0xc8/0x240
[ 4565.030872] [c003f302bde0] [c03566cc] ksys_write+0x5c/0x100
[ 4565.030880] [c003f302be30] [c000b288] system_call+0x5c/0x70
[ 4565.030884] Instruction dump:
[ 4565.030887] 38210070 3860 e8010010 eb61ffd8 eb81ffe0 eba1ffe8 ebc1fff0 
ebe1fff8
[ 4565.030895] 7c0803a6 4e800020 e9230098 7929f7e2 <0b09> 2f89 4cde0020 
e9030040
[ 4565.030903] ---[ end trace 5bd54cb1df9d2976 ]---

The mobility.c code continues on during the second migration, accepts
the definitions of the new nodes from the PHYP and ends up renaming
the new properties 

[PATCH net 2/2] ibmvnic: Fix non-atomic memory allocation in IRQ context

2018-12-10 Thread Thomas Falcon
ibmvnic_reset allocated new reset work item objects in a non-atomic
context. This can be called from a tasklet, generating the output below.
Allocate work items with the GFP_ATOMIC flag instead.

BUG: sleeping function called from invalid context at mm/slab.h:421
in_atomic(): 1, irqs_disabled(): 1, pid: 93, name: kworker/0:2
INFO: lockdep is turned off.
irq event stamp: 66049
hardirqs last  enabled at (66048): [] 
tasklet_action_common.isra.12+0x78/0x1c0
hardirqs last disabled at (66049): [] 
_raw_spin_lock_irqsave+0x48/0xf0
softirqs last  enabled at (66044): [] 
dev_deactivate_queue.constprop.28+0xc8/0x160
softirqs last disabled at (66045): [] 
call_do_softirq+0x14/0x24
CPU: 0 PID: 93 Comm: kworker/0:2 Kdump: loaded Not tainted 
4.20.0-rc6-1-g1b50a8f03706 #7
Workqueue: events linkwatch_event
Call Trace:
[c003fffe7ae0] [c0bc83e4] dump_stack+0xe8/0x164 (unreliable)
[c003fffe7b30] [c015ba0c] ___might_sleep+0x2dc/0x320
[c003fffe7bb0] [c0391514] kmem_cache_alloc_trace+0x3e4/0x440
[c003fffe7c30] [d5b2309c] ibmvnic_reset+0x16c/0x360 [ibmvnic]
[c003fffe7cc0] [d5b29834] ibmvnic_tasklet+0x1054/0x2010 [ibmvnic]
[c003fffe7e00] [c01224c8] tasklet_action_common.isra.12+0xd8/0x1c0
[c003fffe7e60] [c0bf1238] __do_softirq+0x1a8/0x64c
[c003fffe7f90] [c00306e0] call_do_softirq+0x14/0x24
[c003f3967980] [c001ba50] do_softirq_own_stack+0x60/0xb0
[c003f39679c0] [c01218a8] do_softirq+0xa8/0x100
[c003f39679f0] [c0121a74] __local_bh_enable_ip+0x174/0x180
[c003f3967a60] [c0bf003c] _raw_spin_unlock_bh+0x5c/0x80
[c003f3967a90] [c0a8ac78] 
dev_deactivate_queue.constprop.28+0xc8/0x160
[c003f3967ad0] [c0a8c8b0] dev_deactivate_many+0xd0/0x520
[c003f3967b70] [c0a8cd40] dev_deactivate+0x40/0x60
[c003f3967ba0] [c0a5e0c4] linkwatch_do_dev+0x74/0xd0
[c003f3967bd0] [c0a5e694] __linkwatch_run_queue+0x1a4/0x1f0
[c003f3967c30] [c0a5e728] linkwatch_event+0x48/0x60
[c003f3967c50] [c01444e8] process_one_work+0x238/0x710
[c003f3967d20] [c0144a48] worker_thread+0x88/0x4e0
[c003f3967db0] [c014e3a8] kthread+0x178/0x1c0
[c003f3967e20] [c000bfd0] ret_from_kernel_thread+0x5c/0x6c

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index ffc0cab05b0f..67cc6d9c8fd7 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -2055,7 +2055,7 @@ static int ibmvnic_reset(struct ibmvnic_adapter *adapter,
}
}
 
-   rwi = kzalloc(sizeof(*rwi), GFP_KERNEL);
+   rwi = kzalloc(sizeof(*rwi), GFP_ATOMIC);
if (!rwi) {
spin_unlock_irqrestore(>rwi_lock, flags);
ibmvnic_close(netdev);
-- 
2.12.3



[PATCH net 1/2] ibmvnic: Convert reset work item mutex to spin lock

2018-12-10 Thread Thomas Falcon
ibmvnic_reset can create and schedule a reset work item from
an IRQ context, so do not use a mutex, which can sleep. Convert
the reset work item mutex to a spin lock. Locking debugger generated
the trace output below.

BUG: sleeping function called from invalid context at kernel/locking/mutex.c:908
in_atomic(): 1, irqs_disabled(): 1, pid: 120, name: kworker/8:1
4 locks held by kworker/8:1/120:
 #0: 17c05720 ((wq_completion)"events"){+.+.}, at: 
process_one_work+0x188/0x710
 #1: ace90706 ((linkwatch_work).work){+.+.}, at: 
process_one_work+0x188/0x710
 #2: 7632871f (rtnl_mutex){+.+.}, at: rtnl_lock+0x30/0x50
 #3: fc36813a (&(>lock)->rlock){..-.}, at: 
ibmvnic_tasklet+0x88/0x2010 [ibmvnic]
irq event stamp: 26293
hardirqs last  enabled at (26292): [] 
tasklet_action_common.isra.12+0x78/0x1c0
hardirqs last disabled at (26293): [] 
_raw_spin_lock_irqsave+0x48/0xf0
softirqs last  enabled at (26288): [] 
dev_deactivate_queue.constprop.28+0xc8/0x160
softirqs last disabled at (26289): [] 
call_do_softirq+0x14/0x24
CPU: 8 PID: 120 Comm: kworker/8:1 Kdump: loaded Not tainted 4.20.0-rc6 #6
Workqueue: events linkwatch_event
Call Trace:
[c003fffa7a50] [c0bc83e4] dump_stack+0xe8/0x164 (unreliable)
[c003fffa7aa0] [c015ba0c] ___might_sleep+0x2dc/0x320
[c003fffa7b20] [c0be960c] __mutex_lock+0x8c/0xb40
[c003fffa7c30] [d6202ac8] ibmvnic_reset+0x78/0x330 [ibmvnic]
[c003fffa7cc0] [d62097f4] ibmvnic_tasklet+0x1054/0x2010 [ibmvnic]
[c003fffa7e00] [c01224c8] tasklet_action_common.isra.12+0xd8/0x1c0
[c003fffa7e60] [c0bf1238] __do_softirq+0x1a8/0x64c
[c003fffa7f90] [c00306e0] call_do_softirq+0x14/0x24
[c003f3f87980] [c001ba50] do_softirq_own_stack+0x60/0xb0
[c003f3f879c0] [c01218a8] do_softirq+0xa8/0x100
[c003f3f879f0] [c0121a74] __local_bh_enable_ip+0x174/0x180
[c003f3f87a60] [c0bf003c] _raw_spin_unlock_bh+0x5c/0x80
[c003f3f87a90] [c0a8ac78] 
dev_deactivate_queue.constprop.28+0xc8/0x160
[c003f3f87ad0] [c0a8c8b0] dev_deactivate_many+0xd0/0x520
[c003f3f87b70] [c0a8cd40] dev_deactivate+0x40/0x60
[c003f3f87ba0] [c0a5e0c4] linkwatch_do_dev+0x74/0xd0
[c003f3f87bd0] [c0a5e694] __linkwatch_run_queue+0x1a4/0x1f0
[c003f3f87c30] [c0a5e728] linkwatch_event+0x48/0x60
[c003f3f87c50] [c01444e8] process_one_work+0x238/0x710
[c003f3f87d20] [c0144a48] worker_thread+0x88/0x4e0
[c003f3f87db0] [c014e3a8] kthread+0x178/0x1c0
[c003f3f87e20] [c000bfd0] ret_from_kernel_thread+0x5c/0x6c

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 16 +---
 drivers/net/ethernet/ibm/ibmvnic.h |  2 +-
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index ed50b8dee44f..ffc0cab05b0f 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1939,8 +1939,9 @@ static int do_hard_reset(struct ibmvnic_adapter *adapter,
 static struct ibmvnic_rwi *get_next_rwi(struct ibmvnic_adapter *adapter)
 {
struct ibmvnic_rwi *rwi;
+   unsigned long flags;
 
-   mutex_lock(>rwi_lock);
+   spin_lock_irqsave(>rwi_lock, flags);
 
if (!list_empty(>rwi_list)) {
rwi = list_first_entry(>rwi_list, struct ibmvnic_rwi,
@@ -1950,7 +1951,7 @@ static struct ibmvnic_rwi *get_next_rwi(struct 
ibmvnic_adapter *adapter)
rwi = NULL;
}
 
-   mutex_unlock(>rwi_lock);
+   spin_unlock_irqrestore(>rwi_lock, flags);
return rwi;
 }
 
@@ -2025,6 +2026,7 @@ static int ibmvnic_reset(struct ibmvnic_adapter *adapter,
struct list_head *entry, *tmp_entry;
struct ibmvnic_rwi *rwi, *tmp;
struct net_device *netdev = adapter->netdev;
+   unsigned long flags;
int ret;
 
if (adapter->state == VNIC_REMOVING ||
@@ -2041,13 +2043,13 @@ static int ibmvnic_reset(struct ibmvnic_adapter 
*adapter,
goto err;
}
 
-   mutex_lock(>rwi_lock);
+   spin_lock_irqsave(>rwi_lock, flags);
 
list_for_each(entry, >rwi_list) {
tmp = list_entry(entry, struct ibmvnic_rwi, list);
if (tmp->reset_reason == reason) {
netdev_dbg(netdev, "Skipping matching reset\n");
-   mutex_unlock(>rwi_lock);
+   spin_unlock_irqrestore(>rwi_lock, flags);
ret = EBUSY;
goto err;
}
@@ -2055,7 +2057,7 @@ static int ibmvnic_reset(struct ibmvnic_adapter *adapter,
 
rwi = kzalloc(sizeof(*rwi), GFP_KERNEL);
if (!rwi) {
-   mutex_unlock(>rwi_lock);
+   spin_unlock_irqrestore(>rwi_lock, flags);
ibmvnic_close(netdev);
ret = ENOMEM;
  

[PATCH net 0/2] net/ibmvnic: Fix reset work item locking bugs

2018-12-10 Thread Thomas Falcon
This patch set fixes issues with scheduling reset work items in
a tasklet context. Since ibmvnic_reset can called in an interrupt,
it should not use a mutex or allocate memory non-atomically.

Thomas Falcon (2):
  ibmvnic: Convert reset work item mutex to spin lock
  ibmvnic: Fix non-atomic memory allocation in IRQ context

 drivers/net/ethernet/ibm/ibmvnic.c | 18 ++
 drivers/net/ethernet/ibm/ibmvnic.h |  2 +-
 2 files changed, 11 insertions(+), 9 deletions(-)

-- 
2.12.3



Re: use generic DMA mapping code in powerpc V4

2018-12-10 Thread Benjamin Herrenschmidt
On Mon, 2018-12-10 at 20:33 +0100, Christoph Hellwig wrote:
> On Mon, Dec 10, 2018 at 05:04:46PM +, Rui Salvaterra wrote:
> > Hi, Christoph and Ben,
> > 
> > It just came to my mind (and this is most likely a stupid question,
> > but still)… Is there any possibility of these changes having an
> > (positive) effect on the long-standing problem of Power Mac machines
> > with AGP graphics cards (which have to be limited to PCI transfers,
> > otherwise they'll hang, due to coherence issues)? If so, I have a G4
> > machine where I'd gladly test them.
> 
> These patches themselves are not going to affect that directly.
> But IFF the problem really is that the AGP needs to be treated as not
> cache coherent (I have no idea if that is true) the generic direct
> mapping code has full support for a per-device coherent flag, so
> support for a non-coherent AGP slot could be implemented relatively
> simply.

AGP is a gigantic nightmare :-) It's not just cache coherency issues
(some implementations are coherent, some aren't, Apple's is ... weird).

Apple has all sort of bugs, and Darwin source code only sheds light on
some of them. Some implementation can only read, not write I think, for
example. There are issues with transfers crossing some boundaries I
beleive, but it's all unclear.

Apple makes this work with a combination of hacks in the AGP "driver"
and the closed source GPU driver, which we don't see.

I have given up trying to make that stuff work reliably a decade ago :)

Cheers,
Ben.




[PATCH v3] kbuild: Add support for DT binding schema checks

2018-12-10 Thread Rob Herring
This adds the build infrastructure for checking DT binding schema
documents and validating dts files using the binding schema.

Check DT binding schema documents:
make dt_binding_check

Build dts files and check using DT binding schema:
make dtbs_check

Optionally, DT_SCHEMA_FILES can passed in with a schema file(s) to use
for validation. This makes it easier to find and fix errors generated by
a specific schema.

Currently, the validation targets are separate from a normal build to
avoid a hard dependency on the external DT schema project and because
there are lots of warnings generated.

Cc: Jonathan Corbet 
Cc: Mark Rutland 
Cc: Masahiro Yamada 
Cc: Michal Marek 
Cc: linux-...@vger.kernel.org
Cc: devicet...@vger.kernel.org
Cc: linux-kbu...@vger.kernel.org
Signed-off-by: Rob Herring 
---
v3:
- Fix error causing only 1st schema file to get used.
- Add a more useful error message when dtc is missing YAML support 
telling the user they need to install libyaml devel package.

 
 .gitignore   |  1 +
 Documentation/Makefile   |  2 +-
 Documentation/devicetree/bindings/.gitignore |  1 +
 Documentation/devicetree/bindings/Makefile   | 33 +
 Makefile | 11 --
 scripts/Makefile.lib | 37 ++--
 6 files changed, 80 insertions(+), 5 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/.gitignore
 create mode 100644 Documentation/devicetree/bindings/Makefile

diff --git a/.gitignore b/.gitignore
index 97ba6b79834c..a20ac26aa2f5 100644
--- a/.gitignore
+++ b/.gitignore
@@ -15,6 +15,7 @@
 *.bin
 *.bz2
 *.c.[012]*.*
+*.dt.yaml
 *.dtb
 *.dtb.S
 *.dwo
diff --git a/Documentation/Makefile b/Documentation/Makefile
index 2ca77ad0f238..9786957c6a35 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -2,7 +2,7 @@
 # Makefile for Sphinx documentation
 #
 
-subdir-y :=
+subdir-y := devicetree/bindings/
 
 # You can set these variables from the command line.
 SPHINXBUILD   = sphinx-build
diff --git a/Documentation/devicetree/bindings/.gitignore 
b/Documentation/devicetree/bindings/.gitignore
new file mode 100644
index ..d9194c02dd08
--- /dev/null
+++ b/Documentation/devicetree/bindings/.gitignore
@@ -0,0 +1 @@
+*.example.dts
diff --git a/Documentation/devicetree/bindings/Makefile 
b/Documentation/devicetree/bindings/Makefile
new file mode 100644
index ..43f8657ab070
--- /dev/null
+++ b/Documentation/devicetree/bindings/Makefile
@@ -0,0 +1,33 @@
+# SPDX-License-Identifier: GPL-2.0
+DT_DOC_CHECKER ?= dt-doc-validate
+DT_EXTRACT_EX ?= dt-extract-example
+DT_MK_SCHEMA ?= dt-mk-schema
+DT_MK_SCHEMA_FLAGS := $(if $(DT_SCHEMA_FILES), -u)
+
+quiet_cmd_chk_binding = CHKDT   $<
+  cmd_chk_binding = (set -e; \
+ $(DT_DOC_CHECKER) $< ; \
+ mkdir -p $(dir $@) ; \
+ $(DT_EXTRACT_EX) $< > $@ )
+
+$(obj)/%.example.dts: $(src)/%.yaml FORCE
+   $(call if_changed,chk_binding)
+
+DT_TMP_SCHEMA := .schema.yaml.tmp
+extra-y += $(DT_TMP_SCHEMA)
+
+quiet_cmd_mk_schema = SCHEMA  $@
+  cmd_mk_schema = mkdir -p $(obj); \
+  rm -f $@; \
+  $(DT_MK_SCHEMA) $(DT_MK_SCHEMA_FLAGS) -o $@ $(filter-out 
FORCE, $^)
+
+DT_DOCS = $(shell cd $(srctree)/$(src) && find * -name '*.yaml')
+DT_SCHEMA_FILES ?= $(addprefix $(src)/,$(DT_DOCS))
+
+extra-y += $(patsubst $(src)/%.yaml,%.example.dts, $(DT_SCHEMA_FILES))
+extra-y += $(patsubst $(src)/%.yaml,%.example.dtb, $(DT_SCHEMA_FILES))
+
+$(obj)/$(DT_TMP_SCHEMA): | $(addprefix $(obj)/,$(patsubst 
$(src)/%.yaml,%.example.dtb, $(DT_SCHEMA_FILES)))
+
+$(obj)/$(DT_TMP_SCHEMA): $(addprefix $(srctree)/, $(DT_SCHEMA_FILES)) FORCE
+   $(call if_changed,mk_schema)
diff --git a/Makefile b/Makefile
index 2f36db897895..ff59adf43300 100644
--- a/Makefile
+++ b/Makefile
@@ -1232,10 +1232,13 @@ ifneq ($(dtstree),)
 %.dtb: prepare3 scripts_dtc
$(Q)$(MAKE) $(build)=$(dtstree) $(dtstree)/$@
 
-PHONY += dtbs dtbs_install
+PHONY += dtbs dtbs_install dt_binding_check
 dtbs: prepare3 scripts_dtc
$(Q)$(MAKE) $(build)=$(dtstree)
 
+dtbs_check: prepare3 dt_binding_check
+   $(Q)$(MAKE) $(build)=$(dtstree) CHECK_DTBS=1
+
 dtbs_install:
$(Q)$(MAKE) $(dtbinst)=$(dtstree)
 
@@ -1249,6 +1252,9 @@ PHONY += scripts_dtc
 scripts_dtc: scripts_basic
$(Q)$(MAKE) $(build)=scripts/dtc
 
+dt_binding_check: scripts_dtc
+   $(Q)$(MAKE) $(build)=Documentation/devicetree/bindings
+
 # ---
 # Modules
 
@@ -1611,7 +1617,8 @@ clean: $(clean-dirs)
$(call cmd,rmfiles)
@find $(if $(KBUILD_EXTMOD), $(KBUILD_EXTMOD), .) $(RCS_FIND_IGNORE) \
\( -name '*.[aios]' -o -name '*.ko' -o -name '.*.cmd' \
-   -o -name '*.ko.*' -o -name '*.dtb' -o -name '*.dtb.S' \
+   -o -name '*.ko.*' \
+   

Re: [PATCH] pseries/hotplug: Add more delay in pseries_cpu_die while waiting for rtas-stop

2018-12-10 Thread Thiago Jung Bauermann


Hello Michael,

Michael Bringmann  writes:

> I have asked Scott Mayes to take a look at one of these crashes from
> the phyp side.  I will let you know if he finds anything notable.

Thanks! It might make sense to test whether booting with
cede_offline=off makes the bug go away.

One suspicion I have is regarding the code handling CPU_STATE_INACTIVE.
>From what I understand, it is a powerpc-specific CPU state and from the
perspective of the generic CPU hotplug state machine, inactive CPUs are
already fully offline. Which means that the locking performed by the
generic code state machine doesn't apply to transitioning CPUs from
INACTIVE to OFFLINE state. Perhaps the bug is that there is more than
one CPU making that transition at the same time? That would cause two
CPUs to call RTAS stop-self.

I haven't checked whether this is really possible or not, though. It's
just a conjecture.

--
Thiago Jung Bauermann
IBM Linux Technology Center



Re: [PATCH] pseries/hotplug: Add more delay in pseries_cpu_die while waiting for rtas-stop

2018-12-10 Thread Michael Bringmann
I have asked Scott Mayes to take a look at one of these crashes from
the phyp side.  I will let you know if he finds anything notable.

Michael

On 12/07/2018 08:40 PM, Thiago Jung Bauermann wrote:
> 
> Gautham R Shenoy  writes:
>> On Fri, Dec 07, 2018 at 04:13:11PM +0530, Gautham R Shenoy wrote:
>>> Sure. I will test the patch and report back.
>>
>> I added the following debug patch on top of your patch, and after an
>> hour's run, the system crashed. Appending the log at the end.
> 
> Thank you very much for testing! Your debug patch was very helpful as
> well.
> 
>> I suppose we still need to increase the number of tries since we wait
>> only for 2.5ms looping before giving up.
> 
> Do you think it would have helped? In the debug output you posted I
> would have expected the following message to show up if the loop
> finished too early, and it didn't:
> 
> "Querying DEAD? cpu %i (%i) shows %i\n"
> 
> So I don't think increasing the loop length would have made a
> difference. In fact, the call to smp_query_cpu_stopped() always
> succeeded on the first iteration.
> 
> I think there is something else going on which we don't fully understand
> yet. From your other email:
> 
>> I agree that the Kernel has to respect RTAS's restriction. The PAPR
>> v2.8.1, Requirement R1-7.2.3-8 under section 7.2.3 says the following:
>>
>> "The stop-self service needs to be serialized with calls to the
>>  stop-self, start-cpu, and set-power-level services. The OS must
>>  be able to call RTAS services on other processors while the
>>  processor is stopped or being stopped"
>>
>> Thus the onus is on the OS to ensure that there are no concurrent rtas
>> calls with "stop-self" token.
> 
> As you say perhaps there's another call to stop-self, start-cpu or
> set-power-level being made concurrently. I don't currently see how more
> than one stop-self or start-cpu call could be in flight at the same time
> given that there are a number of locks being grabbed during CPU hotplug
> and unplug. OTOH the CPU that actually calls stop-self doesn't seem to
> grab any locks itself so it's a possibility.
> 
> As for set-power-level, it's only used in the case of PCI hotplug from
> what I can see, and that isn't part of the picture in this case, right?
> 
> We could address this problem directly by adding another lock separate
> from rtas.lock to serialize just these calls. The challenge is with
> stop-self, because the CPU calling it will never return to release the
> lock. Is it possible to grab a lock (or down a semaphore) in the CPU
> calling stop-self and then release the lock (or up the semaphore) in the
> CPU running pseries_cpu_die()?
> 
>>> There's also a race between the CPU driving the unplug and the CPU
>>> being unplugged which I think is not easy for the CPU being
>>> unplugged to win, which makes the busy loop in pseries_cpu_die() a
>>> bit fragile. I describe the race in the patch description.
>>>
>>> My solution to make the race less tight is to make the CPU driving
>>> the unplug to only start the busy loop only after the CPU being
>>> unplugged is in the CPU_STATE_OFFLINE state. At that point, we know
>>> that it either is about to call RTAS or it already has.
>>
>> Ah, yes this is good optimization. Though, I think we ought to
>> unconditionally wait until the target CPU has woken up from CEDE and
>> changed its state to CPU_STATE_OFFLINE. After if PROD failed, then we
>> would have caught it in dlpar_offline_cpu() itself.
> 
> I recently saw a QEMU-implemented hcall (H_LOGICAL_CI_LOAD) return
> success when it had been given an invalid memory address to load from,
> so my confidence in the error reporting of hcalls is a bit shaken. :-)
> 
> In that case the CPU would wait forever for the CPU state to change. If
> you believe 100 ms is too short a timeout, we could make it 500 ms or
> even 1s. What do you think?
> 
>> cpu 112 (hwid 112) Ready to die...
>> [DEBUG] Waited for CPU 112 to enter rtas: tries=0, time=65
>> cpu 113 (hwid 113) Ready to die...
>> [DEBUG] Waited for CPU 113 to enter rtas: tries=0, time=1139
>> cpu 114 (hwid 114) Ready to die...
>> [DEBUG] Waited for CPU 114 to enter rtas: tries=0, time=1036
>> cpu 115 (hwid 115) Ready to die...
>> [DEBUG] Waited for CPU 115 to enter rtas: tries=0, time=133
>> cpu 116 (hwid 116) Ready to die...
>> [DEBUG] Waited for CPU 116 to enter rtas: tries=0, time=1231
>> cpu 117 (hwid 117) Ready to die...
>> [DEBUG] Waited for CPU 117 to enter rtas: tries=0, time=1231
>> cpu 118 (hwid 118) Ready to die...
>> [DEBUG] Waited for CPU 118 to enter rtas: tries=0, time=1231
>> cpu 119 (hwid 119) Ready to die...
>> [DEBUG] Waited for CPU 119 to enter rtas: tries=0, time=1131
>> cpu 104 (hwid 104) Ready to die...
>> [DEBUG] Waited for CPU 104 to enter rtas: tries=0, time=40
> 
> Interesting, so 1.2 ms can pass before the dying CPU actually gets close
> to making the stop-self call. And even in those cases the retry loop is
> succeeding on the first try! So this shows 

Re: use generic DMA mapping code in powerpc V4

2018-12-10 Thread Rui Salvaterra
On Mon, 10 Dec 2018 at 19:33, Christoph Hellwig  wrote:
>
> On Mon, Dec 10, 2018 at 05:04:46PM +, Rui Salvaterra wrote:
> > Hi, Christoph and Ben,
> >
> > It just came to my mind (and this is most likely a stupid question,
> > but still)… Is there any possibility of these changes having an
> > (positive) effect on the long-standing problem of Power Mac machines
> > with AGP graphics cards (which have to be limited to PCI transfers,
> > otherwise they'll hang, due to coherence issues)? If so, I have a G4
> > machine where I'd gladly test them.
>
> These patches themselves are not going to affect that directly.
> But IFF the problem really is that the AGP needs to be treated as not
> cache coherent (I have no idea if that is true) the generic direct
> mapping code has full support for a per-device coherent flag, so
> support for a non-coherent AGP slot could be implemented relatively
> simply.

Thanks for the insight, Christoph. Well, the problem[1] is real, and
it's been known for a long time, though I can't be sure if it's *only*
a coherence issue. If someone who knows the hardware manages to cook
up a patch (as hacky is it may be), I'll definitely fire up old my G4
laptop to test it! :)

[1] https://bugs.freedesktop.org/show_bug.cgi?id=95017


[PATCH v1 2/2] powerpc/pseries: Add debugfs interface to retrieve VPHN info

2018-12-10 Thread Naveen N. Rao
Add debugfs interface to retrieve associativity information for lpar
vcpus (debugfs/vphn/lpar) and the hypervisor cpus (debugfs/vphn/hyp).
This information is useful to derive various metrics, including the vcpu
dispatch statistics in a SPLPAR environment.

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/mm/numa.c | 105 +
 1 file changed, 105 insertions(+)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 6677a578f18d..f0b0e87016e6 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -40,6 +40,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static int numa_enabled = 1;
 
@@ -1089,6 +1090,107 @@ static long hcall_vphn(unsigned long cpu, u64 flags, 
__be32 *associativity)
return rc;
 }
 
+#ifdef CONFIG_DEBUG_FS
+static ssize_t vphn_lpar_cpu_file_read(struct file *filp, char __user *buf,
+   size_t len, loff_t *pos)
+{
+   int cpu = (long)filp->private_data;
+   __be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
+   int hwcpu = get_hard_smp_processor_id(cpu);
+   long int rc;
+
+   if (len != sizeof(associativity))
+   return -EINVAL;
+
+   rc = hcall_vphn(hwcpu, 1, associativity);
+   if (rc)
+   return -EFAULT;
+
+   rc = copy_to_user(buf, , sizeof(associativity));
+   if (rc)
+   return -EFAULT;
+
+   return sizeof(associativity);
+}
+
+static ssize_t vphn_hyp_cpu_file_read(struct file *filp, char __user *buf,
+   size_t len, loff_t *pos)
+{
+   int cpu = (long)filp->private_data;
+   __be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
+   long int rc;
+
+   if (len != sizeof(associativity))
+   return -EINVAL;
+
+   rc = hcall_vphn(cpu, 2, associativity);
+   if (rc)
+   return -EFAULT;
+
+   rc = copy_to_user(buf, , sizeof(associativity));
+   if (rc)
+   return -EFAULT;
+
+   return sizeof(associativity);
+}
+
+static const struct file_operations vphn_lpar_cpu_fops = {
+   .open   = simple_open,
+   .read   = vphn_lpar_cpu_file_read,
+   .llseek = no_llseek,
+};
+
+static const struct file_operations vphn_hyp_cpu_fops = {
+   .open   = simple_open,
+   .read   = vphn_hyp_cpu_file_read,
+   .llseek = no_llseek,
+};
+
+static int debug_init_vphn_entries(void)
+{
+   struct dentry *vphn_dir, *vphn_lpar_dir, *vphn_hyp_dir;
+   struct dentry *vphn_lpar_cpu_file, *vphn_hyp_cpu_file;
+   long cpu;
+   char name[10];
+
+   if (!firmware_has_feature(FW_FEATURE_SPLPAR))
+   return 0;
+
+   vphn_dir = debugfs_create_dir("vphn", powerpc_debugfs_root);
+   if (!vphn_dir) {
+   pr_warn("%s: can't create vphn debugfs root dir\n", __func__);
+   return -ENOMEM;
+   }
+
+   vphn_lpar_dir = debugfs_create_dir("lpar", vphn_dir);
+   vphn_hyp_dir = debugfs_create_dir("hyp", vphn_dir);
+   if (!vphn_lpar_dir || !vphn_hyp_dir) {
+   pr_warn("%s: can't create vphn dir\n", __func__);
+   goto err_remove_dir;
+   }
+
+   for_each_possible_cpu(cpu) {
+   sprintf(name, "cpu-%ld", cpu);
+   vphn_lpar_cpu_file = debugfs_create_file(name, 0400,
+   vphn_lpar_dir, (void *)cpu, 
_lpar_cpu_fops);
+   vphn_hyp_cpu_file = debugfs_create_file(name, 0400,
+   vphn_hyp_dir, (void *)cpu, _hyp_cpu_fops);
+   if (!vphn_lpar_cpu_file || !vphn_hyp_cpu_file) {
+   pr_warn("%s: can't create vphn cpu file\n", __func__);
+   goto err_remove_dir;
+   }
+   }
+
+   return 0;
+
+err_remove_dir:
+   debugfs_remove_recursive(vphn_dir);
+   return -ENOMEM;
+}
+#else
+static int debug_init_vphn_entries(void) { return 0; }
+#endif /* CONFIG_DEBUG_FS */
+
 /*
  * Change polling interval for associativity changes.
  */
@@ -1619,6 +1721,9 @@ static int topology_update_init(void)
if (!proc_create("powerpc/topology_updates", 0644, NULL, _ops))
return -ENOMEM;
 
+   if (!debug_init_vphn_entries())
+   return -ENOMEM;
+
topology_inited = 1;
return 0;
 }
-- 
2.19.2



[PATCH v1 1/2] powerpc/pseries: Generalize hcall_vphn()

2018-12-10 Thread Naveen N. Rao
H_HOME_NODE_ASSOCIATIVITY hcall can take two different flags and return
different associativity information in each case. Generalize the
existing hcall_vphn() function to take flags as an argument and to
return the result. Update the only existing user to pass the proper
arguments.

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/mm/numa.c | 27 +--
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 87f0dd004295..6677a578f18d 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1078,6 +1078,17 @@ static void reset_topology_timer(void);
 static int topology_timer_secs = 1;
 static int topology_inited;
 
+static long hcall_vphn(unsigned long cpu, u64 flags, __be32 *associativity)
+{
+   long rc;
+   long retbuf[PLPAR_HCALL9_BUFSIZE] = {0};
+
+   rc = plpar_hcall9(H_HOME_NODE_ASSOCIATIVITY, retbuf, flags, cpu);
+   vphn_unpack_associativity(retbuf, associativity);
+
+   return rc;
+}
+
 /*
  * Change polling interval for associativity changes.
  */
@@ -1156,25 +1167,13 @@ static int update_cpu_associativity_changes_mask(void)
  * Retrieve the new associativity information for a virtual processor's
  * home node.
  */
-static long hcall_vphn(unsigned long cpu, __be32 *associativity)
-{
-   long rc;
-   long retbuf[PLPAR_HCALL9_BUFSIZE] = {0};
-   u64 flags = 1;
-   int hwcpu = get_hard_smp_processor_id(cpu);
-
-   rc = plpar_hcall9(H_HOME_NODE_ASSOCIATIVITY, retbuf, flags, hwcpu);
-   vphn_unpack_associativity(retbuf, associativity);
-
-   return rc;
-}
-
 static long vphn_get_associativity(unsigned long cpu,
__be32 *associativity)
 {
long rc;
+   int hwcpu = get_hard_smp_processor_id(cpu);
 
-   rc = hcall_vphn(cpu, associativity);
+   rc = hcall_vphn(hwcpu, 1, associativity);
 
switch (rc) {
case H_FUNCTION:
-- 
2.19.2



[PATCH v1 0/2] Interface to expose VPHN information

2018-12-10 Thread Naveen N. Rao
This is an alternate way to provide statistics around vcpu dispatches in 
a SPLPAR environment. By obtaining access to the VPHN information, and 
in concert with the existing debugfs dtl interface, we can derive 
different statistics related to how vcpus are dispatched by the 
hypervisor.

- Naveen

Naveen N. Rao (2):
  powerpc/pseries: Generalize hcall_vphn()
  powerpc/pseries: Add debugfs interface to retrieve VPHN info

 arch/powerpc/mm/numa.c | 132 -
 1 file changed, 118 insertions(+), 14 deletions(-)

-- 
2.19.2



Re: [PATCH] mm/zsmalloc.c: Fix zsmalloc 32-bit PAE support

2018-12-10 Thread Rafael David Tinoco
On 12/10/18 1:05 PM, Russell King - ARM Linux wrote:
> On Mon, Dec 10, 2018 at 02:35:55PM +, Robin Murphy wrote:
>> On 10/12/2018 14:21, Rafael David Tinoco wrote:
>>> On 32-bit systems, zsmalloc uses HIGHMEM and, when PAE is enabled, the
>>> physical frame number might be so big that zsmalloc obj encoding (to
>>> location) will break, causing:
>>>
>>> BUG: KASAN: null-ptr-deref in zs_map_object+0xa4/0x2bc
>>> Read of size 4 at addr  by task mkfs.ext4/623
>>> CPU: 2 PID: 623 Comm: mkfs.ext4 Not tainted 
>>> 4.19.0-rc8-00017-g8239bc6d3307-dirty #15
>>> Hardware name: Generic DT based system
>>> [] (unwind_backtrace) from [] (show_stack+0x20/0x24)
>>> [] (show_stack) from [] (dump_stack+0xbc/0xe8)
>>> [] (dump_stack) from [] (kasan_report+0x248/0x390)
>>> [] (kasan_report) from [] (__asan_load4+0x78/0xb4)
>>> [] (__asan_load4) from [] (zs_map_object+0xa4/0x2bc)
>>> [] (zs_map_object) from [] 
>>> (zram_bvec_rw.constprop.2+0x324/0x8e4 [zram])
>>> [] (zram_bvec_rw.constprop.2 [zram]) from [] 
>>> (zram_make_request+0x234/0x46c [zram])
>>> [] (zram_make_request [zram]) from [] 
>>> (generic_make_request+0x304/0x63c)
>>> [] (generic_make_request) from [] 
>>> (submit_bio+0x4c/0x1c8)
>>> [] (submit_bio) from [] 
>>> (submit_bh_wbc.constprop.15+0x238/0x26c)
>>> [] (submit_bh_wbc.constprop.15) from [] 
>>> (__block_write_full_page+0x524/0x76c)
>>> [] (__block_write_full_page) from [] 
>>> (block_write_full_page+0x1bc/0x1d4)
>>> [] (block_write_full_page) from [] 
>>> (blkdev_writepage+0x24/0x28)
>>> [] (blkdev_writepage) from [] (__writepage+0x44/0x78)
>>> [] (__writepage) from [] (write_cache_pages+0x3b8/0x800)
>>> [] (write_cache_pages) from [] 
>>> (generic_writepages+0x74/0xa0)
>>> [] (generic_writepages) from [] 
>>> (blkdev_writepages+0x18/0x1c)
>>> [] (blkdev_writepages) from [] 
>>> (do_writepages+0x68/0x134)
>>> [] (do_writepages) from [] 
>>> (__filemap_fdatawrite_range+0xb0/0xf4)
>>> [] (__filemap_fdatawrite_range) from [] 
>>> (file_write_and_wait_range+0x64/0xd0)
>>> [] (file_write_and_wait_range) from [] 
>>> (blkdev_fsync+0x54/0x84)
>>> [] (blkdev_fsync) from [] (vfs_fsync_range+0x70/0xd4)
>>> [] (vfs_fsync_range) from [] (do_fsync+0x4c/0x80)
>>> [] (do_fsync) from [] (sys_fsync+0x1c/0x20)
>>> [] (sys_fsync) from [] (ret_fast_syscall+0x0/0x2c)
>>>
>>> when trying to decode (the pfn) and map the object.
>>>
>>> That happens because one architecture might not re-define
>>> MAX_PHYSMEM_BITS, like in this ARM 32-bit w/ LPAE enabled example. For
>>> 32-bit systems, if not re-defined, MAX_POSSIBLE_PHYSMEM_BITS will
>>> default to BITS_PER_LONG (32) in most cases, and, with PAE enabled,
>>> _PFN_BITS might be wrong: which may cause obj variable to overflow if
>>> frame number is in HIGHMEM and referencing a page above the 4GB
>>> watermark.
>>>
>>> commit 6e00ec00b1a7 ("staging: zsmalloc: calculate MAX_PHYSMEM_BITS if
>>> not defined") realized MAX_PHYSMEM_BITS depended on SPARSEMEM headers
>>> and "fixed" it by calculating it using BITS_PER_LONG if SPARSEMEM wasn't
>>> used, like in the example given above.
>>>
>>> Systems with potential for PAE exist for a long time and assuming
>>> BITS_PER_LONG seems inadequate. Defining MAX_PHYSMEM_BITS looks better,
>>> however it is NOT a constant anymore for x86.
>>>
>>> SO, instead, MAX_POSSIBLE_PHYSMEM_BITS should be defined by every
>>> architecture using zsmalloc, together with a sanity check for
>>> MAX_POSSIBLE_PHYSMEM_BITS being too big on 32-bit systems.
>>>
>>> Link: https://bugs.linaro.org/show_bug.cgi?id=3765#c17
>>> Signed-off-by: Rafael David Tinoco 
>>> ---
>>>  arch/arm/include/asm/pgtable-2level-types.h |  2 ++
>>>  arch/arm/include/asm/pgtable-3level-types.h |  2 ++
>>>  arch/arm64/include/asm/pgtable-types.h  |  2 ++
>>>  arch/ia64/include/asm/page.h|  2 ++
>>>  arch/mips/include/asm/page.h|  2 ++
>>>  arch/powerpc/include/asm/mmu.h  |  2 ++
>>>  arch/s390/include/asm/page.h|  2 ++
>>>  arch/sh/include/asm/page.h  |  2 ++
>>>  arch/sparc/include/asm/page_32.h|  2 ++
>>>  arch/sparc/include/asm/page_64.h|  2 ++
>>>  arch/x86/include/asm/pgtable-2level_types.h |  2 ++
>>>  arch/x86/include/asm/pgtable-3level_types.h |  3 +-
>>>  arch/x86/include/asm/pgtable_64_types.h |  4 +--
>>>  mm/zsmalloc.c   | 35 +++--
>>>  14 files changed, 45 insertions(+), 19 deletions(-)
>>>
>>> diff --git a/arch/arm/include/asm/pgtable-2level-types.h 
>>> b/arch/arm/include/asm/pgtable-2level-types.h
>>> index 66cb5b0e89c5..552dba411324 100644
>>> --- a/arch/arm/include/asm/pgtable-2level-types.h
>>> +++ b/arch/arm/include/asm/pgtable-2level-types.h
>>> @@ -64,4 +64,6 @@ typedef pteval_t pgprot_t;
>>>  #endif /* STRICT_MM_TYPECHECKS */
>>> +#define MAX_POSSIBLE_PHYSMEM_BITS 32
>>> +
>>>  #endif /* _ASM_PGTABLE_2LEVEL_TYPES_H */
>>> diff --git a/arch/arm/include/asm/pgtable-3level-types.h 
>>> 

Re: [PATCH] mm/zsmalloc.c: Fix zsmalloc 32-bit PAE support

2018-12-10 Thread Rafael David Tinoco
On 12/10/18 1:15 PM, Kirill A. Shutemov wrote:
> On Mon, Dec 10, 2018 at 12:21:05PM -0200, Rafael David Tinoco wrote:
>> diff --git a/arch/x86/include/asm/pgtable_64_types.h 
>> b/arch/x86/include/asm/pgtable_64_types.h
>> index 84bd9bdc1987..d808cfde3d19 100644
>> --- a/arch/x86/include/asm/pgtable_64_types.h
>> +++ b/arch/x86/include/asm/pgtable_64_types.h
>> @@ -64,8 +64,6 @@ extern unsigned int ptrs_per_p4d;
>>  #define P4D_SIZE(_AC(1, UL) << P4D_SHIFT)
>>  #define P4D_MASK(~(P4D_SIZE - 1))
>>  
>> -#define MAX_POSSIBLE_PHYSMEM_BITS   52
>> -
>>  #else /* CONFIG_X86_5LEVEL */
>>  
>>  /*
>> @@ -154,4 +152,6 @@ extern unsigned int ptrs_per_p4d;
>>  
>>  #define PGD_KERNEL_START((PAGE_SIZE / 2) / sizeof(pgd_t))
>>  
>> +#define MAX_POSSIBLE_PHYSMEM_BITS (pgtable_l5_enabled() ? 52 : 46)
>> +
> 
> ...
> 
>>  #endif /* _ASM_X86_PGTABLE_64_DEFS_H */
>> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
>> index 0787d33b80d8..132c20b6fd4f 100644
>> --- a/mm/zsmalloc.c
>> +++ b/mm/zsmalloc.c
> 
> ...
> 
>> @@ -116,6 +100,25 @@
>>   */
>>  #define OBJ_ALLOCATED_TAG 1
>>  #define OBJ_TAG_BITS 1
>> +
>> +/*
>> + * MAX_POSSIBLE_PHYSMEM_BITS should be defined by all archs using zsmalloc:
>> + * Trying to guess it from MAX_PHYSMEM_BITS, or considering it 
>> BITS_PER_LONG,
>> + * proved to be wrong by not considering PAE capabilities, or using 
>> SPARSEMEM
>> + * only headers, leading to bad object encoding due to object index 
>> overflow.
>> + */
>> +#ifndef MAX_POSSIBLE_PHYSMEM_BITS
>> + #define MAX_POSSIBLE_PHYSMEM_BITS BITS_PER_LONG
>> + #error "MAX_POSSIBLE_PHYSMEM_BITS HAS to be defined by arch using 
>> zsmalloc";
>> +#else
>> + #ifndef CONFIG_64BIT
>> +  #if (MAX_POSSIBLE_PHYSMEM_BITS >= (BITS_PER_LONG + PAGE_SHIFT - 
>> OBJ_TAG_BITS))
>> +   #error "MAX_POSSIBLE_PHYSMEM_BITS is wrong for this arch";
>> +  #endif
>> + #endif
>> +#endif
>> +
>> +#define _PFN_BITS (MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT)
>>  #define OBJ_INDEX_BITS  (BITS_PER_LONG - _PFN_BITS - OBJ_TAG_BITS)
>>  #define OBJ_INDEX_MASK  ((_AC(1, UL) << OBJ_INDEX_BITS) - 1)
> 
> Have you tested it with CONFIG_X86_5LEVEL=y?
> 
> ASAICS, the patch makes OBJ_INDEX_BITS and what depends from it dynamic --
> it depends what paging mode we are booting in. ZS_SIZE_CLASSES depends
> indirectly on OBJ_INDEX_BITS and I don't see how struct zs_pool definition
> can compile with dynamic ZS_SIZE_CLASSES.
> 
> Hm?
> 

You're right, terribly sorry. This was a last time change.

mm/zsmalloc.c:256:21: error: variably modified ‘size_class’ at file scope

I'll revisit the patch. Any other comments are welcome.

Thank you



[PATCH] mm/zsmalloc.c: Fix zsmalloc 32-bit PAE support

2018-12-10 Thread Rafael David Tinoco
On 32-bit systems, zsmalloc uses HIGHMEM and, when PAE is enabled, the
physical frame number might be so big that zsmalloc obj encoding (to
location) will break, causing:

BUG: KASAN: null-ptr-deref in zs_map_object+0xa4/0x2bc
Read of size 4 at addr  by task mkfs.ext4/623
CPU: 2 PID: 623 Comm: mkfs.ext4 Not tainted 
4.19.0-rc8-00017-g8239bc6d3307-dirty #15
Hardware name: Generic DT based system
[] (unwind_backtrace) from [] (show_stack+0x20/0x24)
[] (show_stack) from [] (dump_stack+0xbc/0xe8)
[] (dump_stack) from [] (kasan_report+0x248/0x390)
[] (kasan_report) from [] (__asan_load4+0x78/0xb4)
[] (__asan_load4) from [] (zs_map_object+0xa4/0x2bc)
[] (zs_map_object) from [] 
(zram_bvec_rw.constprop.2+0x324/0x8e4 [zram])
[] (zram_bvec_rw.constprop.2 [zram]) from [] 
(zram_make_request+0x234/0x46c [zram])
[] (zram_make_request [zram]) from [] 
(generic_make_request+0x304/0x63c)
[] (generic_make_request) from [] (submit_bio+0x4c/0x1c8)
[] (submit_bio) from [] 
(submit_bh_wbc.constprop.15+0x238/0x26c)
[] (submit_bh_wbc.constprop.15) from [] 
(__block_write_full_page+0x524/0x76c)
[] (__block_write_full_page) from [] 
(block_write_full_page+0x1bc/0x1d4)
[] (block_write_full_page) from [] 
(blkdev_writepage+0x24/0x28)
[] (blkdev_writepage) from [] (__writepage+0x44/0x78)
[] (__writepage) from [] (write_cache_pages+0x3b8/0x800)
[] (write_cache_pages) from [] 
(generic_writepages+0x74/0xa0)
[] (generic_writepages) from [] 
(blkdev_writepages+0x18/0x1c)
[] (blkdev_writepages) from [] (do_writepages+0x68/0x134)
[] (do_writepages) from [] 
(__filemap_fdatawrite_range+0xb0/0xf4)
[] (__filemap_fdatawrite_range) from [] 
(file_write_and_wait_range+0x64/0xd0)
[] (file_write_and_wait_range) from [] 
(blkdev_fsync+0x54/0x84)
[] (blkdev_fsync) from [] (vfs_fsync_range+0x70/0xd4)
[] (vfs_fsync_range) from [] (do_fsync+0x4c/0x80)
[] (do_fsync) from [] (sys_fsync+0x1c/0x20)
[] (sys_fsync) from [] (ret_fast_syscall+0x0/0x2c)

when trying to decode (the pfn) and map the object.

That happens because one architecture might not re-define
MAX_PHYSMEM_BITS, like in this ARM 32-bit w/ LPAE enabled example. For
32-bit systems, if not re-defined, MAX_POSSIBLE_PHYSMEM_BITS will
default to BITS_PER_LONG (32) in most cases, and, with PAE enabled,
_PFN_BITS might be wrong: which may cause obj variable to overflow if
frame number is in HIGHMEM and referencing a page above the 4GB
watermark.

commit 6e00ec00b1a7 ("staging: zsmalloc: calculate MAX_PHYSMEM_BITS if
not defined") realized MAX_PHYSMEM_BITS depended on SPARSEMEM headers
and "fixed" it by calculating it using BITS_PER_LONG if SPARSEMEM wasn't
used, like in the example given above.

Systems with potential for PAE exist for a long time and assuming
BITS_PER_LONG seems inadequate. Defining MAX_PHYSMEM_BITS looks better,
however it is NOT a constant anymore for x86.

SO, instead, MAX_POSSIBLE_PHYSMEM_BITS should be defined by every
architecture using zsmalloc, together with a sanity check for
MAX_POSSIBLE_PHYSMEM_BITS being too big on 32-bit systems.

Link: https://bugs.linaro.org/show_bug.cgi?id=3765#c17
Signed-off-by: Rafael David Tinoco 
---
 arch/arm/include/asm/pgtable-2level-types.h |  2 ++
 arch/arm/include/asm/pgtable-3level-types.h |  2 ++
 arch/arm64/include/asm/pgtable-types.h  |  2 ++
 arch/ia64/include/asm/page.h|  2 ++
 arch/mips/include/asm/page.h|  2 ++
 arch/powerpc/include/asm/mmu.h  |  2 ++
 arch/s390/include/asm/page.h|  2 ++
 arch/sh/include/asm/page.h  |  2 ++
 arch/sparc/include/asm/page_32.h|  2 ++
 arch/sparc/include/asm/page_64.h|  2 ++
 arch/x86/include/asm/pgtable-2level_types.h |  2 ++
 arch/x86/include/asm/pgtable-3level_types.h |  3 +-
 arch/x86/include/asm/pgtable_64_types.h |  4 +--
 mm/zsmalloc.c   | 35 +++--
 14 files changed, 45 insertions(+), 19 deletions(-)

diff --git a/arch/arm/include/asm/pgtable-2level-types.h 
b/arch/arm/include/asm/pgtable-2level-types.h
index 66cb5b0e89c5..552dba411324 100644
--- a/arch/arm/include/asm/pgtable-2level-types.h
+++ b/arch/arm/include/asm/pgtable-2level-types.h
@@ -64,4 +64,6 @@ typedef pteval_t pgprot_t;
 
 #endif /* STRICT_MM_TYPECHECKS */
 
+#define MAX_POSSIBLE_PHYSMEM_BITS 32
+
 #endif /* _ASM_PGTABLE_2LEVEL_TYPES_H */
diff --git a/arch/arm/include/asm/pgtable-3level-types.h 
b/arch/arm/include/asm/pgtable-3level-types.h
index 921aa30259c4..664c39e6517c 100644
--- a/arch/arm/include/asm/pgtable-3level-types.h
+++ b/arch/arm/include/asm/pgtable-3level-types.h
@@ -67,4 +67,6 @@ typedef pteval_t pgprot_t;
 
 #endif /* STRICT_MM_TYPECHECKS */
 
+#define MAX_POSSIBLE_PHYSMEM_BITS 36
+
 #endif /* _ASM_PGTABLE_3LEVEL_TYPES_H */
diff --git a/arch/arm64/include/asm/pgtable-types.h 
b/arch/arm64/include/asm/pgtable-types.h
index 345a072b5856..45c3834eb4c8 100644
--- a/arch/arm64/include/asm/pgtable-types.h
+++ b/arch/arm64/include/asm/pgtable-types.h
@@ -64,4 

[PATCH 3/6] powerpc/frace: Use ftrace_graph_get_ret_stack() instead of curr_ret_stack

2018-12-10 Thread Steven Rostedt
From: "Steven Rostedt (VMware)" 

[
  Folks, I'm working on rewriting the function graph tracer. In order to
  do so, some changes need to be done that affect architecture specific
  code. I'm only able to compile test these changes. I would like to
  have folks check out my repo and give them a test.

git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
  ftrace/core

  Head SHA1: 51584396cff54aaf57ed0bd353767d71429f77b4
]

The structure of the ret_stack array on the task struct is going to
change, and accessing it directly via the curr_ret_stack index will no
longer give the ret_stack entry that holds the return address. To access
that, architectures must now use ftrace_graph_get_ret_stack() to get the
associated ret_stack that matches the saved return address.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Steven Rostedt (VMware) 
---
 arch/powerpc/kernel/process.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 96f34730010f..ce393df243aa 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -2061,9 +2061,10 @@ void show_stack(struct task_struct *tsk, unsigned long 
*stack)
int count = 0;
int firstframe = 1;
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
-   int curr_frame = current->curr_ret_stack;
+   struct ftrace_ret_stack *ret_stack;
extern void return_to_handler(void);
unsigned long rth = (unsigned long)return_to_handler;
+   int curr_frame = 0;
 #endif
 
sp = (unsigned long) stack;
@@ -2089,9 +2090,13 @@ void show_stack(struct task_struct *tsk, unsigned long 
*stack)
printk("["REG"] ["REG"] %pS", sp, ip, (void *)ip);
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
if ((ip == rth) && curr_frame >= 0) {
-   pr_cont(" (%pS)",
-  (void 
*)current->ret_stack[curr_frame].ret);
-   curr_frame--;
+   ret_stack = ftrace_graph_get_ret_stack(current,
+ curr_frame++);
+   if (ret_stack)
+   pr_cont(" (%pS)",
+   (void *)ret_stack->ret);
+   else
+   curr_frame = -1;
}
 #endif
if (firstframe)
-- 
2.19.1




Re: use generic DMA mapping code in powerpc V4

2018-12-10 Thread Christoph Hellwig
On Mon, Dec 10, 2018 at 05:04:46PM +, Rui Salvaterra wrote:
> Hi, Christoph and Ben,
> 
> It just came to my mind (and this is most likely a stupid question,
> but still)… Is there any possibility of these changes having an
> (positive) effect on the long-standing problem of Power Mac machines
> with AGP graphics cards (which have to be limited to PCI transfers,
> otherwise they'll hang, due to coherence issues)? If so, I have a G4
> machine where I'd gladly test them.

These patches themselves are not going to affect that directly.
But IFF the problem really is that the AGP needs to be treated as not
cache coherent (I have no idea if that is true) the generic direct
mapping code has full support for a per-device coherent flag, so
support for a non-coherent AGP slot could be implemented relatively
simply.


Re: [PATCH] lib: fix build failure in CONFIG_DEBUG_VIRTUAL test

2018-12-10 Thread Kees Cook
On Mon, Dec 10, 2018 at 12:08 AM Christophe Leroy
 wrote:
>
> On several arches, virt_to_phys() is in io.h
>
> Build fails without it:
>
>   CC  lib/test_debug_virtual.o
> lib/test_debug_virtual.c: In function 'test_debug_virtual_init':
> lib/test_debug_virtual.c:26:7: error: implicit declaration of function 
> 'virt_to_phys' [-Werror=implicit-function-declaration]
>   pa = virt_to_phys(va);
>^
>
> Fixes: e4dace361552 ("lib: add test module for CONFIG_DEBUG_VIRTUAL")
> CC: sta...@vger.kernel.org
> Signed-off-by: Christophe Leroy 

Reviewed-by: Kees Cook 

-Kees

> ---
>  lib/test_debug_virtual.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/lib/test_debug_virtual.c b/lib/test_debug_virtual.c
> index d5a06addeb27..bf864c73e462 100644
> --- a/lib/test_debug_virtual.c
> +++ b/lib/test_debug_virtual.c
> @@ -5,6 +5,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include 
>  #ifdef CONFIG_MIPS
> --
> 2.13.3
>


-- 
Kees Cook


Re: [PATCH] bpf: fix overflow of bpf_jit_limit when PAGE_SIZE >= 64K

2018-12-10 Thread Michael Roth
Quoting Daniel Borkmann (2018-12-10 08:26:31)
> On 12/07/2018 04:36 PM, Michael Roth wrote:
> > Quoting Michael Ellerman (2018-12-07 06:31:13)
> >> Michael Roth  writes:
> >>
> >>> Commit ede95a63b5 introduced a bpf_jit_limit tuneable to limit BPF
> >>> JIT allocations. At compile time it defaults to PAGE_SIZE * 4,
> >>> and is adjusted again at init time if MODULES_VADDR is defined.
> >>>
> >>> For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with
> >>
> >> But maybe it should be, I don't know why we don't define it.
> >>
> >>> the compile-time default at boot-time, which is 0x9c40 when
> >>> using 64K page size. This overflows the signed 32-bit bpf_jit_limit
> >>> value:
> >>>
> >>>   root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit
> >>>   -1673527296
> >>>
> >>> and can cause various unexpected failures throughout the network
> >>> stack. In one case `strace dhclient eth0` reported:
> >>>
> >>>   setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, 
> >>> filter=0x105dd27f8}, 16) = -1 ENOTSUPP (Unknown error 524)
> >>>
> >>> and similar failures can be seen with tools like tcpdump. This doesn't
> >>> always reproduce however, and I'm not sure why. The more consistent
> >>> failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9 host
> >>> would time out on systemd/netplan configuring a virtio-net NIC with no
> >>> noticeable errors in the logs.
> >>>
> >>> Fix this by limiting the compile-time default for bpf_jit_limit to
> >>> INT_MAX.
> >>
> >> INT_MAX is a lot more than (4k * 4), so I guess I'm not clear on
> >> whether we should be using PAGE_SIZE here at all. I guess each BPF
> >> program uses at least one page is the thinking?
> > 
> > That seems to be the case, at least, the max number of minimum-sized
> > allocations would be less on ppc64 since the allocations are always at
> > least PAGE_SIZE in size. The init-time default also limits to INT_MAX,
> > so it seemed consistent to do that here too.
> > 
> >>
> >> Thanks for tracking this down. For some reason none of my ~10 test boxes
> >> have hit this, perhaps I don't have new enough userspace?
> > 
> > I'm not too sure, I would've thought things like the dhclient case in
> > the commit log would fail every time, but sometimes I need to reboot the
> > guest before I start seeing the behavior. Maybe there's something special
> > about when JIT allocations are actually done that can affect
> > reproducibility?
> > 
> > In my case at least the virtio-net networking timeout was consistent
> > enough for a bisect, but maybe it depends on the specific network
> > configuration (single NIC, basic DHCP through netplan/systemd in my case).
> > 
> >>
> >> You don't mention why you needed to add BPF_MIN(), I assume because the
> >> kernel version of min() has gotten too complicated to work here?
> > 
> > I wasn't sure if it was safe here or not, so I tried looking at other
> > users and came across:
> > 
> > mm/vmalloc.c:777:#define VMAP_MIN(x, y)   ((x) < (y) ? (x) : 
> > (y)) /* can't use min() */
> > 
> > I'm not sure what the reasoning was (or whether it still applies), but I
> > figured it was safer to do the same here. Maybe Nick still recalls?
> > 
> >>
> >> Daniel I assume you'll merge this via your tree?
> >>
> >> cheers
> >>
> >>> Fixes: ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv 
> >>> allocations")
> >>> Cc: linuxppc-...@ozlabs.org
> >>> Cc: Daniel Borkmann 
> >>> Cc: Sandipan Das 
> >>> Cc: Alexei Starovoitov 
> >>> Signed-off-by: Michael Roth 
> 
> Thanks for the reports / fixes and sorry for my late reply (bit too
> swamped last week), some more thoughts below.
> 
> >>>  kernel/bpf/core.c | 3 ++-
> >>>  1 file changed, 2 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> >>> index b1a3545d0ec8..55de4746cdfd 100644
> >>> --- a/kernel/bpf/core.c
> >>> +++ b/kernel/bpf/core.c
> >>> @@ -365,7 +365,8 @@ void bpf_prog_kallsyms_del_all(struct bpf_prog *fp)
> >>>  }
> >>>  
> >>>  #ifdef CONFIG_BPF_JIT
> >>> -# define BPF_JIT_LIMIT_DEFAULT   (PAGE_SIZE * 4)
> >>> +# define BPF_MIN(x, y) ((x) < (y) ? (x) : (y))
> >>> +# define BPF_JIT_LIMIT_DEFAULT   BPF_MIN((PAGE_SIZE * 4), 
> >>> INT_MAX)
> >>>  
> >>>  /* All BPF JIT sysctl knobs here. */
> >>>  int bpf_jit_enable   __read_mostly = 
> >>> IS_BUILTIN(CONFIG_BPF_JIT_ALWAYS_ON);
> 
> I would actually just like to get rid of the BPF_JIT_LIMIT_DEFAULT
> define also given for 4.21 arm64 will have its own dedicated area for
> JIT allocations where neither the above limit nor the MODULES_END/
> MODULES_VADDR one would fit and I don't want to make this even more
> ugly with adding further cases into the core. Would the below variant
> work for you?

Looks good to me. My one concern (which is probably a separate
issue) is that the INT_MAX limit is a bit more punishing for larger
page sizes since the minimum allocations seem to be 1 page. Are there
reasonable workloads that 

Re: [PATCH] ocxl: Clarify error path in setup_xsl_irq()

2018-12-10 Thread Frederic Barrat




Le 10/12/2018 à 16:18, Greg Kurz a écrit :

Implementing rollback with goto and labels is a common practice that
leads to prettier and more maintainable code. FWIW, this design pattern
is already being used in alloc_link() a few lines below in this file.

Do the same in setup_xsl_irq().

Signed-off-by: Greg Kurz 
---


This looks good. I don't have a fixed limit when I start using the "goto 
undo" pattern, so it's likely inconsistent in other places as well. 
Truth is I'm not too fussed either way.


Acked-by: Frederic Barrat 





  drivers/misc/ocxl/link.c |   23 ++-
  1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
index eed92055184d..659977a17405 100644
--- a/drivers/misc/ocxl/link.c
+++ b/drivers/misc/ocxl/link.c
@@ -273,9 +273,9 @@ static int setup_xsl_irq(struct pci_dev *dev, struct link 
*link)
spa->irq_name = kasprintf(GFP_KERNEL, "ocxl-xsl-%x-%x-%x",
link->domain, link->bus, link->dev);
if (!spa->irq_name) {
-   unmap_irq_registers(spa);
dev_err(>dev, "Can't allocate name for xsl interrupt\n");
-   return -ENOMEM;
+   rc = -ENOMEM;
+   goto err_xsl;
}
/*
 * At some point, we'll need to look into allowing a higher
@@ -283,11 +283,10 @@ static int setup_xsl_irq(struct pci_dev *dev, struct link 
*link)
 */
spa->virq = irq_create_mapping(NULL, hwirq);
if (!spa->virq) {
-   kfree(spa->irq_name);
-   unmap_irq_registers(spa);
dev_err(>dev,
"irq_create_mapping failed for translation 
interrupt\n");
-   return -EINVAL;
+   rc = -EINVAL;
+   goto err_name;
}

dev_dbg(>dev, "hwirq %d mapped to virq %d\n", hwirq, spa->virq);
@@ -295,15 +294,21 @@ static int setup_xsl_irq(struct pci_dev *dev, struct link 
*link)
rc = request_irq(spa->virq, xsl_fault_handler, 0, spa->irq_name,
link);
if (rc) {
-   irq_dispose_mapping(spa->virq);
-   kfree(spa->irq_name);
-   unmap_irq_registers(spa);
dev_err(>dev,
"request_irq failed for translation interrupt: %d\n",
rc);
-   return -EINVAL;
+   rc = -EINVAL;
+   goto err_mapping;
}
return 0;
+
+err_mapping:
+   irq_dispose_mapping(spa->virq);
+err_name:
+   kfree(spa->irq_name);
+err_xsl:
+   unmap_irq_registers(spa);
+   return rc;
  }

  static void release_xsl_irq(struct link *link)





Re: use generic DMA mapping code in powerpc V4

2018-12-10 Thread Rui Salvaterra
On Sat, 8 Dec 2018 at 17:17, Christoph Hellwig  wrote:
>
> On Sun, Dec 02, 2018 at 05:11:02PM +1100, Benjamin Herrenschmidt wrote:
> > Talking of which ... Christoph, not sure if we can do something about
> > this at the DMA API level or keep hacks but some adapters such as the
> > nVidia GPUs have a HW hack we can use to work around their limitations
> > in that case.
> >
> > They have a register that can program a fixed value for the top bits
> > that they don't support.
> >
> > This works fine for any linear mapping with an offset, provided they
> > can program the offset in that register and they have enough DMA range
> > to cover all memory from that offset.
> >
> > I can probably get the info about this from them so we can exploit it
> > in nouveau.
>
> I think we can expose the direct mapping offset if people care enough,
> we just have to be very careful designing the API.  I'll happily leave
> that to those that actually want to use it, but I'll gladly help
> reviewing it.

Hi, Christoph and Ben,

It just came to my mind (and this is most likely a stupid question,
but still)… Is there any possibility of these changes having an
(positive) effect on the long-standing problem of Power Mac machines
with AGP graphics cards (which have to be limited to PCI transfers,
otherwise they'll hang, due to coherence issues)? If so, I have a G4
machine where I'd gladly test them.

Thanks,

Rui


[PATCH] ocxl/afu_irq: Don't include

2018-12-10 Thread Greg Kurz
The AFU irq code doesn't need to reach out to the platform.

Signed-off-by: Greg Kurz 
---
 drivers/misc/ocxl/afu_irq.c |1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/misc/ocxl/afu_irq.c b/drivers/misc/ocxl/afu_irq.c
index e70cfa24577f..11ab996657a2 100644
--- a/drivers/misc/ocxl/afu_irq.c
+++ b/drivers/misc/ocxl/afu_irq.c
@@ -2,7 +2,6 @@
 // Copyright 2017 IBM Corp.
 #include 
 #include 
-#include 
 #include "ocxl_internal.h"
 #include "trace.h"
 



[PATCH] ocxl: Simplify free_spa()

2018-12-10 Thread Greg Kurz
The only users of free_spa() are alloc_link() and free_link(), and
in both cases:

- link->spa != NULL

- free_spa(link) is immediatly followed by kfree(link)

The check isn't needed, and it doesn't bring much to clear the link->spa
pointer. Drop both.

Signed-off-by: Greg Kurz 
---
 drivers/misc/ocxl/link.c |7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
index 31695a078485..eed92055184d 100644
--- a/drivers/misc/ocxl/link.c
+++ b/drivers/misc/ocxl/link.c
@@ -352,11 +352,8 @@ static void free_spa(struct link *link)
pr_debug("Freeing SPA for %x:%x:%x\n", link->domain, link->bus,
link->dev);
 
-   if (spa && spa->spa_mem) {
-   free_pages((unsigned long) spa->spa_mem, spa->spa_order);
-   kfree(spa);
-   link->spa = NULL;
-   }
+   free_pages((unsigned long) spa->spa_mem, spa->spa_order);
+   kfree(spa);
 }
 
 static int alloc_link(struct pci_dev *dev, int PE_mask, struct link **out_link)



Re: [PATCH] ocxl: Simplify free_spa()

2018-12-10 Thread Frederic Barrat




Le 10/12/2018 à 16:15, Greg Kurz a écrit :

The only users of free_spa() are alloc_link() and free_link(), and
in both cases:

- link->spa != NULL

- free_spa(link) is immediatly followed by kfree(link)

The check isn't needed, and it doesn't bring much to clear the link->spa
pointer. Drop both.

Signed-off-by: Greg Kurz 
---


OK, that looks safe enough
Acked-by: Frederic Barrat 



  drivers/misc/ocxl/link.c |7 ++-
  1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
index 31695a078485..eed92055184d 100644
--- a/drivers/misc/ocxl/link.c
+++ b/drivers/misc/ocxl/link.c
@@ -352,11 +352,8 @@ static void free_spa(struct link *link)
pr_debug("Freeing SPA for %x:%x:%x\n", link->domain, link->bus,
link->dev);

-   if (spa && spa->spa_mem) {
-   free_pages((unsigned long) spa->spa_mem, spa->spa_order);
-   kfree(spa);
-   link->spa = NULL;
-   }
+   free_pages((unsigned long) spa->spa_mem, spa->spa_order);
+   kfree(spa);
  }

  static int alloc_link(struct pci_dev *dev, int PE_mask, struct link 
**out_link)





[PATCH] ocxl: Clarify error path in setup_xsl_irq()

2018-12-10 Thread Greg Kurz
Implementing rollback with goto and labels is a common practice that
leads to prettier and more maintainable code. FWIW, this design pattern
is already being used in alloc_link() a few lines below in this file.

Do the same in setup_xsl_irq().

Signed-off-by: Greg Kurz 
---
 drivers/misc/ocxl/link.c |   23 ++-
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
index eed92055184d..659977a17405 100644
--- a/drivers/misc/ocxl/link.c
+++ b/drivers/misc/ocxl/link.c
@@ -273,9 +273,9 @@ static int setup_xsl_irq(struct pci_dev *dev, struct link 
*link)
spa->irq_name = kasprintf(GFP_KERNEL, "ocxl-xsl-%x-%x-%x",
link->domain, link->bus, link->dev);
if (!spa->irq_name) {
-   unmap_irq_registers(spa);
dev_err(>dev, "Can't allocate name for xsl interrupt\n");
-   return -ENOMEM;
+   rc = -ENOMEM;
+   goto err_xsl;
}
/*
 * At some point, we'll need to look into allowing a higher
@@ -283,11 +283,10 @@ static int setup_xsl_irq(struct pci_dev *dev, struct link 
*link)
 */
spa->virq = irq_create_mapping(NULL, hwirq);
if (!spa->virq) {
-   kfree(spa->irq_name);
-   unmap_irq_registers(spa);
dev_err(>dev,
"irq_create_mapping failed for translation 
interrupt\n");
-   return -EINVAL;
+   rc = -EINVAL;
+   goto err_name;
}
 
dev_dbg(>dev, "hwirq %d mapped to virq %d\n", hwirq, spa->virq);
@@ -295,15 +294,21 @@ static int setup_xsl_irq(struct pci_dev *dev, struct link 
*link)
rc = request_irq(spa->virq, xsl_fault_handler, 0, spa->irq_name,
link);
if (rc) {
-   irq_dispose_mapping(spa->virq);
-   kfree(spa->irq_name);
-   unmap_irq_registers(spa);
dev_err(>dev,
"request_irq failed for translation interrupt: %d\n",
rc);
-   return -EINVAL;
+   rc = -EINVAL;
+   goto err_mapping;
}
return 0;
+
+err_mapping:
+   irq_dispose_mapping(spa->virq);
+err_name:
+   kfree(spa->irq_name);
+err_xsl:
+   unmap_irq_registers(spa);
+   return rc;
 }
 
 static void release_xsl_irq(struct link *link)



Re: [PATCH] ocxl/afu_irq: Don't include

2018-12-10 Thread Frederic Barrat




Le 10/12/2018 à 16:13, Greg Kurz a écrit :

The AFU irq code doesn't need to reach out to the platform.

Signed-off-by: Greg Kurz 
---


Acked-by: Frederic Barrat 




  drivers/misc/ocxl/afu_irq.c |1 -
  1 file changed, 1 deletion(-)

diff --git a/drivers/misc/ocxl/afu_irq.c b/drivers/misc/ocxl/afu_irq.c
index e70cfa24577f..11ab996657a2 100644
--- a/drivers/misc/ocxl/afu_irq.c
+++ b/drivers/misc/ocxl/afu_irq.c
@@ -2,7 +2,6 @@
  // Copyright 2017 IBM Corp.
  #include 
  #include 
-#include 
  #include "ocxl_internal.h"
  #include "trace.h"






Re: [PATCH v2 01/34] kbuild: Add support for DT binding schema checks

2018-12-10 Thread Rob Herring
On Fri, Dec 7, 2018 at 10:47 PM Masahiro Yamada
 wrote:
>
> Hi Rob,
>
>
> On Tue, Dec 4, 2018 at 6:32 AM Rob Herring  wrote:
> >
> > This adds the build infrastructure for checking DT binding schema
> > documents and validating dts files using the binding schema.
> >
> > Check DT binding schema documents:
> > make dt_binding_check
> >
> > Build dts files and check using DT binding schema:
> > make dtbs_check
> >
> > Optionally, DT_SCHEMA_FILES can passed in with a schema file(s) to use
> > for validation. This makes it easier to find and fix errors generated by
> > a specific schema.
> >
> > Currently, the validation targets are separate from a normal build to
> > avoid a hard dependency on the external DT schema project and because
> > there are lots of warnings generated.
> >
> > Cc: Jonathan Corbet 
> > Cc: Mark Rutland 
> > Cc: Masahiro Yamada 
> > Cc: Michal Marek 
> > Cc: linux-...@vger.kernel.org
> > Cc: devicet...@vger.kernel.org
> > Cc: linux-kbu...@vger.kernel.org
> > Signed-off-by: Rob Herring 
> > ---
> >  .gitignore   |  1 +
> >  Documentation/Makefile   |  2 +-
> >  Documentation/devicetree/bindings/.gitignore |  1 +
> >  Documentation/devicetree/bindings/Makefile   | 33 
> >  Makefile | 11 +--
> >  scripts/Makefile.lib | 24 --
> >  6 files changed, 67 insertions(+), 5 deletions(-)
> >  create mode 100644 Documentation/devicetree/bindings/.gitignore
> >  create mode 100644 Documentation/devicetree/bindings/Makefile
> >
> > diff --git a/.gitignore b/.gitignore
> > index 97ba6b79834c..a20ac26aa2f5 100644
> > --- a/.gitignore
> > +++ b/.gitignore
> > @@ -15,6 +15,7 @@
> >  *.bin
> >  *.bz2
> >  *.c.[012]*.*
> > +*.dt.yaml
> >  *.dtb
> >  *.dtb.S
> >  *.dwo
> > diff --git a/Documentation/Makefile b/Documentation/Makefile
> > index 2ca77ad0f238..9786957c6a35 100644
> > --- a/Documentation/Makefile
> > +++ b/Documentation/Makefile
> > @@ -2,7 +2,7 @@
> >  # Makefile for Sphinx documentation
> >  #
> >
> > -subdir-y :=
> > +subdir-y := devicetree/bindings/
> >
> >  # You can set these variables from the command line.
> >  SPHINXBUILD   = sphinx-build
> > diff --git a/Documentation/devicetree/bindings/.gitignore 
> > b/Documentation/devicetree/bindings/.gitignore
> > new file mode 100644
> > index ..d9194c02dd08
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/.gitignore
> > @@ -0,0 +1 @@
> > +*.example.dts
> > diff --git a/Documentation/devicetree/bindings/Makefile 
> > b/Documentation/devicetree/bindings/Makefile
> > new file mode 100644
> > index ..ee0110dd8131
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/Makefile
> > @@ -0,0 +1,33 @@
> > +# SPDX-License-Identifier: GPL-2.0
> > +DT_DOC_CHECKER ?= dt-doc-validate
> > +DT_EXTRACT_EX ?= dt-extract-example
> > +DT_MK_SCHEMA ?= dt-mk-schema
> > +DT_MK_SCHEMA_FLAGS := $(if $(DT_SCHEMA_FILES), -u)
> > +
> > +quiet_cmd_chk_binding = CHKDT   $<
> > +  cmd_chk_binding = (set -e; \
> > + $(DT_DOC_CHECKER) $< ; \
> > + mkdir -p $(dir $@) ; \
> > + $(DT_EXTRACT_EX) $< > $@ )
> > +
> > +$(obj)/%.example.dts: $(src)/%.yaml FORCE
> > +   $(call if_changed,chk_binding)
> > +
> > +DT_TMP_SCHEMA := .schema.yaml.tmp
> > +extra-y += $(DT_TMP_SCHEMA)
> > +
> > +quiet_cmd_mk_schema = SCHEMA  $@
> > +  cmd_mk_schema = mkdir -p $(obj); \
> > +  rm -f $@; \
> > +  $(DT_MK_SCHEMA) $(DT_MK_SCHEMA_FLAGS) -o $@ $<
>
>
> I think '$<' is wrong.
>
> '$<' is replaced with the first prerequisite.

Indeed.

> You can easily check what is happening here.
>
> $ cat   Documentation/devicetree/bindings/..schema.yaml.tmp.cmd
> cmd_Documentation/devicetree/bindings/.schema.yaml.tmp := mkdir -p
> Documentation/devicetree/bindings; rm -f
> Documentation/devicetree/bindings/.schema.yaml.tmp; dt-mk-schema  -o
> Documentation/devicetree/bindings/.schema.yaml.tmp
> Documentation/devicetree/bindings/arm/ti/ti,davinci.yaml
>
>
> So, the dt-validater will check only binding from ti,davinci.yaml,
> which is almost useless.
>
>
>
> If I understand it correctly,
> .schema.yaml.tmp should contain all binding yaml.
>
>
> I fixed it up like follows:
>
> diff --git a/Documentation/devicetree/bindings/Makefile
> b/Documentation/devicetree/bindings/Makefile
> index ee0110d..267458f 100644
> --- a/Documentation/devicetree/bindings/Makefile
> +++ b/Documentation/devicetree/bindings/Makefile
> @@ -19,7 +19,7 @@ extra-y += $(DT_TMP_SCHEMA)
>  quiet_cmd_mk_schema = SCHEMA  $@
>cmd_mk_schema = mkdir -p $(obj); \
>rm -f $@; \
> -  $(DT_MK_SCHEMA) $(DT_MK_SCHEMA_FLAGS) -o $@ $<
> +  $(DT_MK_SCHEMA) $(DT_MK_SCHEMA_FLAGS) -o $@
> $(filter-out FORCE, $^)

Thanks, I incorporated this fix.

>
>  DT_DOCS = $(shell cd 

Re: use generic DMA mapping code in powerpc V4

2018-12-10 Thread Christian Zigotzky
Next step: 64ecd2c160bbef31465c4d34efc0f076a2aad4df (powerpc/dma: use 
phys_to_dma instead of get_dma_offset)


The P5020 board boots and the PASEMI onboard ethernet works.

-- Christian


On 09 December 2018 at 7:26PM, Christian Zigotzky wrote:
Next step: c1bfcad4b0cf38ce5b00f7ad880d3a13484c123a (dma-mapping, 
powerpc: simplify the arch dma_set_mask override)


Result: No problems with the PASEMI onboard ethernet and with booting 
the X5000 (P5020 board).


-- Christian


On 09 December 2018 at 3:20PM, Christian Zigotzky wrote:
Next step: 602307b034734ce77a05da4b99333a2eaf6b6482 (powerpc/fsl_pci: 
simplify fsl_pci_dma_set_mask)


git checkout 602307b034734ce77a05da4b99333a2eaf6b6482

The PASEMI onboard ethernet works and the X5000 boots.

-- Christian


On 08 December 2018 at 2:47PM, Christian Zigotzky wrote:
Next step: e15cd8173ef85e9cc3e2a9c7cc2982f5c1355615 (powerpc/dma: 
fix an off-by-one in dma_capable)


git checkout e15cd8173ef85e9cc3e2a9c7cc2982f5c1355615

The PASEMI onboard ethernet also works with this commit and the 
X5000 boots without any problems.


-- Christian


On 08 December 2018 at 11:29AM, Christian Zigotzky wrote:
Next step: 7ebc44c535f6bd726d553756d38b137acc718443 (powerpc/dma: 
remove max_direct_dma_addr)


git checkout 7ebc44c535f6bd726d553756d38b137acc718443

OK, the PASEMI onboard ethernet works and the P5020 board boots.

-- Christian


On 07 December 2018 at 7:33PM, Christian Zigotzky wrote:
Next step: 13c1fdec5682b6e13257277fa16aa31f342d167d (powerpc/dma: 
move pci_dma_dev_setup_swiotlb to fsl_pci.c)


git checkout 13c1fdec5682b6e13257277fa16aa31f342d167d

Result: The PASEMI onboard ethernet works and the P5020 board boots.

— Christian

















Re: [PATCH] ocxl: Fix endiannes bug in read_afu_name()

2018-12-10 Thread Frederic Barrat




Le 10/12/2018 à 16:10, Greg Kurz a écrit :

The double word returned by read_afu_info(OCXL_DVSEC_TEMPL_NAME) contains
four characters of the AFU name, read from the PCI config space, hence
with a little-endian ordering. When composing the string, a big-endian
system must swap the bytes so that the characters appear in the right
order.

Do this with le32_to_cpu().

Signed-off-by: Greg Kurz 
---


Thanks!

Acked-by: Frederic Barrat 



  drivers/misc/ocxl/config.c |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c
index 57a6bb1fd3c9..b76198ba8630 100644
--- a/drivers/misc/ocxl/config.c
+++ b/drivers/misc/ocxl/config.c
@@ -318,7 +318,7 @@ static int read_afu_name(struct pci_dev *dev, struct 
ocxl_fn_config *fn,
if (rc)
return rc;
ptr = (u32 *) >name[i];
-   *ptr = val;
+   *ptr = le32_to_cpu(val);
}
afu->name[OCXL_AFU_NAME_SZ - 1] = '\0'; /* play safe */
return 0;





[PATCH] ocxl: Fix endiannes bug in read_afu_name()

2018-12-10 Thread Greg Kurz
The double word returned by read_afu_info(OCXL_DVSEC_TEMPL_NAME) contains
four characters of the AFU name, read from the PCI config space, hence
with a little-endian ordering. When composing the string, a big-endian
system must swap the bytes so that the characters appear in the right
order.

Do this with le32_to_cpu().

Signed-off-by: Greg Kurz 
---
 drivers/misc/ocxl/config.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c
index 57a6bb1fd3c9..b76198ba8630 100644
--- a/drivers/misc/ocxl/config.c
+++ b/drivers/misc/ocxl/config.c
@@ -318,7 +318,7 @@ static int read_afu_name(struct pci_dev *dev, struct 
ocxl_fn_config *fn,
if (rc)
return rc;
ptr = (u32 *) >name[i];
-   *ptr = val;
+   *ptr = le32_to_cpu(val);
}
afu->name[OCXL_AFU_NAME_SZ - 1] = '\0'; /* play safe */
return 0;



Re: [PATCH] mm/zsmalloc.c: Fix zsmalloc 32-bit PAE support

2018-12-10 Thread Kirill A. Shutemov
On Mon, Dec 10, 2018 at 12:21:05PM -0200, Rafael David Tinoco wrote:
> diff --git a/arch/x86/include/asm/pgtable_64_types.h 
> b/arch/x86/include/asm/pgtable_64_types.h
> index 84bd9bdc1987..d808cfde3d19 100644
> --- a/arch/x86/include/asm/pgtable_64_types.h
> +++ b/arch/x86/include/asm/pgtable_64_types.h
> @@ -64,8 +64,6 @@ extern unsigned int ptrs_per_p4d;
>  #define P4D_SIZE (_AC(1, UL) << P4D_SHIFT)
>  #define P4D_MASK (~(P4D_SIZE - 1))
>  
> -#define MAX_POSSIBLE_PHYSMEM_BITS52
> -
>  #else /* CONFIG_X86_5LEVEL */
>  
>  /*
> @@ -154,4 +152,6 @@ extern unsigned int ptrs_per_p4d;
>  
>  #define PGD_KERNEL_START ((PAGE_SIZE / 2) / sizeof(pgd_t))
>  
> +#define MAX_POSSIBLE_PHYSMEM_BITS (pgtable_l5_enabled() ? 52 : 46)
> +

...

>  #endif /* _ASM_X86_PGTABLE_64_DEFS_H */
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 0787d33b80d8..132c20b6fd4f 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c

...

> @@ -116,6 +100,25 @@
>   */
>  #define OBJ_ALLOCATED_TAG 1
>  #define OBJ_TAG_BITS 1
> +
> +/*
> + * MAX_POSSIBLE_PHYSMEM_BITS should be defined by all archs using zsmalloc:
> + * Trying to guess it from MAX_PHYSMEM_BITS, or considering it BITS_PER_LONG,
> + * proved to be wrong by not considering PAE capabilities, or using SPARSEMEM
> + * only headers, leading to bad object encoding due to object index overflow.
> + */
> +#ifndef MAX_POSSIBLE_PHYSMEM_BITS
> + #define MAX_POSSIBLE_PHYSMEM_BITS BITS_PER_LONG
> + #error "MAX_POSSIBLE_PHYSMEM_BITS HAS to be defined by arch using zsmalloc";
> +#else
> + #ifndef CONFIG_64BIT
> +  #if (MAX_POSSIBLE_PHYSMEM_BITS >= (BITS_PER_LONG + PAGE_SHIFT - 
> OBJ_TAG_BITS))
> +   #error "MAX_POSSIBLE_PHYSMEM_BITS is wrong for this arch";
> +  #endif
> + #endif
> +#endif
> +
> +#define _PFN_BITS (MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT)
>  #define OBJ_INDEX_BITS   (BITS_PER_LONG - _PFN_BITS - OBJ_TAG_BITS)
>  #define OBJ_INDEX_MASK   ((_AC(1, UL) << OBJ_INDEX_BITS) - 1)

Have you tested it with CONFIG_X86_5LEVEL=y?

ASAICS, the patch makes OBJ_INDEX_BITS and what depends from it dynamic --
it depends what paging mode we are booting in. ZS_SIZE_CLASSES depends
indirectly on OBJ_INDEX_BITS and I don't see how struct zs_pool definition
can compile with dynamic ZS_SIZE_CLASSES.

Hm?

-- 
 Kirill A. Shutemov


Re: [PATCH v2.1 24/34] dt-bindings: arm: Convert Rockchip board/soc bindings to json-schema

2018-12-10 Thread Rob Herring
On Sun, Dec 9, 2018 at 4:14 PM Heiko Stuebner  wrote:
>
> Convert Rockchip SoC bindings to DT schema format using json-schema.
>
> Cc: Mark Rutland 
> Cc: Heiko Stuebner 
> Cc: devicet...@vger.kernel.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-rockc...@lists.infradead.org
> Signed-off-by: Rob Herring 
> [move to per-board entries and added recently added boards]
> Signed-off-by: Heiko Stuebner 
> ---
> Hi Rob,
>
> there are boards where the description adds much value and on others
> it is maybe less, but personally I'd like to keep things uniform,
> as that makes reading these things easier if the format stays the
> same all the time, so I've gone forward and just did the conversion
>
> make dtbs_check did not complain about the schema it seems but I
> did end up with an error later on:
>
> FATAL ERROR: Unknown output format "yaml"
> make[2]: *** [scripts/Makefile.lib:313: arch/arm/boot/dts/rk3036-evb.dt.yaml] 
> Fehler 1

You need libyaml and its headers installed so dtc can output yaml, but
that's not needed for checking the schema against the meta-schema.

> But I guess I did not mess up the schema yet.
>
> So does it look ok that way?

Yes, but one comment...

> +  - description: Firefly Firefly-RK3288
> +items:
> +  - const: firefly,firefly-rk3288
> +  - const: rockchip,rk3288
> +
> +  - description: Firefly Firefly-RK3288 (beta board)
> +items:
> +  - const: firefly,firefly-rk3288-beta
> +  - const: rockchip,rk3288
> +
> +  - description: Firefly Firefly-RK3288 Reload

Seems like combining these 3 (or first 2?) would make sense if this is
just revs of the same board.

But either way is fine.

> +items:
> +  - const: firefly,firefly-rk3288-reload
> +  - const: rockchip,rk3288


Re: [PATCH] mm/zsmalloc.c: Fix zsmalloc 32-bit PAE support

2018-12-10 Thread Russell King - ARM Linux
On Mon, Dec 10, 2018 at 02:35:55PM +, Robin Murphy wrote:
> On 10/12/2018 14:21, Rafael David Tinoco wrote:
> >On 32-bit systems, zsmalloc uses HIGHMEM and, when PAE is enabled, the
> >physical frame number might be so big that zsmalloc obj encoding (to
> >location) will break, causing:
> >
> >BUG: KASAN: null-ptr-deref in zs_map_object+0xa4/0x2bc
> >Read of size 4 at addr  by task mkfs.ext4/623
> >CPU: 2 PID: 623 Comm: mkfs.ext4 Not tainted 
> >4.19.0-rc8-00017-g8239bc6d3307-dirty #15
> >Hardware name: Generic DT based system
> >[] (unwind_backtrace) from [] (show_stack+0x20/0x24)
> >[] (show_stack) from [] (dump_stack+0xbc/0xe8)
> >[] (dump_stack) from [] (kasan_report+0x248/0x390)
> >[] (kasan_report) from [] (__asan_load4+0x78/0xb4)
> >[] (__asan_load4) from [] (zs_map_object+0xa4/0x2bc)
> >[] (zs_map_object) from [] 
> >(zram_bvec_rw.constprop.2+0x324/0x8e4 [zram])
> >[] (zram_bvec_rw.constprop.2 [zram]) from [] 
> >(zram_make_request+0x234/0x46c [zram])
> >[] (zram_make_request [zram]) from [] 
> >(generic_make_request+0x304/0x63c)
> >[] (generic_make_request) from [] (submit_bio+0x4c/0x1c8)
> >[] (submit_bio) from [] 
> >(submit_bh_wbc.constprop.15+0x238/0x26c)
> >[] (submit_bh_wbc.constprop.15) from [] 
> >(__block_write_full_page+0x524/0x76c)
> >[] (__block_write_full_page) from [] 
> >(block_write_full_page+0x1bc/0x1d4)
> >[] (block_write_full_page) from [] 
> >(blkdev_writepage+0x24/0x28)
> >[] (blkdev_writepage) from [] (__writepage+0x44/0x78)
> >[] (__writepage) from [] (write_cache_pages+0x3b8/0x800)
> >[] (write_cache_pages) from [] 
> >(generic_writepages+0x74/0xa0)
> >[] (generic_writepages) from [] 
> >(blkdev_writepages+0x18/0x1c)
> >[] (blkdev_writepages) from [] (do_writepages+0x68/0x134)
> >[] (do_writepages) from [] 
> >(__filemap_fdatawrite_range+0xb0/0xf4)
> >[] (__filemap_fdatawrite_range) from [] 
> >(file_write_and_wait_range+0x64/0xd0)
> >[] (file_write_and_wait_range) from [] 
> >(blkdev_fsync+0x54/0x84)
> >[] (blkdev_fsync) from [] (vfs_fsync_range+0x70/0xd4)
> >[] (vfs_fsync_range) from [] (do_fsync+0x4c/0x80)
> >[] (do_fsync) from [] (sys_fsync+0x1c/0x20)
> >[] (sys_fsync) from [] (ret_fast_syscall+0x0/0x2c)
> >
> >when trying to decode (the pfn) and map the object.
> >
> >That happens because one architecture might not re-define
> >MAX_PHYSMEM_BITS, like in this ARM 32-bit w/ LPAE enabled example. For
> >32-bit systems, if not re-defined, MAX_POSSIBLE_PHYSMEM_BITS will
> >default to BITS_PER_LONG (32) in most cases, and, with PAE enabled,
> >_PFN_BITS might be wrong: which may cause obj variable to overflow if
> >frame number is in HIGHMEM and referencing a page above the 4GB
> >watermark.
> >
> >commit 6e00ec00b1a7 ("staging: zsmalloc: calculate MAX_PHYSMEM_BITS if
> >not defined") realized MAX_PHYSMEM_BITS depended on SPARSEMEM headers
> >and "fixed" it by calculating it using BITS_PER_LONG if SPARSEMEM wasn't
> >used, like in the example given above.
> >
> >Systems with potential for PAE exist for a long time and assuming
> >BITS_PER_LONG seems inadequate. Defining MAX_PHYSMEM_BITS looks better,
> >however it is NOT a constant anymore for x86.
> >
> >SO, instead, MAX_POSSIBLE_PHYSMEM_BITS should be defined by every
> >architecture using zsmalloc, together with a sanity check for
> >MAX_POSSIBLE_PHYSMEM_BITS being too big on 32-bit systems.
> >
> >Link: https://bugs.linaro.org/show_bug.cgi?id=3765#c17
> >Signed-off-by: Rafael David Tinoco 
> >---
> >  arch/arm/include/asm/pgtable-2level-types.h |  2 ++
> >  arch/arm/include/asm/pgtable-3level-types.h |  2 ++
> >  arch/arm64/include/asm/pgtable-types.h  |  2 ++
> >  arch/ia64/include/asm/page.h|  2 ++
> >  arch/mips/include/asm/page.h|  2 ++
> >  arch/powerpc/include/asm/mmu.h  |  2 ++
> >  arch/s390/include/asm/page.h|  2 ++
> >  arch/sh/include/asm/page.h  |  2 ++
> >  arch/sparc/include/asm/page_32.h|  2 ++
> >  arch/sparc/include/asm/page_64.h|  2 ++
> >  arch/x86/include/asm/pgtable-2level_types.h |  2 ++
> >  arch/x86/include/asm/pgtable-3level_types.h |  3 +-
> >  arch/x86/include/asm/pgtable_64_types.h |  4 +--
> >  mm/zsmalloc.c   | 35 +++--
> >  14 files changed, 45 insertions(+), 19 deletions(-)
> >
> >diff --git a/arch/arm/include/asm/pgtable-2level-types.h 
> >b/arch/arm/include/asm/pgtable-2level-types.h
> >index 66cb5b0e89c5..552dba411324 100644
> >--- a/arch/arm/include/asm/pgtable-2level-types.h
> >+++ b/arch/arm/include/asm/pgtable-2level-types.h
> >@@ -64,4 +64,6 @@ typedef pteval_t pgprot_t;
> >  #endif /* STRICT_MM_TYPECHECKS */
> >+#define MAX_POSSIBLE_PHYSMEM_BITS 32
> >+
> >  #endif /* _ASM_PGTABLE_2LEVEL_TYPES_H */
> >diff --git a/arch/arm/include/asm/pgtable-3level-types.h 
> >b/arch/arm/include/asm/pgtable-3level-types.h
> >index 921aa30259c4..664c39e6517c 100644
> >--- a/arch/arm/include/asm/pgtable-3level-types.h
> >+++ 

Re: [PATCH] bpf: fix overflow of bpf_jit_limit when PAGE_SIZE >= 64K

2018-12-10 Thread Daniel Borkmann
On 12/07/2018 04:36 PM, Michael Roth wrote:
> Quoting Michael Ellerman (2018-12-07 06:31:13)
>> Michael Roth  writes:
>>
>>> Commit ede95a63b5 introduced a bpf_jit_limit tuneable to limit BPF
>>> JIT allocations. At compile time it defaults to PAGE_SIZE * 4,
>>> and is adjusted again at init time if MODULES_VADDR is defined.
>>>
>>> For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with
>>
>> But maybe it should be, I don't know why we don't define it.
>>
>>> the compile-time default at boot-time, which is 0x9c40 when
>>> using 64K page size. This overflows the signed 32-bit bpf_jit_limit
>>> value:
>>>
>>>   root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit
>>>   -1673527296
>>>
>>> and can cause various unexpected failures throughout the network
>>> stack. In one case `strace dhclient eth0` reported:
>>>
>>>   setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, filter=0x105dd27f8}, 
>>> 16) = -1 ENOTSUPP (Unknown error 524)
>>>
>>> and similar failures can be seen with tools like tcpdump. This doesn't
>>> always reproduce however, and I'm not sure why. The more consistent
>>> failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9 host
>>> would time out on systemd/netplan configuring a virtio-net NIC with no
>>> noticeable errors in the logs.
>>>
>>> Fix this by limiting the compile-time default for bpf_jit_limit to
>>> INT_MAX.
>>
>> INT_MAX is a lot more than (4k * 4), so I guess I'm not clear on
>> whether we should be using PAGE_SIZE here at all. I guess each BPF
>> program uses at least one page is the thinking?
> 
> That seems to be the case, at least, the max number of minimum-sized
> allocations would be less on ppc64 since the allocations are always at
> least PAGE_SIZE in size. The init-time default also limits to INT_MAX,
> so it seemed consistent to do that here too.
> 
>>
>> Thanks for tracking this down. For some reason none of my ~10 test boxes
>> have hit this, perhaps I don't have new enough userspace?
> 
> I'm not too sure, I would've thought things like the dhclient case in
> the commit log would fail every time, but sometimes I need to reboot the
> guest before I start seeing the behavior. Maybe there's something special
> about when JIT allocations are actually done that can affect
> reproducibility?
> 
> In my case at least the virtio-net networking timeout was consistent
> enough for a bisect, but maybe it depends on the specific network
> configuration (single NIC, basic DHCP through netplan/systemd in my case).
> 
>>
>> You don't mention why you needed to add BPF_MIN(), I assume because the
>> kernel version of min() has gotten too complicated to work here?
> 
> I wasn't sure if it was safe here or not, so I tried looking at other
> users and came across:
> 
> mm/vmalloc.c:777:#define VMAP_MIN(x, y)   ((x) < (y) ? (x) : (y)) 
> /* can't use min() */
> 
> I'm not sure what the reasoning was (or whether it still applies), but I
> figured it was safer to do the same here. Maybe Nick still recalls?
> 
>>
>> Daniel I assume you'll merge this via your tree?
>>
>> cheers
>>
>>> Fixes: ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv 
>>> allocations")
>>> Cc: linuxppc-...@ozlabs.org
>>> Cc: Daniel Borkmann 
>>> Cc: Sandipan Das 
>>> Cc: Alexei Starovoitov 
>>> Signed-off-by: Michael Roth 

Thanks for the reports / fixes and sorry for my late reply (bit too
swamped last week), some more thoughts below.

>>>  kernel/bpf/core.c | 3 ++-
>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>>> index b1a3545d0ec8..55de4746cdfd 100644
>>> --- a/kernel/bpf/core.c
>>> +++ b/kernel/bpf/core.c
>>> @@ -365,7 +365,8 @@ void bpf_prog_kallsyms_del_all(struct bpf_prog *fp)
>>>  }
>>>  
>>>  #ifdef CONFIG_BPF_JIT
>>> -# define BPF_JIT_LIMIT_DEFAULT   (PAGE_SIZE * 4)
>>> +# define BPF_MIN(x, y) ((x) < (y) ? (x) : (y))
>>> +# define BPF_JIT_LIMIT_DEFAULT   BPF_MIN((PAGE_SIZE * 4), INT_MAX)
>>>  
>>>  /* All BPF JIT sysctl knobs here. */
>>>  int bpf_jit_enable   __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_ALWAYS_ON);

I would actually just like to get rid of the BPF_JIT_LIMIT_DEFAULT
define also given for 4.21 arm64 will have its own dedicated area for
JIT allocations where neither the above limit nor the MODULES_END/
MODULES_VADDR one would fit and I don't want to make this even more
ugly with adding further cases into the core. Would the below variant
work for you?

Thanks,
Daniel

>From da9daf462d41ce5506c6b6318a9fa3d6d8a64f6c Mon Sep 17 00:00:00 2001
From: Daniel Borkmann 
Date: Mon, 10 Dec 2018 14:30:27 +0100
Subject: [PATCH bpf] bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K

Michael and Sandipan report:

  Commit ede95a63b5 introduced a bpf_jit_limit tuneable to limit BPF
  JIT allocations. At compile time it defaults to PAGE_SIZE * 4,
  and is adjusted again at init time if MODULES_VADDR is defined.

  For ppc64 kernels, MODULES_VADDR isn't defined, so 

Re: [PATCH] mm/zsmalloc.c: Fix zsmalloc 32-bit PAE support

2018-12-10 Thread Robin Murphy

On 10/12/2018 14:21, Rafael David Tinoco wrote:

On 32-bit systems, zsmalloc uses HIGHMEM and, when PAE is enabled, the
physical frame number might be so big that zsmalloc obj encoding (to
location) will break, causing:

BUG: KASAN: null-ptr-deref in zs_map_object+0xa4/0x2bc
Read of size 4 at addr  by task mkfs.ext4/623
CPU: 2 PID: 623 Comm: mkfs.ext4 Not tainted 
4.19.0-rc8-00017-g8239bc6d3307-dirty #15
Hardware name: Generic DT based system
[] (unwind_backtrace) from [] (show_stack+0x20/0x24)
[] (show_stack) from [] (dump_stack+0xbc/0xe8)
[] (dump_stack) from [] (kasan_report+0x248/0x390)
[] (kasan_report) from [] (__asan_load4+0x78/0xb4)
[] (__asan_load4) from [] (zs_map_object+0xa4/0x2bc)
[] (zs_map_object) from [] 
(zram_bvec_rw.constprop.2+0x324/0x8e4 [zram])
[] (zram_bvec_rw.constprop.2 [zram]) from [] 
(zram_make_request+0x234/0x46c [zram])
[] (zram_make_request [zram]) from [] 
(generic_make_request+0x304/0x63c)
[] (generic_make_request) from [] (submit_bio+0x4c/0x1c8)
[] (submit_bio) from [] 
(submit_bh_wbc.constprop.15+0x238/0x26c)
[] (submit_bh_wbc.constprop.15) from [] 
(__block_write_full_page+0x524/0x76c)
[] (__block_write_full_page) from [] 
(block_write_full_page+0x1bc/0x1d4)
[] (block_write_full_page) from [] 
(blkdev_writepage+0x24/0x28)
[] (blkdev_writepage) from [] (__writepage+0x44/0x78)
[] (__writepage) from [] (write_cache_pages+0x3b8/0x800)
[] (write_cache_pages) from [] 
(generic_writepages+0x74/0xa0)
[] (generic_writepages) from [] 
(blkdev_writepages+0x18/0x1c)
[] (blkdev_writepages) from [] (do_writepages+0x68/0x134)
[] (do_writepages) from [] 
(__filemap_fdatawrite_range+0xb0/0xf4)
[] (__filemap_fdatawrite_range) from [] 
(file_write_and_wait_range+0x64/0xd0)
[] (file_write_and_wait_range) from [] 
(blkdev_fsync+0x54/0x84)
[] (blkdev_fsync) from [] (vfs_fsync_range+0x70/0xd4)
[] (vfs_fsync_range) from [] (do_fsync+0x4c/0x80)
[] (do_fsync) from [] (sys_fsync+0x1c/0x20)
[] (sys_fsync) from [] (ret_fast_syscall+0x0/0x2c)

when trying to decode (the pfn) and map the object.

That happens because one architecture might not re-define
MAX_PHYSMEM_BITS, like in this ARM 32-bit w/ LPAE enabled example. For
32-bit systems, if not re-defined, MAX_POSSIBLE_PHYSMEM_BITS will
default to BITS_PER_LONG (32) in most cases, and, with PAE enabled,
_PFN_BITS might be wrong: which may cause obj variable to overflow if
frame number is in HIGHMEM and referencing a page above the 4GB
watermark.

commit 6e00ec00b1a7 ("staging: zsmalloc: calculate MAX_PHYSMEM_BITS if
not defined") realized MAX_PHYSMEM_BITS depended on SPARSEMEM headers
and "fixed" it by calculating it using BITS_PER_LONG if SPARSEMEM wasn't
used, like in the example given above.

Systems with potential for PAE exist for a long time and assuming
BITS_PER_LONG seems inadequate. Defining MAX_PHYSMEM_BITS looks better,
however it is NOT a constant anymore for x86.

SO, instead, MAX_POSSIBLE_PHYSMEM_BITS should be defined by every
architecture using zsmalloc, together with a sanity check for
MAX_POSSIBLE_PHYSMEM_BITS being too big on 32-bit systems.

Link: https://bugs.linaro.org/show_bug.cgi?id=3765#c17
Signed-off-by: Rafael David Tinoco 
---
  arch/arm/include/asm/pgtable-2level-types.h |  2 ++
  arch/arm/include/asm/pgtable-3level-types.h |  2 ++
  arch/arm64/include/asm/pgtable-types.h  |  2 ++
  arch/ia64/include/asm/page.h|  2 ++
  arch/mips/include/asm/page.h|  2 ++
  arch/powerpc/include/asm/mmu.h  |  2 ++
  arch/s390/include/asm/page.h|  2 ++
  arch/sh/include/asm/page.h  |  2 ++
  arch/sparc/include/asm/page_32.h|  2 ++
  arch/sparc/include/asm/page_64.h|  2 ++
  arch/x86/include/asm/pgtable-2level_types.h |  2 ++
  arch/x86/include/asm/pgtable-3level_types.h |  3 +-
  arch/x86/include/asm/pgtable_64_types.h |  4 +--
  mm/zsmalloc.c   | 35 +++--
  14 files changed, 45 insertions(+), 19 deletions(-)

diff --git a/arch/arm/include/asm/pgtable-2level-types.h 
b/arch/arm/include/asm/pgtable-2level-types.h
index 66cb5b0e89c5..552dba411324 100644
--- a/arch/arm/include/asm/pgtable-2level-types.h
+++ b/arch/arm/include/asm/pgtable-2level-types.h
@@ -64,4 +64,6 @@ typedef pteval_t pgprot_t;
  
  #endif /* STRICT_MM_TYPECHECKS */
  
+#define MAX_POSSIBLE_PHYSMEM_BITS 32

+
  #endif/* _ASM_PGTABLE_2LEVEL_TYPES_H */
diff --git a/arch/arm/include/asm/pgtable-3level-types.h 
b/arch/arm/include/asm/pgtable-3level-types.h
index 921aa30259c4..664c39e6517c 100644
--- a/arch/arm/include/asm/pgtable-3level-types.h
+++ b/arch/arm/include/asm/pgtable-3level-types.h
@@ -67,4 +67,6 @@ typedef pteval_t pgprot_t;
  
  #endif	/* STRICT_MM_TYPECHECKS */
  
+#define MAX_POSSIBLE_PHYSMEM_BITS 36


Nit: with LPAE, physical addresses go up to 40 bits, not just 36.

Robin.


+
  #endif/* _ASM_PGTABLE_3LEVEL_TYPES_H */
diff --git a/arch/arm64/include/asm/pgtable-types.h 

Re: [REPOST PATCH v6 0/4] kgdb: Fix kgdb_roundup_cpus()

2018-12-10 Thread Catalin Marinas
Hi Doug,

On Fri, Dec 07, 2018 at 10:40:24AM -0800, Doug Anderson wrote:
> On Fri, Dec 7, 2018 at 9:42 AM Catalin Marinas  
> wrote:
> > On Tue, Dec 04, 2018 at 07:38:24PM -0800, Douglas Anderson wrote:
> > > Douglas Anderson (4):
> > >   kgdb: Remove irq flags from roundup
> > >   kgdb: Fix kgdb_roundup_cpus() for arches who used smp_call_function()
> > >   kgdb: Don't round up a CPU that failed rounding up before
> > >   kdb: Don't back trace on a cpu that didn't round up
> >
> > FWIW, trying these on arm64 (ThunderX2) with CONFIG_KGDB_TESTS_ON_BOOT=y
> > on top of 4.20-rc5 doesn't boot. It looks like they leave interrupts
> > disabled when they shouldn't and it trips over the BUG at
> > mm/vmalloc.c:1380 (called via do_fork -> copy_process).
> >
> > Now, I don't think these patches make things worse on arm64 since prior
> > to them the kgdb boot tests on arm64 were stuck in a loop (RUN
> > singlestep).
> 
> Thanks for the report!  ...actually, I'd never tried CONFIG_KGDB_TESTS
> before.  ...so I tried them now:
> 
> A) chromeos-4.19 tree on qcom-sdm845 without this series: booted up OK
> B) chromeos-4.19 tree on qcom-sdm845 with this series: booted up OK
> C) v4.20-rc5-90-g30002dd008ed on rockchip-rk3399 (kevin) with this
> series: booted up OK
> 
> Example output from B) above:
> 
> localhost ~ # dmesg | grep kgdbts
> [2.139814] KGDB: Registered I/O driver kgdbts
> [2.144582] kgdbts:RUN plant and detach test
> [2.165333] kgdbts:RUN sw breakpoint test
> [2.172990] kgdbts:RUN bad memory access test
> [2.178640] kgdbts:RUN singlestep test 1000 iterations
> [2.187765] kgdbts:RUN singlestep [0/1000]
> [2.559596] kgdbts:RUN singlestep [100/1000]
> [2.931419] kgdbts:RUN singlestep [200/1000]
> [3.303474] kgdbts:RUN singlestep [300/1000]
> [3.675121] kgdbts:RUN singlestep [400/1000]
> [4.046867] kgdbts:RUN singlestep [500/1000]
> [4.418920] kgdbts:RUN singlestep [600/1000]
> [4.790824] kgdbts:RUN singlestep [700/1000]
> [5.162479] kgdbts:RUN singlestep [800/1000]
> [5.534103] kgdbts:RUN singlestep [900/1000]
> [5.902299] kgdbts:RUN do_fork for 100 breakpoints
> [8.463900] KGDB: Unregistered I/O driver kgdbts, debugger disabled
> 
> ...so I guess I'm a little confused.  Either I have a different config
> than you do or something is special about your machine?

I tried it now on a Juno board both as a host and a guest and boots
fine. It must be something that only triggers ThunderX2. Ignore the
report for now, if I find anything interesting I'll let you know.

-- 
Catalin


Re: [PATCH v6] powerpc/ptrace: replace ptrace_report_syscall() with a tracehook call

2018-12-10 Thread Dmitry V. Levin
On Mon, Dec 10, 2018 at 02:28:07PM +0100, Oleg Nesterov wrote:
> On 12/07, Dmitry V. Levin wrote:
> >
> > Please make either v5 or v6 edition of this fix, or any similar fix,
> > into v4.20.
> 
> IIUC, v5 above means
> 
>   [PATCH v5 23/25] powerpc/ptrace: replace ptrace_report_syscall() with a 
> tracehook call
> 
> you sent in another series...

They just happen to have the same v5 here and there.
In that series I included the most trivial variant of the change.

> >  long do_syscall_trace_enter(struct pt_regs *regs)
> >  {
> > +   struct thread_info *ti;
> > +   u32 cached_flags;
> > +
> > user_exit();
> >  
> > -   if (test_thread_flag(TIF_SYSCALL_EMU)) {
> > -   ptrace_report_syscall(regs);
> > -   /*
> > -* Returning -1 will skip the syscall execution. We want to
> > -* avoid clobbering any register also, thus, not 'gotoing'
> > -* skip label.
> > -*/
> > -   return -1;
> > -   }
> > +   ti = current_thread_info();
> > +   cached_flags = READ_ONCE(ti->flags) &
> > +  (_TIF_SYSCALL_EMU | _TIF_SYSCALL_TRACE |
> > +   _TIF_SYSCALL_TRACEPOINT);
> >  
> > -   /*
> > -* The tracer may decide to abort the syscall, if so tracehook
> > -* will return !0. Note that the tracer may also just change
> > -* regs->gpr[0] to an invalid syscall number, that is handled
> > -* below on the exit path.
> > -*/
> > -   if (test_thread_flag(TIF_SYSCALL_TRACE) &&
> > -   tracehook_report_syscall_entry(regs))
> > -   goto skip;
> > +   if (cached_flags & (_TIF_SYSCALL_EMU | _TIF_SYSCALL_TRACE)) {
> > +   int rc = tracehook_report_syscall_entry(regs);
> > +
> > +   if (unlikely(cached_flags & _TIF_SYSCALL_EMU)) {
> > +   /*
> > +* A nonzero return code from
> > +* tracehook_report_syscall_entry() tells us
> > +* to prevent the syscall execution, but
> > +* we are not going to execute it anyway.
> > +*
> > +* Returning -1 will skip the syscall execution.
> > +* We want to avoid clobbering any register also,
> > +* thus, not 'gotoing' skip label.
> > +*/
> > +   return -1;
> > +   }
> > +
> > +   if (rc) {
> > +   /*
> > +* The tracer decided to abort the syscall.
> > +* Note that the tracer may also just change
> > +* regs->gpr[0] to an invalid syscall number,
> > +* that is handled below on the exit path.
> > +*/
> > +   goto skip;
> > +   }
> > +   }
> >  
> > /* Run seccomp after ptrace; allow it to set gpr[3]. */
> > if (do_seccomp(regs))
> > @@ -3293,7 +3309,7 @@ long do_syscall_trace_enter(struct pt_regs *regs)
> > if (regs->gpr[0] >= NR_syscalls)
> > goto skip;
> >  
> > -   if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
> > +   if (unlikely(cached_flags & _TIF_SYSCALL_TRACEPOINT))
> 
> I will leave this to maintainers, but to me this change looks good and imo it
> also cleanups the code.
> 
> However I am not sure cached_flags should include _TIF_SYSCALL_TRACEPOINT. If
> nothing else, the caller can sleep in ptrace_stop() unpredictably long and
> TIF_SYSCALL_TRACEPOINT can be set/cleared meanwhile.

I agree, we shouldn't cache _TIF_SYSCALL_TRACEPOINT.


-- 
ldv


signature.asc
Description: PGP signature


Re: [PATCH v6] powerpc/ptrace: replace ptrace_report_syscall() with a tracehook call

2018-12-10 Thread Oleg Nesterov
On 12/07, Dmitry V. Levin wrote:
>
> Please make either v5 or v6 edition of this fix, or any similar fix,
> into v4.20.

IIUC, v5 above means

[PATCH v5 23/25] powerpc/ptrace: replace ptrace_report_syscall() with a 
tracehook call

you sent in another series...

>  long do_syscall_trace_enter(struct pt_regs *regs)
>  {
> + struct thread_info *ti;
> + u32 cached_flags;
> +
>   user_exit();
>  
> - if (test_thread_flag(TIF_SYSCALL_EMU)) {
> - ptrace_report_syscall(regs);
> - /*
> -  * Returning -1 will skip the syscall execution. We want to
> -  * avoid clobbering any register also, thus, not 'gotoing'
> -  * skip label.
> -  */
> - return -1;
> - }
> + ti = current_thread_info();
> + cached_flags = READ_ONCE(ti->flags) &
> +(_TIF_SYSCALL_EMU | _TIF_SYSCALL_TRACE |
> + _TIF_SYSCALL_TRACEPOINT);
>  
> - /*
> -  * The tracer may decide to abort the syscall, if so tracehook
> -  * will return !0. Note that the tracer may also just change
> -  * regs->gpr[0] to an invalid syscall number, that is handled
> -  * below on the exit path.
> -  */
> - if (test_thread_flag(TIF_SYSCALL_TRACE) &&
> - tracehook_report_syscall_entry(regs))
> - goto skip;
> + if (cached_flags & (_TIF_SYSCALL_EMU | _TIF_SYSCALL_TRACE)) {
> + int rc = tracehook_report_syscall_entry(regs);
> +
> + if (unlikely(cached_flags & _TIF_SYSCALL_EMU)) {
> + /*
> +  * A nonzero return code from
> +  * tracehook_report_syscall_entry() tells us
> +  * to prevent the syscall execution, but
> +  * we are not going to execute it anyway.
> +  *
> +  * Returning -1 will skip the syscall execution.
> +  * We want to avoid clobbering any register also,
> +  * thus, not 'gotoing' skip label.
> +  */
> + return -1;
> + }
> +
> + if (rc) {
> + /*
> +  * The tracer decided to abort the syscall.
> +  * Note that the tracer may also just change
> +  * regs->gpr[0] to an invalid syscall number,
> +  * that is handled below on the exit path.
> +  */
> + goto skip;
> + }
> + }
>  
>   /* Run seccomp after ptrace; allow it to set gpr[3]. */
>   if (do_seccomp(regs))
> @@ -3293,7 +3309,7 @@ long do_syscall_trace_enter(struct pt_regs *regs)
>   if (regs->gpr[0] >= NR_syscalls)
>   goto skip;
>  
> - if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
> + if (unlikely(cached_flags & _TIF_SYSCALL_TRACEPOINT))

I will leave this to maintainers, but to me this change looks good and imo it
also cleanups the code.

However I am not sure cached_flags should include _TIF_SYSCALL_TRACEPOINT. If
nothing else, the caller can sleep in ptrace_stop() unpredictably long and
TIF_SYSCALL_TRACEPOINT can be set/cleared meanwhile.

Oleg.



[PATCH] powerpc/mm: define an empty slice_init_new_context_exec()

2018-12-10 Thread Christophe Leroy
Define slice_init_new_context_exec() at all time to avoid

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/slice.h | 14 +-
 arch/powerpc/mm/mmu_context_nohash.c |  2 --
 2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/slice.h b/arch/powerpc/include/asm/slice.h
index a595461c9cb0..44816cbc4198 100644
--- a/arch/powerpc/include/asm/slice.h
+++ b/arch/powerpc/include/asm/slice.h
@@ -10,6 +10,10 @@
 #include 
 #endif
 
+#ifndef __ASSEMBLY__
+
+struct mm_struct;
+
 #ifdef CONFIG_PPC_MM_SLICES
 
 #ifdef CONFIG_HUGETLB_PAGE
@@ -18,10 +22,6 @@
 #define HAVE_ARCH_UNMAPPED_AREA
 #define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
 
-#ifndef __ASSEMBLY__
-
-struct mm_struct;
-
 unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len,
  unsigned long flags, unsigned int psize,
  int topdown);
@@ -34,8 +34,12 @@ void slice_set_range_psize(struct mm_struct *mm, unsigned 
long start,
 void slice_init_new_context_exec(struct mm_struct *mm);
 void slice_setup_new_exec(void);
 
-#endif /* __ASSEMBLY__ */
+#else /* CONFIG_PPC_MM_SLICES */
+
+static inline void slice_init_new_context_exec(struct mm_struct *mm) {}
 
 #endif /* CONFIG_PPC_MM_SLICES */
 
+#endif /* __ASSEMBLY__ */
+
 #endif /* _ASM_POWERPC_SLICE_H */
diff --git a/arch/powerpc/mm/mmu_context_nohash.c 
b/arch/powerpc/mm/mmu_context_nohash.c
index 431ecf37f17c..22d71a58167f 100644
--- a/arch/powerpc/mm/mmu_context_nohash.c
+++ b/arch/powerpc/mm/mmu_context_nohash.c
@@ -372,7 +372,6 @@ int init_new_context(struct task_struct *t, struct 
mm_struct *mm)
 {
pr_hard("initing context for mm @%p\n", mm);
 
-#ifdef CONFIG_PPC_MM_SLICES
/*
 * We have MMU_NO_CONTEXT set to be ~0. Hence check
 * explicitly against context.id == 0. This ensures that we properly
@@ -382,7 +381,6 @@ int init_new_context(struct task_struct *t, struct 
mm_struct *mm)
 */
if (mm->context.id == 0)
slice_init_new_context_exec(mm);
-#endif
mm->context.id = MMU_NO_CONTEXT;
mm->context.active = 0;
pte_frag_set(>context, NULL);
-- 
2.13.3



Re: [PATCH] powerpc/ipic: Fix a bounds check in ipic_set_priority()

2018-12-10 Thread Christophe Leroy

Le 07/12/2018 à 03:07, Michael Ellerman a écrit :

Christophe LEROY  writes:

Le 05/12/2018 à 04:26, Michael Ellerman a écrit :

Hi Dan,

Thanks for the patch.

Dan Carpenter  writes:

The ipic_info[] array only has 95 elements so I have made the bounds
check smaller to prevent a read overflow.  It was Smatch that found
this issue:

  arch/powerpc/sysdev/ipic.c:784 ipic_set_priority()
  error: buffer overflow 'ipic_info' 95 <= 127

Signed-off-by: Dan Carpenter 
---
I wasn't able to find any callers of this code.  Maybe we removed the
last one in commit b9f0f1bb2bca ("[POWERPC] Adapt ipic driver to new
host_ops interface, add set_irq_type to set IRQ sense").  So perhaps we
should just remove it.  I'm not really comfortable doing that myself,
because I don't know the code well enough and can't build test
it properly.


Hah wow, last usage removed in 2006!

I don't see any mention of it since then, so I'll remove it. If it
breaks something we can put it back.

Can smatch help us find things like this that are defined non-static but
never used?



I think we have to do that carrefully. Some of those functions might be
used by out-of-tree boards.


We don't keep unused code around for out-of-tree boards.

Either the out-of-tree code should be merged upstream, or you can
maintain whatever extra functions you need as part of your out-of-tree
code base.


I'm thinking especially at ipic_get_mcp_status() and
ipic_set_mcp_status(). They are used in my 832x boards's machine check
handler to know when a machine check is a timeout from the 832x watchdog.


Thanks for pointing them out, I'll send a patch to remove them :)


Lol :)

If you are doing the housework, you can remove 
ipic_set_highest_priority() ipic_enable_mcp() and ipic_disable_mcp()





But seriously, why is your machine check code not in-tree, is there some
reason you can't merge it?


Maybe because you haven't merged it yet allthough I sent it more than 
three minutes ago. :)


Seriously, that was just left over with other priorities, and also 
because I'm not too happy about it because what I would really like is 
that it kills the userland app if any but don't crash the system when it 
is in interrupt or in idle.


But due to b96672dd840f ("powerpc: Machine check interrupt is a 
non-maskable interrupt"), die_will_crash() doesn't work anymore. So for 
the time being, the patch I sent is not killing anybody, just doing an 
Oops for notification (note that die_will_crash() is used in the Opal 
machine check handler, so it probably doesn't work anymore there either).


Christophe


[PATCH] powerpc/83xx: handle machine check caused by watchdog timer

2018-12-10 Thread Christophe Leroy
When the watchdog timer is set in interrupt mode, it causes a
machine check when it times out. The purpose of this mode is to
ease debugging, not to crash the kernel and reboot the machine.

This patch implements a special handling for that, in order to not
crash the kernel if the watchdog times out while in interrupt or
within the idle task.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/cputable.h |  1 +
 arch/powerpc/include/asm/reg.h  |  2 ++
 arch/powerpc/kernel/cputable.c  | 10 ++
 arch/powerpc/platforms/83xx/misc.c  | 16 
 4 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/cputable.h 
b/arch/powerpc/include/asm/cputable.h
index a0395ccbbe9e..d05f0c28e515 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -44,6 +44,7 @@ extern int machine_check_e500(struct pt_regs *regs);
 extern int machine_check_e200(struct pt_regs *regs);
 extern int machine_check_47x(struct pt_regs *regs);
 int machine_check_8xx(struct pt_regs *regs);
+int machine_check_83xx(struct pt_regs *regs);
 
 extern void cpu_down_flush_e500v2(void);
 extern void cpu_down_flush_e500mc(void);
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 0d2139a0d5b9..1c98ef1f2d5b 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -769,6 +769,8 @@
 #define   SRR1_PROGTRAP0x0002 /* Trap */
 #define   SRR1_PROGADDR0x0001 /* SRR0 contains subsequent 
addr */
 
+#define   SRR1_MCE_MCP 0x0008 /* Machine check signal caused 
interrupt */
+
 #define SPRN_HSRR0 0x13A   /* Save/Restore Register 0 */
 #define SPRN_HSRR1 0x13B   /* Save/Restore Register 1 */
 #define   HSRR1_DENORM 0x0010 /* Denorm exception */
diff --git a/arch/powerpc/kernel/cputable.c b/arch/powerpc/kernel/cputable.c
index 2da01340c84c..1eab54bc6ee9 100644
--- a/arch/powerpc/kernel/cputable.c
+++ b/arch/powerpc/kernel/cputable.c
@@ -1141,6 +1141,7 @@ static struct cpu_spec __initdata cpu_specs[] = {
.machine_check  = machine_check_generic,
.platform   = "ppc603",
},
+#ifdef CONFIG_PPC_83xx
{   /* e300c1 (a 603e core, plus some) on 83xx */
.pvr_mask   = 0x7fff,
.pvr_value  = 0x0083,
@@ -1151,7 +1152,7 @@ static struct cpu_spec __initdata cpu_specs[] = {
.icache_bsize   = 32,
.dcache_bsize   = 32,
.cpu_setup  = __setup_cpu_603,
-   .machine_check  = machine_check_generic,
+   .machine_check  = machine_check_83xx,
.platform   = "ppc603",
},
{   /* e300c2 (an e300c1 core, plus some, minus FPU) on 83xx */
@@ -1165,7 +1166,7 @@ static struct cpu_spec __initdata cpu_specs[] = {
.icache_bsize   = 32,
.dcache_bsize   = 32,
.cpu_setup  = __setup_cpu_603,
-   .machine_check  = machine_check_generic,
+   .machine_check  = machine_check_83xx,
.platform   = "ppc603",
},
{   /* e300c3 (e300c1, plus one IU, half cache size) on 83xx */
@@ -1179,7 +1180,7 @@ static struct cpu_spec __initdata cpu_specs[] = {
.icache_bsize   = 32,
.dcache_bsize   = 32,
.cpu_setup  = __setup_cpu_603,
-   .machine_check  = machine_check_generic,
+   .machine_check  = machine_check_83xx,
.num_pmcs   = 4,
.oprofile_cpu_type  = "ppc/e300",
.oprofile_type  = PPC_OPROFILE_FSL_EMB,
@@ -1196,12 +1197,13 @@ static struct cpu_spec __initdata cpu_specs[] = {
.icache_bsize   = 32,
.dcache_bsize   = 32,
.cpu_setup  = __setup_cpu_603,
-   .machine_check  = machine_check_generic,
+   .machine_check  = machine_check_83xx,
.num_pmcs   = 4,
.oprofile_cpu_type  = "ppc/e300",
.oprofile_type  = PPC_OPROFILE_FSL_EMB,
.platform   = "ppc603",
},
+#endif
{   /* default match, we assume split I/D cache & TB (non-601)... */
.pvr_mask   = 0x,
.pvr_value  = 0x,
diff --git a/arch/powerpc/platforms/83xx/misc.c 
b/arch/powerpc/platforms/83xx/misc.c
index d75c9816a5c9..3e591d159413 100644
--- a/arch/powerpc/platforms/83xx/misc.c
+++ b/arch/powerpc/platforms/83xx/misc.c
@@ -150,3 +150,19 @@ void __init mpc83xx_setup_arch(void)
 
mpc83xx_setup_pci();
 }

Re: [PATCH v2 26/34] dt-bindings: arm: Convert Renesas board/soc bindings to json-schema

2018-12-10 Thread Simon Horman
On Thu, Dec 06, 2018 at 01:38:42PM -0600, Rob Herring wrote:
> On Wed, Dec 5, 2018 at 1:44 PM Simon Horman  wrote:
> >
> > On Tue, Dec 04, 2018 at 09:08:57AM -0600, Rob Herring wrote:
> > > On Tue, Dec 4, 2018 at 8:57 AM Geert Uytterhoeven  
> > > wrote:
> > > >
> > > > Hi Simon,
> > > >
> > > > On Tue, Dec 4, 2018 at 3:48 PM Simon Horman  wrote:
> > > > > On Mon, Dec 03, 2018 at 03:32:15PM -0600, Rob Herring wrote:
> > > > > > Convert Renesas SoC bindings to DT schema format using json-schema.
> > > > > >
> > > > > > Cc: Simon Horman 
> > > > > > Cc: Magnus Damm 
> > > > > > Cc: Mark Rutland 
> > > > > > Cc: linux-renesas-...@vger.kernel.org
> > > > > > Cc: devicet...@vger.kernel.org
> > > > > > Signed-off-by: Rob Herring 
> > > > > > ---
> > > > > >  .../devicetree/bindings/arm/shmobile.txt  | 151 
> > > > > >  .../devicetree/bindings/arm/shmobile.yaml | 218 
> > > > > > ++
> > > > > >  2 files changed, 218 insertions(+), 151 deletions(-)
> > > > > >  delete mode 100644 
> > > > > > Documentation/devicetree/bindings/arm/shmobile.txt
> > > > > >  create mode 100644 
> > > > > > Documentation/devicetree/bindings/arm/shmobile.yaml
> > > > >
> > > > > Hi Rob,
> > > > >
> > > > > what is this based on? I get a conflict when applying the .txt change
> > > > > and if I knew the base for this patch it would be rather easy to work
> > > > > out what has changed.
> > >
> > > 4.20-rc2
> > >
> > > > >
> > > > > Also, should we do an s/shmobile.txt/shmobile.yaml/ in MAINTAINERS?
> > >
> > > Yes. Though it was pointed out that get_maintainers.pl can pull emails
> > > out of this file. We'd need to get that to work by default though.
> > >
> > > > Probably even s/shmobile.yaml/renesas.yaml/, while at it?
> > >
> > > Sure, if that's what you all want.
> >
> > How about this?
> 
> LGTM

Thanks.

As my tree is already closed for v4.21 I have applied this for v4.22.


Re: [PATCH v2.1 24/34] dt-bindings: arm: Convert Rockchip board/soc bindings to json-schema

2018-12-10 Thread Heiko Stuebner
Am Sonntag, 9. Dezember 2018, 23:14:05 CET schrieb Heiko Stuebner:
Forgot the

From: Rob Herring 

here, but if you're ok with how it looks I can apply it to my tree.


Heiko

> Convert Rockchip SoC bindings to DT schema format using json-schema.
> 
> Cc: Mark Rutland 
> Cc: Heiko Stuebner 
> Cc: devicet...@vger.kernel.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-rockc...@lists.infradead.org
> Signed-off-by: Rob Herring 
> [move to per-board entries and added recently added boards]
> Signed-off-by: Heiko Stuebner 
> ---
> Hi Rob,
> 
> there are boards where the description adds much value and on others
> it is maybe less, but personally I'd like to keep things uniform,
> as that makes reading these things easier if the format stays the
> same all the time, so I've gone forward and just did the conversion
> 
> make dtbs_check did not complain about the schema it seems but I
> did end up with an error later on:
> 
> FATAL ERROR: Unknown output format "yaml"
> make[2]: *** [scripts/Makefile.lib:313: arch/arm/boot/dts/rk3036-evb.dt.yaml] 
> Fehler 1
> 
> But I guess I did not mess up the schema yet.
> 
> So does it look ok that way?
> Heiko
> 
>  .../devicetree/bindings/arm/rockchip.txt  | 240 --
>  .../devicetree/bindings/arm/rockchip.yaml | 419 ++
>  2 files changed, 419 insertions(+), 240 deletions(-)
>  delete mode 100644 Documentation/devicetree/bindings/arm/rockchip.txt
>  create mode 100644 Documentation/devicetree/bindings/arm/rockchip.yaml
> 
> diff --git a/Documentation/devicetree/bindings/arm/rockchip.txt 
> b/Documentation/devicetree/bindings/arm/rockchip.txt
> deleted file mode 100644
> index 0cc71236d639..
> --- a/Documentation/devicetree/bindings/arm/rockchip.txt
> +++ /dev/null
> @@ -1,240 +0,0 @@
> -Rockchip platforms device tree bindings
> 
> -
> -- 96boards RK3399 Ficus (ROCK960 Enterprise Edition)
> -Required root node properties:
> -  - compatible = "vamrs,ficus", "rockchip,rk3399";
> -
> -- 96boards RK3399 Rock960 (ROCK960 Consumer Edition)
> -Required root node properties:
> -  - compatible = "vamrs,rock960", "rockchip,rk3399";
> -
> -- Amarula Vyasa RK3288 board
> -Required root node properties:
> -  - compatible = "amarula,vyasa-rk3288", "rockchip,rk3288";
> -
> -- Asus Tinker board
> -Required root node properties:
> -  - compatible = "asus,rk3288-tinker", "rockchip,rk3288";
> -
> -- Asus Tinker board S
> -Required root node properties:
> -  - compatible = "asus,rk3288-tinker-s", "rockchip,rk3288";
> -
> -- Kylin RK3036 board:
> -Required root node properties:
> -  - compatible = "rockchip,kylin-rk3036", "rockchip,rk3036";
> -
> -- MarsBoard RK3066 board:
> -Required root node properties:
> -  - compatible = "haoyu,marsboard-rk3066", "rockchip,rk3066a";
> -
> -- bq Curie 2 tablet:
> -Required root node properties:
> -  - compatible = "mundoreader,bq-curie2", "rockchip,rk3066a";
> -
> -- ChipSPARK Rayeager PX2 board:
> -Required root node properties:
> -  - compatible = "chipspark,rayeager-px2", "rockchip,rk3066a";
> -
> -- Radxa Rock board:
> -Required root node properties:
> -  - compatible = "radxa,rock", "rockchip,rk3188";
> -
> -- Radxa Rock2 Square board:
> -Required root node properties:
> -  - compatible = "radxa,rock2-square", "rockchip,rk3288";
> -
> -- Rikomagic MK808 v1 board:
> -Required root node properties:
> -  - compatible = "rikomagic,mk808", "rockchip,rk3066a";
> -
> -- Firefly Firefly-RK3288 board:
> -Required root node properties:
> -  - compatible = "firefly,firefly-rk3288", "rockchip,rk3288";
> -or
> -  - compatible = "firefly,firefly-rk3288-beta", "rockchip,rk3288";
> -
> -- Firefly Firefly-RK3288 Reload board:
> -Required root node properties:
> -  - compatible = "firefly,firefly-rk3288-reload", "rockchip,rk3288";
> -
> -- Firefly Firefly-RK3399 board:
> -Required root node properties:
> -  - compatible = "firefly,firefly-rk3399", "rockchip,rk3399";
> -
> -- Firefly roc-rk3328-cc board:
> -Required root node properties:
> -  - compatible = "firefly,roc-rk3328-cc", "rockchip,rk3328";
> -
> -- Firefly ROC-RK3399-PC board:
> -Required root node properties:
> -  - compatible = "firefly,roc-rk3399-pc", "rockchip,rk3399";
> -
> -- ChipSPARK PopMetal-RK3288 board:
> -Required root node properties:
> -  - compatible = "chipspark,popmetal-rk3288", "rockchip,rk3288";
> -
> -- Netxeon R89 board:
> -Required root node properties:
> -  - compatible = "netxeon,r89", "rockchip,rk3288";
> -
> -- GeekBuying GeekBox:
> -Required root node properties:
> -  - compatible = "geekbuying,geekbox", "rockchip,rk3368";
> -
> -- Google Bob (Asus Chromebook Flip C101PA):
> -Required root node properties:
> - compatible = "google,bob-rev13", "google,bob-rev12",
> -  "google,bob-rev11", "google,bob-rev10",

[PATCH] lib: fix build failure in CONFIG_DEBUG_VIRTUAL test

2018-12-10 Thread Christophe Leroy
On several arches, virt_to_phys() is in io.h

Build fails without it:

  CC  lib/test_debug_virtual.o
lib/test_debug_virtual.c: In function 'test_debug_virtual_init':
lib/test_debug_virtual.c:26:7: error: implicit declaration of function 
'virt_to_phys' [-Werror=implicit-function-declaration]
  pa = virt_to_phys(va);
   ^

Fixes: e4dace361552 ("lib: add test module for CONFIG_DEBUG_VIRTUAL")
CC: sta...@vger.kernel.org
Signed-off-by: Christophe Leroy 
---
 lib/test_debug_virtual.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/test_debug_virtual.c b/lib/test_debug_virtual.c
index d5a06addeb27..bf864c73e462 100644
--- a/lib/test_debug_virtual.c
+++ b/lib/test_debug_virtual.c
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #ifdef CONFIG_MIPS
-- 
2.13.3



[PATCH v2] powerpc: implement CONFIG_DEBUG_VIRTUAL

2018-12-10 Thread Christophe Leroy
This patch implements CONFIG_DEBUG_VIRTUAL to warn about
incorrect use of virt_to_phys() and page_to_phys()

It also warns about DMA on stack when CONFIG_HAVE_ARCH_VMAP_STACK is
selected. It will help locate them before activating CONFIG_VMAP_STACK

Below is the result of test_debug_virtual:

[1.438746] WARNING: CPU: 0 PID: 1 at ./arch/powerpc/include/asm/io.h:808 
test_debug_virtual_init+0x3c/0xd4
[1.448156] CPU: 0 PID: 1 Comm: swapper Not tainted 
4.20.0-rc5-00560-g6bfb52e23a00-dirty #532
[1.457259] NIP:  c066c550 LR: c0650ccc CTR: c066c514
[1.462257] REGS: c900bdb0 TRAP: 0700   Not tainted  
(4.20.0-rc5-00560-g6bfb52e23a00-dirty)
[1.471184] MSR:  00029032   CR: 48000422  XER: 2000
[1.477811]
[1.477811] GPR00: c0650ccc c900be60 c60d  006000c0 c900 
9032 c7fa0020
[1.477811] GPR08: 2400 0001 0900  c07b5d04  
c00037d8 
[1.477811] GPR16:     c076 c074 
0092 c0685bb0
[1.477811] GPR24: c065042c c068a734 c0685b8c 0006  c076 
c075c3c0 
[1.512711] NIP [c066c550] test_debug_virtual_init+0x3c/0xd4
[1.518315] LR [c0650ccc] do_one_initcall+0x8c/0x1cc
[1.523163] Call Trace:
[1.525595] [c900be60] [c0567340] 0xc0567340 (unreliable)
[1.530954] [c900be90] [c0650ccc] do_one_initcall+0x8c/0x1cc
[1.536551] [c900bef0] [c0651000] kernel_init_freeable+0x1f4/0x2cc
[1.542658] [c900bf30] [c00037ec] kernel_init+0x14/0x110
[1.547913] [c900bf40] [c000e1d0] ret_from_kernel_thread+0x14/0x1c
[1.553971] Instruction dump:
[1.556909] 3ca50100 bfa10024 54a5000e 3fa0c076 7c0802a6 3d454000 813dc204 
554893be
[1.564566] 7d294010 7d294910 90010034 39290001 <0f09> 7c3e0b78 955e0008 
3fe0c062
[1.572425] ---[ end trace 6f6984225b280ad6 ]---
[1.577467] PA: 0x0900 for VA: 0xc900
[1.581799] PA: 0x061e8f50 for VA: 0xc61e8f50

Signed-off-by: Christophe Leroy 
---
 v2: Using asm/pgtable.h to avoid build failure on ppc64e.
 Added a verification that the object is not in stack to catch problems 
before activing VMAP_STACK.

 arch/powerpc/Kconfig  |  1 +
 arch/powerpc/include/asm/io.h | 19 ++-
 arch/powerpc/mm/pgtable_32.c  |  2 +-
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index e312e92e3381..94b46624068d 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -128,6 +128,7 @@ config PPC
#
# Please keep this list sorted alphabetically.
#
+   select ARCH_HAS_DEBUG_VIRTUAL
select ARCH_HAS_DEVMEM_IS_ALLOWED
select ARCH_HAS_DMA_SET_COHERENT_MASK
select ARCH_HAS_ELF_RANDOMIZE
diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h
index e746becd9d6f..51e96a4413d2 100644
--- a/arch/powerpc/include/asm/io.h
+++ b/arch/powerpc/include/asm/io.h
@@ -29,12 +29,14 @@ extern struct pci_dev *isa_bridge_pcidev;
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_PPC64
 #include 
@@ -804,6 +806,11 @@ extern void __iounmap_at(void *ea, unsigned long size);
  */
 static inline unsigned long virt_to_phys(volatile void * address)
 {
+   if (IS_ENABLED(CONFIG_DEBUG_VIRTUAL) &&
+   !WARN_ON(IS_ENABLED(CONFIG_HAVE_ARCH_VMAP_STACK) && current->pid &&
+object_is_on_stack((const void*)address)))
+   WARN_ON(!virt_addr_valid(address));
+
return __pa((unsigned long)address);
 }
 
@@ -827,7 +834,17 @@ static inline void * phys_to_virt(unsigned long address)
 /*
  * Change "struct page" to physical address.
  */
-#define page_to_phys(page) ((phys_addr_t)page_to_pfn(page) << PAGE_SHIFT)
+static inline phys_addr_t page_to_phys(struct page *page)
+{
+   unsigned long pfn = page_to_pfn(page);
+
+   if (IS_ENABLED(CONFIG_DEBUG_VIRTUAL) &&
+   !WARN_ON(IS_ENABLED(CONFIG_HAVE_ARCH_VMAP_STACK) && current->pid &&
+object_is_on_stack(__va(PFN_PHYS(pfn)
+   WARN_ON(!pfn_valid(pfn));
+
+   return PFN_PHYS(pfn);
+}
 
 /*
  * 32 bits still uses virt_to_bus() for it's implementation of DMA
diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c
index 4fc77a99c9bf..68d204a45cd0 100644
--- a/arch/powerpc/mm/pgtable_32.c
+++ b/arch/powerpc/mm/pgtable_32.c
@@ -143,7 +143,7 @@ __ioremap_caller(phys_addr_t addr, unsigned long size, 
pgprot_t prot, void *call
 * Don't allow anybody to remap normal RAM that we're using.
 * mem_init() sets high_memory so only do the check after that.
 */
-   if (slab_is_available() && (p < virt_to_phys(high_memory)) &&
+   if (slab_is_available() && virt_addr_valid(p) &&
page_is_ram(__phys_to_pfn(p))) {
printk("__ioremap(): phys addr 0x%llx is RAM lr %ps\n",
   (unsigned long long)p,