Re: Linux kernel: powerpc: KVM guest can trigger host crash on Power8

2021-11-01 Thread Michal Suchánek
On Fri, Oct 29, 2021 at 02:33:12PM +0200, John Paul Adrian Glaubitz wrote:
> Hi Nicholas!
> 
> On 10/29/21 02:41, Nicholas Piggin wrote:
> > Soft lockup should mean it's taking timer interrupts still, just not 
> > scheduling. Do you have the hard lockup detector enabled as well? Is
> > there anything stuck spinning on another CPU?
> 

> 
> > Could you try a sysrq+w to get a trace of blocked tasks?
> 
> Not sure how to send a magic sysrequest over the IPMI serial console. Any 
> idea?

As on any serial console sending break should be equivalent to the magic
sysrq key combo.

https://tldp.org/HOWTO/Remote-Serial-Console-HOWTO/security-sysrq.html

With ipmitool break is sent by typing ~B

https://linux.die.net/man/1/ipmitool

Thanks

Michal


Re: Linux kernel: powerpc: KVM guest can trigger host crash on Power8

2021-11-01 Thread Michal Suchánek
Hello,

On Thu, Oct 28, 2021 at 04:15:19PM +0200, John Paul Adrian Glaubitz wrote:
> Hi!
> 
> On 10/28/21 16:05, John Paul Adrian Glaubitz wrote:
> > The following packages were being built at the same time:
> > 
> > - guest 1: virtuoso-opensource and openturns
> > - guest 2: llvm-toolchain-13
> > 
> > I really did a lot of testing today with no issues and just after I sent my 
> > report
> > to oss-security that the machine seems to be stable again, the issue showed 
> > up :(.
> 
> Do you know whether IPMI features any sort of monitoring for capturing the 
> output
> of the serial console non-interactively? This way I would be able to capture 
> the
> crash besides what I have seen above.

I am pretty sure you can run something like

script ipmitool

to capture output indefinitely, and the same inside screen on a remote
machine.

Thanks

Michal


Re: [PATCH v4 2/2] powerpc/64: Option to use ELF V2 ABI for big-endian kernels

2021-06-11 Thread Michal Suchánek
On Fri, Jun 11, 2021 at 11:58:19AM +0200, Michal Suchánek wrote:
> On Fri, Jun 11, 2021 at 07:39:59PM +1000, Nicholas Piggin wrote:
> > Provide an option to build big-endian kernels using the ELFv2 ABI. This
> > works on GCC only so far, although it is rumored to work with clang
> > that's not been tested yet. A new module version check ensures the
> > module ELF ABI level matches the kernel build.
> > 
> > This can give big-endian kernels some useful advantages of the ELFv2 ABI
> > (e.g., less stack usage, -mprofile-kernel, better compatibility with eBPF
> > tools).
> > 
> > BE+ELFv2 is not officially supported by the GNU toolchain, but it works
> > fine in testing and has been used by some userspace for some time (e.g.,
> > Void Linux).
> > 
> > Tested-by: Michal Suchánek 
> > Reviewed-by: Segher Boessenkool 
> > Signed-off-by: Nicholas Piggin 
> > ---
> >  arch/powerpc/Kconfig| 22 ++
> >  arch/powerpc/Makefile   | 18 --
> >  arch/powerpc/boot/Makefile  |  4 +++-
> >  arch/powerpc/include/asm/module.h   | 24 
> >  arch/powerpc/kernel/vdso64/Makefile | 13 +
> >  drivers/crypto/vmx/Makefile |  8 ++--
> >  drivers/crypto/vmx/ppc-xlate.pl | 10 ++
> >  7 files changed, 86 insertions(+), 13 deletions(-)
> > 
> > diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> > index 088dd2afcfe4..093f973a28b9 100644
> > --- a/arch/powerpc/Kconfig
> > +++ b/arch/powerpc/Kconfig
> > @@ -163,6 +163,7 @@ config PPC
> > select ARCH_WEAK_RELEASE_ACQUIRE
> > select BINFMT_ELF
> > select BUILDTIME_TABLE_SORT
> > +   select PPC64_BUILD_ELF_V2_ABI   if PPC64 && CPU_LITTLE_ENDIAN
> > select CLONE_BACKWARDS
> > select DCACHE_WORD_ACCESS   if PPC64 && CPU_LITTLE_ENDIAN
> > select DMA_OPS_BYPASS   if PPC64
> > @@ -561,6 +562,27 @@ config KEXEC_FILE
> >  config ARCH_HAS_KEXEC_PURGATORY
> > def_bool KEXEC_FILE
> >  
> > +config PPC64_BUILD_ELF_V2_ABI
> > +   bool
> > +
> > +config PPC64_BUILD_BIG_ENDIAN_ELF_V2_ABI
> > +   bool "Build big-endian kernel using ELF ABI V2 (EXPERIMENTAL)"
> > +   depends on PPC64 && CPU_BIG_ENDIAN && EXPERT
> > +   depends on CC_IS_GCC && LD_VERSION >= 22400
> > +   default n
> > +   select PPC64_BUILD_ELF_V2_ABI
> > +   help
> > + This builds the kernel image using the "Power Architecture 64-Bit ELF
> > + V2 ABI Specification", which has a reduced stack overhead and faster
> > + function calls. This internal kernel ABI option does not affect
> > +  userspace compatibility.
> > +
> > + The V2 ABI is standard for 64-bit little-endian, but for big-endian
> > + it is less well tested by kernel and toolchain. However some distros
> > + build userspace this way, and it can produce a functioning kernel.
> > +
> > + This requires GCC and binutils 2.24 or newer.
> > +
> >  config RELOCATABLE
> > bool "Build a relocatable kernel"
> > depends on PPC64 || (FLATMEM && (44x || FSL_BOOKE))
> > diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
> > index 3212d076ac6a..b90b5cb799aa 100644
> > --- a/arch/powerpc/Makefile
> > +++ b/arch/powerpc/Makefile
> > @@ -91,10 +91,14 @@ endif
> >  
> >  ifdef CONFIG_PPC64
> >  ifndef CONFIG_CC_IS_CLANG
> > -cflags-$(CONFIG_CPU_BIG_ENDIAN)+= $(call cc-option,-mabi=elfv1)
> > -cflags-$(CONFIG_CPU_BIG_ENDIAN)+= $(call 
> > cc-option,-mcall-aixdesc)
> > -aflags-$(CONFIG_CPU_BIG_ENDIAN)+= $(call cc-option,-mabi=elfv1)
> > -aflags-$(CONFIG_CPU_LITTLE_ENDIAN) += -mabi=elfv2
> > +ifdef CONFIG_PPC64_BUILD_ELF_V2_ABI
> > +cflags-y   += $(call cc-option,-mabi=elfv2)
> > +aflags-y   += $(call cc-option,-mabi=elfv2)
> > +else
> > +cflags-y   += $(call cc-option,-mabi=elfv1)
> > +cflags-y   += $(call cc-option,-mcall-aixdesc)
> > +aflags-y   += $(call cc-option,-mabi=elfv1)
> > +endif
> >  endif
> >  endif
> >  
> > @@ -142,15 +146,17 @@ endif
> >  
> >  CFLAGS-$(CONFIG_PPC64) := $(call cc-option,-mtraceback=no)
> >  ifndef CONFIG_CC_IS_CLANG
> > -ifdef CONFIG_CPU_LITTLE_ENDIAN
> > -CFLAGS-$(CONFIG_PPC64) += $(call cc-option,-mabi=el

Re: [PATCH v4 2/2] powerpc/64: Option to use ELF V2 ABI for big-endian kernels

2021-06-11 Thread Michal Suchánek
On Fri, Jun 11, 2021 at 07:39:59PM +1000, Nicholas Piggin wrote:
> Provide an option to build big-endian kernels using the ELFv2 ABI. This
> works on GCC only so far, although it is rumored to work with clang
> that's not been tested yet. A new module version check ensures the
> module ELF ABI level matches the kernel build.
> 
> This can give big-endian kernels some useful advantages of the ELFv2 ABI
> (e.g., less stack usage, -mprofile-kernel, better compatibility with eBPF
> tools).
> 
> BE+ELFv2 is not officially supported by the GNU toolchain, but it works
> fine in testing and has been used by some userspace for some time (e.g.,
> Void Linux).
> 
> Tested-by: Michal Suchánek 
> Reviewed-by: Segher Boessenkool 
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/Kconfig| 22 ++
>  arch/powerpc/Makefile   | 18 --
>  arch/powerpc/boot/Makefile  |  4 +++-
>  arch/powerpc/include/asm/module.h   | 24 
>  arch/powerpc/kernel/vdso64/Makefile | 13 +
>  drivers/crypto/vmx/Makefile |  8 ++--
>  drivers/crypto/vmx/ppc-xlate.pl | 10 ++
>  7 files changed, 86 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 088dd2afcfe4..093f973a28b9 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -163,6 +163,7 @@ config PPC
>   select ARCH_WEAK_RELEASE_ACQUIRE
>   select BINFMT_ELF
>   select BUILDTIME_TABLE_SORT
> + select PPC64_BUILD_ELF_V2_ABI   if PPC64 && CPU_LITTLE_ENDIAN
>   select CLONE_BACKWARDS
>   select DCACHE_WORD_ACCESS   if PPC64 && CPU_LITTLE_ENDIAN
>   select DMA_OPS_BYPASS   if PPC64
> @@ -561,6 +562,27 @@ config KEXEC_FILE
>  config ARCH_HAS_KEXEC_PURGATORY
>   def_bool KEXEC_FILE
>  
> +config PPC64_BUILD_ELF_V2_ABI
> + bool
> +
> +config PPC64_BUILD_BIG_ENDIAN_ELF_V2_ABI
> + bool "Build big-endian kernel using ELF ABI V2 (EXPERIMENTAL)"
> + depends on PPC64 && CPU_BIG_ENDIAN && EXPERT
> + depends on CC_IS_GCC && LD_VERSION >= 22400
> + default n
> + select PPC64_BUILD_ELF_V2_ABI
> + help
> +   This builds the kernel image using the "Power Architecture 64-Bit ELF
> +   V2 ABI Specification", which has a reduced stack overhead and faster
> +   function calls. This internal kernel ABI option does not affect
> +  userspace compatibility.
> +
> +   The V2 ABI is standard for 64-bit little-endian, but for big-endian
> +   it is less well tested by kernel and toolchain. However some distros
> +   build userspace this way, and it can produce a functioning kernel.
> +
> +   This requires GCC and binutils 2.24 or newer.
> +
>  config RELOCATABLE
>   bool "Build a relocatable kernel"
>   depends on PPC64 || (FLATMEM && (44x || FSL_BOOKE))
> diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
> index 3212d076ac6a..b90b5cb799aa 100644
> --- a/arch/powerpc/Makefile
> +++ b/arch/powerpc/Makefile
> @@ -91,10 +91,14 @@ endif
>  
>  ifdef CONFIG_PPC64
>  ifndef CONFIG_CC_IS_CLANG
> -cflags-$(CONFIG_CPU_BIG_ENDIAN)  += $(call cc-option,-mabi=elfv1)
> -cflags-$(CONFIG_CPU_BIG_ENDIAN)  += $(call 
> cc-option,-mcall-aixdesc)
> -aflags-$(CONFIG_CPU_BIG_ENDIAN)  += $(call cc-option,-mabi=elfv1)
> -aflags-$(CONFIG_CPU_LITTLE_ENDIAN)   += -mabi=elfv2
> +ifdef CONFIG_PPC64_BUILD_ELF_V2_ABI
> +cflags-y += $(call cc-option,-mabi=elfv2)
> +aflags-y += $(call cc-option,-mabi=elfv2)
> +else
> +cflags-y += $(call cc-option,-mabi=elfv1)
> +cflags-y += $(call cc-option,-mcall-aixdesc)
> +aflags-y += $(call cc-option,-mabi=elfv1)
> +endif
>  endif
>  endif
>  
> @@ -142,15 +146,17 @@ endif
>  
>  CFLAGS-$(CONFIG_PPC64)   := $(call cc-option,-mtraceback=no)
>  ifndef CONFIG_CC_IS_CLANG
> -ifdef CONFIG_CPU_LITTLE_ENDIAN
> -CFLAGS-$(CONFIG_PPC64)   += $(call cc-option,-mabi=elfv2,$(call 
> cc-option,-mcall-aixdesc))
> +ifdef CONFIG_PPC64_BUILD_ELF_V2_ABI
> +CFLAGS-$(CONFIG_PPC64)   += $(call cc-option,-mabi=elfv2)
>  AFLAGS-$(CONFIG_PPC64)   += $(call cc-option,-mabi=elfv2)
>  else
> +# Keep these in synch with arch/powerpc/kernel/vdso64/Makefile
>  CFLAGS-$(CONFIG_PPC64)   += $(call cc-option,-mabi=elfv1)
>  CFLAGS-$(CONFIG_PPC64)   += $(call cc-option,-mcall-aixdesc)
>  AFLAGS-$(CONFIG_PPC64)   += $(ca

Re: [PATCH v3] powerpc/64: Option to use ELFv2 ABI for big-endian kernels

2021-05-14 Thread Michal Suchánek
On Wed, May 05, 2021 at 10:07:29PM +1000, Michael Ellerman wrote:
> Michal Suchánek  writes:
> > On Mon, May 03, 2021 at 01:37:57PM +0200, Andreas Schwab wrote:
> >> Should this add a tag to the module vermagic?
> >
> > Would the modues link even if the vermagic was not changed?
> 
> Most modules will require some symbols from the kernel, and those will
> be dot symbols, which won't resolve.
> 
> But there are a few small modules that don't rely on any kernel symbols,
> which can load.
> 
> > I suppose something like this might do it.
> 
> It would, but I feel like we should be handling this at the ELF level.
> ie. we don't allow loading modules with a different ELF machine type, so
> neither should we allow loading a module with the wrong ELF ABI.
> 
> And you can build the kernel without MODVERSIONS, so relying on
> MODVERSIONS still leaves a small exposure (same kernel version
> with/without ELFv2).
> 
> I don't see an existing hook that would do what we want. There's
> elf_check_arch(), but that also applies to userspace binaries, which is
> not what we want.
> 
> Maybe something like below.

The below patch works for me.

Tested-by: Michal Suchánek 

Built a Hello World module for both v1 and v2 ABI, and kernels built
with v1 and v2 ABI rejected module with the other ABI.

[  100.602943] Module has invalid ELF structures
insmod: ERROR: could not insert module moin_v1.ko: Invalid module format

Thanks

Michal
> 
> cheers
> 
> 
> diff --git a/arch/powerpc/include/asm/module.h 
> b/arch/powerpc/include/asm/module.h
> index 857d9ff24295..d0e9368982d8 100644
> --- a/arch/powerpc/include/asm/module.h
> +++ b/arch/powerpc/include/asm/module.h
> @@ -83,5 +83,28 @@ static inline int module_finalize_ftrace(struct module 
> *mod, const Elf_Shdr *sec
>  }
>  #endif
>  
> +#ifdef CONFIG_PPC64
> +static inline bool elf_check_module_arch(Elf_Ehdr *hdr)
> +{
> + unsigned long flags;
> +
> + if (!elf_check_arch(hdr))
> + return false;
> +
> + flags = hdr->e_flags & 0x3;
> +
> +#ifdef CONFIG_PPC64_BUILD_ELF_V2_ABI
> + if (flags == 2)
> + return true;
> +#else
> + if (flags < 2)
> + return true;
> +#endif
> + return false;
> +}
> +
> +#define elf_check_module_arch elf_check_module_arch
> +#endif /* CONFIG_PPC64 */
> +
>  #endif /* __KERNEL__ */
>  #endif   /* _ASM_POWERPC_MODULE_H */
> diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
> index 9e09d11ffe5b..fdc042a84562 100644
> --- a/include/linux/moduleloader.h
> +++ b/include/linux/moduleloader.h
> @@ -13,6 +13,11 @@
>   * must be implemented by each architecture.
>   */
>  
> +// Allow arch to optionally do additional checking of module ELF header
> +#ifndef elf_check_module_arch
> +#define elf_check_module_arch elf_check_arch
> +#endif
> +
>  /* Adjust arch-specific sections.  Return 0 on success.  */
>  int module_frob_arch_sections(Elf_Ehdr *hdr,
> Elf_Shdr *sechdrs,
> diff --git a/kernel/module.c b/kernel/module.c
> index b5dd92e35b02..c71889107226 100644
> --- a/kernel/module.c
> +++ b/kernel/module.c
> @@ -2941,7 +2941,7 @@ static int elf_validity_check(struct load_info *info)
>  
>   if (memcmp(info->hdr->e_ident, ELFMAG, SELFMAG) != 0
>   || info->hdr->e_type != ET_REL
> - || !elf_check_arch(info->hdr)
> + || !elf_check_module_arch(info->hdr)
>   || info->hdr->e_shentsize != sizeof(Elf_Shdr))
>   return -ENOEXEC;
>  


Re: [PATCH] powerpc/perf: Simplify Makefile

2021-05-07 Thread Michal Suchánek
On Fri, May 07, 2021 at 02:01:09PM +, Christophe Leroy wrote:
> arch/powerpc/Kbuild decend into arch/powerpc/perf/ only when
> CONFIG_PERF_EVENTS is selected, so there is not need to take
> CONFIG_PERF_EVENTS into account in arch/powerpc/perf/Makefile.

So long as CONFIG_PERF_EVENTS stays boolean.
If it were tristate the result is less clear.

Reviewed-by: Michal Suchánek 

Thanks

Michal
> 
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/perf/Makefile | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile
> index c02854dea2b2..2f46e31c7612 100644
> --- a/arch/powerpc/perf/Makefile
> +++ b/arch/powerpc/perf/Makefile
> @@ -1,9 +1,7 @@
>  # SPDX-License-Identifier: GPL-2.0
>  
> -obj-$(CONFIG_PERF_EVENTS)+= callchain.o callchain_$(BITS).o perf_regs.o
> -ifdef CONFIG_COMPAT
> -obj-$(CONFIG_PERF_EVENTS)+= callchain_32.o
> -endif
> +obj-y+= callchain.o callchain_$(BITS).o 
> perf_regs.o
> +obj-$(CONFIG_COMPAT) += callchain_32.o
>  
>  obj-$(CONFIG_PPC_PERF_CTRS)  += core-book3s.o bhrb.o
>  obj64-$(CONFIG_PPC_PERF_CTRS)+= ppc970-pmu.o power5-pmu.o \
> -- 
> 2.25.0
> 


Re: [PATCH v3] powerpc/64: Option to use ELFv2 ABI for big-endian kernels

2021-05-05 Thread Michal Suchánek
On Wed, May 05, 2021 at 05:23:37PM +0200, Michal Suchánek wrote:
> Hello,
> 
> looks like the ABI flags are not correctly applied when cross-compiling.
> 
> While building natively success of BTFIDS depends on the kernel ABI but
> when cross-compiling success of BTFIDS depends on the default toolchain
> ABI.
> 
> It's problem independent of this patch - the problem exists both before
> and after.

Actually this is not the case. Now retested with LE toolchain and BTFIDS
fails on BE v2 kernel with either one but earlier the default LE
toolchain produced BTFIDS on BE v1 kernel.

No idea what is going on except the general issue that success/failure
of BTFIDS when cross-compiling ppc64 is not representative of
success/failure when building natively. Don't even want to know what
would happen if I tried to link a BPF program with the kernel code using
that info.

Thanks

Michal

> 
> Thanks
> 
> Michal
> 
> On Mon, May 03, 2021 at 09:07:13PM +1000, Nicholas Piggin wrote:
> > Provide an option to build big-endian kernels using the ELFv2 ABI. This
> > works on GCC only so far, although it is rumored to work with clang
> > that's not been tested yet.
> > 
> > This can give big-endian kernels some useful advantages of the ELFv2 ABI
> > (e.g., less stack usage, -mprofile-kernel, better compatibility with bpf
> > tools).
> > 
> > BE+ELFv2 is not officially supported by the GNU toolchain, but it works
> > fine in testing and has been used by some userspace for some time (e.g.,
> > Void Linux).
> > 
> > Tested-by: Michal Suchánek 
> > Reviewed-by: Segher Boessenkool 
> > Signed-off-by: Nicholas Piggin 
> > ---
> > 
> > I didn't add the -mprofile-kernel change but I think it would be a good
> > one that can be merged independently if it works.
> > 
> > Since v2:
> > - Rebased, tweaked changelog.
> > - Changed ELF_V2 to ELF_V2_ABI in config options, to be clearer.
> > 
> > Since v1:
> > - Improved the override flavour name suggested by Segher.
> > - Improved changelog wording.
> > 
> >  arch/powerpc/Kconfig| 22 ++
> >  arch/powerpc/Makefile   | 18 --
> >  arch/powerpc/boot/Makefile  |  4 +++-
> >  arch/powerpc/kernel/vdso64/Makefile | 13 +
> >  drivers/crypto/vmx/Makefile |  8 ++--
> >  drivers/crypto/vmx/ppc-xlate.pl | 10 ++
> >  6 files changed, 62 insertions(+), 13 deletions(-)
> > 
> > diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> > index 1e6230bea09d..d3f78d3d574d 100644
> > --- a/arch/powerpc/Kconfig
> > +++ b/arch/powerpc/Kconfig
> > @@ -160,6 +160,7 @@ config PPC
> > select ARCH_WEAK_RELEASE_ACQUIRE
> > select BINFMT_ELF
> > select BUILDTIME_TABLE_SORT
> > +   select PPC64_BUILD_ELF_V2_ABI   if PPC64 && CPU_LITTLE_ENDIAN
> > select CLONE_BACKWARDS
> > select DCACHE_WORD_ACCESS   if PPC64 && CPU_LITTLE_ENDIAN
> > select DMA_OPS  if PPC64
> > @@ -568,6 +569,27 @@ config KEXEC_FILE
> >  config ARCH_HAS_KEXEC_PURGATORY
> > def_bool KEXEC_FILE
> >  
> > +config PPC64_BUILD_ELF_V2_ABI
> > +   bool
> > +
> > +config PPC64_BUILD_BIG_ENDIAN_ELF_V2_ABI
> > +   bool "Build big-endian kernel using ELF ABI V2 (EXPERIMENTAL)"
> > +   depends on PPC64 && CPU_BIG_ENDIAN && EXPERT
> > +   depends on CC_IS_GCC && LD_VERSION >= 22400
> > +   default n
> > +   select PPC64_BUILD_ELF_V2_ABI
> > +   help
> > + This builds the kernel image using the "Power Architecture 64-Bit ELF
> > + V2 ABI Specification", which has a reduced stack overhead and faster
> > + function calls. This internal kernel ABI option does not affect
> > +  userspace compatibility.
> > +
> > + The V2 ABI is standard for 64-bit little-endian, but for big-endian
> > + it is less well tested by kernel and toolchain. However some distros
> > + build userspace this way, and it can produce a functioning kernel.
> > +
> > + This requires GCC and binutils 2.24 or newer.
> > +
> >  config RELOCATABLE
> > bool "Build a relocatable kernel"
> > depends on PPC64 || (FLATMEM && (44x || FSL_BOOKE))
> > diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
> > index 3212d076ac6a..b90b5cb799aa 100644
> > --- a/arch/powerpc/Makefile
> > +++ b/arch/powerpc/Makefile
> > @@ -91,10 +91,14 @@ endif
> >  
>

Re: [PATCH v3] powerpc/64: Option to use ELFv2 ABI for big-endian kernels

2021-05-05 Thread Michal Suchánek
Hello,

looks like the ABI flags are not correctly applied when cross-compiling.

While building natively success of BTFIDS depends on the kernel ABI but
when cross-compiling success of BTFIDS depends on the default toolchain
ABI.

It's problem independent of this patch - the problem exists both before
and after.

Thanks

Michal

On Mon, May 03, 2021 at 09:07:13PM +1000, Nicholas Piggin wrote:
> Provide an option to build big-endian kernels using the ELFv2 ABI. This
> works on GCC only so far, although it is rumored to work with clang
> that's not been tested yet.
> 
> This can give big-endian kernels some useful advantages of the ELFv2 ABI
> (e.g., less stack usage, -mprofile-kernel, better compatibility with bpf
> tools).
> 
> BE+ELFv2 is not officially supported by the GNU toolchain, but it works
> fine in testing and has been used by some userspace for some time (e.g.,
> Void Linux).
> 
> Tested-by: Michal Suchánek 
> Reviewed-by: Segher Boessenkool 
> Signed-off-by: Nicholas Piggin 
> ---
> 
> I didn't add the -mprofile-kernel change but I think it would be a good
> one that can be merged independently if it works.
> 
> Since v2:
> - Rebased, tweaked changelog.
> - Changed ELF_V2 to ELF_V2_ABI in config options, to be clearer.
> 
> Since v1:
> - Improved the override flavour name suggested by Segher.
> - Improved changelog wording.
> 
>  arch/powerpc/Kconfig| 22 ++
>  arch/powerpc/Makefile   | 18 --
>  arch/powerpc/boot/Makefile  |  4 +++-
>  arch/powerpc/kernel/vdso64/Makefile | 13 +
>  drivers/crypto/vmx/Makefile |  8 ++--
>  drivers/crypto/vmx/ppc-xlate.pl | 10 ++
>  6 files changed, 62 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 1e6230bea09d..d3f78d3d574d 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -160,6 +160,7 @@ config PPC
>   select ARCH_WEAK_RELEASE_ACQUIRE
>   select BINFMT_ELF
>   select BUILDTIME_TABLE_SORT
> + select PPC64_BUILD_ELF_V2_ABI   if PPC64 && CPU_LITTLE_ENDIAN
>   select CLONE_BACKWARDS
>   select DCACHE_WORD_ACCESS   if PPC64 && CPU_LITTLE_ENDIAN
>   select DMA_OPS  if PPC64
> @@ -568,6 +569,27 @@ config KEXEC_FILE
>  config ARCH_HAS_KEXEC_PURGATORY
>   def_bool KEXEC_FILE
>  
> +config PPC64_BUILD_ELF_V2_ABI
> + bool
> +
> +config PPC64_BUILD_BIG_ENDIAN_ELF_V2_ABI
> + bool "Build big-endian kernel using ELF ABI V2 (EXPERIMENTAL)"
> + depends on PPC64 && CPU_BIG_ENDIAN && EXPERT
> + depends on CC_IS_GCC && LD_VERSION >= 22400
> + default n
> + select PPC64_BUILD_ELF_V2_ABI
> + help
> +   This builds the kernel image using the "Power Architecture 64-Bit ELF
> +   V2 ABI Specification", which has a reduced stack overhead and faster
> +   function calls. This internal kernel ABI option does not affect
> +  userspace compatibility.
> +
> +   The V2 ABI is standard for 64-bit little-endian, but for big-endian
> +   it is less well tested by kernel and toolchain. However some distros
> +   build userspace this way, and it can produce a functioning kernel.
> +
> +   This requires GCC and binutils 2.24 or newer.
> +
>  config RELOCATABLE
>   bool "Build a relocatable kernel"
>   depends on PPC64 || (FLATMEM && (44x || FSL_BOOKE))
> diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
> index 3212d076ac6a..b90b5cb799aa 100644
> --- a/arch/powerpc/Makefile
> +++ b/arch/powerpc/Makefile
> @@ -91,10 +91,14 @@ endif
>  
>  ifdef CONFIG_PPC64
>  ifndef CONFIG_CC_IS_CLANG
> -cflags-$(CONFIG_CPU_BIG_ENDIAN)  += $(call cc-option,-mabi=elfv1)
> -cflags-$(CONFIG_CPU_BIG_ENDIAN)  += $(call 
> cc-option,-mcall-aixdesc)
> -aflags-$(CONFIG_CPU_BIG_ENDIAN)  += $(call cc-option,-mabi=elfv1)
> -aflags-$(CONFIG_CPU_LITTLE_ENDIAN)   += -mabi=elfv2
> +ifdef CONFIG_PPC64_BUILD_ELF_V2_ABI
> +cflags-y += $(call cc-option,-mabi=elfv2)
> +aflags-y += $(call cc-option,-mabi=elfv2)
> +else
> +cflags-y += $(call cc-option,-mabi=elfv1)
> +cflags-y += $(call cc-option,-mcall-aixdesc)
> +aflags-y += $(call cc-option,-mabi=elfv1)
> +endif
>  endif
>  endif
>  
> @@ -142,15 +146,17 @@ endif
>  
>  CFLAGS-$(CONFIG_PPC64)   := $(call cc-option,-mtraceback=no)
>  ifndef CONFIG_CC_IS_CLANG
> -ifdef CO

Re: [PATCH v3] powerpc/64: Option to use ELFv2 ABI for big-endian kernels

2021-05-05 Thread Michal Suchánek
On Wed, May 05, 2021 at 10:07:29PM +1000, Michael Ellerman wrote:
> Michal Suchánek  writes:
> > On Mon, May 03, 2021 at 01:37:57PM +0200, Andreas Schwab wrote:
> >> Should this add a tag to the module vermagic?
> >
> > Would the modues link even if the vermagic was not changed?
> 
> Most modules will require some symbols from the kernel, and those will
> be dot symbols, which won't resolve.
> 
> But there are a few small modules that don't rely on any kernel symbols,
> which can load.
> 
> > I suppose something like this might do it.
> 
> It would, but I feel like we should be handling this at the ELF level.
> ie. we don't allow loading modules with a different ELF machine type, so
> neither should we allow loading a module with the wrong ELF ABI.
> 
> And you can build the kernel without MODVERSIONS, so relying on
> MODVERSIONS still leaves a small exposure (same kernel version
> with/without ELFv2).
> 
> I don't see an existing hook that would do what we want. There's
> elf_check_arch(), but that also applies to userspace binaries, which is
> not what we want.
> 
> Maybe something like below.
Yes, that looks better.

Thanks

Michal
> 
> cheers
> 
> 
> diff --git a/arch/powerpc/include/asm/module.h 
> b/arch/powerpc/include/asm/module.h
> index 857d9ff24295..d0e9368982d8 100644
> --- a/arch/powerpc/include/asm/module.h
> +++ b/arch/powerpc/include/asm/module.h
> @@ -83,5 +83,28 @@ static inline int module_finalize_ftrace(struct module 
> *mod, const Elf_Shdr *sec
>  }
>  #endif
>  
> +#ifdef CONFIG_PPC64
> +static inline bool elf_check_module_arch(Elf_Ehdr *hdr)
> +{
> + unsigned long flags;
> +
> + if (!elf_check_arch(hdr))
> + return false;
> +
> + flags = hdr->e_flags & 0x3;
> +
> +#ifdef CONFIG_PPC64_BUILD_ELF_V2_ABI
> + if (flags == 2)
> + return true;
> +#else
> + if (flags < 2)
> + return true;
> +#endif
> + return false;
> +}
> +
> +#define elf_check_module_arch elf_check_module_arch
> +#endif /* CONFIG_PPC64 */
> +
>  #endif /* __KERNEL__ */
>  #endif   /* _ASM_POWERPC_MODULE_H */
> diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
> index 9e09d11ffe5b..fdc042a84562 100644
> --- a/include/linux/moduleloader.h
> +++ b/include/linux/moduleloader.h
> @@ -13,6 +13,11 @@
>   * must be implemented by each architecture.
>   */
>  
> +// Allow arch to optionally do additional checking of module ELF header
> +#ifndef elf_check_module_arch
> +#define elf_check_module_arch elf_check_arch
> +#endif
> +
>  /* Adjust arch-specific sections.  Return 0 on success.  */
>  int module_frob_arch_sections(Elf_Ehdr *hdr,
> Elf_Shdr *sechdrs,
> diff --git a/kernel/module.c b/kernel/module.c
> index b5dd92e35b02..c71889107226 100644
> --- a/kernel/module.c
> +++ b/kernel/module.c
> @@ -2941,7 +2941,7 @@ static int elf_validity_check(struct load_info *info)
>  
>   if (memcmp(info->hdr->e_ident, ELFMAG, SELFMAG) != 0
>   || info->hdr->e_type != ET_REL
> - || !elf_check_arch(info->hdr)
> + || !elf_check_module_arch(info->hdr)
>   || info->hdr->e_shentsize != sizeof(Elf_Shdr))
>   return -ENOEXEC;
>  


Re: [PATCH v2] powerpc/64: BE option to use ELFv2 ABI for big endian kernels

2021-05-04 Thread Michal Suchánek
On Tue, May 04, 2021 at 11:11:25PM +0530, Naveen N. Rao wrote:
> Nicholas Piggin wrote:
> > Excerpts from Michal Suchánek's message of May 4, 2021 6:17 am:
> > > On Mon, May 03, 2021 at 11:34:25AM +0200, Michal Suchánek wrote:
> > > > On Mon, May 03, 2021 at 09:11:16AM +0200, Michal Suchánek wrote:
> > > > > On Mon, May 03, 2021 at 10:58:33AM +1000, Nicholas Piggin wrote:
> > > > > > Excerpts from Michal Suchánek's message of May 3, 2021 2:57 am:
> > > > > > > On Tue, Apr 28, 2020 at 09:25:17PM +1000, Nicholas Piggin wrote:
> > > > > > >> Provide an option to use ELFv2 ABI for big endian builds. This 
> > > > > > >> works on
> > > > > > >> GCC and clang (since 2014). It is less well tested and supported 
> > > > > > >> by the
> > > > > > >> GNU toolchain, but it can give some useful advantages of the 
> > > > > > >> ELFv2 ABI
> > > > > > >> for BE (e.g., less stack usage). Some distros even build BE ELFv2
> > > > > > >> userspace.
> > > > > > > > > > Fixes BTFID failure on BE for me and the ELF ABIv2
> > > > kernel boots.
> > > > > > > > What's the BTFID failure? Anything we can do to fix it
> > > > on the v1 ABI or > > at least make it depend on BUILD_ELF_V2?
> > > > > > Looks like symbols are prefixed with a dot in ABIv1 and
> > > > BTFID tool is
> > > > > not aware of that. It can be disabled on ABIv1 easily.
> 
> Yes, I think BTF is generated by pahole, so we will need to add support for
> recognising dot symbols there.

There are symbols both with and without dot, and the dwarves
development is headed towards using the ones without dot. Not sure it's
the correct way to resolve it, though.
https://lore.kernel.org/lkml/8c3cbd22-eb26-ea8b-c8bb-35a629d6d...@kernel.org/

> 
> > > > > > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> > > > > index 678c13967580..e703c26e9b80 100644
> > > > > --- a/lib/Kconfig.debug
> > > > > +++ b/lib/Kconfig.debug
> > > > > @@ -305,6 +305,7 @@ config DEBUG_INFO_BTF
> > > > >   bool "Generate BTF typeinfo"
> > > > >   depends on !DEBUG_INFO_SPLIT && !DEBUG_INFO_REDUCED
> > > > >   depends on !GCC_PLUGIN_RANDSTRUCT || COMPILE_TEST
> > > > > + depends on !PPC64 || BUILD_ELF_V2
> > > > >   help
> > > > > Generate deduplicated BTF type information from DWARF debug 
> > > > > info.
> > > > > Turning this on expects presence of pahole tool, which will 
> > > > > convert
> > > > > > > > > > > > > Tested-by: Michal Suchánek 
> > > > > > > > > > Also can we enable mprofile on BE now?
> > > > > > > > > > I don't see anything endian-specific in the mprofile
> > > > code at a glance
> > > > > > > but don't have any idea how to test it.
> > > > > > > > AFAIK it's just a different ABI for the _mcount call so
> > > > just running
> > > > > > some ftrace and ftrace with call graph should test it reasonably 
> > > > > > well.
> > > > 
> > > > It does not crash and burn but there are some regressions from LE to BE
> > > > on the ftrace kernel selftest:
> > > > 
> > > > @@ -16,10 +16,10 @@
> > > >  [n] event tracing - enable/disable with event level files  [PASS]
> > > >  [n] event tracing - restricts events based on pid notrace filtering
> > > > [PASS]
> > > >  [n] event tracing - restricts events based on pid  [PASS]
> > > > -[n] event tracing - enable/disable with subsystem level files  [PASS]
> > > > +[n] event tracing - enable/disable with subsystem level files  [FAIL]
> > > >  [n] event tracing - enable/disable with top level files[PASS]
> > > > -[n] Test trace_printk from module  [UNRESOLVED]
> > > > +[n] Test trace_printk from module  [FAIL]
> > > > -[n] ftrace - function graph filters with stack tracer  [PASS]
> > > > +[n] ftrace - function graph filters with stack tracer  [FAIL]
> > > >  [n] ftrace - function graph filters[PASS]
> > > >  [n] ftrace - function trace with cpumask   [PASS]
> > > >  [n] ftrace - te

Re: [PATCH] Raise the minimum GCC version to 5.2

2021-05-04 Thread Michal Suchánek
On Tue, May 04, 2021 at 02:09:24PM +0200, Miguel Ojeda wrote:
> On Tue, May 4, 2021 at 11:22 AM Michal Suchánek  wrote:
> >
> > Except it makes answering the question "Is this bug we see on this
> > ancient system still present in upstream?" needlessly more difficult to
> > answer.
> 
> Can you please provide some details? If you are talking about testing
> a new kernel image in the ancient system "as-is", why wouldn't you
> build it in a newer system? If you are talking about  particular
> problems about bisecting (kernel, compiler) pairs etc., details would
> also be welcome.

Yes, bisecting comes to mind. If you need to switch the userspace as
well the bisection results are not that solid. You may not be even able
to bisect because the workload does not exist on a new system at all.
Crafting a minimal test case that can be forward-ported to a new system
is not always trivial - if you understood the problem to that extent you
might not even need to bisect it in the first place.

Thanks

Michal


Re: [PATCH] Raise the minimum GCC version to 5.2

2021-05-04 Thread Michal Suchánek
On Tue, May 04, 2021 at 10:38:32AM +0200, Miguel Ojeda wrote:
> On Tue, May 4, 2021 at 9:57 AM Ben Dooks  wrote:
> >
> > Some of us are a bit stuck as either customer refuses to upgrade
> > their build infrastructure or has paid for some old but safety
> > blessed version of gcc. These often lag years behind the recent
> > gcc releases :(
> 
> In those scenarios, why do you need to build mainline? Aren't your
> customers using longterm or frozen kernels? If they are paying for
> certified GCC images, aren't they already paying for supported kernel
> images from some vendor too?
> 
> I understand where you are coming from -- I have also dealt with
> projects/machines running ancient, unsupported software/toolchains for
> various reasons; but nobody expected upstream (and in particular the
> mainline kernel source) to support them. In the cases I experienced,
> those use cases require not touching anything at all, and when the
> time came of doing so, everything would be updated at once,
> re-certified/validated as needed and frozen again.

Except it makes answering the question "Is this bug we see on this
ancient system still present in upstream?" needlessly more difficult to
answer.

Sure, throwing out old compiler versions that are known to cause
problems makes sense. Updating to latest just because much less so.

One of the selling point of C in general and gcc in particular is
stability. If we need the latest compiler we can as well rewrite the
kernel in Rust which has a required update cycle of a few months.

Because some mainline kernel features rely on bleeding edge tools I end
up building mainline with current tools anyway but if you do not need
BTF or whatever other latest gimmick older toolchains should do.

Thanks

Michal


Re: [PATCH v2] powerpc/64: BE option to use ELFv2 ABI for big endian kernels

2021-05-03 Thread Michal Suchánek
On Mon, May 03, 2021 at 11:34:25AM +0200, Michal Suchánek wrote:
> On Mon, May 03, 2021 at 09:11:16AM +0200, Michal Suchánek wrote:
> > On Mon, May 03, 2021 at 10:58:33AM +1000, Nicholas Piggin wrote:
> > > Excerpts from Michal Suchánek's message of May 3, 2021 2:57 am:
> > > > On Tue, Apr 28, 2020 at 09:25:17PM +1000, Nicholas Piggin wrote:
> > > >> Provide an option to use ELFv2 ABI for big endian builds. This works on
> > > >> GCC and clang (since 2014). It is less well tested and supported by the
> > > >> GNU toolchain, but it can give some useful advantages of the ELFv2 ABI
> > > >> for BE (e.g., less stack usage). Some distros even build BE ELFv2
> > > >> userspace.
> > > > 
> > > > Fixes BTFID failure on BE for me and the ELF ABIv2 kernel boots.
> > > 
> > > What's the BTFID failure? Anything we can do to fix it on the v1 ABI or 
> > > at least make it depend on BUILD_ELF_V2?
> > 
> > Looks like symbols are prefixed with a dot in ABIv1 and BTFID tool is
> > not aware of that. It can be disabled on ABIv1 easily.
> > 
> > Thanks
> > 
> > Michal
> > 
> > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> > index 678c13967580..e703c26e9b80 100644
> > --- a/lib/Kconfig.debug
> > +++ b/lib/Kconfig.debug
> > @@ -305,6 +305,7 @@ config DEBUG_INFO_BTF
> > bool "Generate BTF typeinfo"
> > depends on !DEBUG_INFO_SPLIT && !DEBUG_INFO_REDUCED
> > depends on !GCC_PLUGIN_RANDSTRUCT || COMPILE_TEST
> > +   depends on !PPC64 || BUILD_ELF_V2
> > help
> >   Generate deduplicated BTF type information from DWARF debug info.
> >   Turning this on expects presence of pahole tool, which will convert
> > 
> > > 
> > > > 
> > > > Tested-by: Michal Suchánek 
> > > > 
> > > > Also can we enable mprofile on BE now?
> > > > 
> > > > I don't see anything endian-specific in the mprofile code at a glance
> > > > but don't have any idea how to test it.
> > > 
> > > AFAIK it's just a different ABI for the _mcount call so just running
> > > some ftrace and ftrace with call graph should test it reasonably well.
> 
> It does not crash and burn but there are some regressions from LE to BE
> on the ftrace kernel selftest:
> 
> --- ftraceLE.txt  2021-05-03 11:19:14.83000 +0200
> +++ ftraceBE.txt  2021-05-03 11:27:24.77000 +0200
> @@ -7,8 +7,8 @@
>  [n] Change the ringbuffer size   [PASS]
>  [n] Snapshot and tracing setting [PASS]
>  [n] trace_pipe and trace_marker  [PASS]
> -[n] Test ftrace direct functions against tracers [UNRESOLVED]
> -[n] Test ftrace direct functions against kprobes [UNRESOLVED]
> +[n] Test ftrace direct functions against tracers [FAIL]
> +[n] Test ftrace direct functions against kprobes [FAIL]
>  [n] Generic dynamic event - add/remove kprobe events [PASS]
>  [n] Generic dynamic event - add/remove synthetic events  [PASS]
>  [n] Generic dynamic event - selective clear (compatibility)  [PASS]
> @@ -16,10 +16,10 @@
>  [n] event tracing - enable/disable with event level files[PASS]
>  [n] event tracing - restricts events based on pid notrace filtering  [PASS]
>  [n] event tracing - restricts events based on pid[PASS]
> -[n] event tracing - enable/disable with subsystem level files[PASS]
> +[n] event tracing - enable/disable with subsystem level files[FAIL]
>  [n] event tracing - enable/disable with top level files  [PASS]
> -[n] Test trace_printk from module[UNRESOLVED]
> -[n] ftrace - function graph filters with stack tracer[PASS]
> +[n] Test trace_printk from module[FAIL]
> +[n] ftrace - function graph filters with stack tracer[FAIL]
>  [n] ftrace - function graph filters  [PASS]
>  [n] ftrace - function trace with cpumask [PASS]
>  [n] ftrace - test for function event triggers[PASS]
> @@ -27,7 +27,7 @@
>  [n] ftrace - function pid notrace filters[PASS]
>  [n] ftrace - function pid filters[PASS]
>  [n] ftrace - stacktrace filter command   [PASS]
> -[n] ftrace - function trace on module[UNRESOLVED]
> +[n] ftrace - function trace on module[FAIL]
>  [n] ftrace - function profiler with function tracing [PASS]
>  [n] ftrace - function profiling  [PASS]
>  [n] ftrace - test reading of set_ftrace_filter   [PASS]
> @@ -44,10 +44,10 @@
>  [n] Kprobe event argument syntax [PASS]
>  [n] Kprobe dynamic event with arguments  [PASS]
>  [n] Kprobes event arguments with types   [PAS

Re: [PATCH v3] powerpc/64: Option to use ELFv2 ABI for big-endian kernels

2021-05-03 Thread Michal Suchánek
On Mon, May 03, 2021 at 01:37:57PM +0200, Andreas Schwab wrote:
> Should this add a tag to the module vermagic?

Would the modues link even if the vermagic was not changed?

I suppose something like this might do it.

Thanks

Michal

diff --git a/arch/powerpc/include/asm/vermagic.h 
b/arch/powerpc/include/asm/vermagic.h
index b054a8576e5d..3fdaacd7a743 100644
--- a/arch/powerpc/include/asm/vermagic.h
+++ b/arch/powerpc/include/asm/vermagic.h
@@ -14,7 +14,14 @@
 #define MODULE_ARCH_VERMAGIC_RELOCATABLE   ""
 #endif
 
+
+#ifdef CONFIG_PPC64_BUILD_BIG_ENDIAN_ELF_V2_ABI
+#define MODULE_ARCH_VERMAGIC_ELF_V2_ABI"abi-elfv2 "
+#else
+#define MODULE_ARCH_VERMAGIC_ELF_V2_ABI""
+#endif
+
 #define MODULE_ARCH_VERMAGIC \
-   MODULE_ARCH_VERMAGIC_FTRACE MODULE_ARCH_VERMAGIC_RELOCATABLE
+   MODULE_ARCH_VERMAGIC_FTRACE MODULE_ARCH_VERMAGIC_RELOCATABLE 
MODULE_ARCH_VERMAGIC_ELF_V2_ABI
 
 #endif /* _ASM_VERMAGIC_H */


Re: [PATCH v2] powerpc/64: BE option to use ELFv2 ABI for big endian kernels

2021-05-03 Thread Michal Suchánek
On Mon, May 03, 2021 at 09:11:16AM +0200, Michal Suchánek wrote:
> On Mon, May 03, 2021 at 10:58:33AM +1000, Nicholas Piggin wrote:
> > Excerpts from Michal Suchánek's message of May 3, 2021 2:57 am:
> > > On Tue, Apr 28, 2020 at 09:25:17PM +1000, Nicholas Piggin wrote:
> > >> Provide an option to use ELFv2 ABI for big endian builds. This works on
> > >> GCC and clang (since 2014). It is less well tested and supported by the
> > >> GNU toolchain, but it can give some useful advantages of the ELFv2 ABI
> > >> for BE (e.g., less stack usage). Some distros even build BE ELFv2
> > >> userspace.
> > > 
> > > Fixes BTFID failure on BE for me and the ELF ABIv2 kernel boots.
> > 
> > What's the BTFID failure? Anything we can do to fix it on the v1 ABI or 
> > at least make it depend on BUILD_ELF_V2?
> 
> Looks like symbols are prefixed with a dot in ABIv1 and BTFID tool is
> not aware of that. It can be disabled on ABIv1 easily.
> 
> Thanks
> 
> Michal
> 
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 678c13967580..e703c26e9b80 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -305,6 +305,7 @@ config DEBUG_INFO_BTF
>   bool "Generate BTF typeinfo"
>   depends on !DEBUG_INFO_SPLIT && !DEBUG_INFO_REDUCED
>   depends on !GCC_PLUGIN_RANDSTRUCT || COMPILE_TEST
> + depends on !PPC64 || BUILD_ELF_V2
>   help
> Generate deduplicated BTF type information from DWARF debug info.
> Turning this on expects presence of pahole tool, which will convert
> 
> > 
> > > 
> > > Tested-by: Michal Suchánek 
> > > 
> > > Also can we enable mprofile on BE now?
> > > 
> > > I don't see anything endian-specific in the mprofile code at a glance
> > > but don't have any idea how to test it.
> > 
> > AFAIK it's just a different ABI for the _mcount call so just running
> > some ftrace and ftrace with call graph should test it reasonably well.

It does not crash and burn but there are some regressions from LE to BE
on the ftrace kernel selftest:

--- ftraceLE.txt2021-05-03 11:19:14.83000 +0200
+++ ftraceBE.txt2021-05-03 11:27:24.77000 +0200
@@ -7,8 +7,8 @@
 [n] Change the ringbuffer size [PASS]
 [n] Snapshot and tracing setting   [PASS]
 [n] trace_pipe and trace_marker[PASS]
-[n] Test ftrace direct functions against tracers   [UNRESOLVED]
-[n] Test ftrace direct functions against kprobes   [UNRESOLVED]
+[n] Test ftrace direct functions against tracers   [FAIL]
+[n] Test ftrace direct functions against kprobes   [FAIL]
 [n] Generic dynamic event - add/remove kprobe events   [PASS]
 [n] Generic dynamic event - add/remove synthetic events[PASS]
 [n] Generic dynamic event - selective clear (compatibility)[PASS]
@@ -16,10 +16,10 @@
 [n] event tracing - enable/disable with event level files  [PASS]
 [n] event tracing - restricts events based on pid notrace filtering[PASS]
 [n] event tracing - restricts events based on pid  [PASS]
-[n] event tracing - enable/disable with subsystem level files  [PASS]
+[n] event tracing - enable/disable with subsystem level files  [FAIL]
 [n] event tracing - enable/disable with top level files[PASS]
-[n] Test trace_printk from module  [UNRESOLVED]
-[n] ftrace - function graph filters with stack tracer  [PASS]
+[n] Test trace_printk from module  [FAIL]
+[n] ftrace - function graph filters with stack tracer  [FAIL]
 [n] ftrace - function graph filters[PASS]
 [n] ftrace - function trace with cpumask   [PASS]
 [n] ftrace - test for function event triggers  [PASS]
@@ -27,7 +27,7 @@
 [n] ftrace - function pid notrace filters  [PASS]
 [n] ftrace - function pid filters  [PASS]
 [n] ftrace - stacktrace filter command [PASS]
-[n] ftrace - function trace on module  [UNRESOLVED]
+[n] ftrace - function trace on module  [FAIL]
 [n] ftrace - function profiler with function tracing   [PASS]
 [n] ftrace - function profiling[PASS]
 [n] ftrace - test reading of set_ftrace_filter [PASS]
@@ -44,10 +44,10 @@
 [n] Kprobe event argument syntax   [PASS]
 [n] Kprobe dynamic event with arguments[PASS]
 [n] Kprobes event arguments with types [PASS]
-[n] Kprobe event user-memory access[UNSUPPORTED]
+[n] Kprobe event user-memory access[FAIL]
 [n] Kprobe event auto/manual naming[PASS]
 [n] Kprobe dynamic event with function tracer  [PASS]
-[n] Kprobe dynamic event - probing module  [UNRESOLVED]
+[n] Kprobe dynamic event - probing module  [FAIL]
 [n] Create/delete multiprobe on kprobe event   [PASS]
 [n] Kprobe event parser error log check[PASS]
 [n] Kretprobe dynamic event with arguments [PASS]
@@ -57,11 +57,11 @@
 [n] Kprobe e

Re: [PATCH v2] powerpc/64: BE option to use ELFv2 ABI for big endian kernels

2021-05-03 Thread Michal Suchánek
On Mon, May 03, 2021 at 10:58:33AM +1000, Nicholas Piggin wrote:
> Excerpts from Michal Suchánek's message of May 3, 2021 2:57 am:
> > On Tue, Apr 28, 2020 at 09:25:17PM +1000, Nicholas Piggin wrote:
> >> Provide an option to use ELFv2 ABI for big endian builds. This works on
> >> GCC and clang (since 2014). It is less well tested and supported by the
> >> GNU toolchain, but it can give some useful advantages of the ELFv2 ABI
> >> for BE (e.g., less stack usage). Some distros even build BE ELFv2
> >> userspace.
> > 
> > Fixes BTFID failure on BE for me and the ELF ABIv2 kernel boots.
> 
> What's the BTFID failure? Anything we can do to fix it on the v1 ABI or 
> at least make it depend on BUILD_ELF_V2?

Looks like symbols are prefixed with a dot in ABIv1 and BTFID tool is
not aware of that. It can be disabled on ABIv1 easily.

Thanks

Michal

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 678c13967580..e703c26e9b80 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -305,6 +305,7 @@ config DEBUG_INFO_BTF
bool "Generate BTF typeinfo"
depends on !DEBUG_INFO_SPLIT && !DEBUG_INFO_REDUCED
depends on !GCC_PLUGIN_RANDSTRUCT || COMPILE_TEST
+   depends on !PPC64 || BUILD_ELF_V2
help
  Generate deduplicated BTF type information from DWARF debug info.
  Turning this on expects presence of pahole tool, which will convert

> 
> > 
> > Tested-by: Michal Suchánek 
> > 
> > Also can we enable mprofile on BE now?
> > 
> > I don't see anything endian-specific in the mprofile code at a glance
> > but don't have any idea how to test it.
> 
> AFAIK it's just a different ABI for the _mcount call so just running
> some ftrace and ftrace with call graph should test it reasonably well.
> 
> > 
> > Thanks
> > 
> > Michal
> > 
> > diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> > index 6a4ad11f6349..75b3afbfc378 100644
> > --- a/arch/powerpc/Kconfig
> > +++ b/arch/powerpc/Kconfig
> > @@ -495,7 +495,7 @@ config LD_HEAD_STUB_CATCH
> >   If unsure, say "N".
> >  
> >  config MPROFILE_KERNEL
> > -   depends on PPC64 && CPU_LITTLE_ENDIAN && FUNCTION_TRACER
> > +   depends on PPC64 && BUILD_ELF_V2 && FUNCTION_TRACER
> > def_bool 
> > $(success,$(srctree)/arch/powerpc/tools/gcc-check-mprofile-kernel.sh $(CC) 
> > -I$(srctree)/include -D__KERNEL__)
> 
> Good idea. I can't remember if I did a grep for LITTLE_ENDIAN to check 
> for other such opportunities.
> 
> Thanks,
> Nick
> 
> >  
> >  config HOTPLUG_CPU
> >> 
> >> Reviewed-by: Segher Boessenkool 
> >> Signed-off-by: Nicholas Piggin 
> >> ---
> >> Since v1:
> >> - Improved the override flavour name suggested by Segher.
> >> - Improved changelog wording.
> >> 
> >> 
> >>  arch/powerpc/Kconfig| 19 +++
> >>  arch/powerpc/Makefile   | 15 ++-
> >>  arch/powerpc/boot/Makefile  |  4 
> >>  drivers/crypto/vmx/Makefile |  8 ++--
> >>  drivers/crypto/vmx/ppc-xlate.pl | 10 ++
> >>  5 files changed, 45 insertions(+), 11 deletions(-)
> >> 
> >> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> >> index 924c541a9260..d9d2abc06c2c 100644
> >> --- a/arch/powerpc/Kconfig
> >> +++ b/arch/powerpc/Kconfig
> >> @@ -147,6 +147,7 @@ config PPC
> >>select ARCH_WEAK_RELEASE_ACQUIRE
> >>select BINFMT_ELF
> >>select BUILDTIME_TABLE_SORT
> >> +  select BUILD_ELF_V2 if PPC64 && CPU_LITTLE_ENDIAN
> >>select CLONE_BACKWARDS
> >>select DCACHE_WORD_ACCESS   if PPC64 && CPU_LITTLE_ENDIAN
> >>select DYNAMIC_FTRACE   if FUNCTION_TRACER
> >> @@ -541,6 +542,24 @@ config KEXEC_FILE
> >>  config ARCH_HAS_KEXEC_PURGATORY
> >>def_bool KEXEC_FILE
> >>  
> >> +config BUILD_ELF_V2
> >> +  bool
> >> +
> >> +config BUILD_BIG_ENDIAN_ELF_V2
> >> +  bool "Build big-endian kernel using ELFv2 ABI (EXPERIMENTAL)"
> >> +  depends on PPC64 && CPU_BIG_ENDIAN && EXPERT
> >> +  default n
> >> +  select BUILD_ELF_V2
> >> +  help
> >> +This builds the kernel image using the ELFv2 ABI, which has a
> >> +reduced stack overhead and faster function calls. This does not
> >> +affect the userspace ABIs.
> >

Re: [PATCH v2] powerpc/64: BE option to use ELFv2 ABI for big endian kernels

2021-05-02 Thread Michal Suchánek
On Tue, Apr 28, 2020 at 09:25:17PM +1000, Nicholas Piggin wrote:
> Provide an option to use ELFv2 ABI for big endian builds. This works on
> GCC and clang (since 2014). It is less well tested and supported by the
> GNU toolchain, but it can give some useful advantages of the ELFv2 ABI
> for BE (e.g., less stack usage). Some distros even build BE ELFv2
> userspace.

Fixes BTFID failure on BE for me and the ELF ABIv2 kernel boots.

Tested-by: Michal Suchánek 

Also can we enable mprofile on BE now?

I don't see anything endian-specific in the mprofile code at a glance
but don't have any idea how to test it.

Thanks

Michal

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 6a4ad11f6349..75b3afbfc378 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -495,7 +495,7 @@ config LD_HEAD_STUB_CATCH
  If unsure, say "N".
 
 config MPROFILE_KERNEL
-   depends on PPC64 && CPU_LITTLE_ENDIAN && FUNCTION_TRACER
+   depends on PPC64 && BUILD_ELF_V2 && FUNCTION_TRACER
def_bool 
$(success,$(srctree)/arch/powerpc/tools/gcc-check-mprofile-kernel.sh $(CC) 
-I$(srctree)/include -D__KERNEL__)
 
 config HOTPLUG_CPU
> 
> Reviewed-by: Segher Boessenkool 
> Signed-off-by: Nicholas Piggin 
> ---
> Since v1:
> - Improved the override flavour name suggested by Segher.
> - Improved changelog wording.
> 
> 
>  arch/powerpc/Kconfig| 19 +++
>  arch/powerpc/Makefile   | 15 ++-
>  arch/powerpc/boot/Makefile  |  4 
>  drivers/crypto/vmx/Makefile |  8 ++--
>  drivers/crypto/vmx/ppc-xlate.pl | 10 ++
>  5 files changed, 45 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 924c541a9260..d9d2abc06c2c 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -147,6 +147,7 @@ config PPC
>   select ARCH_WEAK_RELEASE_ACQUIRE
>   select BINFMT_ELF
>   select BUILDTIME_TABLE_SORT
> + select BUILD_ELF_V2 if PPC64 && CPU_LITTLE_ENDIAN
>   select CLONE_BACKWARDS
>   select DCACHE_WORD_ACCESS   if PPC64 && CPU_LITTLE_ENDIAN
>   select DYNAMIC_FTRACE   if FUNCTION_TRACER
> @@ -541,6 +542,24 @@ config KEXEC_FILE
>  config ARCH_HAS_KEXEC_PURGATORY
>   def_bool KEXEC_FILE
>  
> +config BUILD_ELF_V2
> + bool
> +
> +config BUILD_BIG_ENDIAN_ELF_V2
> + bool "Build big-endian kernel using ELFv2 ABI (EXPERIMENTAL)"
> + depends on PPC64 && CPU_BIG_ENDIAN && EXPERT
> + default n
> + select BUILD_ELF_V2
> + help
> +   This builds the kernel image using the ELFv2 ABI, which has a
> +   reduced stack overhead and faster function calls. This does not
> +   affect the userspace ABIs.
> +
> +   ELFv2 is the standard ABI for little-endian, but for big-endian
> +   this is an experimental option that is less tested (kernel and
> +   toolchain). This requires gcc 4.9 or newer and binutils 2.24 or
> +   newer.
> +
>  config RELOCATABLE
>   bool "Build a relocatable kernel"
>   depends on PPC64 || (FLATMEM && (44x || FSL_BOOKE))
> diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
> index f310c32e88a4..e306b39d847e 100644
> --- a/arch/powerpc/Makefile
> +++ b/arch/powerpc/Makefile
> @@ -92,10 +92,14 @@ endif
>  
>  ifdef CONFIG_PPC64
>  ifndef CONFIG_CC_IS_CLANG
> -cflags-$(CONFIG_CPU_BIG_ENDIAN)  += $(call cc-option,-mabi=elfv1)
> -cflags-$(CONFIG_CPU_BIG_ENDIAN)  += $(call 
> cc-option,-mcall-aixdesc)
> -aflags-$(CONFIG_CPU_BIG_ENDIAN)  += $(call cc-option,-mabi=elfv1)
> -aflags-$(CONFIG_CPU_LITTLE_ENDIAN)   += -mabi=elfv2
> +ifdef CONFIG_BUILD_ELF_V2
> +cflags-y += $(call cc-option,-mabi=elfv2,$(call 
> cc-option,-mcall-aixdesc))
> +aflags-y += $(call cc-option,-mabi=elfv2)
> +else
> +cflags-y += $(call cc-option,-mabi=elfv1)
> +cflags-y += $(call cc-option,-mcall-aixdesc)
> +aflags-y += $(call cc-option,-mabi=elfv1)
> +endif
>  endif
>  endif
>  
> @@ -144,7 +148,7 @@ endif
>  
>  CFLAGS-$(CONFIG_PPC64)   := $(call cc-option,-mtraceback=no)
>  ifndef CONFIG_CC_IS_CLANG
> -ifdef CONFIG_CPU_LITTLE_ENDIAN
> +ifdef CONFIG_BUILD_ELF_V2
>  CFLAGS-$(CONFIG_PPC64)   += $(call cc-option,-mabi=elfv2,$(call 
> cc-option,-mcall-aixdesc))
>  AFLAGS-$(CONFIG_PPC64)   += $(call cc-option,-mabi=elfv2)
>  else
> @@ -153,6 +157,7 @@ CFLAGS-$(CONFIG_PPC64)+= $(call 
> cc-option,-mcall-ai

Re: [PATCH] cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards

2021-04-28 Thread Michal Suchánek
On Wed, Apr 28, 2021 at 11:28:48AM +0530, Gautham R Shenoy wrote:
> Hello Michal,
> 
> On Sun, Apr 25, 2021 at 01:07:14PM +0200, Michal Suchánek wrote:
> > On Sat, Apr 24, 2021 at 01:07:16PM +0530, Vaidyanathan Srinivasan wrote:
> > > * Michal Such?nek  [2021-04-23 20:42:16]:
> > > 
> > > > On Fri, Apr 23, 2021 at 11:59:30PM +0530, Vaidyanathan Srinivasan wrote:
> > > > > * Michal Such?nek  [2021-04-23 19:45:05]:
> > > > > 
> > > > > > On Fri, Apr 23, 2021 at 09:29:39PM +0530, Vaidyanathan Srinivasan 
> > > > > > wrote:
> > > > > > > * Michal Such?nek  [2021-04-23 09:35:51]:
> > > > > > > 
> > > > > > > > On Thu, Apr 22, 2021 at 08:37:29PM +0530, Gautham R. Shenoy 
> > > > > > > > wrote:
> > > > > > > > > From: "Gautham R. Shenoy" 
> > > > > > > > > 
> > > > > > > > > Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
> > > > > > > > > CEDE(0)") sets the exit latency of CEDE(0) based on the 
> > > > > > > > > latency values
> > > > > > > > > of the Extended CEDE states advertised by the platform
> > > > > > > > > 
> > > > > > > > > On some of the POWER9 LPARs, the older firmwares advertise a 
> > > > > > > > > very low
> > > > > > > > > value of 2us for CEDE1 exit latency on a Dedicated LPAR. 
> > > > > > > > > However the
> > > > > > > > Can you be more specific about 'older firmwares'?
> > > > > > > 
> > > > > > > Hi Michal,
> > > > > > > 
> > > > > > > This is POWER9 vs POWER10 difference, not really an obsolete FW.  
> > > > > > > The
> > > > > > > key idea behind the original patch was to make the H_CEDE latency 
> > > > > > > and
> > > > > > > hence target residency come from firmware instead of being 
> > > > > > > decided by
> > > > > > > the kernel.  The advantage is such that, different type of 
> > > > > > > systems in
> > > > > > > POWER10 generation can adjust this value and have an optimal 
> > > > > > > H_CEDE
> > > > > > > entry criteria which balances good single thread performance and
> > > > > > > wakeup latency.  Further we can have additional H_CEDE state to 
> > > > > > > feed
> > > > > > > into the cpuidle.  
> > > > > > 
> > > > > > So all POWER9 machines are affected by the firmware bug where 
> > > > > > firmware
> > > > > > reports CEDE1 exit latency of 2us and the real latency is 5us which
> > > > > > causes the kernel to prefer CEDE1 too much when relying on the 
> > > > > > values
> > > > > > supplied by the firmware. It is not about 'older firmware'.
> > > > > 
> > > > > Correct.  All POWER9 systems running Linux as guest LPARs will see
> > > > > extra usage of CEDE idle state, but not baremetal (PowerNV).
> > > > > 
> > > > > The correct definition of the bug or miss-match in expectation is that
> > > > > firmware reports wakeup latency from a core/thread wakeup timing, but
> > > > > not end-to-end time from sending a wakeup event like an IPI using
> > > > > H_calls and receiving the events on the target.  Practically there are
> > > > > few extra micro-seconds needed after deciding to wakeup a target
> > > > > core/thread to getting the target to start executing instructions
> > > > > within the LPAR instance.
> > > > 
> > > > Thanks for the detailed explanation.
> > > > 
> > > > Maybe just adding a few microseconds to the reported time would be a
> > > > more reasonable workaround than using a blanket fixed value then.
> > > 
> > > Yes, that is an option.  But that may only reduce the difference
> > > between existing kernel and new kernel unless we make it the same
> > > number.  Further we are fixing this in P10 and hence we will have to
> > > add "if(P9) do the compensation" and otherwise take it as is.  That
> > > would not be elegant.  Given that

Re: [PATCH] cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards

2021-04-25 Thread Michal Suchánek
On Sat, Apr 24, 2021 at 01:07:16PM +0530, Vaidyanathan Srinivasan wrote:
> * Michal Such?nek  [2021-04-23 20:42:16]:
> 
> > On Fri, Apr 23, 2021 at 11:59:30PM +0530, Vaidyanathan Srinivasan wrote:
> > > * Michal Such?nek  [2021-04-23 19:45:05]:
> > > 
> > > > On Fri, Apr 23, 2021 at 09:29:39PM +0530, Vaidyanathan Srinivasan wrote:
> > > > > * Michal Such?nek  [2021-04-23 09:35:51]:
> > > > > 
> > > > > > On Thu, Apr 22, 2021 at 08:37:29PM +0530, Gautham R. Shenoy wrote:
> > > > > > > From: "Gautham R. Shenoy" 
> > > > > > > 
> > > > > > > Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
> > > > > > > CEDE(0)") sets the exit latency of CEDE(0) based on the latency 
> > > > > > > values
> > > > > > > of the Extended CEDE states advertised by the platform
> > > > > > > 
> > > > > > > On some of the POWER9 LPARs, the older firmwares advertise a very 
> > > > > > > low
> > > > > > > value of 2us for CEDE1 exit latency on a Dedicated LPAR. However 
> > > > > > > the
> > > > > > Can you be more specific about 'older firmwares'?
> > > > > 
> > > > > Hi Michal,
> > > > > 
> > > > > This is POWER9 vs POWER10 difference, not really an obsolete FW.  The
> > > > > key idea behind the original patch was to make the H_CEDE latency and
> > > > > hence target residency come from firmware instead of being decided by
> > > > > the kernel.  The advantage is such that, different type of systems in
> > > > > POWER10 generation can adjust this value and have an optimal H_CEDE
> > > > > entry criteria which balances good single thread performance and
> > > > > wakeup latency.  Further we can have additional H_CEDE state to feed
> > > > > into the cpuidle.  
> > > > 
> > > > So all POWER9 machines are affected by the firmware bug where firmware
> > > > reports CEDE1 exit latency of 2us and the real latency is 5us which
> > > > causes the kernel to prefer CEDE1 too much when relying on the values
> > > > supplied by the firmware. It is not about 'older firmware'.
> > > 
> > > Correct.  All POWER9 systems running Linux as guest LPARs will see
> > > extra usage of CEDE idle state, but not baremetal (PowerNV).
> > > 
> > > The correct definition of the bug or miss-match in expectation is that
> > > firmware reports wakeup latency from a core/thread wakeup timing, but
> > > not end-to-end time from sending a wakeup event like an IPI using
> > > H_calls and receiving the events on the target.  Practically there are
> > > few extra micro-seconds needed after deciding to wakeup a target
> > > core/thread to getting the target to start executing instructions
> > > within the LPAR instance.
> > 
> > Thanks for the detailed explanation.
> > 
> > Maybe just adding a few microseconds to the reported time would be a
> > more reasonable workaround than using a blanket fixed value then.
> 
> Yes, that is an option.  But that may only reduce the difference
> between existing kernel and new kernel unless we make it the same
> number.  Further we are fixing this in P10 and hence we will have to
> add "if(P9) do the compensation" and otherwise take it as is.  That
> would not be elegant.  Given that our goal for P9 platform is to not
> introduce changes in H_CEDE entry behaviour, we arrived at this
> approach (this small patch) and this also makes it easy to backport to
> various distro products.

I don't see how this is more elegent.

The current patch is

if(p9)
use fixed value

the suggested patch is

if(p9)
apply compensation

That is either will add one branch for the affected platform.

But I understand if you do not have confidence that the compensation is
the same in all cases and do not have the opportunity to measure it it
may be simpler to apply one very conservative adjustment.

Thanks

Michal


Re: [PATCH] cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards

2021-04-23 Thread Michal Suchánek
On Fri, Apr 23, 2021 at 11:59:30PM +0530, Vaidyanathan Srinivasan wrote:
> * Michal Such?nek  [2021-04-23 19:45:05]:
> 
> > On Fri, Apr 23, 2021 at 09:29:39PM +0530, Vaidyanathan Srinivasan wrote:
> > > * Michal Such?nek  [2021-04-23 09:35:51]:
> > > 
> > > > On Thu, Apr 22, 2021 at 08:37:29PM +0530, Gautham R. Shenoy wrote:
> > > > > From: "Gautham R. Shenoy" 
> > > > > 
> > > > > Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
> > > > > CEDE(0)") sets the exit latency of CEDE(0) based on the latency values
> > > > > of the Extended CEDE states advertised by the platform
> > > > > 
> > > > > On some of the POWER9 LPARs, the older firmwares advertise a very low
> > > > > value of 2us for CEDE1 exit latency on a Dedicated LPAR. However the
> > > > Can you be more specific about 'older firmwares'?
> > > 
> > > Hi Michal,
> > > 
> > > This is POWER9 vs POWER10 difference, not really an obsolete FW.  The
> > > key idea behind the original patch was to make the H_CEDE latency and
> > > hence target residency come from firmware instead of being decided by
> > > the kernel.  The advantage is such that, different type of systems in
> > > POWER10 generation can adjust this value and have an optimal H_CEDE
> > > entry criteria which balances good single thread performance and
> > > wakeup latency.  Further we can have additional H_CEDE state to feed
> > > into the cpuidle.  
> > 
> > So all POWER9 machines are affected by the firmware bug where firmware
> > reports CEDE1 exit latency of 2us and the real latency is 5us which
> > causes the kernel to prefer CEDE1 too much when relying on the values
> > supplied by the firmware. It is not about 'older firmware'.
> 
> Correct.  All POWER9 systems running Linux as guest LPARs will see
> extra usage of CEDE idle state, but not baremetal (PowerNV).
> 
> The correct definition of the bug or miss-match in expectation is that
> firmware reports wakeup latency from a core/thread wakeup timing, but
> not end-to-end time from sending a wakeup event like an IPI using
> H_calls and receiving the events on the target.  Practically there are
> few extra micro-seconds needed after deciding to wakeup a target
> core/thread to getting the target to start executing instructions
> within the LPAR instance.

Thanks for the detailed explanation.

Maybe just adding a few microseconds to the reported time would be a
more reasonable workaround than using a blanket fixed value then.

> 
> > I still think it would be preferrable to adjust the latency value
> > reported by the firmware to match reality over a kernel workaround.
> 
> Right, practically we can fix for future releases and as such we
> targeted this scheme from POWER10 but expected no harm on POWER9 which
> proved to be wrong.
> 
> We can possibly change this FW value for POWER9, but it is too
> expensive and not practical because many release streams exist for
> different platforms and further customers are at different streams as
> well.  We cannot force all of them to update because that blows up
> co-dependency matrix.

>From the user point of view only few firmware release streams exist but
what is packaged in such binaries might be another story.

Thanks

Michal


Re: [PATCH] cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards

2021-04-23 Thread Michal Suchánek
On Fri, Apr 23, 2021 at 09:29:39PM +0530, Vaidyanathan Srinivasan wrote:
> * Michal Such?nek  [2021-04-23 09:35:51]:
> 
> > On Thu, Apr 22, 2021 at 08:37:29PM +0530, Gautham R. Shenoy wrote:
> > > From: "Gautham R. Shenoy" 
> > > 
> > > Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
> > > CEDE(0)") sets the exit latency of CEDE(0) based on the latency values
> > > of the Extended CEDE states advertised by the platform
> > > 
> > > On some of the POWER9 LPARs, the older firmwares advertise a very low
> > > value of 2us for CEDE1 exit latency on a Dedicated LPAR. However the
> > Can you be more specific about 'older firmwares'?
> 
> Hi Michal,
> 
> This is POWER9 vs POWER10 difference, not really an obsolete FW.  The
> key idea behind the original patch was to make the H_CEDE latency and
> hence target residency come from firmware instead of being decided by
> the kernel.  The advantage is such that, different type of systems in
> POWER10 generation can adjust this value and have an optimal H_CEDE
> entry criteria which balances good single thread performance and
> wakeup latency.  Further we can have additional H_CEDE state to feed
> into the cpuidle.  

So all POWER9 machines are affected by the firmware bug where firmware
reports CEDE1 exit latency of 2us and the real latency is 5us which
causes the kernel to prefer CEDE1 too much when relying on the values
supplied by the firmware. It is not about 'older firmware'.

I still think it would be preferrable to adjust the latency value
reported by the firmware to match reality over a kernel workaround.

Thanks

Michal


Re: [PATCH] cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards

2021-04-23 Thread Michal Suchánek
On Thu, Apr 22, 2021 at 08:37:29PM +0530, Gautham R. Shenoy wrote:
> From: "Gautham R. Shenoy" 
> 
> Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
> CEDE(0)") sets the exit latency of CEDE(0) based on the latency values
> of the Extended CEDE states advertised by the platform
> 
> On some of the POWER9 LPARs, the older firmwares advertise a very low
> value of 2us for CEDE1 exit latency on a Dedicated LPAR. However the
Can you be more specific about 'older firmwares'?

Also while this is a performance regression on such firmwares it
should be fixed by updating the firmware to current version.

Having sub-optimal performance on obsolete firmware should not require a
kernel workaround, should it?

It's not like the kernel would crash on the affected firmware.

Thanks

Michal


Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed

2021-04-22 Thread Michal Suchánek
Hello,

On Thu, Apr 22, 2021 at 12:06:45AM -0500, Lijun Pan wrote:
> On Wed, Apr 21, 2021 at 2:25 AM Sukadev Bhattiprolu
>  wrote:
> >
> > Lijun Pan [l...@linux.vnet.ibm.com] wrote:
> > >
> > >
> > > > On Apr 20, 2021, at 4:35 PM, Dany Madden  wrote:
> > > >
> > > > When ibmvnic gets a FATAL error message from the vnicserver, it marks
> > > > the Command Respond Queue (CRQ) inactive and resets the adapter. If this
> > > > FATAL reset fails and a transmission timeout reset follows, the CRQ is
> > > > still inactive, ibmvnic's attempt to set link down will also fail. If
> > > > ibmvnic abandons the reset because of this failed set link down and this
> > > > is the last reset in the workqueue, then this adapter will be left in an
> > > > inoperable state.
> > > >
> > > > Instead, make the driver ignore this link down failure and continue to
> > > > free and re-register CRQ so that the adapter has an opportunity to
> > > > recover.
> > >
> > > This v2 does not adddress the concerns mentioned in v1.
> > > And I think it is better to exit with error from do_reset, and schedule a 
> > > thorough
> > > do_hard_reset if the the adapter is already in unstable state.
> >
> > We had a FATAL error and when handling it, we failed to send a
> > link-down message to the VIOS. So what we need to try next is to
> > reset the connection with the VIOS. For this we must talk to the
> > firmware using the H_FREE_CRQ and H_REG_CRQ hcalls. do_reset()
> > does just that in ibmvnic_reset_crq().
> >
> > Now, sure we can attempt a "thorough hard reset" which also does
> > the same hcalls to reestablish the connection. Is there any
> > other magic in do_hard_reset()? But in addition, it also frees lot
> > more Linux kernel buffers and reallocates them for instance.
> 
> Working around everything in do_reset will make the code very difficult
> to manage. Ultimately do_reset can do anything I am afraid, and do_hard_reset
> can be removed completely or merged into do_reset.

This debate is not very constructive.

In the context of driver that has separate do_reset and do_hard_reset
this fix picks the correct one unless you can refute the arguments
provided.

Merging do_reset and do_hard_reset might be a good code cleanup which is
out of the scope of this fix.



Given that vast majority of fixes to the vnic driver are related to the
reset handling it would improve stability and testability if every
reset took the same code path.

In the context of merging do_hard_reset and do_reset the question is
what is the intended distinction and performance gain by having
'lightweight' reset.

I don't have a vnic protocol manual at hand and I suspect I would not
get one even if I searched for one.

>From reading through the fixes in the past my understanding is that the
full reset is required when the backend changes which then potentially
requires different size/number of buffers.

What is the expected situation when reset is required without changing
the backend?

Is this so common that it warrants a separate 'lightweight' optimized
function?

Thanks

Michal


Re: [PATCHv5 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents

2021-04-15 Thread Michal Suchánek
Hello,

On Wed, Apr 14, 2021 at 11:08:19AM +0800, Pingfan Liu wrote:
> On Sat, Apr 10, 2021 at 12:33 AM Michal Suchánek  wrote:
> >
> > Hello,
> >
> > On Fri, Aug 28, 2020 at 04:10:09PM +0800, Pingfan Liu wrote:
> > > On Thu, Aug 27, 2020 at 3:53 PM Laurent Dufour  
> > > wrote:
> > > >
> > > > Le 10/08/2020 à 10:52, Pingfan Liu a écrit :
> > > > > A bug is observed on pseries by taking the following steps on rhel:
> > > > > -1. drmgr -c mem -r -q 5
> > > > > -2. echo c > /proc/sysrq-trigger
> > > > >
> > > > > And then, the failure looks like:
> > > > > kdump: saving to /sysroot//var/crash/127.0.0.1-2020-01-16-02:06:14/
> > > > > kdump: saving vmcore-dmesg.txt
> > > > > kdump: saving vmcore-dmesg.txt complete
> > > > > kdump: saving vmcore
> > > > >   Checking for memory holes : [  0.0 %] / 
> > > > >   Checking for memory holes : 
> > > > > [100.0 %] |   Excluding unnecessary pages 
> > > > >   : [100.0 %] \   Copying data
> > > > >   : [  0.3 %] -  eta: 38s[   44.337636] 
> > > > > hash-mmu: mm: Hashing failure ! EA=0x7fffba40 
> > > > > access=0x8004 current=makedumpfile
> > > > > [   44.337663] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base 
> > > > > psize=2 psize 2 pte=0xc0005504
> > > > > [   44.337677] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 
> > > > > access=0x8004 current=makedumpfile
> > > > > [   44.337692] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base 
> > > > > psize=2 psize 2 pte=0xc0005504
> > > > > [   44.337708] makedumpfile[469]: unhandled signal 7 at 
> > > > > 7fffba40 nip 7fffbbc4d7fc lr 00011356ca3c code 2
> > > > > [   44.338548] Core dump to |/bin/false pipe failed
> > > > > /lib/kdump-lib-initramfs.sh: line 98:   469 Bus error   
> > > > > $CORE_COLLECTOR /proc/vmcore 
> > > > > $_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore-incomplete
> > > > > kdump: saving vmcore failed
> > > > >
> > > > > * Root cause *
> > > > >After analyzing, it turns out that in the current implementation,
> > > > > when hot-removing lmb, the KOBJ_REMOVE event ejects before the dt 
> > > > > updating as
> > > > > the code __remove_memory() comes before drmem_update_dt().
> > > > > So in kdump kernel, when read_from_oldmem() resorts to
> > > > > pSeries_lpar_hpte_insert() to install hpte, but fails with -2 due to
> > > > > non-exist pfn. And finally, low_hash_fault() raise SIGBUS to process, 
> > > > > as it
> > > > > can be observed "Bus error"
> > > > >
> > > > >  From a viewpoint of listener and publisher, the publisher notifies 
> > > > > the
> > > > > listener before data is ready.  This introduces a problem where udev
> > > > > launches kexec-tools (due to KOBJ_REMOVE) and loads a stale dt before
> > > > > updating. And in capture kernel, makedumpfile will access the memory 
> > > > > based
> > > > > on the stale dt info, and hit a SIGBUS error due to an un-existed lmb.
> > > > >
> > > > > * Fix *
> > > > > This bug is introduced by commit 063b8b1251fd
> > > > > ("powerpc/pseries/memory-hotplug: Only update DT once per memory DLPAR
> > > > > request"), which tried to combine all the dt updating into one.
> > > > >
> > > > > To fix this issue, meanwhile not to introduce a quadratic runtime
> > > > > complexity by the model:
> > > > >dlpar_memory_add_by_count
> > > > >  for_each_drmem_lmb <--
> > > > >dlpar_add_lmb
> > > > >  drmem_update_dt(_v1|_v2)
> > > > >for_each_drmem_lmb   <--
> > > > > The dt should still be only updated once, and just before the last 
> > > > > memory
> > > > > online/offline event is ejected to user space. Achieve this by 
> > > > > tracing the
> > > > > num of lmb added or removed.
> > > &

Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

2021-04-12 Thread Michal Suchánek
On Mon, Apr 12, 2021 at 04:24:44PM +0100, Mel Gorman wrote:
> On Mon, Apr 12, 2021 at 02:21:47PM +0200, Vincent Guittot wrote:
> > > > Peter, Valentin, Vincent, Mel, etal
> > > >
> > > > On architectures where we have multiple levels of cache access latencies
> > > > within a DIE, (For example: one within the current LLC or SMT core and 
> > > > the
> > > > other at MC or Hemisphere, and finally across hemispheres), do you have 
> > > > any
> > > > suggestions on how we could handle the same in the core scheduler?
> >
> > I would say that SD_SHARE_PKG_RESOURCES is there for that and doesn't
> > only rely on cache
> >
>
> From topology.c
>
>   SD_SHARE_PKG_RESOURCES - describes shared caches
>
> I'm guessing here because I am not familiar with power10 but the central
> problem appears to be when to prefer selecting a CPU sharing L2 or L3
> cache and the core assumes the last-level-cache is the only relevant one.

It does not seem to be the case according to original description:

 When the scheduler tries to wakeup a task, it chooses between the
 waker-CPU and the wakee's previous-CPU. Suppose this choice is called
 the "target", then in the target's LLC domain, the scheduler
 
 a) tries to find an idle core in the LLC. This helps exploit the
This is the same as (b) Should this be SMT^^^ ?
SMT folding that the wakee task can benefit from. If an idle
core is found, the wakee is woken up on it.
 
 b) Failing to find an idle core, the scheduler tries to find an idle
CPU in the LLC. This helps minimise the wakeup latency for the
wakee since it gets to run on the CPU immediately.
 
 c) Failing this, it will wake it up on target CPU.
 
 Thus, with P9-sched topology, since the CACHE domain comprises of two
 SMT4 cores, there is a decent chance that we get an idle core, failing
 which there is a relatively higher probability of finding an idle CPU
 among the 8 threads in the domain.
 
 However, in P10-sched topology, since the SMT domain is the LLC and it
 contains only a single SMT4 core, the probability that we find that
 core to be idle is less. Furthermore, since there are only 4 CPUs to
 search for an idle CPU, there is lower probability that we can get an
 idle CPU to wake up the task on.

>
> For this patch, I wondered if setting SD_SHARE_PKG_RESOURCES would have
> unintended consequences for load balancing because load within a die may
> not be spread between SMT4 domains if SD_SHARE_PKG_RESOURCES was set at
> the MC level.

Not spreading load between SMT4 domains within MC is exactly what setting LLC
at MC level would address, wouldn't it?

As in on P10 we have two relevant levels but the topology as is describes only
one, and moving the LLC level lower gives two levels the scheduler looks at
again. Or am I missing something?

Thanks

Michal

> > >
> > > Minimally I think it would be worth detecting when there are multiple
> > > LLCs per node and detecting that in generic code as a static branch. In
> > > select_idle_cpu, consider taking two passes -- first on the LLC domain
> > > and if no idle CPU is found then taking a second pass if the search depth
> >
> > We have done a lot of changes to reduce and optimize the fast path and
> > I don't think re adding another layer  in the fast path makes sense as
> > you will end up unrolling the for_each_domain behind some
> > static_banches.
> >
>
> Searching the node would only happen if a) there was enough search depth
> left and b) there were no idle CPUs at the LLC level. As no new domain
> is added, it's not clear to me why for_each_domain would change.
>
> But still, your comment reminded me that different architectures have
> different requirements
>
> Power 10 appears to prefer CPU selection sharing L2 cache but desires
>   spillover to L3 when selecting and idle CPU.
>
> X86 varies, it might want the Power10 approach for some families and prefer
>   L3 spilling over to a CPU on the same node in others.
>
> S390 cares about something called books and drawers although I've no
>   what it means as such and whether it has any preferences on
>   search order.
>
> ARM has similar requirements again according to "scheduler: expose the
>   topology of clusters and add cluster scheduler" and that one *does*
>   add another domain.
>
> I had forgotten about the ARM patches but remembered that they were
> interesting because they potentially help the Zen situation but I didn't
> get the chance to review them before they fell off my radar again. About
> all I recall is that I thought the "cluster" terminology was vague.
>
> The only commonality I thought might exist is that architectures may
> like to define what the first domain to search for an idle CPU and a
> second domain. Alternatively, architectures could specify a domain to
> search primarily but also search the next domain in the hierarchy if
> search depth permits. The

Re: [PATCHv5 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents

2021-04-09 Thread Michal Suchánek
Hello,

On Fri, Aug 28, 2020 at 04:10:09PM +0800, Pingfan Liu wrote:
> On Thu, Aug 27, 2020 at 3:53 PM Laurent Dufour  wrote:
> >
> > Le 10/08/2020 à 10:52, Pingfan Liu a écrit :
> > > A bug is observed on pseries by taking the following steps on rhel:
> > > -1. drmgr -c mem -r -q 5
> > > -2. echo c > /proc/sysrq-trigger
> > >
> > > And then, the failure looks like:
> > > kdump: saving to /sysroot//var/crash/127.0.0.1-2020-01-16-02:06:14/
> > > kdump: saving vmcore-dmesg.txt
> > > kdump: saving vmcore-dmesg.txt complete
> > > kdump: saving vmcore
> > >   Checking for memory holes : [  0.0 %] / 
> > >   Checking for memory holes : [100.0 %] | 
> > >   Excluding unnecessary pages   : 
> > > [100.0 %] \   Copying data
> > >   : [  0.3 %] -  eta: 38s[   44.337636] hash-mmu: mm: Hashing 
> > > failure ! EA=0x7fffba40 access=0x8004 current=makedumpfile
> > > [   44.337663] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base 
> > > psize=2 psize 2 pte=0xc0005504
> > > [   44.337677] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 
> > > access=0x8004 current=makedumpfile
> > > [   44.337692] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base 
> > > psize=2 psize 2 pte=0xc0005504
> > > [   44.337708] makedumpfile[469]: unhandled signal 7 at 7fffba40 
> > > nip 7fffbbc4d7fc lr 00011356ca3c code 2
> > > [   44.338548] Core dump to |/bin/false pipe failed
> > > /lib/kdump-lib-initramfs.sh: line 98:   469 Bus error   
> > > $CORE_COLLECTOR /proc/vmcore 
> > > $_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore-incomplete
> > > kdump: saving vmcore failed
> > >
> > > * Root cause *
> > >After analyzing, it turns out that in the current implementation,
> > > when hot-removing lmb, the KOBJ_REMOVE event ejects before the dt 
> > > updating as
> > > the code __remove_memory() comes before drmem_update_dt().
> > > So in kdump kernel, when read_from_oldmem() resorts to
> > > pSeries_lpar_hpte_insert() to install hpte, but fails with -2 due to
> > > non-exist pfn. And finally, low_hash_fault() raise SIGBUS to process, as 
> > > it
> > > can be observed "Bus error"
> > >
> > >  From a viewpoint of listener and publisher, the publisher notifies the
> > > listener before data is ready.  This introduces a problem where udev
> > > launches kexec-tools (due to KOBJ_REMOVE) and loads a stale dt before
> > > updating. And in capture kernel, makedumpfile will access the memory based
> > > on the stale dt info, and hit a SIGBUS error due to an un-existed lmb.
> > >
> > > * Fix *
> > > This bug is introduced by commit 063b8b1251fd
> > > ("powerpc/pseries/memory-hotplug: Only update DT once per memory DLPAR
> > > request"), which tried to combine all the dt updating into one.
> > >
> > > To fix this issue, meanwhile not to introduce a quadratic runtime
> > > complexity by the model:
> > >dlpar_memory_add_by_count
> > >  for_each_drmem_lmb <--
> > >dlpar_add_lmb
> > >  drmem_update_dt(_v1|_v2)
> > >for_each_drmem_lmb   <--
> > > The dt should still be only updated once, and just before the last memory
> > > online/offline event is ejected to user space. Achieve this by tracing the
> > > num of lmb added or removed.
> > >
> > > Signed-off-by: Pingfan Liu 
> > > Cc: Michael Ellerman 
> > > Cc: Hari Bathini 
> > > Cc: Nathan Lynch 
> > > Cc: Nathan Fontenot 
> > > Cc: Laurent Dufour 
> > > To: linuxppc-dev@lists.ozlabs.org
> > > Cc: ke...@lists.infradead.org
> > > ---
> > > v4 -> v5: change dlpar_add_lmb()/dlpar_remove_lmb() prototype to report
> > >whether dt is updated successfully.
> > >Fix a condition boundary check bug
> > > v3 -> v4: resolve a quadratic runtime complexity issue.
> > >This series is applied on next-test branch
> > >   arch/powerpc/platforms/pseries/hotplug-memory.c | 102 
> > > +++-
> > >   1 file changed, 80 insertions(+), 22 deletions(-)
> > >
> > > diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
> > > b/arch/powerpc/platforms/pseries/hotplug-memory.c
> > > index 46cbcd1..1567d9f 100644
> > > --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> > > +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> > > @@ -350,13 +350,22 @@ static bool lmb_is_removable(struct drmem_lmb *lmb)
> > >   return true;
> > >   }
> > >
> > > -static int dlpar_add_lmb(struct drmem_lmb *);
> > > +enum dt_update_status {
> > > + DT_NOUPDATE,
> > > + DT_TOUPDATE,
> > > + DT_UPDATED,
> > > +};
> > > +
> > > +/* "*dt_update" returns DT_UPDATED if updated */
> > > +static int dlpar_add_lmb(struct drmem_lmb *lmb,
> > > + enum dt_update_status *dt_update);
> > >
> > > -static int dlpar_remove_lmb(struct drmem_lmb *lmb)
> > > +static int dlpar_remove_lmb(struct drmem_lm

Re: [PATCH] rpadlpar: fix potential drc_name corruption in store functions

2021-03-13 Thread Michal Suchánek
On Wed, Mar 10, 2021 at 04:30:21PM -0600, Tyrel Datwyler wrote:
> Both add_slot_store() and remove_slot_store() try to fix up the drc_name
> copied from the store buffer by placing a NULL terminator at nbyte + 1
> or in place of a '\n' if present. However, the static buffer that we
> copy the drc_name data into is not zeored and can contain anything past
> the n-th byte. This is problematic if a '\n' byte appears in that buffer
> after nbytes and the string copied into the store buffer was not NULL
> terminated to start with as the strchr() search for a '\n' byte will mark
> this incorrectly as the end of the drc_name string resulting in a drc_name
> string that contains garbage data after the n-th byte. The following
> debugging shows an example of the drmgr utility writing "PHB 4543" to
> the add_slot sysfs attribute, but add_slot_store logging a corrupted
> string value.
> 
> [135823.702864] drmgr: drmgr: -c phb -a -s PHB 4543 -d 1
> [135823.702879] add_slot_store: drc_name = PHB 4543°|<82>!, rc = -19
> 
> Fix this by NULL terminating the string when we copy it into our static
> buffer by coping nbytes + 1 of data from the store buffer. The code has
Why is it OK to copy nbytes + 1 and why is it expected that the buffer
contains a nul after the content?

Isn't it much saner to just nul terminate the string after copying?

diff --git a/drivers/pci/hotplug/rpadlpar_sysfs.c 
b/drivers/pci/hotplug/rpadlpar_sysfs.c
index cdbfa5df3a51..cfbad67447da 100644
--- a/drivers/pci/hotplug/rpadlpar_sysfs.c
+++ b/drivers/pci/hotplug/rpadlpar_sysfs.c
@@ -35,11 +35,11 @@ static ssize_t add_slot_store(struct kobject *kobj, struct 
kobj_attribute *attr,
return 0;
 
memcpy(drc_name, buf, nbytes);
+   &drc_name[nbytes] = '\0';
 
end = strchr(drc_name, '\n');
-   if (!end)
-   end = &drc_name[nbytes];
-   *end = '\0';
+   if (end)
+   *end = '\0';
 
rc = dlpar_add_slot(drc_name);
if (rc)
@@ -66,11 +66,11 @@ static ssize_t remove_slot_store(struct kobject *kobj,
return 0;
 
memcpy(drc_name, buf, nbytes);
+   &drc_name[nbytes] = '\0';
 
end = strchr(drc_name, '\n');
-   if (!end)
-   end = &drc_name[nbytes];
-   *end = '\0';
+   if (end)
+   *end = '\0';
 
rc = dlpar_remove_slot(drc_name);
if (rc)

Thanks

Michal

> already made sure that nbytes is not >= MAX_DRC_NAME_LEN and the store
> buffer is guaranteed to be zeroed beyond the nth-byte of data copied
> from the user. Further, since the string is now NULL terminated the code
> only needs to change '\n' to '\0' when present.
> 
> Signed-off-by: Tyrel Datwyler 
> ---
>  drivers/pci/hotplug/rpadlpar_sysfs.c | 14 ++
>  1 file changed, 6 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/pci/hotplug/rpadlpar_sysfs.c 
> b/drivers/pci/hotplug/rpadlpar_sysfs.c
> index cdbfa5df3a51..375087921284 100644
> --- a/drivers/pci/hotplug/rpadlpar_sysfs.c
> +++ b/drivers/pci/hotplug/rpadlpar_sysfs.c
> @@ -34,12 +34,11 @@ static ssize_t add_slot_store(struct kobject *kobj, 
> struct kobj_attribute *attr,
>   if (nbytes >= MAX_DRC_NAME_LEN)
>   return 0;
>  
> - memcpy(drc_name, buf, nbytes);
> + memcpy(drc_name, buf, nbytes + 1);
>  
>   end = strchr(drc_name, '\n');
> - if (!end)
> - end = &drc_name[nbytes];
> - *end = '\0';
> + if (end)
> + *end = '\0';
>  
>   rc = dlpar_add_slot(drc_name);
>   if (rc)
> @@ -65,12 +64,11 @@ static ssize_t remove_slot_store(struct kobject *kobj,
>   if (nbytes >= MAX_DRC_NAME_LEN)
>   return 0;
>  
> - memcpy(drc_name, buf, nbytes);
> + memcpy(drc_name, buf, nbytes + 1);
>  
>   end = strchr(drc_name, '\n');
> - if (!end)
> - end = &drc_name[nbytes];
> - *end = '\0';
> + if (end)
> + *end = '\0';
>  
>   rc = dlpar_remove_slot(drc_name);
>   if (rc)
> -- 
> 2.27.0
> 


Re: [PATCH 3/3] powerpc/qspinlock: Use generic smp_cond_load_relaxed

2021-03-09 Thread Michal Suchánek
On Tue, Mar 09, 2021 at 07:46:11AM -0800, Davidlohr Bueso wrote:
> On Tue, 09 Mar 2021, Michal Such�nek wrote:
> 
> > On Mon, Mar 08, 2021 at 05:59:50PM -0800, Davidlohr Bueso wrote:
> > > 49a7d46a06c3 (powerpc: Implement smp_cond_load_relaxed()) added
> > > busy-waiting pausing with a preferred SMT priority pattern, lowering
> > > the priority (reducing decode cycles) during the whole loop slowpath.
> > > 
> > > However, data shows that while this pattern works well with simple
> >  ^^
> > > spinlocks, queued spinlocks benefit more being kept in medium priority,
> > > with a cpu_relax() instead, being a low+medium combo on powerpc.
> > ...
> > > 
> > > diff --git a/arch/powerpc/include/asm/barrier.h 
> > > b/arch/powerpc/include/asm/barrier.h
> > > index aecfde829d5d..7ae29cfb06c0 100644
> > > --- a/arch/powerpc/include/asm/barrier.h
> > > +++ b/arch/powerpc/include/asm/barrier.h
> > > @@ -80,22 +80,6 @@ do {   
> > > \
> > >   ___p1;  \
> > >  })
> > > 
> > > -#ifdef CONFIG_PPC64
> > Maybe it should be kept for the simple spinlock case then?
> 
> It is kept, note that simple spinlocks don't use smp_cond_load_relaxed,
> but instead deal with the priorities in arch_spin_lock(), so it will
> spin in low priority until it sees a chance to take the lock, where
> it switches back to medium.

Indeed, thanks for the clarification.

Michal


Re: [PATCH 3/3] powerpc/qspinlock: Use generic smp_cond_load_relaxed

2021-03-09 Thread Michal Suchánek
On Mon, Mar 08, 2021 at 05:59:50PM -0800, Davidlohr Bueso wrote:
> 49a7d46a06c3 (powerpc: Implement smp_cond_load_relaxed()) added
> busy-waiting pausing with a preferred SMT priority pattern, lowering
> the priority (reducing decode cycles) during the whole loop slowpath.
> 
> However, data shows that while this pattern works well with simple
  ^^
> spinlocks, queued spinlocks benefit more being kept in medium priority,
> with a cpu_relax() instead, being a low+medium combo on powerpc.
...
> 
> diff --git a/arch/powerpc/include/asm/barrier.h 
> b/arch/powerpc/include/asm/barrier.h
> index aecfde829d5d..7ae29cfb06c0 100644
> --- a/arch/powerpc/include/asm/barrier.h
> +++ b/arch/powerpc/include/asm/barrier.h
> @@ -80,22 +80,6 @@ do {   
> \
>   ___p1;  \
>  })
>  
> -#ifdef CONFIG_PPC64
Maybe it should be kept for the simple spinlock case then?

Thanks

Michal
> -#define smp_cond_load_relaxed(ptr, cond_expr) ({ \
> - typeof(ptr) __PTR = (ptr);  \
> - __unqual_scalar_typeof(*ptr) VAL;   \
> - VAL = READ_ONCE(*__PTR);\
> - if (unlikely(!(cond_expr))) {   \
> - spin_begin();   \
> - do {\
> - VAL = READ_ONCE(*__PTR);\
> - } while (!(cond_expr)); \
> - spin_end(); \
> - }   \
> - (typeof(*ptr))VAL;  \
> -})
> -#endif
> -
>  #ifdef CONFIG_PPC_BOOK3S_64
>  #define NOSPEC_BARRIER_SLOT   nop
>  #elif defined(CONFIG_PPC_FSL_BOOK3E)
> -- 
> 2.26.2
> 


Re: [PATCH kernel] powerpc/kuap: Restore AMR after replaying soft interrupts

2021-02-03 Thread Michal Suchánek
Hello,

On Tue, Feb 02, 2021 at 08:15:41PM +1100, Alexey Kardashevskiy wrote:
> Since de78a9c "powerpc: Add a framework for Kernel Userspace Access
> Protection", user access helpers call user_{read|write}_access_{begin|end}
> when user space access is allowed.
> 
> 890274c "powerpc/64s: Implement KUAP for Radix MMU" made the mentioned
> helpers program a AMR special register to allow such access for a short
> period of time, most of the time AMR is expected to block user memory
> access by the kernel.
> 
> Since the code accesses the user space memory, unsafe_get_user()
> calls might_fault() which calls arch_local_irq_restore() if either
> CONFIG_PROVE_LOCKING or CONFIG_DEBUG_ATOMIC_SLEEP is enabled.
> arch_local_irq_restore() then attempts to replay pending soft interrupts
> as KUAP regions have hardware interrupts enabled.
> If a pending interrupt happens to do user access (performance interrupts
> do that), it enables access for a short period of time so after returning
> from the replay, the user access state remains blocked and if a user page
> fault happens - "Bug: Read fault blocked by AMR!" appears and SIGSEGV is
> sent.
> 
> This saves/restores AMR when replaying interrupts.
> 
> This adds a check if AMR was not blocked when before replaying interrupts.
> 
> Found by syzkaller. The call stack for the bug is:
> 
> copy_from_user_nofault+0xf8/0x250
> perf_callchain_user_64+0x3d8/0x8d0
> perf_callchain_user+0x38/0x50
> get_perf_callchain+0x28c/0x300
> perf_callchain+0xb0/0x130
> perf_prepare_sample+0x364/0xbf0
> perf_event_output_forward+0xe0/0x280
> __perf_event_overflow+0xa4/0x240
> perf_swevent_hrtimer+0x1d4/0x1f0
> __hrtimer_run_queues+0x328/0x900
> hrtimer_interrupt+0x128/0x350
> timer_interrupt+0x180/0x600
> replay_soft_interrupts+0x21c/0x4f0
> arch_local_irq_restore+0x94/0x150
> lock_is_held_type+0x140/0x200
> ___might_sleep+0x220/0x330
> __might_fault+0x88/0x120
> do_strncpy_from_user+0x108/0x2b0
> strncpy_from_user+0x1d0/0x2a0
> getname_flags+0x88/0x2c0
> do_sys_openat2+0x2d4/0x5f0
> do_sys_open+0xcc/0x140
> system_call_exception+0x160/0x240
> system_call_common+0xf0/0x27c
> 
Can we get a Fixes tag?

Thanks

Michal
> Signed-off-by: Alexey Kardashevskiy 
> Reviewed-by: Nicholas Piggin 
> ---
> Changes:
> v3:
> * do not block/unblock if AMR was blocked
> * reverted move of AMR_KUAP_***
> * added pr_warn
> 
> v2:
> * fixed compile on hash
> * moved get/set to arch_local_irq_restore
> * block KUAP before replaying
> 
> ---
> 
> This is an example:
> 
> [ cut here ]
> Bug: Read fault blocked by AMR!
> WARNING: CPU: 0 PID: 1603 at 
> /home/aik/p/kernel/arch/powerpc/include/asm/book3s/64/kup-radix.h:145 
> __do_page_fau
> 
> Modules linked in:
> CPU: 0 PID: 1603 Comm: amr Not tainted 5.10.0-rc6_v5.10-rc6_a+fstn1 #24
> NIP:  c009ece8 LR: c009ece4 CTR: 
> REGS: cdc63560 TRAP: 0700   Not tainted  
> (5.10.0-rc6_v5.10-rc6_a+fstn1)
> MSR:  80021033   CR: 28002888  XER: 2004
> CFAR: c01fa928 IRQMASK: 1
> GPR00: c009ece4 cdc637f0 c2397600 001f
> GPR04: c20eb318  cdc63494 0027
> GPR08: c0007fe4de68 cdfe9180  0001
> GPR12: 2000 c30a  
> GPR16:    bfff
> GPR20:  c000134a4020 c19c2218 0fe0
> GPR24:   cd106200 4000
> GPR28:  0300 cdc63910 c1946730
> NIP [c009ece8] __do_page_fault+0xb38/0xde0
> LR [c009ece4] __do_page_fault+0xb34/0xde0
> Call Trace:
> [cdc637f0] [c009ece4] __do_page_fault+0xb34/0xde0 (unreliable)
> [cdc638a0] [c000c968] handle_page_fault+0x10/0x2c
> --- interrupt: 300 at strncpy_from_user+0x290/0x440
> LR = strncpy_from_user+0x284/0x440
> [cdc63ba0] [c0c3dcb0] strncpy_from_user+0x2f0/0x440 
> (unreliable)
> [cdc63c30] [c068b888] getname_flags+0x88/0x2c0
> [cdc63c90] [c0662a44] do_sys_openat2+0x2d4/0x5f0
> [cdc63d30] [c066560c] do_sys_open+0xcc/0x140
> [cdc63dc0] [c0045e10] system_call_exception+0x160/0x240
> [cdc63e20] [c000da60] system_call_common+0xf0/0x27c
> Instruction dump:
> 409c0048 3fe2ff5b 3bfff128 fac10060 fae10068 482f7a85 6000 3c62ff5b
> 7fe4fb78 3863f250 4815bbd9 6000 <0fe0> 3c62ff5b 3863f2b8 4815c8b5
> irq event stamp: 254
> hardirqs last  enabled at (253): [] 
> arch_local_irq_restore+0xa0/0x150
> hardirqs last disabled at (254): [] 
> data_access_common_virt+0x1b0/0x1d0
> softirqs last  enabled at (0): [] copy_process+0x78c/0x2120
> softirqs last disabled at (0): [<>] 0x0
> ---[ end trace ba98aec5151f3aeb ]---
> ---
>  arch/powerpc/kernel/irq.c | 27 

Re: [PATCH v3] [PATCH] powerpc/sstep: Check ISA 3.0 instruction validity before emulation

2021-01-20 Thread Michal Suchánek
On Wed, Jan 20, 2021 at 04:43:14PM +0530, Ananth N Mavinakayanahalli wrote:
> We currently unconditionally try to emulate newer instructions on older
> Power versions that could cause issues. Gate it.
Fixes: 350779a29f11 ("powerpc: Handle most loads and stores in instruction 
emulation code")

There are more that would apply but most of the checks land in code
added by the above.

Thanks

Michal
> 
> Signed-off-by: Ananth N Mavinakayanahalli 
> ---
> 
> [v3] Addressed Naveen's comments on scv and addpcis
> [v2] Fixed description
> 
>  arch/powerpc/lib/sstep.c |   46 
> --
>  1 file changed, 44 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
> index bf7a7d62ae8b..5a425a4a1d88 100644
> --- a/arch/powerpc/lib/sstep.c
> +++ b/arch/powerpc/lib/sstep.c
> @@ -1304,9 +1304,11 @@ int analyse_instr(struct instruction_op *op, const 
> struct pt_regs *regs,
>   if ((word & 0xfe2) == 2)
>   op->type = SYSCALL;
>   else if (IS_ENABLED(CONFIG_PPC_BOOK3S_64) &&
> - (word & 0xfe3) == 1)
> + (word & 0xfe3) == 1) {  /* scv */
>   op->type = SYSCALL_VECTORED_0;
> - else
> + if (!cpu_has_feature(CPU_FTR_ARCH_300))
> + return -1;
> + } else
>   op->type = UNKNOWN;
>   return 0;
>  #endif
> @@ -1530,6 +1532,8 @@ int analyse_instr(struct instruction_op *op, const 
> struct pt_regs *regs,
>   case 19:
>   if (((word >> 1) & 0x1f) == 2) {
>   /* addpcis */
> + if (!cpu_has_feature(CPU_FTR_ARCH_300))
> + return -1;
>   imm = (short) (word & 0xffc1);  /* d0 + d2 fields */
>   imm |= (word >> 15) & 0x3e; /* d1 field */
>   op->val = regs->nip + (imm << 16) + 4;
> @@ -2439,6 +2443,8 @@ int analyse_instr(struct instruction_op *op, const 
> struct pt_regs *regs,
>   break;
>  
>   case 268:   /* lxvx */
> + if (!cpu_has_feature(CPU_FTR_ARCH_300))
> + return -1;
>   op->reg = rd | ((word & 1) << 5);
>   op->type = MKOP(LOAD_VSX, 0, 16);
>   op->element_size = 16;
> @@ -2448,6 +2454,8 @@ int analyse_instr(struct instruction_op *op, const 
> struct pt_regs *regs,
>   case 269:   /* lxvl */
>   case 301: { /* lxvll */
>   int nb;
> + if (!cpu_has_feature(CPU_FTR_ARCH_300))
> + return -1;
>   op->reg = rd | ((word & 1) << 5);
>   op->ea = ra ? regs->gpr[ra] : 0;
>   nb = regs->gpr[rb] & 0xff;
> @@ -2475,6 +2483,8 @@ int analyse_instr(struct instruction_op *op, const 
> struct pt_regs *regs,
>   break;
>  
>   case 364:   /* lxvwsx */
> + if (!cpu_has_feature(CPU_FTR_ARCH_300))
> + return -1;
>   op->reg = rd | ((word & 1) << 5);
>   op->type = MKOP(LOAD_VSX, 0, 4);
>   op->element_size = 4;
> @@ -2482,6 +2492,8 @@ int analyse_instr(struct instruction_op *op, const 
> struct pt_regs *regs,
>   break;
>  
>   case 396:   /* stxvx */
> + if (!cpu_has_feature(CPU_FTR_ARCH_300))
> + return -1;
>   op->reg = rd | ((word & 1) << 5);
>   op->type = MKOP(STORE_VSX, 0, 16);
>   op->element_size = 16;
> @@ -2491,6 +2503,8 @@ int analyse_instr(struct instruction_op *op, const 
> struct pt_regs *regs,
>   case 397:   /* stxvl */
>   case 429: { /* stxvll */
>   int nb;
> + if (!cpu_has_feature(CPU_FTR_ARCH_300))
> + return -1;
>   op->reg = rd | ((word & 1) << 5);
>   op->ea = ra ? regs->gpr[ra] : 0;
>   nb = regs->gpr[rb] & 0xff;
> @@ -2542,6 +2556,8 @@ int analyse_instr(struct instruction_op *op, const 
> struct pt_regs *regs,
>   break;
>  
>   case 781:   /* lxsibzx */
> + if (!cpu_has_feature(CPU_FTR_ARCH_300))
> + return -1;
>   op->reg = rd | ((word & 1) << 5);
>   op->type = MKOP(LOAD_VSX, 0, 1);
>   op->element_size = 8;
> @@ -2549,6 +2565,8 @@ int analyse_instr(struct instruction_op *op, const 
> struct pt_regs *regs,
>   break;
>  
>   case 812:   /* lxvh8x */
> +

Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2021-01-14 Thread Michal Suchánek
On Mon, Oct 19, 2020 at 02:50:51PM +1000, Nicholas Piggin wrote:
> Excerpts from Nicholas Piggin's message of October 19, 2020 11:00 am:
> > Excerpts from Michal Suchánek's message of October 17, 2020 6:14 am:
> >> On Mon, Sep 07, 2020 at 11:13:47PM +1000, Nicholas Piggin wrote:
> >>> Excerpts from Michael Ellerman's message of August 31, 2020 8:50 pm:
> >>> > Michal Suchánek  writes:
> >>> >> On Mon, Aug 31, 2020 at 11:14:18AM +1000, Nicholas Piggin wrote:
> >>> >>> Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am:
> >>> >>> > Hello,
> >>> >>> > 
> >>> >>> > on POWER8 KVM hosts lock up since commit 10d91611f426 ("powerpc/64s:
> >>> >>> > Reimplement book3s idle code in C").
> >>> >>> > 
> >>> >>> > The symptom is host locking up completely after some hours of KVM
> >>> >>> > workload with messages like
> >>> >>> > 
> >>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab 
> >>> >>> > cpu 47
> >>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab 
> >>> >>> > cpu 71
> >>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab 
> >>> >>> > cpu 47
> >>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab 
> >>> >>> > cpu 71
> >>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab 
> >>> >>> > cpu 47
> >>> >>> > 
> >>> >>> > printed before the host locks up.
> >>> >>> > 
> >>> >>> > The machines run sandboxed builds which is a mixed workload 
> >>> >>> > resulting in
> >>> >>> > IO/single core/mutiple core load over time and there are periods of 
> >>> >>> > no
> >>> >>> > activity and no VMS runnig as well. The VMs are shortlived so VM
> >>> >>> > setup/terdown is somewhat excercised as well.
> >>> >>> > 
> >>> >>> > POWER9 with the new guest entry fast path does not seem to be 
> >>> >>> > affected.
> >>> >>> > 
> >>> >>> > Reverted the patch and the followup idle fixes on top of 5.2.14 and
> >>> >>> > re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR
> >>> >>> > after idle") which gives same idle code as 5.1.16 and the kernel 
> >>> >>> > seems
> >>> >>> > stable.
> >>> >>> > 
> >>> >>> > Config is attached.
> >>> >>> > 
> >>> >>> > I cannot easily revert this commit, especially if I want to use the 
> >>> >>> > same
> >>> >>> > kernel on POWER8 and POWER9 - many of the POWER9 fixes are 
> >>> >>> > applicable
> >>> >>> > only to the new idle code.
> >>> >>> > 
> >>> >>> > Any idea what can be the problem?
> >>> >>> 
> >>> >>> So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on
> >>> >>> those threads. I wonder what they are doing. POWER8 doesn't have a 
> >>> >>> good
> >>> >>> NMI IPI and I don't know if it supports pdbg dumping registers from 
> >>> >>> the
> >>> >>> BMC unfortunately.
> >>> >>
> >>> >> It may be possible to set up fadump with a later kernel version that
> >>> >> supports it on powernv and dump the whole kernel.
> >>> > 
> >>> > Your firmware won't support it AFAIK.
> >>> > 
> >>> > You could try kdump, but if we have CPUs stuck in KVM then there's a
> >>> > good chance it won't work :/
> >>> 
> >>> I haven't had any luck yet reproducing this still. Testing with sub 
> >>> cores of various different combinations, etc. I'll keep trying though.
> >> 
> >> Hello,
> >> 
> >> I tried running some KVM guests to simulate the workload and what I get
> >> is guests failing to start with a rcu stall. Tried both 5.3 and 5.9
> >> kernel and qemu 4.2.1 and 5.1.0
> >> 
> >> To start some guests I run
> >> 
> >> for i in $(seq 0 9) ; do /opt/qemu/bin/qemu-system-ppc64 -m 2048 -accel 
> >> kvm -smp 8 -kernel /boot/vmlinux -initrd /boot/initrd -nodefaults 
> >> -nographic -serial mon:telnet::444$i,server,wait & done
> >> 
> >> To simulate some workload I run
> >> 
> >> xz -zc9T0 < /dev/zero > /dev/null &
> >> while true; do
> >> killall -STOP xz; sleep 1; killall -CONT xz; sleep 1;
> >> done &
> >> 
> >> on the host and add a job that executes this to the ramdisk. However, most
> >> guests never get to the point where the job is executed.
> >> 
> >> Any idea what might be the problem?
> > 
> > I would say try without pv queued spin locks (but if the same thing is 
> > happening with 5.3 then it must be something else I guess). 
> > 
> > I'll try to test a similar setup on a POWER8 here.
> 
> Couldn't reproduce the guest hang, they seem to run fine even with 
> queued spinlocks. Might have a different .config.
> 
> I might have got a lockup in the host (although different symptoms than 
> the original report). I'll look into that a bit further.

Hello,

any progress on this?

I considered reinstating the old assembly code for POWER[78] but even
the way it's called has changed slightly.

Thanks

Michal


Re: [PATCH] ibmvnic: fix: NULL pointer dereference.

2020-12-30 Thread Michal Suchánek
On Wed, Dec 30, 2020 at 03:23:14PM +0800, YANG LI wrote:
> The error is due to dereference a null pointer in function
> reset_one_sub_crq_queue():
> 
> if (!scrq) {
> netdev_dbg(adapter->netdev,
>"Invalid scrq reset. irq (%d) or msgs(%p).\n",
>   scrq->irq, scrq->msgs);
>   return -EINVAL;
> }
> 
> If the expression is true, scrq must be a null pointer and cannot
> dereference.
> 
> Signed-off-by: YANG LI 
> Reported-by: Abaci 
Fixes: 9281cf2d5840 ("ibmvnic: avoid memset null scrq msgs")
> ---
>  drivers/net/ethernet/ibm/ibmvnic.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
> b/drivers/net/ethernet/ibm/ibmvnic.c
> index f302504..d7472be 100644
> --- a/drivers/net/ethernet/ibm/ibmvnic.c
> +++ b/drivers/net/ethernet/ibm/ibmvnic.c
> @@ -2981,9 +2981,7 @@ static int reset_one_sub_crq_queue(struct 
> ibmvnic_adapter *adapter,
>   int rc;
>  
>   if (!scrq) {
> - netdev_dbg(adapter->netdev,
> -"Invalid scrq reset. irq (%d) or msgs (%p).\n",
> -scrq->irq, scrq->msgs);
> + netdev_dbg(adapter->netdev, "Invalid scrq reset.\n");
>   return -EINVAL;
>   }
>  
> -- 
> 1.8.3.1
> 


Re: [RFC PATCH] treewide: remove bzip2 compression support

2020-12-15 Thread Michal Suchánek
Hello,

On Tue, Dec 15, 2020 at 02:03:15PM -0500, Alex Xu (Hello71) wrote:
> bzip2 is either slower or larger than every other supported algorithm,
> according to benchmarks at [0]. It is far slower to decompress than any
> other algorithm, and still larger than lzma, xz, and zstd.
> 
> [0] https://lore.kernel.org/lkml/1588791882.08g1378g67.none@localhost/

Sounds cool. I wonder how many people will complain that their
distribution migrated to bzip2 but got stuck there and now new kernels
won't work on there with some odd tool or another :p

> @@ -212,11 +209,6 @@ choice
> Compression speed is only relevant when building a kernel.
> Decompression speed is relevant at each boot.
>  
> -   If you have any problems with bzip2 or lzma compressed
> -   kernels, mail me (Alain Knaff) . (An older
> -   version of this functionality (bzip2 only), for 2.4, was
> -   supplied by Christian Ludwig)
> -
Shouldn't the LZMA part be preserved here?

Thanks

Michal


Re: Kernel panic from malloc() on SUSE 15.1?

2020-11-06 Thread Michal Suchánek
On Mon, Nov 02, 2020 at 12:14:27PM -0800, Carl Jacobsen wrote:
> I've got a SUSE 15.1 install (on ppc64le) that kernel panics on a very
> simple
> test program, built in a slightly unusual way.
> 
> I'm compiling on SUSE 12, using gcc 4.8.3. I'm linking to a static
> copy of libcrypto.a (from openssl-1.1.1g), built without threads.
> I have a 10 line C test program that compiles and runs fine on the
> SUSE 12 system. If I compile the same program on SUSE 15.1 (with
> gcc 7.4.1), it runs fine on SUSE 15.1.
> 
> But, if I run the version that I compiled on SUSE 12, on the SUSE 15.1
> system, the call to RAND_status() gets to a malloc() and then panics.
> (And, of course, if I just compile a call to malloc(), that runs fine
> on both systems.) Here's the test program, it's really just a call to
> RAND_status():
> 
> #include 
> #include 
> 
> int main(int argc, char **argv)
> {
> int has_enough_data = RAND_status();
> printf("The PRNG %s been seeded with enough data\n",
>has_enough_data ? "HAS" : "has NOT");
> return 0;
> }
> 
> openssl is configured/built with:
> ./config no-shared no-dso no-threads -fPIC -ggdb3 -debug -static
> make
> 
> and the test program is compiled with:
> gcc -ggdb3 -o rand_test rand_test.c libcrypto.a
> 
> The kernel on SUSE 12 is: 3.12.28-4-default
> And glibc is: 2.19
> 
> The kernel on SUSE 15.1 is: 4.12.14-197.18-default
> And glibc is: 2.26

SLE 12 SP5 has pretty much the same kernel as SLE 15 SP1 and pretty much
the same compiler as SLE 12 so it might be interesting data point to try
there.

Also I saw you are using very old VIOS (which should not make much of a
difference) but did not see what firmware version the machine has.

There have been cases of mysterious crashes solved by updating the
firmware.

Thanks

Michal


Re: [RFC PATCH 0/4] powerpc/papr_scm: Add support for reporting NVDIMM performance statistics

2020-10-21 Thread Michal Suchánek
Hello,

apparently this has not received any (public) comments.

Maybe resend without the RFC status?

Clearly the kernel interface must be defined first, and then ndctl can
follow and make use of it.

Thanks

Michal

On Mon, May 18, 2020 at 04:38:10PM +0530, Vaibhav Jain wrote:
> The patch-set proposes to add support for fetching and reporting
> performance statistics for PAPR compliant NVDIMMs as described in
> documentation for H_SCM_PERFORMANCE_STATS hcall Ref[1]. The patch-set
> also implements mechanisms to expose NVDIMM performance stats via
> sysfs and newly introduced PDSMs[2] for libndctl.
> 
> This patch-set combined with corresponding ndctl and libndctl changes
> proposed at Ref[3] should enable user to fetch PAPR compliant NVDIMMs
> using following command:
> 
>  # ndctl list -D --stats
> [
>   {
> "dev":"nmem0",
> "stats":{
>   "Controller Reset Count":2,
>   "Controller Reset Elapsed Time":603331,
>   "Power-on Seconds":603931,
>   "Life Remaining":"100%",
>   "Critical Resource Utilization":"0%",
>   "Host Load Count":5781028,
>   "Host Store Count":8966800,
>   "Host Load Duration":975895365,
>   "Host Store Duration":716230690,
>   "Media Read Count":0,
>   "Media Write Count":6313,
>   "Media Read Duration":0,
>   "Media Write Duration":9679615,
>   "Cache Read Hit Count":5781028,
>   "Cache Write Hit Count":8442479,
>   "Fast Write Count":8969912
> }
>   }
> ]
> 
> The patchset is dependent on existing patch-set "[PATCH v7 0/5]
> powerpc/papr_scm: Add support for reporting nvdimm health" available
> at Ref[2] that adds support for reporting PAPR compliant NVDIMMs in
> 'papr_scm' kernel module.
> 
> Structure of the patch-set
> ==
> 
> The patch-set starts with implementing functionality in papr_scm
> module to issue H_SCM_PERFORMANCE_STATS hcall, fetch & parse dimm
> performance stats and exposing them as a PAPR specific libnvdimm
> attribute named 'perf_stats'
> 
> Patch-2 introduces a new PDSM named FETCH_PERF_STATS that can be
> issued by libndctl asking papr_scm to issue the
> H_SCM_PERFORMANCE_STATS hcall using helpers introduced earlier and
> storing the results in a dimm specific perf-stats-buffer.
> 
> Patch-3 introduces a new PDSM named READ_PERF_STATS that can be
> issued by libndctl to read the perf-stats-buffer in an incremental
> manner to workaround the 256-bytes envelop limitation of libnvdimm.
> 
> Finally Patch-4 introduces a new PDSM named GET_PERF_STAT that can be
> issued by libndctl to read values of a specific NVDIMM performance
> stat like "Life Remaining".
> 
> References
> ==
> [1] Documentation/powerpc/papr_hcals.rst
> 
> [2] 
> https://lore.kernel.org/linux-nvdimm/20200508104922.72565-1-vaib...@linux.ibm.com/
> 
> [3] https://github.com/vaibhav92/ndctl/tree/papr_scm_stats_v1
> 
> Vaibhav Jain (4):
>   powerpc/papr_scm: Fetch nvdimm performance stats from PHYP
>   powerpc/papr_scm: Add support for PAPR_SCM_PDSM_FETCH_PERF_STATS
>   powerpc/papr_scm: Implement support for PAPR_SCM_PDSM_READ_PERF_STATS
>   powerpc/papr_scm: Add support for PDSM GET_PERF_STAT
> 
>  Documentation/ABI/testing/sysfs-bus-papr-scm  |  27 ++
>  arch/powerpc/include/uapi/asm/papr_scm_pdsm.h |  60 +++
>  arch/powerpc/platforms/pseries/papr_scm.c | 391 ++
>  3 files changed, 478 insertions(+)
> 
> -- 
> 2.26.2
> 


Re: [PATCH v4 2/2] lkdtm/powerpc: Add SLB multihit test

2020-10-19 Thread Michal Suchánek
On Mon, Oct 19, 2020 at 09:59:57PM +1100, Michael Ellerman wrote:
> Hi Ganesh,
> 
> Some comments below ...
> 
> Ganesh Goudar  writes:
> > To check machine check handling, add support to inject slb
> > multihit errors.
> >
> > Cc: Kees Cook 
> > Reviewed-by: Michal Suchánek 
> > Co-developed-by: Mahesh Salgaonkar 
> > Signed-off-by: Mahesh Salgaonkar 
> > Signed-off-by: Ganesh Goudar 
> > ---
> >  drivers/misc/lkdtm/Makefile |   1 +
> >  drivers/misc/lkdtm/core.c   |   3 +
> >  drivers/misc/lkdtm/lkdtm.h  |   3 +
> >  drivers/misc/lkdtm/powerpc.c| 156 
> >  tools/testing/selftests/lkdtm/tests.txt |   1 +
> >  5 files changed, 164 insertions(+)
> >  create mode 100644 drivers/misc/lkdtm/powerpc.c
> >
> ..
> > diff --git a/drivers/misc/lkdtm/powerpc.c b/drivers/misc/lkdtm/powerpc.c
> > new file mode 100644
> > index ..f388b53dccba
> > --- /dev/null
> > +++ b/drivers/misc/lkdtm/powerpc.c
> > @@ -0,0 +1,156 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +#include "lkdtm.h"
> > +#include 
> > +#include 
> 
> Usual style is to include the linux headers first and then the local header.
> 
> > +
> > +/* Gets index for new slb entry */
> > +static inline unsigned long get_slb_index(void)
> > +{
> > +   unsigned long index;
> > +
> > +   index = get_paca()->stab_rr;
> > +
> > +   /*
> > +* simple round-robin replacement of slb starting at SLB_NUM_BOLTED.
> > +*/
> > +   if (index < (mmu_slb_size - 1))
> > +   index++;
> > +   else
> > +   index = SLB_NUM_BOLTED;
> > +   get_paca()->stab_rr = index;
> > +   return index;
> > +}
> 
> I'm not sure we need that really?
> 
> We can just always insert at SLB_MUM_BOLTED and SLB_NUM_BOLTED + 1.
> 
> Or we could allocate from the top down using mmu_slb_size - 1, and
> mmu_slb_size - 2.
> 
> 
> > +#define slb_esid_mask(ssize)   \
> > +   (((ssize) == MMU_SEGSIZE_256M) ? ESID_MASK : ESID_MASK_1T)
> > +
> > +/* Form the operand for slbmte */
> > +static inline unsigned long mk_esid_data(unsigned long ea, int ssize,
> > +unsigned long slot)
> > +{
> > +   return (ea & slb_esid_mask(ssize)) | SLB_ESID_V | slot;
> > +}
> > +
> > +#define slb_vsid_shift(ssize)  \
> > +   ((ssize) == MMU_SEGSIZE_256M ? SLB_VSID_SHIFT : SLB_VSID_SHIFT_1T)
> > +
> > +/* Form the operand for slbmte */
> > +static inline unsigned long mk_vsid_data(unsigned long ea, int ssize,
> > +unsigned long flags)
> > +{
> > +   return (get_kernel_vsid(ea, ssize) << slb_vsid_shift(ssize)) | flags |
> > +   ((unsigned long)ssize << SLB_VSID_SSIZE_SHIFT);
> > +}
> 
> I realise it's not much code, but I'd rather those were in a header,
> rather than copied from slb.c. That way they can never skew vs the
> versions in slb.c
> 
> Best place I think would be arch/powerpc/include/asm/book3s/64/mmu-hash.h
> 
> 
> > +
> > +/* Inserts new slb entry */
> 
> It inserts two.
> 
> > +static void insert_slb_entry(char *p, int ssize)
> > +{
> > +   unsigned long flags, entry;
> > +
> > +   flags = SLB_VSID_KERNEL | mmu_psize_defs[MMU_PAGE_64K].sllp;
> 
> That won't work if the kernel is built for 4K pages. Or at least it
> won't work the way we want it to.
> 
> You should use mmu_linear_psize.
> 
> But for vmalloc you should use mmu_vmalloc_psize, so it will need to be
> a parameter.
> 
> > +   preempt_disable();
> > +
> > +   entry = get_slb_index();
> > +   asm volatile("slbmte %0,%1" :
> > +   : "r" (mk_vsid_data((unsigned long)p, ssize, flags)),
> > + "r" (mk_esid_data((unsigned long)p, ssize, entry))
> > +   : "memory");
> > +
> > +   entry = get_slb_index();
> > +   asm volatile("slbmte %0,%1" :
> > +   : "r" (mk_vsid_data((unsigned long)p, ssize, flags)),
> > + "r" (mk_esid_data((unsigned long)p, ssize, entry))
> > +   : "memory");
> > +   preempt_enable();
> > +   /*
> > +* This triggers exception, If handled correctly we must recover
> > +* from this error.
> > +*/
> > +   p[0] = '!';
> 
> That doesn

Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2020-10-16 Thread Michal Suchánek
On Mon, Sep 07, 2020 at 11:13:47PM +1000, Nicholas Piggin wrote:
> Excerpts from Michael Ellerman's message of August 31, 2020 8:50 pm:
> > Michal Suchánek  writes:
> >> On Mon, Aug 31, 2020 at 11:14:18AM +1000, Nicholas Piggin wrote:
> >>> Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am:
> >>> > Hello,
> >>> > 
> >>> > on POWER8 KVM hosts lock up since commit 10d91611f426 ("powerpc/64s:
> >>> > Reimplement book3s idle code in C").
> >>> > 
> >>> > The symptom is host locking up completely after some hours of KVM
> >>> > workload with messages like
> >>> > 
> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 
> >>> > 47
> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 
> >>> > 71
> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 
> >>> > 47
> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 
> >>> > 71
> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 
> >>> > 47
> >>> > 
> >>> > printed before the host locks up.
> >>> > 
> >>> > The machines run sandboxed builds which is a mixed workload resulting in
> >>> > IO/single core/mutiple core load over time and there are periods of no
> >>> > activity and no VMS runnig as well. The VMs are shortlived so VM
> >>> > setup/terdown is somewhat excercised as well.
> >>> > 
> >>> > POWER9 with the new guest entry fast path does not seem to be affected.
> >>> > 
> >>> > Reverted the patch and the followup idle fixes on top of 5.2.14 and
> >>> > re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR
> >>> > after idle") which gives same idle code as 5.1.16 and the kernel seems
> >>> > stable.
> >>> > 
> >>> > Config is attached.
> >>> > 
> >>> > I cannot easily revert this commit, especially if I want to use the same
> >>> > kernel on POWER8 and POWER9 - many of the POWER9 fixes are applicable
> >>> > only to the new idle code.
> >>> > 
> >>> > Any idea what can be the problem?
> >>> 
> >>> So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on
> >>> those threads. I wonder what they are doing. POWER8 doesn't have a good
> >>> NMI IPI and I don't know if it supports pdbg dumping registers from the
> >>> BMC unfortunately.
> >>
> >> It may be possible to set up fadump with a later kernel version that
> >> supports it on powernv and dump the whole kernel.
> > 
> > Your firmware won't support it AFAIK.
> > 
> > You could try kdump, but if we have CPUs stuck in KVM then there's a
> > good chance it won't work :/
> 
> I haven't had any luck yet reproducing this still. Testing with sub 
> cores of various different combinations, etc. I'll keep trying though.

Hello,

I tried running some KVM guests to simulate the workload and what I get
is guests failing to start with a rcu stall. Tried both 5.3 and 5.9
kernel and qemu 4.2.1 and 5.1.0

To start some guests I run

for i in $(seq 0 9) ; do /opt/qemu/bin/qemu-system-ppc64 -m 2048 -accel kvm 
-smp 8 -kernel /boot/vmlinux -initrd /boot/initrd -nodefaults -nographic 
-serial mon:telnet::444$i,server,wait & done

To simulate some workload I run

xz -zc9T0 < /dev/zero > /dev/null &
while true; do
killall -STOP xz; sleep 1; killall -CONT xz; sleep 1;
done &

on the host and add a job that executes this to the ramdisk. However, most
guests never get to the point where the job is executed.

Any idea what might be the problem?

In the past I was able to boot guests quite realiably.

This is boot log of one of the VMs

Trying ::1...
Connected to localhost.
Escape character is '^]'.


SLOF **
QEMU Starting
 Build Date = Jul 17 2020 11:15:24
 FW Version = git-e18ddad8516ff2cf
 Press "s" to enter Open Firmware.

Populating /vdevice methods
Populating /vdevice/vty@7100
Populating /vdevice/nvram@7101
Populating /pci@8002000
No NVRAM common partition, re-initializing...
Scanning USB 
Using default console: /vdevice/vty@7100
Detected RAM kernel at 40 (27c8620 bytes) 
 
  Welcome to Open Firm

Re: [PATCH v6 02/11] mm/gup: Use functions to track lockless pgtbl walks on gup_pgd_range

2020-10-15 Thread Michal Suchánek
Hello,

On Thu, Feb 06, 2020 at 12:25:18AM -0300, Leonardo Bras wrote:
> On Thu, 2020-02-06 at 00:08 -0300, Leonardo Bras wrote:
> > gup_pgd_range(addr, end, gup_flags, pages, &nr);
> > -   local_irq_enable();
> > +   end_lockless_pgtbl_walk(IRQS_ENABLED);
> > ret = nr;
> > }
> >  
> 
> Just noticed IRQS_ENABLED is not available on other archs than ppc64.
> I will fix this for v7.

Has threre been v7?

I cannot find it.

Thanks

Michal


Re: [PATCH] powerpc/perf: fix Threshold Event CounterMultiplier width for P10

2020-10-13 Thread Michal Suchánek
On Tue, Oct 13, 2020 at 06:27:05PM +0530, Madhavan Srinivasan wrote:
> 
> On 10/12/20 4:59 PM, Michal Suchánek wrote:
> > Hello,
> > 
> > On Mon, Oct 12, 2020 at 04:01:28PM +0530, Madhavan Srinivasan wrote:
> > > Power9 and isa v3.1 has 7bit mantissa field for Threshold Event Counter
> >^^^ Shouldn't his be 3.0?
> 
> My bad, What I meant was
> 
> Power9, ISA v3.0 and ISA v3.1 define a 7 bit mantissa field for Threshold
> Event Counter Multiplier(TECM).
I am really confused.

The following text and the code suggests that the mantissa is 8bit on
POWER10 and ISA v3.1.

Thanks

Michal
> 
> Maddy
> 
> > 
> > > Multiplier (TECM). TECM is part of Monitor Mode Control Register A 
> > > (MMCRA).
> > > This field along with Threshold Event Counter Exponent (TECE) is used to
> > > get threshould counter value. In Power10, the width of TECM field is
> > > increase to 8bits. Patch fixes the current code to modify the MMCRA[TECM]
> > > extraction macro to handling this changes.
> > > 
> > > Fixes: 170a315f41c64 ('powerpc/perf: Support to export MMCRA[TEC*] field 
> > > to userspace')
> > > Signed-off-by: Madhavan Srinivasan 
> > > ---
> > >   arch/powerpc/perf/isa207-common.c | 3 +++
> > >   arch/powerpc/perf/isa207-common.h | 4 
> > >   2 files changed, 7 insertions(+)
> > > 
> > > diff --git a/arch/powerpc/perf/isa207-common.c 
> > > b/arch/powerpc/perf/isa207-common.c
> > > index 964437adec18..5fe129f02290 100644
> > > --- a/arch/powerpc/perf/isa207-common.c
> > > +++ b/arch/powerpc/perf/isa207-common.c
> > > @@ -247,6 +247,9 @@ void isa207_get_mem_weight(u64 *weight)
> > >   u64 sier = mfspr(SPRN_SIER);
> > >   u64 val = (sier & ISA207_SIER_TYPE_MASK) >> 
> > > ISA207_SIER_TYPE_SHIFT;
> > > + if (cpu_has_feature(CPU_FTR_ARCH_31))
> > > + mantissa = P10_MMCRA_THR_CTR_MANT(mmcra);
> > > +
> > >   if (val == 0 || val == 7)
> > >   *weight = 0;
> > >   else
> > > diff --git a/arch/powerpc/perf/isa207-common.h 
> > > b/arch/powerpc/perf/isa207-common.h
> > > index 044de65e96b9..71380e854f48 100644
> > > --- a/arch/powerpc/perf/isa207-common.h
> > > +++ b/arch/powerpc/perf/isa207-common.h
> > > @@ -219,6 +219,10 @@
> > >   #define MMCRA_THR_CTR_EXP(v)(((v) >> 
> > > MMCRA_THR_CTR_EXP_SHIFT) &\
> > >   MMCRA_THR_CTR_EXP_MASK)
> > > +#define P10_MMCRA_THR_CTR_MANT_MASK  0xFFul
> > > +#define P10_MMCRA_THR_CTR_MANT(v)(((v) >> 
> > > MMCRA_THR_CTR_MANT_SHIFT) &\
> > > + P10_MMCRA_THR_CTR_MANT_MASK)
> > > +
> > >   /* MMCRA Threshold Compare bit constant for power9 */
> > >   #define p9_MMCRA_THR_CMP_SHIFT  45
> > > -- 
> > > 2.26.2
> > > 


Re: [PATCH] powerpc/perf: fix Threshold Event CounterMultiplier width for P10

2020-10-12 Thread Michal Suchánek
Hello,

On Mon, Oct 12, 2020 at 04:01:28PM +0530, Madhavan Srinivasan wrote:
> Power9 and isa v3.1 has 7bit mantissa field for Threshold Event Counter
  ^^^ Shouldn't his be 3.0?

> Multiplier (TECM). TECM is part of Monitor Mode Control Register A (MMCRA).
> This field along with Threshold Event Counter Exponent (TECE) is used to
> get threshould counter value. In Power10, the width of TECM field is
> increase to 8bits. Patch fixes the current code to modify the MMCRA[TECM]
> extraction macro to handling this changes.
> 
> Fixes: 170a315f41c64 ('powerpc/perf: Support to export MMCRA[TEC*] field to 
> userspace')
> Signed-off-by: Madhavan Srinivasan 
> ---
>  arch/powerpc/perf/isa207-common.c | 3 +++
>  arch/powerpc/perf/isa207-common.h | 4 
>  2 files changed, 7 insertions(+)
> 
> diff --git a/arch/powerpc/perf/isa207-common.c 
> b/arch/powerpc/perf/isa207-common.c
> index 964437adec18..5fe129f02290 100644
> --- a/arch/powerpc/perf/isa207-common.c
> +++ b/arch/powerpc/perf/isa207-common.c
> @@ -247,6 +247,9 @@ void isa207_get_mem_weight(u64 *weight)
>   u64 sier = mfspr(SPRN_SIER);
>   u64 val = (sier & ISA207_SIER_TYPE_MASK) >> ISA207_SIER_TYPE_SHIFT;
>  
> + if (cpu_has_feature(CPU_FTR_ARCH_31))
> + mantissa = P10_MMCRA_THR_CTR_MANT(mmcra);
> +
>   if (val == 0 || val == 7)
>   *weight = 0;
>   else
> diff --git a/arch/powerpc/perf/isa207-common.h 
> b/arch/powerpc/perf/isa207-common.h
> index 044de65e96b9..71380e854f48 100644
> --- a/arch/powerpc/perf/isa207-common.h
> +++ b/arch/powerpc/perf/isa207-common.h
> @@ -219,6 +219,10 @@
>  #define MMCRA_THR_CTR_EXP(v) (((v) >> MMCRA_THR_CTR_EXP_SHIFT) &\
>   MMCRA_THR_CTR_EXP_MASK)
>  
> +#define P10_MMCRA_THR_CTR_MANT_MASK  0xFFul
> +#define P10_MMCRA_THR_CTR_MANT(v)(((v) >> MMCRA_THR_CTR_MANT_SHIFT) &\
> + P10_MMCRA_THR_CTR_MANT_MASK)
> +
>  /* MMCRA Threshold Compare bit constant for power9 */
>  #define p9_MMCRA_THR_CMP_SHIFT   45
>  
> -- 
> 2.26.2
> 


Re: [PATCH v2 2/3] lkdtm/powerpc: Add SLB multihit test

2020-09-29 Thread Michal Suchánek
Hello,

On Fri, Sep 25, 2020 at 12:57:33PM -0700, Kees Cook wrote:
> On Fri, Sep 25, 2020 at 04:01:22PM +0530, Ganesh Goudar wrote:
> > Add support to inject slb multihit errors, to test machine
> > check handling.
> 
> Thank you for more tests in here!
Thanks for working on integrating this.
> 
> > 
> > Based on work by Mahesh Salgaonkar and Michal Suchánek.
> > 
> > Cc: Mahesh Salgaonkar 
> > Cc: Michal Suchánek 
> 
> Should these be Co-developed-by: with S-o-b?

I don't think I wrote any of this code. I packaged it for SUSE and maybe
changed some constants based on test result discussion.

I compared this code to my saved snapshots of past versions of the test
module and this covers all the test cases I have. The only difference is that
the development modules have verbose prints showing what's going on.

It is true that without the verbose prints some explanatory comments
could be helpful.

Reviewed-by: Michal Suchánek 
> 
> > Signed-off-by: Ganesh Goudar 
> > ---
> >  drivers/misc/lkdtm/Makefile  |   4 ++
> >  drivers/misc/lkdtm/core.c|   3 +
> >  drivers/misc/lkdtm/lkdtm.h   |   3 +
> >  drivers/misc/lkdtm/powerpc.c | 132 +++
> >  4 files changed, 142 insertions(+)
> >  create mode 100644 drivers/misc/lkdtm/powerpc.c
> > 
> > diff --git a/drivers/misc/lkdtm/Makefile b/drivers/misc/lkdtm/Makefile
> > index c70b3822013f..6a82f407fbcd 100644
> > --- a/drivers/misc/lkdtm/Makefile
> > +++ b/drivers/misc/lkdtm/Makefile
> > @@ -11,6 +11,10 @@ lkdtm-$(CONFIG_LKDTM)+= usercopy.o
> >  lkdtm-$(CONFIG_LKDTM)  += stackleak.o
> >  lkdtm-$(CONFIG_LKDTM)  += cfi.o
> >  
> > +ifeq ($(CONFIG_PPC64),y)
> > +lkdtm-$(CONFIG_LKDTM)  += powerpc.o
> > +endif
> 
> This can just be:
> 
> lkdtm-$(CONFIG_PPC64) += powerpc.o
> 
> > +
> >  KASAN_SANITIZE_stackleak.o := n
> >  KCOV_INSTRUMENT_rodata.o   := n
> >  
> > diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c
> > index a5e344df9166..8d5db42baa90 100644
> > --- a/drivers/misc/lkdtm/core.c
> > +++ b/drivers/misc/lkdtm/core.c
> > @@ -178,6 +178,9 @@ static const struct crashtype crashtypes[] = {
> >  #ifdef CONFIG_X86_32
> > CRASHTYPE(DOUBLE_FAULT),
> >  #endif
> > +#ifdef CONFIG_PPC64
> > +   CRASHTYPE(PPC_SLB_MULTIHIT),
> > +#endif
> >  };
> >  
> >  
> > diff --git a/drivers/misc/lkdtm/lkdtm.h b/drivers/misc/lkdtm/lkdtm.h
> > index 8878538b2c13..b305bd511ee5 100644
> > --- a/drivers/misc/lkdtm/lkdtm.h
> > +++ b/drivers/misc/lkdtm/lkdtm.h
> > @@ -104,4 +104,7 @@ void lkdtm_STACKLEAK_ERASING(void);
> >  /* cfi.c */
> >  void lkdtm_CFI_FORWARD_PROTO(void);
> >  
> > +/* powerpc.c */
> > +void lkdtm_PPC_SLB_MULTIHIT(void);
> > +
> >  #endif
> > diff --git a/drivers/misc/lkdtm/powerpc.c b/drivers/misc/lkdtm/powerpc.c
> > new file mode 100644
> > index ..d6db18444757
> > --- /dev/null
> > +++ b/drivers/misc/lkdtm/powerpc.c
> > @@ -0,0 +1,132 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> 
> Please #include "lkdtm.h" here to get the correct pr_fmt heading (and
> any future header adjustments).
> 
> > +#include 
> > +#include 
> > +
> > +static inline unsigned long get_slb_index(void)
> > +{
> > +   unsigned long index;
> > +
> > +   index = get_paca()->stab_rr;
> > +
> > +   /*
> > +* simple round-robin replacement of slb starting at SLB_NUM_BOLTED.
> > +*/
> > +   if (index < (mmu_slb_size - 1))
> > +   index++;
> > +   else
> > +   index = SLB_NUM_BOLTED;
> > +   get_paca()->stab_rr = index;
> > +   return index;
> > +}
> > +
> > +#define slb_esid_mask(ssize)   \
> > +   (((ssize) == MMU_SEGSIZE_256M) ? ESID_MASK : ESID_MASK_1T)
> > +
> > +static inline unsigned long mk_esid_data(unsigned long ea, int ssize,
> > +unsigned long slot)
> > +{
> > +   return (ea & slb_esid_mask(ssize)) | SLB_ESID_V | slot;
> > +}
> > +
> > +#define slb_vsid_shift(ssize)  \
> > +   ((ssize) == MMU_SEGSIZE_256M ? SLB_VSID_SHIFT : SLB_VSID_SHIFT_1T)
> > +
> > +static inline unsigned long mk_vsid_data(unsigned long ea, int ssize,
> > +unsigned long flags)
> > +{
> > +   return (get_kernel_vsid(ea, ssize) << slb_vsid_shift(ssize)) | flags |
> > +   ((unsigned long)ssize <&l

Re: [PATCH 0/3] powerpc/mce: Fix mce handler and add selftest

2020-09-17 Thread Michal Suchánek
Hello,

On Wed, Sep 16, 2020 at 10:52:25PM +0530, Ganesh Goudar wrote:
> This patch series fixes mce handling for pseries, provides debugfs
> interface for mce injection and adds selftest to test mce handling
> on pseries/powernv machines running in hash mmu mode.
> debugfs interface and sleftest are added only for slb multihit
> injection, We can add other tests in future if possible.
> 
> Ganesh Goudar (3):
>   powerpc/mce: remove nmi_enter/exit from real mode handler
>   powerpc/mce: Add debugfs interface to inject MCE
>   selftest/powerpc: Add slb multihit selftest

Is the below logic sound? It does not agree with what is added here:

void machine_check_exception(struct pt_regs *regs)
{
int recover = 0;

/*
 * BOOK3S_64 does not call this handler as a non-maskable interrupt
 * (it uses its own early real-mode handler to handle the MCE proper
 * and then raises irq_work to call this handler when interrupts are
 * enabled).
 *
 * This is silly. The BOOK3S_64 should just call a different function
 * rather than expecting semantics to magically change. Something
 * like 'non_nmi_machine_check_exception()', perhaps?
 */
const bool nmi = !IS_ENABLED(CONFIG_PPC_BOOK3S_64);

if (nmi) nmi_enter();

Thanks

Michal


Re: [PATCH 2/3] powerpc/mce: Add debugfs interface to inject MCE

2020-09-17 Thread Michal Suchánek
Hello,

On Wed, Sep 16, 2020 at 10:52:27PM +0530, Ganesh Goudar wrote:
> To test machine check handling, add debugfs interface to inject
> slb multihit errors.
> 
> To inject slb multihit:
>  #echo 1 > /sys/kernel/debug/powerpc/mce_error_inject/inject_slb_multihit
> 
> Signed-off-by: Ganesh Goudar 
> Signed-off-by: Mahesh Salgaonkar 
> ---
>  arch/powerpc/Kconfig.debug |   9 ++
>  arch/powerpc/sysdev/Makefile   |   2 +
>  arch/powerpc/sysdev/mce_error_inject.c | 148 +
>  3 files changed, 159 insertions(+)
>  create mode 100644 arch/powerpc/sysdev/mce_error_inject.c
> 
> diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug
> index b88900f4832f..61db133f2f0d 100644
> --- a/arch/powerpc/Kconfig.debug
> +++ b/arch/powerpc/Kconfig.debug
> @@ -398,3 +398,12 @@ config KASAN_SHADOW_OFFSET
>   hex
>   depends on KASAN
>   default 0xe000
> +
> +config MCE_ERROR_INJECT
> + bool "Enable MCE error injection through debugfs"
> + depends on DEBUG_FS
> + default y
> + help
> +   This option creates an mce_error_inject directory in the
> +   powerpc debugfs directory that allows limited injection of
> +   Machine Check Errors (MCEs).
> diff --git a/arch/powerpc/sysdev/Makefile b/arch/powerpc/sysdev/Makefile
> index 026b3f01a991..7fc10b77 100644
> --- a/arch/powerpc/sysdev/Makefile
> +++ b/arch/powerpc/sysdev/Makefile
> @@ -52,3 +52,5 @@ obj-$(CONFIG_PPC_XICS)  += xics/
>  obj-$(CONFIG_PPC_XIVE)   += xive/
>  
>  obj-$(CONFIG_GE_FPGA)+= ge/
> +
> +obj-$(CONFIG_MCE_ERROR_INJECT)   += mce_error_inject.o
> diff --git a/arch/powerpc/sysdev/mce_error_inject.c 
> b/arch/powerpc/sysdev/mce_error_inject.c
> new file mode 100644
> index ..ca4726bfa2d9
> --- /dev/null
> +++ b/arch/powerpc/sysdev/mce_error_inject.c
> @@ -0,0 +1,148 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Machine Check Exception injection code
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +static inline unsigned long get_slb_index(void)
> +{
> + unsigned long index;
> +
> + index = get_paca()->stab_rr;
> +
> + /*
> +  * simple round-robin replacement of slb starting at SLB_NUM_BOLTED.
> +  */
> + if (index < (mmu_slb_size - 1))
> + index++;
> + else
> + index = SLB_NUM_BOLTED;
> + get_paca()->stab_rr = index;
> + return index;
> +}
> +
> +#define slb_esid_mask(ssize) \
> + (((ssize) == MMU_SEGSIZE_256M) ? ESID_MASK : ESID_MASK_1T)
> +
> +static inline unsigned long mk_esid_data(unsigned long ea, int ssize,
> +  unsigned long slot)
> +{
> + return (ea & slb_esid_mask(ssize)) | SLB_ESID_V | slot;
> +}
> +
> +#define slb_vsid_shift(ssize)\
> + ((ssize) == MMU_SEGSIZE_256M ? SLB_VSID_SHIFT : SLB_VSID_SHIFT_1T)
> +
> +static inline unsigned long mk_vsid_data(unsigned long ea, int ssize,
> +  unsigned long flags)
> +{
> + return (get_kernel_vsid(ea, ssize) << slb_vsid_shift(ssize)) | flags |
> + ((unsigned long)ssize << SLB_VSID_SSIZE_SHIFT);
> +}
> +
> +static void insert_slb_entry(char *p, int ssize)
> +{
> + unsigned long flags, entry;
> + struct paca_struct *paca;
> +
> + flags = SLB_VSID_KERNEL | mmu_psize_defs[MMU_PAGE_64K].sllp;
> +
> + preempt_disable();
> +
> + paca = get_paca();
This seems unused?
> +
> + entry = get_slb_index();
> + asm volatile("slbmte %0,%1" :
> + : "r" (mk_vsid_data((unsigned long)p, ssize, flags)),
> +   "r" (mk_esid_data((unsigned long)p, ssize, entry))
> + : "memory");
> +
> + entry = get_slb_index();
> + asm volatile("slbmte %0,%1" :
> + : "r" (mk_vsid_data((unsigned long)p, ssize, flags)),
> +   "r" (mk_esid_data((unsigned long)p, ssize, entry))
> + : "memory");
> + preempt_enable();
> + p[0] = '!';
> +}
> +
> +static void inject_vmalloc_slb_multihit(void)
> +{
> + char *p;
> +
> + p = vmalloc(2048);
> + if (!p)
> + return;
> +
> + insert_slb_entry(p, MMU_SEGSIZE_1T);
> + vfree(p);
> +}
> +
> +static void inject_kmalloc_slb_multihit(void)
> +{
> + char *p;
> +
> + p = kmalloc(2048, GFP_KERNEL);
> + if (!p)
> + return;
> +
> + insert_slb_entry(p, MMU_SEGSIZE_1T);
> + kfree(p);
> +}
> +
> +static ssize_t inject_slb_multihit(const char __user *u_buf, size_t count)
> +{
> + char buf[32];
> + size_t buf_size;
> +
> + buf_size = min(count, (sizeof(buf) - 1));
> + if (copy_from_user(buf, u_buf, buf_size))
> + return -EFAULT;
> + buf[buf_size] = '\0';
> +
> + if (buf[0] != '1')
> + return -EINVAL;
> +
> + inject_vmalloc_slb_multihit();
> + inject_kmalloc_slb_multihit();
This is mis

Re: [PATCH 1/3] powerpc/mce: remove nmi_enter/exit from real mode handler

2020-09-17 Thread Michal Suchánek
Hello,

On Wed, Sep 16, 2020 at 10:52:26PM +0530, Ganesh Goudar wrote:
> Use of nmi_enter/exit in real mode handler causes the kernel to panic
> and reboot on injecting slb mutihit on pseries machine running in hash
> mmu mode, As these calls try to accesses memory outside RMO region in
> real mode handler where translation is disabled.
> 
> Add check to not to use these calls on pseries machine running in hash
> mmu mode.
> 
> Fixes: 116ac378bb3f ("powerpc/64s: machine check interrupt update NMI 
> accounting")
> Signed-off-by: Ganesh Goudar 
> ---
>  arch/powerpc/kernel/mce.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
> index ada59f6c4298..1d42fe0f5f9c 100644
> --- a/arch/powerpc/kernel/mce.c
> +++ b/arch/powerpc/kernel/mce.c
> @@ -591,10 +591,15 @@ EXPORT_SYMBOL_GPL(machine_check_print_event_info);
>  long notrace machine_check_early(struct pt_regs *regs)
>  {
>   long handled = 0;
> - bool nested = in_nmi();
> + bool nested;
> + bool is_pseries_hpt_guest;
>   u8 ftrace_enabled = this_cpu_get_ftrace_enabled();
>  
>   this_cpu_set_ftrace_enabled(0);
> + is_pseries_hpt_guest = machine_is(pseries) &&
> +mmu_has_feature(MMU_FTR_HPTE_TABLE);
> + /* Do not use nmi_enter/exit for pseries hpte guest */
> + nested = is_pseries_hpt_guest ? true : in_nmi();
As pointed out already in another comment nesting is supported natively
since 69ea03b56ed2c7189ccd0b5910ad39f3cad1df21. You can simply do
nmi_enter and nmi_exit unconditionally - or only based on
is_pseries_hpt_guest.

The other question is what is the value of calling nmi_enter here at
all. It crashes in one case, we simply skip it for that case, and we are
good. Maybe we could skip it altogether?

Thanks

Michal


Re: [PATCH] Revert "powerpc/64s: machine check interrupt update NMI accounting"

2020-09-16 Thread Michal Suchánek
On Tue, Sep 15, 2020 at 08:16:42PM +0200, pet...@infradead.org wrote:
> On Tue, Sep 15, 2020 at 08:06:59PM +0200, Michal Suchanek wrote:
> > This reverts commit 116ac378bb3ff844df333e7609e7604651a0db9d.
> > 
> > This commit causes the kernel to oops and reboot when injecting a SLB
> > multihit which causes a MCE.
> > 
> > Before this commit a SLB multihit was corrected by the kernel and the
> > system continued to operate normally.
> > 
> > cc: sta...@vger.kernel.org
> > Fixes: 116ac378bb3f ("powerpc/64s: machine check interrupt update NMI 
> > accounting")
> > Signed-off-by: Michal Suchanek 
> 
> Ever since 69ea03b56ed2 ("hardirq/nmi: Allow nested nmi_enter()")
> nmi_enter() supports nesting natively.

And this patch was merged in parallel with this native nesting support
and conflicted with it - hence the explicit nesting in the hunk that did
not conflict.

Either way the bug is present on kernels both with and without
69ea03b56ed2. So besides the conflict 69ea03b56ed2 does not affect this
problem.

Thanks

Michal


Injecting SLB miltihit crashes kernel 5.9.0-rc5

2020-09-15 Thread Michal Suchánek
Hello,

Using the SLB mutihit injection test module (which I did not write so I
do not want to post it here) to verify updates on my 5.3 frankernekernel
I found that the kernel crashes with Oops: kernel bad access.

I tested on latest upstream kernel build that I have at hand and the
result is te same (minus the message - nothing was logged and the kernel
simply rebooted).

Since the whole effort to write a real mode MCE handler was supposed to
prevent this maybe the SLB injection module should be added to the
kernel selftests?

Thanks

Michal


Re: [PATCH] powerpc/traps: fix recoverability of machine check handling on book3s/32

2020-09-14 Thread Michal Suchánek
On Fri, Sep 11, 2020 at 11:23:57PM +1000, Michael Ellerman wrote:
> Michal Suchánek  writes:
> > Hello,
> >
> > does this logic apply to "Unrecoverable System Reset" as well?
> 
> Which logic do you mean?
> 
> We do call die() before checking MSR_RI in system_reset_exception():
> 
>   /*
>* No debugger or crash dump registered, print logs then
>* panic.
>*/
>   die("System Reset", regs, SIGABRT);
>   
>   mdelay(2*MSEC_PER_SEC); /* Wait a little while for others to print */
>   add_taint(TAINT_DIE, LOCKDEP_NOW_UNRELIABLE);
>   nmi_panic(regs, "System Reset");
>   
>   out:
>   #ifdef CONFIG_PPC_BOOK3S_64
>   BUG_ON(get_paca()->in_nmi == 0);
>   if (get_paca()->in_nmi > 1)
>   die("Unrecoverable nested System Reset", regs, SIGABRT);
>   #endif
>   /* Must die if the interrupt is not recoverable */
>   if (!(regs->msr & MSR_RI))
>   die("Unrecoverable System Reset", regs, SIGABRT);
> 
> 
> So you should see the output from die("System Reset", ...) even if
> MSR[RI] was clear when you took the system reset.

Indeed, replied to the wrong patch. I was looking at daf00ae71dad
("powerpc/traps: restore recoverability of machine_check interrupts")
which has very similar commit message.

Sorry about the confusion.

Thanks

Michal

> 
> cheers
> 
> > On Tue, Jan 22, 2019 at 02:11:24PM +, Christophe Leroy wrote:
> >> Looks like book3s/32 doesn't set RI on machine check, so
> >> checking RI before calling die() will always be fatal
> >> allthought this is not an issue in most cases.
> >> 
> >> Fixes: b96672dd840f ("powerpc: Machine check interrupt is a non-maskable 
> >> interrupt")
> >> Fixes: daf00ae71dad ("powerpc/traps: restore recoverability of 
> >> machine_check interrupts")
> >> Signed-off-by: Christophe Leroy 
> >> Cc: sta...@vger.kernel.org
> >> ---
> >>  arch/powerpc/kernel/traps.c | 8 
> >>  1 file changed, 4 insertions(+), 4 deletions(-)
> >> 
> >> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
> >> index 64936b60d521..c740f8bfccc9 100644
> >> --- a/arch/powerpc/kernel/traps.c
> >> +++ b/arch/powerpc/kernel/traps.c
> >> @@ -763,15 +763,15 @@ void machine_check_exception(struct pt_regs *regs)
> >>if (check_io_access(regs))
> >>goto bail;
> >>  
> >> -  /* Must die if the interrupt is not recoverable */
> >> -  if (!(regs->msr & MSR_RI))
> >> -  nmi_panic(regs, "Unrecoverable Machine check");
> >> -
> >>if (!nested)
> >>nmi_exit();
> >>  
> >>die("Machine check", regs, SIGBUS);
> >>  
> >> +  /* Must die if the interrupt is not recoverable */
> >> +  if (!(regs->msr & MSR_RI))
> >> +  nmi_panic(regs, "Unrecoverable Machine check");
> >> +
> >>return;
> >>  
> >>  bail:
> >> -- 
> >> 2.13.3
> >> 


Re: [PATCH] powerpc/traps: fix recoverability of machine check handling on book3s/32

2020-09-11 Thread Michal Suchánek
Hello,

does this logic apply to "Unrecoverable System Reset" as well?

Thanks

Michal

On Tue, Jan 22, 2019 at 02:11:24PM +, Christophe Leroy wrote:
> Looks like book3s/32 doesn't set RI on machine check, so
> checking RI before calling die() will always be fatal
> allthought this is not an issue in most cases.
> 
> Fixes: b96672dd840f ("powerpc: Machine check interrupt is a non-maskable 
> interrupt")
> Fixes: daf00ae71dad ("powerpc/traps: restore recoverability of machine_check 
> interrupts")
> Signed-off-by: Christophe Leroy 
> Cc: sta...@vger.kernel.org
> ---
>  arch/powerpc/kernel/traps.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
> index 64936b60d521..c740f8bfccc9 100644
> --- a/arch/powerpc/kernel/traps.c
> +++ b/arch/powerpc/kernel/traps.c
> @@ -763,15 +763,15 @@ void machine_check_exception(struct pt_regs *regs)
>   if (check_io_access(regs))
>   goto bail;
>  
> - /* Must die if the interrupt is not recoverable */
> - if (!(regs->msr & MSR_RI))
> - nmi_panic(regs, "Unrecoverable Machine check");
> -
>   if (!nested)
>   nmi_exit();
>  
>   die("Machine check", regs, SIGBUS);
>  
> + /* Must die if the interrupt is not recoverable */
> + if (!(regs->msr & MSR_RI))
> + nmi_panic(regs, "Unrecoverable Machine check");
> +
>   return;
>  
>  bail:
> -- 
> 2.13.3
> 


Re: [PATCH] KVM: PPC: Book3S HV: Do not allocate HPT for a nested guest

2020-09-11 Thread Michal Suchánek
On Fri, Sep 11, 2020 at 10:01:33AM +0200, Greg Kurz wrote:
> On Fri, 11 Sep 2020 09:45:36 +0200
> Greg Kurz  wrote:
> 
> > On Fri, 11 Sep 2020 01:16:07 -0300
> > Fabiano Rosas  wrote:
> > 
> > > The current nested KVM code does not support HPT guests. This is
> > > informed/enforced in some ways:
> > > 
> > > - Hosts < P9 will not be able to enable the nested HV feature;
> > > 
> > > - The nested hypervisor MMU capabilities will not contain
> > >   KVM_CAP_PPC_MMU_HASH_V3;
> > > 
> > > - QEMU reflects the MMU capabilities in the
> > >   'ibm,arch-vec-5-platform-support' device-tree property;
> > > 
> > > - The nested guest, at 'prom_parse_mmu_model' ignores the
> > >   'disable_radix' kernel command line option if HPT is not supported;
> > > 
> > > - The KVM_PPC_CONFIGURE_V3_MMU ioctl will fail if trying to use HPT.
> > > 
> > > There is, however, still a way to start a HPT guest by using
> > > max-compat-cpu=power8 at the QEMU machine options. This leads to the
> > > guest being set to use hash after QEMU calls the KVM_PPC_ALLOCATE_HTAB
> > > ioctl.
> > > 
> > > With the guest set to hash, the nested hypervisor goes through the
> > > entry path that has no knowledge of nesting (kvmppc_run_vcpu) and
> > > crashes when it tries to execute an hypervisor-privileged (mtspr
> > > HDEC) instruction at __kvmppc_vcore_entry:
> > > 
> > > root@L1:~ $ qemu-system-ppc64 -machine pseries,max-cpu-compat=power8 ...
> > > 
> > > 
> > > [  538.543303] CPU: 83 PID: 25185 Comm: CPU 0/KVM Not tainted 5.9.0-rc4 #1
> > > [  538.543355] NIP:  c0080753f388 LR: c0080753f368 CTR: 
> > > c01e5ec0
> > > [  538.543417] REGS: c013e91e33b0 TRAP: 0700   Not tainted  
> > > (5.9.0-rc4)
> > > [  538.543470] MSR:  82843033   CR: 
> > > 22422882  XER: 2004
> > > [  538.543546] CFAR: c0080753f4b0 IRQMASK: 3
> > >GPR00: c008075397a0 c013e91e3640 c0080755e600 
> > > 8000
> > >GPR04:  c013eab19800 c01394de 
> > > 0043a054db72
> > >GPR08: 003b1652   
> > > c008075502e0
> > >GPR12: c01e5ec0 c007ffa74200 c013eab19800 
> > > 0008
> > >GPR16:  c0139676c6c0 c1d23948 
> > > c013e91e38b8
> > >GPR20: 0053  0001 
> > > 
> > >GPR24: 0001 0001  
> > > 0001
> > >GPR28: 0001 0053 c013eab19800 
> > > 0001
> > > [  538.544067] NIP [c0080753f388] __kvmppc_vcore_entry+0x90/0x104 
> > > [kvm_hv]
> > > [  538.544121] LR [c0080753f368] __kvmppc_vcore_entry+0x70/0x104 
> > > [kvm_hv]
> > > [  538.544173] Call Trace:
> > > [  538.544196] [c013e91e3640] [c013e91e3680] 0xc013e91e3680 
> > > (unreliable)
> > > [  538.544260] [c013e91e3820] [c008075397a0] 
> > > kvmppc_run_core+0xbc8/0x19d0 [kvm_hv]
> > > [  538.544325] [c013e91e39e0] [c0080753d99c] 
> > > kvmppc_vcpu_run_hv+0x404/0xc00 [kvm_hv]
> > > [  538.544394] [c013e91e3ad0] [c008072da4fc] 
> > > kvmppc_vcpu_run+0x34/0x48 [kvm]
> > > [  538.544472] [c013e91e3af0] [c008072d61b8] 
> > > kvm_arch_vcpu_ioctl_run+0x310/0x420 [kvm]
> > > [  538.544539] [c013e91e3b80] [c008072c7450] 
> > > kvm_vcpu_ioctl+0x298/0x778 [kvm]
> > > [  538.544605] [c013e91e3ce0] [c04b8c2c] sys_ioctl+0x1dc/0xc90
> > > [  538.544662] [c013e91e3dc0] [c002f9a4] 
> > > system_call_exception+0xe4/0x1c0
> > > [  538.544726] [c013e91e3e20] [c000d140] 
> > > system_call_common+0xf0/0x27c
> > > [  538.544787] Instruction dump:
> > > [  538.544821] f86d1098 6000 6000 4899 e8ad0fe8 e8c500a0 
> > > e9264140 75290002
> > > [  538.544886] 7d1602a6 7cec42a6 40820008 7d0807b4 <7d164ba6> 7d083a14 
> > > f90d10a0 480104fd
> > > [  538.544953] ---[ end trace 74423e2b948c2e0c ]---
> > > 
> > > This patch makes the KVM_PPC_ALLOCATE_HTAB ioctl fail when running in
> > > the nested hypervisor, causing QEMU to abort.
> > > 
> > > Reported-by: Satheesh Rajendran 
> > > Signed-off-by: Fabiano Rosas 
> > > ---
> > 
> > LGTM
> > 
> > Reviewed-by: Greg Kurz 
> > 
> > >  arch/powerpc/kvm/book3s_hv.c | 6 ++
> > >  1 file changed, 6 insertions(+)
> > > 
> > > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > > index 4ba06a2a306c..764b6239ef72 100644
> > > --- a/arch/powerpc/kvm/book3s_hv.c
> > > +++ b/arch/powerpc/kvm/book3s_hv.c
> > > @@ -5250,6 +5250,12 @@ static long kvm_arch_vm_ioctl_hv(struct file *filp,
> > >   case KVM_PPC_ALLOCATE_HTAB: {
> > >   u32 htab_order;
> > >  
> > > + /* If we're a nested hypervisor, we currently only support 
> > > radix */
> > > + if (kvmhv_on_pseries()) {
> > > + r = -EOPNOTSUPP;
> 
> According to PO

Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2020-09-07 Thread Michal Suchánek
On Mon, Sep 07, 2020 at 11:13:47PM +1000, Nicholas Piggin wrote:
> Excerpts from Michael Ellerman's message of August 31, 2020 8:50 pm:
> > Michal Suchánek  writes:
> >> On Mon, Aug 31, 2020 at 11:14:18AM +1000, Nicholas Piggin wrote:
> >>> Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am:
> >>> > Hello,
> >>> > 
> >>> > on POWER8 KVM hosts lock up since commit 10d91611f426 ("powerpc/64s:
> >>> > Reimplement book3s idle code in C").
> >>> > 
> >>> > The symptom is host locking up completely after some hours of KVM
> >>> > workload with messages like
> >>> > 
> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 
> >>> > 47
> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 
> >>> > 71
> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 
> >>> > 47
> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 
> >>> > 71
> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 
> >>> > 47
> >>> > 
> >>> > printed before the host locks up.
> >>> > 
> >>> > The machines run sandboxed builds which is a mixed workload resulting in
> >>> > IO/single core/mutiple core load over time and there are periods of no
> >>> > activity and no VMS runnig as well. The VMs are shortlived so VM
> >>> > setup/terdown is somewhat excercised as well.
> >>> > 
> >>> > POWER9 with the new guest entry fast path does not seem to be affected.
> >>> > 
> >>> > Reverted the patch and the followup idle fixes on top of 5.2.14 and
> >>> > re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR
> >>> > after idle") which gives same idle code as 5.1.16 and the kernel seems
> >>> > stable.
> >>> > 
> >>> > Config is attached.
> >>> > 
> >>> > I cannot easily revert this commit, especially if I want to use the same
> >>> > kernel on POWER8 and POWER9 - many of the POWER9 fixes are applicable
> >>> > only to the new idle code.
> >>> > 
> >>> > Any idea what can be the problem?
> >>> 
> >>> So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on
> >>> those threads. I wonder what they are doing. POWER8 doesn't have a good
> >>> NMI IPI and I don't know if it supports pdbg dumping registers from the
> >>> BMC unfortunately.
> >>
> >> It may be possible to set up fadump with a later kernel version that
> >> supports it on powernv and dump the whole kernel.
> > 
> > Your firmware won't support it AFAIK.
> > 
> > You could try kdump, but if we have CPUs stuck in KVM then there's a
> > good chance it won't work :/
> 
> I haven't had any luck yet reproducing this still. Testing with sub 
> cores of various different combinations, etc. I'll keep trying though.
> 
> I don't know if there's much we can add to debug it. Can we run pdbg
> on the BMCs on these things?

I suppose it depends on the machine type?

Thanks

Michal


Re: [PATCH v11] Fixup for "powerpc/vdso: Provide __kernel_clock_gettime64() on vdso32"

2020-09-01 Thread Michal Suchánek
Hello,

can you add Fixes: ?

Thanks

Michal

On Tue, Sep 01, 2020 at 05:28:57AM +, Christophe Leroy wrote:
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/include/asm/vdso/gettimeofday.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/vdso/gettimeofday.h 
> b/arch/powerpc/include/asm/vdso/gettimeofday.h
> index 59a609a48b63..8da84722729b 100644
> --- a/arch/powerpc/include/asm/vdso/gettimeofday.h
> +++ b/arch/powerpc/include/asm/vdso/gettimeofday.h
> @@ -186,6 +186,8 @@ int __c_kernel_clock_getres(clockid_t clock_id, struct 
> __kernel_timespec *res,
>  #else
>  int __c_kernel_clock_gettime(clockid_t clock, struct old_timespec32 *ts,
>const struct vdso_data *vd);
> +int __c_kernel_clock_gettime64(clockid_t clock, struct __kernel_timespec *ts,
> +const struct vdso_data *vd);
>  int __c_kernel_clock_getres(clockid_t clock_id, struct old_timespec32 *res,
>   const struct vdso_data *vd);
>  #endif
> -- 
> 2.25.0
> 


Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2020-08-31 Thread Michal Suchánek
On Mon, Aug 31, 2020 at 11:14:18AM +1000, Nicholas Piggin wrote:
> Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am:
> > Hello,
> > 
> > on POWER8 KVM hosts lock up since commit 10d91611f426 ("powerpc/64s:
> > Reimplement book3s idle code in C").
> > 
> > The symptom is host locking up completely after some hours of KVM
> > workload with messages like
> > 
> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
> > 
> > printed before the host locks up.
> > 
> > The machines run sandboxed builds which is a mixed workload resulting in
> > IO/single core/mutiple core load over time and there are periods of no
> > activity and no VMS runnig as well. The VMs are shortlived so VM
> > setup/terdown is somewhat excercised as well.
> > 
> > POWER9 with the new guest entry fast path does not seem to be affected.
> > 
> > Reverted the patch and the followup idle fixes on top of 5.2.14 and
> > re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR
> > after idle") which gives same idle code as 5.1.16 and the kernel seems
> > stable.
> > 
> > Config is attached.
> > 
> > I cannot easily revert this commit, especially if I want to use the same
> > kernel on POWER8 and POWER9 - many of the POWER9 fixes are applicable
> > only to the new idle code.
> > 
> > Any idea what can be the problem?
> 
> So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on
> those threads. I wonder what they are doing. POWER8 doesn't have a good
> NMI IPI and I don't know if it supports pdbg dumping registers from the
> BMC unfortunately.
It may be possible to set up fadump with a later kernel version that
supports it on powernv and dump the whole kernel.

Thanks

Michal


Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2020-08-31 Thread Michal Suchánek
On Mon, Aug 31, 2020 at 11:14:18AM +1000, Nicholas Piggin wrote:
> Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am:
> > Hello,
> > 
> > on POWER8 KVM hosts lock up since commit 10d91611f426 ("powerpc/64s:
> > Reimplement book3s idle code in C").
> > 
> > The symptom is host locking up completely after some hours of KVM
> > workload with messages like
> > 
> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
> > 
> > printed before the host locks up.
> > 
> > The machines run sandboxed builds which is a mixed workload resulting in
> > IO/single core/mutiple core load over time and there are periods of no
> > activity and no VMS runnig as well. The VMs are shortlived so VM
> > setup/terdown is somewhat excercised as well.
> > 
> > POWER9 with the new guest entry fast path does not seem to be affected.
> > 
> > Reverted the patch and the followup idle fixes on top of 5.2.14 and
> > re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR
> > after idle") which gives same idle code as 5.1.16 and the kernel seems
> > stable.
> > 
> > Config is attached.
> > 
> > I cannot easily revert this commit, especially if I want to use the same
> > kernel on POWER8 and POWER9 - many of the POWER9 fixes are applicable
> > only to the new idle code.
> > 
> > Any idea what can be the problem?
> 
> So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on
> those threads. I wonder what they are doing. POWER8 doesn't have a good
> NMI IPI and I don't know if it supports pdbg dumping registers from the
> BMC unfortunately. Do the messages always come in pairs of CPUs?
> 
> I'm not sure where to start with reproducing, I'll have to try. How many
> vCPUs in the guests? Do you have several guests running at once?

The guests are spawned on demand - there are like 20-30 'slots'
configured where a VM may be running or it may be idle with no VM
spawned when there are no jobs available.

Thanks

Michal


Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-08-07 Thread Michal Suchánek
On Fri, Aug 07, 2020 at 08:58:09AM +0200, David Hildenbrand wrote:
> On 07.08.20 06:32, Andrew Morton wrote:
> > On Fri, 3 Jul 2020 18:28:23 +0530 Srikar Dronamraju 
> >  wrote:
> > 
> >>> The memory hotplug changes that somehow because you can hotremove numa
> >>> nodes and therefore make the nodemask sparse but that is not a common
> >>> case. I am not sure what would happen if a completely new node was added
> >>> and its corresponding node was already used by the renumbered one
> >>> though. It would likely conflate the two I am afraid. But I am not sure
> >>> this is really possible with x86 and a lack of a bug report would
> >>> suggest that nobody is doing that at least.
> >>>
> >>
> >> JFYI,
> >> Satheesh copied in this mailchain had opened a bug a year on crash with 
> >> vcpu
> >> hotplug on memoryless node. 
> >>
> >> https://bugzilla.kernel.org/show_bug.cgi?id=202187
> > 
> > So...  do we merge this patch or not?  Seems that the overall view is
> > "risky but nobody is likely to do anything better any time soon"?
> 
> I recall the issue Michal saw was "fix powerpc" vs. "break other
> architectures". @Michal how should we proceed? At least x86-64 won't be
> affected IIUC.
There is a patch to introduce the node remapping on ppc as well which
should eliminate the empty node 0.

https://patchwork.ozlabs.org/project/linuxppc-dev/patch/2020073916.243569-1-aneesh.ku...@linux.ibm.com/

Thanks

Michal


Re: [PATCH v3 4/6] powerpc/64s: implement queued spinlocks and rwlocks

2020-07-23 Thread Michal Suchánek
On Mon, Jul 06, 2020 at 02:35:38PM +1000, Nicholas Piggin wrote:
> These have shown significantly improved performance and fairness when
> spinlock contention is moderate to high on very large systems.
> 
>  [ Numbers hopefully forthcoming after more testing, but initial
>results look good ]
> 
> Thanks to the fast path, single threaded performance is not noticably
> hurt.
> 
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/Kconfig  | 13 
>  arch/powerpc/include/asm/Kbuild   |  2 ++
>  arch/powerpc/include/asm/qspinlock.h  | 25 +++
>  arch/powerpc/include/asm/spinlock.h   |  5 +
>  arch/powerpc/include/asm/spinlock_types.h |  5 +
>  arch/powerpc/lib/Makefile |  3 +++
>  include/asm-generic/qspinlock.h   |  2 ++
>  7 files changed, 55 insertions(+)
>  create mode 100644 arch/powerpc/include/asm/qspinlock.h
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 24ac85c868db..17663ea57697 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -146,6 +146,8 @@ config PPC
>   select ARCH_SUPPORTS_ATOMIC_RMW
>   select ARCH_USE_BUILTIN_BSWAP
>   select ARCH_USE_CMPXCHG_LOCKREF if PPC64
> + select ARCH_USE_QUEUED_RWLOCKS  if PPC_QUEUED_SPINLOCKS
> + select ARCH_USE_QUEUED_SPINLOCKSif PPC_QUEUED_SPINLOCKS
>   select ARCH_WANT_IPC_PARSE_VERSION
>   select ARCH_WEAK_RELEASE_ACQUIRE
>   select BINFMT_ELF
> @@ -492,6 +494,17 @@ config HOTPLUG_CPU
>  
> Say N if you are unsure.
>  
> +config PPC_QUEUED_SPINLOCKS
> + bool "Queued spinlocks"
> + depends on SMP
> + default "y" if PPC_BOOK3S_64
> + help
> +   Say Y here to use to use queued spinlocks which are more complex
> +   but give better salability and fairness on large SMP and NUMA
   ^ +c?
Thanks

Michal
> +   systems.
> +
> +   If unsure, say "Y" if you have lots of cores, otherwise "N".
> +
>  config ARCH_CPU_PROBE_RELEASE
>   def_bool y
>   depends on HOTPLUG_CPU
> diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild
> index dadbcf3a0b1e..1dd8b6adff5e 100644
> --- a/arch/powerpc/include/asm/Kbuild
> +++ b/arch/powerpc/include/asm/Kbuild
> @@ -6,5 +6,7 @@ generated-y += syscall_table_spu.h
>  generic-y += export.h
>  generic-y += local64.h
>  generic-y += mcs_spinlock.h
> +generic-y += qrwlock.h
> +generic-y += qspinlock.h
>  generic-y += vtime.h
>  generic-y += early_ioremap.h
> diff --git a/arch/powerpc/include/asm/qspinlock.h 
> b/arch/powerpc/include/asm/qspinlock.h
> new file mode 100644
> index ..c49e33e24edd
> --- /dev/null
> +++ b/arch/powerpc/include/asm/qspinlock.h
> @@ -0,0 +1,25 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_POWERPC_QSPINLOCK_H
> +#define _ASM_POWERPC_QSPINLOCK_H
> +
> +#include 
> +
> +#define _Q_PENDING_LOOPS (1 << 9) /* not tuned */
> +
> +#define smp_mb__after_spinlock()   smp_mb()
> +
> +static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
> +{
> + /*
> +  * This barrier was added to simple spinlocks by commit 51d7d5205d338,
> +  * but it should now be possible to remove it, asm arm64 has done with
> +  * commit c6f5d02b6a0f.
> +  */
> + smp_mb();
> + return atomic_read(&lock->val);
> +}
> +#define queued_spin_is_locked queued_spin_is_locked
> +
> +#include 
> +
> +#endif /* _ASM_POWERPC_QSPINLOCK_H */
> diff --git a/arch/powerpc/include/asm/spinlock.h 
> b/arch/powerpc/include/asm/spinlock.h
> index 21357fe05fe0..434615f1d761 100644
> --- a/arch/powerpc/include/asm/spinlock.h
> +++ b/arch/powerpc/include/asm/spinlock.h
> @@ -3,7 +3,12 @@
>  #define __ASM_SPINLOCK_H
>  #ifdef __KERNEL__
>  
> +#ifdef CONFIG_PPC_QUEUED_SPINLOCKS
> +#include 
> +#include 
> +#else
>  #include 
> +#endif
>  
>  #endif /* __KERNEL__ */
>  #endif /* __ASM_SPINLOCK_H */
> diff --git a/arch/powerpc/include/asm/spinlock_types.h 
> b/arch/powerpc/include/asm/spinlock_types.h
> index 3906f52dae65..c5d742f18021 100644
> --- a/arch/powerpc/include/asm/spinlock_types.h
> +++ b/arch/powerpc/include/asm/spinlock_types.h
> @@ -6,6 +6,11 @@
>  # error "please don't include this file directly"
>  #endif
>  
> +#ifdef CONFIG_PPC_QUEUED_SPINLOCKS
> +#include 
> +#include 
> +#else
>  #include 
> +#endif
>  
>  #endif
> diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
> index 5e994cda8e40..d66a645503eb 100644
> --- a/arch/powerpc/lib/Makefile
> +++ b/arch/powerpc/lib/Makefile
> @@ -41,7 +41,10 @@ obj-$(CONFIG_PPC_BOOK3S_64) += copyuser_power7.o 
> copypage_power7.o \
>  obj64-y  += copypage_64.o copyuser_64.o mem_64.o hweight_64.o \
>  memcpy_64.o memcpy_mcsafe_64.o
>  
> +ifndef CONFIG_PPC_QUEUED_SPINLOCKS
>  obj64-$(CONFIG_SMP)  += locks.o
> +endif
> +
>  obj64-$(CONFIG_ALTIVEC)  += vmx-helper.o
>  obj64-$(CONFIG_KPROBES_SANITY_TEST)  += test_emulate_step.o \
>

Re: [PATCH] powerpc/fault: kernel can extend a user process's stack

2020-07-20 Thread Michal Suchánek
Hello,

On Wed, Dec 11, 2019 at 08:37:21PM +1100, Daniel Axtens wrote:
> > Fixes: 14cf11af6cf6 ("powerpc: Merge enough to start building in
> > arch/powerpc.")
> 
> Wow, that's pretty ancient! I'm also not sure it's right - in that same
> patch, arch/ppc64/mm/fault.c contains:
> 
> ^1da177e4c3f4 (Linus Torvalds 2005-04-16 15:20:36 -0700 213)  
>   if (address + 2048 < uregs->gpr[1]
> ^1da177e4c3f4 (Linus Torvalds 2005-04-16 15:20:36 -0700 214)  
>   && (!user_mode(regs) || !store_updates_sp(regs)))
> ^1da177e4c3f4 (Linus Torvalds 2005-04-16 15:20:36 -0700 215)  
>   goto bad_area;
> 
> Which is the same as the new arch/powerpc/mm/fault.c code:
> 
> 14cf11af6cf60 (Paul Mackerras 2005-09-26 16:04:21 +1000 234)if 
> (address + 2048 < uregs->gpr[1]
> 14cf11af6cf60 (Paul Mackerras 2005-09-26 16:04:21 +1000 235)
> && (!user_mode(regs) || !store_updates_sp(regs)))
> 14cf11af6cf60 (Paul Mackerras 2005-09-26 16:04:21 +1000 236)  
>   goto bad_area;
> 
> So either they're both right or they're both wrong, either way I'm not
> sure how this patch is to blame.

Is there any progress on resolving this?

I did not notice any followup patch nor this one being merged/refuted.

Thanks

Michal

> 
> I guess we should also cc stable@...
> 
> Regards,
> Daniel
> 
> >> Reported-by: Tom Lane 
> >> Cc: Daniel Black 
> >> Signed-off-by: Daniel Axtens 
> >> ---
> >>  arch/powerpc/mm/fault.c | 10 ++
> >>  1 file changed, 10 insertions(+)
> >> 
> >> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> >> index b5047f9b5dec..00183731ea22 100644
> >> --- a/arch/powerpc/mm/fault.c
> >> +++ b/arch/powerpc/mm/fault.c
> >> @@ -287,7 +287,17 @@ static bool bad_stack_expansion(struct pt_regs *regs, 
> >> unsigned long address,
> >>if (!res)
> >>return !store_updates_sp(inst);
> >>*must_retry = true;
> >> +  } else if ((flags & FAULT_FLAG_WRITE) &&
> >> + !(flags & FAULT_FLAG_USER)) {
> >> +  /*
> >> +   * the kernel can also attempt to write beyond the end
> >> +   * of a process's stack - for example setting up a
> >> +   * signal frame. We assume this is valid, subject to
> >> +   * the checks in expand_stack() later.
> >> +   */
> >> +  return false;
> >>}
> >> +
> >>return true;
> >>}
> >>return false;
> >> -- 
> >> 2.20.1
> >> 


Re: [PATCH v3] powerpc/pseries: detect secure and trusted boot state of the system.

2020-07-17 Thread Michal Suchánek
On Fri, Jul 17, 2020 at 03:58:01PM +1000, Daniel Axtens wrote:
> Michal Suchánek  writes:
> 
> > On Wed, Jul 15, 2020 at 07:52:01AM -0400, Nayna Jain wrote:
> >> The device-tree property to check secure and trusted boot state is
> >> different for guests(pseries) compared to baremetal(powernv).
> >> 
> >> This patch updates the existing is_ppc_secureboot_enabled() and
> >> is_ppc_trustedboot_enabled() functions to add support for pseries.
> >> 
> >> The secureboot and trustedboot state are exposed via device-tree property:
> >> /proc/device-tree/ibm,secure-boot and /proc/device-tree/ibm,trusted-boot
> >> 
> >> The values of ibm,secure-boot under pseries are interpreted as:
> >   ^^^
> >> 
> >> 0 - Disabled
> >> 1 - Enabled in Log-only mode. This patch interprets this value as
> >> disabled, since audit mode is currently not supported for Linux.
> >> 2 - Enabled and enforced.
> >> 3-9 - Enabled and enforcing; requirements are at the discretion of the
> >> operating system.
> >> 
> >> The values of ibm,trusted-boot under pseries are interpreted as:
> >^^^
> > These two should be different I suppose?
> 
> I'm not quite sure what you mean? They'll be documented in a future
> revision of the PAPR, once I get my act together and submit the
> relevant internal paperwork.

Nevermind, one talks about secure boot, the other about trusted boot.

Thanks

Michal


Re: [PATCH net-next] ibmvnic: Increase driver logging

2020-07-16 Thread Michal Suchánek
On Thu, Jul 16, 2020 at 10:59:58AM -0500, Thomas Falcon wrote:
> 
> On 7/15/20 8:29 PM, David Miller wrote:
> > From: Jakub Kicinski 
> > Date: Wed, 15 Jul 2020 17:06:32 -0700
> > 
> > > On Wed, 15 Jul 2020 18:51:55 -0500 Thomas Falcon wrote:
> > > > free_netdev(netdev);
> > > > dev_set_drvdata(&dev->dev, NULL);
> > > > +   netdev_info(netdev, "VNIC client device has been successfully 
> > > > removed.\n");
> > > A step too far, perhaps.
> > > 
> > > In general this patch looks a little questionable IMHO, this amount of
> > > logging output is not commonly seen in drivers. All the the info
> > > messages are just static text, not even carrying any extra information.
> > > In an era of ftrace, and bpftrace, do we really need this?
> > Agreed, this is too much.  This is debugging, and thus suitable for tracing
> > facilities, at best.
> 
> Thanks for your feedback. I see now that I was overly aggressive with this
> patch to be sure, but it would help with narrowing down problems at a first
> glance, should they arise. The driver in its current state logs very little
> of what is it doing without the use of additional debugging or tracing
> facilities. Would it be worth it to pursue a less aggressive version or
> would that be dead on arrival? What are acceptable driver operations to log
> at this level?

Also would it be advisable to add the messages as pr_dbg to be enabled on 
demand?

Thanks

Michal


Re: [PATCH v3] powerpc/pseries: detect secure and trusted boot state of the system.

2020-07-16 Thread Michal Suchánek
On Wed, Jul 15, 2020 at 07:52:01AM -0400, Nayna Jain wrote:
> The device-tree property to check secure and trusted boot state is
> different for guests(pseries) compared to baremetal(powernv).
> 
> This patch updates the existing is_ppc_secureboot_enabled() and
> is_ppc_trustedboot_enabled() functions to add support for pseries.
> 
> The secureboot and trustedboot state are exposed via device-tree property:
> /proc/device-tree/ibm,secure-boot and /proc/device-tree/ibm,trusted-boot
> 
> The values of ibm,secure-boot under pseries are interpreted as:
  ^^^
> 
> 0 - Disabled
> 1 - Enabled in Log-only mode. This patch interprets this value as
> disabled, since audit mode is currently not supported for Linux.
> 2 - Enabled and enforced.
> 3-9 - Enabled and enforcing; requirements are at the discretion of the
> operating system.
> 
> The values of ibm,trusted-boot under pseries are interpreted as:
   ^^^
These two should be different I suppose?

Thanks

Michal
> 0 - Disabled
> 1 - Enabled
> 
> Signed-off-by: Nayna Jain 
> Reviewed-by: Daniel Axtens 
> ---
> v3:
> * fixed double check. Thanks Daniel for noticing it.
> * updated patch description.
> 
> v2:
> * included Michael Ellerman's feedback.
> * added Daniel Axtens's Reviewed-by.
> 
>  arch/powerpc/kernel/secure_boot.c | 19 +--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/secure_boot.c 
> b/arch/powerpc/kernel/secure_boot.c
> index 4b982324d368..118bcb5f79c4 100644
> --- a/arch/powerpc/kernel/secure_boot.c
> +++ b/arch/powerpc/kernel/secure_boot.c
> @@ -6,6 +6,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  static struct device_node *get_ppc_fw_sb_node(void)
>  {
> @@ -23,12 +24,19 @@ bool is_ppc_secureboot_enabled(void)
>  {
>   struct device_node *node;
>   bool enabled = false;
> + u32 secureboot;
>  
>   node = get_ppc_fw_sb_node();
>   enabled = of_property_read_bool(node, "os-secureboot-enforcing");
> -
>   of_node_put(node);
>  
> + if (enabled)
> + goto out;
> +
> + if (!of_property_read_u32(of_root, "ibm,secure-boot", &secureboot))
> + enabled = (secureboot > 1);
> +
> +out:
>   pr_info("Secure boot mode %s\n", enabled ? "enabled" : "disabled");
>  
>   return enabled;
> @@ -38,12 +46,19 @@ bool is_ppc_trustedboot_enabled(void)
>  {
>   struct device_node *node;
>   bool enabled = false;
> + u32 trustedboot;
>  
>   node = get_ppc_fw_sb_node();
>   enabled = of_property_read_bool(node, "trusted-enabled");
> -
>   of_node_put(node);
>  
> + if (enabled)
> + goto out;
> +
> + if (!of_property_read_u32(of_root, "ibm,trusted-boot", &trustedboot))
> + enabled = (trustedboot > 0);
> +
> +out:
>   pr_info("Trusted boot mode %s\n", enabled ? "enabled" : "disabled");
>  
>   return enabled;
> -- 
> 2.26.2
> 


Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-03 Thread Michal Suchánek
On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote:
> On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
> > On 01.07.20 13:06, David Hildenbrand wrote:
> > > On 01.07.20 13:01, Srikar Dronamraju wrote:
> > >> * David Hildenbrand  [2020-07-01 12:15:54]:
> > >>
> > >>> On 01.07.20 12:04, Srikar Dronamraju wrote:
> >  * Michal Hocko  [2020-07-01 10:42:00]:
> > 
> > >
> > >>
> > >> 2. Also existence of dummy node also leads to inconsistent 
> > >> information. The
> > >> number of online nodes is inconsistent with the information in the
> > >> device-tree and resource-dump
> > >>
> > >> 3. When the dummy node is present, single node non-Numa systems end 
> > >> up showing
> > >> up as NUMA systems and numa_balancing gets enabled. This will mean 
> > >> we take
> > >> the hit from the unnecessary numa hinting faults.
> > >
> > > I have to say that I dislike the node online/offline state and 
> > > directly
> > > exporting that to the userspace. Users should only care whether the 
> > > node
> > > has memory/cpus. Numa nodes can be online without any memory. Just
> > > offline all the present memory blocks but do not physically hot remove
> > > them and you are in the same situation. If users are confused by an
> > > output of tools like numactl -H then those could be updated and hide
> > > nodes without any memory&cpus.
> > >
> > > The autonuma problem sounds interesting but again this patch doesn't
> > > really solve the underlying problem because I strongly suspect that 
> > > the
> > > problem is still there when a numa node gets all its memory offline as
> > > mentioned above.
> 
> I would really appreciate a feedback to these two as well.
> 
> > > While I completely agree that making node 0 special is wrong, I have
> > > still hard time to review this very simply looking patch because all 
> > > the
> > > numa initialization is so spread around that this might just blow up
> > > at unexpected places. IIRC we have discussed testing in the previous
> > > version and David has provided a way to emulate these configurations
> > > on x86. Did you manage to use those instruction for additional testing
> > > on other than ppc architectures?
> > >
> > 
> >  I have tried all the steps that David mentioned and reported back at
> >  https://lore.kernel.org/lkml/20200511174731.gd1...@linux.vnet.ibm.com/t/#u
> > 
> >  As a summary, David's steps are still not creating a 
> >  memoryless/cpuless on
> >  x86 VM.
> > >>>
> > >>> Now, that is wrong. You get a memoryless/cpuless node, which is *not
> > >>> online*. Once you hotplug some memory, it will switch online. Once you
> > >>> remove memory, it will switch back offline.
> > >>>
> > >>
> > >> Let me clarify, we are looking for a node 0 which is cpuless/memoryless 
> > >> at
> > >> boot.  The code in question tries to handle a cpuless/memoryless node 0 
> > >> at
> > >> boot.
> > > 
> > > I was just correcting your statement, because it was wrong.
> > > 
> > > Could be that x86 code maps PXM 1 to node 0 because PXM 1 does neither
> > > have CPUs nor memory. That would imply that we can, in fact, never have
> > > node 0 offline during boot.
> > > 
> > 
> > Yep, looks like it.
> > 
> > [0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> > [0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> > [0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> > [0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> > [0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x-0x0009]
> > [0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x0010-0xbfff]
> > [0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x1-0x13fff]
> 
> This begs a question whether ppc can do the same thing?
Or x86 stop doing it so that you can see on what node you are running?

What's the point of this indirection other than another way of avoiding
empty node 0?

Thanks

Michal


Re: [PATCH v6 6/8] powerpc/pmem: Avoid the barrier in flush routines

2020-06-30 Thread Michal Suchánek
On Mon, Jun 29, 2020 at 06:50:15PM -0700, Dan Williams wrote:
> On Mon, Jun 29, 2020 at 1:41 PM Aneesh Kumar K.V
>  wrote:
> >
> > Michal Suchánek  writes:
> >
> > > Hello,
> > >
> > > On Mon, Jun 29, 2020 at 07:27:20PM +0530, Aneesh Kumar K.V wrote:
> > >> nvdimm expect the flush routines to just mark the cache clean. The 
> > >> barrier
> > >> that mark the store globally visible is done in nvdimm_flush().
> > >>
> > >> Update the papr_scm driver to a simplified nvdim_flush callback that do
> > >> only the required barrier.
> > >>
> > >> Signed-off-by: Aneesh Kumar K.V 
> > >> ---
> > >>  arch/powerpc/lib/pmem.c   |  6 --
> > >>  arch/powerpc/platforms/pseries/papr_scm.c | 13 +
> > >>  2 files changed, 13 insertions(+), 6 deletions(-)
> > >>
> > >> diff --git a/arch/powerpc/lib/pmem.c b/arch/powerpc/lib/pmem.c
> > >> index 5a61aaeb6930..21210fa676e5 100644
> > >> --- a/arch/powerpc/lib/pmem.c
> > >> +++ b/arch/powerpc/lib/pmem.c
> > >> @@ -19,9 +19,6 @@ static inline void __clean_pmem_range(unsigned long 
> > >> start, unsigned long stop)
> > >>
> > >>  for (i = 0; i < size >> shift; i++, addr += bytes)
> > >>  asm volatile(PPC_DCBSTPS(%0, %1): :"i"(0), "r"(addr): 
> > >> "memory");
> > >> -
> > >> -
> > >> -asm volatile(PPC_PHWSYNC ::: "memory");
> > >>  }
> > >>
> > >>  static inline void __flush_pmem_range(unsigned long start, unsigned 
> > >> long stop)
> > >> @@ -34,9 +31,6 @@ static inline void __flush_pmem_range(unsigned long 
> > >> start, unsigned long stop)
> > >>
> > >>  for (i = 0; i < size >> shift; i++, addr += bytes)
> > >>  asm volatile(PPC_DCBFPS(%0, %1): :"i"(0), "r"(addr): 
> > >> "memory");
> > >> -
> > >> -
> > >> -asm volatile(PPC_PHWSYNC ::: "memory");
> > >>  }
> > >>
> > >>  static inline void clean_pmem_range(unsigned long start, unsigned long 
> > >> stop)
> > >> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> > >> b/arch/powerpc/platforms/pseries/papr_scm.c
> > >> index 9c569078a09f..9a9a0766f8b6 100644
> > >> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> > >> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> > >> @@ -630,6 +630,18 @@ static int papr_scm_ndctl(struct 
> > >> nvdimm_bus_descriptor *nd_desc,
> > >>
> > >>  return 0;
> > >>  }
> > >> +/*
> > >> + * We have made sure the pmem writes are done such that before calling 
> > >> this
> > >> + * all the caches are flushed/clean. We use dcbf/dcbfps to ensure this. 
> > >> Here
> > >> + * we just need to add the necessary barrier to make sure the above 
> > >> flushes
> > >> + * are have updated persistent storage before any data access or data 
> > >> transfer
> > >> + * caused by subsequent instructions is initiated.
> > >> + */
> > >> +static int papr_scm_flush_sync(struct nd_region *nd_region, struct bio 
> > >> *bio)
> > >> +{
> > >> +arch_pmem_flush_barrier();
> > >> +return 0;
> > >> +}
> > >>
> > >>  static ssize_t flags_show(struct device *dev,
> > >>struct device_attribute *attr, char *buf)
> > >> @@ -743,6 +755,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv 
> > >> *p)
> > >>  ndr_desc.mapping = &mapping;
> > >>  ndr_desc.num_mappings = 1;
> > >>  ndr_desc.nd_set = &p->nd_set;
> > >> +ndr_desc.flush = papr_scm_flush_sync;
> > >
> > > AFAICT currently the only device that implements flush is virtio_pmem.
> > > How does the nfit driver get away without implementing flush?
> >
> > generic_nvdimm_flush does the required barrier for nfit. The reason for
> > adding ndr_desc.flush call back for papr_scm was to avoid the usage
> > of iomem based deep flushing (ndr_region_data.flush_wpq) which is not
> > supported by papr_scm.
> >
> > BTW we do return NULL for ndrd_get_flush_wpq() on power. So the upstream
> > code also does the same thing, but in a different way.
> >
> >
> > > Also the flush takes arguments that are completely unused but a user of
> > > the pmem region must assume they are used, and call flush() on the
> > > region rather than arch_pmem_flush_barrier() directly.
> >
> > The bio argument can help a pmem driver to do range based flushing in
> > case of pmem_make_request. If bio is null then we must assume a full
> > device flush.
> 
> The bio argument isn't for range based flushing, it is for flush
> operations that need to complete asynchronously.
How does the block layer determine that the pmem device needs
asynchronous fushing?

The flush() was designed for the purpose with the bio argument and only
virtio_pmem which is fulshed asynchronously used it. Now that papr_scm
resuses it fir different purpose how do you tell?

Thanks

Michal


Re: [PATCH v6 6/8] powerpc/pmem: Avoid the barrier in flush routines

2020-06-29 Thread Michal Suchánek
Hello,

On Mon, Jun 29, 2020 at 07:27:20PM +0530, Aneesh Kumar K.V wrote:
> nvdimm expect the flush routines to just mark the cache clean. The barrier
> that mark the store globally visible is done in nvdimm_flush().
> 
> Update the papr_scm driver to a simplified nvdim_flush callback that do
> only the required barrier.
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/lib/pmem.c   |  6 --
>  arch/powerpc/platforms/pseries/papr_scm.c | 13 +
>  2 files changed, 13 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/lib/pmem.c b/arch/powerpc/lib/pmem.c
> index 5a61aaeb6930..21210fa676e5 100644
> --- a/arch/powerpc/lib/pmem.c
> +++ b/arch/powerpc/lib/pmem.c
> @@ -19,9 +19,6 @@ static inline void __clean_pmem_range(unsigned long start, 
> unsigned long stop)
>  
>   for (i = 0; i < size >> shift; i++, addr += bytes)
>   asm volatile(PPC_DCBSTPS(%0, %1): :"i"(0), "r"(addr): "memory");
> -
> -
> - asm volatile(PPC_PHWSYNC ::: "memory");
>  }
>  
>  static inline void __flush_pmem_range(unsigned long start, unsigned long 
> stop)
> @@ -34,9 +31,6 @@ static inline void __flush_pmem_range(unsigned long start, 
> unsigned long stop)
>  
>   for (i = 0; i < size >> shift; i++, addr += bytes)
>   asm volatile(PPC_DCBFPS(%0, %1): :"i"(0), "r"(addr): "memory");
> -
> -
> - asm volatile(PPC_PHWSYNC ::: "memory");
>  }
>  
>  static inline void clean_pmem_range(unsigned long start, unsigned long stop)
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index 9c569078a09f..9a9a0766f8b6 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -630,6 +630,18 @@ static int papr_scm_ndctl(struct nvdimm_bus_descriptor 
> *nd_desc,
>  
>   return 0;
>  }
> +/*
> + * We have made sure the pmem writes are done such that before calling this
> + * all the caches are flushed/clean. We use dcbf/dcbfps to ensure this. Here
> + * we just need to add the necessary barrier to make sure the above flushes
> + * are have updated persistent storage before any data access or data 
> transfer
> + * caused by subsequent instructions is initiated.
> + */
> +static int papr_scm_flush_sync(struct nd_region *nd_region, struct bio *bio)
> +{
> + arch_pmem_flush_barrier();
> + return 0;
> +}
>  
>  static ssize_t flags_show(struct device *dev,
> struct device_attribute *attr, char *buf)
> @@ -743,6 +755,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
>   ndr_desc.mapping = &mapping;
>   ndr_desc.num_mappings = 1;
>   ndr_desc.nd_set = &p->nd_set;
> + ndr_desc.flush = papr_scm_flush_sync;

AFAICT currently the only device that implements flush is virtio_pmem.
How does the nfit driver get away without implementing flush?
Also the flush takes arguments that are completely unused but a user of
the pmem region must assume they are used, and call flush() on the
region rather than arch_pmem_flush_barrier() directly.  This may not
work well with md as discussed with earlier iteration of the patchest.

Thanks

Michal


Re: [PATCH v2 3/5] libnvdimm/nvdimm/flush: Allow architecture to override the flush barrier

2020-06-26 Thread Michal Suchánek
On Fri, May 22, 2020 at 09:01:17AM -0400, Mikulas Patocka wrote:
> 
> 
> On Fri, 22 May 2020, Aneesh Kumar K.V wrote:
> 
> > On 5/22/20 3:01 PM, Michal Suchánek wrote:
> > > On Thu, May 21, 2020 at 02:52:30PM -0400, Mikulas Patocka wrote:
> > > > 
> > > > 
> > > > On Thu, 21 May 2020, Dan Williams wrote:
> > > > 
> > > > > On Thu, May 21, 2020 at 10:03 AM Aneesh Kumar K.V
> > > > >  wrote:
> > > > > > 
> > > > > > > Moving on to the patch itself--Aneesh, have you audited other
> > > > > > > persistent
> > > > > > > memory users in the kernel?  For example, 
> > > > > > > drivers/md/dm-writecache.c
> > > > > > > does
> > > > > > > this:
> > > > > > > 
> > > > > > > static void writecache_commit_flushed(struct dm_writecache *wc, 
> > > > > > > bool
> > > > > > > wait_for_ios)
> > > > > > > {
> > > > > > >if (WC_MODE_PMEM(wc))
> > > > > > >wmb(); <==
> > > > > > >   else
> > > > > > >   ssd_commit_flushed(wc, wait_for_ios);
> > > > > > > }
> > > > > > > 
> > > > > > > I believe you'll need to make modifications there.
> > > > > > > 
> > > > > > 
> > > > > > Correct. Thanks for catching that.
> > > > > > 
> > > > > > 
> > > > > > I don't understand dm much, wondering how this will work with
> > > > > > non-synchronous DAX device?
> > > > > 
> > > > > That's a good point. DM-writecache needs to be cognizant of things
> > > > > like virtio-pmem that violate the rule that persisent memory writes
> > > > > can be flushed by CPU functions rather than calling back into the
> > > > > driver. It seems we need to always make the flush case a dax_operation
> > > > > callback to account for this.
> > > > 
> > > > dm-writecache is normally sitting on the top of dm-linear, so it would
> > > > need to pass the wmb() call through the dm core and dm-linear target ...
> > > > that would slow it down ... I remember that you already did it this way
> > > > some times ago and then removed it.
> > > > 
> > > > What's the exact problem with POWER? Could the POWER system have two 
> > > > types
> > > > of persistent memory that need two different ways of flushing?
> > > 
> > > As far as I understand the discussion so far
> > > 
> > >   - on POWER $oldhardware uses $oldinstruction to ensure pmem consistency
> > >   - on POWER $newhardware uses $newinstruction to ensure pmem consistency
> > > (compatible with $oldinstruction on $oldhardware)
> > 
> > Correct.
> > 
> > >   - on some platforms instead of barrier instruction a callback into the
> > > driver is issued to ensure consistency 
> > 
> > This is virtio-pmem only at this point IIUC.
> > 
> > -aneesh
> 
> And does the virtio-pmem driver track which pages are dirty? Or does it 
> need to specify the range of pages to flush in the flush function?
> 
> > > None of this is reflected by the dm driver.
> 
> We could make a new dax method:
> void *(dax_get_flush_function)(void);
> 
> This would return a pointer to "wmb()" on x86 and something else on Power.
> 
> The method "dax_get_flush_function" would be called only once when 
> initializing the writecache driver (because the call would be slow because 
> it would have to go through the DM stack) and then, the returned function 
> would be called each time we need write ordering. The returned function 
> would do just "sfence; ret".

Hello,

as far as I understand the code virtio_pmem has a fush function defined
which indeed can make use of the region properties, such as memory
range. If such function exists you need quivalent of sync() - call into
the device in question. If it does not calling arch_pmem_flush_barrier()
instead of wmb() should suffice.

I am not aware of an interface to determine if the flush function exists
for a particular region.

Thanks

Michal


Re: ppc64le and 32-bit LE userland compatibility

2020-06-02 Thread Michal Suchánek
On Tue, Jun 02, 2020 at 05:40:39PM +0200, Daniel Kolesa wrote:
> 
> 
> On Tue, Jun 2, 2020, at 17:27, Michal Suchánek wrote:
> > On Tue, Jun 02, 2020 at 05:13:25PM +0200, Daniel Kolesa wrote:
> > > On Tue, Jun 2, 2020, at 16:23, Michal Suchánek wrote:
> > > > On Tue, Jun 02, 2020 at 01:40:23PM +, Joseph Myers wrote:
> > > > > On Tue, 2 Jun 2020, Daniel Kolesa wrote:
> > > > > 
> > > > > > not be limited to being just userspace under ppc64le, but should be 
> > > > > > runnable on a native kernel as well, which should not be limited to 
> > > > > > any 
> > > > > > particular baseline other than just PowerPC.
> > > > > 
> > > > > This is a fairly unusual approach to bringing up a new ABI.  Since 
> > > > > new 
> > > > > ABIs are more likely to be used on new systems rather than switching 
> > > > > ABI 
> > > > > on an existing installation, and since it can take quite some time 
> > > > > for all 
> > > > > the software support for a new ABI to become widely available in 
> > > > > distributions, people developing new ABIs are likely to think about 
> > > > > what 
> > > > > new systems are going to be relevant in a few years' time when 
> > > > > working out 
> > > > > the minimum hardware requirements for the new ABI.  (The POWER8 
> > > > > minimum 
> > > > > for powerpc64le fits in with that, for example.)
> > > > That means that you cannot run ppc64le on FSL embedded CPUs (which lack
> > > > the vector instructions in LE mode). Which may be fine with you but
> > > > other people may want to support these. Can't really say if that's good
> > > > idea or not but I don't foresee them going away in a few years, either.
> > > 
> > > well, ppc64le already cannot be run on those, as far as I know (I don't 
> > > think it's possible to build ppc64le userland without VSX in any 
> > > configuration)
> > 
> > What hardware are you targetting then? I did not notice anything
> > specific mentioned in the thread.
> > 
> > Naturally on POWER the first cpu that has LE support is POWER8 so you
> > can count on all other POWER8 features to be present. With other
> > architecture variants the situation is different.
> 
> This is not true; nearly every 32-bit PowerPC CPU has LE support (all the way 
> back to 6xx), these would be the native-hardware targets for the port (would 
> need kernel support implemented, but it's technically possible).
I find dealing with memory management issues on 32bit architectures a
pain. There is never enough address space.
> 
> As far as 64-bit CPUs go, POWER7 is the first one that could in practice run 
> the current ppc64le configuration, but in glibc it's limited to POWER8 and in 
> gcc the default for powerpc64le is also POWER8 (however, it is perfectly 
> possible to configure gcc for POWER7 and use musl libc with it).
That's interesting. I guess I was tricked but the glibc limitation.

Thanks

Michal


Re: ppc64le and 32-bit LE userland compatibility

2020-06-02 Thread Michal Suchánek
On Tue, Jun 02, 2020 at 05:13:25PM +0200, Daniel Kolesa wrote:
> On Tue, Jun 2, 2020, at 16:23, Michal Suchánek wrote:
> > On Tue, Jun 02, 2020 at 01:40:23PM +, Joseph Myers wrote:
> > > On Tue, 2 Jun 2020, Daniel Kolesa wrote:
> > > 
> > > > not be limited to being just userspace under ppc64le, but should be 
> > > > runnable on a native kernel as well, which should not be limited to any 
> > > > particular baseline other than just PowerPC.
> > > 
> > > This is a fairly unusual approach to bringing up a new ABI.  Since new 
> > > ABIs are more likely to be used on new systems rather than switching ABI 
> > > on an existing installation, and since it can take quite some time for 
> > > all 
> > > the software support for a new ABI to become widely available in 
> > > distributions, people developing new ABIs are likely to think about what 
> > > new systems are going to be relevant in a few years' time when working 
> > > out 
> > > the minimum hardware requirements for the new ABI.  (The POWER8 minimum 
> > > for powerpc64le fits in with that, for example.)
> > That means that you cannot run ppc64le on FSL embedded CPUs (which lack
> > the vector instructions in LE mode). Which may be fine with you but
> > other people may want to support these. Can't really say if that's good
> > idea or not but I don't foresee them going away in a few years, either.
> 
> well, ppc64le already cannot be run on those, as far as I know (I don't think 
> it's possible to build ppc64le userland without VSX in any configuration)

What hardware are you targetting then? I did not notice anything
specific mentioned in the thread.

Naturally on POWER the first cpu that has LE support is POWER8 so you
can count on all other POWER8 features to be present. With other
architecture variants the situation is different.

Thanks

Michal


Re: ppc64le and 32-bit LE userland compatibility

2020-06-02 Thread Michal Suchánek
On Tue, Jun 02, 2020 at 01:40:23PM +, Joseph Myers wrote:
> On Tue, 2 Jun 2020, Daniel Kolesa wrote:
> 
> > not be limited to being just userspace under ppc64le, but should be 
> > runnable on a native kernel as well, which should not be limited to any 
> > particular baseline other than just PowerPC.
> 
> This is a fairly unusual approach to bringing up a new ABI.  Since new 
> ABIs are more likely to be used on new systems rather than switching ABI 
> on an existing installation, and since it can take quite some time for all 
> the software support for a new ABI to become widely available in 
> distributions, people developing new ABIs are likely to think about what 
> new systems are going to be relevant in a few years' time when working out 
> the minimum hardware requirements for the new ABI.  (The POWER8 minimum 
> for powerpc64le fits in with that, for example.)
That means that you cannot run ppc64le on FSL embedded CPUs (which lack
the vector instructions in LE mode). Which may be fine with you but
other people may want to support these. Can't really say if that's good
idea or not but I don't foresee them going away in a few years, either.

Thanks

Michal


Re: [RFC PATCH 1/2] libnvdimm: Add prctl control for disabling synchronous fault support.

2020-06-01 Thread Michal Suchánek
On Mon, Jun 01, 2020 at 05:31:50PM +0530, Aneesh Kumar K.V wrote:
> On 6/1/20 3:39 PM, Jan Kara wrote:
> > On Fri 29-05-20 16:25:35, Aneesh Kumar K.V wrote:
> > > On 5/29/20 3:22 PM, Jan Kara wrote:
> > > > On Fri 29-05-20 15:07:31, Aneesh Kumar K.V wrote:
> > > > > Thanks Michal. I also missed Jeff in this email thread.
> > > > 
> > > > And I think you'll also need some of the sched maintainers for the prctl
> > > > bits...
> > > > 
> > > > > On 5/29/20 3:03 PM, Michal Suchánek wrote:
> > > > > > Adding Jan
> > > > > > 
> > > > > > On Fri, May 29, 2020 at 11:11:39AM +0530, Aneesh Kumar K.V wrote:
> > > > > > > With POWER10, architecture is adding new pmem flush and sync 
> > > > > > > instructions.
> > > > > > > The kernel should prevent the usage of MAP_SYNC if applications 
> > > > > > > are not using
> > > > > > > the new instructions on newer hardware.
> > > > > > > 
> > > > > > > This patch adds a prctl option MAP_SYNC_ENABLE that can be used 
> > > > > > > to enable
> > > > > > > the usage of MAP_SYNC. The kernel config option is added to allow 
> > > > > > > the user
> > > > > > > to control whether MAP_SYNC should be enabled by default or not.
> > > > > > > 
> > > > > > > Signed-off-by: Aneesh Kumar K.V 
> > > > ...
> > > > > > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > > > > > index 8c700f881d92..d5a9a363e81e 100644
> > > > > > > --- a/kernel/fork.c
> > > > > > > +++ b/kernel/fork.c
> > > > > > > @@ -963,6 +963,12 @@ __cacheline_aligned_in_smp 
> > > > > > > DEFINE_SPINLOCK(mmlist_lock);
> > > > > > > static unsigned long default_dump_filter = 
> > > > > > > MMF_DUMP_FILTER_DEFAULT;
> > > > > > > +#ifdef CONFIG_ARCH_MAP_SYNC_DISABLE
> > > > > > > +unsigned long default_map_sync_mask = MMF_DISABLE_MAP_SYNC_MASK;
> > > > > > > +#else
> > > > > > > +unsigned long default_map_sync_mask = 0;
> > > > > > > +#endif
> > > > > > > +
> > > > 
> > > > I'm not sure CONFIG is really the right approach here. For a distro 
> > > > that would
> > > > basically mean to disable MAP_SYNC for all PPC kernels unless 
> > > > application
> > > > explicitly uses the right prctl. Shouldn't we rather initialize
> > > > default_map_sync_mask on boot based on whether the CPU we run on 
> > > > requires
> > > > new flush instructions or not? Otherwise the patch looks sensible.
> > > > 
> > > 
> > > yes that is correct. We ideally want to deny MAP_SYNC only w.r.t POWER10.
> > > But on a virtualized platform there is no easy way to detect that. We 
> > > could
> > > ideally hook this into the nvdimm driver where we look at the new compat
> > > string ibm,persistent-memory-v2 and then disable MAP_SYNC
> > > if we find a device with the specific value.
> > 
> > Hum, couldn't we set some flag for nvdimm devices with
> > "ibm,persistent-memory-v2" property and then check it during mmap(2) time
> > and when the device has this propery and the mmap(2) caller doesn't have
> > the prctl set, we'd disallow MAP_SYNC? That should make things mostly
> > seamless, shouldn't it? Only apps that want to use MAP_SYNC on these
> > devices would need to use prctl(MMF_DISABLE_MAP_SYNC, 0) but then these
> > applications need to be aware of new instructions so this isn't that much
> > additional burden...
> 
> I am not sure application would want to add that much details/knowledge
> about a platform in their code. I was expecting application to do
> 
> #ifdef __ppc64__
> prctl(MAP_SYNC_ENABLE, 1, 0, 0, 0));
> #endif
> a = mmap(NULL, PAGE_SIZE, PROT_READ|PROT_WRITE,
> MAP_SHARED_VALIDATE | MAP_SYNC, fd, 0);
> 
> 
> For that code all the complexity that we add w.r.t ibm,persistent-memory-v2
> is not useful. Do you see a value in making all these device specific rather
> than a conditional on  __ppc64__?
If the vpmem devices continue to work with the old instruction on
POWER10 then it makes sense to make this per-device.

Also adding a message to kernel log in case the application does not do
the prctl would be helful for people migrating old code to POWER10.

Thanks

Michal


Re: [RFC PATCH 1/2] libnvdimm: Add prctl control for disabling synchronous fault support.

2020-05-29 Thread Michal Suchánek
Adding Jan

On Fri, May 29, 2020 at 11:11:39AM +0530, Aneesh Kumar K.V wrote:
> With POWER10, architecture is adding new pmem flush and sync instructions.
> The kernel should prevent the usage of MAP_SYNC if applications are not using
> the new instructions on newer hardware.
> 
> This patch adds a prctl option MAP_SYNC_ENABLE that can be used to enable
> the usage of MAP_SYNC. The kernel config option is added to allow the user
> to control whether MAP_SYNC should be enabled by default or not.
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  include/linux/sched/coredump.h | 13 ++---
>  include/uapi/linux/prctl.h |  3 +++
>  kernel/fork.c  |  8 +++-
>  kernel/sys.c   | 18 ++
>  mm/Kconfig |  3 +++
>  mm/mmap.c  |  4 
>  6 files changed, 45 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
> index ecdc6542070f..9ba6b3d5f991 100644
> --- a/include/linux/sched/coredump.h
> +++ b/include/linux/sched/coredump.h
> @@ -72,9 +72,16 @@ static inline int get_dumpable(struct mm_struct *mm)
>  #define MMF_DISABLE_THP  24  /* disable THP for all VMAs */
>  #define MMF_OOM_VICTIM   25  /* mm is the oom victim */
>  #define MMF_OOM_REAP_QUEUED  26  /* mm was queued for oom_reaper */
> -#define MMF_DISABLE_THP_MASK (1 << MMF_DISABLE_THP)
> +#define MMF_DISABLE_MAP_SYNC 27  /* disable THP for all VMAs */
> +#define MMF_DISABLE_THP_MASK (1 << MMF_DISABLE_THP)
> +#define MMF_DISABLE_MAP_SYNC_MASK(1 << MMF_DISABLE_MAP_SYNC)
>  
> -#define MMF_INIT_MASK(MMF_DUMPABLE_MASK | 
> MMF_DUMP_FILTER_MASK |\
> -  MMF_DISABLE_THP_MASK)
> +#define MMF_INIT_MASK(MMF_DUMPABLE_MASK | 
> MMF_DUMP_FILTER_MASK | \
> + MMF_DISABLE_THP_MASK | MMF_DISABLE_MAP_SYNC_MASK)
> +
> +static inline bool map_sync_enabled(struct mm_struct *mm)
> +{
> + return !(mm->flags & MMF_DISABLE_MAP_SYNC_MASK);
> +}
>  
>  #endif /* _LINUX_SCHED_COREDUMP_H */
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 07b4f8131e36..ee4cde32d5cf 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -238,4 +238,7 @@ struct prctl_mm_map {
>  #define PR_SET_IO_FLUSHER57
>  #define PR_GET_IO_FLUSHER58
>  
> +#define PR_SET_MAP_SYNC_ENABLE   59
> +#define PR_GET_MAP_SYNC_ENABLE   60
> +
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 8c700f881d92..d5a9a363e81e 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -963,6 +963,12 @@ __cacheline_aligned_in_smp DEFINE_SPINLOCK(mmlist_lock);
>  
>  static unsigned long default_dump_filter = MMF_DUMP_FILTER_DEFAULT;
>  
> +#ifdef CONFIG_ARCH_MAP_SYNC_DISABLE
> +unsigned long default_map_sync_mask = MMF_DISABLE_MAP_SYNC_MASK;
> +#else
> +unsigned long default_map_sync_mask = 0;
> +#endif
> +
>  static int __init coredump_filter_setup(char *s)
>  {
>   default_dump_filter =
> @@ -1039,7 +1045,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, 
> struct task_struct *p,
>   mm->flags = current->mm->flags & MMF_INIT_MASK;
>   mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK;
>   } else {
> - mm->flags = default_dump_filter;
> + mm->flags = default_dump_filter | default_map_sync_mask;
>   mm->def_flags = 0;
>   }
>  
> diff --git a/kernel/sys.c b/kernel/sys.c
> index d325f3ab624a..f6127cf4128b 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2450,6 +2450,24 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, 
> arg2, unsigned long, arg3,
>   clear_bit(MMF_DISABLE_THP, &me->mm->flags);
>   up_write(&me->mm->mmap_sem);
>   break;
> +
> + case PR_GET_MAP_SYNC_ENABLE:
> + if (arg2 || arg3 || arg4 || arg5)
> + return -EINVAL;
> + error = !test_bit(MMF_DISABLE_MAP_SYNC, &me->mm->flags);
> + break;
> + case PR_SET_MAP_SYNC_ENABLE:
> + if (arg3 || arg4 || arg5)
> + return -EINVAL;
> + if (down_write_killable(&me->mm->mmap_sem))
> + return -EINTR;
> + if (arg2)
> + clear_bit(MMF_DISABLE_MAP_SYNC, &me->mm->flags);
> + else
> + set_bit(MMF_DISABLE_MAP_SYNC, &me->mm->flags);
> + up_write(&me->mm->mmap_sem);
> + break;
> +
>   case PR_MPX_ENABLE_MANAGEMENT:
>   case PR_MPX_DISABLE_MANAGEMENT:
>   /* No longer implemented: */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index c1acc34c1c35..38fd7cfbfca8 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -867,4 +867,7 @@ config ARCH_HAS_HUGEPD
>  config MAPPING_DIRTY_HELPERS
>  bool
>  
> +config A

Re: [PATCH v2 3/5] libnvdimm/nvdimm/flush: Allow architecture to override the flush barrier

2020-05-22 Thread Michal Suchánek
On Thu, May 21, 2020 at 02:52:30PM -0400, Mikulas Patocka wrote:
> 
> 
> On Thu, 21 May 2020, Dan Williams wrote:
> 
> > On Thu, May 21, 2020 at 10:03 AM Aneesh Kumar K.V
> >  wrote:
> > >
> > > > Moving on to the patch itself--Aneesh, have you audited other persistent
> > > > memory users in the kernel?  For example, drivers/md/dm-writecache.c 
> > > > does
> > > > this:
> > > >
> > > > static void writecache_commit_flushed(struct dm_writecache *wc, bool 
> > > > wait_for_ios)
> > > > {
> > > >   if (WC_MODE_PMEM(wc))
> > > >   wmb(); <==
> > > >  else
> > > >  ssd_commit_flushed(wc, wait_for_ios);
> > > > }
> > > >
> > > > I believe you'll need to make modifications there.
> > > >
> > >
> > > Correct. Thanks for catching that.
> > >
> > >
> > > I don't understand dm much, wondering how this will work with
> > > non-synchronous DAX device?
> > 
> > That's a good point. DM-writecache needs to be cognizant of things
> > like virtio-pmem that violate the rule that persisent memory writes
> > can be flushed by CPU functions rather than calling back into the
> > driver. It seems we need to always make the flush case a dax_operation
> > callback to account for this.
> 
> dm-writecache is normally sitting on the top of dm-linear, so it would 
> need to pass the wmb() call through the dm core and dm-linear target ... 
> that would slow it down ... I remember that you already did it this way 
> some times ago and then removed it.
> 
> What's the exact problem with POWER? Could the POWER system have two types 
> of persistent memory that need two different ways of flushing?

As far as I understand the discussion so far

 - on POWER $oldhardware uses $oldinstruction to ensure pmem consistency
 - on POWER $newhardware uses $newinstruction to ensure pmem consistency
   (compatible with $oldinstruction on $oldhardware)
 - on some platforms instead of barrier instruction a callback into the
   driver is issued to ensure consistency

None of this is reflected by the dm driver.

Thanks

Michal


Re: crash in cpuidle_enter_state with 5.7-rc1

2020-04-21 Thread Michal Suchánek
On Tue, Apr 21, 2020 at 10:21:52PM +1000, Michael Ellerman wrote:
> Michal Suchánek  writes:
> > On Mon, Apr 20, 2020 at 08:50:30AM +0200, Michal Suchánek wrote:
> >> On Mon, Apr 20, 2020 at 04:15:39PM +1000, Michael Ellerman wrote:
> >> > Michal Suchánek  writes:
> > ...
> >> > 
> >> > 
> >> > And I've just hit it with your config on a machine here, but the crash
> >> > is different:
> >> That does not look like it.
> >> You don't have this part in the stack trace:
> >> > [1.234899] [c7597420] [] 0x0
> >> > [1.234908] [c7597720] [0a6d] 0xa6d
> >> > [1.234919] [c7597a20] [] 0x0
> >> > [1.234931] [c7597d20] [0004] 0x4
> >> which is somewhat random but at least on such line is always present in
> >> the traces I get. Also I always get crash in cpuidle_enter_state
> > ..
> >> > I'm going to guess it's STRICT_KERNEL_RWX that's at fault.
> >> I can try without that as well.
> >
> > Can't reproduce without STRICT_KERNEL_RWX either.
> 
> I've reproduced something similar all the way back to v5.5, though it
> seems harder to hit - sometimes 5 boots will succeed before one fails.
I only tried 3 times because I do not have automation in place to
capture these early crashes. I suppose I could tell the kernel to not
reboot on panic and try rebooting several times.
> 
> Are you testing on top of PowerVM or KVM?
PowerVM.

Thanks

Michal


Re: crash in cpuidle_enter_state with 5.7-rc1

2020-04-20 Thread Michal Suchánek
On Mon, Apr 20, 2020 at 08:50:30AM +0200, Michal Suchánek wrote:
> Hello,
> 
> On Mon, Apr 20, 2020 at 04:15:39PM +1000, Michael Ellerman wrote:
> > Michal Suchánek  writes:
...
> > 
> > 
> > And I've just hit it with your config on a machine here, but the crash
> > is different:
> That does not look like it.
> You don't have this part in the stack trace:
> > [1.234899] [c7597420] [] 0x0
> > [1.234908] [c7597720] [0a6d] 0xa6d
> > [1.234919] [c7597a20] [] 0x0
> > [1.234931] [c7597d20] [0004] 0x4
> which is somewhat random but at least on such line is always present in
> the traces I get. Also I always get crash in cpuidle_enter_state
..
> > I'm going to guess it's STRICT_KERNEL_RWX that's at fault.
> I can try without that as well.
Can't reproduce without STRICT_KERNEL_RWX either.

Thanks

Michal


Re: crash in cpuidle_enter_state with 5.7-rc1

2020-04-19 Thread Michal Suchánek
Hello,

On Mon, Apr 20, 2020 at 04:15:39PM +1000, Michael Ellerman wrote:
> Michal Suchánek  writes:
> > Hello,
> >
> > I observe crash in cpuidle_enter_state in early boot on POWER9 pSeries
> > machine with 5.7-rc1 kernel. The crash is not 100% reliable. Sometimes
> > the machine boots.
> >
> > Attaching config, dmesg, and sample crash message. The stack below
> > cpuidle_enter_state appears random - different in each crash.
> >
> > Any idea what could cause this?
> 
> Nothing immediately springs to mind.
> 
> > Preparing to boot Linux version 5.7.0-rc1-1.g8f6a41f-default 
> > (geeko@buildhost) (gcc version 9.3.1 20200406 [revision 
> > 6db837a5288ee3ca5ec504fbd5a765817e556ac2] (SUSE Linux), GNU ld (GNU 
> > Binutils; openSUSE Tumbleweed) 2.34.0.20200325-1) #1 SMP Fri Apr 17 
> > 10:39:25 UTC 2020 (8f6a41f)
> > Detected machine type: 0101
> > command line: BOOT_IMAGE=/boot/vmlinux-5.7.0-rc1-1.g8f6a41f-default 
> > root=UUID=04f3f652-7c85-470b-9d5f-490601f371f8 mitigations=auto quiet 
> > crashkernel=242M
> > Max number of cores passed to firmware: 256 (NR_CPUS = 2048)
> > Calling ibm,client-architecture-support... done
> > memory layout at init:
> >   memory_limit :  (16 MB aligned)
> >   alloc_bottom : 0e68
> >   alloc_top: 2000
> >   alloc_top_hi : 2000
> >   rmo_top  : 2000
> >   ram_top  : 2000
> > instantiating rtas at 0x1ecb... done
> > prom_hold_cpus: skipped
> > copying OF device tree...
> > Building dt strings...
> > Building dt structure...
> > Device tree strings 0x0e69 -> 0x0e691886
> > Device tree struct  0x0e6a -> 0x0e6b
> > Quiescing Open Firmware ...
> > Booting Linux via __start() @ 0x0a6e ...
> > [1.234639] BUG: Unable to handle kernel data access on read at 
> > 0xc26970e0
> > [1.234654] Faulting instruction address: 0xc00088dc
> > [1.234665] Oops: Kernel access of bad area, sig: 11 [#1]
> > [1.234675] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> > [1.234686] Modules linked in:
> > [1.234698] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 
> > 5.7.0-rc1-1.g8f6a41f-default #1 openSUSE Tumbleweed (unreleased)
> > [1.234714] NIP:  c00088dc LR: c0aad890 CTR: 
> > c00087a0
> > [1.234727] REGS: c7596e90 TRAP: 0300   Not tainted  
> > (5.7.0-rc1-1.g8f6a41f-default)
> > [1.234742] MSR:  80001033   CR: 2822  
> > XER: 
> 
> MMU was on when we faulted (IR & DR), so it's not real mode weirdness.
> 
> > [1.234760] CFAR: c00087fc DAR: c26970e0 DSISR: 4000 
> > IRQMASK: 0
> > [1.234760] GPR00: c0aa9384 c7597120 c269f000 
> > 
> > [1.234760] GPR04: c25d2778  c800 
> > c26d69c0
> > [1.234760] GPR08: 000e98056e8436d9 0300  
> > 
> > [1.234760] GPR12: 80001033 c0001ec79c00  
> > 1ef3d880
> > [1.234760] GPR16:   c0058990 
> > 
> > [1.234760] GPR20: c25d2778 c003ffa967c8 0001 
> > 0008
> > [2.234760] GPR24: c7538100   
> > 4995ca42
> > [1.234760] GPR28: c25d2778 c003ffa967c8 04945388 
> > 
> > [1.234869] NIP [c00088dc] data_access_common_virt+0x13c/0x170
> > [1.234882] LR [c0aad890] snooze_loop+0x70/0x220
> > [1.234888] Call Trace:
> > [1.234899] [c7597420] [] 0x0
> > [1.234908] [c7597720] [0a6d] 0xa6d
> > [1.234919] [c7597a20] [] 0x0
> > [1.234931] [c7597d20] [0004] 0x4
> > [1.234943] [c7597d50] [c0aa9384] 
> > cpuidle_enter_state+0xa4/0x590
> > [1.234954] Freeing unused kernel memory: 5312K
> > [1.234958] [c7597dd0] [c0aa990c] cpuidle_enter+0x4c/0x70
> > [1.234965] [c7597e10] [c019635c] call_cpuidle+0x4c/0x90
> > [1.234969] [c7597e30] [c0196978] do_idle+0x308/0x420
> > [1.234973] [c7597ed0] [c0196cd8] 
> > cpu_startup_entry+0x38/0x40
> > [1.234977] [c759

Re: CVE-2020-11669: Linux kernel 4.10 to 5.1: powerpc: guest can cause DoS on POWER9 KVM hosts

2020-04-15 Thread Michal Suchánek
On Wed, Apr 15, 2020 at 10:52:53PM +1000, Andrew Donnellan wrote:
> The Linux kernel for powerpc from v4.10 to v5.1 has a bug where the
> Authority Mask Register (AMR), Authority Mask Override Register (AMOR) and
> User Authority Mask Override Register (UAMOR) are not correctly saved and
> restored when the CPU is going into/coming out of idle state.
> 
> On POWER9 CPUs, this means that a CPU may return from idle with the AMR
> value of another thread on the same core.
> 
> This allows a trivial Denial of Service attack against KVM hosts, by booting
> a guest kernel which makes use of the AMR, such as a v5.2 or later kernel
> with Kernel Userspace Access Prevention (KUAP) enabled.
> 
> The guest kernel will set the AMR to prevent userspace access, then the
> thread will go idle. At a later point, the hardware thread that the guest
> was using may come out of idle and start executing in the host, without
> restoring the host AMR value. The host kernel can get caught in a page fault
> loop, as the AMR is unexpectedly causing memory accesses to fail in the
> host, and the host is eventually rendered unusable.

Hello,

shouldn't the kernel restore the host registers when leaving the guest?

I recall some code exists for handling the *AM*R when leaving guest. Can
the KVM guest enter idle without exiting to host?

Thanks

Michal


Re: [PATCH v2 2/2] crypto: Remove unnecessary memzero_explicit()

2020-04-14 Thread Michal Suchánek
On Tue, Apr 14, 2020 at 12:24:36PM -0400, Waiman Long wrote:
> On 4/14/20 2:08 AM, Christophe Leroy wrote:
> >
> >
> > Le 14/04/2020 à 00:28, Waiman Long a écrit :
> >> Since kfree_sensitive() will do an implicit memzero_explicit(), there
> >> is no need to call memzero_explicit() before it. Eliminate those
> >> memzero_explicit() and simplify the call sites. For better correctness,
> >> the setting of keylen is also moved down after the key pointer check.
> >>
> >> Signed-off-by: Waiman Long 
> >> ---
> >>   .../allwinner/sun8i-ce/sun8i-ce-cipher.c  | 19 +-
> >>   .../allwinner/sun8i-ss/sun8i-ss-cipher.c  | 20 +--
> >>   drivers/crypto/amlogic/amlogic-gxl-cipher.c   | 12 +++
> >>   drivers/crypto/inside-secure/safexcel_hash.c  |  3 +--
> >>   4 files changed, 14 insertions(+), 40 deletions(-)
> >>
> >> diff --git a/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c
> >> b/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c
> >> index aa4e8fdc2b32..8358fac98719 100644
> >> --- a/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c
> >> +++ b/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c
> >> @@ -366,10 +366,7 @@ void sun8i_ce_cipher_exit(struct crypto_tfm *tfm)
> >>   {
> >>   struct sun8i_cipher_tfm_ctx *op = crypto_tfm_ctx(tfm);
> >>   -    if (op->key) {
> >> -    memzero_explicit(op->key, op->keylen);
> >> -    kfree(op->key);
> >> -    }
> >> +    kfree_sensitive(op->key);
> >>   crypto_free_sync_skcipher(op->fallback_tfm);
> >>   pm_runtime_put_sync_suspend(op->ce->dev);
> >>   }
> >> @@ -391,14 +388,11 @@ int sun8i_ce_aes_setkey(struct crypto_skcipher
> >> *tfm, const u8 *key,
> >>   dev_dbg(ce->dev, "ERROR: Invalid keylen %u\n", keylen);
> >>   return -EINVAL;
> >>   }
> >> -    if (op->key) {
> >> -    memzero_explicit(op->key, op->keylen);
> >> -    kfree(op->key);
> >> -    }
> >> -    op->keylen = keylen;
> >> +    kfree_sensitive(op->key);
> >>   op->key = kmemdup(key, keylen, GFP_KERNEL | GFP_DMA);
> >>   if (!op->key)
> >>   return -ENOMEM;
> >> +    op->keylen = keylen;
> >
> > Does it matter at all to ensure op->keylen is not set when of->key is
> > NULL ? I'm not sure.
> >
> > But if it does, then op->keylen should be set to 0 when freeing op->key. 
> 
> My thinking is that if memory allocation fails, we just don't touch
> anything and return an error code. I will not explicitly set keylen to 0
> in this case unless it is specified in the API documentation.
You already freed the key by now so not touching anything is not
possible. The key is set to NULL on allocation failure so setting keylen
to 0 should be redundant. However, setting keylen to 0 is consisent with
not having a key, and it avoids the possibility of leaking the length
later should that ever cause any problem.

Thanks

Michal


Re: [PATCH] powerpcs: perf: consolidate perf_callchain_user_64 and perf_callchain_user_32

2020-04-09 Thread Michal Suchánek
On Tue, Apr 07, 2020 at 07:21:06AM +0200, Christophe Leroy wrote:
> 
> 
> Le 06/04/2020 à 23:00, Michal Suchanek a écrit :
> > perf_callchain_user_64 and perf_callchain_user_32 are nearly identical.
> > Consolidate into one function with thin wrappers.
> > 
> > Suggested-by: Nicholas Piggin 
> > Signed-off-by: Michal Suchanek 
> > ---
> >   arch/powerpc/perf/callchain.h| 24 +++-
> >   arch/powerpc/perf/callchain_32.c | 21 ++---
> >   arch/powerpc/perf/callchain_64.c | 14 --
> >   3 files changed, 29 insertions(+), 30 deletions(-)
> > 
> > diff --git a/arch/powerpc/perf/callchain.h b/arch/powerpc/perf/callchain.h
> > index 7a2cb9e1181a..7540bb71cb60 100644
> > --- a/arch/powerpc/perf/callchain.h
> > +++ b/arch/powerpc/perf/callchain.h
> > @@ -2,7 +2,7 @@
> >   #ifndef _POWERPC_PERF_CALLCHAIN_H
> >   #define _POWERPC_PERF_CALLCHAIN_H
> > -int read_user_stack_slow(void __user *ptr, void *buf, int nb);
> > +int read_user_stack_slow(const void __user *ptr, void *buf, int nb);
> 
> Does the constification of ptr has to be in this patch ?
It was in the original patch. The code is touched anyway.
> Wouldn't it be better to have it as a separate patch ?
Don't care much either way. Can resend it as separate patches.
> 
> >   void perf_callchain_user_64(struct perf_callchain_entry_ctx *entry,
> > struct pt_regs *regs);
> >   void perf_callchain_user_32(struct perf_callchain_entry_ctx *entry,
> > @@ -16,4 +16,26 @@ static inline bool invalid_user_sp(unsigned long sp)
> > return (!sp || (sp & mask) || (sp > top));
> >   }
> > +/*
> > + * On 32-bit we just access the address and let hash_page create a
> > + * HPTE if necessary, so there is no need to fall back to reading
> > + * the page tables.  Since this is called at interrupt level,
> > + * do_page_fault() won't treat a DSI as a page fault.
> > + */
> > +static inline int __read_user_stack(const void __user *ptr, void *ret,
> > +   size_t size)
> > +{
> > +   int rc;
> > +
> > +   if ((unsigned long)ptr > TASK_SIZE - size ||
> > +   ((unsigned long)ptr & (size - 1)))
> > +   return -EFAULT;
> > +   rc = probe_user_read(ret, ptr, size);
> > +
> > +   if (rc && IS_ENABLED(CONFIG_PPC64))
> 
> gcc is probably smart enough to deal with it efficiently, but it would
> be more correct to test rc after checking CONFIG_PPC64.
IS_ENABLED(CONFIG_PPC64) is constant so that part of the check should be
compiled out in any case.

Thanks

Michal


Re: [PATCH v12 5/8] powerpc/64: make buildable without CONFIG_COMPAT

2020-04-07 Thread Michal Suchánek
On Tue, Apr 07, 2020 at 07:50:30AM +0200, Christophe Leroy wrote:
> 
> 
> Le 20/03/2020 à 11:20, Michal Suchanek a écrit :
> > There are numerous references to 32bit functions in generic and 64bit
> > code so ifdef them out.
> > 
> > Signed-off-by: Michal Suchanek 
> > ---
> > v2:
> > - fix 32bit ifdef condition in signal.c
> > - simplify the compat ifdef condition in vdso.c - 64bit is redundant
> > - simplify the compat ifdef condition in callchain.c - 64bit is redundant
> > v3:
> > - use IS_ENABLED and maybe_unused where possible
> > - do not ifdef declarations
> > - clean up Makefile
> > v4:
> > - further makefile cleanup
> > - simplify is_32bit_task conditions
> > - avoid ifdef in condition by using return
> > v5:
> > - avoid unreachable code on 32bit
> > - make is_current_64bit constant on !COMPAT
> > - add stub perf_callchain_user_32 to avoid some ifdefs
> > v6:
> > - consolidate current_is_64bit
> > v7:
> > - remove leftover perf_callchain_user_32 stub from previous series version
> > v8:
> > - fix build again - too trigger-happy with stub removal
> > - remove a vdso.c hunk that causes warning according to kbuild test robot
> > v9:
> > - removed current_is_64bit in previous patch
> > v10:
> > - rebase on top of 70ed86f4de5bd
> > ---
> >   arch/powerpc/include/asm/thread_info.h | 4 ++--
> >   arch/powerpc/kernel/Makefile   | 6 +++---
> >   arch/powerpc/kernel/entry_64.S | 2 ++
> >   arch/powerpc/kernel/signal.c   | 3 +--
> >   arch/powerpc/kernel/syscall_64.c   | 6 ++
> >   arch/powerpc/kernel/vdso.c | 3 ++-
> >   arch/powerpc/perf/callchain.c  | 8 +++-
> >   7 files changed, 19 insertions(+), 13 deletions(-)
> > 
> 
> [...]
> 
> > diff --git a/arch/powerpc/kernel/syscall_64.c 
> > b/arch/powerpc/kernel/syscall_64.c
> > index 87d95b455b83..2dcbfe38f5ac 100644
> > --- a/arch/powerpc/kernel/syscall_64.c
> > +++ b/arch/powerpc/kernel/syscall_64.c
> > @@ -24,7 +24,6 @@ notrace long system_call_exception(long r3, long r4, long 
> > r5,
> >long r6, long r7, long r8,
> >unsigned long r0, struct pt_regs *regs)
> >   {
> > -   unsigned long ti_flags;
> > syscall_fn f;
> > if (IS_ENABLED(CONFIG_PPC_IRQ_SOFT_MASK_DEBUG))
> > @@ -68,8 +67,7 @@ notrace long system_call_exception(long r3, long r4, long 
> > r5,
> > local_irq_enable();
> > -   ti_flags = current_thread_info()->flags;
> > -   if (unlikely(ti_flags & _TIF_SYSCALL_DOTRACE)) {
> > +   if (unlikely(current_thread_info()->flags & _TIF_SYSCALL_DOTRACE)) {
> > /*
> >  * We use the return value of do_syscall_trace_enter() as the
> >  * syscall number. If the syscall was rejected for any reason
> > @@ -94,7 +92,7 @@ notrace long system_call_exception(long r3, long r4, long 
> > r5,
> > /* May be faster to do array_index_nospec? */
> > barrier_nospec();
> > -   if (unlikely(ti_flags & _TIF_32BIT)) {
> > +   if (unlikely(is_32bit_task())) {
> 
> is_compat() should be used here instead, because we dont want to use
is_compat_task()
> compat_sys_call_table() on PPC32.
> 
> > f = (void *)compat_sys_call_table[r0];
> > r3 &= 0xULL;
> 
That only applies once you use this for 32bit as well. Right now it's
64bit only so the two are the same.

Thanks

Michal


Re: [PATCH v11 3/8] powerpc/perf: consolidate read_user_stack_32

2020-04-06 Thread Michal Suchánek
On Fri, Apr 03, 2020 at 05:13:25PM +1000, Nicholas Piggin wrote:
> Michal Suchánek's on March 25, 2020 5:38 am:
> > On Tue, Mar 24, 2020 at 06:48:20PM +1000, Nicholas Piggin wrote:
> >> Michal Suchanek's on March 19, 2020 10:19 pm:
> >> > There are two almost identical copies for 32bit and 64bit.
> >> > 
> >> > The function is used only in 32bit code which will be split out in next
> >> > patch so consolidate to one function.
> >> > 
> >> > Signed-off-by: Michal Suchanek 
> >> > Reviewed-by: Christophe Leroy 
> >> > ---
> >> > v6:  new patch
> >> > v8:  move the consolidated function out of the ifdef block.
> >> > v11: rebase on top of def0bfdbd603
> >> > ---
> >> >  arch/powerpc/perf/callchain.c | 48 +--
> >> >  1 file changed, 24 insertions(+), 24 deletions(-)
> >> > 
> >> > diff --git a/arch/powerpc/perf/callchain.c 
> >> > b/arch/powerpc/perf/callchain.c
> >> > index cbc251981209..c9a78c6e4361 100644
> >> > --- a/arch/powerpc/perf/callchain.c
> >> > +++ b/arch/powerpc/perf/callchain.c
> >> > @@ -161,18 +161,6 @@ static int read_user_stack_64(unsigned long __user 
> >> > *ptr, unsigned long *ret)
> >> >  return read_user_stack_slow(ptr, ret, 8);
> >> >  }
> >> >  
> >> > -static int read_user_stack_32(unsigned int __user *ptr, unsigned int 
> >> > *ret)
> >> > -{
> >> > -if ((unsigned long)ptr > TASK_SIZE - sizeof(unsigned int) ||
> >> > -((unsigned long)ptr & 3))
> >> > -return -EFAULT;
> >> > -
> >> > -if (!probe_user_read(ret, ptr, sizeof(*ret)))
> >> > -return 0;
> >> > -
> >> > -return read_user_stack_slow(ptr, ret, 4);
> >> > -}
> >> > -
> >> >  static inline int valid_user_sp(unsigned long sp, int is_64)
> >> >  {
> >> >  if (!sp || (sp & 7) || sp > (is_64 ? TASK_SIZE : 0x1UL) 
> >> > - 32)
> >> > @@ -277,19 +265,9 @@ static void perf_callchain_user_64(struct 
> >> > perf_callchain_entry_ctx *entry,
> >> >  }
> >> >  
> >> >  #else  /* CONFIG_PPC64 */
> >> > -/*
> >> > - * On 32-bit we just access the address and let hash_page create a
> >> > - * HPTE if necessary, so there is no need to fall back to reading
> >> > - * the page tables.  Since this is called at interrupt level,
> >> > - * do_page_fault() won't treat a DSI as a page fault.
> >> > - */
> >> > -static int read_user_stack_32(unsigned int __user *ptr, unsigned int 
> >> > *ret)
> >> > +static int read_user_stack_slow(void __user *ptr, void *buf, int nb)
> >> >  {
> >> > -if ((unsigned long)ptr > TASK_SIZE - sizeof(unsigned int) ||
> >> > -((unsigned long)ptr & 3))
> >> > -return -EFAULT;
> >> > -
> >> > -return probe_user_read(ret, ptr, sizeof(*ret));
> >> > +return 0;
> >> >  }
> >> >  
> >> >  static inline void perf_callchain_user_64(struct 
> >> > perf_callchain_entry_ctx *entry,
> >> > @@ -312,6 +290,28 @@ static inline int valid_user_sp(unsigned long sp, 
> >> > int is_64)
> >> >  
> >> >  #endif /* CONFIG_PPC64 */
> >> >  
> >> > +/*
> >> > + * On 32-bit we just access the address and let hash_page create a
> >> > + * HPTE if necessary, so there is no need to fall back to reading
> >> > + * the page tables.  Since this is called at interrupt level,
> >> > + * do_page_fault() won't treat a DSI as a page fault.
> >> > + */
> >> 
> >> The comment is actually probably better to stay in the 32-bit
> >> read_user_stack_slow implementation. Is that function defined
> >> on 32-bit purely so that you can use IS_ENABLED()? In that case
> > It documents the IS_ENABLED() and that's where it is. The 32bit
> > definition is only a technical detail.
> 
> Sorry for the late reply, busy trying to fix bugs in the C rewrite
> series. I don't think it is the right place, it should be in the
> ppc32 implementation detail.
Which does not exist anymore after the 32bit and 64bit part is split.
> ppc64 has an equivalent comment at the top of its read_user_stack functions.
> 
> >> I would prefer to put a BUG() there which makes it self documenting.
> > Which will cause checkpatch complaints about introducing new BUG() which
> > is frowned on.
> 
> It's fine in this case, that warning is about not introducing
> runtime bugs, but this wouldn't be.
> 
> But... I actually don't like adding read_user_stack_slow on 32-bit
> and especially not just to make IS_ENABLED work.
That's to not break build at this point. Later the function is removed.
> 
> IMO this would be better if you really want to consolidate it
> 
> ---
> 
> diff --git a/arch/powerpc/perf/callchain.c b/arch/powerpc/perf/callchain.c
> index cbc251981209..ca3a599b3f54 100644
> --- a/arch/powerpc/perf/callchain.c
> +++ b/arch/powerpc/perf/callchain.c
> @@ -108,7 +108,7 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx 
> *entry, struct pt_regs *re
>   * interrupt context, so if the access faults, we read the page tables
>   * to find which page (if any) is mapped and access it directly.
>   */
> -static int read_user_st

Re: [PATCH v11 3/8] powerpc/perf: consolidate read_user_stack_32

2020-04-03 Thread Michal Suchánek
On Fri, Apr 03, 2020 at 09:26:27PM +1000, Nicholas Piggin wrote:
> Michal Suchánek's on April 3, 2020 8:52 pm:
> > Hello,
> > 
> > there are 3 variants of the function
> > 
> > read_user_stack_64
> > 
> > 32bit read_user_stack_32
> > 64bit read_user_Stack_32
> 
> Right.
> 
> > On Fri, Apr 03, 2020 at 05:13:25PM +1000, Nicholas Piggin wrote:
> [...]
> >>  #endif /* CONFIG_PPC64 */
> >>  
> >> +static int read_user_stack_32(unsigned int __user *ptr, unsigned int *ret)
> >> +{
> >> +  return __read_user_stack(ptr, ret, sizeof(*ret));
> > Does not work for 64bit read_user_stack_32 ^ this should be 4.
> > 
> > Other than that it should preserve the existing logic just fine.
> 
> sizeof(int) == 4 on 64bit so it should work.
> 
Right, the type is different for the 32bit and 64bit version.

Thanks

Michal


Re: [PATCH v11 3/8] powerpc/perf: consolidate read_user_stack_32

2020-04-03 Thread Michal Suchánek
Hello,

there are 3 variants of the function

read_user_stack_64

32bit read_user_stack_32
64bit read_user_Stack_32

On Fri, Apr 03, 2020 at 05:13:25PM +1000, Nicholas Piggin wrote:
> Michal Suchánek's on March 25, 2020 5:38 am:
> > On Tue, Mar 24, 2020 at 06:48:20PM +1000, Nicholas Piggin wrote:
> >> Michal Suchanek's on March 19, 2020 10:19 pm:
> >> > There are two almost identical copies for 32bit and 64bit.
> >> > 
> >> > The function is used only in 32bit code which will be split out in next
> >> > patch so consolidate to one function.
> >> > 
> >> > Signed-off-by: Michal Suchanek 
> >> > Reviewed-by: Christophe Leroy 
> >> > ---
> >> > v6:  new patch
> >> > v8:  move the consolidated function out of the ifdef block.
> >> > v11: rebase on top of def0bfdbd603
> >> > ---
> >> >  arch/powerpc/perf/callchain.c | 48 +--
> >> >  1 file changed, 24 insertions(+), 24 deletions(-)
> >> > 
> >> > diff --git a/arch/powerpc/perf/callchain.c 
> >> > b/arch/powerpc/perf/callchain.c
> >> > index cbc251981209..c9a78c6e4361 100644
> >> > --- a/arch/powerpc/perf/callchain.c
> >> > +++ b/arch/powerpc/perf/callchain.c
> >> > @@ -161,18 +161,6 @@ static int read_user_stack_64(unsigned long __user 
> >> > *ptr, unsigned long *ret)
> >> >  return read_user_stack_slow(ptr, ret, 8);
> >> >  }
> >> >  
> >> > -static int read_user_stack_32(unsigned int __user *ptr, unsigned int 
> >> > *ret)
> >> > -{
> >> > -if ((unsigned long)ptr > TASK_SIZE - sizeof(unsigned int) ||
> >> > -((unsigned long)ptr & 3))
> >> > -return -EFAULT;
> >> > -
> >> > -if (!probe_user_read(ret, ptr, sizeof(*ret)))
> >> > -return 0;
> >> > -
> >> > -return read_user_stack_slow(ptr, ret, 4);
> >> > -}
> >> > -
> >> >  static inline int valid_user_sp(unsigned long sp, int is_64)
> >> >  {
> >> >  if (!sp || (sp & 7) || sp > (is_64 ? TASK_SIZE : 0x1UL) 
> >> > - 32)
> >> > @@ -277,19 +265,9 @@ static void perf_callchain_user_64(struct 
> >> > perf_callchain_entry_ctx *entry,
> >> >  }
> >> >  
> >> >  #else  /* CONFIG_PPC64 */
> >> > -/*
> >> > - * On 32-bit we just access the address and let hash_page create a
> >> > - * HPTE if necessary, so there is no need to fall back to reading
> >> > - * the page tables.  Since this is called at interrupt level,
> >> > - * do_page_fault() won't treat a DSI as a page fault.
> >> > - */
> >> > -static int read_user_stack_32(unsigned int __user *ptr, unsigned int 
> >> > *ret)
> >> > +static int read_user_stack_slow(void __user *ptr, void *buf, int nb)
> >> >  {
> >> > -if ((unsigned long)ptr > TASK_SIZE - sizeof(unsigned int) ||
> >> > -((unsigned long)ptr & 3))
> >> > -return -EFAULT;
> >> > -
> >> > -return probe_user_read(ret, ptr, sizeof(*ret));
> >> > +return 0;
> >> >  }
> >> >  
> >> >  static inline void perf_callchain_user_64(struct 
> >> > perf_callchain_entry_ctx *entry,
> >> > @@ -312,6 +290,28 @@ static inline int valid_user_sp(unsigned long sp, 
> >> > int is_64)
> >> >  
> >> >  #endif /* CONFIG_PPC64 */
> >> >  
> >> > +/*
> >> > + * On 32-bit we just access the address and let hash_page create a
> >> > + * HPTE if necessary, so there is no need to fall back to reading
> >> > + * the page tables.  Since this is called at interrupt level,
> >> > + * do_page_fault() won't treat a DSI as a page fault.
> >> > + */
> >> 
> >> The comment is actually probably better to stay in the 32-bit
> >> read_user_stack_slow implementation. Is that function defined
> >> on 32-bit purely so that you can use IS_ENABLED()? In that case
> > It documents the IS_ENABLED() and that's where it is. The 32bit
> > definition is only a technical detail.
> 
> Sorry for the late reply, busy trying to fix bugs in the C rewrite
> series. I don't think it is the right place, it should be in the
> ppc32 implementation detail. ppc64 has an equivalent comment at the
> top of its read_user_stack functions.
> 
> >> I would prefer to put a BUG() there which makes it self documenting.
> > Which will cause checkpatch complaints about introducing new BUG() which
> > is frowned on.
> 
> It's fine in this case, that warning is about not introducing
> runtime bugs, but this wouldn't be.
> 
> But... I actually don't like adding read_user_stack_slow on 32-bit
> and especially not just to make IS_ENABLED work.
> 
> IMO this would be better if you really want to consolidate it
> 
> ---
> 
> diff --git a/arch/powerpc/perf/callchain.c b/arch/powerpc/perf/callchain.c
> index cbc251981209..ca3a599b3f54 100644
> --- a/arch/powerpc/perf/callchain.c
> +++ b/arch/powerpc/perf/callchain.c
> @@ -108,7 +108,7 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx 
> *entry, struct pt_regs *re
>   * interrupt context, so if the access faults, we read the page tables
>   * to find which page (if any) is mapped and access it directly.
>   */
> -static int read_user_stack_slow(void __user *ptr

Re: [PATCH] powerpc/64: Fix section mismatch warnings.

2020-03-26 Thread Michal Suchánek
On Thu, Mar 26, 2020 at 11:22:03PM +0530, Naveen N. Rao wrote:
> Michal Suchanek wrote:
> > Fixes the following warnings:
> > 
> > WARNING: vmlinux.o(.text+0x2d24): Section mismatch in reference from the 
> > variable __boot_from_prom to the function .init.text:prom_init()
> > The function __boot_from_prom() references
> > the function __init prom_init().
> > This is often because __boot_from_prom lacks a __init
> > annotation or the annotation of prom_init is wrong.
> > 
> > WARNING: vmlinux.o(.text+0x2fd0): Section mismatch in reference from the 
> > variable start_here_common to the function .init.text:start_kernel()
> > The function start_here_common() references
> > the function __init start_kernel().
> > This is often because start_here_common lacks a __init
> > annotation or the annotation of start_kernel is wrong.
> > 
> > Signed-off-by: Michal Suchanek 
> > ---
> >  arch/powerpc/kernel/head_64.S | 4 
> >  1 file changed, 4 insertions(+)
> 
> Michael committed a similar patch just earlier today:
> https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?id=6eeb9b3b9ce588f14a697737a30d0702b5a20293

Missed it because it did not reach master yet.

Thanks

Michal


Re: [PATCH v11 3/8] powerpc/perf: consolidate read_user_stack_32

2020-03-24 Thread Michal Suchánek
On Tue, Mar 24, 2020 at 06:48:20PM +1000, Nicholas Piggin wrote:
> Michal Suchanek's on March 19, 2020 10:19 pm:
> > There are two almost identical copies for 32bit and 64bit.
> > 
> > The function is used only in 32bit code which will be split out in next
> > patch so consolidate to one function.
> > 
> > Signed-off-by: Michal Suchanek 
> > Reviewed-by: Christophe Leroy 
> > ---
> > v6:  new patch
> > v8:  move the consolidated function out of the ifdef block.
> > v11: rebase on top of def0bfdbd603
> > ---
> >  arch/powerpc/perf/callchain.c | 48 +--
> >  1 file changed, 24 insertions(+), 24 deletions(-)
> > 
> > diff --git a/arch/powerpc/perf/callchain.c b/arch/powerpc/perf/callchain.c
> > index cbc251981209..c9a78c6e4361 100644
> > --- a/arch/powerpc/perf/callchain.c
> > +++ b/arch/powerpc/perf/callchain.c
> > @@ -161,18 +161,6 @@ static int read_user_stack_64(unsigned long __user 
> > *ptr, unsigned long *ret)
> > return read_user_stack_slow(ptr, ret, 8);
> >  }
> >  
> > -static int read_user_stack_32(unsigned int __user *ptr, unsigned int *ret)
> > -{
> > -   if ((unsigned long)ptr > TASK_SIZE - sizeof(unsigned int) ||
> > -   ((unsigned long)ptr & 3))
> > -   return -EFAULT;
> > -
> > -   if (!probe_user_read(ret, ptr, sizeof(*ret)))
> > -   return 0;
> > -
> > -   return read_user_stack_slow(ptr, ret, 4);
> > -}
> > -
> >  static inline int valid_user_sp(unsigned long sp, int is_64)
> >  {
> > if (!sp || (sp & 7) || sp > (is_64 ? TASK_SIZE : 0x1UL) - 32)
> > @@ -277,19 +265,9 @@ static void perf_callchain_user_64(struct 
> > perf_callchain_entry_ctx *entry,
> >  }
> >  
> >  #else  /* CONFIG_PPC64 */
> > -/*
> > - * On 32-bit we just access the address and let hash_page create a
> > - * HPTE if necessary, so there is no need to fall back to reading
> > - * the page tables.  Since this is called at interrupt level,
> > - * do_page_fault() won't treat a DSI as a page fault.
> > - */
> > -static int read_user_stack_32(unsigned int __user *ptr, unsigned int *ret)
> > +static int read_user_stack_slow(void __user *ptr, void *buf, int nb)
> >  {
> > -   if ((unsigned long)ptr > TASK_SIZE - sizeof(unsigned int) ||
> > -   ((unsigned long)ptr & 3))
> > -   return -EFAULT;
> > -
> > -   return probe_user_read(ret, ptr, sizeof(*ret));
> > +   return 0;
> >  }
> >  
> >  static inline void perf_callchain_user_64(struct perf_callchain_entry_ctx 
> > *entry,
> > @@ -312,6 +290,28 @@ static inline int valid_user_sp(unsigned long sp, int 
> > is_64)
> >  
> >  #endif /* CONFIG_PPC64 */
> >  
> > +/*
> > + * On 32-bit we just access the address and let hash_page create a
> > + * HPTE if necessary, so there is no need to fall back to reading
> > + * the page tables.  Since this is called at interrupt level,
> > + * do_page_fault() won't treat a DSI as a page fault.
> > + */
> 
> The comment is actually probably better to stay in the 32-bit
> read_user_stack_slow implementation. Is that function defined
> on 32-bit purely so that you can use IS_ENABLED()? In that case
It documents the IS_ENABLED() and that's where it is. The 32bit
definition is only a technical detail.
> I would prefer to put a BUG() there which makes it self documenting.
Which will cause checkpatch complaints about introducing new BUG() which
is frowned on.

Thanks

Michal


Re: [PATCH v11 5/8] powerpc/64: make buildable without CONFIG_COMPAT

2020-03-24 Thread Michal Suchánek
On Tue, Mar 24, 2020 at 06:54:20PM +1000, Nicholas Piggin wrote:
> Michal Suchanek's on March 19, 2020 10:19 pm:
> > diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
> > index 4b0152108f61..a264989626fd 100644
> > --- a/arch/powerpc/kernel/signal.c
> > +++ b/arch/powerpc/kernel/signal.c
> > @@ -247,7 +247,6 @@ static void do_signal(struct task_struct *tsk)
> > sigset_t *oldset = sigmask_to_save();
> > struct ksignal ksig = { .sig = 0 };
> > int ret;
> > -   int is32 = is_32bit_task();
> >  
> > BUG_ON(tsk != current);
> >  
> > @@ -277,7 +276,7 @@ static void do_signal(struct task_struct *tsk)
> >  
> > rseq_signal_deliver(&ksig, tsk->thread.regs);
> >  
> > -   if (is32) {
> > +   if (is_32bit_task()) {
> > if (ksig.ka.sa.sa_flags & SA_SIGINFO)
> > ret = handle_rt_signal32(&ksig, oldset, tsk);
> > else
> 
> Unnecessary?
> 
> > diff --git a/arch/powerpc/kernel/syscall_64.c 
> > b/arch/powerpc/kernel/syscall_64.c
> > index 87d95b455b83..2dcbfe38f5ac 100644
> > --- a/arch/powerpc/kernel/syscall_64.c
> > +++ b/arch/powerpc/kernel/syscall_64.c
> > @@ -24,7 +24,6 @@ notrace long system_call_exception(long r3, long r4, long 
> > r5,
> >long r6, long r7, long r8,
> >unsigned long r0, struct pt_regs *regs)
> >  {
> > -   unsigned long ti_flags;
> > syscall_fn f;
> >  
> > if (IS_ENABLED(CONFIG_PPC_IRQ_SOFT_MASK_DEBUG))
> > @@ -68,8 +67,7 @@ notrace long system_call_exception(long r3, long r4, long 
> > r5,
> >  
> > local_irq_enable();
> >  
> > -   ti_flags = current_thread_info()->flags;
> > -   if (unlikely(ti_flags & _TIF_SYSCALL_DOTRACE)) {
> > +   if (unlikely(current_thread_info()->flags & _TIF_SYSCALL_DOTRACE)) {
> > /*
> >  * We use the return value of do_syscall_trace_enter() as the
> >  * syscall number. If the syscall was rejected for any reason
> > @@ -94,7 +92,7 @@ notrace long system_call_exception(long r3, long r4, long 
> > r5,
> > /* May be faster to do array_index_nospec? */
> > barrier_nospec();
> >  
> > -   if (unlikely(ti_flags & _TIF_32BIT)) {
> > +   if (unlikely(is_32bit_task())) {
> 
> Problem is, does this allow the load of ti_flags to be used for both
> tests, or does test_bit make it re-load?
> 
> This could maybe be fixed by testing if(IS_ENABLED(CONFIG_COMPAT) &&
Both points already discussed here:

https://lore.kernel.org/linuxppc-dev/13fa324dc879a7f325290bf2e131b87eb491cd7b.1573576649.git.msucha...@suse.de/

Thanks

Michal


Re: [PATCH v12 8/8] MAINTAINERS: perf: Add pattern that matches ppc perf to the perf entry.

2020-03-20 Thread Michal Suchánek
On Fri, Mar 20, 2020 at 06:31:57PM +0200, Andy Shevchenko wrote:
> On Fri, Mar 20, 2020 at 07:42:03AM -0700, Joe Perches wrote:
> > On Fri, 2020-03-20 at 14:42 +0200, Andy Shevchenko wrote:
> > > On Fri, Mar 20, 2020 at 12:23:38PM +0100, Michal Suchánek wrote:
> > > > On Fri, Mar 20, 2020 at 12:33:50PM +0200, Andy Shevchenko wrote:
> > > > > On Fri, Mar 20, 2020 at 11:20:19AM +0100, Michal Suchanek wrote:
> > > > > > While at it also simplify the existing perf patterns.
> > > > > And still missed fixes from parse-maintainers.pl.
> > > > 
> > > > Oh, that script UX is truly ingenious.
> > > 
> > > You have at least two options, their combinations, etc:
> > >  - complain to the author :-)
> > >  - send a patch :-)
> > 
> > Recently:
> > 
> > https://lore.kernel.org/lkml/4d5291fa3fb4962b1fa55e8fd9ef421ef0c1b1e5.ca...@perches.com/
> 
> But why?
> 
> Shouldn't we rather run MAINTAINERS clean up once and require people to use
> parse-maintainers.pl for good?

That cleanup did not happen yet, and I am not volunteering for one.
The difference between MAINTAINERS and MAINTAINERS.new is:

 MAINTAINERS | 5510 +--
 1 file changed, 2755 insertions(+), 2755 deletions(-)

Thanks

Michal


Re: [PATCH v12 8/8] MAINTAINERS: perf: Add pattern that matches ppc perf to the perf entry.

2020-03-20 Thread Michal Suchánek
On Fri, Mar 20, 2020 at 07:42:03AM -0700, Joe Perches wrote:
> On Fri, 2020-03-20 at 14:42 +0200, Andy Shevchenko wrote:
> > On Fri, Mar 20, 2020 at 12:23:38PM +0100, Michal Suchánek wrote:
> > > On Fri, Mar 20, 2020 at 12:33:50PM +0200, Andy Shevchenko wrote:
> > > > On Fri, Mar 20, 2020 at 11:20:19AM +0100, Michal Suchanek wrote:
> > > > > While at it also simplify the existing perf patterns.
> > > > And still missed fixes from parse-maintainers.pl.
> > > 
> > > Oh, that script UX is truly ingenious.
> > 
> > You have at least two options, their combinations, etc:
> >  - complain to the author :-)
> >  - send a patch :-)
> 
> Recently:
> 
> https://lore.kernel.org/lkml/4d5291fa3fb4962b1fa55e8fd9ef421ef0c1b1e5.ca...@perches.com/

Can we expect that reaordering is taken care of in that discussion then?

Thanks

Michal


Re: [PATCH v12 8/8] MAINTAINERS: perf: Add pattern that matches ppc perf to the perf entry.

2020-03-20 Thread Michal Suchánek
On Fri, Mar 20, 2020 at 12:33:50PM +0200, Andy Shevchenko wrote:
> On Fri, Mar 20, 2020 at 11:20:19AM +0100, Michal Suchanek wrote:
> > While at it also simplify the existing perf patterns.
> > 
> 
> And still missed fixes from parse-maintainers.pl.

Oh, that script UX is truly ingenious. It provides no output and quietly
creates MAINTAINERS.new which is, of course, not included in the patch.

Thanks

Michal

> 
> I see it like below in the linux-next (after the script)
> 
> PERFORMANCE EVENTS SUBSYSTEM
> M:  Peter Zijlstra 
> M:  Ingo Molnar 
> M:  Arnaldo Carvalho de Melo 
> R:  Mark Rutland 
> R:  Alexander Shishkin 
> R:  Jiri Olsa 
> R:  Namhyung Kim 
> L:  linux-ker...@vger.kernel.org
> S:  Supported
> T:  git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 
> perf/core
> F:  arch/*/events/*
> F:  arch/*/events/*/*
> F:  arch/*/include/asm/perf_event.h
> F:  arch/*/kernel/*/*/perf_event*.c
> F:  arch/*/kernel/*/perf_event*.c
> F:  arch/*/kernel/perf_callchain.c
> F:  arch/*/kernel/perf_event*.c
> F:  include/linux/perf_event.h
> F:  include/uapi/linux/perf_event.h
> F:  kernel/events/*
> F:  tools/perf/
> 
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -13080,7 +13080,7 @@ R:  Namhyung Kim 
> >  L: linux-ker...@vger.kernel.org
> >  T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/core
> >  S: Supported
> > -F: kernel/events/*
> > +F: kernel/events/
> >  F: include/linux/perf_event.h
> >  F: include/uapi/linux/perf_event.h
> >  F: arch/*/kernel/perf_event*.c
> > @@ -13088,8 +13088,8 @@ F:  arch/*/kernel/*/perf_event*.c
> >  F: arch/*/kernel/*/*/perf_event*.c
> >  F: arch/*/include/asm/perf_event.h
> >  F: arch/*/kernel/perf_callchain.c
> > -F: arch/*/events/*
> > -F: arch/*/events/*/*
> > +F: arch/*/events/
> > +F: arch/*/perf/
> >  F: tools/perf/
> >  
> >  PERFORMANCE EVENTS SUBSYSTEM ARM64 PMU EVENTS
> 
> -- 
> With Best Regards,
> Andy Shevchenko
> 
> 


Re: [PATCH -v2] treewide: Rename "unencrypted" to "decrypted"

2020-03-19 Thread Michal Suchánek
On Thu, Mar 19, 2020 at 06:25:49PM +0100, Thomas Gleixner wrote:
> Borislav Petkov  writes:
> 
> > On Thu, Mar 19, 2020 at 11:06:15AM +, Robin Murphy wrote:
> >> Let me add another vote from a native English speaker that "unencrypted" is
> >> the appropriate term to imply the *absence* of encryption, whereas
> >> "decrypted" implies the *reversal* of applied encryption.
Even as a non-native speaker I can clearly see the distinction.
> >> 
> >> Naming things is famously hard, for good reason - names are *important* for
> >> understanding. Just because a decision was already made one way doesn't 
> >> mean
> >> that that decision was necessarily right. Churning one area to be
> >> consistently inaccurate just because it's less work than churning another
> >> area to be consistently accurate isn't really the best excuse.
> >
> > Well, the reason we chose "decrypted" vs something else is so to be as
> > different from "encrypted" as possible. If we called it "unencrypted"
> > you'd have stuff like:
> >
> >if (force_dma_unencrypted(dev))
> > set_memory_encrypted((unsigned long)cpu_addr, 1 << 
> > page_order);

If you want something with high edit distance from 'encrypted' meaning
the opposite there is already 'cleartext' which was designed for this
exact purpose.

Thanks

Michal


Re: [PATCH v11 4/8] powerpc/perf: consolidate valid_user_sp

2020-03-19 Thread Michal Suchánek
On Thu, Mar 19, 2020 at 03:16:03PM +0100, Christophe Leroy wrote:
> 
> 
> Le 19/03/2020 à 14:35, Andy Shevchenko a écrit :
> > On Thu, Mar 19, 2020 at 1:54 PM Michal Suchanek  wrote:
> > > 
> > > Merge the 32bit and 64bit version.
> > > 
> > > Halve the check constants on 32bit.
> > > 
> > > Use STACK_TOP since it is defined.
> > > 
> > > Passing is_64 is now redundant since is_32bit_task() is used to
> > > determine which callchain variant should be used. Use STACK_TOP and
> > > is_32bit_task() directly.
> > > 
> > > This removes a page from the valid 32bit area on 64bit:
> > >   #define TASK_SIZE_USER32 (0x0001UL - (1 * PAGE_SIZE))
> > >   #define STACK_TOP_USER32 TASK_SIZE_USER32
> > 
> > ...
> > 
> > > +static inline int valid_user_sp(unsigned long sp)
> > > +{
> > > +   bool is_64 = !is_32bit_task();
> > > +
> > > +   if (!sp || (sp & (is_64 ? 7 : 3)) || sp > STACK_TOP - (is_64 ? 32 
> > > : 16))
> > > +   return 0;
> > > +   return 1;
> > > +}
> > 
> > Other possibility:
> 
> I prefer this one.
> 
> > 
> >unsigned long align = is_32bit_task() ? 3 : 7;
> 
> I would call it mask instead of align
> 
> >unsigned long top = STACK_TOP - (is_32bit_task() ? 16 : 32);
> > 
> >return !(!sp || (sp & align) || sp > top);
And we can avoid the inversion here as well as in !valid_user_sp(sp) by
changing to invalid_user_sp.

Thanks

Michal


Re: [PATCH v11 4/8] powerpc/perf: consolidate valid_user_sp

2020-03-19 Thread Michal Suchánek
On Thu, Mar 19, 2020 at 03:35:03PM +0200, Andy Shevchenko wrote:
> On Thu, Mar 19, 2020 at 1:54 PM Michal Suchanek  wrote:
> >
> > Merge the 32bit and 64bit version.
> >
> > Halve the check constants on 32bit.
> >
> > Use STACK_TOP since it is defined.
> >
> > Passing is_64 is now redundant since is_32bit_task() is used to
> > determine which callchain variant should be used. Use STACK_TOP and
> > is_32bit_task() directly.
> >
> > This removes a page from the valid 32bit area on 64bit:
> >  #define TASK_SIZE_USER32 (0x0001UL - (1 * PAGE_SIZE))
> >  #define STACK_TOP_USER32 TASK_SIZE_USER32
> 
> ...
> 
> > +static inline int valid_user_sp(unsigned long sp)
> > +{
> > +   bool is_64 = !is_32bit_task();
> > +
> > +   if (!sp || (sp & (is_64 ? 7 : 3)) || sp > STACK_TOP - (is_64 ? 32 : 
> > 16))
> > +   return 0;
> > +   return 1;
> > +}
> 
> Perhaps better to read
> 
>   if (!sp)
> return 0;
> 
>   if (is_32bit_task()) {
> if (sp & 0x03)
>   return 0;
> if (sp > STACK_TOP - 16)
>   return 0;
>   } else {
> ...
>   }
> 
>   return 1;
> 
> Other possibility:
> 
>   unsigned long align = is_32bit_task() ? 3 : 7;
>   unsigned long top = STACK_TOP - (is_32bit_task() ? 16 : 32);
> 
>   return !(!sp || (sp & align) || sp > top);
Sounds reasonale.

Thanks

Michal
> 
> -- 
> With Best Regards,
> Andy Shevchenko


Re: [PATCH v11 0/8] Disable compat cruft on ppc64le v11

2020-03-19 Thread Michal Suchánek
On Thu, Mar 19, 2020 at 01:36:56PM +0100, Christophe Leroy wrote:
> You sent it twice ? Any difference between the two dispatch ?
Some headers were broken the first time around.

Thanks

Michal
> 
> Christophe
> 
> Le 19/03/2020 à 13:19, Michal Suchanek a écrit :
> > Less code means less bugs so add a knob to skip the compat stuff.
> > 
> > Changes in v2: saner CONFIG_COMPAT ifdefs
> > Changes in v3:
> >   - change llseek to 32bit instead of builing it unconditionally in fs
> >   - clanup the makefile conditionals
> >   - remove some ifdefs or convert to IS_DEFINED where possible
> > Changes in v4:
> >   - cleanup is_32bit_task and current_is_64bit
> >   - more makefile cleanup
> > Changes in v5:
> >   - more current_is_64bit cleanup
> >   - split off callchain.c 32bit and 64bit parts
> > Changes in v6:
> >   - cleanup makefile after split
> >   - consolidate read_user_stack_32
> >   - fix some checkpatch warnings
> > Changes in v7:
> >   - add back __ARCH_WANT_SYS_LLSEEK to fix build with llseek
> >   - remove leftover hunk
> >   - add review tags
> > Changes in v8:
> >   - consolidate valid_user_sp to fix it in the split callchain.c
> >   - fix build errors/warnings with PPC64 !COMPAT and PPC32
> > Changes in v9:
> >   - remove current_is_64bit()
> > Chanegs in v10:
> >   - rebase, sent together with the syscall cleanup
> > Changes in v11:
> >   - rebase
> >   - add MAINTAINERS pattern for ppc perf
> > 
> > Michal Suchanek (8):
> >powerpc: Add back __ARCH_WANT_SYS_LLSEEK macro
> >powerpc: move common register copy functions from signal_32.c to
> >  signal.c
> >powerpc/perf: consolidate read_user_stack_32
> >powerpc/perf: consolidate valid_user_sp
> >powerpc/64: make buildable without CONFIG_COMPAT
> >powerpc/64: Make COMPAT user-selectable disabled on littleendian by
> >  default.
> >powerpc/perf: split callchain.c by bitness
> >MAINTAINERS: perf: Add pattern that matches ppc perf to the perf
> >  entry.
> > 
> >   MAINTAINERS|   2 +
> >   arch/powerpc/Kconfig   |   5 +-
> >   arch/powerpc/include/asm/thread_info.h |   4 +-
> >   arch/powerpc/include/asm/unistd.h  |   1 +
> >   arch/powerpc/kernel/Makefile   |   6 +-
> >   arch/powerpc/kernel/entry_64.S |   2 +
> >   arch/powerpc/kernel/signal.c   | 144 +-
> >   arch/powerpc/kernel/signal_32.c| 140 --
> >   arch/powerpc/kernel/syscall_64.c   |   6 +-
> >   arch/powerpc/kernel/vdso.c |   3 +-
> >   arch/powerpc/perf/Makefile |   5 +-
> >   arch/powerpc/perf/callchain.c  | 356 +
> >   arch/powerpc/perf/callchain.h  |  20 ++
> >   arch/powerpc/perf/callchain_32.c   | 196 ++
> >   arch/powerpc/perf/callchain_64.c   | 174 
> >   fs/read_write.c|   3 +-
> >   16 files changed, 556 insertions(+), 511 deletions(-)
> >   create mode 100644 arch/powerpc/perf/callchain.h
> >   create mode 100644 arch/powerpc/perf/callchain_32.c
> >   create mode 100644 arch/powerpc/perf/callchain_64.c
> > 


Re: [PATCH v11 8/8] MAINTAINERS: perf: Add pattern that matches ppc perf to the perf entry.

2020-03-19 Thread Michal Suchánek
On Thu, Mar 19, 2020 at 03:37:03PM +0200, Andy Shevchenko wrote:
> On Thu, Mar 19, 2020 at 2:21 PM Michal Suchanek  wrote:
> >
> > Signed-off-by: Michal Suchanek 
> > ---
> > v10: new patch
> > ---
> >  MAINTAINERS | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index bc8dbe4fe4c9..329bf4a31412 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -13088,6 +13088,8 @@ F:  arch/*/kernel/*/perf_event*.c
> >  F: arch/*/kernel/*/*/perf_event*.c
> >  F: arch/*/include/asm/perf_event.h
> >  F: arch/*/kernel/perf_callchain.c
> > +F: arch/*/perf/*
> > +F: arch/*/perf/*/*
> >  F: arch/*/events/*
> >  F: arch/*/events/*/*
> >  F: tools/perf/
> 
> Had you run parse-maintainers.pl?
Did not know it exists. The output is:

scripts/parse-maintainers.pl 
Odd non-pattern line '
Documentation/devicetree/bindings/media/ti,cal.yaml
' for 'TI VPE/CAL DRIVERS' at scripts/parse-maintainers.pl line 147,
<$file> line 16756.

Thanks

Michal


Re: [PATCH v11 0/8] Disable compat cruft on ppc64le v11

2020-03-19 Thread Michal Suchánek
Lost some headers so will resend.

On Thu, Mar 19, 2020 at 12:52:20PM +0100, Michal Suchanek wrote:
> Less code means less bugs so add a knob to skip the compat stuff.
> 
> Changes in v2: saner CONFIG_COMPAT ifdefs
> Changes in v3:
>  - change llseek to 32bit instead of builing it unconditionally in fs
>  - clanup the makefile conditionals
>  - remove some ifdefs or convert to IS_DEFINED where possible
> Changes in v4:
>  - cleanup is_32bit_task and current_is_64bit
>  - more makefile cleanup
> Changes in v5:
>  - more current_is_64bit cleanup
>  - split off callchain.c 32bit and 64bit parts
> Changes in v6:
>  - cleanup makefile after split
>  - consolidate read_user_stack_32
>  - fix some checkpatch warnings
> Changes in v7:
>  - add back __ARCH_WANT_SYS_LLSEEK to fix build with llseek
>  - remove leftover hunk
>  - add review tags
> Changes in v8:
>  - consolidate valid_user_sp to fix it in the split callchain.c
>  - fix build errors/warnings with PPC64 !COMPAT and PPC32
> Changes in v9:
>  - remove current_is_64bit()
> Chanegs in v10:
>  - rebase, sent together with the syscall cleanup
> Changes in v11:
>  - rebase
>  - add MAINTAINERS pattern for ppc perf
> 
> Michal Suchanek (8):
>   powerpc: Add back __ARCH_WANT_SYS_LLSEEK macro
>   powerpc: move common register copy functions from signal_32.c to
> signal.c
>   powerpc/perf: consolidate read_user_stack_32
>   powerpc/perf: consolidate valid_user_sp
>   powerpc/64: make buildable without CONFIG_COMPAT
>   powerpc/64: Make COMPAT user-selectable disabled on littleendian by
> default.
>   powerpc/perf: split callchain.c by bitness
>   MAINTAINERS: perf: Add pattern that matches ppc perf to the perf
> entry.
> 
>  MAINTAINERS|   2 +
>  arch/powerpc/Kconfig   |   5 +-
>  arch/powerpc/include/asm/thread_info.h |   4 +-
>  arch/powerpc/include/asm/unistd.h  |   1 +
>  arch/powerpc/kernel/Makefile   |   6 +-
>  arch/powerpc/kernel/entry_64.S |   2 +
>  arch/powerpc/kernel/signal.c   | 144 +-
>  arch/powerpc/kernel/signal_32.c| 140 --
>  arch/powerpc/kernel/syscall_64.c   |   6 +-
>  arch/powerpc/kernel/vdso.c |   3 +-
>  arch/powerpc/perf/Makefile |   5 +-
>  arch/powerpc/perf/callchain.c  | 356 +
>  arch/powerpc/perf/callchain.h  |  20 ++
>  arch/powerpc/perf/callchain_32.c   | 196 ++
>  arch/powerpc/perf/callchain_64.c   | 174 
>  fs/read_write.c|   3 +-
>  16 files changed, 556 insertions(+), 511 deletions(-)
>  create mode 100644 arch/powerpc/perf/callchain.h
>  create mode 100644 arch/powerpc/perf/callchain_32.c
>  create mode 100644 arch/powerpc/perf/callchain_64.c
> 
> -- 
> 2.23.0
> 


Re: [RFC PATCH v1] pseries/drmem: don't cache node id in drmem_lmb struct

2020-03-11 Thread Michal Suchánek
On Wed, Mar 11, 2020 at 06:08:15PM -0500, Scott Cheloha wrote:
> At memory hot-remove time we can retrieve an LMB's nid from its
> corresponding memory_block.  There is no need to store the nid
> in multiple locations.
> 
> Signed-off-by: Scott Cheloha 
> ---
> The linear search in powerpc's memory_add_physaddr_to_nid() has become a
> bottleneck at boot on systems with many LMBs.
> 
> As described in this patch here:
> 
> https://lore.kernel.org/linuxppc-dev/20200221172901.1596249-2-chel...@linux.ibm.com/
> 
> the linear search seriously cripples drmem_init().
> 
> The obvious solution (shown in that patch) is to just make the search
> in memory_add_physaddr_to_nid() faster.  An XArray seems well-suited
> to the task of mapping an address range to an LMB object.
> 
> The less obvious approach is to just call memory_add_physaddr_to_nid()
> in fewer places.
> 
> I'm not sure which approach is correct, hence the RFC.

You basically revert the below which will likely cause the very error
that was fixed there:

commit b2d3b5ee66f2a04a918cc043cec0c9ed3de58f40
Author: Nathan Fontenot 
Date:   Tue Oct 2 10:35:59 2018 -0500

powerpc/pseries: Track LMB nid instead of using device tree

When removing memory we need to remove the memory from the node
it was added to instead of looking up the node it should be in
in the device tree.

During testing we have seen scenarios where the affinity for a
LMB changes due to a partition migration or PRRN event. In these
cases the node the LMB exists in may not match the node the device
tree indicates it belongs in. This can lead to a system crash
when trying to DLPAR remove the LMB after a migration or PRRN
event. The current code looks up the node in the device tree to
remove the LMB from, the crash occurs when we try to offline this
node and it does not have any data, i.e. node_data[nid] == NULL.

Thanks

Michal


Re: [PATCH rebased 1/2] powerpc: reserve memory for capture kernel after hugepages init

2020-03-05 Thread Michal Suchánek
Hello,

This seems to cause crash with kdump reservation 1GB quite reliably.

Thanks

Michal

On Tue, Feb 18, 2020 at 05:28:34PM +0100, Michal Suchanek wrote:
> From: Hari Bathini 
> 
> Sometimes, memory reservation for KDump/FADump can overlap with memory
> marked for hugepages. This overlap leads to error, hang in KDump case
> and copy error reported by f/w in case of FADump, while trying to
> capture dump. Report error while setting up memory for the capture
> kernel instead of running into issues while capturing dump, by moving
> KDump/FADump reservation below MMU early init and failing gracefully
> when hugepages memory overlaps with capture kernel memory.
> 
> Signed-off-by: Hari Bathini 
> ---
>  arch/powerpc/kernel/prom.c | 16 
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
> index 6620f37abe73..0f14dc9c4dab 100644
> --- a/arch/powerpc/kernel/prom.c
> +++ b/arch/powerpc/kernel/prom.c
> @@ -735,14 +735,6 @@ void __init early_init_devtree(void *params)
>   if (PHYSICAL_START > MEMORY_START)
>   memblock_reserve(MEMORY_START, 0x8000);
>   reserve_kdump_trampoline();
> -#if defined(CONFIG_FA_DUMP) || defined(CONFIG_PRESERVE_FA_DUMP)
> - /*
> -  * If we fail to reserve memory for firmware-assisted dump then
> -  * fallback to kexec based kdump.
> -  */
> - if (fadump_reserve_mem() == 0)
> -#endif
> - reserve_crashkernel();
>   early_reserve_mem();
>  
>   /* Ensure that total memory size is page-aligned. */
> @@ -781,6 +773,14 @@ void __init early_init_devtree(void *params)
>  #endif
>  
>   mmu_early_init_devtree();
> +#if defined(CONFIG_FA_DUMP) || defined(CONFIG_PRESERVE_FA_DUMP)
> + /*
> +  * If we fail to reserve memory for firmware-assisted dump then
> +  * fallback to kexec based kdump.
> +  */
> + if (fadump_reserve_mem() == 0)
> +#endif
> + reserve_crashkernel();
>  
>  #ifdef CONFIG_PPC_POWERNV
>   /* Scan and build the list of machine check recoverable ranges */
> -- 
> 2.23.0
> 


Re: [PATCH 2/2] powerpc: avoid adjusting memory_limit for capture kernel memory reservation

2020-02-18 Thread Michal Suchánek
On Fri, Jun 28, 2019 at 12:51:19AM +0530, Hari Bathini wrote:
> Currently, if memory_limit is specified and it overlaps with memory to
> be reserved for capture kernel, memory_limit is adjusted to accommodate
> capture kernel. With memory reservation for capture kernel moved later
> (after enforcing memory limit), this adjustment no longer holds water.
> So, avoid adjusting memory_limit and error out instead.

The adjustment of memory limit does not look quite sound
 - There is no code to undo the adjustment in case reservation fails
 - I don't think reservation is still forced to the end of memory
   causing the kernel to use memory it was supposed not to
 - The CMA reservation again causes teh reserved memory to be used
 - Finally the CMA reservation makes this obsolete because the reserved
   memory is can be used by the system

> 
> Signed-off-by: Hari Bathini 
Reviewed-by: Michal Suchanek 
> ---
>  arch/powerpc/kernel/fadump.c|   16 
>  arch/powerpc/kernel/machine_kexec.c |   22 +++---
>  2 files changed, 11 insertions(+), 27 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
> index 4eab972..a784695 100644
> --- a/arch/powerpc/kernel/fadump.c
> +++ b/arch/powerpc/kernel/fadump.c
> @@ -476,22 +476,6 @@ int __init fadump_reserve_mem(void)
>  #endif
>   }
>  
> - /*
> -  * Calculate the memory boundary.
> -  * If memory_limit is less than actual memory boundary then reserve
> -  * the memory for fadump beyond the memory_limit and adjust the
> -  * memory_limit accordingly, so that the running kernel can run with
> -  * specified memory_limit.
> -  */
> - if (memory_limit && memory_limit < memblock_end_of_DRAM()) {
> - size = get_fadump_area_size();
> - if ((memory_limit + size) < memblock_end_of_DRAM())
> - memory_limit += size;
> - else
> - memory_limit = memblock_end_of_DRAM();
> - printk(KERN_INFO "Adjusted memory_limit for firmware-assisted"
> - " dump, now %#016llx\n", memory_limit);
> - }
>   if (memory_limit)
>   memory_boundary = memory_limit;
>   else
> diff --git a/arch/powerpc/kernel/machine_kexec.c 
> b/arch/powerpc/kernel/machine_kexec.c
> index c4ed328..fc5533b 100644
> --- a/arch/powerpc/kernel/machine_kexec.c
> +++ b/arch/powerpc/kernel/machine_kexec.c
> @@ -125,10 +125,8 @@ void __init reserve_crashkernel(void)
>   crashk_res.end = crash_base + crash_size - 1;
>   }
>  
> - if (crashk_res.end == crashk_res.start) {
> - crashk_res.start = crashk_res.end = 0;
> - return;
> - }
> + if (crashk_res.end == crashk_res.start)
> + goto error_out;
>  
>   /* We might have got these values via the command line or the
>* device tree, either way sanitise them now. */
> @@ -170,15 +168,13 @@ void __init reserve_crashkernel(void)
>   if (overlaps_crashkernel(__pa(_stext), _end - _stext)) {
>   printk(KERN_WARNING
>   "Crash kernel can not overlap current kernel\n");
> - crashk_res.start = crashk_res.end = 0;
> - return;
> + goto error_out;
>   }
>  
>   /* Crash kernel trumps memory limit */
>   if (memory_limit && memory_limit <= crashk_res.end) {
> - memory_limit = crashk_res.end + 1;
> - printk("Adjusted memory limit for crashkernel, now 0x%llx\n",
> -memory_limit);
> + pr_err("Crash kernel size can't exceed memory_limit\n");
> + goto error_out;
>   }
>  
>   printk(KERN_INFO "Reserving %ldMB of memory at %ldMB "
> @@ -190,9 +186,13 @@ void __init reserve_crashkernel(void)
>   if (!memblock_is_region_memory(crashk_res.start, crash_size) ||
>   memblock_reserve(crashk_res.start, crash_size)) {
>   pr_err("Failed to reserve memory for crashkernel!\n");
> - crashk_res.start = crashk_res.end = 0;
> - return;
> + goto error_out;
>   }
> +
> + return;
> +error_out:
> + crashk_res.start = crashk_res.end = 0;
> + return;
>  }
>  
>  int overlaps_crashkernel(unsigned long start, unsigned long size)
> 


Re: [PATCH] powerpc/process: Remove unneccessary #ifdef CONFIG_PPC64 in copy_thread_tls()

2020-01-29 Thread Michal Suchánek
On Wed, Jan 29, 2020 at 07:50:07PM +, Christophe Leroy wrote:
> is_32bit_task() exists on both PPC64 and PPC32, no need of an ifdefery.
> 
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/kernel/process.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> index fad50db9dcf2..e730b8e522b0 100644
> --- a/arch/powerpc/kernel/process.c
> +++ b/arch/powerpc/kernel/process.c
> @@ -1634,11 +1634,9 @@ int copy_thread_tls(unsigned long clone_flags, 
> unsigned long usp,
>   p->thread.regs = childregs;
>   childregs->gpr[3] = 0;  /* Result from fork() */
>   if (clone_flags & CLONE_SETTLS) {
> -#ifdef CONFIG_PPC64
>   if (!is_32bit_task())
>   childregs->gpr[13] = tls;
>   else
> -#endif
>   childregs->gpr[2] = tls;
>   }
>  

Reviewed-by: Michal Suchanek 

Thanks

Michal


Re: [PATCH] powerpc/64: system call implement the bulk of the logic in C fix

2020-01-28 Thread Michal Suchánek
On Tue, Jan 28, 2020 at 10:41:02AM +1000, Nicholas Piggin wrote:
> Michal Suchánek's on January 28, 2020 4:08 am:
> > On Tue, Jan 28, 2020 at 12:17:12AM +1000, Nicholas Piggin wrote:
> >> This incremental patch fixes several soft-mask debug and unsafe
> >> smp_processor_id messages due to tracing and false positives in
> >> "unreconciled" code.
> >> 
> >> It also fixes a bug with syscall tracing functions that set registers
> >> (e.g., PTRACE_SETREG) not setting GPRs properly.
> >> 
> >> There was a bug reported with the TM selftests, I haven't been able
> >> to reproduce that one.
> >> 
> >> I can squash this into the main patch and resend the series if it
> >> helps but the incremental helps to see the bug fixes.
> > 
> > There are some whitespace differences between this and the series I have
> > applied locally. What does it apply to?
> > 
> > Is there some revision of the patchset I missed?
> 
> No I may have just missed some of your whitespace cleanups, or maybe I got
> some that Michael made which you don't have in his next-test branch.

Looks like the latter. I will pick patches from next-test.

Thanks

Michal


Re: [PATCH] powerpc/64: system call implement the bulk of the logic in C fix

2020-01-27 Thread Michal Suchánek
On Tue, Jan 28, 2020 at 12:17:12AM +1000, Nicholas Piggin wrote:
> This incremental patch fixes several soft-mask debug and unsafe
> smp_processor_id messages due to tracing and false positives in
> "unreconciled" code.
> 
> It also fixes a bug with syscall tracing functions that set registers
> (e.g., PTRACE_SETREG) not setting GPRs properly.
> 
> There was a bug reported with the TM selftests, I haven't been able
> to reproduce that one.
> 
> I can squash this into the main patch and resend the series if it
> helps but the incremental helps to see the bug fixes.

There are some whitespace differences between this and the series I have
applied locally. What does it apply to?

Is there some revision of the patchset I missed?

Thanks

Michal
> 
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/include/asm/cputime.h | 39 +-
>  arch/powerpc/kernel/syscall_64.c   | 26 ++--
>  2 files changed, 41 insertions(+), 24 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/cputime.h 
> b/arch/powerpc/include/asm/cputime.h
> index c43614cffaac..6639a6847cc0 100644
> --- a/arch/powerpc/include/asm/cputime.h
> +++ b/arch/powerpc/include/asm/cputime.h
> @@ -44,6 +44,28 @@ static inline unsigned long cputime_to_usecs(const 
> cputime_t ct)
>  #ifdef CONFIG_PPC64
>  #define get_accounting(tsk)  (&get_paca()->accounting)
>  static inline void arch_vtime_task_switch(struct task_struct *tsk) { }
> +
> +/*
> + * account_cpu_user_entry/exit runs "unreconciled", so can't trace,
> + * can't use use get_paca()
> + */
> +static notrace inline void account_cpu_user_entry(void)
> +{
> + unsigned long tb = mftb();
> + struct cpu_accounting_data *acct = &local_paca->accounting;
> +
> + acct->utime += (tb - acct->starttime_user);
> + acct->starttime = tb;
> +}
> +static notrace inline void account_cpu_user_exit(void)
> +{
> + unsigned long tb = mftb();
> + struct cpu_accounting_data *acct = &local_paca->accounting;
> +
> + acct->stime += (tb - acct->starttime);
> + acct->starttime_user = tb;
> +}
> +
>  #else
>  #define get_accounting(tsk)  (&task_thread_info(tsk)->accounting)
>  /*
> @@ -60,23 +82,6 @@ static inline void arch_vtime_task_switch(struct 
> task_struct *prev)
>  }
>  #endif
>  
> -static inline void account_cpu_user_entry(void)
> -{
> - unsigned long tb = mftb();
> - struct cpu_accounting_data *acct = get_accounting(current);
> -
> - acct->utime += (tb - acct->starttime_user);
> - acct->starttime = tb;
> -}
> -static inline void account_cpu_user_exit(void)
> -{
> - unsigned long tb = mftb();
> - struct cpu_accounting_data *acct = get_accounting(current);
> -
> - acct->stime += (tb - acct->starttime);
> - acct->starttime_user = tb;
> -}
> -
>  #endif /* __KERNEL__ */
>  #else /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
>  static inline void account_cpu_user_entry(void)
> diff --git a/arch/powerpc/kernel/syscall_64.c 
> b/arch/powerpc/kernel/syscall_64.c
> index 529393a1ff1e..cfe458adde07 100644
> --- a/arch/powerpc/kernel/syscall_64.c
> +++ b/arch/powerpc/kernel/syscall_64.c
> @@ -19,7 +19,8 @@ extern void __noreturn tabort_syscall(void);
>  
>  typedef long (*syscall_fn)(long, long, long, long, long, long);
>  
> -long system_call_exception(long r3, long r4, long r5, long r6, long r7, long 
> r8,
> +/* Has to run notrace because it is entered "unreconciled" */
> +notrace long system_call_exception(long r3, long r4, long r5, long r6, long 
> r7, long r8,
>  unsigned long r0, struct pt_regs *regs)
>  {
>   unsigned long ti_flags;
> @@ -36,7 +37,7 @@ long system_call_exception(long r3, long r4, long r5, long 
> r6, long r7, long r8,
>  #ifdef CONFIG_PPC_SPLPAR
>   if (IS_ENABLED(CONFIG_VIRT_CPU_ACCOUNTING_NATIVE) &&
>   firmware_has_feature(FW_FEATURE_SPLPAR)) {
> - struct lppaca *lp = get_lppaca();
> + struct lppaca *lp = local_paca->lppaca_ptr;
>  
>   if (unlikely(local_paca->dtl_ridx != be64_to_cpu(lp->dtl_idx)))
>   accumulate_stolen_time();
> @@ -71,13 +72,22 @@ long system_call_exception(long r3, long r4, long r5, 
> long r6, long r7, long r8,
>* We use the return value of do_syscall_trace_enter() as the
>* syscall number. If the syscall was rejected for any reason
>* do_syscall_trace_enter() returns an invalid syscall number
> -  * and the test below against NR_syscalls will fail.
> +  * and the test against NR_syscalls will fail and the return
> +  * value to be used is in regs->gpr[3].
>*/
>   r0 = do_syscall_trace_enter(regs);
> - }
> -
> - if (unlikely(r0 >= NR_syscalls))
> + if (unlikely(r0 >= NR_syscalls))
> + return regs->gpr[3];
> + r3 = regs->gpr[3];
> + r4 = regs->gpr[4];
> + r5 = regs->gpr[5];
> + r6 = regs

Re: [PATCH] powerpc: drmem: avoid NULL pointer dereference when drmem is unavailable

2020-01-23 Thread Michal Suchánek
On Thu, Jan 23, 2020 at 09:56:10AM -0600, Nathan Lynch wrote:
> Hello and thanks for the patch.
> 
> Libor Pechacek  writes:
> > In KVM guests drmem structure is only zero initialized. Trying to
> > manipulate DLPAR parameters results in a crash in this environment.
> 
> I think this statement needs qualification. Unless I'm mistaken, this
> happens only when you boot a guest without any hotpluggable memory
> configured, and then try to add or remove memory.
> 
> 
> > diff --git a/arch/powerpc/include/asm/drmem.h 
> > b/arch/powerpc/include/asm/drmem.h
> > index 3d76e1c388c2..28c3d936fdf3 100644
> > --- a/arch/powerpc/include/asm/drmem.h
> > +++ b/arch/powerpc/include/asm/drmem.h
> > @@ -27,12 +27,12 @@ struct drmem_lmb_info {
> >  extern struct drmem_lmb_info *drmem_info;
> >  
> >  #define for_each_drmem_lmb_in_range(lmb, start, end)   \
> > -   for ((lmb) = (start); (lmb) <= (end); (lmb)++)
> > +   for ((lmb) = (start); (lmb) < (end); (lmb)++)
> >  
> >  #define for_each_drmem_lmb(lmb)\
> > for_each_drmem_lmb_in_range((lmb),  \
> > &drmem_info->lmbs[0],   \
> > -   &drmem_info->lmbs[drmem_info->n_lmbs - 1])
> > +   &drmem_info->lmbs[drmem_info->n_lmbs])
> >  
> >  /*
> >   * The of_drconf_cell_v1 struct defines the layout of the LMB data
> > diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
> > b/arch/powerpc/platforms/pseries/hotplug-memory.c
> > index c126b94d1943..4ea6af002e27 100644
> > --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> > +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> > @@ -236,9 +236,9 @@ static int get_lmb_range(u32 drc_index, int n_lmbs,
> > if (!start)
> > return -EINVAL;
> >  
> > -   end = &start[n_lmbs - 1];
> > +   end = &start[n_lmbs];
> >  
> > -   last_lmb = &drmem_info->lmbs[drmem_info->n_lmbs - 1];
> > +   last_lmb = &drmem_info->lmbs[drmem_info->n_lmbs];
> > if (end > last_lmb)
> > return -EINVAL;
> 
> Is this not undefined behavior? I'd rather do this in a way that does
> not involve forming out-of-bounds pointers. Even if it's safe, naming
> that pointer "last_lmb" now actively hinders understanding of the code;
> it should be named "limit" or something.

Indeed, the name might be misleading now.

However, the loop differes from anything else we have in the kernel.

The standard explicitly allows the pointer to point just after the last
element to allow expressing the iteration limit without danger of
overflow.

Thanks

Michal


Re: [PATCH] powerpc: drmem: avoid NULL pointer dereference when drmem is unavailable

2020-01-22 Thread Michal Suchánek
On Thu, Jan 16, 2020 at 11:27:58AM +0100, Libor Pechacek wrote:
> In KVM guests drmem structure is only zero initialized. Trying to
> manipulate DLPAR parameters results in a crash in this environment.
> 
> $ echo "memory add count 1" > /sys/kernel/dlpar
> Oops: Kernel access of bad area, sig: 11 [#1]
> LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> Modules linked in: af_packet(E) rfkill(E) nvram(E) vmx_crypto(E)
> gf128mul(E) e1000(E) virtio_balloon(E) rtc_generic(E) crct10dif_vpmsum(E)
> btrfs(E) blake2b_generic(E) libcrc32c(E) xor(E) raid6_pq(E) virtio_rng(E)
> virtio_blk(E) ohci_pci(E) ehci_pci(E) ohci_hcd(E) ehci_hcd(E)
> crc32c_vpmsum(E) usbcore(E) virtio_pci(E) virtio_ring(E) virtio(E) sg(E)
> dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E)
> scsi_mod(E)
> CPU: 1 PID: 4114 Comm: bash Kdump: loaded Tainted: GE 
> 5.5.0-rc6-2-default #1
> NIP:  c00ff294 LR: c00ff248 CTR: 
> REGS: c000fb9d3880 TRAP: 0300   Tainted: GE  
> (5.5.0-rc6-2-default)
> MSR:  80009033   CR: 28242428  XER: 2000
> CFAR: c09a6c10 DAR: 0010 DSISR: 4000 IRQMASK: 0
> GPR00: c00ff248 c000fb9d3b10 c1682e00 0033
> GPR04: c000ff30bf90 c000ff394800 5110 ffe8
> GPR08:   fe1c 
> GPR12: 2200 c0003fffee00  00011cbc37c0
> GPR16: 00011cb27ed0  00011cb6dd10 
> GPR20: 00011cb7db28 01003ce035f0 00011cbc7828 00011cbc6c70
> GPR24: 01003cf01210  c000ffade4e0 c2d7216b
> GPR28: 0001 c2d78560  c15458d0
> NIP [c00ff294] dlpar_memory+0x6e4/0xd00
> LR [c00ff248] dlpar_memory+0x698/0xd00
> Call Trace:
> [c000fb9d3b10] [c00ff248] dlpar_memory+0x698/0xd00 (unreliable)
> [c000fb9d3ba0] [c00f5990] handle_dlpar_errorlog+0xc0/0x190
> [c000fb9d3c10] [c00f5c58] dlpar_store+0x198/0x4a0
> [c000fb9d3cd0] [c0c4cb00] kobj_attr_store+0x30/0x50
> [c000fb9d3cf0] [c05a37b4] sysfs_kf_write+0x64/0x90
> [c000fb9d3d10] [c05a2c90] kernfs_fop_write+0x1b0/0x290
> [c000fb9d3d60] [c04a2bec] __vfs_write+0x3c/0x70
> [c000fb9d3d80] [c04a6560] vfs_write+0xd0/0x260
> [c000fb9d3dd0] [c04a69ac] ksys_write+0xdc/0x130
> [c000fb9d3e20] [c000b478] system_call+0x5c/0x68
> Instruction dump:
> ebc9 1ce70018 38e7ffe8 7cfe3a14 7fbe3840 419dff14 fb610068 7fc9f378
> 3900 480c 6000 4195fef4 <81490010> 39290018 38c80001 7ea93840
> ---[ end trace cc2dd8152608c295 ]---
> 
> Taking closer look at the code, I can see that for_each_drmem_lmb is a
> macro expanding into `for (lmb = &drmem_info->lmbs[0]; lmb <=
> &drmem_info->lmbs[drmem_info->n_lmbs - 1]; lmb++)`. When drmem_info->lmbs
> is NULL, the loop would iterate through the whole address range if it
> weren't stopped by the NULL pointer dereference on the next line.
> 
> This patch aligns for_each_drmem_lmb and for_each_drmem_lmb_in_range macro
> behavior with the common C semantics, where the end marker does not belong
> to the scanned range, and alters get_lmb_range() semantics. As a side
> effect, the wraparound observed in the crash is prevented.
> 
> Fixes: 6c6ea53725b3 ("powerpc/mm: Separate ibm, dynamic-memory data from DT 
> format")
> Cc: Michal Suchanek 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Libor Pechacek 

Reviewed-by: Michal Suchanek 
> ---
>  arch/powerpc/include/asm/drmem.h| 4 ++--
>  arch/powerpc/platforms/pseries/hotplug-memory.c | 4 ++--
>  2 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/drmem.h 
> b/arch/powerpc/include/asm/drmem.h
> index 3d76e1c388c2..28c3d936fdf3 100644
> --- a/arch/powerpc/include/asm/drmem.h
> +++ b/arch/powerpc/include/asm/drmem.h
> @@ -27,12 +27,12 @@ struct drmem_lmb_info {
>  extern struct drmem_lmb_info *drmem_info;
>  
>  #define for_each_drmem_lmb_in_range(lmb, start, end) \
> - for ((lmb) = (start); (lmb) <= (end); (lmb)++)
> + for ((lmb) = (start); (lmb) < (end); (lmb)++)
>  
>  #define for_each_drmem_lmb(lmb)  \
>   for_each_drmem_lmb_in_range((lmb),  \
>   &drmem_info->lmbs[0],   \
> - &drmem_info->lmbs[drmem_info->n_lmbs - 1])
> + &drmem_info->lmbs[drmem_info->n_lmbs])
>  
>  /*
>   * The of_drconf_cell_v1 struct defines the layout of the LMB data
> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
> b/arch/powerpc/platforms/pseries/hotplug-memory.c
> index c126b94d1943..4ea6af002e27 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> @@ -236,9 +2

<    1   2   3   4   >