date:20200326

Re: [PATCH v2 08/12] powerpc/pseries: limit machine check stack to 4GB

2020-03-26 Thread Mahesh J Salgaonkar

On 2020-03-25 20:34:06 Wed, Nicholas Piggin wrote:
> This allows rtas_args to be put on the machine check stack, which
> avoids a lot of complications with re-entrancy deadlocks.
> 
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/kernel/setup_64.c | 15 ++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
> index 3bf03666ee09..ca1041f8a578 100644
> --- a/arch/powerpc/kernel/setup_64.c
> +++ b/arch/powerpc/kernel/setup_64.c
> @@ -695,6 +695,9 @@ void __init exc_lvl_early_init(void)
>  void __init emergency_stack_init(void)
>  {
>   u64 limit;
> +#ifdef CONFIG_PPC_BOOK3S_64
> + u64 mce_limit;
> +#endif
>   unsigned int i;
> 
>   /*
> @@ -713,6 +716,16 @@ void __init emergency_stack_init(void)
>*/
>   limit = min(ppc64_bolted_size(), ppc64_rma_size);
> 
> + /*
> +  * Machine check on pseries calls rtas, but can't use the static
> +  * rtas_args due to a machine check hitting while the lock is held.
> +  * rtas args have to be under 4GB, so the machine check stack is
> +  * limited to 4GB so args can be put on stack.
> +  */
> + mce_limit = limit;
> + if (firmware_has_feature(FW_FEATURE_LPAR) && mce_limit > 
> 4UL*1024*1024*1024)
> + mce_limit = 4UL*1024*1024*1024;
> +

Don't you need this as well under CONFIG_PPC_BOOK3S_64 #ifdef ??

Rest looks good.

Reviewed-by: Mahesh Salgaonkar 

Thanks,
-Mahesh.

Re: [PATCH] x86: Alias memset to __builtin_memset.

2020-03-26 Thread Michael Ellerman

Clement Courbet  writes:
> I discussed with the original authors who added freestanding to our
> build. It turns out that it was added globally but this was just to
> to workaround powerpc not compiling under clang, but they felt the
> fix was appropriate globally.
>
> Now Nick has dug up https://lkml.org/lkml/2019/8/29/1300, which
> advises against freestanding. Also, I've did some research and
> discovered that the original reason for using freestanding for
> powerpc has been fixed here:
> https://lore.kernel.org/linuxppc-dev/20191119045712.39633-3-natechancel...@gmail.com/
>
> I'm going to remove -ffreestanding from downstream, so we don't really need
> this anymore, sorry for waisting people's time.
>
> I wonder if the freestanding fix from the aforementioned patch is really 
> needed
> though. I think that clang is actually right to point out the issue.
> I don't see any reason why setjmp()/longjmp() are declared as taking longs
> rather than ints. The implementation looks like it only ever propagates the
> value (in longjmp) or sets it to 1 (in setjmp), and we only ever call longjmp
> with integer parameters. But I'm not a PowerPC expert, so I might
> be misreading the code.
>
>
> So it seems that we could just remove freestanding altogether and rewrite the
> code to:
>
> diff --git a/arch/powerpc/include/asm/setjmp.h 
> b/arch/powerpc/include/asm/setjmp.h
> index 279d03a1eec6..7941ae68fe21 100644
> --- a/arch/powerpc/include/asm/setjmp.h
> +++ b/arch/powerpc/include/asm/setjmp.h
> @@ -12,7 +12,9 @@
>
>  #define JMP_BUF_LEN23
> -extern long setjmp(long *);
> -extern void longjmp(long *, long);
> +typedef long * jmp_buf;
> +
> +extern int setjmp(jmp_buf);
> +extern void longjmp(jmp_buf, int);
>
> I'm happy to send a patch for this, and get rid of more -ffreestanding.
> Opinions ?

If it works then it looks like a much better fix than using -ffreestanding.

Please submit a patch with a change log etc. and I'd be happy to merge
it.

cheers

Re: [PATCH v2 05/12] powerpc/pseries/ras: FWNMI_VALID off by one

2020-03-26 Thread Mahesh J Salgaonkar

On 2020-03-25 20:34:03 Wed, Nicholas Piggin wrote:
> This was discovered developing qemu fwnmi sreset support. This
> off-by-one bug means the last 16 bytes of the rtas area can not
> be used for a 16 byte save area.
> 
> It's not a serious bug, and QEMU implementation has to retain a
> workaround for old kernels, but it's good to tighten it.
> 
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/platforms/pseries/ras.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/ras.c 
> b/arch/powerpc/platforms/pseries/ras.c
> index c74d5e740922..9a37bda47468 100644
> --- a/arch/powerpc/platforms/pseries/ras.c
> +++ b/arch/powerpc/platforms/pseries/ras.c
> @@ -395,10 +395,11 @@ static irqreturn_t ras_error_interrupt(int irq, void 
> *dev_id)
>  /*
>   * Some versions of FWNMI place the buffer inside the 4kB page starting at
>   * 0x7000. Other versions place it inside the rtas buffer. We check both.
> + * Minimum size of the buffer is 16 bytes.

Acked-by: Mahesh Salgaonkar 

Thanks,
-Mahesh.

>   */
>  #define VALID_FWNMI_BUFFER(A) \
> - A) >= 0x7000) && ((A) < 0x7ff0)) || \
> - (((A) >= rtas.base) && ((A) < (rtas.base + rtas.size - 16
> + A) >= 0x7000) && ((A) <= 0x8000 - 16)) || \
> + (((A) >= rtas.base) && ((A) <= (rtas.base + rtas.size - 16
> 
>  static inline struct rtas_error_log *fwnmi_get_errlog(void)
>  {
> -- 
> 2.23.0
> 

-- 
Mahesh J Salgaonkar

Re: [PATCH v2 04/12] powerpc/pseries/ras: avoid calling rtas_token in NMI paths

2020-03-26 Thread Mahesh J Salgaonkar

On 2020-03-25 20:34:02 Wed, Nicholas Piggin wrote:
> In the interest of reducing code and possible failures in the
> machine check and system reset paths, grab the "ibm,nmi-interlock"
> token at init time.
> 
> Signed-off-by: Nicholas Piggin 

Reviewed-by: Mahesh Salgaonkar 

Thanks,
-Mahesh.
> ---
>  arch/powerpc/include/asm/firmware.h|  1 +
>  arch/powerpc/platforms/pseries/ras.c   |  2 +-
>  arch/powerpc/platforms/pseries/setup.c | 13 ++---
>  3 files changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/firmware.h 
> b/arch/powerpc/include/asm/firmware.h
> index ca33f4ef6cb4..6003c2e533a0 100644
> --- a/arch/powerpc/include/asm/firmware.h
> +++ b/arch/powerpc/include/asm/firmware.h
> @@ -128,6 +128,7 @@ extern void machine_check_fwnmi(void);
> 
>  /* This is true if we are using the firmware NMI handler (typically LPAR) */
>  extern int fwnmi_active;
> +extern int ibm_nmi_interlock_token;
> 
>  extern unsigned int __start___fw_ftr_fixup, __stop___fw_ftr_fixup;
> 
> diff --git a/arch/powerpc/platforms/pseries/ras.c 
> b/arch/powerpc/platforms/pseries/ras.c
> index 1d7f973c647b..c74d5e740922 100644
> --- a/arch/powerpc/platforms/pseries/ras.c
> +++ b/arch/powerpc/platforms/pseries/ras.c
> @@ -458,7 +458,7 @@ static struct rtas_error_log *fwnmi_get_errinfo(struct 
> pt_regs *regs)
>   */
>  static void fwnmi_release_errinfo(void)
>  {
> - int ret = rtas_call(rtas_token("ibm,nmi-interlock"), 0, 1, NULL);
> + int ret = rtas_call(ibm_nmi_interlock_token, 0, 1, NULL);
>   if (ret != 0)
>   printk(KERN_ERR "FWNMI: nmi-interlock failed: %d\n", ret);
>  }
> diff --git a/arch/powerpc/platforms/pseries/setup.c 
> b/arch/powerpc/platforms/pseries/setup.c
> index 17d17f064a2d..c31acd7ce0c0 100644
> --- a/arch/powerpc/platforms/pseries/setup.c
> +++ b/arch/powerpc/platforms/pseries/setup.c
> @@ -83,6 +83,7 @@ unsigned long CMO_PageSize = (ASM_CONST(1) << 
> IOMMU_PAGE_SHIFT_4K);
>  EXPORT_SYMBOL(CMO_PageSize);
> 
>  int fwnmi_active;  /* TRUE if an FWNMI handler is present */
> +int ibm_nmi_interlock_token;
> 
>  static void pSeries_show_cpuinfo(struct seq_file *m)
>  {
> @@ -113,9 +114,14 @@ static void __init fwnmi_init(void)
>   struct slb_entry *slb_ptr;
>   size_t size;
>  #endif
> + int ibm_nmi_register_token;
> 
> - int ibm_nmi_register = rtas_token("ibm,nmi-register");
> - if (ibm_nmi_register == RTAS_UNKNOWN_SERVICE)
> + ibm_nmi_register_token = rtas_token("ibm,nmi-register");
> + if (ibm_nmi_register_token == RTAS_UNKNOWN_SERVICE)
> + return;
> +
> + ibm_nmi_interlock_token = rtas_token("ibm,nmi-interlock");
> + if (WARN_ON(ibm_nmi_interlock_token == RTAS_UNKNOWN_SERVICE))
>   return;
> 
>   /* If the kernel's not linked at zero we point the firmware at low
> @@ -123,7 +129,8 @@ static void __init fwnmi_init(void)
>   system_reset_addr  = __pa(system_reset_fwnmi) - PHYSICAL_START;
>   machine_check_addr = __pa(machine_check_fwnmi) - PHYSICAL_START;
> 
> - if (0 == rtas_call(ibm_nmi_register, 2, 1, NULL, system_reset_addr,
> + if (0 == rtas_call(ibm_nmi_register_token, 2, 1, NULL,
> + system_reset_addr,
>   machine_check_addr))
>   fwnmi_active = 1;
> 
> -- 
> 2.23.0
> 

-- 
Mahesh J Salgaonkar

Re: [PATCH v2 1/1] ppc/crash: Skip spinlocks during crash

2020-03-26 Thread Michael Ellerman

Hi Leonardo,

Leonardo Bras  writes:
> During a crash, there is chance that the cpus that handle the NMI IPI
> are holding a spin_lock. If this spin_lock is needed by crashing_cpu it
> will cause a deadlock. (rtas_lock and printk logbuf_log as of today)

Please give us more detail on how those locks are causing you trouble, a
stack trace would be good if you have it.

> This is a problem if the system has kdump set up, given if it crashes
> for any reason kdump may not be saved for crash analysis.
>
> Skip spinlocks after NMI IPI is sent to all other cpus.

We don't want to add overhead to all spinlocks for the life of the
system, just to handle this one case.

There's already a flag that is set when the system is crashing,
"oops_in_progress", maybe we need to use that somewhere to skip a lock
or do an early return.

cheers

> diff --git a/arch/powerpc/include/asm/spinlock.h 
> b/arch/powerpc/include/asm/spinlock.h
> index 860228e917dc..a6381d110795 100644
> --- a/arch/powerpc/include/asm/spinlock.h
> +++ b/arch/powerpc/include/asm/spinlock.h
> @@ -111,6 +111,8 @@ static inline void splpar_spin_yield(arch_spinlock_t 
> *lock) {};
>  static inline void splpar_rw_yield(arch_rwlock_t *lock) {};
>  #endif
>  
> +extern bool crash_skip_spinlock __read_mostly;
> +
>  static inline bool is_shared_processor(void)
>  {
>  #ifdef CONFIG_PPC_SPLPAR
> @@ -142,6 +144,8 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
>   if (likely(__arch_spin_trylock(lock) == 0))
>   break;
>   do {
> + if (unlikely(crash_skip_spinlock))
> + return;
>   HMT_low();
>   if (is_shared_processor())
>   splpar_spin_yield(lock);
> @@ -161,6 +165,8 @@ void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned 
> long flags)
>   local_save_flags(flags_dis);
>   local_irq_restore(flags);
>   do {
> + if (unlikely(crash_skip_spinlock))
> + return;
>   HMT_low();
>   if (is_shared_processor())
>   splpar_spin_yield(lock);
> diff --git a/arch/powerpc/kexec/crash.c b/arch/powerpc/kexec/crash.c
> index d488311efab1..ae081f0f2472 100644
> --- a/arch/powerpc/kexec/crash.c
> +++ b/arch/powerpc/kexec/crash.c
> @@ -66,6 +66,9 @@ static int handle_fault(struct pt_regs *regs)
>  
>  #ifdef CONFIG_SMP
>  
> +bool crash_skip_spinlock;
> +EXPORT_SYMBOL(crash_skip_spinlock);
> +
>  static atomic_t cpus_in_crash;
>  void crash_ipi_callback(struct pt_regs *regs)
>  {
> @@ -129,6 +132,7 @@ static void crash_kexec_prepare_cpus(int cpu)
>   /* Would it be better to replace the trap vector here? */
>  
>   if (atomic_read(_in_crash) >= ncpus) {
> + crash_skip_spinlock = true;
>   printk(KERN_EMERG "IPI complete\n");
>   return;
>   }
> -- 
> 2.24.1

Re: [PATCH v2 01/12] powerpc/64s/exceptions: Fix in_mce accounting in unrecoverable path

2020-03-26 Thread Mahesh J Salgaonkar

On 2020-03-25 20:33:59 Wed, Nicholas Piggin wrote:
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/kernel/exceptions-64s.S | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
> b/arch/powerpc/kernel/exceptions-64s.S
> index 6a936c9199d6..67cbcb2d0c7f 100644
> --- a/arch/powerpc/kernel/exceptions-64s.S
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> @@ -1335,6 +1335,10 @@ END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
>   andcr10,r10,r3
>   mtmsrd  r10
> 
> + lhz r12,PACA_IN_MCE(r13)
> + subir12,r12,1
> + sth r12,PACA_IN_MCE(r13)
> +

Acked-by: Mahesh Salgaonkar 

Thanks,
-Mahesh.

[powerpc:merge] BUILD SUCCESS c6624071c338732402e8c726df6a4074473eaa0e

2020-03-26 Thread kbuild test robot

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  
merge
branch HEAD: c6624071c338732402e8c726df6a4074473eaa0e  Automatic merge of 
branches 'master', 'next' and 'fixes' into merge

elapsed time: 660m

configs tested: 152
configs skipped: 0

The following configs have been built successfully.
More configs may be tested in the coming days.

arm  allmodconfig
arm   allnoconfig
arm  allyesconfig
arm64allmodconfig
arm64 allnoconfig
arm64allyesconfig
arm at91_dt_defconfig
arm   efm32_defconfig
arm  exynos_defconfig
armmulti_v5_defconfig
armmulti_v7_defconfig
armshmobile_defconfig
arm   sunxi_defconfig
arm64   defconfig
sparcallyesconfig
um   x86_64_defconfig
xtensa  iss_defconfig
i386  allnoconfig
i386 allyesconfig
i386 alldefconfig
i386defconfig
ia64 allmodconfig
ia64defconfig
ia64  allnoconfig
ia64 allyesconfig
ia64 alldefconfig
c6x  allyesconfig
c6xevmc6678_defconfig
nios2 10m50_defconfig
nios2 3c120_defconfig
openriscor1ksim_defconfig
openrisc simple_smp_defconfig
xtensa   common_defconfig
alpha   defconfig
cskydefconfig
nds32 allnoconfig
nds32   defconfig
h8300 edosk2674_defconfig
h8300h8300h-sim_defconfig
h8300   h8s-sim_defconfig
m68k allmodconfig
m68k   m5475evb_defconfig
m68k  multi_defconfig
m68k   sun3_defconfig
arc defconfig
arc  allyesconfig
powerpc defconfig
powerpc   ppc64_defconfig
powerpc  rhel-kconfig
microblaze  mmu_defconfig
microblazenommu_defconfig
powerpc   allnoconfig
mips   32r2_defconfig
mips 64r6el_defconfig
mips allmodconfig
mips  allnoconfig
mips allyesconfig
mips  fuloong2e_defconfig
mips  malta_kvm_defconfig
pariscallnoconfig
pariscgeneric-64bit_defconfig
pariscgeneric-32bit_defconfig
parisc   allyesconfig
i386 randconfig-a002-20200326
i386 randconfig-a001-20200326
x86_64   randconfig-a002-20200326
x86_64   randconfig-a001-20200326
i386 randconfig-a003-20200326
x86_64   randconfig-a003-20200326
mips randconfig-a001-20200326
nds32randconfig-a001-20200326
m68k randconfig-a001-20200326
parisc   randconfig-a001-20200326
alpharandconfig-a001-20200326
riscvrandconfig-a001-20200326
c6x  randconfig-a001-20200326
h8300randconfig-a001-20200326
microblaze   randconfig-a001-20200326
nios2randconfig-a001-20200326
sparc64  randconfig-a001-20200326
s390 randconfig-a001-20200326
csky randconfig-a001-20200326
xtensa   randconfig-a001-20200326
openrisc randconfig-a001-20200326
sh   randconfig-a001-20200326
x86_64   randconfig-c001-20200326
x86_64   randconfig-c002-20200326
x86_64   randconfig-c003-20200326
i386 randconfig-c001-20200326
i386 randconfig-c002-20200326
i386 randconfig-c003-20200326
x86_64   randconfig-e001-20200326
x86_64   randconfig-e002-20200326
x86_64   randconfig-e003-20200326
i386 randconfig-e001-20200326
i386 randconfig-e002-20200326
i386 randconfig-e003-20200326
x86_64   randconfig-f001-20200326
x86_64   randconfig-f002-20200326
x86_64   randconfig-f003-20200326
i386

[powerpc:next] BUILD SUCCESS 7074695ac6fb965d478f373b95bc5c636e9f21b0

2020-03-26 Thread kbuild test robot

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  
next
branch HEAD: 7074695ac6fb965d478f373b95bc5c636e9f21b0  powerpc/prom_init: 
Remove leftover comment

elapsed time: 483m

configs tested: 152
configs skipped: 0

The following configs have been built successfully.
More configs may be tested in the coming days.

arm  allmodconfig
arm   allnoconfig
arm  allyesconfig
arm64allmodconfig
arm64 allnoconfig
arm64allyesconfig
arm at91_dt_defconfig
arm   efm32_defconfig
arm  exynos_defconfig
armmulti_v5_defconfig
armmulti_v7_defconfig
armshmobile_defconfig
arm   sunxi_defconfig
arm64   defconfig
sparcallyesconfig
powerpc   ppc64_defconfig
um   x86_64_defconfig
xtensa  iss_defconfig
i386 alldefconfig
i386  allnoconfig
i386 allyesconfig
i386defconfig
ia64defconfig
ia64 allmodconfig
ia64  allnoconfig
ia64 allyesconfig
ia64 alldefconfig
c6x  allyesconfig
c6xevmc6678_defconfig
nios2 10m50_defconfig
nios2 3c120_defconfig
openriscor1ksim_defconfig
openrisc simple_smp_defconfig
xtensa   common_defconfig
nds32 allnoconfig
cskydefconfig
alpha   defconfig
nds32   defconfig
h8300   h8s-sim_defconfig
h8300 edosk2674_defconfig
m68k   m5475evb_defconfig
m68k allmodconfig
h8300h8300h-sim_defconfig
m68k   sun3_defconfig
m68k  multi_defconfig
powerpc  rhel-kconfig
arc defconfig
arc  allyesconfig
powerpc defconfig
microblaze  mmu_defconfig
microblazenommu_defconfig
powerpc   allnoconfig
mips   32r2_defconfig
mips 64r6el_defconfig
mips allmodconfig
mips  allnoconfig
mips allyesconfig
mips  fuloong2e_defconfig
mips  malta_kvm_defconfig
pariscallnoconfig
parisc   allyesconfig
pariscgeneric-32bit_defconfig
pariscgeneric-64bit_defconfig
x86_64   randconfig-a001-20200326
x86_64   randconfig-a002-20200326
x86_64   randconfig-a003-20200326
i386 randconfig-a001-20200326
i386 randconfig-a002-20200326
i386 randconfig-a003-20200326
alpharandconfig-a001-20200326
m68k randconfig-a001-20200326
mips randconfig-a001-20200326
nds32randconfig-a001-20200326
parisc   randconfig-a001-20200326
riscvrandconfig-a001-20200326
c6x  randconfig-a001-20200326
h8300randconfig-a001-20200326
microblaze   randconfig-a001-20200326
nios2randconfig-a001-20200326
sparc64  randconfig-a001-20200326
s390 randconfig-a001-20200326
csky randconfig-a001-20200326
xtensa   randconfig-a001-20200326
openrisc randconfig-a001-20200326
sh   randconfig-a001-20200326
x86_64   randconfig-c001-20200326
x86_64   randconfig-c002-20200326
x86_64   randconfig-c003-20200326
i386 randconfig-c001-20200326
i386 randconfig-c002-20200326
i386 randconfig-c003-20200326
x86_64   randconfig-e001-20200326
x86_64   randconfig-e002-20200326
x86_64   randconfig-e003-20200326
i386 randconfig-e001-20200326
i386 randconfig-e002-20200326
i386 randconfig-e003-20200326
x86_64   randconfig-f001-20200326
x86_64   randconfig-f002-20200326
x86_64   randconfig-f003-20200326
i386 randconfig-f001-20200326
i386

Re: [Intel-gfx] [PATCH v7 00/12] Introduce CAP_PERFMON to secure system performance monitoring and observability

2020-03-26 Thread James Morris

On Sun, 1 Mar 2020, Serge Hallyn wrote:

> Thanks, this looks good to me, in keeping with the CAP_SYSLOG break.
> 
> Acked-by: Serge E. Hallyn 
> 
> for the set.
> 
> James/Ingo/Peter, if noone has remaining objections, whose branch
> should these go in through?
> 
> thanks,
> -serge
> 
> On Tue, Feb 25, 2020 at 12:55:54PM +0300, Alexey Budankov wrote:
> > 
> > Hi,
> > 
> > Is there anything else I could do in order to move the changes forward
> > or is something still missing from this patch set?
> > Could you please share you mind?

Alexey,

It seems some of the previous Acks are not included in this patchset, e.g. 
https://lkml.org/lkml/2020/1/22/655

Every patch needs a Reviewed-by or Acked-by from maintainers of the code 
being changed.

You have enough from the security folk, but I can't see any included from 
the perf folk.

-- 
James Morris

Re: [PATCH 1/1] ppc/crash: Skip spinlocks during crash

2020-03-26 Thread Leonardo Bras

oops, forgot to EXPORT_SYMBOL. 
arch_spin_lock*() is used on modules.

Sending v2.

On Thu, 2020-03-26 at 19:28 -0300, Leonardo Bras wrote:
> During a crash, there is chance that the cpus that handle the NMI IPI
> are holding a spin_lock. If this spin_lock is needed by crashing_cpu it
> will cause a deadlock. (rtas_lock and printk logbuf_log as of today)
> 
> This is a problem if the system has kdump set up, given if it crashes
> for any reason kdump may not be saved for crash analysis.
> 
> Skip spinlocks after NMI IPI is sent to all other cpus.
> 
> Signed-off-by: Leonardo Bras 
> ---
>  arch/powerpc/include/asm/spinlock.h | 6 ++
>  arch/powerpc/kexec/crash.c  | 3 +++
>  2 files changed, 9 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/spinlock.h 
> b/arch/powerpc/include/asm/spinlock.h
> index 860228e917dc..a6381d110795 100644
> --- a/arch/powerpc/include/asm/spinlock.h
> +++ b/arch/powerpc/include/asm/spinlock.h
> @@ -111,6 +111,8 @@ static inline void splpar_spin_yield(arch_spinlock_t 
> *lock) {};
>  static inline void splpar_rw_yield(arch_rwlock_t *lock) {};
>  #endif
>  
> +extern bool crash_skip_spinlock __read_mostly;
> +
>  static inline bool is_shared_processor(void)
>  {
>  #ifdef CONFIG_PPC_SPLPAR
> @@ -142,6 +144,8 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
>   if (likely(__arch_spin_trylock(lock) == 0))
>   break;
>   do {
> + if (unlikely(crash_skip_spinlock))
> + return;
>   HMT_low();
>   if (is_shared_processor())
>   splpar_spin_yield(lock);
> @@ -161,6 +165,8 @@ void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned 
> long flags)
>   local_save_flags(flags_dis);
>   local_irq_restore(flags);
>   do {
> + if (unlikely(crash_skip_spinlock))
> + return;
>   HMT_low();
>   if (is_shared_processor())
>   splpar_spin_yield(lock);
> diff --git a/arch/powerpc/kexec/crash.c b/arch/powerpc/kexec/crash.c
> index d488311efab1..8a522380027d 100644
> --- a/arch/powerpc/kexec/crash.c
> +++ b/arch/powerpc/kexec/crash.c
> @@ -66,6 +66,8 @@ static int handle_fault(struct pt_regs *regs)
>  
>  #ifdef CONFIG_SMP
>  
> +bool crash_skip_spinlock;
> +
>  static atomic_t cpus_in_crash;
>  void crash_ipi_callback(struct pt_regs *regs)
>  {
> @@ -129,6 +131,7 @@ static void crash_kexec_prepare_cpus(int cpu)
>   /* Would it be better to replace the trap vector here? */
>  
>   if (atomic_read(_in_crash) >= ncpus) {
> + crash_skip_spinlock = true;
>   printk(KERN_EMERG "IPI complete\n");
>   return;
>   }


signature.asc
Description: This is a digitally signed message part

[PATCH v2 1/1] ppc/crash: Skip spinlocks during crash

2020-03-26 Thread Leonardo Bras

During a crash, there is chance that the cpus that handle the NMI IPI
are holding a spin_lock. If this spin_lock is needed by crashing_cpu it
will cause a deadlock. (rtas_lock and printk logbuf_log as of today)

This is a problem if the system has kdump set up, given if it crashes
for any reason kdump may not be saved for crash analysis.

Skip spinlocks after NMI IPI is sent to all other cpus.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/include/asm/spinlock.h | 6 ++
 arch/powerpc/kexec/crash.c  | 4 
 2 files changed, 10 insertions(+)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index 860228e917dc..a6381d110795 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -111,6 +111,8 @@ static inline void splpar_spin_yield(arch_spinlock_t *lock) 
{};
 static inline void splpar_rw_yield(arch_rwlock_t *lock) {};
 #endif
 
+extern bool crash_skip_spinlock __read_mostly;
+
 static inline bool is_shared_processor(void)
 {
 #ifdef CONFIG_PPC_SPLPAR
@@ -142,6 +144,8 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
if (likely(__arch_spin_trylock(lock) == 0))
break;
do {
+   if (unlikely(crash_skip_spinlock))
+   return;
HMT_low();
if (is_shared_processor())
splpar_spin_yield(lock);
@@ -161,6 +165,8 @@ void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned 
long flags)
local_save_flags(flags_dis);
local_irq_restore(flags);
do {
+   if (unlikely(crash_skip_spinlock))
+   return;
HMT_low();
if (is_shared_processor())
splpar_spin_yield(lock);
diff --git a/arch/powerpc/kexec/crash.c b/arch/powerpc/kexec/crash.c
index d488311efab1..ae081f0f2472 100644
--- a/arch/powerpc/kexec/crash.c
+++ b/arch/powerpc/kexec/crash.c
@@ -66,6 +66,9 @@ static int handle_fault(struct pt_regs *regs)
 
 #ifdef CONFIG_SMP
 
+bool crash_skip_spinlock;
+EXPORT_SYMBOL(crash_skip_spinlock);
+
 static atomic_t cpus_in_crash;
 void crash_ipi_callback(struct pt_regs *regs)
 {
@@ -129,6 +132,7 @@ static void crash_kexec_prepare_cpus(int cpu)
/* Would it be better to replace the trap vector here? */
 
if (atomic_read(_in_crash) >= ncpus) {
+   crash_skip_spinlock = true;
printk(KERN_EMERG "IPI complete\n");
return;
}
-- 
2.24.1

Re: [PATCH 1/4] hugetlbfs: add arch_hugetlb_valid_size

2020-03-26 Thread Dave Hansen

On 3/26/20 2:56 PM, Mike Kravetz wrote:
> Perhaps it would be best to check hugepages_supported() when parsing
> hugetlb command line options.  If not enabled, throw an error.  This
> will be much easier to do after moving all command line parsing to
> arch independent code.

Yeah, that sounds sane.

> Is that a sufficient way to address this concern?  I think it is a good
> change in any case.

(Thanks to Kirill for pointing this out.)

So, it turns out the x86 huge page enumeration is totally buggered.
X86_FEATURE_PSE is actually meaningless on 64-bit (and 32-bit PAE).  All
CPUs architecturally support 2MB pages regardless of X86_FEATURE_PSE and
the state of CR4.PSE.

So, on x86_64 at least, hugepages_supported() should *always* return 1.

1GB page support can continue to be dependent on X86_FEATURE_GBPAGES.

[PATCH 1/1] ppc/crash: Skip spinlocks during crash

2020-03-26 Thread Leonardo Bras

During a crash, there is chance that the cpus that handle the NMI IPI
are holding a spin_lock. If this spin_lock is needed by crashing_cpu it
will cause a deadlock. (rtas_lock and printk logbuf_log as of today)

This is a problem if the system has kdump set up, given if it crashes
for any reason kdump may not be saved for crash analysis.

Skip spinlocks after NMI IPI is sent to all other cpus.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/include/asm/spinlock.h | 6 ++
 arch/powerpc/kexec/crash.c  | 3 +++
 2 files changed, 9 insertions(+)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index 860228e917dc..a6381d110795 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -111,6 +111,8 @@ static inline void splpar_spin_yield(arch_spinlock_t *lock) 
{};
 static inline void splpar_rw_yield(arch_rwlock_t *lock) {};
 #endif
 
+extern bool crash_skip_spinlock __read_mostly;
+
 static inline bool is_shared_processor(void)
 {
 #ifdef CONFIG_PPC_SPLPAR
@@ -142,6 +144,8 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
if (likely(__arch_spin_trylock(lock) == 0))
break;
do {
+   if (unlikely(crash_skip_spinlock))
+   return;
HMT_low();
if (is_shared_processor())
splpar_spin_yield(lock);
@@ -161,6 +165,8 @@ void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned 
long flags)
local_save_flags(flags_dis);
local_irq_restore(flags);
do {
+   if (unlikely(crash_skip_spinlock))
+   return;
HMT_low();
if (is_shared_processor())
splpar_spin_yield(lock);
diff --git a/arch/powerpc/kexec/crash.c b/arch/powerpc/kexec/crash.c
index d488311efab1..8a522380027d 100644
--- a/arch/powerpc/kexec/crash.c
+++ b/arch/powerpc/kexec/crash.c
@@ -66,6 +66,8 @@ static int handle_fault(struct pt_regs *regs)
 
 #ifdef CONFIG_SMP
 
+bool crash_skip_spinlock;
+
 static atomic_t cpus_in_crash;
 void crash_ipi_callback(struct pt_regs *regs)
 {
@@ -129,6 +131,7 @@ static void crash_kexec_prepare_cpus(int cpu)
/* Would it be better to replace the trap vector here? */
 
if (atomic_read(_in_crash) >= ncpus) {
+   crash_skip_spinlock = true;
printk(KERN_EMERG "IPI complete\n");
return;
}
-- 
2.24.1

Re: [PATCH v2] powerpc/boot: Delete unneeded .globl _zimage_start

2020-03-26 Thread Segher Boessenkool

On Wed, Mar 25, 2020 at 09:42:57AM -0700, Fangrui Song wrote:
> .globl sets the symbol binding to STB_GLOBAL while .weak sets the
> binding to STB_WEAK. GNU as let .weak override .globl since binutils-gdb
> 5ca547dc2399a0a5d9f20626d4bf5547c3ccfddd (1996). Clang integrated
> assembler let the last win but it may error in the future.

GNU AS works for more than just ELF.  The way the assembler language
is defined, it is not .weak overriding .globl -- instead, .weak sets a
symbol attribute.  On an existing symbol (but it creates on if there is
none yet).

Clang is buggy if it does not allow valid (and perfectly normal)
assembler code like this.

Segher

Re: [PATCH v2] mm/sparse: Fix kernel crash with pfn_section_valid check

2020-03-26 Thread Wei Yang

On Thu, Mar 26, 2020 at 07:02:35PM +0530, Aneesh Kumar K.V wrote:
>Fixes the below crash
>
>BUG: Kernel NULL pointer dereference on read at 0x
>Faulting instruction address: 0xc0c3447c
>Oops: Kernel access of bad area, sig: 11 [#1]
>LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
>CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
>...
>NIP [c0c3447c] vmemmap_populated+0x98/0xc0
>LR [c0088354] vmemmap_free+0x144/0x320
>Call Trace:
> section_deactivate+0x220/0x240
> __remove_pages+0x118/0x170
> arch_remove_memory+0x3c/0x150
> memunmap_pages+0x1cc/0x2f0
> devm_action_release+0x30/0x50
> release_nodes+0x2f8/0x3e0
> device_release_driver_internal+0x168/0x270
> unbind_store+0x130/0x170
> drv_attr_store+0x44/0x60
> sysfs_kf_write+0x68/0x80
> kernfs_fop_write+0x100/0x290
> __vfs_write+0x3c/0x70
> vfs_write+0xcc/0x240
> ksys_write+0x7c/0x140
> system_call+0x5c/0x68
>
>The crash is due to NULL dereference at
>
>test_bit(idx, ms->usage->subsection_map); due to ms->usage = NULL; in 
>pfn_section_valid()
>
>With commit: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in 
>SPARSEMEM|!VMEMMAP case")
>section_mem_map is set to NULL after depopulate_section_mem(). This
>was done so that pfn_page() can work correctly with kernel config that disables
>SPARSEMEM_VMEMMAP. With that config pfn_to_page does
>
>   __section_mem_map_addr(__sec) + __pfn;
>where
>
>static inline struct page *__section_mem_map_addr(struct mem_section *section)
>{
>   unsigned long map = section->section_mem_map;
>   map &= SECTION_MAP_MASK;
>   return (struct page *)map;
>}
>
>Now with SPASEMEM_VMEMAP enabled, mem_section->usage->subsection_map is used to
>check the pfn validity (pfn_valid()). Since section_deactivate release
>mem_section->usage if a section is fully deactivated, pfn_valid() check after
>a subsection_deactivate cause a kernel crash.
>
>static inline int pfn_valid(unsigned long pfn)
>{
>...
>   return early_section(ms) || pfn_section_valid(ms, pfn);
>}
>
>where
>
>static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
>{
>   int idx = subsection_map_index(pfn);
>
>   return test_bit(idx, ms->usage->subsection_map);
>}
>
>Avoid this by clearing SECTION_HAS_MEM_MAP when mem_section->usage is freed.
>For architectures like ppc64 where large pages are used for vmmemap mapping 
>(16MB),
>a specific vmemmap mapping can cover multiple sections. Hence before a vmemmap
>mapping page can be freed, the kernel needs to make sure there are no valid 
>sections
>within that mapping. Clearing the section valid bit before
>depopulate_section_memap enables this.
>
>Fixes: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP 
>case")
>Reported-by: Sachin Sant 
>Tested-by: Sachin Sant 
>Cc: Baoquan He 
>Cc: Michael Ellerman 
>Cc: Dan Williams 
>Cc: Pankaj Gupta 
>Cc: David Hildenbrand 
>Cc: Michal Hocko 
>Cc: Wei Yang 
>Cc: Oscar Salvador 
>Cc: Mike Rapoport 
>Cc: 
>Signed-off-by: Aneesh Kumar K.V 

Reviewed-by: Wei Yang 

-- 
Wei Yang
Help you, Help me

Re: [RFC PATCH 1/1] ppc/smp: Replace unnecessary 'while' by 'if'

2020-03-26 Thread Leonardo Bras

On Fri, 2020-03-27 at 08:40 +1100, Paul Mackerras wrote:
> On Thu, Mar 26, 2020 at 05:37:52PM -0300, Leonardo Bras wrote:
> > spin_until_cond() will wait until nmi_ipi_busy == false, and
> > nmi_ipi_lock_start() does not seem to change nmi_ipi_busy, so there is
> > no way this while will ever repeat.
> > 
> > Replace this 'while' by an 'if', so it does not look like it can repeat.
> 
> Nack, it can repeat.  The scenario is that cpu A is in this code,
> inside spin_until_cond(); cpu B has previously set nmi_ipi_busy, and
> cpu C is also waiting for nmi_ipi_busy to be cleared, like cpu A.
> When cpu B clears nmi_ipi_busy, both cpu A and cpu C will see that and
> will race inside nmi_ipi_lock_start().  One of them, say cpu C, will
> take the lock and proceed to set nmi_ipi_busy and then call
> nmi_ipi_unlock().  Then the other cpu (cpu A) will then take the lock
> and return from nmi_ipi_lock_start() and find nmi_ipi_busy == true.
> At that point it needs to go through the while loop body once more.
> 
> Paul.

Ok, got it.

Thanks for explaining Paul!


signature.asc
Description: This is a digitally signed message part

Re: [PATCH 1/4] hugetlbfs: add arch_hugetlb_valid_size

2020-03-26 Thread Mike Kravetz

On 3/18/20 4:36 PM, Dave Hansen wrote:
> On 3/18/20 3:52 PM, Mike Kravetz wrote:
>> Sounds good.  I'll incorporate those changes into a v2, unless someone
>> else with has a different opinion.
>>
>> BTW, this patch should not really change the way the code works today.
>> It is mostly a movement of code.  Unless I am missing something, the
>> existing code will always allow setup of PMD_SIZE hugetlb pages.
> 
> Hah, I totally skipped over the old code in the diff.
> 
> It looks like we'll disable hugetblfs *entirely* if PSE isn't supported.
>  I think this is actually wrong, but nobody ever noticed.  I think you'd
> have to be running as a guest under a hypervisor that's lying about PSE
> not being supported *and* care about 1GB pages.  Nobody does that.

Actually, !PSE will disable hugetlbfs a little later in the boot process.
You are talking about hugepages_supported() correct?

I think something really bad could happen in this situation (!PSE and
X86_FEATURE_GBPAGES).  When parsing 'hugepages=' for gigantic pages we
immediately allocate from bootmem.  This happens before later checks in
hugetlb_init for hugepages_supported().  So, I think we would end up
allocating GB pages from bootmem and not be able to use or free them. :(

Perhaps it would be best to check hugepages_supported() when parsing
hugetlb command line options.  If not enabled, throw an error.  This
will be much easier to do after moving all command line parsing to
arch independent code.

Is that a sufficient way to address this concern?  I think it is a good
change in any case.
-- 
Mike Kravetz

Re: [RFC PATCH 1/1] ppc/smp: Replace unnecessary 'while' by 'if'

2020-03-26 Thread Paul Mackerras

On Thu, Mar 26, 2020 at 05:37:52PM -0300, Leonardo Bras wrote:
> spin_until_cond() will wait until nmi_ipi_busy == false, and
> nmi_ipi_lock_start() does not seem to change nmi_ipi_busy, so there is
> no way this while will ever repeat.
> 
> Replace this 'while' by an 'if', so it does not look like it can repeat.

Nack, it can repeat.  The scenario is that cpu A is in this code,
inside spin_until_cond(); cpu B has previously set nmi_ipi_busy, and
cpu C is also waiting for nmi_ipi_busy to be cleared, like cpu A.
When cpu B clears nmi_ipi_busy, both cpu A and cpu C will see that and
will race inside nmi_ipi_lock_start().  One of them, say cpu C, will
take the lock and proceed to set nmi_ipi_busy and then call
nmi_ipi_unlock().  Then the other cpu (cpu A) will then take the lock
and return from nmi_ipi_lock_start() and find nmi_ipi_busy == true.
At that point it needs to go through the while loop body once more.

Paul.

[RFC PATCH 1/1] ppc/smp: Replace unnecessary 'while' by 'if'

2020-03-26 Thread Leonardo Bras

spin_until_cond() will wait until nmi_ipi_busy == false, and
nmi_ipi_lock_start() does not seem to change nmi_ipi_busy, so there is
no way this while will ever repeat.

Replace this 'while' by an 'if', so it does not look like it can repeat.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/kernel/smp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ea6adbf6a221..7c904d6fb4d2 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -473,7 +473,7 @@ static int __smp_send_nmi_ipi(int cpu, void (*fn)(struct 
pt_regs *),
return 0;
 
nmi_ipi_lock_start();
-   while (nmi_ipi_busy) {
+   if (nmi_ipi_busy) {
nmi_ipi_unlock_end();
spin_until_cond(!nmi_ipi_busy);
nmi_ipi_lock_start();
-- 
2.24.1

Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information

2020-03-26 Thread Kim Phillips




On 3/26/20 5:19 AM, maddy wrote:
> 
> 
> On 3/18/20 11:05 PM, Kim Phillips wrote:
>> Hi Maddy,
>>
>> On 3/17/20 1:50 AM, maddy wrote:
>>> On 3/13/20 4:08 AM, Kim Phillips wrote:
 On 3/11/20 11:00 AM, Ravi Bangoria wrote:
> On 3/6/20 3:36 AM, Kim Phillips wrote:
>>> On 3/3/20 3:55 AM, Kim Phillips wrote:
 On 3/2/20 2:21 PM, Stephane Eranian wrote:
> On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra  
> wrote:
>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
>>> Modern processors export such hazard data in Performance
>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
>>> AMD[3] provides similar information.
>>>
>>> Implementation detail:
>>>
>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
>>> If it's set, kernel converts arch specific hazard information
>>> into generic format:
>>>
>>>   struct perf_pipeline_haz_data {
>>>  /* Instruction/Opcode type: Load, Store, Branch  */
>>>  __u8    itype;
>>>  /* Instruction Cache source */
>>>  __u8    icache;
>>>  /* Instruction suffered hazard in pipeline stage */
>>>  __u8    hazard_stage;
>>>  /* Hazard reason */
>>>  __u8    hazard_reason;
>>>  /* Instruction suffered stall in pipeline stage */
>>>  __u8    stall_stage;
>>>  /* Stall reason */
>>>  __u8    stall_reason;
>>>  __u16   pad;
>>>   };
>> Kim, does this format indeed work for AMD IBS?
 It's not really 1:1, we don't have these separations of stages
 and reasons, for example: we have missed in L2 cache, for example.
 So IBS output is flatter, with more cycle latency figures than
 IBM's AFAICT.
>>> AMD IBS captures pipeline latency data incase Fetch sampling like the
>>> Fetch latency, tag to retire latency, completion to retire latency and
>>> so on. Yes, Ops sampling do provide more data on load/store centric
>>> information. But it also captures more detailed data for Branch 
>>> instructions.
>>> And we also looked at ARM SPE, which also captures more details pipeline
>>> data and latency information.
>>>
> Personally, I don't like the term hazard. This is too IBM Power
> specific. We need to find a better term, maybe stall or penalty.
 Right, IBS doesn't have a filter to only count stalled or otherwise
 bad events.  IBS' PPR descriptions has one occurrence of the
 word stall, and no penalty.  The way I read IBS is it's just
 reporting more sample data than just the precise IP: things like
 hits, misses, cycle latencies, addresses, types, etc., so words
 like 'extended', or the 'auxiliary' already used today even
 are more appropriate for IBS, although I'm the last person to
 bikeshed.
>>> We are thinking of using "pipeline" word instead of Hazard.
>> Hm, the word 'pipeline' occurs 0 times in IBS documentation.
> NP. We thought pipeline is generic hw term so we proposed "pipeline"
> word. We are open to term which can be generic enough.
>
>> I realize there are a couple of core pipeline-specific pieces
>> of information coming out of it, but the vast majority
>> are addresses, latencies of various components in the memory
>> hierarchy, and various component hit/miss bits.
> Yes. we should capture core pipeline specific details. For example,
> IBS generates Branch unit information(IbsOpData1) and Icahce related
> data(IbsFetchCtl) which is something that shouldn't be extended as
> part of perf-mem, IMO.
 Sure, IBS Op-side output is more 'perf mem' friendly, and so it
 should populate perf_mem_data_src fields, just like POWER9 can:

 union perf_mem_data_src {
 ...
   __u64   mem_rsvd:24,
   mem_snoopx:2,   /* snoop mode, ext */
   mem_remote:1,   /* remote */
   mem_lvl_num:4,  /* memory hierarchy level number 
 */
   mem_dtlb:7, /* tlb access */
   mem_lock:2, /* lock instr */
   mem_snoop:5,    /* snoop mode */
   mem_lvl:14, /* memory hierarchy level */
   mem_op:5;   /* type of opcode */


 E.g., SIER[LDST] SIER[A_XLATE_SRC] can be used to populate
 mem_lvl[_num], SIER_TYPE can be used to populate 'mem_op',
 'mem_lock', and the Reload Bus Source Encoding bits can

Re: [PATCH] powerpc/64: Fix section mismatch warnings.

2020-03-26 Thread Michal Suchánek

On Thu, Mar 26, 2020 at 11:22:03PM +0530, Naveen N. Rao wrote:
> Michal Suchanek wrote:
> > Fixes the following warnings:
> > 
> > WARNING: vmlinux.o(.text+0x2d24): Section mismatch in reference from the 
> > variable __boot_from_prom to the function .init.text:prom_init()
> > The function __boot_from_prom() references
> > the function __init prom_init().
> > This is often because __boot_from_prom lacks a __init
> > annotation or the annotation of prom_init is wrong.
> > 
> > WARNING: vmlinux.o(.text+0x2fd0): Section mismatch in reference from the 
> > variable start_here_common to the function .init.text:start_kernel()
> > The function start_here_common() references
> > the function __init start_kernel().
> > This is often because start_here_common lacks a __init
> > annotation or the annotation of start_kernel is wrong.
> > 
> > Signed-off-by: Michal Suchanek 
> > ---
> >  arch/powerpc/kernel/head_64.S | 4 
> >  1 file changed, 4 insertions(+)
> 
> Michael committed a similar patch just earlier today:
> https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?id=6eeb9b3b9ce588f14a697737a30d0702b5a20293

Missed it because it did not reach master yet.

Thanks

Michal

[PATCH] Fix 4 "[v3, 28/32] powerpc/64s: interrupt implement exit logic in C"

2020-03-26 Thread Nicholas Piggin

The return-to-kernel path has to replay any soft-pending interrupts if it is
returning to a context that had interrupts soft-enabled. It has to do this
carefully and avoid plain enabling interrupts if this is an irq context,
which can cause multiple nesting of interrupts on the stack, and other
unexpected issues.

The code which avoided this case got the soft-mask state wrong, and
marked interrupts as enabled before going around again to retry. This
seems to be mostly harmless except when PREEMPT=y, this calls
preempt_schedule_irq with irqs apparently enabled and runs into a BUG
in kernel/sched/core.c

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/syscall_64.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/syscall_64.c b/arch/powerpc/kernel/syscall_64.c
index 049608d811c7..cf06eb443a80 100644
--- a/arch/powerpc/kernel/syscall_64.c
+++ b/arch/powerpc/kernel/syscall_64.c
@@ -342,11 +342,15 @@ notrace unsigned long 
interrupt_exit_kernel_prepare(struct pt_regs *regs, unsign
trace_hardirqs_off();
local_paca->irq_happened |= PACA_IRQ_HARD_DIS;
/*
-* Can't local_irq_enable in case we are in interrupt
-* context. Must replay directly.
+* Can't local_irq_restore to replay if we were in
+* interrupt context. Must replay directly.
 */
-   replay_soft_interrupts();
-   irq_soft_mask_set(flags);
+   if (irqs_disabled_flags(flags)) {
+   replay_soft_interrupts();
+   } else {
+   local_irq_restore(flags);
+   local_irq_save(flags);
+   }
/* Took an interrupt, may have more exit work to do. */
goto again;
}
-- 
2.23.0

[PATCH v4] powerpc/pseries: Handle UE event for memcpy_mcsafe

2020-03-26 Thread Ganesh Goudar

memcpy_mcsafe has been implemented for power machines which is used
by pmem infrastructure, so that an UE encountered during memcpy from
pmem devices would not result in panic instead a right error code
is returned. The implementation expects machine check handler to ignore
the event and set nip to continue the execution from fixup code.

Appropriate changes are already made to powernv machine check handler,
make similar changes to pseries machine check handler to ignore the
the event and set nip to continue execution at the fixup entry if we
hit UE at an instruction with a fixup entry.

while we are at it, have a common function which searches the exception
table entry and updates nip with fixup address, and any future common
changes can be made in this function that are valid for both architectures.

powernv changes are made by
commit 895e3dceeb97 ("powerpc/mce: Handle UE event for memcpy_mcsafe")

Reviewed-by: Mahesh Salgaonkar 
Reviewed-by: Santosh S 
Signed-off-by: Ganesh Goudar 
---
V2: Fixes a trivial checkpatch error in commit msg.
V3: Use proper subject prefix.
V4: Rephrase the commit message.
Define a common function to update nip with fixup address.
---
 arch/powerpc/include/asm/mce.h   |  2 ++
 arch/powerpc/kernel/mce.c| 14 ++
 arch/powerpc/kernel/mce_power.c  |  8 ++--
 arch/powerpc/platforms/pseries/ras.c |  3 +++
 4 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index 6a6ddaabdb34..376a395daf32 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -218,6 +218,8 @@ extern void machine_check_queue_event(void);
 extern void machine_check_print_event_info(struct machine_check_event *evt,
   bool user_mode, bool in_guest);
 unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr);
+extern void mce_common_process_ue(struct pt_regs *regs,
+ struct mce_error_info *mce_err);
 #ifdef CONFIG_PPC_BOOK3S_64
 void flush_and_reload_slb(void);
 #endif /* CONFIG_PPC_BOOK3S_64 */
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index 34c1001e9e8b..8077b5fb18a7 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -251,6 +252,19 @@ void machine_check_queue_event(void)
/* Queue irq work to process this event later. */
irq_work_queue(_event_process_work);
 }
+
+void mce_common_process_ue(struct pt_regs *regs,
+  struct mce_error_info *mce_err)
+{
+   const struct exception_table_entry *entry;
+
+   entry = search_kernel_exception_table(regs->nip);
+   if (entry) {
+   mce_err->ignore_event = true;
+   regs->nip = extable_fixup(entry);
+   }
+}
+
 /*
  * process pending MCE event from the mce event queue. This function will be
  * called during syscall exit.
diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index 1cbf7f1a4e3d..067b094bfeff 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -579,14 +579,10 @@ static long mce_handle_ue_error(struct pt_regs *regs,
struct mce_error_info *mce_err)
 {
long handled = 0;
-   const struct exception_table_entry *entry;
 
-   entry = search_kernel_exception_table(regs->nip);
-   if (entry) {
-   mce_err->ignore_event = true;
-   regs->nip = extable_fixup(entry);
+   mce_common_process_ue(regs, mce_err);
+   if (mce_err->ignore_event)
return 1;
-   }
 
/*
 * On specific SCOM read via MMIO we may get a machine check
diff --git a/arch/powerpc/platforms/pseries/ras.c 
b/arch/powerpc/platforms/pseries/ras.c
index 43710b69e09e..1d1da639b8b7 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -558,6 +558,9 @@ static int mce_handle_error(struct pt_regs *regs, struct 
rtas_error_log *errp)
switch (mce_log->error_type) {
case MC_ERROR_TYPE_UE:
mce_err.error_type = MCE_ERROR_TYPE_UE;
+   mce_common_process_ue(regs, _err);
+   if (mce_err.ignore_event)
+   disposition = RTAS_DISP_FULLY_RECOVERED;
switch (err_sub_type) {
case MC_ERROR_UE_IFETCH:
mce_err.u.ue_error_type = MCE_UE_ERROR_IFETCH;
-- 
2.17.2

Re: [PATCH] powerpc/64: Fix section mismatch warnings.

2020-03-26 Thread Naveen N. Rao


Michal Suchanek wrote:

Fixes the following warnings:

WARNING: vmlinux.o(.text+0x2d24): Section mismatch in reference from the 
variable __boot_from_prom to the function .init.text:prom_init()
The function __boot_from_prom() references
the function __init prom_init().
This is often because __boot_from_prom lacks a __init
annotation or the annotation of prom_init is wrong.

WARNING: vmlinux.o(.text+0x2fd0): Section mismatch in reference from the 
variable start_here_common to the function .init.text:start_kernel()
The function start_here_common() references
the function __init start_kernel().
This is often because start_here_common lacks a __init
annotation or the annotation of start_kernel is wrong.

Signed-off-by: Michal Suchanek 
---
 arch/powerpc/kernel/head_64.S | 4 
 1 file changed, 4 insertions(+)


Michael committed a similar patch just earlier today:
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?id=6eeb9b3b9ce588f14a697737a30d0702b5a20293


- Naveen

Re: [PATCH v2 12/12] powerpc/64s: system reset do not trace

2020-03-26 Thread Naveen N. Rao


Nicholas Piggin wrote:

Similarly to the previous patch, do not trace system reset. This code
is used when there is a crash or hang, and tracing disturbs the system
more and has been known to crash in the crash handling path.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/traps.c | 5 +
 1 file changed, 5 insertions(+)


For the ftrace bits:
Acked-by: Naveen N. Rao 



diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 1845fd7e161a..ed7b7a6e2dc0 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -443,6 +443,9 @@ void system_reset_exception(struct pt_regs *regs)
unsigned long hsrr0, hsrr1;
bool nested = in_nmi();
bool saved_hsrrs = false;
+   u8 ftrace_enabled = local_paca->ftrace_enabled;
+
+   local_paca->ftrace_enabled = 0;
 
 	/*

 * Avoid crashes in case of nested NMI exceptions. Recoverability
@@ -524,6 +527,8 @@ void system_reset_exception(struct pt_regs *regs)
if (!nested)
nmi_exit();
 
+	local_paca->ftrace_enabled = ftrace_enabled;

+


I suppose we could add helpers to save/restore the state. But, since we 
only have these two users as of now, we can revisit that if the need 
arises.



- Naveen

Re: [PATCH V2 1/3] mm/debug: Add tests validating arch page table helpers for core features

2020-03-26 Thread Zi Yan

On 25 Mar 2020, at 22:18, Anshuman Khandual wrote:

> External email: Use caution opening links or attachments
>
>
> On 03/24/2020 06:59 PM, Zi Yan wrote:
>> On 24 Mar 2020, at 1:22, Anshuman Khandual wrote:
>>
>>> This adds new tests validating arch page table helpers for these following
>>> core memory features. These tests create and test specific mapping types at
>>> various page table levels.
>>>
>>> 1. SPECIAL mapping
>>> 2. PROTNONE mapping
>>> 3. DEVMAP mapping
>>> 4. SOFTDIRTY mapping
>>> 5. SWAP mapping
>>> 6. MIGRATION mapping
>>> 7. HUGETLB mapping
>>> 8. THP mapping
>>>
>>> Cc: Andrew Morton 
>>> Cc: Mike Rapoport 
>>> Cc: Vineet Gupta 
>>> Cc: Catalin Marinas 
>>> Cc: Will Deacon 
>>> Cc: Benjamin Herrenschmidt 
>>> Cc: Paul Mackerras 
>>> Cc: Michael Ellerman 
>>> Cc: Heiko Carstens 
>>> Cc: Vasily Gorbik 
>>> Cc: Christian Borntraeger 
>>> Cc: Thomas Gleixner 
>>> Cc: Ingo Molnar 
>>> Cc: Borislav Petkov 
>>> Cc: "H. Peter Anvin" 
>>> Cc: Kirill A. Shutemov 
>>> Cc: Paul Walmsley 
>>> Cc: Palmer Dabbelt 
>>> Cc: linux-snps-...@lists.infradead.org
>>> Cc: linux-arm-ker...@lists.infradead.org
>>> Cc: linuxppc-dev@lists.ozlabs.org
>>> Cc: linux-s...@vger.kernel.org
>>> Cc: linux-ri...@lists.infradead.org
>>> Cc: x...@kernel.org
>>> Cc: linux-a...@vger.kernel.org
>>> Cc: linux-ker...@vger.kernel.org
>>> Suggested-by: Catalin Marinas 
>>> Signed-off-by: Anshuman Khandual 
>>> ---
>>>  mm/debug_vm_pgtable.c | 291 +-
>>>  1 file changed, 290 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
>>> index 98990a515268..15055a8f6478 100644
>>> --- a/mm/debug_vm_pgtable.c
>>> +++ b/mm/debug_vm_pgtable.c
>>> @@ -289,6 +289,267 @@ static void __init pmd_populate_tests(struct 
>>> mm_struct *mm, pmd_t *pmdp,
>>>  WARN_ON(pmd_bad(pmd));
>>>  }
>>>
>>> +static void __init pte_special_tests(unsigned long pfn, pgprot_t prot)
>>> +{
>>> +pte_t pte = pfn_pte(pfn, prot);
>>> +
>>> +if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL))
>>> +return;
>>> +
>>> +WARN_ON(!pte_special(pte_mkspecial(pte)));
>>> +}
>>> +
>>> +static void __init pte_protnone_tests(unsigned long pfn, pgprot_t prot)
>>> +{
>>> +pte_t pte = pfn_pte(pfn, prot);
>>> +
>>> +if (!IS_ENABLED(CONFIG_NUMA_BALANCING))
>>> +return;
>>> +
>>> +WARN_ON(!pte_protnone(pte));
>>> +WARN_ON(!pte_present(pte));
>>> +}
>>> +
>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> +static void __init pmd_protnone_tests(unsigned long pfn, pgprot_t prot)
>>> +{
>>> +pmd_t pmd = pfn_pmd(pfn, prot);
>>> +
>>> +if (!IS_ENABLED(CONFIG_NUMA_BALANCING))
>>> +return;
>>> +
>>> +WARN_ON(!pmd_protnone(pmd));
>>> +WARN_ON(!pmd_present(pmd));
>>> +}
>>> +#else
>>> +static void __init pmd_protnone_tests(unsigned long pfn, pgprot_t prot) { }
>>> +#endif
>>> +
>>> +#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
>>> +static void __init pte_devmap_tests(unsigned long pfn, pgprot_t prot)
>>> +{
>>> +pte_t pte = pfn_pte(pfn, prot);
>>> +
>>> +WARN_ON(!pte_devmap(pte_mkdevmap(pte)));
>>> +}
>>> +
>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> +static void __init pmd_devmap_tests(unsigned long pfn, pgprot_t prot)
>>> +{
>>> +pmd_t pmd = pfn_pmd(pfn, prot);
>>> +
>>> +WARN_ON(!pmd_devmap(pmd_mkdevmap(pmd)));
>>> +}
>>> +
>>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>> +static void __init pud_devmap_tests(unsigned long pfn, pgprot_t prot)
>>> +{
>>> +pud_t pud = pfn_pud(pfn, prot);
>>> +
>>> +WARN_ON(!pud_devmap(pud_mkdevmap(pud)));
>>> +}
>>> +#else
>>> +static void __init pud_devmap_tests(unsigned long pfn, pgprot_t prot) { }
>>> +#endif
>>> +#else
>>> +static void __init pmd_devmap_tests(unsigned long pfn, pgprot_t prot) { }
>>> +static void __init pud_devmap_tests(unsigned long pfn, pgprot_t prot) { }
>>> +#endif
>>> +#else
>>> +static void __init pte_devmap_tests(unsigned long pfn, pgprot_t prot) { }
>>> +static void __init pmd_devmap_tests(unsigned long pfn, pgprot_t prot) { }
>>> +static void __init pud_devmap_tests(unsigned long pfn, pgprot_t prot) { }
>>> +#endif
>>> +
>>> +static void __init pte_soft_dirty_tests(unsigned long pfn, pgprot_t prot)
>>> +{
>>> +pte_t pte = pfn_pte(pfn, prot);
>>> +
>>> +if (!IS_ENABLED(CONFIG_HAVE_ARCH_SOFT_DIRTY))
>>> +return;
>>> +
>>> +WARN_ON(!pte_soft_dirty(pte_mksoft_dirty(pte)));
>>> +WARN_ON(pte_soft_dirty(pte_clear_soft_dirty(pte)));
>>> +}
>>> +
>>> +static void __init pte_swap_soft_dirty_tests(unsigned long pfn, pgprot_t 
>>> prot)
>>> +{
>>> +pte_t pte = pfn_pte(pfn, prot);
>>> +
>>> +if (!IS_ENABLED(CONFIG_HAVE_ARCH_SOFT_DIRTY))
>>> +return;
>>> +
>>> +WARN_ON(!pte_swp_soft_dirty(pte_swp_mksoft_dirty(pte)));
>>> +WARN_ON(pte_swp_soft_dirty(pte_swp_clear_soft_dirty(pte)));
>>> +}
>>> +
>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> +static void __init pmd_soft_dirty_tests(unsigned long pfn, pgprot_t prot)
>>> +{
>>>

[PATCH] powerpc/64: Fix section mismatch warnings.

2020-03-26 Thread Michal Suchanek

Fixes the following warnings:

WARNING: vmlinux.o(.text+0x2d24): Section mismatch in reference from the 
variable __boot_from_prom to the function .init.text:prom_init()
The function __boot_from_prom() references
the function __init prom_init().
This is often because __boot_from_prom lacks a __init
annotation or the annotation of prom_init is wrong.

WARNING: vmlinux.o(.text+0x2fd0): Section mismatch in reference from the 
variable start_here_common to the function .init.text:start_kernel()
The function start_here_common() references
the function __init start_kernel().
This is often because start_here_common lacks a __init
annotation or the annotation of start_kernel is wrong.

Signed-off-by: Michal Suchanek 
---
 arch/powerpc/kernel/head_64.S | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S
index ad79fddb974d..4afb82f1b7c7 100644
--- a/arch/powerpc/kernel/head_64.S
+++ b/arch/powerpc/kernel/head_64.S
@@ -537,6 +537,7 @@ __start_initialization_multiplatform:
b   __after_prom_start
 #endif /* CONFIG_PPC_BOOK3E */
 
+__REF
 __boot_from_prom:
 #ifdef CONFIG_PPC_OF_BOOT_TRAMPOLINE
/* Save parameters */
@@ -574,6 +575,7 @@ __boot_from_prom:
/* We never return. We also hit that trap if trying to boot
 * from OF while CONFIG_PPC_OF_BOOT_TRAMPOLINE isn't selected */
trap
+   .previous
 
 __after_prom_start:
 #ifdef CONFIG_RELOCATABLE
@@ -980,6 +982,7 @@ start_here_multiplatform:
.previous
/* This is where all platforms converge execution */
 
+__REF
 start_here_common:
/* relocation is on at this point */
std r1,PACAKSAVE(r13)
@@ -1001,6 +1004,7 @@ start_here_common:
/* Not reached */
trap
EMIT_BUG_ENTRY 0b, __FILE__, __LINE__, 0
+   .previous
 
 /*
  * We put a few things here that have to be page-aligned.
-- 
2.16.4

Re: [powerpc] Intermittent crashes ( link_path_walk) with linux-next

2020-03-26 Thread Sachin Sant




> On 26-Mar-2020, at 7:19 PM, Al Viro  wrote:
> 
> On Thu, Mar 26, 2020 at 10:40:06PM +1100, Michael Ellerman wrote:
> 
>>> The code in question (link_path_walk() in fs/namei.c ) was recently changed 
>>> by
>>> following commit:
>>> 
>>> commit 881386f7e46a: 
>>>  link_path_walk(): sample parent's i_uid and i_mode for the last component
>> 
>> That and about 10 other commits.
>> 
>> Unless Al can give us a clue we'll need to bisect.
> 
>   Already fixed yesterday.  It's not link_path_walk(), it's handle_dots()
> ignoring an error returned by step_into().
> 
> commit 5e3c3570ec97 is the broken one; commit 20971012f63e is its variant 
> with the
> fix folded in.  So next-20200325 has the bug and next-20200326 should have it
> fixed.  Could you check the current -next and see if you still observe that 
> crap?

Thanks Al for the information. 

I confirm that today’s next tree (20200326) work for me. I can no longer 
recreate this
problem.

Thanks
-Sachin

Re: [PATCH v2] mm/sparse: Fix kernel crash with pfn_section_valid check

2020-03-26 Thread Pankaj Gupta

>
> Fixes the below crash
>
> BUG: Kernel NULL pointer dereference on read at 0x
> Faulting instruction address: 0xc0c3447c
> Oops: Kernel access of bad area, sig: 11 [#1]
> LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
> ...
> NIP [c0c3447c] vmemmap_populated+0x98/0xc0
> LR [c0088354] vmemmap_free+0x144/0x320
> Call Trace:
>  section_deactivate+0x220/0x240
>  __remove_pages+0x118/0x170
>  arch_remove_memory+0x3c/0x150
>  memunmap_pages+0x1cc/0x2f0
>  devm_action_release+0x30/0x50
>  release_nodes+0x2f8/0x3e0
>  device_release_driver_internal+0x168/0x270
>  unbind_store+0x130/0x170
>  drv_attr_store+0x44/0x60
>  sysfs_kf_write+0x68/0x80
>  kernfs_fop_write+0x100/0x290
>  __vfs_write+0x3c/0x70
>  vfs_write+0xcc/0x240
>  ksys_write+0x7c/0x140
>  system_call+0x5c/0x68
>
> The crash is due to NULL dereference at
>
> test_bit(idx, ms->usage->subsection_map); due to ms->usage = NULL; in 
> pfn_section_valid()
>
> With commit: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in 
> SPARSEMEM|!VMEMMAP case")
> section_mem_map is set to NULL after depopulate_section_mem(). This
> was done so that pfn_page() can work correctly with kernel config that 
> disables
> SPARSEMEM_VMEMMAP. With that config pfn_to_page does
>
> __section_mem_map_addr(__sec) + __pfn;
> where
>
> static inline struct page *__section_mem_map_addr(struct mem_section *section)
> {
> unsigned long map = section->section_mem_map;
> map &= SECTION_MAP_MASK;
> return (struct page *)map;
> }
>
> Now with SPASEMEM_VMEMAP enabled, mem_section->usage->subsection_map is used 
> to
> check the pfn validity (pfn_valid()). Since section_deactivate release
> mem_section->usage if a section is fully deactivated, pfn_valid() check after
> a subsection_deactivate cause a kernel crash.
>
> static inline int pfn_valid(unsigned long pfn)
> {
> ...
> return early_section(ms) || pfn_section_valid(ms, pfn);
> }
>
> where
>
> static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
> {
> int idx = subsection_map_index(pfn);
>
> return test_bit(idx, ms->usage->subsection_map);
> }
>
> Avoid this by clearing SECTION_HAS_MEM_MAP when mem_section->usage is freed.
> For architectures like ppc64 where large pages are used for vmmemap mapping 
> (16MB),
> a specific vmemmap mapping can cover multiple sections. Hence before a vmemmap
> mapping page can be freed, the kernel needs to make sure there are no valid 
> sections
> within that mapping. Clearing the section valid bit before
> depopulate_section_memap enables this.
>
> Fixes: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in 
> SPARSEMEM|!VMEMMAP case")
> Reported-by: Sachin Sant 
> Tested-by: Sachin Sant 
> Cc: Baoquan He 
> Cc: Michael Ellerman 
> Cc: Dan Williams 
> Cc: Pankaj Gupta 
> Cc: David Hildenbrand 
> Cc: Michal Hocko 
> Cc: Wei Yang 
> Cc: Oscar Salvador 
> Cc: Mike Rapoport 
> Cc: 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  mm/sparse.c | 6 ++
>  1 file changed, 6 insertions(+)
>
> diff --git a/mm/sparse.c b/mm/sparse.c
> index aadb7298dcef..65599e8bd636 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -781,6 +781,12 @@ static void section_deactivate(unsigned long pfn, 
> unsigned long nr_pages,
> ms->usage = NULL;
> }
> memmap = sparse_decode_mem_map(ms->section_mem_map, 
> section_nr);
> +   /*
> +* Mark the section invalid so that valid_section()
> +* return false. This prevents code from dereferencing
> +* ms->usage array.
> +*/
> +   ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
> }
>
> if (section_is_early && memmap)
> --
Agree with Michal, comment for clearing the section would be nicer.

Fix looks good to me.
Acked-by: Pankaj Gupta 

> 2.25.1
>

Re: [PATCH V2 0/3] mm/debug: Add more arch page table helper tests

2020-03-26 Thread Christophe Leroy





Le 26/03/2020 à 03:23, Anshuman Khandual a écrit :



On 03/24/2020 10:52 AM, Anshuman Khandual wrote:

This series adds more arch page table helper tests. The new tests here are
either related to core memory functions and advanced arch pgtable helpers.
This also creates a documentation file enlisting all expected semantics as
suggested by Mike Rapoport (https://lkml.org/lkml/2020/1/30/40).

This series has been tested on arm64 and x86 platforms.


If folks can test these patches out on remaining ARCH_HAS_DEBUG_VM_PGTABLE
enabled platforms i.e s390, arc, powerpc (32 and 64), that will be really
appreciated. Thank you.



On powerpc 8xx (PPC32), I get:

[   53.338368] debug_vm_pgtable: debug_vm_pgtable: Validating 
architecture page table helpers

[   53.347403] [ cut here ]
[   53.351832] WARNING: CPU: 0 PID: 1 at mm/debug_vm_pgtable.c:647 
debug_vm_pgtable+0x280/0x3f4
[   53.360140] CPU: 0 PID: 1 Comm: swapper Not tainted 
5.6.0-rc7-s3k-dev-01090-g92710e99881f #3544

[   53.368718] NIP:  c0777c04 LR: c0777bb8 CTR: 
[   53.373720] REGS: c9023df0 TRAP: 0700   Not tainted 
(5.6.0-rc7-s3k-dev-01090-g92710e99881f)

[   53.382042] MSR:  00029032   CR: 22000222  XER: 2000
[   53.388667]
[   53.388667] GPR00: c0777bb8 c9023ea8 c612 0001 1e41 
  007641c9
[   53.388667] GPR08:  0001   82000222 
 c00039b8 
[   53.388667] GPR16:    fff0 065fc000 
1e41 c660 01e4
[   53.388667] GPR24: 01d9 c062d14c c65fc000 c642d448 06c9 
 c65f8000 c65fc040

[   53.423400] NIP [c0777c04] debug_vm_pgtable+0x280/0x3f4
[   53.428559] LR [c0777bb8] debug_vm_pgtable+0x234/0x3f4
[   53.433593] Call Trace:
[   53.436048] [c9023ea8] [c0777bb8] debug_vm_pgtable+0x234/0x3f4 
(unreliable)

[   53.442936] [c9023f28] [c00039e0] kernel_init+0x28/0x124
[   53.448184] [c9023f38] [c000f174] ret_from_kernel_thread+0x14/0x1c
[   53.454245] Instruction dump:
[   53.457180] 41a20008 4bea3ed9 62890021 7d36b92e 7d36b82e 71290fd0 
3149 7d2a4910
[   53.464838] 0f09 5789077e 3149 7d2a4910 <0f09> 38c0 
38a0 3880

[   53.472671] ---[ end trace fd5dd92744dc0065 ]---
[   53.519778] Freeing unused kernel memory: 608K

[RFC PATCH] powerpc/lib: Fixing use a temporary mm for code patching

2020-03-26 Thread Christophe Leroy

This patch fixes the RFC series identified below.
It fixes three points:
- Failure with CONFIG_PPC_KUAP
- Failure to write do to lack of DIRTY bit set on the 8xx
- Inadequaly complex WARN post verification

However, it has an impact on the CPU load. Here is the time
needed on an 8xx to run the ftrace selftests without and
with this series:
- Without CONFIG_STRICT_KERNEL_RWX  ==> 38 seconds
- With CONFIG_STRICT_KERNEL_RWX ==> 40 seconds
- With CONFIG_STRICT_KERNEL_RWX + this series   ==> 43 seconds

Link: https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=166003
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/lib/code-patching.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index f156132e8975..4ccff427592e 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -97,6 +97,7 @@ static int map_patch(const void *addr, struct patch_mapping 
*patch_mapping)
}
 
pte = mk_pte(page, pgprot);
+   pte = pte_mkdirty(pte);
set_pte_at(patching_mm, patching_addr, ptep, pte);
 
init_temp_mm(_mapping->temp_mm, patching_mm);
@@ -168,7 +169,9 @@ static int do_patch_instruction(unsigned int *addr, 
unsigned int instr)
(offset_in_page((unsigned long)addr) /
sizeof(unsigned int));
 
+   allow_write_to_user(patch_addr, sizeof(instr));
__patch_instruction(addr, instr, patch_addr);
+   prevent_write_to_user(patch_addr, sizeof(instr));
 
err = unmap_patch(_mapping);
if (err)
@@ -179,7 +182,7 @@ static int do_patch_instruction(unsigned int *addr, 
unsigned int instr)
 * think we just wrote.
 * XXX: BUG_ON() instead?
 */
-   WARN_ON(memcmp(addr, , sizeof(instr)));
+   WARN_ON(*addr != instr);
 
 out:
local_irq_restore(flags);
-- 
2.25.0

Re: [PATCH v2] mm/sparse: Fix kernel crash with pfn_section_valid check

2020-03-26 Thread Michal Hocko

On Thu 26-03-20 19:02:35, Aneesh Kumar K.V wrote:
> Fixes the below crash
> 
> BUG: Kernel NULL pointer dereference on read at 0x
> Faulting instruction address: 0xc0c3447c
> Oops: Kernel access of bad area, sig: 11 [#1]
> LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
> ...
> NIP [c0c3447c] vmemmap_populated+0x98/0xc0
> LR [c0088354] vmemmap_free+0x144/0x320
> Call Trace:
>  section_deactivate+0x220/0x240
>  __remove_pages+0x118/0x170
>  arch_remove_memory+0x3c/0x150
>  memunmap_pages+0x1cc/0x2f0
>  devm_action_release+0x30/0x50
>  release_nodes+0x2f8/0x3e0
>  device_release_driver_internal+0x168/0x270
>  unbind_store+0x130/0x170
>  drv_attr_store+0x44/0x60
>  sysfs_kf_write+0x68/0x80
>  kernfs_fop_write+0x100/0x290
>  __vfs_write+0x3c/0x70
>  vfs_write+0xcc/0x240
>  ksys_write+0x7c/0x140
>  system_call+0x5c/0x68
> 
> The crash is due to NULL dereference at
> 
> test_bit(idx, ms->usage->subsection_map); due to ms->usage = NULL; in 
> pfn_section_valid()
> 
> With commit: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in 
> SPARSEMEM|!VMEMMAP case")
> section_mem_map is set to NULL after depopulate_section_mem(). This
> was done so that pfn_page() can work correctly with kernel config that 
> disables
> SPARSEMEM_VMEMMAP. With that config pfn_to_page does
> 
>   __section_mem_map_addr(__sec) + __pfn;
> where
> 
> static inline struct page *__section_mem_map_addr(struct mem_section *section)
> {
>   unsigned long map = section->section_mem_map;
>   map &= SECTION_MAP_MASK;
>   return (struct page *)map;
> }
> 
> Now with SPASEMEM_VMEMAP enabled, mem_section->usage->subsection_map is used 
> to
> check the pfn validity (pfn_valid()). Since section_deactivate release
> mem_section->usage if a section is fully deactivated, pfn_valid() check after
> a subsection_deactivate cause a kernel crash.
> 
> static inline int pfn_valid(unsigned long pfn)
> {
> ...
>   return early_section(ms) || pfn_section_valid(ms, pfn);
> }
> 
> where
> 
> static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
> {
>   int idx = subsection_map_index(pfn);
> 
>   return test_bit(idx, ms->usage->subsection_map);
> }
> 
> Avoid this by clearing SECTION_HAS_MEM_MAP when mem_section->usage is freed.
> For architectures like ppc64 where large pages are used for vmmemap mapping 
> (16MB),
> a specific vmemmap mapping can cover multiple sections. Hence before a vmemmap
> mapping page can be freed, the kernel needs to make sure there are no valid 
> sections
> within that mapping. Clearing the section valid bit before
> depopulate_section_memap enables this.

I believe that the necessity of clearing the section before the tear
down is worth a comment into the code. Because this is just way to easy
to miss or not be aware at all while looking into the code without git
balme.

> Fixes: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in 
> SPARSEMEM|!VMEMMAP case")
> Reported-by: Sachin Sant 
> Tested-by: Sachin Sant 
> Cc: Baoquan He 
> Cc: Michael Ellerman 
> Cc: Dan Williams 
> Cc: Pankaj Gupta 
> Cc: David Hildenbrand 
> Cc: Michal Hocko 
> Cc: Wei Yang 
> Cc: Oscar Salvador 
> Cc: Mike Rapoport 
> Cc: 
> Signed-off-by: Aneesh Kumar K.V 

Acked-by: Michal Hocko 

Thanks!

> ---
>  mm/sparse.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/mm/sparse.c b/mm/sparse.c
> index aadb7298dcef..65599e8bd636 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -781,6 +781,12 @@ static void section_deactivate(unsigned long pfn, 
> unsigned long nr_pages,
>   ms->usage = NULL;
>   }
>   memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
> + /*
> +  * Mark the section invalid so that valid_section()
> +  * return false. This prevents code from dereferencing
> +  * ms->usage array.
> +  */
> + ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
>   }
>  
>   if (section_is_early && memmap)
> -- 
> 2.25.1
> 

-- 
Michal Hocko
SUSE Labs

Re: [powerpc] Intermittent crashes ( link_path_walk) with linux-next

2020-03-26 Thread Al Viro

On Thu, Mar 26, 2020 at 10:40:06PM +1100, Michael Ellerman wrote:

> > The code in question (link_path_walk() in fs/namei.c ) was recently changed 
> > by
> > following commit:
> >
> > commit 881386f7e46a: 
> >   link_path_walk(): sample parent's i_uid and i_mode for the last component
> 
> That and about 10 other commits.
> 
> Unless Al can give us a clue we'll need to bisect.

Already fixed yesterday.  It's not link_path_walk(), it's handle_dots()
ignoring an error returned by step_into().

commit 5e3c3570ec97 is the broken one; commit 20971012f63e is its variant with 
the
fix folded in.  So next-20200325 has the bug and next-20200326 should have it
fixed.  Could you check the current -next and see if you still observe that 
crap?

Re: [PATCH v2] mm/sparse: Fix kernel crash with pfn_section_valid check

2020-03-26 Thread Baoquan He

On 03/26/20 at 07:02pm, Aneesh Kumar K.V wrote:
> Fixes the below crash
> 
> BUG: Kernel NULL pointer dereference on read at 0x
> Faulting instruction address: 0xc0c3447c
> Oops: Kernel access of bad area, sig: 11 [#1]
> LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
> ...
> NIP [c0c3447c] vmemmap_populated+0x98/0xc0
> LR [c0088354] vmemmap_free+0x144/0x320
> Call Trace:
>  section_deactivate+0x220/0x240
>  __remove_pages+0x118/0x170
>  arch_remove_memory+0x3c/0x150
>  memunmap_pages+0x1cc/0x2f0
>  devm_action_release+0x30/0x50
>  release_nodes+0x2f8/0x3e0
>  device_release_driver_internal+0x168/0x270
>  unbind_store+0x130/0x170
>  drv_attr_store+0x44/0x60
>  sysfs_kf_write+0x68/0x80
>  kernfs_fop_write+0x100/0x290
>  __vfs_write+0x3c/0x70
>  vfs_write+0xcc/0x240
>  ksys_write+0x7c/0x140
>  system_call+0x5c/0x68
> 
> The crash is due to NULL dereference at
> 
> test_bit(idx, ms->usage->subsection_map); due to ms->usage = NULL; in 
> pfn_section_valid()
> 
> With commit: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in 
> SPARSEMEM|!VMEMMAP case")
> section_mem_map is set to NULL after depopulate_section_mem(). This
> was done so that pfn_page() can work correctly with kernel config that 
> disables
> SPARSEMEM_VMEMMAP. With that config pfn_to_page does
> 
>   __section_mem_map_addr(__sec) + __pfn;
> where
> 
> static inline struct page *__section_mem_map_addr(struct mem_section *section)
> {
>   unsigned long map = section->section_mem_map;
>   map &= SECTION_MAP_MASK;
>   return (struct page *)map;
> }
> 
> Now with SPASEMEM_VMEMAP enabled, mem_section->usage->subsection_map is used 
> to
> check the pfn validity (pfn_valid()). Since section_deactivate release
> mem_section->usage if a section is fully deactivated, pfn_valid() check after
> a subsection_deactivate cause a kernel crash.
> 
> static inline int pfn_valid(unsigned long pfn)
> {
> ...
>   return early_section(ms) || pfn_section_valid(ms, pfn);
> }
> 
> where
> 
> static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
> {
>   int idx = subsection_map_index(pfn);
> 
>   return test_bit(idx, ms->usage->subsection_map);
> }
> 
> Avoid this by clearing SECTION_HAS_MEM_MAP when mem_section->usage is freed.
> For architectures like ppc64 where large pages are used for vmmemap mapping 
> (16MB),
> a specific vmemmap mapping can cover multiple sections. Hence before a vmemmap
> mapping page can be freed, the kernel needs to make sure there are no valid 
> sections
> within that mapping. Clearing the section valid bit before
> depopulate_section_memap enables this.
> 
> Fixes: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in 
> SPARSEMEM|!VMEMMAP case")
> Reported-by: Sachin Sant 
> Tested-by: Sachin Sant 
> Cc: Baoquan He 
> Cc: Michael Ellerman 
> Cc: Dan Williams 
> Cc: Pankaj Gupta 
> Cc: David Hildenbrand 
> Cc: Michal Hocko 
> Cc: Wei Yang 
> Cc: Oscar Salvador 
> Cc: Mike Rapoport 
> Cc: 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  mm/sparse.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/mm/sparse.c b/mm/sparse.c
> index aadb7298dcef..65599e8bd636 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -781,6 +781,12 @@ static void section_deactivate(unsigned long pfn, 
> unsigned long nr_pages,
>   ms->usage = NULL;
>   }
>   memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
> + /*
> +  * Mark the section invalid so that valid_section()
> +  * return false. This prevents code from dereferencing
> +  * ms->usage array.
> +  */
> + ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
>   }

Reviewed-by: Baoquan He

[PATCH v2] mm/sparse: Fix kernel crash with pfn_section_valid check

2020-03-26 Thread Aneesh Kumar K.V

Fixes the below crash

BUG: Kernel NULL pointer dereference on read at 0x
Faulting instruction address: 0xc0c3447c
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
...
NIP [c0c3447c] vmemmap_populated+0x98/0xc0
LR [c0088354] vmemmap_free+0x144/0x320
Call Trace:
 section_deactivate+0x220/0x240
 __remove_pages+0x118/0x170
 arch_remove_memory+0x3c/0x150
 memunmap_pages+0x1cc/0x2f0
 devm_action_release+0x30/0x50
 release_nodes+0x2f8/0x3e0
 device_release_driver_internal+0x168/0x270
 unbind_store+0x130/0x170
 drv_attr_store+0x44/0x60
 sysfs_kf_write+0x68/0x80
 kernfs_fop_write+0x100/0x290
 __vfs_write+0x3c/0x70
 vfs_write+0xcc/0x240
 ksys_write+0x7c/0x140
 system_call+0x5c/0x68

The crash is due to NULL dereference at

test_bit(idx, ms->usage->subsection_map); due to ms->usage = NULL; in 
pfn_section_valid()

With commit: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in 
SPARSEMEM|!VMEMMAP case")
section_mem_map is set to NULL after depopulate_section_mem(). This
was done so that pfn_page() can work correctly with kernel config that disables
SPARSEMEM_VMEMMAP. With that config pfn_to_page does

__section_mem_map_addr(__sec) + __pfn;
where

static inline struct page *__section_mem_map_addr(struct mem_section *section)
{
unsigned long map = section->section_mem_map;
map &= SECTION_MAP_MASK;
return (struct page *)map;
}

Now with SPASEMEM_VMEMAP enabled, mem_section->usage->subsection_map is used to
check the pfn validity (pfn_valid()). Since section_deactivate release
mem_section->usage if a section is fully deactivated, pfn_valid() check after
a subsection_deactivate cause a kernel crash.

static inline int pfn_valid(unsigned long pfn)
{
...
return early_section(ms) || pfn_section_valid(ms, pfn);
}

where

static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
{
int idx = subsection_map_index(pfn);

return test_bit(idx, ms->usage->subsection_map);
}

Avoid this by clearing SECTION_HAS_MEM_MAP when mem_section->usage is freed.
For architectures like ppc64 where large pages are used for vmmemap mapping 
(16MB),
a specific vmemmap mapping can cover multiple sections. Hence before a vmemmap
mapping page can be freed, the kernel needs to make sure there are no valid 
sections
within that mapping. Clearing the section valid bit before
depopulate_section_memap enables this.

Fixes: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP 
case")
Reported-by: Sachin Sant 
Tested-by: Sachin Sant 
Cc: Baoquan He 
Cc: Michael Ellerman 
Cc: Dan Williams 
Cc: Pankaj Gupta 
Cc: David Hildenbrand 
Cc: Michal Hocko 
Cc: Wei Yang 
Cc: Oscar Salvador 
Cc: Mike Rapoport 
Cc: 
Signed-off-by: Aneesh Kumar K.V 
---
 mm/sparse.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/mm/sparse.c b/mm/sparse.c
index aadb7298dcef..65599e8bd636 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -781,6 +781,12 @@ static void section_deactivate(unsigned long pfn, unsigned 
long nr_pages,
ms->usage = NULL;
}
memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
+   /*
+* Mark the section invalid so that valid_section()
+* return false. This prevents code from dereferencing
+* ms->usage array.
+*/
+   ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
}
 
if (section_is_early && memmap)
-- 
2.25.1

Re: [PATCH v5 10/13] powerpc/ptrace: split out ADV_DEBUG_REGS related functions.

2020-03-26 Thread Christophe Leroy





Le 24/03/2020 à 07:23, Michael Ellerman a écrit :

Christophe Leroy  writes:

On 03/20/2020 02:12 AM, Michael Ellerman wrote:

Christophe Leroy  writes:

Move ADV_DEBUG_REGS functions out of ptrace.c, into
ptrace-adv.c and ptrace-noadv.c

Signed-off-by: Christophe Leroy 
---
v4: Leave hw_breakpoint.h for ptrace.c
---
   arch/powerpc/kernel/ptrace/Makefile   |   4 +
   arch/powerpc/kernel/ptrace/ptrace-adv.c   | 468 
   arch/powerpc/kernel/ptrace/ptrace-decl.h  |   5 +
   arch/powerpc/kernel/ptrace/ptrace-noadv.c | 236 
   arch/powerpc/kernel/ptrace/ptrace.c   | 650 --
   5 files changed, 713 insertions(+), 650 deletions(-)
   create mode 100644 arch/powerpc/kernel/ptrace/ptrace-adv.c
   create mode 100644 arch/powerpc/kernel/ptrace/ptrace-noadv.c


This is somehow breaking the ptrace-hwbreak selftest on Power8:

test: ptrace-hwbreak
tags: git_version:v5.6-rc6-892-g7a285a6067d6
PTRACE_SET_DEBUGREG, WO, len: 1: Ok
PTRACE_SET_DEBUGREG, WO, len: 2: Ok
PTRACE_SET_DEBUGREG, WO, len: 4: Ok
PTRACE_SET_DEBUGREG, WO, len: 8: Ok
PTRACE_SET_DEBUGREG, RO, len: 1: Ok
PTRACE_SET_DEBUGREG, RO, len: 2: Ok
PTRACE_SET_DEBUGREG, RO, len: 4: Ok
PTRACE_SET_DEBUGREG, RO, len: 8: Ok
PTRACE_SET_DEBUGREG, RW, len: 1: Ok
PTRACE_SET_DEBUGREG, RW, len: 2: Ok
PTRACE_SET_DEBUGREG, RW, len: 4: Ok
PTRACE_SET_DEBUGREG, RW, len: 8: Ok
PPC_PTRACE_SETHWDEBUG, MODE_EXACT, WO, len: 1: Ok
PPC_PTRACE_SETHWDEBUG, MODE_EXACT, RO, len: 1: Ok
PPC_PTRACE_SETHWDEBUG, MODE_EXACT, RW, len: 1: Ok
PPC_PTRACE_SETHWDEBUG, MODE_RANGE, DW ALIGNED, WO, len: 6: Ok
PPC_PTRACE_SETHWDEBUG, MODE_RANGE, DW ALIGNED, RO, len: 6: Ok
PPC_PTRACE_SETHWDEBUG, MODE_RANGE, DW ALIGNED, RW, len: 6: Ok
PPC_PTRACE_SETHWDEBUG, MODE_RANGE, DW UNALIGNED, WO, len: 6: Ok
PPC_PTRACE_SETHWDEBUG, MODE_RANGE, DW UNALIGNED, RO, len: 6: Fail
failure: ptrace-hwbreak

I haven't had time to work out why yet.



A (big) part of commit c3f68b0478e7 ("powerpc/watchpoint: Fix ptrace
code that muck around with address/len") was lost during rebase.

I'll send a fix, up to you to squash it in or commit it as is.


Thanks.



One person asked me if I sent the fix already. So yes, it is there:

https://patchwork.ozlabs.org/patch/1259348/

Christophe

Re: [PATCH] powerpc/prom_init: Remove leftover comment

2020-03-26 Thread Michael Ellerman

On Tue, 2020-03-24 at 18:29:12 UTC, Fabiano Rosas wrote:
> The if statement that this comment referred to was removed in
> commit 11fdb309341c ("powerpc/prom_init: Remove support for OPAL v2").
> 
> Signed-off-by: Fabiano Rosas 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/7074695ac6fb965d478f373b95bc5c636e9f21b0

cheers

Re: [PATCH v6 1/2] powerpc/64: Setup a paca before parsing device tree etc.

2020-03-26 Thread Michael Ellerman

On Fri, 2020-03-20 at 03:21:15 UTC, Michael Ellerman wrote:
> From: Daniel Axtens 
> 
> Currently we set up the paca after parsing the device tree for CPU
> features. Prior to that, r13 contains random data, which means there
> is random data in r13 while we're running the generic dt parsing code.
> 
> This random data varies depending on whether we boot through a vmlinux
> or a zImage: for the vmlinux case it's usually around zero, but for
> zImages we see random values like 912a72603d420015.
> 
> This is poor practice, and can also lead to difficult-to-debug
> crashes. For example, when kcov is enabled, the kcov instrumentation
> attempts to read preempt_count out of the current task, which goes via
> the paca. This then crashes in the zImage case.
> 
> Similarly stack protector can cause crashes if r13 is bogus, by
> reading from the stack canary in the paca.
> 
> To resolve this:
> 
>  - move the paca setup to before the CPU feature parsing.
> 
>  - because we no longer have access to CPU feature flags in paca
>  setup, change the HV feature test in the paca setup path to consider
>  the actual value of the MSR rather than the CPU feature.
> 
> Translations get switched on once we leave early_setup, so I think
> we'd already catch any other cases where the paca or task aren't set
> up.
> 
> Boot tested on a P9 guest and host.
> 
> Fixes: fb0b0a73b223 ("powerpc: Enable kcov")
> Fixes: 06ec27aea9fc ("powerpc/64: add stack protector support")
> Cc: sta...@vger.kernel.org # v4.20+
> Reviewed-by: Andrew Donnellan 
> Suggested-by: Michael Ellerman 
> Signed-off-by: Daniel Axtens 
> [mpe: Reword comments & change log a bit to mention stack protector]
> Signed-off-by: Michael Ellerman 

Series applied to powerpc next.

https://git.kernel.org/powerpc/c/d4a8e98621543d5798421eed177978bf2b3cdd11

cheers

Re: [PATCH] powerpc/pseries: avoid harmless preempt warning

2020-03-26 Thread Michael Ellerman

On Fri, 2020-03-20 at 15:24:36 UTC, Nicholas Piggin wrote:
> Signed-off-by: Nicholas Piggin 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/adde8715cf0571878d37fcb20595aad57b923bab

cheers

Re: [PATCH] powerpc/maple: Fix declaration made after definition

2020-03-26 Thread Michael Ellerman

On Mon, 2020-03-23 at 22:27:29 UTC, Nathan Chancellor wrote:
> When building ppc64 defconfig, Clang errors (trimmed for brevity):
> 
> arch/powerpc/platforms/maple/setup.c:365:1: error: attribute declaration
> must precede definition [-Werror,-Wignored-attributes]
> machine_device_initcall(maple, maple_cpc925_edac_setup);
> ^
> 
> machine_device_initcall expands to __define_machine_initcall, which in
> turn has the macro machine_is used in it, which declares mach_##name
> with an __attribute__((weak)). define_machine actually defines
> mach_##name, which in this file happens before the declaration, hence
> the warning.
> 
> To fix this, move define_machine after machine_device_initcall so that
> the declaration occurs before the definition, which matches how
> machine_device_initcall and define_machine work throughout arch/powerpc.
> 
> While we're here, remove some spaces before tabs.
> 
> Fixes: 8f101a051ef0 ("edac: cpc925 MC platform device setup")
> Link: https://godbolt.org/z/kDoYSA
> Link: https://github.com/ClangBuiltLinux/linux/issues/662
> Reported-by: Nick Desaulniers 
> Suggested-by: Ilie Halip 
> Signed-off-by: Nathan Chancellor 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/af6cf95c4d003fccd6c2ecc99a598fb854b537e7

cheers

Re: [PATCHv2] selftests/powerpc: Turn off timeout setting for benchmarks, dscr, signal, tm

2020-03-26 Thread Michael Ellerman

On Wed, 2020-03-18 at 06:00:04 UTC, Po-Hsu Lin wrote:
> Some specific tests in powerpc can take longer than the default 45
> seconds that added in commit 852c8cbf34d3 ("selftests/kselftest/runner.sh:
> Add 45 second timeout per test") to run, the following test result was
> collected across 2 Power8 nodes and 1 Power9 node in our pool:
>   powerpc/benchmarks/futex_bench - 52s
>   powerpc/dscr/dscr_sysfs_test - 116s
>   powerpc/signal/signal_fuzzer - 88s
>   powerpc/tm/tm_unavailable_test - 168s
>   powerpc/tm/tm-poison - 240s
> 
> Thus they will fail with TIMEOUT error. Disable the timeout setting
> for these sub-tests to allow them finish properly.
> 
> https://bugs.launchpad.net/bugs/1864642
> Fixes: 852c8cbf34d3 ("selftests/kselftest/runner.sh: Add 45 second timeout 
> per test")
> Signed-off-by: Po-Hsu Lin 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/850507f30c38dff21ed557cb98ab16db26c32bbc

cheers

Re: [PATCH] cpufreq: powernv: Fix frame-size-overflow in powernv_cpufreq_work_fn

2020-03-26 Thread Michael Ellerman

On Mon, 2020-03-16 at 13:57:43 UTC, Pratik Rajesh Sampat wrote:
> The patch avoids allocating cpufreq_policy on stack hence fixing frame
> size overflow in 'powernv_cpufreq_work_fn'
> 
> Fixes: 227942809b52 ("cpufreq: powernv: Restore cpu frequency to policy->cur 
> on unthrottling")
> Signed-off-by: Pratik Rajesh Sampat 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/d95fe371ecd28901f11256c610b988ed44e36ee2

cheers

Re: [PATCH] powerpc/book3s/hash64/devmap: Use H_PAGE_THP_HUGE when setting up huge devmap pte entries

2020-03-26 Thread Michael Ellerman

On Fri, 2020-03-13 at 09:48:42 UTC, "Aneesh Kumar K.V" wrote:
> H_PAGE_THP_HUGE is used to differentiate between a THP hugepage and hugetlb
> hugepage entries. The difference is w.r.t how we handle hash fault on these
> address. THP address enables MPSS in segments. We want to manage devmap 
> hugepage
> entries similar to THP pt entries. Hence use H_PAGE_THP_HUGE for devmap huge 
> pte
> entries.
> 
> With current code while handling hash pte fault, we do set is_thp = true when
> finding devmap pte huge pte entries.
> 
> Current code also does the below sequence we setting up huge devmap entries.
>   entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
>   if (pfn_t_devmap(pfn))
>   entry = pmd_mkdevmap(entry);
> 
> In that case we would find both H_PAGE_THP_HUGE and PAGE_DEVMAP set for huge
> devmap pte entries. This results in false positive error like below.
> 
>  kernel BUG at /home/kvaneesh/src/linux/mm/memory.c:4321!
>  Oops: Exception in kernel mode, sig: 5 [#1]
>  LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
>  Modules linked in:
>  CPU: 56 PID: 67996 Comm: t_mmap_dio Not tainted 
> 5.6.0-rc4-59640-g371c804dedbc #128
>  
>  NIP [c044c9e4] __follow_pte_pmd+0x264/0x900
>  LR [c05d45f8] dax_writeback_one+0x1a8/0x740
>  Call Trace:
>  [c00c6e9f38c0] [c13f4130] str_spec.74809+0x22ffb4/0x2d116c 
> (unreliable)
>  [c00c6e9f3960] [c05d45f8] dax_writeback_one+0x1a8/0x740
>  [c00c6e9f3a40] [c05d4dfc] dax_writeback_mapping_range+0x26c/0x700
>  [c00c6e9f3b30] [c0666580] ext4_dax_writepages+0x150/0x5a0
>  [c00c6e9f3ba0] [c03fe278] do_writepages+0x68/0x180
>  [c00c6e9f3c10] [c03ecc58] __filemap_fdatawrite_range+0x138/0x180
>  [c00c6e9f3cc0] [c03ede74] file_write_and_wait_range+0xa4/0x110
>  [c00c6e9f3d10] [c06552d0] ext4_sync_file+0x370/0x6e0
>  [c00c6e9f3d70] [c057d330] vfs_fsync_range+0x70/0xf0
>  [c00c6e9f3db0] [c046a000] sys_msync+0x220/0x2e0
>  [c00c6e9f3e20] [c000b478] system_call+0x5c/0x68
>  Instruction dump:
>  7a941564 392a 7fbffc36 7a94f082 7d2907b4 78f4f00e 7fff4838 7bff1f24
>  7e54f82a 7e74fa14 724900a0 40820410 <0b08> 72490040 418201c8 2fb7000
> 
> This is because our pmd_trans_huge check doesn't exclude _PAGE_DEVMAP.
> 
> To make this all consistent, update pmd_mkdevmap to set H_PAGE_THP_HUGE and
> pmd_trans_huge check now excludes _PAGE_DEVMAP correctly.
> 
> Fixes: ebd31197931d ("powerpc/mm: Add devmap support for ppc64")
> Signed-off-by: Aneesh Kumar K.V 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/36b78402d97a3b9aeab136feb9b00d8647ec2c20

cheers

Re: [PATCH -next] PCI: rpaphp: remove set but not used variable 'value'

2020-03-26 Thread Michael Ellerman

On Thu, 2020-03-12 at 14:04:12 UTC, Chen Zhou wrote:
> Fixes gcc '-Wunused-but-set-variable' warning:
> 
> drivers/pci/hotplug/rpaphp_core.c: In function is_php_type:
> drivers/pci/hotplug/rpaphp_core.c:291:16: warning:
>   variable value set but not used [-Wunused-but-set-variable]
> 
> Reported-by: Hulk Robot 
> Signed-off-by: Chen Zhou 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/9475af081ec1fb6cc794a17ae90f2c01aa8a7993

cheers

Re: [PATCH kernel] powerpc/prom_init: Pass the "os-term" message to hypervisor

2020-03-26 Thread Michael Ellerman

On Thu, 2020-03-12 at 07:44:04 UTC, Alexey Kardashevskiy wrote:
> The "os-term" RTAS calls has one argument with a message address of
> OS termination cause. rtas_os_term() already passes it but the recently
> added prom_init's version of that missed it; it also does not fill args
> correctly.
> 
> This passes the message address and initializes the number of arguments.
> 
> Signed-off-by: Alexey Kardashevskiy 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/74bb84e5117146fa73eb9d01305975c53022b3c3

cheers

Re: [PATCH v4] powerpc: Replace setup_irq() by request_irq()

2020-03-26 Thread Michael Ellerman

On Thu, 2020-03-12 at 06:42:55 UTC, afzal mohammed wrote:
> request_irq() is preferred over setup_irq(). Invocations of setup_irq()
> occur after memory allocators are ready.
> 
> Per tglx[1], setup_irq() existed in olden days when allocators were not
> ready by the time early interrupts were initialized.
> 
> Hence replace setup_irq() by request_irq().
> 
> [1] https://lkml.kernel.org/r/alpine.DEB.2.20.1710191609480.1971@nanos
> 
> Signed-off-by: afzal mohammed 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/b4f00d5b2098320a0d4c4a6d31099bc0c9a85b02

cheers

Re: [PATCH v2] powerpc test_emulate_step: fix DS operand in ld encoding to appropriate value

2020-03-26 Thread Michael Ellerman

On Wed, 2020-03-11 at 10:24:05 UTC, Balamuruhan S wrote:
> ld instruction should have 14 bit immediate field (DS) concatenated with
> 0b00 on the right, encode it accordingly. Introduce macro `IMM_DS()`
> to encode DS form instructions with 14 bit immediate field.
> 
> Fixes: 4ceae137bdab ("powerpc: emulate_step() tests for load/store 
> instructions")
> Reviewed-by: Sandipan Das 
> Signed-off-by: Balamuruhan S 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/3e74a0e16342626511c43937c120beb990539307

cheers

Re: [PATCH -next 017/491] CELL BROADBAND ENGINE ARCHITECTURE: Use fallthrough;

2020-03-26 Thread Michael Ellerman

On Wed, 2020-03-11 at 04:51:31 UTC, Joe Perches wrote:
> Convert the various uses of fallthrough comments to fallthrough;
> 
> Done via script
> Link: 
> https://lore.kernel.org/lkml/b56602fcf79f849e733e7b521bb0e17895d390fa.1582230379.git.joe.com/
> 
> Signed-off-by: Joe Perches 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/addf3727ad28bd159ae2da433b48daf2ffb339f7

cheers

Re: [PATCH v3] powerpc/32s: reorder Linux PTE bits to better match Hash PTE bits.

2020-03-26 Thread Michael Ellerman

On Tue, 2020-03-10 at 17:29:12 UTC, Christophe Leroy wrote:
> Reorder Linux PTE bits to (almost) match Hash PTE bits.
> 
> RW Kernel : PP = 00
> RO Kernel : PP = 00
> RW User   : PP = 01
> RO User   : PP = 11
> 
> So naturally, we should have
> _PAGE_USER = 0x001
> _PAGE_RW   = 0x002
> 
> Today 0x001 and 0x002 and _PAGE_PRESENT and _PAGE_HASHPTE which
> both are software only bits.
> 
> Switch _PAGE_USER and _PAGE_PRESET
> Switch _PAGE_RW and _PAGE_HASHPTE
> 
> This allows to remove a few insns.
> 
> Signed-off-by: Christophe Leroy 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/697ece78f8f749aeea40f2711389901f0974017a

cheers

Re: [PATCH] powerpc/pseries: fix of_read_drc_info_cell() to point at next record

2020-03-26 Thread Michael Ellerman

On Sat, 2020-03-07 at 02:45:47 UTC, Tyrel Datwyler wrote:
> The expectation is that when calling of_read_drc_info_cell()
> repeatedly to parse multiple drc-info records that the in/out curval
> parameter points at the start of the next record on return. However,
> the current behavior has curval still pointing at the final value of
> the record just parsed. The result of which is that if the
> ibm,drc-info property contains multiple properties the parsed value
> of the drc_type for any record after the first has the power_domain
> value of the previous record appended to the type string.
> 
> Ex: observed the following 0x prepended to PHB
> 
> [   69.485037] drc-info: type: \xff\xff\xff\xffPHB, prefix: PHB , 
> index_start: 0x2001
> [   69.485038] drc-info: suffix_start: 1, sequential_elems: 3072, 
> sequential_inc: 1
> [   69.485038] drc-info: power-domain: 0x, last_index: 0x2c00
> 
> Fix by incrementing curval past the power_domain value to point at
> drc_type string of next record.
> 
> Fixes: a29396653b8bf ("pseries/drc-info: Search DRC properties for CPU 
> indexes")
> Signed-off-by: Tyrel Datwyler 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/c5e76fa05b2df519b9f08571cc57e623c1569faa

cheers

Re: [PATCH] powerpc/kasan: Fix kasan_remap_early_shadow_ro()

2020-03-26 Thread Michael Ellerman

On Fri, 2020-03-06 at 15:09:40 UTC, Christophe Leroy wrote:
> At the moment kasan_remap_early_shadow_ro() does nothing, because
> k_end is 0 and k_cur < 0 is always true.
> 
> Change the test to k_cur != k_end, as done in
> kasan_init_shadow_page_tables()
> 
> Signed-off-by: Christophe Leroy 
> Fixes: cbd18991e24f ("powerpc/mm: Fix an Oops in kasan_mmu_init()")
> Cc: sta...@vger.kernel.org

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/af92bad615be75c6c0d1b1c5b48178360250a187

cheers

Re: [PATCH v2 1/6] powerpc/eeh: Add sysfs files in late probe

2020-03-26 Thread Michael Ellerman

On Fri, 2020-03-06 at 07:38:59 UTC, Oliver O'Halloran wrote:
> Move creating the EEH specific sysfs files into eeh_add_device_late()
> rather than being open-coded all over the place. Calling the function is
> generally done immediately after calling eeh_add_device_late() anyway. This
> is also a correctness fix since currently the sysfs files will be added
> even if the EEH probe happens to fail.
> 
> Similarly, on pseries we currently add the sysfs files before calling
> eeh_add_device_late(). This is flat-out broken since the sysfs files
> require the pci_dev->dev.archdata.edev pointer to be set, and that is done
> in eeh_add_device_late().
> 
> Reviewed-by: Sam Bobroff 
> Signed-off-by: Oliver O'Halloran 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/8645aaa87963439007773ed8862ae6a29ea15eae

cheers

Re: [PATCH] selftests/powerpc: Add a test of sigreturn vs VDSO

2020-03-26 Thread Michael Ellerman

On Wed, 2020-03-04 at 11:04:02 UTC, Michael Ellerman wrote:
> There's two different paths through the sigreturn code, depending on
> whether the VDSO is mapped or not. We recently discovered a bug in the
> unmapped case, because it's not commonly used these days.
> 
> So add a test that sends itself a signal, then moves the VDSO, takes
> another signal and finally unmaps the VDSO before sending itself
> another signal. That tests the standard signal path, the code that
> handles the VDSO being moved, and also the signal path in the case
> where the VDSO is unmapped.
> 
> Signed-off-by: Michael Ellerman 

Applied to powerpc next.

https://git.kernel.org/powerpc/c/a0968a025c04702427a4aee2c618f451a5098cd8

cheers

Re: [PATCH] powerpc/kuap: PPC_KUAP_DEBUG should depend on PPC_KUAP

2020-03-26 Thread Michael Ellerman

On Sun, 2020-03-01 at 11:17:38 UTC, Michael Ellerman wrote:
> Currently you can enable PPC_KUAP_DEBUG when PPC_KUAP is disabled,
> even though the former has not effect without the latter.
> 
> Fix it so that PPC_KUAP_DEBUG can only be enabled when PPC_KUAP is
> enabled, not when the platform could support KUAP (PPC_HAVE_KUAP).
> 
> Fixes: 890274c2dc4c ("powerpc/64s: Implement KUAP for Radix MMU")
> Signed-off-by: Michael Ellerman 

Applied to powerpc next.

https://git.kernel.org/powerpc/c/61da50b76b62fd815aa82d853bf82bf4f69568f5

cheers

Re: [PATCH] powerpc/64s: Fix section mismatch warnings from boot code

2020-03-26 Thread Michael Ellerman

On Tue, 2020-02-25 at 03:13:28 UTC, Michael Ellerman wrote:
> We currently have two section mismatch warnings:
> 
>   The function __boot_from_prom() references
>   the function __init prom_init().
> 
>   The function start_here_common() references
>   the function __init start_kernel().
> 
> The warnings are correct, we do have branches from non-init code into
> init code, which is freed after boot. But we don't expect to ever
> execute any of that early boot code after boot, if we did that would
> be a bug. In particular calling into OF after boot would be fatal
> because OF is no longer resident.
> 
> So for now fix the warnings by marking the relevant functions as
> __REF, which puts them in the ".ref.text" section.
> 
> This causes some reordering of the functions in the final link:
> 
>   @@ -217,10 +217,9 @@
>c000b088 t generic_secondary_common_init
>c000b124 t __mmu_off
>c000b14c t __start_initialization_multiplatform
>   -c000b1ac t __boot_from_prom
>   -c000b1ec t __after_prom_start
>   -c000b260 t p_end
>   -c000b27c T copy_and_flush
>   +c000b1ac t __after_prom_start
>   +c000b220 t p_end
>   +c000b23c T copy_and_flush
>c000b300 T __secondary_start
>c000b300 t copy_to_here
>c000b344 t start_secondary_prolog
>   @@ -228,8 +227,9 @@
>c000b36c t enable_64b_mode
>c000b388 T relative_toc
>c000b3a8 t p_toc
>   -c000b3b0 t start_here_common
>   -c000b3d0 t start_here_multiplatform
>   +c000b3b0 t __boot_from_prom
>   +c000b3f0 t start_here_multiplatform
>   +c000b480 t start_here_common
>c000b880 T system_call_common
>c000b974 t system_call
>c000b9dc t system_call_exit
> 
> In particular __boot_from_prom moves after copy_to_here, which means
> it's not copied to zero in the first stage of copy of the kernel to
> zero.
> 
> But that's OK, because we only call __boot_from_prom before we do the
> copy, so it makes no difference when it's copied. The call sequence
> is:
>   __start
>   -> __start_initialization_multiplatform
>  -> __boot_from_prom
> -> __start
>-> __start_initialization_multiplatform
>   -> __after_prom_start
>  -> copy_and_flush
>  -> copy_and_flush (relocated to 0)
> -> start_here_multiplatform
>-> early_setup
> 
> Reported-by: Mauricio Faria de Oliveira 
> Reported-by: Roman Bolshakov 
> Signed-off-by: Michael Ellerman 

Applied to powerpc next.

https://git.kernel.org/powerpc/c/6eeb9b3b9ce588f14a697737a30d0702b5a20293

cheers

Re: [PATCH] powerpc/xmon: Lower limits on nidump and ndump

2020-03-26 Thread Michael Ellerman

On Wed, 2020-02-19 at 11:00:07 UTC, Michael Ellerman wrote:
> In xmon we have two variables that are used by the dump commands.
> There's ndump which is the number of bytes to dump using 'd', and
> nidump which is the number of instructions to dump using 'di'.
> 
> ndump starts as 64 and nidump starts as 16, but both can be set by the
> user.
> 
> It's fairly common to be pasting addresses into xmon when trying to
> debug something, and if you inadvertently double paste an address like
> so:
> 
>   0:mon> di c2101f6c c2101f6c
> 
> The second value is interpreted as the number of instructions to dump.
> 
> Luckily it doesn't dump 13 quintrillion instructions, the value is
> limited to MAX_DUMP (128K). But as each instruction is dumped on a
> single line, that's still a lot of output. If you're on a slow console
> that can take multiple minutes to print. If you were "just popping in
> and out of xmon quickly before the RCU/hardlockup detector fires" you
> are now having a bad day.
> 
> Things are not as bad with 'd' because we print 16 bytes per line, so
> it's fewer lines. But it's still quite a lot.
> 
> So shrink the maximum for 'd' to 64K (one page), which is 4096 lines.
> For 'di' add a new limit which is the above / 4 - because instructions
> are 4 bytes, meaning again we can dump one page.
> 
> Signed-off-by: Michael Ellerman 

Applied to powerpc next.

https://git.kernel.org/powerpc/c/d64c7dbb4d98306b794401ca924ad053f84b59f8

cheers

Re: [PATCH v3] powerpc/kprobes: Ignore traps that happened in real mode

2020-03-26 Thread Michael Ellerman

On Tue, 2020-02-18 at 19:38:27 UTC, Christophe Leroy wrote:
> When a program check exception happens while MMU translation is
> disabled, following Oops happens in kprobe_handler() in the following
> code:
> 
>   } else if (*addr != BREAKPOINT_INSTRUCTION) {
> 
> [   33.098554] BUG: Unable to handle kernel data access on read at 0xe268
> [   33.105091] Faulting instruction address: 0xc000ec34
> [   33.110010] Oops: Kernel access of bad area, sig: 11 [#1]
> [   33.115348] BE PAGE_SIZE=16K PREEMPT CMPC885
> [   33.119540] Modules linked in:
> [   33.122591] CPU: 0 PID: 429 Comm: cat Not tainted 
> 5.6.0-rc1-s3k-dev-00824-g84195dc6c58a #3267
> [   33.131005] NIP:  c000ec34 LR: c000ecd8 CTR: c019cab8
> [   33.136002] REGS: ca4d3b58 TRAP: 0300   Not tainted  
> (5.6.0-rc1-s3k-dev-00824-g84195dc6c58a)
> [   33.144324] MSR:  1032   CR: 2a4d3c52  XER: 
> [   33.150699] DAR: e268 DSISR: c000
> [   33.150699] GPR00: c000b09c ca4d3c10 c66d0620  ca4d3c60  
> 9032 
> [   33.150699] GPR08: 0002  c087de44 c000afe0 c66d0ad0 100d3dd6 
> fff3 
> [   33.150699] GPR16:  0041  ca4d3d70   
> 416d 
> [   33.150699] GPR24: 0004 c53b6128  e268  c07c 
> c07bb6fc ca4d3c60
> [   33.188015] NIP [c000ec34] kprobe_handler+0x128/0x290
> [   33.192989] LR [c000ecd8] kprobe_handler+0x1cc/0x290
> [   33.197854] Call Trace:
> [   33.200340] [ca4d3c30] [c000b09c] program_check_exception+0xbc/0x6fc
> [   33.206590] [ca4d3c50] [c000e43c] ret_from_except_full+0x0/0x4
> [   33.212392] --- interrupt: 700 at 0xe268
> [   33.270401] Instruction dump:
> [   33.273335] 913e0008 8122 3861 3929 9122 80010024 bb410008 
> 7c0803a6
> [   33.280992] 38210020 4e800020 3860 4e800020 <813b> 6d2a7fe0 
> 2f8a0008 419e0154
> [   33.288841] ---[ end trace 5b9152d4cdadd06d ]---
> 
> kprobe is not prepared to handle events in real mode and functions
> running in real mode should have been blacklisted, so kprobe_handler()
> can safely bail out telling 'this trap is not mine' for any trap that
> happened while in real-mode.
> 
> If the trap happened with MSR_IR or MSR_DR cleared, return 0 immediately.
> 
> Reported-by: Larry Finger 
> Fixes: 6cc89bad60a6 ("powerpc/kprobes: Invoke handlers directly")
> Cc: sta...@vger.kernel.org
> Cc: Naveen N. Rao 
> Cc: Masami Hiramatsu 
> Signed-off-by: Christophe Leroy 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/21f8b2fa3ca5b01f7a2b51b89ce97a3705a15aa0

cheers

Re: [PATCH v2] powerpc/kprobes: Remove redundant code

2020-03-26 Thread Michael Ellerman

On Wed, 2020-02-19 at 08:05:57 UTC, Christophe Leroy wrote:
> At the time being we have something like
> 
>   if (something) {
>   p = get();
>   if (p) {
>   if (something_wrong)
>   goto out;
>   ...
>   return;
>   } else if (a != b) {
>   if (some_error)
>   goto out;
>   ...
>   }
>   goto out;
>   }
>   p = get();
>   if (!p) {
>   if (a != b) {
>   if (some_error)
>   goto out;
>   ...
>   }
>   goto out;
>   }
> 
> This is similar to
> 
>   p = get();
>   if (!p) {
>   if (a != b) {
>   if (some_error)
>   goto out;
>   ...
>   }
>   goto out;
>   }
>   if (something) {
>   if (something_wrong)
>   goto out;
>   ...
>   return;
>   }
> 
> Signed-off-by: Christophe Leroy 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/eb4f8e259acc37b91b62ca57e0d3c8960c357843

cheers

Re: [PATCH v3 3/3] selftests/powerpc: Don't rely on segfault to rerun the test

2020-03-26 Thread Michael Ellerman

On Tue, 2020-02-11 at 03:38:31 UTC, Gustavo Luiz Duarte wrote:
> The test case tm-signal-context-force-tm expects a segfault to happen on
> returning from signal handler, and then does a setcontext() to run the test
> again. However, the test doesn't always segfault, causing the test to run a
> single time.
> 
> This patch fixes the test by putting it within a loop and jumping, via
> setcontext, just prior to the loop in case it segfaults. This way we get the
> desired behavior (run the test COUNT_MAX times) regardless if it segfaults or
> not. This also reduces the use of setcontext for control flow logic, keeping 
> it
> only in the segfault handler.
> 
> Also, since 'count' is changed within the signal handler, it is declared as
> volatile to prevent any compiler optimization getting confused with
> asynchronous changes.
> 
> Signed-off-by: Gustavo Luiz Duarte 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/0f8f554e5244f56f496b4ce30ada1126fe290345

cheers

Re: [PATCH v3 2/3] selftests/powerpc: Add tm-signal-pagefault test

2020-03-26 Thread Michael Ellerman

On Tue, 2020-02-11 at 03:38:30 UTC, Gustavo Luiz Duarte wrote:
> This test triggers a TM Bad Thing by raising a signal in transactional state
> and forcing a pagefault to happen in kernelspace when the kernel signal
> handling code first touches the user signal stack.
> 
> This is inspired by the test tm-signal-context-force-tm but uses userfaultfd 
> to
> make the test deterministic. While this test always triggers the bug in one
> run, I had to execute tm-signal-context-force-tm several times (the test runs
> 5000 times each execution) to trigger the same bug.
> 
> tm-signal-context-force-tm is kept instead of replaced because, while this 
> test
> is more reliable and triggers the same bug, tm-signal-context-force-tm has a
> better coverage, in the sense that by running the test several times it might
> trigger the pagefault and/or be preempted at different places.
> 
> v3: skip test if userfaultfd is unavailable.
> 
> Signed-off-by: Gustavo Luiz Duarte 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/915b7f6f9a5e232c138bb36743a1fdb0fcf2c432

cheers

Re: [PATCH] selftests/powerpc: Add a test of sigreturn vs VDSO

2020-03-26 Thread Michael Ellerman

Nathan Lynch  writes:
> Nathan Lynch  writes:
>> Michael Ellerman  writes:
>>> +static int search_proc_maps(char *needle, unsigned long *low, unsigned 
>>> long *high)
>>
>>^^ const?

Sorry I meant to do this but then forgot.

>>> +{
>>> +   unsigned long start, end;
>>> +   static char buf[4096];
>>> +   char name[128];
>>> +   FILE *f;
>>> +   int rc = -1;
>>> +
>>> +   f = fopen("/proc/self/maps", "r");
>>> +   if (!f) {
>>> +   perror("fopen");
>>> +   return -1;
>>> +   }
>>> +
>>> +   while (fgets(buf, sizeof(buf), f)) {
>>> +   rc = sscanf(buf, "%lx-%lx %*c%*c%*c%*c %*x %*d:%*d %*d %127s\n",
>>> +   , , name);
>>
>> I suspect it doesn't matter in practice for this particular test, but
>> since this looks like a generally useful function that could gain users
>> in the future: does this spuriously fail if the matching line straddles
>> a 4096-byte boundary? Maybe fscanf(3) should be used instead?
>
> Or maybe I should read the fgets man page more closely :-)
>
>   "Reading stops after an EOF or a newline."
>
> Sorry for the noise.

No worries, thanks for reviewing.

cheers

Re: [powerpc] Intermittent crashes ( link_path_walk) with linux-next

2020-03-26 Thread Michael Ellerman

Sachin Sant  writes:
> I am running into intermittent crashes with linux-next on POWER 9 PowerVM LPAR
> First it was against next-20200324 while running LTP tests. With next-20200325
> I ran into similar crash (a different stack trace but same failure point — 
> link_path_walk)
> while running sosreport command.
>
> BUG: Kernel NULL pointer dereference on read at 0x
> Faulting instruction address: 0xc043f278
> Oops: Kernel access of bad area, sig: 11 [#1]
> LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> Dumping ftrace buffer:
>(ftrace buffer empty)
> Modules linked in: loop iscsi_target_mod target_core_mod macsec tcp_diag 
> udp_diag inet_diag unix_diag af_packet_diag netlink_diag binfmt_misc overlay 
> dm_mod nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip6_tables 
> nft_compat ip_set rfkill nf_tables nfnetlink sunrpc sg pseries_rng 
> uio_pdrv_genirq uio sch_fq_codel ip_tables ext4 mbcache jbd2 sr_mod sd_mod 
> cdrom t10_pi ibmvscsi scsi_transport_srp ibmveth
> CPU: 26 PID: 7771 Comm: avocado Not tainted 5.6.0-rc7-next-20200324-autotest 
> #1
> NIP:  c043f278 LR: c043f330 CTR: 01fb
> REGS: c0082802f830 TRAP: 0300   Not tainted  
> (5.6.0-rc7-next-20200324-autotest)
> MSR:  80009033   CR: 28248442  XER: 2004
> CFAR: c000dec4 DAR:  DSISR: 4000 IRQMASK: 0 
> GPR00: c043f330 c0082802fac0 c155e900  
> GPR04: 0002  0002 c008b340 
> GPR08: 00031419    
> GPR12: 8000 c0001ec48600 7fffa08a53f8 0001 
> GPR16: 7fff9faf9a63 0100073bec00 7fff9f2493b0  
> GPR20: 7fffa1143bf8 7fffa1103b18 c0087f547cb3 2f2f2f2f2f2f2f2f 
> GPR24: 0003  c0082802fbc8 f000 
> GPR28: 0020  61c8864680b583eb  
> NIP [c043f278] link_path_walk.part.49+0x228/0x400
> LR [c043f330] link_path_walk.part.49+0x2e0/0x400
> Call Trace:
> [c0082802fac0] [c043f330] link_path_walk.part.49+0x2e0/0x400 
> (unreliable)
> [c0082802fb50] [c043f5a4] path_lookupat.isra.51+0x64/0x1f0
> [c0082802fba0] [c0441c00] filename_lookup.part.69+0xa0/0x1b0
> [c0082802fce0] [c042ff38] vfs_statx+0xa8/0x190
> [c0082802fd60] [c04302a0] __do_sys_newstat+0x40/0x90
> [c0082802fe20] [c000b278] system_call+0x5c/0x68
> Instruction dump:
> 3bff e93a0058 3880 7f43d378 7fff07b4 1d5f0030 7d295214 eac90020 
> 4bfffb21 2fa3 409e00c8 e93a0008 <8129> 55290256 7f89e000 419efecc 
>  ---[ end trace 34abf29ebd56e423 ]—
>
> Relevant snippet from obj dump:
>
>6dc4:   20 00 c9 ea ld  r22,32(r9)
> link = walk_component(nd, 0);
> 6db4:   78 d3 43 7f mr  r3,r26
> name = nd->stack[--depth].name;
> 6db8:   b4 07 ff 7f extsw   r31,r31
> 6dbc:   30 00 5f 1d mulli   r10,r31,48
> 6dc0:   14 52 29 7d add r9,r9,r10
> 6dc4:   20 00 c9 ea ld  r22,32(r9)
> link = walk_component(nd, 0);
> 6dc8:   01 00 00 48 bl  6dc8 
> if (unlikely(link)) {
> 6dcc:   00 00 a3 2f cmpdi   cr7,r3,0
> 6dd0:   c8 00 9e 40 bne cr7,6e98 
> 
> return dentry->d_flags & DCACHE_ENTRY_TYPE;
> 6dd4:   08 00 3a e9 ld  r9,8(r26)
> 6dd8:   00 00 29 81 lwz r9,0(r9)  <<=== crashes here ??

Yeah.

And r9 is NULL.

It came from r26 + 8, r26 is c0082802fbc8  which looks OK.

r26 is nd, the nameidata:

struct nameidata {
struct path path;

struct path {
struct vfsmount *mnt;
struct dentry *dentry;
} __randomize_layout;

r26 + 8 should be the dentry in the struct path.

So we have an nd.path with a NULL dentry.


> 6ddc:   56 02 29 55 rlwinm  r9,r9,0,9,11
> if (unlikely(!d_can_lookup(nd->path.dentry))) {
> 6de0:   00 e0 89 7f cmpwcr7,r9,r28
>
> The code in question (link_path_walk() in fs/namei.c ) was recently changed by
> following commit:
>
> commit 881386f7e46a: 
>   link_path_walk(): sample parent's i_uid and i_mode for the last component

That and about 10 other commits.

Unless Al can give us a clue we'll need to bisect.

cheers


> Crash while running sosreport command:
>
> [ 1917.926113] BUG: Kernel NULL pointer dereference on read at 0x
> [ 1917.926126] Faulting instruction address: 0xc043f638
> [ 1917.926131] Oops: Kernel access of bad area, sig: 11 [#1]
> [ 1917.926136] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> [ 1917.926147] Dumping ftrace buffer:
> [ 1917.926157](ftrace buffer empty)
> [ 1917.926161] Modules linked in: iscsi_target_mod target_core_mod macsec 
>

[PATCH v2] macintosh: convert to i2c_new_scanned_device

2020-03-26 Thread Wolfram Sang

Move from the deprecated i2c_new_probed_device() to the new
i2c_new_scanned_device(). No functional change for this driver because
it doesn't check the return code anyhow.

Signed-off-by: Wolfram Sang 
Acked-by: Michael Ellerman 
---

Change since v1: rebased on top of v5.6-rc7

 drivers/macintosh/therm_windtunnel.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/macintosh/therm_windtunnel.c 
b/drivers/macintosh/therm_windtunnel.c
index a0d87ed9da69..f55f6adf5e5f 100644
--- a/drivers/macintosh/therm_windtunnel.c
+++ b/drivers/macintosh/therm_windtunnel.c
@@ -323,7 +323,7 @@ static void do_attach(struct i2c_adapter *adapter)
of_node_put(np);
} else {
strlcpy(info.type, "MAC,ds1775", I2C_NAME_SIZE);
-   i2c_new_probed_device(adapter, , scan_ds1775, NULL);
+   i2c_new_scanned_device(adapter, , scan_ds1775, NULL);
}
 
np = of_find_compatible_node(adapter->dev.of_node, NULL, "MAC,adm1030");
@@ -331,7 +331,7 @@ static void do_attach(struct i2c_adapter *adapter)
of_node_put(np);
} else {
strlcpy(info.type, "MAC,adm1030", I2C_NAME_SIZE);
-   i2c_new_probed_device(adapter, , scan_adm1030, NULL);
+   i2c_new_scanned_device(adapter, , scan_adm1030, NULL);
}
 }
 
-- 
2.20.1

Re: [PATCH -next] usb: gadget: fsl: remove unused variable 'driver_desc'

2020-03-26 Thread Peter Chen

On 20-03-26 15:14:19, YueHaibing wrote:
> drivers/usb/gadget/udc/fsl_udc_core.c:56:19:
>  warning: 'driver_desc' defined but not used [-Wunused-const-variable=]
> 
> It is never used, so remove it.
> 
> Reported-by: Hulk Robot 
> Signed-off-by: YueHaibing 
> ---
>  drivers/usb/gadget/udc/fsl_udc_core.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/usb/gadget/udc/fsl_udc_core.c 
> b/drivers/usb/gadget/udc/fsl_udc_core.c
> index ec6eda426223..febabde62f71 100644
> --- a/drivers/usb/gadget/udc/fsl_udc_core.c
> +++ b/drivers/usb/gadget/udc/fsl_udc_core.c
> @@ -53,7 +53,6 @@
>  #define  DMA_ADDR_INVALID(~(dma_addr_t)0)
>  
>  static const char driver_name[] = "fsl-usb2-udc";
> -static const char driver_desc[] = DRIVER_DESC;
>  
>  static struct usb_dr_device __iomem *dr_regs;
>  
> -- 
> 2.17.1
> 
> 

Reviewed-by: Peter Chen 

-- 

Thanks,
Peter Chen

Re: [PATCH] mm/sparse: Fix kernel crash with pfn_section_valid check

2020-03-26 Thread Michal Hocko

On Thu 26-03-20 11:16:33, Michal Hocko wrote:
> On Thu 26-03-20 15:26:22, Aneesh Kumar K.V wrote:
> > On 3/26/20 3:10 PM, Michal Hocko wrote:
> > > On Wed 25-03-20 08:49:14, Aneesh Kumar K.V wrote:
> > > > Fixes the below crash
> > > > 
> > > > BUG: Kernel NULL pointer dereference on read at 0x
> > > > Faulting instruction address: 0xc0c3447c
> > > > Oops: Kernel access of bad area, sig: 11 [#1]
> > > > LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> > > > CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
> > > > ...
> > > > NIP [c0c3447c] vmemmap_populated+0x98/0xc0
> > > > LR [c0088354] vmemmap_free+0x144/0x320
> > > > Call Trace:
> > > >   section_deactivate+0x220/0x240
> > > 
> > > It would be great to match this to the specific source code.
> > 
> > The crash is due to NULL dereference at
> > 
> > test_bit(idx, ms->usage->subsection_map); due to ms->usage = NULL;
> 
> It would be nice to call that out here as well
> 
> [...]
> > > Why do we have to free usage before deactivaing section memmap? Now that
> > > we have a late section_mem_map reset shouldn't we tear down the usage in
> > > the same branch?
> > > 
> > 
> > We still need to make the section invalid before we call into
> > depopulate_section_memmap(). Because architecture like powerpc can share
> > vmemmap area across sections (16MB mapping of vmemmap area) and we use
> > vmemmap_popluated() to make that decision.
> 
> This should be noted in a comment as well.
> 
> > > > Fixes: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in 
> > > > SPARSEMEM|!VMEMMAP case")
> > > > Cc: Baoquan He 
> > > > Reported-by: Sachin Sant 
> > > > Signed-off-by: Aneesh Kumar K.V 
> > > > ---
> > > >   mm/sparse.c | 2 ++
> > > >   1 file changed, 2 insertions(+)
> > > > 
> > > > diff --git a/mm/sparse.c b/mm/sparse.c
> > > > index aadb7298dcef..3012d1f3771a 100644
> > > > --- a/mm/sparse.c
> > > > +++ b/mm/sparse.c
> > > > @@ -781,6 +781,8 @@ static void section_deactivate(unsigned long pfn, 
> > > > unsigned long nr_pages,
> > > > ms->usage = NULL;
> > > > }
> > > > memmap = sparse_decode_mem_map(ms->section_mem_map, 
> > > > section_nr);
> > > > +   /* Mark the section invalid */
> > > > +   ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
> > > 
> > > Btw. this comment is not really helping at all.
> > 
> > That is marking the section invalid so that
> > 
> > static inline int valid_section(struct mem_section *section)
> > {
> > return (section && (section->section_mem_map & SECTION_HAS_MEM_MAP));
> > }
> > 
> > 
> > returns false.
> 
> Yes that is obvious once you are clear where to look. I was really
> hoping for a comment that would simply point you to the right
> direcection without chasing SECTION_HAS_MEM_MAP usage. This code is
> subtle and useful comments, even when they state something that is
> obvious to you _right_now_, can be really helpful.

Btw. forgot to add. With the improved comment feel free to add
Acked-by: Michal Hocko 

-- 
Michal Hocko
SUSE Labs

Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information

2020-03-26 Thread maddy





On 3/18/20 11:05 PM, Kim Phillips wrote:

Hi Maddy,

On 3/17/20 1:50 AM, maddy wrote:

On 3/13/20 4:08 AM, Kim Phillips wrote:

On 3/11/20 11:00 AM, Ravi Bangoria wrote:

On 3/6/20 3:36 AM, Kim Phillips wrote:

On 3/3/20 3:55 AM, Kim Phillips wrote:

On 3/2/20 2:21 PM, Stephane Eranian wrote:

On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra  wrote:

On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:

Modern processors export such hazard data in Performance
Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
AMD[3] provides similar information.

Implementation detail:

A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
If it's set, kernel converts arch specific hazard information
into generic format:

  struct perf_pipeline_haz_data {
     /* Instruction/Opcode type: Load, Store, Branch  */
     __u8    itype;
     /* Instruction Cache source */
     __u8    icache;
     /* Instruction suffered hazard in pipeline stage */
     __u8    hazard_stage;
     /* Hazard reason */
     __u8    hazard_reason;
     /* Instruction suffered stall in pipeline stage */
     __u8    stall_stage;
     /* Stall reason */
     __u8    stall_reason;
     __u16   pad;
  };

Kim, does this format indeed work for AMD IBS?

It's not really 1:1, we don't have these separations of stages
and reasons, for example: we have missed in L2 cache, for example.
So IBS output is flatter, with more cycle latency figures than
IBM's AFAICT.

AMD IBS captures pipeline latency data incase Fetch sampling like the
Fetch latency, tag to retire latency, completion to retire latency and
so on. Yes, Ops sampling do provide more data on load/store centric
information. But it also captures more detailed data for Branch instructions.
And we also looked at ARM SPE, which also captures more details pipeline
data and latency information.


Personally, I don't like the term hazard. This is too IBM Power
specific. We need to find a better term, maybe stall or penalty.

Right, IBS doesn't have a filter to only count stalled or otherwise
bad events.  IBS' PPR descriptions has one occurrence of the
word stall, and no penalty.  The way I read IBS is it's just
reporting more sample data than just the precise IP: things like
hits, misses, cycle latencies, addresses, types, etc., so words
like 'extended', or the 'auxiliary' already used today even
are more appropriate for IBS, although I'm the last person to
bikeshed.

We are thinking of using "pipeline" word instead of Hazard.

Hm, the word 'pipeline' occurs 0 times in IBS documentation.

NP. We thought pipeline is generic hw term so we proposed "pipeline"
word. We are open to term which can be generic enough.


I realize there are a couple of core pipeline-specific pieces
of information coming out of it, but the vast majority
are addresses, latencies of various components in the memory
hierarchy, and various component hit/miss bits.

Yes. we should capture core pipeline specific details. For example,
IBS generates Branch unit information(IbsOpData1) and Icahce related
data(IbsFetchCtl) which is something that shouldn't be extended as
part of perf-mem, IMO.

Sure, IBS Op-side output is more 'perf mem' friendly, and so it
should populate perf_mem_data_src fields, just like POWER9 can:

union perf_mem_data_src {
...
  __u64   mem_rsvd:24,
  mem_snoopx:2,   /* snoop mode, ext */
  mem_remote:1,   /* remote */
  mem_lvl_num:4,  /* memory hierarchy level number */
  mem_dtlb:7, /* tlb access */
  mem_lock:2, /* lock instr */
  mem_snoop:5,    /* snoop mode */
  mem_lvl:14, /* memory hierarchy level */
  mem_op:5;   /* type of opcode */


E.g., SIER[LDST] SIER[A_XLATE_SRC] can be used to populate
mem_lvl[_num], SIER_TYPE can be used to populate 'mem_op',
'mem_lock', and the Reload Bus Source Encoding bits can
be used to populate mem_snoop, right?

Hi Kim,

Yes. We do expose these data as part of perf-mem for POWER.

OK, I see relevant PERF_MEM_S bits in arch/powerpc/perf/isa207-common.c:
isa207_find_source now, thanks.


For IBS, I see PERF_SAMPLE_ADDR and PERF_SAMPLE_PHYS_ADDR can be
used for the ld/st target addresses, too.


What's needed here is a vendor-specific extended
sample information that all these technologies gather,
of which things like e.g., 'L1 TLB cycle latency' we
all should have in common.

Yes. We will include fields to capture the latency cycles (like Issue
latency, Instruction completion latency etc..) along with other pipeline
details in the proposed structure.

Latency figures are just an example, and from what I
can tell, struct perf_sample_data already has

Re: [PATCH] mm/sparse: Fix kernel crash with pfn_section_valid check

2020-03-26 Thread Michal Hocko

On Thu 26-03-20 15:26:22, Aneesh Kumar K.V wrote:
> On 3/26/20 3:10 PM, Michal Hocko wrote:
> > On Wed 25-03-20 08:49:14, Aneesh Kumar K.V wrote:
> > > Fixes the below crash
> > > 
> > > BUG: Kernel NULL pointer dereference on read at 0x
> > > Faulting instruction address: 0xc0c3447c
> > > Oops: Kernel access of bad area, sig: 11 [#1]
> > > LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> > > CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
> > > ...
> > > NIP [c0c3447c] vmemmap_populated+0x98/0xc0
> > > LR [c0088354] vmemmap_free+0x144/0x320
> > > Call Trace:
> > >   section_deactivate+0x220/0x240
> > 
> > It would be great to match this to the specific source code.
> 
> The crash is due to NULL dereference at
> 
> test_bit(idx, ms->usage->subsection_map); due to ms->usage = NULL;

It would be nice to call that out here as well

[...]
> > Why do we have to free usage before deactivaing section memmap? Now that
> > we have a late section_mem_map reset shouldn't we tear down the usage in
> > the same branch?
> > 
> 
> We still need to make the section invalid before we call into
> depopulate_section_memmap(). Because architecture like powerpc can share
> vmemmap area across sections (16MB mapping of vmemmap area) and we use
> vmemmap_popluated() to make that decision.

This should be noted in a comment as well.

> > > Fixes: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in 
> > > SPARSEMEM|!VMEMMAP case")
> > > Cc: Baoquan He 
> > > Reported-by: Sachin Sant 
> > > Signed-off-by: Aneesh Kumar K.V 
> > > ---
> > >   mm/sparse.c | 2 ++
> > >   1 file changed, 2 insertions(+)
> > > 
> > > diff --git a/mm/sparse.c b/mm/sparse.c
> > > index aadb7298dcef..3012d1f3771a 100644
> > > --- a/mm/sparse.c
> > > +++ b/mm/sparse.c
> > > @@ -781,6 +781,8 @@ static void section_deactivate(unsigned long pfn, 
> > > unsigned long nr_pages,
> > >   ms->usage = NULL;
> > >   }
> > >   memmap = sparse_decode_mem_map(ms->section_mem_map, 
> > > section_nr);
> > > + /* Mark the section invalid */
> > > + ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
> > 
> > Btw. this comment is not really helping at all.
> 
> That is marking the section invalid so that
> 
> static inline int valid_section(struct mem_section *section)
> {
>   return (section && (section->section_mem_map & SECTION_HAS_MEM_MAP));
> }
> 
> 
> returns false.

Yes that is obvious once you are clear where to look. I was really
hoping for a comment that would simply point you to the right
direcection without chasing SECTION_HAS_MEM_MAP usage. This code is
subtle and useful comments, even when they state something that is
obvious to you _right_now_, can be really helpful.

Thanks!
-- 
Michal Hocko
SUSE Labs

Re: [PATCH 2/3] soc: fsl: dpio: QMAN performance improvement. Function pointer indirection.

2020-03-26 Thread Arnd Bergmann

On Thu, Dec 12, 2019 at 6:02 PM Youri Querry  wrote:
>
> We are making the access decision in the initialization and
> setting the function pointers accordingly.
>
> Signed-off-by: Youri Querry 
> ---
>  drivers/soc/fsl/dpio/qbman-portal.c | 451 
> +++-
>  drivers/soc/fsl/dpio/qbman-portal.h | 129 ++-
>  2 files changed, 507 insertions(+), 73 deletions(-)

While merging pull requests, I came across some style issues with this driver.
I'm pulling it anyway, but please have another look and fix these for the next
release, or send a follow-up patch for the coming merge window.

>
> diff --git a/drivers/soc/fsl/dpio/qbman-portal.c 
> b/drivers/soc/fsl/dpio/qbman-portal.c
> index 5a37ac8..0ffe018 100644
> --- a/drivers/soc/fsl/dpio/qbman-portal.c
> +++ b/drivers/soc/fsl/dpio/qbman-portal.c
> @@ -83,6 +83,82 @@ enum qbman_sdqcr_fc {
> qbman_sdqcr_fc_up_to_3 = 1
>  };
>
> +/* Internal Function declaration */
> +static int qbman_swp_enqueue_direct(struct qbman_swp *s,
> +   const struct qbman_eq_desc *d,
> +   const struct dpaa2_fd *fd);
> +static int qbman_swp_enqueue_mem_back(struct qbman_swp *s,
> + const struct qbman_eq_desc *d,
> + const struct dpaa2_fd *fd);
> +static int qbman_swp_enqueue_multiple_direct(struct qbman_swp *s,
> +const struct qbman_eq_desc *d,
> +const struct dpaa2_fd *fd,
> +uint32_t *flags,
> +int num_frames);
> +static int qbman_swp_enqueue_multiple_mem_back(struct qbman_swp *s,
> +  const struct qbman_eq_desc *d,
> +  const struct dpaa2_fd *fd,
> +  uint32_t *flags,
> +  int num_frames);

Please try to avoid all static forward declarations. The coding style for the
kernel generally mandates that you define the functions in the order they
are used in, and have no such declarations, unless there is a recursion
that requires it. If you do have recursion, then please add a comment that
explains how you limit it to avoid overrunning the kernel stack.

> +const struct dpaa2_dq *qbman_swp_dqrr_next_direct(struct qbman_swp *s);
> +const struct dpaa2_dq *qbman_swp_dqrr_next_mem_back(struct qbman_swp *s);

Forward declarations for non-static functions in C code are much
worse, you should
never need these. If a function is shared between files, then put the
declaration
into a header file that is included by both, to ensure the prototypes match, and
if it's only used here, then make it 'static'.

> +/* Function pointers */
> +int (*qbman_swp_enqueue_ptr)(struct qbman_swp *s,
> +const struct qbman_eq_desc *d,
> +const struct dpaa2_fd *fd)
> +   = qbman_swp_enqueue_direct;
> +
> +int (*qbman_swp_enqueue_multiple_ptr)(struct qbman_swp *s,
> + const struct qbman_eq_desc *d,
> + const struct dpaa2_fd *fd,
> + uint32_t *flags,
> +int num_frames)
> +   = qbman_swp_enqueue_multiple_direct;

This looks like you just have an indirect function pointer with a
single possible
implementation. This is less of a problem, but until you have a way to safely
override these at runtime, it may be better to simplify this by using direct
function calls.

   Arnd

Re: [PATCH v2] powerpc/XIVE: SVM: share the event-queue page with the Hypervisor.

2020-03-26 Thread Greg Kurz

On Thu, 26 Mar 2020 01:38:47 -0700
Ram Pai  wrote:

> XIVE interrupt controller use an Event Queue (EQ) to enqueue event
> notifications when an exception occurs. The EQ is a single memory page
> provided by the O/S defining a circular buffer, one per server and
> priority couple.
> 
> On baremetal, the EQ page is configured with an OPAL call. On pseries,
> an extra hop is necessary and the guest OS uses the hcall
> H_INT_SET_QUEUE_CONFIG to configure the XIVE interrupt controller.
> 
> The XIVE controller being Hypervisor privileged, it will not be allowed
> to enqueue event notifications for a Secure VM unless the EQ pages are
> shared by the Secure VM.
> 
> Hypervisor/Ultravisor still requires support for the TIMA and ESB page
> fault handlers. Until this is complete, QEMU can use the emulated XIVE
> device for Secure VMs, option "kernel_irqchip=off" on the QEMU pseries
> machine.
> 
> Cc: kvm-...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Michael Ellerman 
> Cc: Thiago Jung Bauermann 
> Cc: Michael Anderson 
> Cc: Sukadev Bhattiprolu 
> Cc: Alexey Kardashevskiy 
> Cc: Paul Mackerras 
> Cc: Greg Kurz 
> Cc: Cedric Le Goater 
> Cc: David Gibson 
> Signed-off-by: Ram Pai 
> 
> v2: better description of the patch from Cedric.
> ---

Reviewed-by: Greg Kurz 

>  arch/powerpc/sysdev/xive/spapr.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/arch/powerpc/sysdev/xive/spapr.c 
> b/arch/powerpc/sysdev/xive/spapr.c
> index 55dc61c..608b52f 100644
> --- a/arch/powerpc/sysdev/xive/spapr.c
> +++ b/arch/powerpc/sysdev/xive/spapr.c
> @@ -26,6 +26,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  
>  #include "xive-internal.h"
>  
> @@ -501,6 +503,9 @@ static int xive_spapr_configure_queue(u32 target, struct 
> xive_q *q, u8 prio,
>   rc = -EIO;
>   } else {
>   q->qpage = qpage;
> + if (is_secure_guest())
> + uv_share_page(PHYS_PFN(qpage_phys),
> + 1 << xive_alloc_order(order));
>   }
>  fail:
>   return rc;
> @@ -534,6 +539,8 @@ static void xive_spapr_cleanup_queue(unsigned int cpu, 
> struct xive_cpu *xc,
>  hw_cpu, prio);
>  
>   alloc_order = xive_alloc_order(xive_queue_shift);
> + if (is_secure_guest())
> + uv_unshare_page(PHYS_PFN(__pa(q->qpage)), 1 << alloc_order);
>   free_pages((unsigned long)q->qpage, alloc_order);
>   q->qpage = NULL;
>  }

Re: [PATCH] mm/sparse: Fix kernel crash with pfn_section_valid check

2020-03-26 Thread Aneesh Kumar K.V


On 3/26/20 3:10 PM, Michal Hocko wrote:

On Wed 25-03-20 08:49:14, Aneesh Kumar K.V wrote:

Fixes the below crash

BUG: Kernel NULL pointer dereference on read at 0x
Faulting instruction address: 0xc0c3447c
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
...
NIP [c0c3447c] vmemmap_populated+0x98/0xc0
LR [c0088354] vmemmap_free+0x144/0x320
Call Trace:
  section_deactivate+0x220/0x240


It would be great to match this to the specific source code.


The crash is due to NULL dereference at

test_bit(idx, ms->usage->subsection_map); due to ms->usage = NULL;

that is explained in later part of the commit.



  __remove_pages+0x118/0x170
  arch_remove_memory+0x3c/0x150
  memunmap_pages+0x1cc/0x2f0
  devm_action_release+0x30/0x50
  release_nodes+0x2f8/0x3e0
  device_release_driver_internal+0x168/0x270
  unbind_store+0x130/0x170
  drv_attr_store+0x44/0x60
  sysfs_kf_write+0x68/0x80
  kernfs_fop_write+0x100/0x290
  __vfs_write+0x3c/0x70
  vfs_write+0xcc/0x240
  ksys_write+0x7c/0x140
  system_call+0x5c/0x68

With commit: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP 
case")
section_mem_map is set to NULL after depopulate_section_mem(). This
was done so that pfn_page() can work correctly with kernel config that disables
SPARSEMEM_VMEMMAP. With that config pfn_to_page does

__section_mem_map_addr(__sec) + __pfn;
where

static inline struct page *__section_mem_map_addr(struct mem_section *section)
{
unsigned long map = section->section_mem_map;
map &= SECTION_MAP_MASK;
return (struct page *)map;
}

Now with SPASEMEM_VMEMAP enabled, mem_section->usage->subsection_map is used to
check the pfn validity (pfn_valid()). Since section_deactivate release
mem_section->usage if a section is fully deactivated, pfn_valid() check after
a subsection_deactivate cause a kernel crash.

static inline int pfn_valid(unsigned long pfn)
{
...
return early_section(ms) || pfn_section_valid(ms, pfn);
}

where

static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
{



int idx = subsection_map_index(pfn);

return test_bit(idx, ms->usage->subsection_map);
}

Avoid this by clearing SECTION_HAS_MEM_MAP when mem_section->usage is freed.


I am sorry, I haven't noticed that during the review of the commit
mentioned above. This is all subtle as hell, I have to say.

Why do we have to free usage before deactivaing section memmap? Now that
we have a late section_mem_map reset shouldn't we tear down the usage in
the same branch?



We still need to make the section invalid before we call into 
depopulate_section_memmap(). Because architecture like powerpc can share 
vmemmap area across sections (16MB mapping of vmemmap area) and we use 
vmemmap_popluated() to make that decision.





Fixes: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP 
case")
Cc: Baoquan He 
Reported-by: Sachin Sant 
Signed-off-by: Aneesh Kumar K.V 
---
  mm/sparse.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/mm/sparse.c b/mm/sparse.c
index aadb7298dcef..3012d1f3771a 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -781,6 +781,8 @@ static void section_deactivate(unsigned long pfn, unsigned 
long nr_pages,
ms->usage = NULL;
}
memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
+   /* Mark the section invalid */
+   ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;


Btw. this comment is not really helping at all.


That is marking the section invalid so that

static inline int valid_section(struct mem_section *section)
{
return (section && (section->section_mem_map & SECTION_HAS_MEM_MAP));
}


returns false.


/*
 * section->usage is gone and VMEMMAP's pfn_valid depens
 * on it (see pfn_section_valid)
 */

}
  
  	if (section_is_early && memmap)

--
2.25.1

Re: [PATCH] mm/sparse: Fix kernel crash with pfn_section_valid check

2020-03-26 Thread Michal Hocko

On Wed 25-03-20 08:49:14, Aneesh Kumar K.V wrote:
> Fixes the below crash
> 
> BUG: Kernel NULL pointer dereference on read at 0x
> Faulting instruction address: 0xc0c3447c
> Oops: Kernel access of bad area, sig: 11 [#1]
> LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
> ...
> NIP [c0c3447c] vmemmap_populated+0x98/0xc0
> LR [c0088354] vmemmap_free+0x144/0x320
> Call Trace:
>  section_deactivate+0x220/0x240

It would be great to match this to the specific source code.

>  __remove_pages+0x118/0x170
>  arch_remove_memory+0x3c/0x150
>  memunmap_pages+0x1cc/0x2f0
>  devm_action_release+0x30/0x50
>  release_nodes+0x2f8/0x3e0
>  device_release_driver_internal+0x168/0x270
>  unbind_store+0x130/0x170
>  drv_attr_store+0x44/0x60
>  sysfs_kf_write+0x68/0x80
>  kernfs_fop_write+0x100/0x290
>  __vfs_write+0x3c/0x70
>  vfs_write+0xcc/0x240
>  ksys_write+0x7c/0x140
>  system_call+0x5c/0x68
> 
> With commit: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in 
> SPARSEMEM|!VMEMMAP case")
> section_mem_map is set to NULL after depopulate_section_mem(). This
> was done so that pfn_page() can work correctly with kernel config that 
> disables
> SPARSEMEM_VMEMMAP. With that config pfn_to_page does
> 
>   __section_mem_map_addr(__sec) + __pfn;
> where
> 
> static inline struct page *__section_mem_map_addr(struct mem_section *section)
> {
>   unsigned long map = section->section_mem_map;
>   map &= SECTION_MAP_MASK;
>   return (struct page *)map;
> }
> 
> Now with SPASEMEM_VMEMAP enabled, mem_section->usage->subsection_map is used 
> to
> check the pfn validity (pfn_valid()). Since section_deactivate release
> mem_section->usage if a section is fully deactivated, pfn_valid() check after
> a subsection_deactivate cause a kernel crash.
> 
> static inline int pfn_valid(unsigned long pfn)
> {
> ...
>   return early_section(ms) || pfn_section_valid(ms, pfn);
> }
> 
> where
> 
> static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
> {

>   int idx = subsection_map_index(pfn);
> 
>   return test_bit(idx, ms->usage->subsection_map);
> }
> 
> Avoid this by clearing SECTION_HAS_MEM_MAP when mem_section->usage is freed.

I am sorry, I haven't noticed that during the review of the commit
mentioned above. This is all subtle as hell, I have to say. 

Why do we have to free usage before deactivaing section memmap? Now that
we have a late section_mem_map reset shouldn't we tear down the usage in
the same branch?

> Fixes: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in 
> SPARSEMEM|!VMEMMAP case")
> Cc: Baoquan He 
> Reported-by: Sachin Sant 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  mm/sparse.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/sparse.c b/mm/sparse.c
> index aadb7298dcef..3012d1f3771a 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -781,6 +781,8 @@ static void section_deactivate(unsigned long pfn, 
> unsigned long nr_pages,
>   ms->usage = NULL;
>   }
>   memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
> + /* Mark the section invalid */
> + ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;

Btw. this comment is not really helping at all.
/*
 * section->usage is gone and VMEMMAP's pfn_valid depens
 * on it (see pfn_section_valid)
 */
>   }
>  
>   if (section_is_early && memmap)
> -- 
> 2.25.1
> 

-- 
Michal Hocko
SUSE Labs

Re: [PATCH v2] powerpc/XIVE: SVM: share the event-queue page with the Hypervisor.

2020-03-26 Thread Cédric Le Goater

On 3/26/20 9:38 AM, Ram Pai wrote:
> XIVE interrupt controller use an Event Queue (EQ) to enqueue event

The XIVE interrupt controller uses ... (my bad)

> notifications when an exception occurs. The EQ is a single memory page
> provided by the O/S defining a circular buffer, one per server and
> priority couple.
> 
> On baremetal, the EQ page is configured with an OPAL call. On pseries,
> an extra hop is necessary and the guest OS uses the hcall
> H_INT_SET_QUEUE_CONFIG to configure the XIVE interrupt controller.
> 
> The XIVE controller being Hypervisor privileged, it will not be allowed
> to enqueue event notifications for a Secure VM unless the EQ pages are
> shared by the Secure VM.
> 
> Hypervisor/Ultravisor still requires support for the TIMA and ESB page
> fault handlers. Until this is complete, QEMU can use the emulated XIVE
> device for Secure VMs, option "kernel_irqchip=off" on the QEMU pseries
> machine.
> 
> Cc: kvm-...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Michael Ellerman 
> Cc: Thiago Jung Bauermann 
> Cc: Michael Anderson 
> Cc: Sukadev Bhattiprolu 
> Cc: Alexey Kardashevskiy 
> Cc: Paul Mackerras 
> Cc: Greg Kurz 
> Cc: Cedric Le Goater 

c...@fr.ibm.com is insecure. Please use c...@kaod.org.

> Cc: David Gibson 
> Signed-off-by: Ram Pai 

Reviewed-by: Cedric Le Goater 

Thanks,

C.

> 
> v2: better description of the patch from Cedric.
> ---
>  arch/powerpc/sysdev/xive/spapr.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/arch/powerpc/sysdev/xive/spapr.c 
> b/arch/powerpc/sysdev/xive/spapr.c
> index 55dc61c..608b52f 100644
> --- a/arch/powerpc/sysdev/xive/spapr.c
> +++ b/arch/powerpc/sysdev/xive/spapr.c
> @@ -26,6 +26,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> 
>  #include "xive-internal.h"
> 
> @@ -501,6 +503,9 @@ static int xive_spapr_configure_queue(u32 target, struct 
> xive_q *q, u8 prio,
>   rc = -EIO;
>   } else {
>   q->qpage = qpage;
> + if (is_secure_guest())
> + uv_share_page(PHYS_PFN(qpage_phys),
> + 1 << xive_alloc_order(order));
>   }
>  fail:
>   return rc;
> @@ -534,6 +539,8 @@ static void xive_spapr_cleanup_queue(unsigned int cpu, 
> struct xive_cpu *xc,
>  hw_cpu, prio);
> 
>   alloc_order = xive_alloc_order(xive_queue_shift);
> + if (is_secure_guest())
> + uv_unshare_page(PHYS_PFN(__pa(q->qpage)), 1 << alloc_order);
>   free_pages((unsigned long)q->qpage, alloc_order);
>   q->qpage = NULL;
>  }
>

[PATCH v2] powerpc/XIVE: SVM: share the event-queue page with the Hypervisor.

2020-03-26 Thread Ram Pai

XIVE interrupt controller use an Event Queue (EQ) to enqueue event
notifications when an exception occurs. The EQ is a single memory page
provided by the O/S defining a circular buffer, one per server and
priority couple.

On baremetal, the EQ page is configured with an OPAL call. On pseries,
an extra hop is necessary and the guest OS uses the hcall
H_INT_SET_QUEUE_CONFIG to configure the XIVE interrupt controller.

The XIVE controller being Hypervisor privileged, it will not be allowed
to enqueue event notifications for a Secure VM unless the EQ pages are
shared by the Secure VM.

Hypervisor/Ultravisor still requires support for the TIMA and ESB page
fault handlers. Until this is complete, QEMU can use the emulated XIVE
device for Secure VMs, option "kernel_irqchip=off" on the QEMU pseries
machine.

Cc: kvm-...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Michael Ellerman 
Cc: Thiago Jung Bauermann 
Cc: Michael Anderson 
Cc: Sukadev Bhattiprolu 
Cc: Alexey Kardashevskiy 
Cc: Paul Mackerras 
Cc: Greg Kurz 
Cc: Cedric Le Goater 
Cc: David Gibson 
Signed-off-by: Ram Pai 

v2: better description of the patch from Cedric.
---
 arch/powerpc/sysdev/xive/spapr.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/powerpc/sysdev/xive/spapr.c b/arch/powerpc/sysdev/xive/spapr.c
index 55dc61c..608b52f 100644
--- a/arch/powerpc/sysdev/xive/spapr.c
+++ b/arch/powerpc/sysdev/xive/spapr.c
@@ -26,6 +26,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include "xive-internal.h"
 
@@ -501,6 +503,9 @@ static int xive_spapr_configure_queue(u32 target, struct 
xive_q *q, u8 prio,
rc = -EIO;
} else {
q->qpage = qpage;
+   if (is_secure_guest())
+   uv_share_page(PHYS_PFN(qpage_phys),
+   1 << xive_alloc_order(order));
}
 fail:
return rc;
@@ -534,6 +539,8 @@ static void xive_spapr_cleanup_queue(unsigned int cpu, 
struct xive_cpu *xc,
   hw_cpu, prio);
 
alloc_order = xive_alloc_order(xive_queue_shift);
+   if (is_secure_guest())
+   uv_unshare_page(PHYS_PFN(__pa(q->qpage)), 1 << alloc_order);
free_pages((unsigned long)q->qpage, alloc_order);
q->qpage = NULL;
 }
-- 
1.8.3.1

[powerpc] Intermittent crashes ( link_path_walk) with linux-next

2020-03-26 Thread Sachin Sant

I am running into intermittent crashes with linux-next on POWER 9 PowerVM LPAR
First it was against next-20200324 while running LTP tests. With next-20200325
I ran into similar crash (a different stack trace but same failure point — 
link_path_walk)
while running sosreport command.

BUG: Kernel NULL pointer dereference on read at 0x
Faulting instruction address: 0xc043f278
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in: loop iscsi_target_mod target_core_mod macsec tcp_diag 
udp_diag inet_diag unix_diag af_packet_diag netlink_diag binfmt_misc overlay 
dm_mod nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip6_tables 
nft_compat ip_set rfkill nf_tables nfnetlink sunrpc sg pseries_rng 
uio_pdrv_genirq uio sch_fq_codel ip_tables ext4 mbcache jbd2 sr_mod sd_mod 
cdrom t10_pi ibmvscsi scsi_transport_srp ibmveth
CPU: 26 PID: 7771 Comm: avocado Not tainted 5.6.0-rc7-next-20200324-autotest #1
NIP:  c043f278 LR: c043f330 CTR: 01fb
REGS: c0082802f830 TRAP: 0300   Not tainted  
(5.6.0-rc7-next-20200324-autotest)
MSR:  80009033   CR: 28248442  XER: 2004
CFAR: c000dec4 DAR:  DSISR: 4000 IRQMASK: 0 
GPR00: c043f330 c0082802fac0 c155e900  
GPR04: 0002  0002 c008b340 
GPR08: 00031419    
GPR12: 8000 c0001ec48600 7fffa08a53f8 0001 
GPR16: 7fff9faf9a63 0100073bec00 7fff9f2493b0  
GPR20: 7fffa1143bf8 7fffa1103b18 c0087f547cb3 2f2f2f2f2f2f2f2f 
GPR24: 0003  c0082802fbc8 f000 
GPR28: 0020  61c8864680b583eb  
NIP [c043f278] link_path_walk.part.49+0x228/0x400
LR [c043f330] link_path_walk.part.49+0x2e0/0x400
Call Trace:
[c0082802fac0] [c043f330] link_path_walk.part.49+0x2e0/0x400 
(unreliable)
[c0082802fb50] [c043f5a4] path_lookupat.isra.51+0x64/0x1f0
[c0082802fba0] [c0441c00] filename_lookup.part.69+0xa0/0x1b0
[c0082802fce0] [c042ff38] vfs_statx+0xa8/0x190
[c0082802fd60] [c04302a0] __do_sys_newstat+0x40/0x90
[c0082802fe20] [c000b278] system_call+0x5c/0x68
Instruction dump:
3bff e93a0058 3880 7f43d378 7fff07b4 1d5f0030 7d295214 eac90020 
4bfffb21 2fa3 409e00c8 e93a0008 <8129> 55290256 7f89e000 419efecc 
 ---[ end trace 34abf29ebd56e423 ]—

Relevant snippet from obj dump:

   6dc4:   20 00 c9 ea ld  r22,32(r9)
link = walk_component(nd, 0);
6db4:   78 d3 43 7f mr  r3,r26
name = nd->stack[--depth].name;
6db8:   b4 07 ff 7f extsw   r31,r31
6dbc:   30 00 5f 1d mulli   r10,r31,48
6dc0:   14 52 29 7d add r9,r9,r10
6dc4:   20 00 c9 ea ld  r22,32(r9)
link = walk_component(nd, 0);
6dc8:   01 00 00 48 bl  6dc8 
if (unlikely(link)) {
6dcc:   00 00 a3 2f cmpdi   cr7,r3,0
6dd0:   c8 00 9e 40 bne cr7,6e98 
return dentry->d_flags & DCACHE_ENTRY_TYPE;
6dd4:   08 00 3a e9 ld  r9,8(r26)
6dd8:   00 00 29 81 lwz r9,0(r9)  <<=== crashes here ??
6ddc:   56 02 29 55 rlwinm  r9,r9,0,9,11
if (unlikely(!d_can_lookup(nd->path.dentry))) {
6de0:   00 e0 89 7f cmpwcr7,r9,r28

The code in question (link_path_walk() in fs/namei.c ) was recently changed by
following commit:

commit 881386f7e46a: 
  link_path_walk(): sample parent's i_uid and i_mode for the last component

Thanks
-Sachin




next-20200325.log
Description: Binary data

Re: [PATCHv3] powerpc/crashkernel: take "mem=" option into account

2020-03-26 Thread Hari Bathini

Hello Pingfan,

Thanks for the patch..

On 19/02/20 7:48 PM, Pingfan Liu wrote:
> 'mem=" option is an easy way to put high pressure on memory during some
> test. Hence after applying the memory limit, instead of total mem, the
> actual usable memory should be considered when reserving mem for
> crashkernel. Otherwise the boot up may experience OOM issue.
> 
> E.g. it would reserve 4G prior to the change and 512M afterward, if passing
> crashkernel="2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G", and
> mem=5G on a 256G machine.
> 
> This issue is powerpc specific because it puts higher priority on fadump
> and kdump reservation than on "mem=". Referring the following code:
>   if (fadump_reserve_mem() == 0)
>   reserve_crashkernel();
>   ...
>   /* Ensure that total memory size is page-aligned. */
>   limit = ALIGN(memory_limit ?: memblock_phys_mem_size(), PAGE_SIZE);
>   memblock_enforce_memory_limit(limit);
> 
> While on other arches, the effect of "mem=" takes a higher priority and pass
> through memblock_phys_mem_size() before calling reserve_crashkernel().
> 
> Signed-off-by: Pingfan Liu 
> To: linuxppc-dev@lists.ozlabs.org
> Cc: Hari Bathini 
> Cc: Michael Ellerman 
> Cc: ke...@lists.infradead.org
> ---
> v2 -> v3: improve commit log
>  arch/powerpc/kernel/machine_kexec.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/machine_kexec.c 
> b/arch/powerpc/kernel/machine_kexec.c
> index c4ed328..eec96dc 100644
> --- a/arch/powerpc/kernel/machine_kexec.c
> +++ b/arch/powerpc/kernel/machine_kexec.c
> @@ -114,11 +114,12 @@ void machine_kexec(struct kimage *image)
>  
>  void __init reserve_crashkernel(void)
>  {
> - unsigned long long crash_size, crash_base;
> + unsigned long long crash_size, crash_base, total_mem_sz;
>   int ret;
>  
> + total_mem_sz = memory_limit ? memory_limit : memblock_phys_mem_size();
>   /* use common parsing */
> - ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
> + ret = parse_crashkernel(boot_command_line, total_mem_sz,
>   _size, _base);
>   if (ret == 0 && crash_size > 0) {
>   crashk_res.start = crash_base;

memory_limit is adjusted after this with the below snippet:

/* Crash kernel trumps memory limit */
if (memory_limit && memory_limit <= crashk_res.end) {   

memory_limit = crashk_res.end + 1;
printk("Adjusted memory limit for crashkernel, now 0x%llx\n",   

   memory_limit);   

} 

So, either the above snippet must be dropped or the print below should use an 
updated total_mem_sz
based on adjusted memory_limit. I would prefer the latter..

> @@ -185,7 +186,7 @@ void __init reserve_crashkernel(void)
>   "for crashkernel (System RAM: %ldMB)\n",
>   (unsigned long)(crash_size >> 20),
>   (unsigned long)(crashk_res.start >> 20),
> - (unsigned long)(memblock_phys_mem_size() >> 20));
> + (unsigned long)(total_mem_sz >> 20));
>  
>   if (!memblock_is_region_memory(crashk_res.start, crash_size) ||
>   memblock_reserve(crashk_res.start, crash_size)) {
> 

-- 
- Hari

[PATCH -next] usb: gadget: fsl: remove unused variable 'driver_desc'

2020-03-26 Thread YueHaibing

drivers/usb/gadget/udc/fsl_udc_core.c:56:19:
 warning: 'driver_desc' defined but not used [-Wunused-const-variable=]

It is never used, so remove it.

Reported-by: Hulk Robot 
Signed-off-by: YueHaibing 
---
 drivers/usb/gadget/udc/fsl_udc_core.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/usb/gadget/udc/fsl_udc_core.c 
b/drivers/usb/gadget/udc/fsl_udc_core.c
index ec6eda426223..febabde62f71 100644
--- a/drivers/usb/gadget/udc/fsl_udc_core.c
+++ b/drivers/usb/gadget/udc/fsl_udc_core.c
@@ -53,7 +53,6 @@
 #defineDMA_ADDR_INVALID(~(dma_addr_t)0)
 
 static const char driver_name[] = "fsl-usb2-udc";
-static const char driver_desc[] = DRIVER_DESC;
 
 static struct usb_dr_device __iomem *dr_regs;
 
-- 
2.17.1

[PATCH v6 3/3] powerpc/powernv: Preference optimization for SPRs with constant values

2020-03-26 Thread Pratik Rajesh Sampat

There are SPRs whose values don't tend to change over time and invoking
self-save on them, where the values are gotten each time may turn out to
be inefficient. In that case calling a self-restore where passing the
value makes more sense as, if the value is same the memory location
is not updated.
SPRs that dont change are as follows:
SPRN_HSPRG0,
SPRN_LPCR,
SPRN_PTCR,
SPRN_HMEER,
SPRN_HID0,

There are also SPRs whose values change and/or their value may not be
correcty determinable in the kernel. Eg: MSR and PSSCR

The value of LPCR is dynamic based on if the CPU is entered a stop
state during cpu idle versus cpu hotplug.

Therefore in this optimization patch, introducing the concept of
preference for each SPR to choose from in the case both self-save and
self-restore is supported.

The preference bitmask is shown as below:

|... | 2nd pref | 1st pref |

MSB   LSB

The preference from higher to lower is from LSB to MSB with a shift of 8
bits.
Example:
Prefer self save first, if not available then prefer self
restore
The preference mask for this scenario will be seen as below.
((FIRMWARE_RESTORE << PREFERENCE_SHIFT) | FIRMWARE_SELF_SAVE)
-
|... | Self restore | Self save |
-
MSB LSB

Signed-off-by: Pratik Rajesh Sampat 
---
 arch/powerpc/platforms/powernv/idle.c | 88 +--
 1 file changed, 70 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/idle.c 
b/arch/powerpc/platforms/powernv/idle.c
index e77b31621081..4d896df51582 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -43,6 +43,31 @@
 #define FIRMWARE_SELF_SAVE0x2
 #define KERNEL_SAVE_RESTORE   0x4
 
+#define NR_PREFERENCES2
+#define PREFERENCE_SHIFT  4
+#define PREFERENCE_MASK   0xf
+/*
+ * Bitmask defining the kind of preferences available.
+ * Note : The higher to lower preference is from LSB to MSB, with a shift of
+ * 4 bits.
+ * 
+ * || 2nd pref | 1st pref |
+ * 
+ * MSB   LSB
+ */
+/* Prefer Restore if available, otherwise unsupported */
+#define PREFER_SELF_RESTORE_ONLY   FIRMWARE_RESTORE
+/* Prefer Save if available, otherwise unsupported */
+#define PREFER_SELF_SAVE_ONLY  FIRMWARE_SELF_SAVE
+/* Prefer Restore when available, otherwise prefer Save */
+#define PREFER_RESTORE_SAVE((FIRMWARE_SELF_SAVE << \
+ PREFERENCE_SHIFT)\
+ | FIRMWARE_RESTORE)
+/* Prefer Save when available, otherwise prefer Restore*/
+#define PREFER_SAVE_RESTORE((FIRMWARE_RESTORE <<\
+ PREFERENCE_SHIFT)\
+ | FIRMWARE_SELF_SAVE)
+
 static u32 supported_cpuidle_states;
 struct pnv_idle_states_t *pnv_idle_states;
 int nr_pnv_idle_states;
@@ -52,6 +77,7 @@ static bool is_ptcr_self_save;
 
 struct preferred_sprs {
u64 spr;
+   u32 preferred_mode;
u32 supported_mode;
 };
 
@@ -66,42 +92,52 @@ struct preferred_sprs {
 struct preferred_sprs preferred_sprs[] = {
{
.spr = SPRN_HSPRG0,
+   .preferred_mode = PREFER_RESTORE_SAVE,
.supported_mode = FIRMWARE_RESTORE,
},
{
.spr = SPRN_LPCR,
+   .preferred_mode = PREFER_SAVE_RESTORE,
.supported_mode = FIRMWARE_RESTORE,
},
{
.spr = SPRN_PTCR,
+   .preferred_mode = PREFER_RESTORE_SAVE,
.supported_mode = KERNEL_SAVE_RESTORE,
},
{
.spr = SPRN_HMEER,
+   .preferred_mode = PREFER_RESTORE_SAVE,
.supported_mode = FIRMWARE_RESTORE,
},
{
.spr = SPRN_HID0,
+   .preferred_mode = PREFER_RESTORE_SAVE,
.supported_mode = FIRMWARE_RESTORE,
},
{
.spr = P9_STOP_SPR_MSR,
+   .preferred_mode = PREFER_SAVE_RESTORE,
.supported_mode = FIRMWARE_RESTORE,
},
{
.spr = P9_STOP_SPR_PSSCR,
+   .preferred_mode = PREFER_SAVE_RESTORE,
.supported_mode = FIRMWARE_RESTORE,
},
{
.spr = SPRN_HID1,
+   .preferred_mode = PREFER_RESTORE_SAVE,
.supported_mode = FIRMWARE_RESTORE,
},
{
.spr = SPRN_HID4,
+   .preferred_mode = PREFER_SELF_RESTORE_ONLY,
.supported_mode = FIRMWARE_RESTORE,
},
{
.spr = SPRN_HID5,
+   .preferred_mode = PREFER_SELF_RESTORE_ONLY,
.supported_mode = FIRMWARE_RESTORE,
}
 };
@@ -218,7 +254,9 @@ static int

[PATCH v6 2/3] powerpc/powernv: Introduce support and parsing for self-save API

2020-03-26 Thread Pratik Rajesh Sampat

This commit introduces and leverages the Self save API. The difference
between self-save and self-restore is that the value to be saved for the
SPR does not need to be passed to the call.

Add the new Self Save OPAL API call in the list of OPAL calls.
Implement the self saving of the SPRs based on the support populated.
This commit imposes the self-save over self-restore in case both are
supported for a particular SPR.

Along with support for self-save, kernel supported save restore is also
populated in the list. This property is only populated for those SPRs
which encapsulate support from the kernel and hav ethe possibility to
garner support from a firmware mode too.

In addition, the commit also parses the device tree for nodes self-save,
self-restore and populate support for the preferred SPRs based on what
was advertised by the device tree.

In the case a SPR is supported by the firmware self-save, self-restore
and kernel save restore then the preference of execution is also in the
same order as above.

Signed-off-by: Pratik Rajesh Sampat 
---
 .../bindings/powerpc/opal/power-mgt.txt   |  18 +++
 arch/powerpc/include/asm/opal-api.h   |   3 +-
 arch/powerpc/include/asm/opal.h   |   1 +
 arch/powerpc/platforms/powernv/idle.c | 131 +-
 arch/powerpc/platforms/powernv/opal-call.c|   1 +
 5 files changed, 146 insertions(+), 8 deletions(-)

diff --git a/Documentation/devicetree/bindings/powerpc/opal/power-mgt.txt 
b/Documentation/devicetree/bindings/powerpc/opal/power-mgt.txt
index 9d619e955576..5fb03c6d7de9 100644
--- a/Documentation/devicetree/bindings/powerpc/opal/power-mgt.txt
+++ b/Documentation/devicetree/bindings/powerpc/opal/power-mgt.txt
@@ -116,3 +116,21 @@ otherwise. The length of all the property arrays must be 
the same.
which of the fields of the PMICR are set in the corresponding
entries in ibm,cpu-idle-state-pmicr. This is an optional
property on POWER8 and is absent on POWER9.
+
+- self-restore:
+ Array of unsigned 64-bit values containing a property for sprn-mask
+ with each bit indicating the index of the supported SPR for the
+ functionality. This is an optional property for both Power8 and Power9
+
+- self-save:
+  Array of unsigned 64-bit values containing a property for sprn-mask
+  with each bit indicating the index of the supported SPR for the
+  functionality. This is an optional property for both Power8 and Power9
+
+Example of arrangement of self-restore and self-save arrays:
+For instance if PSSCR is supported, the value is 0x357 = 855.
+Since the array is of 64 bit values, the index of the array is determined by
+855 / 64 = 13th element. Within that index, the bit number is determined by
+855 % 64 = 23rd bit.
+This means that if the 23rd bit in array[13] is set, then that SPR is supported
+by the corresponding self-save or self-restore API.
diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index c1f25a760eb1..1b6e1a68d431 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -214,7 +214,8 @@
 #define OPAL_SECVAR_GET176
 #define OPAL_SECVAR_GET_NEXT   177
 #define OPAL_SECVAR_ENQUEUE_UPDATE 178
-#define OPAL_LAST  178
+#define OPAL_SLW_SELF_SAVE_REG 181
+#define OPAL_LAST  181
 
 #define QUIESCE_HOLD   1 /* Spin all calls at entry */
 #define QUIESCE_REJECT 2 /* Fail all calls with OPAL_BUSY */
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 9986ac34b8e2..a370b0e8d899 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -204,6 +204,7 @@ int64_t opal_handle_hmi2(__be64 *out_flags);
 int64_t opal_register_dump_region(uint32_t id, uint64_t start, uint64_t end);
 int64_t opal_unregister_dump_region(uint32_t id);
 int64_t opal_slw_set_reg(uint64_t cpu_pir, uint64_t sprn, uint64_t val);
+int64_t opal_slw_self_save_reg(uint64_t cpu_pir, uint64_t sprn);
 int64_t opal_config_cpu_idle_state(uint64_t state, uint64_t flag);
 int64_t opal_pci_set_phb_cxl_mode(uint64_t phb_id, uint64_t mode, uint64_t 
pe_number);
 int64_t opal_pci_get_pbcq_tunnel_bar(uint64_t phb_id, uint64_t *addr);
diff --git a/arch/powerpc/platforms/powernv/idle.c 
b/arch/powerpc/platforms/powernv/idle.c
index 858ceb86394d..e77b31621081 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -35,13 +35,20 @@
 /*
  * Type of support for each SPR
  * FIRMWARE_RESTORE: firmware restoration supported: calls self-restore OPAL 
API
+ * FIRMWARE_SELF_SAVE: firmware save and restore: calls self-save OPAL API
+ * KERNEL_SAVE_RESTORE: kernel handles the saving and restoring of SPR
  */
 #define UNSUPPORTED   0x0
 #define FIRMWARE_RESTORE  0x1
+#define FIRMWARE_SELF_SAVE0x2
+#define

[PATCH v6 1/3] powerpc/powernv: Introduce interface for self-restore support

2020-03-26 Thread Pratik Rajesh Sampat

Introduces an interface that helps determine support for the
self-restore API. The commit is isomorphic to the original interface of
declaring SPRs to self-restore.

Signed-off-by: Pratik Rajesh Sampat 
---
 arch/powerpc/platforms/powernv/idle.c | 200 +++---
 1 file changed, 152 insertions(+), 48 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/idle.c 
b/arch/powerpc/platforms/powernv/idle.c
index 78599bca66c2..858ceb86394d 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -32,10 +32,67 @@
 #define P9_STOP_SPR_MSR 2000
 #define P9_STOP_SPR_PSSCR  855
 
+/*
+ * Type of support for each SPR
+ * FIRMWARE_RESTORE: firmware restoration supported: calls self-restore OPAL 
API
+ */
+#define UNSUPPORTED   0x0
+#define FIRMWARE_RESTORE  0x1
+
 static u32 supported_cpuidle_states;
 struct pnv_idle_states_t *pnv_idle_states;
 int nr_pnv_idle_states;
 
+struct preferred_sprs {
+   u64 spr;
+   u32 supported_mode;
+};
+
+/*
+ * Supported mode: Default support. Can be overwritten during system
+ *initialization
+ */
+struct preferred_sprs preferred_sprs[] = {
+   {
+   .spr = SPRN_HSPRG0,
+   .supported_mode = FIRMWARE_RESTORE,
+   },
+   {
+   .spr = SPRN_LPCR,
+   .supported_mode = FIRMWARE_RESTORE,
+   },
+   {
+   .spr = SPRN_HMEER,
+   .supported_mode = FIRMWARE_RESTORE,
+   },
+   {
+   .spr = SPRN_HID0,
+   .supported_mode = FIRMWARE_RESTORE,
+   },
+   {
+   .spr = P9_STOP_SPR_MSR,
+   .supported_mode = FIRMWARE_RESTORE,
+   },
+   {
+   .spr = P9_STOP_SPR_PSSCR,
+   .supported_mode = FIRMWARE_RESTORE,
+   },
+   {
+   .spr = SPRN_HID1,
+   .supported_mode = FIRMWARE_RESTORE,
+   },
+   {
+   .spr = SPRN_HID4,
+   .supported_mode = FIRMWARE_RESTORE,
+   },
+   {
+   .spr = SPRN_HID5,
+   .supported_mode = FIRMWARE_RESTORE,
+   }
+};
+
+const int nr_preferred_sprs = ARRAY_SIZE(preferred_sprs);
+
 /*
  * The default stop state that will be used by ppc_md.power_save
  * function on platforms that support stop instruction.
@@ -61,78 +118,125 @@ static bool deepest_stop_found;
 
 static unsigned long power7_offline_type;
 
-static int pnv_save_sprs_for_deep_states(void)
+static int pnv_self_restore_sprs(u64 pir, int cpu, u64 spr)
 {
-   int cpu;
+   u64 reg_val;
int rc;
 
-   /*
-* hid0, hid1, hid4, hid5, hmeer and lpcr values are symmetric across
-* all cpus at boot. Get these reg values of current cpu and use the
-* same across all cpus.
-*/
-   uint64_t lpcr_val   = mfspr(SPRN_LPCR);
-   uint64_t hid0_val   = mfspr(SPRN_HID0);
-   uint64_t hid1_val   = mfspr(SPRN_HID1);
-   uint64_t hid4_val   = mfspr(SPRN_HID4);
-   uint64_t hid5_val   = mfspr(SPRN_HID5);
-   uint64_t hmeer_val  = mfspr(SPRN_HMEER);
-   uint64_t msr_val = MSR_IDLE;
-   uint64_t psscr_val = pnv_deepest_stop_psscr_val;
-
-   for_each_present_cpu(cpu) {
-   uint64_t pir = get_hard_smp_processor_id(cpu);
-   uint64_t hsprg0_val = (uint64_t)paca_ptrs[cpu];
-
-   rc = opal_slw_set_reg(pir, SPRN_HSPRG0, hsprg0_val);
+   switch (spr) {
+   case SPRN_HSPRG0:
+   reg_val = (uint64_t)paca_ptrs[cpu];
+   rc = opal_slw_set_reg(pir, SPRN_HSPRG0, reg_val);
if (rc != 0)
return rc;
-
-   rc = opal_slw_set_reg(pir, SPRN_LPCR, lpcr_val);
+   break;
+   case SPRN_LPCR:
+   reg_val = mfspr(SPRN_LPCR);
+   rc = opal_slw_set_reg(pir, SPRN_LPCR, reg_val);
if (rc != 0)
return rc;
-
+   break;
+   case P9_STOP_SPR_MSR:
+   reg_val = MSR_IDLE;
if (cpu_has_feature(CPU_FTR_ARCH_300)) {
-   rc = opal_slw_set_reg(pir, P9_STOP_SPR_MSR, msr_val);
+   rc = opal_slw_set_reg(pir, P9_STOP_SPR_MSR, reg_val);
if (rc)
return rc;
-
-   rc = opal_slw_set_reg(pir,
- P9_STOP_SPR_PSSCR, psscr_val);
-
+   }
+   break;
+   case P9_STOP_SPR_PSSCR:
+   reg_val = pnv_deepest_stop_psscr_val;
+   if (cpu_has_feature(CPU_FTR_ARCH_300)) {
+   rc = opal_slw_set_reg(pir, P9_STOP_SPR_PSSCR, reg_val);
if (rc)
return rc;
}
-
-   /* HIDs are per core registers */
+   break;
+   case SPRN_HMEER:
+   reg_val =

[PATCH v6 0/3] powerpc/powernv: Introduce interface for self-restore support

2020-03-26 Thread Pratik Rajesh Sampat

v5: https://lkml.org/lkml/2020/3/17/944
Changelog
v5-->v6
1. Updated background, motivation and illuminated potential design
choices
2. Re-organization of patch-set
  a. Split introducing preference for optimization from 1/1 to patch 3/3
  b. Merge introducing self-save API and parsing the device-tree
  c. Introduce a supported mode called KERNEL_SAVE_RESTORE which
 outlines and makes kernel supported SPRs for save-restore more
 explicit

Background
==

The power management framework on POWER systems include core idle
states that lose context. Deep idle states namely "winkle" on POWER8
and "stop4" and "stop5" on POWER9 can be entered by a CPU to save
different levels of power, as a consequence of which all the
hypervisor resources such as SPRs and SCOMs are lost.

For most SPRs, saving and restoration of content for SPRs and SCOMs
is handled by the hypervisor kernel prior to entering an post exit
from an idle state respectively. However, there is a small set of
critical SPRs and XSCOMs that are expected to contain sane values even
before the control is transferred to the hypervisor kernel at system
reset vector.

For this purpose, microcode firmware provides a mechanism to restore
values on certain SPRs. The communication mechanism between the
hypervisor kernel and the microcode is a standard interface called
sleep-winkle-engine (SLW) on Power8 and Stop-API on Power9 which is
abstracted by OPAL calls from the hypervisor kernel. The Stop-API
provides an interface known as the self-restore API, to which the SPR
number and a predefined value to be restored on wake-up from a deep
stop state is supplied.


Motivation to introduce a new Stop-API
==

The self-restore API expects not just the SPR number but also the
value with which the SPR is restored. This is good for those SPRs such
as HSPRG0 whose values do not change at runtime, since for them, the
kernel can invoke the self-restore API at boot time once the values of
these SPRs are determined.

However, there are use-cases where-in the value to be saved cannot be
known or cannot be updated in the layer it currently is.
The shortcomings and the new use-cases which cannot be served by the
existing self-restore API, serves as motivation for a new API:

Shortcoming1:

In a special wakeup scenario, SPRs such as PSSCR, whose values can
change at runtime, are compelled to make the self-restore API call
every time before entering a deep-idle state rendering it to be
prohibitively expensive

Shortcoming2:

The value of LPCR is dynamic based on if the CPU is entered a stop
state during cpu idle versus cpu hotplug.
Today, an additional self-restore call is made before entering
CPU-Hotplug to clear the PECE1 bit in stop-API so that if we are
woken up by a special wakeup on an offlined CPU, we go back to stop
with the the bit cleared.
There is a overhead of an extra call

New Use-case:
-
In the case where the hypervisor is running on an
ultravisor environment, the boot time is too late in the cycle to make
the self-restore API calls, as these cannot be invoked from an
non-secure context anymore

To address these shortcomings, the firmware provides another API known
as the self-save API. The self-save API only takes the SPR number as a
parameter and will ensure that on wakeup from a deep-stop state the
SPR is restored with the value that it contained prior to entering the
deep-stop.

Contrast between self-save and self-restore APIs


  Before entering
  deep idle |---|
  > | HCODE A   |
  | |---|
   |-||
   |   CPU   ||
   |-||
  | |---|
  |>| HCODE B   |
  On waking up  |---|
from deep idle




When a self-restore API is invoked, the HCODE inserts instructions
into "HCODE B" region of the above figure to restore the content of
the SPR to the said value. The "HCODE B" region gets executed soon
after the CPU wakes up from a deep idle state, thus executing the
inserted instructions, thereby restoring the contents of the SPRs to
the required values.

When a self-save API is invoked, the HCODE inserts instructions into
the "HCODE A" region of the above figure to save the content of the
SPR into some location in memory. It also inserts instructions into
the "HCODE B" region to restore the content of the SPR to the
corresponding value saved in the memory by the instructions in "HCODE
A" region.

Thus, in contrast with self-restore, the self-save API *does not* need
a value to be passed to it, since it ensures that the value of SPR
before entering deep stop is saved, and subsequently the same value is
restored.

Self-save and self-restore are complementary features since,
self-restore

[PATCH v6 4/4] Advertise the self-save and self-restore attributes in the device tree

2020-03-26 Thread Pratik Rajesh Sampat

Support for self save and self restore interface is advertised in the
device tree, along with the list of SPRs it supports for each.

The Special Purpose Register identification is encoded in a 2048 bitmask
structure, where each bit signifies the identification key of that SPR
which is consistent with that of the POWER architecture set for that
register.

Signed-off-by: Pratik Rajesh Sampat 
---
 .../ibm,opal/power-mgt/self-restore.rst   |  27 
 .../ibm,opal/power-mgt/self-save.rst  |  27 
 hw/slw.c  | 116 ++
 include/skiboot.h |   1 +
 4 files changed, 171 insertions(+)
 create mode 100644 doc/device-tree/ibm,opal/power-mgt/self-restore.rst
 create mode 100644 doc/device-tree/ibm,opal/power-mgt/self-save.rst

diff --git a/doc/device-tree/ibm,opal/power-mgt/self-restore.rst 
b/doc/device-tree/ibm,opal/power-mgt/self-restore.rst
new file mode 100644
index ..2a2269f7
--- /dev/null
+++ b/doc/device-tree/ibm,opal/power-mgt/self-restore.rst
@@ -0,0 +1,27 @@
+ibm,opal/power-mgt/self-restore device tree entries
+===
+
+This node exports the bitmask representing the special purpose registers that
+the self-restore API currently supports.
+
+Example:
+
+.. code-block:: dts
+
+  self-restore {
+sprn-bitmask = <0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x4201 0x0 0x0
+0x2 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
+0x0 0x0 0x10 0x90 0x0 0x0 0x53 0x0 0x0 0x0
+0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
+0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
+0x0 0x1>;
+phandle = <0x1c7>;
+  };
+
+sprn-bitmask
+
+
+This property is a bitmask of of all the existing SPRs and if the SPR is
+supported, the corresponding bit of the SPR number is set to 1.
+The representation of the bits are left-right, i.e the MSB of the first
+doubleword represants the 0th bit.
diff --git a/doc/device-tree/ibm,opal/power-mgt/self-save.rst 
b/doc/device-tree/ibm,opal/power-mgt/self-save.rst
new file mode 100644
index ..c367720e
--- /dev/null
+++ b/doc/device-tree/ibm,opal/power-mgt/self-save.rst
@@ -0,0 +1,27 @@
+ibm,opal/power-mgt/self-save device tree entries
+===
+
+This node exports the bitmask representing the special purpose registers that
+the self-save API currently supports.
+
+Example:
+
+.. code-block:: dts
+
+  self-save {
+sprn-bitmask = <0x0 0x0 0x0 0x0 0x10 0x0 0x0 0x0 0x4201 0x0 0x0
+0x2 0x0 0x0 0x0 0x1 0x0 0x0 0x0 0x0 0x0 0x0 0x0
+0x0 0x0 0x0 0x10 0x84 0x0 0x0 0x0 0x0 0x0 0x0
+0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
+0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
+0x0 0x1>;
+phandle = <0x1c8>;
+  };
+
+sprn-bitmask
+
+
+This property is a bitmask of of all the existing SPRs and if the SPR is
+supported, the corresponding bit of the SPR number is set to 1.
+The representation of the bits are left-right, i.e the MSB of the first
+doubleword represants the 0th bit.
diff --git a/hw/slw.c b/hw/slw.c
index 6a09cc2c..9d1fe2c5 100644
--- a/hw/slw.c
+++ b/hw/slw.c
@@ -29,6 +29,7 @@
 #include 
 
 static uint32_t slw_saved_reset[0x100];
+#define SPR_BITMAP_LENGTH  2048
 
 static bool slw_current_le = false;
 
@@ -750,6 +751,119 @@ static void slw_late_init_p9(struct proc_chip *chip)
}
 }
 
+/* Add device tree properties to determine self-save | restore */
+void add_cpu_self_save_restore_properties(void)
+{
+   struct dt_node *self_restore, *self_save, *power_mgt;
+   uint64_t *self_save_mask, *self_restore_mask;
+   bool self_save_supported = true;
+   uint64_t compVector = -1;
+   struct proc_chip *chip;
+   int i, rc;
+
+   const uint64_t self_restore_regs[] = {
+   P8_SPR_HRMOR,
+   P8_SPR_HMEER,
+   P8_SPR_PMICR,
+   P8_SPR_PMCR,
+   P8_SPR_HID0,
+   P8_SPR_HID1,
+   P8_SPR_HID4,
+   P8_SPR_HID5,
+   P8_SPR_HSPRG0,
+   P8_SPR_LPCR,
+   P8_MSR_MSR
+   };
+
+   const uint64_t self_save_regs[] = {
+   P9_STOP_SPR_DAWR,
+   P9_STOP_SPR_HSPRG0,
+   P9_STOP_SPR_LDBAR,
+   P9_STOP_SPR_LPCR,
+   P9_STOP_SPR_PSSCR,
+   P9_STOP_SPR_MSR,
+   P9_STOP_SPR_HRMOR,
+   P9_STOP_SPR_HMEER,
+   P9_STOP_SPR_PMCR,
+   P9_STOP_SPR_PTCR
+   };
+
+   chip = next_chip(NULL);
+   assert(chip);
+   rc = proc_stop_api_discover_capability((void *) chip->homer_base,
+

[PATCH v6 3/4] API to verify the STOP API and image compatibility

2020-03-26 Thread Pratik Rajesh Sampat

From: Prem Shanker Jha 

Commit defines a new API primarily intended for OPAL to determine
cpu register save API's compatibility with HOMER layout and
self save restore. It can help OPAL determine if version of
API integrated with OPAL is different from hostboot.

Change-Id: Ic0de45a336cfb8b6b6096a10ac1cd3ffbaa44fc0
Reviewed-on: http://rchgit01.rchland.ibm.com/gerrit1/77612
Tested-by: FSP CI Jenkins 
Tested-by: Jenkins Server 
Tested-by: Hostboot CI 
Reviewed-by: RANGANATHPRASAD G. BRAHMASAMUDRA 
Reviewed-by: Gregory S Still 
Reviewed-by: Jennifer A Stofer 
Reviewed-on: http://rchgit01.rchland.ibm.com/gerrit1/77614
Tested-by: Jenkins OP Build CI 
Tested-by: Jenkins OP HW 
Reviewed-by: Daniel M Crowell 
Signed-off-by: Pratik Rajesh Sampat 
---
 include/p9_stop_api.H| 26 +++
 libpore/p9_cpu_reg_restore_instruction.H |  7 ++-
 libpore/p9_hcd_memmap_base.H |  7 +++
 libpore/p9_stop_api.C| 58 +++-
 libpore/p9_stop_api.H| 26 ++-
 libpore/p9_stop_util.H   | 20 
 6 files changed, 131 insertions(+), 13 deletions(-)

diff --git a/include/p9_stop_api.H b/include/p9_stop_api.H
index c304f70f..09ce3dc1 100644
--- a/include/p9_stop_api.H
+++ b/include/p9_stop_api.H
@@ -110,6 +110,7 @@ typedef enum
 STOP_SAVE_FAIL   = 14,  // for internal failure within 
firmware.
 STOP_SAVE_SPR_ENTRY_MISSING  =  15,
 STOP_SAVE_SPR_BIT_POS_RESERVE=  16,
+STOP_SAVE_API_IMG_INCOMPATIBLE   =  18,
 } StopReturnCode_t;
 
 /**
@@ -164,6 +165,14 @@ typedef enum
 
 } ScomSection_t;
 
+/**
+ * @brief   versions pertaining relvant to STOP API.
+ */
+typedef enum
+{
+STOP_API_VER=   0x00,
+STOP_API_VER_CONTROL=   0x02,
+} VersionList_t;
 
 
 /**
@@ -195,6 +204,14 @@ typedef enum
 BIT_POS_USPRG1  =   30,
 } SprBitPositionList_t;
 
+typedef enum
+{
+SMF_SUPPORT_MISSING_IN_HOMER =   0x01,
+SELF_SUPPORT_MISSING_FOR_LE_HYP  =   0x02,
+IPL_RUNTIME_CPU_SAVE_VER_MISMATCH=   0x04,
+SELF_RESTORE_VER_MISMATCH=   0x08,
+} VersionIncompList_t;
+
 #ifdef __cplusplus
 extern "C" {
 #endif
@@ -247,6 +264,15 @@ StopReturnCode_t p9_stop_save_scom( void* const   i_pImage,
 StopReturnCode_t
 p9_stop_save_cpureg_control( void* i_pImage, const uint64_t i_pir,
  const uint32_t  i_saveRegVector );
+
+/**
+ * @brief   verifies if API is compatible of current HOMER image.
+ * @param[in]   i_pImagepoints to the start of HOMER image of P9 chip.
+ * @param[out]  o_inCompVector  list of incompatibilities found.
+ * @return  STOP_SAVE_SUCCESS if if API succeeds, error code otherwise.
+ */
+StopReturnCode_t proc_stop_api_discover_capability( void* const i_pImage, 
uint64_t* o_inCompVector );
+
 #ifdef __cplusplus
 } // extern "C"
 };  // namespace stopImageSection ends
diff --git a/libpore/p9_cpu_reg_restore_instruction.H 
b/libpore/p9_cpu_reg_restore_instruction.H
index d69a4212..5f168855 100644
--- a/libpore/p9_cpu_reg_restore_instruction.H
+++ b/libpore/p9_cpu_reg_restore_instruction.H
@@ -5,7 +5,7 @@
 /**/
 /* OpenPOWER HostBoot Project */
 /**/
-/* Contributors Listed Below - COPYRIGHT 2015,2018*/
+/* Contributors Listed Below - COPYRIGHT 2015,2020*/
 /* [+] International Business Machines Corp.  */
 /**/
 /**/
@@ -69,6 +69,11 @@ enum
 OPCODE_18   =   18,
 SELF_SAVE_FUNC_ADD  =   0x2300,
 SELF_SAVE_OFFSET=   0x180,
+SKIP_SPR_REST_INST  =   0x481c, //b . +0x01c
+MFLR_R30=   0x7fc802a6,
+SKIP_SPR_SELF_SAVE  =   0x3bff0020, //addi r31 r31, 0x20
+MTLR_INST   =   0x7fc803a6,  //mtlr r30
+BRANCH_BE_INST  =   0x4820,
 };
 
 #define MR_R0_TO_R100x7c0a0378UL //mr r10 r0
diff --git a/libpore/p9_hcd_memmap_base.H b/libpore/p9_hcd_memmap_base.H
index 000fafef..ddb56728 100644
--- a/libpore/p9_hcd_memmap_base.H
+++ b/libpore/p9_hcd_memmap_base.H
@@ -444,6 +444,13 @@ HCD_CONST(CME_QUAD_PSTATE_SIZE, HALF_KB)
 
 HCD_CONST(CME_REGION_SIZE,  (64 * ONE_KB))
 
+
+// HOMER compatibility
+
+HCD_CONST(STOP_API_CPU_SAVE_VER,0x02)
+HCD_CONST(SELF_SAVE_RESTORE_VER,0x02)
+HCD_CONST(SMF_SUPPORT_SIGNATURE_OFFSET, 0x1300)
+HCD_CONST(SMF_SELF_SIGNATURE,   (0x5f534d46))
 // Debug
 
 HCD_CONST(CPMR_TRACE_REGION_OFFSET, (512 * ONE_KB))
diff --git a/libpore/p9_stop_api.C b/libpore/p9_stop_api.C
index

[PATCH v6 2/4] Self save API integration

2020-03-26 Thread Pratik Rajesh Sampat

The commit makes the self save API available outside the firmware by defining
an OPAL wrapper.
This wrapper has a similar interface to that of self restore and expects the
cpu pir, SPR number, minus the value of that SPR to be passed in its
paramters and returns OPAL_SUCCESS on success.
The commit also documents both the self-save and the self-restore API
calls along with their working and usage.

Signed-off-by: Pratik Rajesh Sampat 
---
 doc/opal-api/opal-slw-self-save-reg-181.rst | 49 
 doc/opal-api/opal-slw-set-reg-100.rst   |  5 ++
 doc/power-management.rst| 44 ++
 hw/slw.c| 89 +
 include/opal-api.h  |  3 +-
 include/p9_stop_api.H   | 17 
 include/skiboot.h   |  3 +
 7 files changed, 209 insertions(+), 1 deletion(-)
 create mode 100644 doc/opal-api/opal-slw-self-save-reg-181.rst

diff --git a/doc/opal-api/opal-slw-self-save-reg-181.rst 
b/doc/opal-api/opal-slw-self-save-reg-181.rst
new file mode 100644
index ..5aa4c930
--- /dev/null
+++ b/doc/opal-api/opal-slw-self-save-reg-181.rst
@@ -0,0 +1,49 @@
+.. OPAL_SLW_SELF_SAVE_REG:
+
+OPAL_SLW_SELF_SAVE_REG
+==
+
+.. code-block:: c
+
+   #define OPAL_SLW_SELF_SAVE_REG  181
+
+   int64_t opal_slw_self_save_reg(uint64_t cpu_pir, uint64_t sprn);
+
+:ref:`OPAL_SLW_SELF_SAVE_REG` is used to inform low-level firmware to save
+the current contents of the SPR before entering a state of loss and
+also restore the content back on waking up from a deep stop state.
+
+An OPAL call `OPAL_SLW_SET_REG` exists which is similar in function as
+saving and restoring the SPR, with one difference being that the value of the
+SPR must also be supplied in the parameters.
+Complete reference: doc/opal-api/opal-slw-set-reg-100.rst
+
+Parameters
+--
+
+``uint64_t cpu_pir``
+  This parameter specifies the pir of the cpu for which the call is being made.
+``uint64_t sprn``
+  This parameter specifies the spr number as mentioned in p9_stop_api.H
+  The list of SPRs supported is as follows. This list is suppiled through the
+  device tree:
+   P9_STOP_SPR_DAWR,
+   P9_STOP_SPR_HSPRG0,
+   P9_STOP_SPR_LDBAR,
+   P9_STOP_SPR_LPCR,
+   P9_STOP_SPR_PSSCR,
+   P9_STOP_SPR_MSR,
+   P9_STOP_SPR_HRMOR,
+   P9_STOP_SPR_HMEER,
+   P9_STOP_SPR_PMCR,
+   P9_STOP_SPR_PTCR
+
+Returns
+---
+
+:ref:`OPAL_UNSUPPORTED`
+  If spr restore is not supported by pore engine.
+:ref:`OPAL_PARAMETER`
+  Invalid handle for the pir/chip
+:ref:`OPAL_SUCCESS`
+  On success
diff --git a/doc/opal-api/opal-slw-set-reg-100.rst 
b/doc/opal-api/opal-slw-set-reg-100.rst
index 2e8f1bd6..ee3e68ce 100644
--- a/doc/opal-api/opal-slw-set-reg-100.rst
+++ b/doc/opal-api/opal-slw-set-reg-100.rst
@@ -21,6 +21,11 @@ In Power 9, it uses p9_stop_save_cpureg(), api provided by 
self restore code,
 to inform the spr with their corresponding values with which they
 must be restored.
 
+An OPAL call `OPAL_SLW_SELF_SAVE_REG` exists which is similar in function
+saving and restoring the SPR, with one difference being that the value of the
+SPR doesn't need to be passed in the parameters, only with the SPR number
+the firmware can identify, save and restore the values for the same.
+Complete reference: doc/opal-api/opal-slw-self-save-reg-181.rst
 
 Parameters
 --
diff --git a/doc/power-management.rst b/doc/power-management.rst
index 76491a71..992a18d0 100644
--- a/doc/power-management.rst
+++ b/doc/power-management.rst
@@ -15,3 +15,47 @@ On boot, specific stop states can be disabled via setting a 
mask. For example,
 to disable all but stop 0,1,2, use ~0xE000. ::
 
   nvram -p ibm,skiboot --update-config opal-stop-state-disable-mask=0x1FFF
+
+Saving and restoring Special Purpose Registers(SPRs)
+
+
+When a CPU wakes up from a deep stop state which can result in
+hypervisor state loss, all the SPRs are lost. The Linux Kernel expects
+a small set of SPRs to contain an expected value when the CPU wakes up
+from such a deep stop state. The microcode firmware provides the
+following two APIs, collectively known as the stop-APIs, to allow the
+kernel/OPAL to specify this set of SPRs and the value that they need
+to be restored with on waking up from a deep stop state.
+
+Self-restore:
+int64_t opal_slw_set_reg(uint64_t cpu_pir, uint64_t sprn, uint64_t val);
+The SPR number and the value of the that SPR must be restored with on
+wakeup from the deep-stop state must be specified. When this call is
+made, the microcode inserts instruction into the HOMER region to
+restore the content of the SPR to the specified value on wakeup from a
+deep-stop state. These instructions are executed by the CPU as soon as
+it wakes up from a deep stop state. The call is to be made once per
+SPR.
+
+Self-Save:
+int64_t

[PATCH v6 1/4] Self Save: Introducing Support for SPR Self Save

2020-03-26 Thread Pratik Rajesh Sampat

From: Prem Shanker Jha 

The commit is a merger of commits that makes the following changes:
1. Commit fixes some issues with code found during integration test
  -  replacement of addi with xor instruction during self save API.
  -  fixing instruction generation for MFMSR during self save
  -  data struct updates in STOP API
  -  error RC updates for hcode image build
  -  HOMER parser updates.
  -  removed self save support for URMOR and HRMOR
  -  code changes for compilation with OPAL
  -  populating CME Image header with unsecure HOMER address.

Key_Cronus_Test=PM_REGRESS

Change-Id: I7cedcc466267c4245255d8d75c01ed695e316720
Reviewed-on: http://rchgit01.rchland.ibm.com/gerrit1/66580
Tested-by: FSP CI Jenkins 
Tested-by: HWSV CI 
Tested-by: PPE CI 
Tested-by: Jenkins Server 
Tested-by: Cronus HW CI 
Tested-by: Hostboot CI 
Reviewed-by: Gregory S. Still 
Reviewed-by: RAHUL BATRA 
Reviewed-by: Jennifer A. Stofer 
Reviewed-on: http://rchgit01.rchland.ibm.com/gerrit1/66587
Reviewed-by: Christian R. Geddes 
Signed-off-by: Prem Shanker Jha 
Signed-off-by: Akshay Adiga 
Signed-off-by: Pratik Rajesh Sampat 

2. The commit also incorporates changes that make STOP API project
agnostic changes include defining wrapper functions which call legacy
API. It also adds duplicate enum members which start with prefix PROC
instead of P9.

Key_Cronus_Test=PM_REGRESS

Change-Id: If87970f3e8cf9b507f33eb1be249e03eb3836a5e
RTC: 201128
Reviewed-on: http://rchgit01.rchland.ibm.com/gerrit1/71307
Tested-by: FSP CI Jenkins 
Tested-by: Jenkins Server 
Tested-by: Hostboot CI 
Tested-by: Cronus HW CI 
Reviewed-by: RANGANATHPRASAD G. BRAHMASAMUDRA 
Reviewed-by: Gregory S. Still 
Reviewed-by: Jennifer A Stofer 
Reviewed-on: http://rchgit01.rchland.ibm.com/gerrit1/71314
Tested-by: Jenkins OP Build CI 
Tested-by: Jenkins OP HW 
Reviewed-by: Daniel M. Crowell 
Signed-off-by: Prem Shanker Jha 
Signed-off-by: Pratik Rajesh Sampat 
---
 include/p9_stop_api.H|  79 +-
 libpore/p9_cpu_reg_restore_instruction.H |   4 +
 libpore/p9_stop_api.C| 954 +--
 libpore/p9_stop_api.H| 115 ++-
 libpore/p9_stop_data_struct.H|   4 +-
 libpore/p9_stop_util.H   |   7 +-
 6 files changed, 721 insertions(+), 442 deletions(-)

diff --git a/include/p9_stop_api.H b/include/p9_stop_api.H
index 79abd000..9d3bc1e5 100644
--- a/include/p9_stop_api.H
+++ b/include/p9_stop_api.H
@@ -63,6 +63,26 @@ typedef enum
 P9_STOP_SPR_PMCR=884,   // core register
 P9_STOP_SPR_HID =   1008,   // core register
 P9_STOP_SPR_MSR =   2000,   // thread register
+
+//enum members which are project agnostic
+PROC_STOP_SPR_DAWR=180,   // thread register
+PROC_STOP_SPR_CIABR   =187,   // thread register
+PROC_STOP_SPR_DAWRX   =188,   // thread register
+PROC_STOP_SPR_HSPRG0  =304,   // thread register
+PROC_STOP_SPR_HRMOR   =313,   // core register
+PROC_STOP_SPR_LPCR=318,   // thread register
+PROC_STOP_SPR_HMEER   =337,   // core register
+PROC_STOP_SPR_PTCR=464,   // core register
+PROC_STOP_SPR_USPRG0  =496,   // thread register
+PROC_STOP_SPR_USPRG1  =497,   // thread register
+PROC_STOP_SPR_URMOR   =505,   // core register
+PROC_STOP_SPR_SMFCTRL =511,   // thread register
+PROC_STOP_SPR_LDBAR   =850,   // thread register
+PROC_STOP_SPR_PSSCR   =855,   // thread register
+PROC_STOP_SPR_PMCR=884,   // core register
+PROC_STOP_SPR_HID =   1008,   // core register
+PROC_STOP_SPR_MSR =   2000,   // thread register
+
 } CpuReg_t;
 
 /**
@@ -85,6 +105,8 @@ typedef enum
 STOP_SAVE_SCOM_ENTRY_UPDATE_FAILED   = 12,
 STOP_SAVE_INVALID_FUSED_CORE_STATUS  = 13,
 STOP_SAVE_FAIL   = 14,  // for internal failure within 
firmware.
+STOP_SAVE_SPR_ENTRY_MISSING  =  15,
+STOP_SAVE_SPR_BIT_POS_RESERVE=  16,
 } StopReturnCode_t;
 
 /**
@@ -101,7 +123,20 @@ typedef enum
 P9_STOP_SCOM_RESET  = 6,
 P9_STOP_SCOM_OR_APPEND  = 7,
 P9_STOP_SCOM_AND_APPEND = 8,
-P9_STOP_SCOM_OP_MAX = 9
+P9_STOP_SCOM_OP_MAX = 9,
+
+//enum members which are project agnostic
+PROC_STOP_SCOM_OP_MIN =   0,
+PROC_STOP_SCOM_APPEND =   1,
+PROC_STOP_SCOM_REPLACE=   2,
+PROC_STOP_SCOM_OR =   3,
+PROC_STOP_SCOM_AND=   4,
+PROC_STOP_SCOM_NOOP   =   5,
+PROC_STOP_SCOM_RESET  =   6,
+PROC_STOP_SCOM_OR_APPEND  =   7,
+PROC_STOP_SCOM_AND_APPEND =   8,
+PROC_STOP_SCOM_OP_MAX =   9,
+
 } ScomOperation_t;
 
 /**
@@ -114,9 +149,49 @@ typedef enum
 P9_STOP_SECTION_EQ_SCOM = 2,
 P9_STOP_SECTION_L2  = 3,
 P9_STOP_SECTION_L3  = 4,
-P9_STOP_SECTION_MAX = 5
+P9_STOP_SECTION_MAX = 5,
+
+//enum members which are project agnostic
+

[PATCH v6 0/4] Support for Self Save API in OPAL

2020-03-26 Thread Pratik Rajesh Sampat

v5:https://lists.ozlabs.org/pipermail/skiboot/2020-March/016538.html
Changelog
v5 --> v6
Updated background, motivation and illuminated potential design choices

Background
==

The power management framework on POWER systems include core idle
states that lose context. Deep idle states namely "winkle" on POWER8
and "stop4" and "stop5" on POWER9 can be entered by a CPU to save
different levels of power, as a consequence of which all the
hypervisor resources such as SPRs and SCOMs are lost.

For most SPRs, saving and restoration of content for SPRs and SCOMs
is handled by the hypervisor kernel prior to entering an post exit
from an idle state respectively. However, there is a small set of
critical SPRs and XSCOMs that are expected to contain sane values even
before the control is transferred to the hypervisor kernel at system
reset vector.

For this purpose, microcode firmware provides a mechanism to restore
values on certain SPRs. The communication mechanism between the
hypervisor kernel and the microcode is a standard interface called
sleep-winkle-engine (SLW) on Power8 and Stop-API on Power9 which is
abstracted by OPAL calls from the hypervisor kernel. The Stop-API
provides an interface known as the self-restore API, to which the SPR
number and a predefined value to be restored on wake-up from a deep
stop state is supplied.


Motivation to introduce a new Stop-API
==

The self-restore API expects not just the SPR number but also the
value with which the SPR is restored. This is good for those SPRs such
as HSPRG0 whose values do not change at runtime, since for them, the
kernel can invoke the self-restore API at boot time once the values of
these SPRs are determined.

However, there are use-cases where-in the value to be saved cannot be
known or cannot be updated in the layer it currently is.
The shortcomings and the new use-cases which cannot be served by the
existing self-restore API, serves as motivation for a new API:

Shortcoming1:

In a special wakeup scenario, SPRs such as PSSCR, whose values can
change at runtime, are compelled to make the self-restore API call
every time before entering a deep-idle state rendering it to be
prohibitively expensive

Shortcoming2:

The value of LPCR is dynamic based on if the CPU is entered a stop
state during cpu idle versus cpu hotplug.
Today, an additional self-restore call is made before entering
CPU-Hotplug to clear the PECE1 bit in stop-API so that if we are
woken up by a special wakeup on an offlined CPU, we go back to stop
with the the bit cleared.
There is a overhead of an extra call

New Use-case:
-
In the case where the hypervisor is running on an
ultravisor environment, the boot time is too late in the cycle to make
the self-restore API calls, as these cannot be invoked from an
non-secure context anymore

To address these shortcomings, the firmware provides another API known
as the self-save API. The self-save API only takes the SPR number as a
parameter and will ensure that on wakeup from a deep-stop state the
SPR is restored with the value that it contained prior to entering the
deep-stop.

Contrast between self-save and self-restore APIs


  Before entering
  deep idle |---|
  > | HCODE A   |
  | |---|
   |-||
   |   CPU   ||
   |-|| 
  | |---|
  |>| HCODE B   |
  On waking up  |---|
from deep idle




When a self-restore API is invoked, the HCODE inserts instructions
into "HCODE B" region of the above figure to restore the content of
the SPR to the said value. The "HCODE B" region gets executed soon
after the CPU wakes up from a deep idle state, thus executing the
inserted instructions, thereby restoring the contents of the SPRs to
the required values.

When a self-save API is invoked, the HCODE inserts instructions into
the "HCODE A" region of the above figure to save the content of the
SPR into some location in memory. It also inserts instructions into
the "HCODE B" region to restore the content of the SPR to the
corresponding value saved in the memory by the instructions in "HCODE
A" region.

Thus, in contrast with self-restore, the self-save API *does not* need
a value to be passed to it, since it ensures that the value of SPR
before entering deep stop is saved, and subsequently the same value is
restored.

Self-save and self-restore are complementary features since,
self-restore can help in restoring a different value in the SPR on
wakeup from a deep-idle state than what it had before entering the
deep idle state. This was used in POWER8 for HSPRG0 to distinguish a
wakeup from Winkle vs Fastsleep.

Limitations of self-save

[PATCH v2] powerpc xmon: use `dcbf` inplace of `dcbi` instruction for 64bit Book3S

2020-03-26 Thread Balamuruhan S

Data Cache Block Invalidate (dcbi) instruction was implemented back in PowerPC
architecture version 2.03. It is obsolete and attempt to use of this illegal
instruction results in a hypervisor emulation assistance interrupt. So, ifdef
it out the option `i` in xmon for 64bit Book3S.

0:mon> fi
cpu 0x0: Vector: 700 (Program Check) at [c3be74a0]
pc: c0102030: cacheflush+0x180/0x1a0
lr: c0101f3c: cacheflush+0x8c/0x1a0
sp: c3be7730
   msr: 80081033
  current = 0xc35e5c00
  paca= 0xc191   irqmask: 0x03   irq_happened: 0x01
pid   = 1025, comm = bash
Linux version 5.6.0-rc5-g5aa19adac (root@ltc-wspoon6) (gcc version 7.4.0
(Ubuntu 7.4.0-1ubuntu1~18.04.1)) #1 SMP Tue Mar 10 04:38:41 CDT 2020
cpu 0x0: Exception 700 (Program Check) in xmon, returning to main loop
[c3be7c50] c084abb0 __handle_sysrq+0xf0/0x2a0
[c3be7d00] c084b3c0 write_sysrq_trigger+0xb0/0xe0
[c3be7d30] c04d1edc proc_reg_write+0x8c/0x130
[c3be7d60] c040dc7c __vfs_write+0x3c/0x70
[c3be7d80] c0410e70 vfs_write+0xd0/0x210
[c3be7dd0] c041126c ksys_write+0xdc/0x130
[c3be7e20] c000b9d0 system_call+0x5c/0x68
--- Exception: c01 (System Call) at 7fffa345e420
SP (70b08ab0) is in userspace

Signed-off-by: Balamuruhan S 
---
 arch/powerpc/xmon/xmon.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)
---
changes in v2:
-
Fix review comments from Segher and Michael,
* change incorrect architecture version 2.01 to 2.03 in commit
  message.
* ifdef it out the option `i` for PPC_BOOK3S_64 instead to drop it
  and change the commit message accordingly.

diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 0ec9640335bb..bfd5a97689cd 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -335,10 +335,12 @@ static inline void cflush(void *p)
asm volatile ("dcbf 0,%0; icbi 0,%0" : : "r" (p));
 }
 
+#ifndef CONFIG_PPC_BOOK3S_64
 static inline void cinval(void *p)
 {
asm volatile ("dcbi 0,%0; icbi 0,%0" : : "r" (p));
 }
+#endif
 
 /**
  * write_ciabr() - write the CIABR SPR
@@ -1791,8 +1793,9 @@ static void prregs(struct pt_regs *fp)
 
 static void cacheflush(void)
 {
-   int cmd;
unsigned long nflush;
+#ifndef CONFIG_PPC_BOOK3S_64
+   int cmd;
 
cmd = inchar();
if (cmd != 'i')
@@ -1800,13 +1803,14 @@ static void cacheflush(void)
scanhex((void *));
if (termch != '\n')
termch = 0;
+#endif
nflush = 1;
scanhex();
nflush = (nflush + L1_CACHE_BYTES - 1) / L1_CACHE_BYTES;
if (setjmp(bus_error_jmp) == 0) {
catch_memory_errors = 1;
sync();
-
+#ifndef CONFIG_PPC_BOOK3S_64
if (cmd != 'i') {
for (; nflush > 0; --nflush, adrs += L1_CACHE_BYTES)
cflush((void *) adrs);
@@ -1814,6 +1818,10 @@ static void cacheflush(void)
for (; nflush > 0; --nflush, adrs += L1_CACHE_BYTES)
cinval((void *) adrs);
}
+#else
+   for (; nflush > 0; --nflush, adrs += L1_CACHE_BYTES)
+   cflush((void *)adrs);
+#endif
sync();
/* wait a little while to see if we get a machine check */
__delay(200);

base-commit: a87b93bdf800a4d7a42d95683624a4516e516b4f
-- 
2.24.1

[PATCH] selftests/eeh: Skip ahci adapters

2020-03-26 Thread Michael Ellerman

The ahci driver doesn't support error recovery, and if your root
filesystem is attached to it the eeh-basic.sh test will likely kill
your machine.

So skip any device we see using the ahci driver.

Signed-off-by: Michael Ellerman 
---
 tools/testing/selftests/powerpc/eeh/eeh-basic.sh | 5 +
 1 file changed, 5 insertions(+)

diff --git a/tools/testing/selftests/powerpc/eeh/eeh-basic.sh 
b/tools/testing/selftests/powerpc/eeh/eeh-basic.sh
index f988d2f42e8f..8a8d0f456946 100755
--- a/tools/testing/selftests/powerpc/eeh/eeh-basic.sh
+++ b/tools/testing/selftests/powerpc/eeh/eeh-basic.sh
@@ -41,6 +41,11 @@ for dev in `ls -1 /sys/bus/pci/devices/ | grep '\.0$'` ; do
continue;
fi
 
+   if [ "ahci" = "$(basename $(realpath 
/sys/bus/pci/devices/$dev/driver))" ] ; then
+   echo "$dev, Skipped: ahci doesn't support recovery"
+   continue
+   fi
+
# Don't inject errosr into an already-frozen PE. This happens with
# PEs that contain multiple PCI devices (e.g. multi-function cards)
# and injecting new errors during the recovery process will probably

base-commit: 11a48a5a18c63fd7621bb050228cebf13566e4d8
prerequisite-patch-id: 2d4d04fc1422098aacadc8743efec84d3929bc0b
prerequisite-patch-id: 6063d411906b0b6f41dfce6cfd3f4795f25e9c11
prerequisite-patch-id: 20393e4806e464b171d542d3de4e41bbba07d521
prerequisite-patch-id: b1c3d27aaaeca2b94cab75a82008db7e4d3fc9f4
prerequisite-patch-id: 721316c5f0021ac0dbc4d03f68c21a77eef88d08
prerequisite-patch-id: 69a161a8f9f2175cce8b4a62c0e86b4bbdab5f10
prerequisite-patch-id: e47c4610cf53f8a30fb4a3541751292dec47a1c9
prerequisite-patch-id: f39dd059e3ad8af1bd91712b3cdc3f4b1b4c2215
prerequisite-patch-id: 717bf28a8254643211185582bc16f73cdde337ad
prerequisite-patch-id: bf548a1a2f6ccd8a2afdd967ddcab3c21b6fa584
prerequisite-patch-id: 2a2e7a1230dff4cf3329406a745f04668d44deef
prerequisite-patch-id: 333c6a243645d61627cfa1db1b476d831f685733
prerequisite-patch-id: b501b455480e2b45f81e07d7f02856df979644f8
prerequisite-patch-id: 893d532f9f307918862be03babd8dedf7cc5c696
prerequisite-patch-id: cab07afe25c3eb8c7aa3bf4d80e045ac422df1d6
prerequisite-patch-id: 52d5c5e98ad0695d631e336c5746b2074f08469b
prerequisite-patch-id: c8ea5b2983fb74822ab08e2c8a478cb17ff3803d
prerequisite-patch-id: 6dc50fcb267fb29bf8a6ac2e9597371e0772aec1
prerequisite-patch-id: 8552363b8e81fa95829e48d31dd12c5659338ef2
prerequisite-patch-id: 46d96230b4c749208a533d730e4f75b500b3bfac
prerequisite-patch-id: 224ac5e18a1d8674d1e7cf49b7f3e2008c4a213e
prerequisite-patch-id: a52022bfd2c0f6391cfa476202e2ddd5cd676eb0
prerequisite-patch-id: 58cfb87f55f07f147f882f3c9109eb5ad70bfbe1
prerequisite-patch-id: c3752d1c343f9fdbfccd757746e0aa9b53c0c8ae
prerequisite-patch-id: 37e1aeadd0ee2f14c15e29b2e2faee113f12f1cd
prerequisite-patch-id: eb75889d6d31b8807e42c8f03956730c7c8d577a
prerequisite-patch-id: 67484b56a17bd7ac6988ef6a15159279fca59e98
prerequisite-patch-id: 6168f9ec8beb6cb4e711510cafd9a87e717002ea
prerequisite-patch-id: 7646f6fb17ff2fc4f4734d2aa246053dc7dca410
prerequisite-patch-id: d79707796ed09bcf0b495a18aa2d52d00674c69f
prerequisite-patch-id: 648730d1c7e784f873a3b2cf2b6447279a573b5b
prerequisite-patch-id: ef5983802b94d095a5c9cbf962b51cf461981750
prerequisite-patch-id: 2e95a065d81d983ed0056b238ff047c187f5cb91
prerequisite-patch-id: 607615b23dee8b92047aaef1ebb8e50c20f68bcb
prerequisite-patch-id: a30f22e4e13063eb4adc2209b88ec96059078ff5
prerequisite-patch-id: b80bbfd5aa13fc43d332968944f8e1e58f059878
prerequisite-patch-id: 94923ed7ac933489c5ed4f2d61d8f267424f4d43
prerequisite-patch-id: b4820918562f815e50d12e07f264286321f7f711
prerequisite-patch-id: a34fb1f33467b3215196d48ecafbf5483fbb74f3
prerequisite-patch-id: 49f0ce9ceb6adb8c208f57180a8b7348dc5bba06
prerequisite-patch-id: 252d2696d7a34bbae579f15e98324abf843ce08c
prerequisite-patch-id: efc754c834ee2bdcf99fb52f9b942cc5a3dc117d
prerequisite-patch-id: 3b26fa3c1eb1fdf26766a2d5d81e3e171f4539a9
prerequisite-patch-id: c8279b3c9f52ad6420f462cf60d4e52da28ad854
prerequisite-patch-id: 3bbc77d19e638aa15e7d587f6aa4d295a6cf13ca
prerequisite-patch-id: ff39259dc64c2bbe01a88d3b03fdbca4b4836742
prerequisite-patch-id: 720aa5f2bf14f55f28e5ca370c81e7f39bce2f78
prerequisite-patch-id: bfe24b8ed9962db12a48cf819880f4423b54ac1b
prerequisite-patch-id: f25df222a309609258a24bbd334bf91ef3a33ee4
prerequisite-patch-id: b0e3e990e1282bde385d92c0a044be24d30beb9c
prerequisite-patch-id: 30953eacc324ccfc71838fd6428572b14fa86568
prerequisite-patch-id: b59be8efdfc2ca1067275009a83cb74269f82d52
prerequisite-patch-id: f1baea64b666a8d85dff464c25a27b7d4fc616cb
prerequisite-patch-id: d821a76d44ebab904b3b0225d77e52f29c51b346
prerequisite-patch-id: 9a4c443a42166871abb4646c64a90802d4a50275
prerequisite-patch-id: 8b273fe00374c0df5d520206ff5ebdcd74e1b30d
prerequisite-patch-id: 2ac5d698941f4ff6ee12869496554abe201ff69c
prerequisite-patch-id: a9149c87dfe0a855ebfaa2dae5af255d0ca4c236
prerequisite-patch-id: 4cd89ebdfc17035c5e34b60968c3c48253d5c57b
prerequisite-patch-id:

87 matches

Mail list logo