Re: [PATCH 1/6] powerpc/code-patching: Implement generic text patching function

2022-09-18 Thread Benjamin Gray
On Mon, 2022-09-19 at 06:04 +, Christophe Leroy wrote:
> With CONFIG_STRICT_KERNEL_RWX, this patches causes a 15% time
> increase 
> for activation/deactivation of ftrace.

It's possible that new alignment check is the cause. I'll see


> Without CONFIG_STRICT_KERNEL_RWX, it doesn't build.

Yup, fixed for v2

> > +static int __patch_text(void *dest, const void *src, size_t size,
> > bool is_exec, void *exec_addr)
> 
> Is 'text' a good name ? For me text mean executable code. Should it
> be 
> __patch_memory() ?

Well patching regular memory is just a normal store. Text to me implies
its non-writeable. Though __patch_memory would be fine.

> Why pass src as a void * ? This forces data to go via the stack.
> Can't 
> you pass it as a 'long' ?

Probably, I wasn't aware that it would make a difference. I prefer
pointers in general for their semantic meaning, but will change if it
affects param passing.

> > +   if (virt_to_pfn(dest) != virt_to_pfn(dest + size - 1))
> > +   return -EFAULT;
> 
> Why do you need that new check ?

If the patch crosses a page boundary then letting it happen is
unpredictable. Though perhaps this requirement can just be put as a
comment, or require that patches be aligned to the patch size.

> > +   case 8:
> > +   __put_kernel_nofault(dest, src, u64,
> > failed);
> > +   break;
> 
> Is case 8 needed for PPC32 ?

I don't have a particular need for it, but the underlying
__put_kernel_nofault is capable of it so I included it.

> > +   }
> 
> Do you catch it when size if none of 1,2,4,8 ?
> 

Not yet. Perhaps I should wrap patch_text_data in a macro that checks
the size with BUILD_BUG_ON? I'd rather not check at runtime.

> > +
> > +   asm ("dcbst 0, %0; sync" :: "r" (dest));
> 
> Maybe write it in C:
> 
> dcbst(dest);
> mb(); /* sync */
> 
> > +
> > +   if (is_exec)
> > +   asm ("icbi 0,%0; sync; isync" :: "r" (exec_addr));
> 
> Same, can be:
> 
> if (is_exec) {
> icbi(exec_addr);
> mb(); /* sync */
> isync();
> }
> 
> Or keep it flat:
> 
> if (!is_exec)
> return 0;
> 
> icbi(exec_addr);
> mb(); /* sync */
> isync();
> 
> return 0;

Will try this.

> > +static int do_patch_text(void *dest, const void *src, size_t size,
> > bool is_exec)
> > +{
> > +   int err;
> > +   pte_t *pte;
> > +   u32 *patch_addr;
> > +
> > +   pte = start_text_patch(dest, &patch_addr);
> > +   err = __patch_text(patch_addr, src, size, is_exec, dest);
> > +   finish_text_patch(pte);
> 
> Why do you need to split this function in three parts ? I can't see
> the 
> added value, all it does is reduce readability.

It made it more readable to me, so the __patch_text didn't get buried.
It also made it easier to do the refactoring, and potentially add code
patching variants that use the poke area but not __patch_text. I'll
remove it for v2 though given this is the only use right now.

> Did you check the impact of calling __this_cpu_read() twice ?

I wasn't concerned about performance, but given I'll merge it back
again it will only be read once in v2 again.

> > +void *patch_memory(void *dest, const void *src, size_t size)
> 
> What is this function used for ?
> 

Build failure apparently :)

It's removed in v2.
> 


Re: [PATCH 1/6] powerpc/code-patching: Implement generic text patching function

2022-09-18 Thread Christophe Leroy


Le 16/09/2022 à 08:23, Benjamin Gray a écrit :
>   
> -static int do_patch_instruction(u32 *addr, ppc_inst_t instr)
> +static int patch_text(void *dest, const void *src, size_t size, bool is_exec)
>   {
>   int err;
>   unsigned long flags;
>   
> - /*
> -  * During early early boot patch_instruction is called
> -  * when text_poke_area is not ready, but we still need
> -  * to allow patching. We just do the plain old patching
> -  */
> + /* Make sure we aren't patching a freed init section */
> + if (static_branch_likely(&init_mem_is_free) && 
> init_section_contains(dest, 4))
> + return 0;
> +
>   if (!static_branch_likely(&poking_init_done))
> - return raw_patch_instruction(addr, instr);
> + return __patch_text(dest, src, size, is_exec, dest);
>   

Test ordering looks odd. How can init_mem_is_free be true and 
poking_init_done be false ?

>   local_irq_save(flags);
> - err = __do_patch_instruction(addr, instr);
> + err = do_patch_text(dest, src, size, is_exec);
>   local_irq_restore(flags);
>   
>   return err;
>   }

Re: [PATCH 4/4] powerpc/64s: Enable KFENCE on book3s64

2022-09-18 Thread Christophe Leroy


Le 19/09/2022 à 03:44, Nicholas Miehlbradt a écrit :
> KFENCE support was added for ppc32 in commit 90cbac0e995d
> ("powerpc: Enable KFENCE for PPC32").
> Enable KFENCE on ppc64 architecture with hash and radix MMUs.
> It uses the same mechanism as debug pagealloc to
> protect/unprotect pages. All KFENCE kunit tests pass on both
> MMUs.
> 
> KFENCE memory is initially allocated using memblock but is
> later marked as SLAB allocated. This necessitates the change
> to __pud_free to ensure that the KFENCE pages are freed
> appropriately.
> 
> Based on previous work by Christophe Leroy and Jordan Niethe.
> 
> Signed-off-by: Nicholas Miehlbradt 
> ---
>   arch/powerpc/Kconfig |  2 +-
>   arch/powerpc/include/asm/book3s/64/pgalloc.h |  6 --
>   arch/powerpc/include/asm/book3s/64/pgtable.h |  2 +-
>   arch/powerpc/include/asm/kfence.h| 18 ++
>   arch/powerpc/mm/book3s64/hash_utils.c| 10 +-
>   arch/powerpc/mm/book3s64/radix_pgtable.c |  8 +---
>   6 files changed, 34 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index a4f8a5276e5c..f7dd0f49510d 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -194,7 +194,7 @@ config PPC
>   select HAVE_ARCH_KASAN  if PPC32 && PPC_PAGE_SHIFT <= 14
>   select HAVE_ARCH_KASAN  if PPC_RADIX_MMU
>   select HAVE_ARCH_KASAN_VMALLOC  if HAVE_ARCH_KASAN
> - select HAVE_ARCH_KFENCE if PPC_BOOK3S_32 || PPC_8xx || 
> 40x
> + select HAVE_ARCH_KFENCE if ARCH_SUPPORTS_DEBUG_PAGEALLOC
>   select HAVE_ARCH_KGDB
>   select HAVE_ARCH_MMAP_RND_BITS
>   select HAVE_ARCH_MMAP_RND_COMPAT_BITS   if COMPAT
> diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h 
> b/arch/powerpc/include/asm/book3s/64/pgalloc.h
> index e1af0b394ceb..dd2cff53a111 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
> @@ -113,9 +113,11 @@ static inline void __pud_free(pud_t *pud)
>   
>   /*
>* Early pud pages allocated via memblock allocator
> -  * can't be directly freed to slab
> +  * can't be directly freed to slab. KFENCE pages have
> +  * both reserved and slab flags set so need to be freed
> +  * kmem_cache_free.
>*/
> - if (PageReserved(page))
> + if (PageReserved(page) && !PageSlab(page))
>   free_reserved_page(page);
>   else
>   kmem_cache_free(PGT_CACHE(PUD_CACHE_INDEX), pud);
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index cb9d5fd39d7f..fd5d800f2836 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -1123,7 +1123,7 @@ static inline void vmemmap_remove_mapping(unsigned long 
> start,
>   }
>   #endif
>   
> -#ifdef CONFIG_DEBUG_PAGEALLOC
> +#if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_KFENCE)
>   static inline void __kernel_map_pages(struct page *page, int numpages, int 
> enable)
>   {
>   if (radix_enabled())
> diff --git a/arch/powerpc/include/asm/kfence.h 
> b/arch/powerpc/include/asm/kfence.h
> index a9846b68c6b9..33edbc312a51 100644
> --- a/arch/powerpc/include/asm/kfence.h
> +++ b/arch/powerpc/include/asm/kfence.h
> @@ -11,11 +11,28 @@
>   #include 
>   #include 
>   
> +#if defined(CONFIG_PPC64) && !defined(CONFIG_PPC64_ELF_ABI_V2)

Can be replaced by:

#ifdef CONFIG_PPC64_ELF_ABI_V1

> +#define ARCH_FUNC_PREFIX "."
> +#endif
> +
>   static inline bool arch_kfence_init_pool(void)
>   {
>   return true;
>   }
>   
> +#ifdef CONFIG_PPC64
> +static inline bool kfence_protect_page(unsigned long addr, bool protect)
> +{
> + struct page *page = virt_to_page(addr);
> +
> + if (protect)
> + __kernel_map_pages(page, 1, 0);
> + else
> + __kernel_map_pages(page, 1, 1);

Can be:
__kernel_map_pages(virt_to_page(addr), !protect);

> +
> + return true;
> +}
> +#else
>   static inline bool kfence_protect_page(unsigned long addr, bool protect)
>   {
>   pte_t *kpte = virt_to_kpte(addr);
> @@ -29,5 +46,6 @@ static inline bool kfence_protect_page(unsigned long addr, 
> bool protect)
>   
>   return true;
>   }
> +#endif
>   
>   #endif /* __ASM_POWERPC_KFENCE_H */
> diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
> b/arch/powerpc/mm/book3s64/hash_utils.c
> index b37412fe5930..9cceaa5998a3 100644
> --- a/arch/powerpc/mm/book3s64/hash_utils.c
> +++ b/arch/powerpc/mm/book3s64/hash_utils.c
> @@ -424,7 +424,7 @@ int htab_bolt_mapping(unsigned long vstart, unsigned long 
> vend,
>   break;
>   
>   cond_resched();
> - if (debug_pagealloc_enabled() &&
> + if (debug_pagealloc_enabled_or_kfence() &&
>   (paddr >> PAGE_SHIFT) < linear_map_hash_count)
>

Re: [PATCH 1/4] powerpc/64s: Add DEBUG_PAGEALLOC for radix

2022-09-18 Thread Christophe Leroy


Le 19/09/2022 à 03:44, Nicholas Miehlbradt a écrit :
> [Vous ne recevez pas souvent de courriers de nicho...@linux.ibm.com. 
> Découvrez pourquoi ceci est important à 
> https://aka.ms/LearnAboutSenderIdentification ]
> 
> There is support for DEBUG_PAGEALLOC on hash but not on radix.
> Add support on radix.
> 
> Signed-off-by: Nicholas Miehlbradt 
> ---
>   arch/powerpc/mm/book3s64/radix_pgtable.c | 16 +++-
>   1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
> b/arch/powerpc/mm/book3s64/radix_pgtable.c
> index db2f3d193448..483c99bfbde5 100644
> --- a/arch/powerpc/mm/book3s64/radix_pgtable.c
> +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
> @@ -30,6 +30,7 @@
>   #include 
>   #include 
>   #include 
> +#include 
> 
>   #include 
> 
> @@ -503,6 +504,9 @@ static unsigned long __init radix_memory_block_size(void)
>   {
>  unsigned long mem_block_size = MIN_MEMORY_BLOCK_SIZE;
> 
> +   if (debug_pagealloc_enabled())
> +   return PAGE_SIZE;
> +
>  /*
>   * OPAL firmware feature is set by now. Hence we are ok
>   * to test OPAL feature.
> @@ -519,6 +523,9 @@ static unsigned long __init radix_memory_block_size(void)
> 
>   static unsigned long __init radix_memory_block_size(void)
>   {
> +   if (debug_pagealloc_enabled())
> +   return PAGE_SIZE;
> +
>  return 1UL * 1024 * 1024 * 1024;

While at it, maybe you can replace the above by SZ_1G

>   }
> 
> @@ -899,7 +906,14 @@ void __meminit radix__vmemmap_remove_mapping(unsigned 
> long start, unsigned long
>   #ifdef CONFIG_DEBUG_PAGEALLOC
>   void radix__kernel_map_pages(struct page *page, int numpages, int enable)
>   {
> -   pr_warn_once("DEBUG_PAGEALLOC not supported in radix mode\n");
> +   unsigned long addr;
> +
> +   addr = (unsigned long)page_address(page);
> +
> +   if (enable)
> +   set_memory_p(addr, numpages);
> +   else
> +   set_memory_np(addr, numpages);
>   }
>   #endif
> 
> --
> 2.34.1
> 

Re: [PATCH 3/6] powerpc/module: Optimise nearby branches in ELF V2 ABI stub

2022-09-18 Thread Christophe Leroy


Le 16/09/2022 à 08:23, Benjamin Gray a écrit :
> Inserts a direct branch to the stub target when possible, replacing the
> mtctr/btctr sequence.
> 
> The load into r12 could potentially be skipped too, but that change
> would need to refactor the arguments to indicate that the address
> does not have a separate local entry point.
> 
> This helps the static call implementation, where modules calling their
> own trampolines are called through this stub and the trampoline is
> easily within range of a direct branch.
> 
> Signed-off-by: Benjamin Gray 
> ---
>   arch/powerpc/kernel/module_64.c | 13 +++--
>   1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/module_64.c b/arch/powerpc/kernel/module_64.c
> index 3656476097c2..03ab28d86008 100644
> --- a/arch/powerpc/kernel/module_64.c
> +++ b/arch/powerpc/kernel/module_64.c
> @@ -432,8 +432,17 @@ static inline int create_stub(const Elf64_Shdr *sechdrs,
>   return create_ftrace_stub(entry, addr, me);
>   
>   for (i = 0; i < ARRAY_SIZE(ppc64_stub_insns); i++) {
> - if (patch_instruction(&entry->jump[i],
> -   ppc_inst(ppc64_stub_insns[i])))
> + ppc_inst_t inst = ppc_inst(ppc64_stub_insns[i]);
> +
> + // Replace the indirect branch with a direct branch where 
> possible
> + if (IS_ENABLED(CONFIG_PPC64_ELF_ABI_V2) && i == 4) {
> + ppc_inst_t direct;

Add blank line here.

> + if (create_branch(&direct, (void*) entry + (i * 4), 
> addr, 0) == 0) {

No braces for single lines if statement.

> + inst = direct;
> + }
> + }
> +
> + if (patch_instruction(&entry->jump[i], inst))
>   return 0;
>   }
>   

Re: [PATCH 2/6] powerpc/module: Handle caller-saved TOC in module linker

2022-09-18 Thread Christophe Leroy


Le 16/09/2022 à 08:23, Benjamin Gray a écrit :
> The callee may set a field in `st_other` to 1 to indicate r2 should be
> treated as caller-saved. This means a trampoline must be used to save
> the current TOC before calling it and restore it afterwards, much like
> external calls.
> 
> This is necessary for supporting V2 ABI static calls that do not
> preserve the TOC.
> 
> Signed-off-by: Benjamin Gray 
> ---
>   arch/powerpc/kernel/module_64.c | 12 +++-
>   1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/module_64.c b/arch/powerpc/kernel/module_64.c
> index 7e45dc98df8a..3656476097c2 100644
> --- a/arch/powerpc/kernel/module_64.c
> +++ b/arch/powerpc/kernel/module_64.c
> @@ -55,6 +55,11 @@ static unsigned int local_entry_offset(const Elf64_Sym 
> *sym)
>* of function and try to derive r2 from it). */
>   return PPC64_LOCAL_ENTRY_OFFSET(sym->st_other);
>   }
> +
> +static bool need_r2save_stub(unsigned char st_other) {

Please read 
https://docs.kernel.org/process/coding-style.html#placing-braces-and-spaces

" ... functions: they have the opening brace at the beginning of the 
next line ..."

> + return ((st_other & STO_PPC64_LOCAL_MASK) >> STO_PPC64_LOCAL_BIT) == 1;
> +}
> +
>   #else
>   
>   static func_desc_t func_desc(unsigned long addr)
> @@ -66,6 +71,10 @@ static unsigned int local_entry_offset(const Elf64_Sym 
> *sym)
>   return 0;
>   }
>   
> +static bool need_r2save_stub(unsigned char st_other) {
> + return false;
> +}
> +
>   void *dereference_module_function_descriptor(struct module *mod, void *ptr)
>   {
>   if (ptr < (void *)mod->arch.start_opd ||
> @@ -632,7 +641,8 @@ int apply_relocate_add(Elf64_Shdr *sechdrs,
>   case R_PPC_REL24:
>   /* FIXME: Handle weak symbols here --RR */
>   if (sym->st_shndx == SHN_UNDEF ||
> - sym->st_shndx == SHN_LIVEPATCH) {
> + sym->st_shndx == SHN_LIVEPATCH ||
> + need_r2save_stub(sym->st_other)) {
>   /* External: go via stub */
>   value = stub_for_addr(sechdrs, value, me,
>   strtab + sym->st_name);

Re: [PATCH 1/6] powerpc/code-patching: Implement generic text patching function

2022-09-18 Thread Christophe Leroy


Le 16/09/2022 à 08:23, Benjamin Gray a écrit :
> Adds a generic text patching mechanism for patches of 1, 2, 4, or 8
> bytes. The patcher conditionally syncs the icache depending on if
> the content will be executed (as opposed to, e.g., read-only data).
> 
> The `patch_instruction` function is reimplemented in terms of this
> more generic function. This generic implementation allows patching of
> arbitrary 64-bit data, whereas the original `patch_instruction` decided
> the size based on the 'instruction' opcode, so was not suitable for
> arbitrary data.

With CONFIG_STRICT_KERNEL_RWX, this patches causes a 15% time increase 
for activation/deactivation of ftrace.

Without CONFIG_STRICT_KERNEL_RWX, it doesn't build.

> 
> Signed-off-by: Benjamin Gray 
> ---
>   arch/powerpc/include/asm/code-patching.h |   1 +
>   arch/powerpc/lib/code-patching.c | 135 +++
>   2 files changed, 89 insertions(+), 47 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/code-patching.h 
> b/arch/powerpc/include/asm/code-patching.h
> index 1c6316ec4b74..6a52c19dae46 100644
> --- a/arch/powerpc/include/asm/code-patching.h
> +++ b/arch/powerpc/include/asm/code-patching.h
> @@ -76,6 +76,7 @@ int create_cond_branch(ppc_inst_t *instr, const u32 *addr,
>   int patch_branch(u32 *addr, unsigned long target, int flags);
>   int patch_instruction(u32 *addr, ppc_inst_t instr);
>   int raw_patch_instruction(u32 *addr, ppc_inst_t instr);
> +int patch_text_data(void *dest, const void *src, size_t size);
>   
>   static inline unsigned long patch_site_addr(s32 *site)
>   {
> diff --git a/arch/powerpc/lib/code-patching.c 
> b/arch/powerpc/lib/code-patching.c
> index ad0cf3108dd0..a09a0898c2ce 100644
> --- a/arch/powerpc/lib/code-patching.c
> +++ b/arch/powerpc/lib/code-patching.c
> @@ -3,6 +3,7 @@
>*  Copyright 2008 Michael Ellerman, IBM Corporation.
>*/
>   
> +#include 
>   #include 
>   #include 
>   #include 
> @@ -14,32 +15,7 @@
>   #include 
>   #include 
>   #include 
> -
> -static int __patch_instruction(u32 *exec_addr, ppc_inst_t instr, u32 
> *patch_addr)
> -{
> - if (!ppc_inst_prefixed(instr)) {
> - u32 val = ppc_inst_val(instr);
> -
> - __put_kernel_nofault(patch_addr, &val, u32, failed);
> - } else {
> - u64 val = ppc_inst_as_ulong(instr);
> -
> - __put_kernel_nofault(patch_addr, &val, u64, failed);
> - }
> -
> - asm ("dcbst 0, %0; sync; icbi 0,%1; sync; isync" :: "r" (patch_addr),
> - "r" (exec_addr));
> -
> - return 0;
> -
> -failed:
> - return -EPERM;
> -}
> -
> -int raw_patch_instruction(u32 *addr, ppc_inst_t instr)
> -{
> - return __patch_instruction(addr, instr, addr);
> -}
> +#include 
>   
>   #ifdef CONFIG_STRICT_KERNEL_RWX
>   static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
> @@ -147,16 +123,44 @@ static void unmap_patch_area(unsigned long addr)
>   flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
>   }
>   
> -static int __do_patch_instruction(u32 *addr, ppc_inst_t instr)
> +static int __patch_text(void *dest, const void *src, size_t size, bool 
> is_exec, void *exec_addr)

Is 'text' a good name ? For me text mean executable code. Should it be 
__patch_memory() ?


Why pass src as a void * ? This forces data to go via the stack. Can't 
you pass it as a 'long' ?

> +{
> + if (virt_to_pfn(dest) != virt_to_pfn(dest + size - 1))
> + return -EFAULT;

Why do you need that new check ?

> +
> + switch (size) {
> + case 1:
> + __put_kernel_nofault(dest, src, u8, failed);
> + break;
> + case 2:
> + __put_kernel_nofault(dest, src, u16, failed);
> + break;
> + case 4:
> + __put_kernel_nofault(dest, src, u32, failed);
> + break;
> + case 8:
> + __put_kernel_nofault(dest, src, u64, failed);
> + break;

Is case 8 needed for PPC32 ?

> + }

Do you catch it when size if none of 1,2,4,8 ?

> +
> + asm ("dcbst 0, %0; sync" :: "r" (dest));

Maybe write it in C:

dcbst(dest);
mb(); /* sync */

> +
> + if (is_exec)
> + asm ("icbi 0,%0; sync; isync" :: "r" (exec_addr));

Same, can be:

if (is_exec) {
icbi(exec_addr);
mb(); /* sync */
isync();
}

Or keep it flat:

if (!is_exec)
return 0;

icbi(exec_addr);
mb(); /* sync */
isync();

return 0;

> +
> + return 0;
> +
> +failed:
> + return -EPERM;
> +}
> +
> +static pte_t *start_text_patch(void* dest, u32 **patch_addr)
>   {
> - int err;
> - u32 *patch_addr;
> - unsigned long text_poke_addr;
>   pte_t *pte;
> - unsigned long pfn = get_patch_pfn(addr);
> + unsigned long text_poke_addr = (unsigned 
> long

Re: [PATCH v2 3/7] powerpc/build: move got, toc, plt, branch_lt sections to read-only

2022-09-18 Thread Michael Ellerman
kernel test robot  writes:
> Hi Nicholas,
>
> I love your patch! Yet something to improve:
>
> [auto build test ERROR on powerpc/next]
> [also build test ERROR on linus/master v6.0-rc5 next-20220916]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url:
> https://github.com/intel-lab-lkp/linux/commits/Nicholas-Piggin/powerpc-build-linker-improvements/20220916-121310
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
> config: powerpc-microwatt_defconfig 
> (https://download.01.org/0day-ci/archive/20220918/202209180437.4u3soljk-...@intel.com/config)
> compiler: clang version 16.0.0 (https://github.com/llvm/llvm-project 
> 791a7ae1ba3efd6bca96338e10ffde557ba83920)
> reproduce (this is a W=1 build):
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # install powerpc cross compiling tool for clang build
> # apt-get install binutils-powerpc-linux-gnu
> # 
> https://github.com/intel-lab-lkp/linux/commit/6c034c08f8d0addb6fecba38c9c428b1c4df7c29
> git remote add linux-review https://github.com/intel-lab-lkp/linux
> git fetch --no-tags linux-review 
> Nicholas-Piggin/powerpc-build-linker-improvements/20220916-121310
> git checkout 6c034c08f8d0addb6fecba38c9c428b1c4df7c29
> # save the config file
> mkdir build_dir && cp config build_dir/.config
> COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 
> O=build_dir ARCH=powerpc SHELL=/bin/bash
^

> If you fix the issue, kindly add following tag where applicable
> Reported-by: kernel test robot 
>
> All errors (new ones prefixed by >>):
>
>>> ld.lld: error: ./arch/powerpc/kernel/vmlinux.lds:46: { expected, but got 
>>> SPECIAL
>>>>  .got : AT(ADDR(.got) - (0xc000 -0x)) SPECIAL {
>>>>   ^

I guess SPECIAL is a binutils ld ism?

I can't find it documented anywhere.

Presumably we can just not use it, given we never did before?

cheers


Re: [PATCH] powerpc/microwatt: Remove unused early debug code

2022-09-18 Thread Andrew Donnellan
On Mon, 2022-09-19 at 15:27 +1000, Michael Ellerman wrote:
> The original microwatt submission[1] included some early debug code
> for
> using the Microwatt "potato" UART.
> 
> The series that was eventually merged switched to using a standard
> UART,
> and so doesn't need any special early debug handling. But some of the
> original code was merged accidentally under the non-existent
> CONFIG_PPC_EARLY_DEBUG_MICROWATT.
> 
> Drop the unused code.
> 
> 1: 
> https://lore.kernel.org/linuxppc-dev/20200509050340.gd1464...@thinks.paulus.ozlabs.org/
> 
> Fixes: 48b545b8018d ("powerpc/microwatt: Use standard 16550 UART for
> console")
> Reported-by: Lukas Bulwahn 
> Signed-off-by: Michael Ellerman 

grep confirms this is indeed dead code.

Reviewed-by: Andrew Donnellan 


-- 
Andrew DonnellanOzLabs, ADL Canberra
a...@linux.ibm.com   IBM Australia Limited



[PATCH] powerpc/microwatt: Remove unused early debug code

2022-09-18 Thread Michael Ellerman
The original microwatt submission[1] included some early debug code for
using the Microwatt "potato" UART.

The series that was eventually merged switched to using a standard UART,
and so doesn't need any special early debug handling. But some of the
original code was merged accidentally under the non-existent
CONFIG_PPC_EARLY_DEBUG_MICROWATT.

Drop the unused code.

1: 
https://lore.kernel.org/linuxppc-dev/20200509050340.gd1464...@thinks.paulus.ozlabs.org/

Fixes: 48b545b8018d ("powerpc/microwatt: Use standard 16550 UART for console")
Reported-by: Lukas Bulwahn 
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/kernel/udbg_16550.c | 39 
 1 file changed, 39 deletions(-)

diff --git a/arch/powerpc/kernel/udbg_16550.c b/arch/powerpc/kernel/udbg_16550.c
index d3942de254c6..ddfbc74bf85f 100644
--- a/arch/powerpc/kernel/udbg_16550.c
+++ b/arch/powerpc/kernel/udbg_16550.c
@@ -296,42 +296,3 @@ void __init udbg_init_40x_realmode(void)
 }
 
 #endif /* CONFIG_PPC_EARLY_DEBUG_40x */
-
-#ifdef CONFIG_PPC_EARLY_DEBUG_MICROWATT
-
-#define UDBG_UART_MW_ADDR  ((void __iomem *)0xc0002000)
-
-static u8 udbg_uart_in_isa300_rm(unsigned int reg)
-{
-   uint64_t msr = mfmsr();
-   uint8_t  c;
-
-   mtmsr(msr & ~(MSR_EE|MSR_DR));
-   isync();
-   eieio();
-   c = __raw_rm_readb(UDBG_UART_MW_ADDR + (reg << 2));
-   mtmsr(msr);
-   isync();
-   return c;
-}
-
-static void udbg_uart_out_isa300_rm(unsigned int reg, u8 val)
-{
-   uint64_t msr = mfmsr();
-
-   mtmsr(msr & ~(MSR_EE|MSR_DR));
-   isync();
-   eieio();
-   __raw_rm_writeb(val, UDBG_UART_MW_ADDR + (reg << 2));
-   mtmsr(msr);
-   isync();
-}
-
-void __init udbg_init_debug_microwatt(void)
-{
-   udbg_uart_in = udbg_uart_in_isa300_rm;
-   udbg_uart_out = udbg_uart_out_isa300_rm;
-   udbg_use_uart();
-}
-
-#endif /* CONFIG_PPC_EARLY_DEBUG_MICROWATT */
-- 
2.37.2



Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation

2022-09-18 Thread Barry Song
On Mon, Sep 19, 2022 at 4:53 PM Barry Song <21cn...@gmail.com> wrote:
>
> On Mon, Sep 19, 2022 at 4:24 PM Anshuman Khandual
>  wrote:
> >
> >
> >
> > On 9/15/22 12:12, Barry Song wrote:
> > > On Thu, Sep 15, 2022 at 6:07 PM Anshuman Khandual
> > >  wrote:
> > >>
> > >>
> > >>
> > >> On 9/9/22 11:05, Barry Song wrote:
> > >>> On Fri, Sep 9, 2022 at 5:24 PM Anshuman Khandual
> > >>>  wrote:
> > 
> > 
> > 
> >  On 8/22/22 13:51, Yicong Yang wrote:
> > > From: Barry Song 
> > >
> > > on x86, batched and deferred tlb shootdown has lead to 90%
> > > performance increase on tlb shootdown. on arm64, HW can do
> > > tlb shootdown without software IPI. But sync tlbi is still
> > > quite expensive.
> > >
> > > Even running a simplest program which requires swapout can
> > > prove this is true,
> > >  #include 
> > >  #include 
> > >  #include 
> > >  #include 
> > >
> > >  int main()
> > >  {
> > >  #define SIZE (1 * 1024 * 1024)
> > >  volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | 
> > > PROT_WRITE,
> > >   MAP_SHARED | MAP_ANONYMOUS, 
> > > -1, 0);
> > >
> > >  memset(p, 0x88, SIZE);
> > >
> > >  for (int k = 0; k < 1; k++) {
> > >  /* swap in */
> > >  for (int i = 0; i < SIZE; i += 4096) {
> > >  (void)p[i];
> > >  }
> > >
> > >  /* swap out */
> > >  madvise(p, SIZE, MADV_PAGEOUT);
> > >  }
> > >  }
> > >
> > > Perf result on snapdragon 888 with 8 cores by using zRAM
> > > as the swap block device.
> > >
> > >  ~ # perf record taskset -c 4 ./a.out
> > >  [ perf record: Woken up 10 times to write data ]
> > >  [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) 
> > > ]
> > >  ~ # perf report
> > >  # To display the perf.data header info, please use 
> > > --header/--header-only options.
> > >  # To display the perf.data header info, please use 
> > > --header/--header-only options.
> > >  #
> > >  #
> > >  # Total Lost Samples: 0
> > >  #
> > >  # Samples: 60K of event 'cycles'
> > >  # Event count (approx.): 35706225414
> > >  #
> > >  # Overhead  Command  Shared Object  Symbol
> > >  #   ...  .  
> > > .
> > >  #
> > > 21.07%  a.out[kernel.kallsyms]  [k] _raw_spin_unlock_irq
> > >  8.23%  a.out[kernel.kallsyms]  [k] 
> > > _raw_spin_unlock_irqrestore
> > >  6.67%  a.out[kernel.kallsyms]  [k] filemap_map_pages
> > >  6.16%  a.out[kernel.kallsyms]  [k] __zram_bvec_write
> > >  5.36%  a.out[kernel.kallsyms]  [k] ptep_clear_flush
> > >  3.71%  a.out[kernel.kallsyms]  [k] _raw_spin_lock
> > >  3.49%  a.out[kernel.kallsyms]  [k] memset64
> > >  1.63%  a.out[kernel.kallsyms]  [k] clear_page
> > >  1.42%  a.out[kernel.kallsyms]  [k] _raw_spin_unlock
> > >  1.26%  a.out[kernel.kallsyms]  [k] 
> > > mod_zone_state.llvm.8525150236079521930
> > >  1.23%  a.out[kernel.kallsyms]  [k] xas_load
> > >  1.15%  a.out[kernel.kallsyms]  [k] zram_slot_lock
> > >
> > > ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
> > > swapping in/out a page mapped by only one process. If the
> > > page is mapped by multiple processes, typically, like more
> > > than 100 on a phone, the overhead would be much higher as
> > > we have to run tlb flush 100 times for one single page.
> > > Plus, tlb flush overhead will increase with the number
> > > of CPU cores due to the bad scalability of tlb shootdown
> > > in HW, so those ARM64 servers should expect much higher
> > > overhead.
> > >
> > > Further perf annonate shows 95% cpu time of ptep_clear_flush
> > > is actually used by the final dsb() to wait for the completion
> > > of tlb flush. This provides us a very good chance to leverage
> > > the existing batched tlb in kernel. The minimum modification
> > > is that we only send async tlbi in the first stage and we send
> > > dsb while we have to sync in the second stage.
> > >
> > > With the above simplest micro benchmark, collapsed time to
> > > finish the program decreases around 5%.
> > >
> > > Typical collapsed time w/o patch:
> > >  ~ # time taskset -c 4 ./a.out
> > >  0.21user 14.34system 0:14.69elapsed
> > > w/ patch:
> > >  ~ # time taskset -c 4 ./a.out
> > >  0.22user 13.45system 0:13.80elapsed
> > >
> > > Also, Yicong Yang added the following observation.
> > >   Tested with benchmark in the commit on Kun

Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation

2022-09-18 Thread Barry Song
On Mon, Sep 19, 2022 at 4:24 PM Anshuman Khandual
 wrote:
>
>
>
> On 9/15/22 12:12, Barry Song wrote:
> > On Thu, Sep 15, 2022 at 6:07 PM Anshuman Khandual
> >  wrote:
> >>
> >>
> >>
> >> On 9/9/22 11:05, Barry Song wrote:
> >>> On Fri, Sep 9, 2022 at 5:24 PM Anshuman Khandual
> >>>  wrote:
> 
> 
> 
>  On 8/22/22 13:51, Yicong Yang wrote:
> > From: Barry Song 
> >
> > on x86, batched and deferred tlb shootdown has lead to 90%
> > performance increase on tlb shootdown. on arm64, HW can do
> > tlb shootdown without software IPI. But sync tlbi is still
> > quite expensive.
> >
> > Even running a simplest program which requires swapout can
> > prove this is true,
> >  #include 
> >  #include 
> >  #include 
> >  #include 
> >
> >  int main()
> >  {
> >  #define SIZE (1 * 1024 * 1024)
> >  volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | 
> > PROT_WRITE,
> >   MAP_SHARED | MAP_ANONYMOUS, 
> > -1, 0);
> >
> >  memset(p, 0x88, SIZE);
> >
> >  for (int k = 0; k < 1; k++) {
> >  /* swap in */
> >  for (int i = 0; i < SIZE; i += 4096) {
> >  (void)p[i];
> >  }
> >
> >  /* swap out */
> >  madvise(p, SIZE, MADV_PAGEOUT);
> >  }
> >  }
> >
> > Perf result on snapdragon 888 with 8 cores by using zRAM
> > as the swap block device.
> >
> >  ~ # perf record taskset -c 4 ./a.out
> >  [ perf record: Woken up 10 times to write data ]
> >  [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
> >  ~ # perf report
> >  # To display the perf.data header info, please use 
> > --header/--header-only options.
> >  # To display the perf.data header info, please use 
> > --header/--header-only options.
> >  #
> >  #
> >  # Total Lost Samples: 0
> >  #
> >  # Samples: 60K of event 'cycles'
> >  # Event count (approx.): 35706225414
> >  #
> >  # Overhead  Command  Shared Object  Symbol
> >  #   ...  .  
> > .
> >  #
> > 21.07%  a.out[kernel.kallsyms]  [k] _raw_spin_unlock_irq
> >  8.23%  a.out[kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
> >  6.67%  a.out[kernel.kallsyms]  [k] filemap_map_pages
> >  6.16%  a.out[kernel.kallsyms]  [k] __zram_bvec_write
> >  5.36%  a.out[kernel.kallsyms]  [k] ptep_clear_flush
> >  3.71%  a.out[kernel.kallsyms]  [k] _raw_spin_lock
> >  3.49%  a.out[kernel.kallsyms]  [k] memset64
> >  1.63%  a.out[kernel.kallsyms]  [k] clear_page
> >  1.42%  a.out[kernel.kallsyms]  [k] _raw_spin_unlock
> >  1.26%  a.out[kernel.kallsyms]  [k] 
> > mod_zone_state.llvm.8525150236079521930
> >  1.23%  a.out[kernel.kallsyms]  [k] xas_load
> >  1.15%  a.out[kernel.kallsyms]  [k] zram_slot_lock
> >
> > ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
> > swapping in/out a page mapped by only one process. If the
> > page is mapped by multiple processes, typically, like more
> > than 100 on a phone, the overhead would be much higher as
> > we have to run tlb flush 100 times for one single page.
> > Plus, tlb flush overhead will increase with the number
> > of CPU cores due to the bad scalability of tlb shootdown
> > in HW, so those ARM64 servers should expect much higher
> > overhead.
> >
> > Further perf annonate shows 95% cpu time of ptep_clear_flush
> > is actually used by the final dsb() to wait for the completion
> > of tlb flush. This provides us a very good chance to leverage
> > the existing batched tlb in kernel. The minimum modification
> > is that we only send async tlbi in the first stage and we send
> > dsb while we have to sync in the second stage.
> >
> > With the above simplest micro benchmark, collapsed time to
> > finish the program decreases around 5%.
> >
> > Typical collapsed time w/o patch:
> >  ~ # time taskset -c 4 ./a.out
> >  0.21user 14.34system 0:14.69elapsed
> > w/ patch:
> >  ~ # time taskset -c 4 ./a.out
> >  0.22user 13.45system 0:13.80elapsed
> >
> > Also, Yicong Yang added the following observation.
> >   Tested with benchmark in the commit on Kunpeng920 arm64 server,
> >   observed an improvement around 12.5% with command
> >   `time ./swap_bench`.
> >   w/o w/
> >   real0m13.460s   0m11.771s
> >   user0m0.248s0m0.279s
> >   sys 0m12.039s   0m11.458s
> >
> >   Ori

Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation

2022-09-18 Thread Anshuman Khandual



On 9/15/22 12:12, Barry Song wrote:
> On Thu, Sep 15, 2022 at 6:07 PM Anshuman Khandual
>  wrote:
>>
>>
>>
>> On 9/9/22 11:05, Barry Song wrote:
>>> On Fri, Sep 9, 2022 at 5:24 PM Anshuman Khandual
>>>  wrote:



 On 8/22/22 13:51, Yicong Yang wrote:
> From: Barry Song 
>
> on x86, batched and deferred tlb shootdown has lead to 90%
> performance increase on tlb shootdown. on arm64, HW can do
> tlb shootdown without software IPI. But sync tlbi is still
> quite expensive.
>
> Even running a simplest program which requires swapout can
> prove this is true,
>  #include 
>  #include 
>  #include 
>  #include 
>
>  int main()
>  {
>  #define SIZE (1 * 1024 * 1024)
>  volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | 
> PROT_WRITE,
>   MAP_SHARED | MAP_ANONYMOUS, -1, 
> 0);
>
>  memset(p, 0x88, SIZE);
>
>  for (int k = 0; k < 1; k++) {
>  /* swap in */
>  for (int i = 0; i < SIZE; i += 4096) {
>  (void)p[i];
>  }
>
>  /* swap out */
>  madvise(p, SIZE, MADV_PAGEOUT);
>  }
>  }
>
> Perf result on snapdragon 888 with 8 cores by using zRAM
> as the swap block device.
>
>  ~ # perf record taskset -c 4 ./a.out
>  [ perf record: Woken up 10 times to write data ]
>  [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
>  ~ # perf report
>  # To display the perf.data header info, please use 
> --header/--header-only options.
>  # To display the perf.data header info, please use 
> --header/--header-only options.
>  #
>  #
>  # Total Lost Samples: 0
>  #
>  # Samples: 60K of event 'cycles'
>  # Event count (approx.): 35706225414
>  #
>  # Overhead  Command  Shared Object  Symbol
>  #   ...  .  
> .
>  #
> 21.07%  a.out[kernel.kallsyms]  [k] _raw_spin_unlock_irq
>  8.23%  a.out[kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
>  6.67%  a.out[kernel.kallsyms]  [k] filemap_map_pages
>  6.16%  a.out[kernel.kallsyms]  [k] __zram_bvec_write
>  5.36%  a.out[kernel.kallsyms]  [k] ptep_clear_flush
>  3.71%  a.out[kernel.kallsyms]  [k] _raw_spin_lock
>  3.49%  a.out[kernel.kallsyms]  [k] memset64
>  1.63%  a.out[kernel.kallsyms]  [k] clear_page
>  1.42%  a.out[kernel.kallsyms]  [k] _raw_spin_unlock
>  1.26%  a.out[kernel.kallsyms]  [k] 
> mod_zone_state.llvm.8525150236079521930
>  1.23%  a.out[kernel.kallsyms]  [k] xas_load
>  1.15%  a.out[kernel.kallsyms]  [k] zram_slot_lock
>
> ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
> swapping in/out a page mapped by only one process. If the
> page is mapped by multiple processes, typically, like more
> than 100 on a phone, the overhead would be much higher as
> we have to run tlb flush 100 times for one single page.
> Plus, tlb flush overhead will increase with the number
> of CPU cores due to the bad scalability of tlb shootdown
> in HW, so those ARM64 servers should expect much higher
> overhead.
>
> Further perf annonate shows 95% cpu time of ptep_clear_flush
> is actually used by the final dsb() to wait for the completion
> of tlb flush. This provides us a very good chance to leverage
> the existing batched tlb in kernel. The minimum modification
> is that we only send async tlbi in the first stage and we send
> dsb while we have to sync in the second stage.
>
> With the above simplest micro benchmark, collapsed time to
> finish the program decreases around 5%.
>
> Typical collapsed time w/o patch:
>  ~ # time taskset -c 4 ./a.out
>  0.21user 14.34system 0:14.69elapsed
> w/ patch:
>  ~ # time taskset -c 4 ./a.out
>  0.22user 13.45system 0:13.80elapsed
>
> Also, Yicong Yang added the following observation.
>   Tested with benchmark in the commit on Kunpeng920 arm64 server,
>   observed an improvement around 12.5% with command
>   `time ./swap_bench`.
>   w/o w/
>   real0m13.460s   0m11.771s
>   user0m0.248s0m0.279s
>   sys 0m12.039s   0m11.458s
>
>   Originally it's noticed a 16.99% overhead of ptep_clear_flush()
>   which has been eliminated by this patch:
>
>   [root@localhost yang]# perf record -- ./swap_bench && perf report
>   [...]
>   16.99%  swap_bench  [kernel.kallsyms]  [k] ptep_clear_flush
>
> Cc:

Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation

2022-09-18 Thread Anshuman Khandual



On 9/15/22 20:01, Nadav Amit wrote:
> 
> 
>> On Sep 14, 2022, at 11:42 PM, Barry Song <21cn...@gmail.com> wrote:
>>
>>>
>>> The very idea behind TLB deferral is the opportunity it (might) provide
>>> to accumulate address ranges and cpu masks so that individual TLB flush
>>> can be replaced with a more cost effective range based TLB flush. Hence
>>> I guess unless address range or cpumask based cost effective TLB flush
>>> is available, deferral does not improve the unmap performance as much.
>>
>>
>> After sending tlbi, if we wait for the completion of tlbi, we have to get Ack
>> from all cpus in the system, tlbi is not scalable. The point here is that we
>> avoid waiting for each individual TLBi. Alternatively, they are batched. If
>> you read the benchmark in the commit log, you can find the great decline
>> in the cost to swap out a page.
> 
> Just a minor correction: arch_tlbbatch_flush() does not collect ranges.
> On x86 it only accumulate CPU mask.

Thanks Nadav for the clarification.


[PATCH v3] hugetlb: simplify hugetlb handling in follow_page_mask

2022-09-18 Thread Mike Kravetz
During discussions of this series [1], it was suggested that hugetlb
handling code in follow_page_mask could be simplified.  At the beginning
of follow_page_mask, there currently is a call to follow_huge_addr which
'may' handle hugetlb pages.  ia64 is the only architecture which provides
a follow_huge_addr routine that does not return error.  Instead, at each
level of the page table a check is made for a hugetlb entry.  If a hugetlb
entry is found, a call to a routine associated with that entry is made.

Currently, there are two checks for hugetlb entries at each page table
level.  The first check is of the form:
if (p?d_huge())
page = follow_huge_p?d();
the second check is of the form:
if (is_hugepd())
page = follow_huge_pd().

We can replace these checks, as well as the special handling routines
such as follow_huge_p?d() and follow_huge_pd() with a single routine to
handle hugetlb vmas.

A new routine hugetlb_follow_page_mask is called for hugetlb vmas at the
beginning of follow_page_mask.  hugetlb_follow_page_mask will use the
existing routine huge_pte_offset to walk page tables looking for hugetlb
entries.  huge_pte_offset can be overwritten by architectures, and already
handles special cases such as hugepd entries.

[1] 
https://lore.kernel.org/linux-mm/cover.1661240170.git.baolin.w...@linux.alibaba.com/

Suggested-by: David Hildenbrand 
Signed-off-by: Mike Kravetz 
---
v3 -Change WARN_ON_ONCE() to BUILD_BUG() as reminded by Christophe Leroy
v2 -Added WARN_ON_ONCE() and updated comment as suggested by David
Fixed build issue found by kernel test robot
Added vma (pmd sharing) locking to hugetlb_follow_page_mask
ReBased on Baolin's patch to fix issues with CONT_* entries

 arch/ia64/mm/hugetlbpage.c|  15 ---
 arch/powerpc/mm/hugetlbpage.c |  37 ---
 include/linux/hugetlb.h   |  50 ++
 mm/gup.c  |  80 +++
 mm/hugetlb.c  | 182 --
 5 files changed, 86 insertions(+), 278 deletions(-)

diff --git a/arch/ia64/mm/hugetlbpage.c b/arch/ia64/mm/hugetlbpage.c
index f993cb36c062..380d2f3966c9 100644
--- a/arch/ia64/mm/hugetlbpage.c
+++ b/arch/ia64/mm/hugetlbpage.c
@@ -91,21 +91,6 @@ int prepare_hugepage_range(struct file *file,
return 0;
 }
 
-struct page *follow_huge_addr(struct mm_struct *mm, unsigned long addr, int 
write)
-{
-   struct page *page;
-   pte_t *ptep;
-
-   if (REGION_NUMBER(addr) != RGN_HPAGE)
-   return ERR_PTR(-EINVAL);
-
-   ptep = huge_pte_offset(mm, addr, HPAGE_SIZE);
-   if (!ptep || pte_none(*ptep))
-   return NULL;
-   page = pte_page(*ptep);
-   page += ((addr & ~HPAGE_MASK) >> PAGE_SHIFT);
-   return page;
-}
 int pmd_huge(pmd_t pmd)
 {
return 0;
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index bc84a594ca62..b0e037c75c12 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -506,43 +506,6 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
} while (addr = next, addr != end);
 }
 
-struct page *follow_huge_pd(struct vm_area_struct *vma,
-   unsigned long address, hugepd_t hpd,
-   int flags, int pdshift)
-{
-   pte_t *ptep;
-   spinlock_t *ptl;
-   struct page *page = NULL;
-   unsigned long mask;
-   int shift = hugepd_shift(hpd);
-   struct mm_struct *mm = vma->vm_mm;
-
-retry:
-   /*
-* hugepage directory entries are protected by mm->page_table_lock
-* Use this instead of huge_pte_lockptr
-*/
-   ptl = &mm->page_table_lock;
-   spin_lock(ptl);
-
-   ptep = hugepte_offset(hpd, address, pdshift);
-   if (pte_present(*ptep)) {
-   mask = (1UL << shift) - 1;
-   page = pte_page(*ptep);
-   page += ((address & mask) >> PAGE_SHIFT);
-   if (flags & FOLL_GET)
-   get_page(page);
-   } else {
-   if (is_hugetlb_entry_migration(*ptep)) {
-   spin_unlock(ptl);
-   __migration_entry_wait(mm, ptep, ptl);
-   goto retry;
-   }
-   }
-   spin_unlock(ptl);
-   return page;
-}
-
 bool __init arch_hugetlb_valid_size(unsigned long size)
 {
int shift = __ffs(size);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index cfe15b32e2d4..32d45e96a894 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -149,6 +149,8 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
 unsigned long len);
 int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *,
struct vm_area_struct *, struct vm_area_struct *);
+struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
+   unsigned l

[PATCH 3/4] powerpc/64s: Allow double call of kernel_[un]map_linear_page()

2022-09-18 Thread Nicholas Miehlbradt
From: Christophe Leroy 

If the page is already mapped resp. already unmapped, bail out.

Signed-off-by: Christophe Leroy 
Signed-off-by: Nicholas Miehlbradt 
---
 arch/powerpc/mm/book3s64/hash_utils.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index e63ff401a6ea..b37412fe5930 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -2000,6 +2000,9 @@ static void kernel_map_linear_page(unsigned long vaddr, 
unsigned long lmi)
if (!vsid)
return;
 
+   if (linear_map_hash_slots[lmi] & 0x80)
+   return;
+
ret = hpte_insert_repeating(hash, vpn, __pa(vaddr), mode,
HPTE_V_BOLTED,
mmu_linear_psize, mmu_kernel_ssize);
@@ -2019,7 +2022,10 @@ static void kernel_unmap_linear_page(unsigned long 
vaddr, unsigned long lmi)
 
hash = hpt_hash(vpn, PAGE_SHIFT, mmu_kernel_ssize);
spin_lock(&linear_map_hash_lock);
-   BUG_ON(!(linear_map_hash_slots[lmi] & 0x80));
+   if (!(linear_map_hash_slots[lmi] & 0x80)) {
+   spin_unlock(&linear_map_hash_lock);
+   return;
+   }
hidx = linear_map_hash_slots[lmi] & 0x7f;
linear_map_hash_slots[lmi] = 0;
spin_unlock(&linear_map_hash_lock);
-- 
2.34.1



[PATCH 2/4] powerpc/64s: Remove unneeded #ifdef CONFIG_DEBUG_PAGEALLOC in hash_utils

2022-09-18 Thread Nicholas Miehlbradt
From: Christophe Leroy 

debug_pagealloc_enabled() is always defined and constant folds to
'false' when CONFIG_DEBUG_PAGEALLOC is not enabled.

Remove the #ifdefs, the code and associated static variables will
be optimised out by the compiler when CONFIG_DEBUG_PAGEALLOC is
not defined.

Signed-off-by: Christophe Leroy 
Signed-off-by: Nicholas Miehlbradt 
---
 arch/powerpc/mm/book3s64/hash_utils.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index fc92613dc2bf..e63ff401a6ea 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -123,11 +123,8 @@ EXPORT_SYMBOL_GPL(mmu_slb_size);
 #ifdef CONFIG_PPC_64K_PAGES
 int mmu_ci_restrictions;
 #endif
-#ifdef CONFIG_DEBUG_PAGEALLOC
 static u8 *linear_map_hash_slots;
 static unsigned long linear_map_hash_count;
-static DEFINE_SPINLOCK(linear_map_hash_lock);
-#endif /* CONFIG_DEBUG_PAGEALLOC */
 struct mmu_hash_ops mmu_hash_ops;
 EXPORT_SYMBOL(mmu_hash_ops);
 
@@ -427,11 +424,9 @@ int htab_bolt_mapping(unsigned long vstart, unsigned long 
vend,
break;
 
cond_resched();
-#ifdef CONFIG_DEBUG_PAGEALLOC
if (debug_pagealloc_enabled() &&
(paddr >> PAGE_SHIFT) < linear_map_hash_count)
linear_map_hash_slots[paddr >> PAGE_SHIFT] = ret | 0x80;
-#endif /* CONFIG_DEBUG_PAGEALLOC */
}
return ret < 0 ? ret : 0;
 }
@@ -1066,7 +1061,6 @@ static void __init htab_initialize(void)
 
prot = pgprot_val(PAGE_KERNEL);
 
-#ifdef CONFIG_DEBUG_PAGEALLOC
if (debug_pagealloc_enabled()) {
linear_map_hash_count = memblock_end_of_DRAM() >> PAGE_SHIFT;
linear_map_hash_slots = memblock_alloc_try_nid(
@@ -1076,7 +1070,6 @@ static void __init htab_initialize(void)
panic("%s: Failed to allocate %lu bytes max_addr=%pa\n",
  __func__, linear_map_hash_count, &ppc64_rma_size);
}
-#endif /* CONFIG_DEBUG_PAGEALLOC */
 
/* create bolted the linear mapping in the hash table */
for_each_mem_range(i, &base, &end) {
@@ -1991,6 +1984,8 @@ long hpte_insert_repeating(unsigned long hash, unsigned 
long vpn,
 }
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
+static DEFINE_SPINLOCK(linear_map_hash_lock);
+
 static void kernel_map_linear_page(unsigned long vaddr, unsigned long lmi)
 {
unsigned long hash;
-- 
2.34.1



[PATCH 4/4] powerpc/64s: Enable KFENCE on book3s64

2022-09-18 Thread Nicholas Miehlbradt
KFENCE support was added for ppc32 in commit 90cbac0e995d
("powerpc: Enable KFENCE for PPC32").
Enable KFENCE on ppc64 architecture with hash and radix MMUs.
It uses the same mechanism as debug pagealloc to
protect/unprotect pages. All KFENCE kunit tests pass on both
MMUs.

KFENCE memory is initially allocated using memblock but is
later marked as SLAB allocated. This necessitates the change
to __pud_free to ensure that the KFENCE pages are freed
appropriately.

Based on previous work by Christophe Leroy and Jordan Niethe.

Signed-off-by: Nicholas Miehlbradt 
---
 arch/powerpc/Kconfig |  2 +-
 arch/powerpc/include/asm/book3s/64/pgalloc.h |  6 --
 arch/powerpc/include/asm/book3s/64/pgtable.h |  2 +-
 arch/powerpc/include/asm/kfence.h| 18 ++
 arch/powerpc/mm/book3s64/hash_utils.c| 10 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c |  8 +---
 6 files changed, 34 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index a4f8a5276e5c..f7dd0f49510d 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -194,7 +194,7 @@ config PPC
select HAVE_ARCH_KASAN  if PPC32 && PPC_PAGE_SHIFT <= 14
select HAVE_ARCH_KASAN  if PPC_RADIX_MMU
select HAVE_ARCH_KASAN_VMALLOC  if HAVE_ARCH_KASAN
-   select HAVE_ARCH_KFENCE if PPC_BOOK3S_32 || PPC_8xx || 
40x
+   select HAVE_ARCH_KFENCE if ARCH_SUPPORTS_DEBUG_PAGEALLOC
select HAVE_ARCH_KGDB
select HAVE_ARCH_MMAP_RND_BITS
select HAVE_ARCH_MMAP_RND_COMPAT_BITS   if COMPAT
diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h 
b/arch/powerpc/include/asm/book3s/64/pgalloc.h
index e1af0b394ceb..dd2cff53a111 100644
--- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
@@ -113,9 +113,11 @@ static inline void __pud_free(pud_t *pud)
 
/*
 * Early pud pages allocated via memblock allocator
-* can't be directly freed to slab
+* can't be directly freed to slab. KFENCE pages have
+* both reserved and slab flags set so need to be freed
+* kmem_cache_free.
 */
-   if (PageReserved(page))
+   if (PageReserved(page) && !PageSlab(page))
free_reserved_page(page);
else
kmem_cache_free(PGT_CACHE(PUD_CACHE_INDEX), pud);
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index cb9d5fd39d7f..fd5d800f2836 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1123,7 +1123,7 @@ static inline void vmemmap_remove_mapping(unsigned long 
start,
 }
 #endif
 
-#ifdef CONFIG_DEBUG_PAGEALLOC
+#if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_KFENCE)
 static inline void __kernel_map_pages(struct page *page, int numpages, int 
enable)
 {
if (radix_enabled())
diff --git a/arch/powerpc/include/asm/kfence.h 
b/arch/powerpc/include/asm/kfence.h
index a9846b68c6b9..33edbc312a51 100644
--- a/arch/powerpc/include/asm/kfence.h
+++ b/arch/powerpc/include/asm/kfence.h
@@ -11,11 +11,28 @@
 #include 
 #include 
 
+#if defined(CONFIG_PPC64) && !defined(CONFIG_PPC64_ELF_ABI_V2)
+#define ARCH_FUNC_PREFIX "."
+#endif
+
 static inline bool arch_kfence_init_pool(void)
 {
return true;
 }
 
+#ifdef CONFIG_PPC64
+static inline bool kfence_protect_page(unsigned long addr, bool protect)
+{
+   struct page *page = virt_to_page(addr);
+
+   if (protect)
+   __kernel_map_pages(page, 1, 0);
+   else
+   __kernel_map_pages(page, 1, 1);
+
+   return true;
+}
+#else
 static inline bool kfence_protect_page(unsigned long addr, bool protect)
 {
pte_t *kpte = virt_to_kpte(addr);
@@ -29,5 +46,6 @@ static inline bool kfence_protect_page(unsigned long addr, 
bool protect)
 
return true;
 }
+#endif
 
 #endif /* __ASM_POWERPC_KFENCE_H */
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index b37412fe5930..9cceaa5998a3 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -424,7 +424,7 @@ int htab_bolt_mapping(unsigned long vstart, unsigned long 
vend,
break;
 
cond_resched();
-   if (debug_pagealloc_enabled() &&
+   if (debug_pagealloc_enabled_or_kfence() &&
(paddr >> PAGE_SHIFT) < linear_map_hash_count)
linear_map_hash_slots[paddr >> PAGE_SHIFT] = ret | 0x80;
}
@@ -773,7 +773,7 @@ static void __init htab_init_page_sizes(void)
bool aligned = true;
init_hpte_page_sizes();
 
-   if (!debug_pagealloc_enabled()) {
+   if (!debug_pagealloc_enabled_or_kfence()) {
/*
 * Pick a size for the linear mapping. Currently, we o

[PATCH 1/4] powerpc/64s: Add DEBUG_PAGEALLOC for radix

2022-09-18 Thread Nicholas Miehlbradt
There is support for DEBUG_PAGEALLOC on hash but not on radix.
Add support on radix.

Signed-off-by: Nicholas Miehlbradt 
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index db2f3d193448..483c99bfbde5 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -503,6 +504,9 @@ static unsigned long __init radix_memory_block_size(void)
 {
unsigned long mem_block_size = MIN_MEMORY_BLOCK_SIZE;
 
+   if (debug_pagealloc_enabled())
+   return PAGE_SIZE;
+
/*
 * OPAL firmware feature is set by now. Hence we are ok
 * to test OPAL feature.
@@ -519,6 +523,9 @@ static unsigned long __init radix_memory_block_size(void)
 
 static unsigned long __init radix_memory_block_size(void)
 {
+   if (debug_pagealloc_enabled())
+   return PAGE_SIZE;
+
return 1UL * 1024 * 1024 * 1024;
 }
 
@@ -899,7 +906,14 @@ void __meminit radix__vmemmap_remove_mapping(unsigned long 
start, unsigned long
 #ifdef CONFIG_DEBUG_PAGEALLOC
 void radix__kernel_map_pages(struct page *page, int numpages, int enable)
 {
-   pr_warn_once("DEBUG_PAGEALLOC not supported in radix mode\n");
+   unsigned long addr;
+
+   addr = (unsigned long)page_address(page);
+
+   if (enable)
+   set_memory_p(addr, numpages);
+   else
+   set_memory_np(addr, numpages);
 }
 #endif
 
-- 
2.34.1



Re: [PATCH] cxl: fix repeated words in comments

2022-09-18 Thread Andrew Donnellan
On Sun, 2022-09-18 at 18:03 +0800, Jilin Yuan wrote:
> Delete the redundant word 'dont'.
> 
> Signed-off-by: Jilin Yuan  ---
>  drivers/misc/cxl/native.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/misc/cxl/native.c b/drivers/misc/cxl/native.c
> index 50b0c44bb8d7..6957946a6463 100644
> --- a/drivers/misc/cxl/native.c
> +++ b/drivers/misc/cxl/native.c
> @@ -920,7 +920,7 @@ int cxl_attach_dedicated_process_psl9(struct
> cxl_context *ctx, u64 wed, u64 amr)
>  * Ideally we should do a wmb() here to make sure the changes
> to the
>  * PE are visible to the card before we call afu_enable.
>  * On ppc64 though all mmios are preceded by a 'sync'
> instruction hence
> -    * we dont dont need one here.
> +    * we dont need one here.
>  */
>  
> result = cxl_ops->afu_reset(afu);

-- 
Andrew DonnellanOzLabs, ADL Canberra
a...@linux.ibm.com   IBM Australia Limited



Re: [PATCH] EDAC/ppc_4xx: Reorder symbols to get rid of a few forward declarations

2022-09-18 Thread Borislav Petkov
On Sun, Sep 18, 2022 at 01:20:13AM +0200, Uwe Kleine-König wrote:
> When moving the definition of ppc4xx_edac_driver further down, the
> forward declarations can just be dropped.
> 
> Do this to reduce line needless repetition.
> 
> Signed-off-by: Uwe Kleine-König 
> ---
>  drivers/edac/ppc4xx_edac.c | 23 +--
>  1 file changed, 9 insertions(+), 14 deletions(-)

Applied, thanks.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette


Re: [PATCH linux-next] macintosh/windfarm: fix comparing pointer to 0

2022-09-18 Thread Arnd Bergmann
On Sun, Sep 18, 2022, at 5:50 PM, cgel@gmail.com wrote:
> @@ -970,7 +970,7 @@ static int pm121_init_pm(void)
> const struct smu_sdbp_header *hdr;
>
> hdr = smu_get_sdb_partition(SMU_SDB_SENSORTREE_ID, NULL);
> -   if (hdr != 0) {
> +   if (hdr != NULL) {
> struct smu_sdbp_sensortree *st =
> (struct smu_sdbp_sensortree *)&hdr[1];

The more common way of writing this in the kernel is 
"if (hdr)".

 Arnd


[PATCH linux-next] macintosh/windfarm: fix comparing pointer to 0

2022-09-18 Thread cgel . zte
From: Xu Panda 

Comparing pointer whith NULL instead of comparing pointer to 0.

Reported-by: Zeal Robot 
Signed-off-by: Xu Panda 
---
 drivers/macintosh/windfarm_pm121.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/macintosh/windfarm_pm121.c 
b/drivers/macintosh/windfarm_pm121.c
index 36312f163aac..82dcd35f439a 100644
--- a/drivers/macintosh/windfarm_pm121.c
+++ b/drivers/macintosh/windfarm_pm121.c
@@ -651,7 +651,7 @@ static void pm121_create_cpu_fans(void)

/* First, locate the PID params in SMU SBD */
hdr = smu_get_sdb_partition(SMU_SDB_CPUPIDDATA_ID, NULL);
-   if (hdr == 0) {
+   if (hdr == NULL) {
printk(KERN_WARNING "pm121: CPU PID fan config not found.\n");
goto fail;
}
@@ -970,7 +970,7 @@ static int pm121_init_pm(void)
const struct smu_sdbp_header *hdr;

hdr = smu_get_sdb_partition(SMU_SDB_SENSORTREE_ID, NULL);
-   if (hdr != 0) {
+   if (hdr != NULL) {
struct smu_sdbp_sensortree *st =
(struct smu_sdbp_sensortree *)&hdr[1];
pm121_mach_model = st->model_id;
-- 
2.15.2



[PATCH] cxl: fix repeated words in comments

2022-09-18 Thread Jilin Yuan
Delete the redundant word 'dont'.

Signed-off-by: Jilin Yuan 
---
 drivers/misc/cxl/native.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/misc/cxl/native.c b/drivers/misc/cxl/native.c
index 50b0c44bb8d7..6957946a6463 100644
--- a/drivers/misc/cxl/native.c
+++ b/drivers/misc/cxl/native.c
@@ -920,7 +920,7 @@ int cxl_attach_dedicated_process_psl9(struct cxl_context 
*ctx, u64 wed, u64 amr)
 * Ideally we should do a wmb() here to make sure the changes to the
 * PE are visible to the card before we call afu_enable.
 * On ppc64 though all mmios are preceded by a 'sync' instruction hence
-* we dont dont need one here.
+* we dont need one here.
 */
 
result = cxl_ops->afu_reset(afu);
-- 
2.36.1



Re: [PATCH] powerpc: Save AMR/IAMR when switching tasks

2022-09-18 Thread Christophe Leroy


Le 17/09/2022 à 20:38, Samuel Holland a écrit :
> On 9/17/22 03:16, Christophe Leroy wrote:
>> Le 16/09/2022 à 07:05, Samuel Holland a écrit :
>>> With CONFIG_PREEMPT=y (involuntary preemption enabled), it is possible
>>> to switch away from a task inside copy_{from,to}_user. This left the CPU
>>> with userspace access enabled until after the next IRQ or privilege
>>> level switch, when AMR/IAMR got reset to AMR_KU[AE]P_BLOCKED. Then, when
>>> switching back to the original task, the userspace access would fault:
>>
>> This is not supposed to happen. You never switch away from a task
>> magically. Task switch will always happen in an interrupt, that means
>> copy_{from,to}_user() get interrupted.
> 
> That makes sense, the interrupt handler is responsible for saving the
> KUAP status. It looks like neither DEFINE_INTERRUPT_HANDLER_RAW nor any
> of its users (performance_monitor_exception(), do_slb_fault()) do that.

As far as I can see, that's done earlier, in exceptions-64s.S.
Look for kuap_save_amr_and_lock.

Now, it may be possible that one of the exceptions pathes misses it.

> Yet they still call one of the interrupt_return variants, which restores
> the status from the stack.
> 
>> Whenever an interrupt is taken, kuap_save_amr_and_lock() macro is used
>> to save KUAP status into the stack then lock KUAP access. At interrupt
>> exit, kuap_kernel_restore() macro or function is used to restore KUAP
>> access from the stack. At the time the task switch happens, KUAP access
>> is expected to be locked. During task switch, the stack is switched so
>> the KUAP status is taken back from the new task's stack.
> 
> What if another task calls schedule() from kernel process context, and
> the scheduler switches to a task that had been preempted inside
> copy_{from,to}_user()? Then there is no interrupt involved, and I don't
> see where kuap_kernel_restore() would get called.

Yes there is interrupt involved. That task, if it has been preempted 
inside copy_from_user(), it must have been through an interrupt, likely 
a timer interrupt. So switching back to that task means doing an 
interrupt return in the context of that task. That's when KUAP status 
should be restored.

> 
>> Your fix suggests that there is some path where the KUAP status is not
>> properly saved and/or restored. Did you try running with
>> CONFIG_PPC_KUAP_DEBUG ? It should warn whenever a KUAP access is left
>> unlocked.
>>
>>>
>>> Kernel attempted to write user page (3fff7ab68190) - exploit attempt? 
>>> (uid: 65536)
>>> [ cut here ]
>>> Bug: Write fault blocked by KUAP!
>>> WARNING: CPU: 56 PID: 4939 at arch/powerpc/mm/fault.c:228 
>>> ___do_page_fault+0x7b4/0xaa0
>>> CPU: 56 PID: 4939 Comm: git Tainted: GW 
>>> 5.19.8-5-gba424747260d #1
>>> NIP:  c00555e4 LR: c00555e0 CTR: c079d9d0
>>> REGS: c0008f507370 TRAP: 0700   Tainted: GW  
>>> (5.19.8-5-gba424747260d)
>>> MSR:  90021033   CR: 2804  XER: 
>>> 2004
>>> CFAR: c0123780 IRQMASK: 3
>>> NIP [c00555e4] ___do_page_fault+0x7b4/0xaa0
>>> LR [c00555e0] ___do_page_fault+0x7b0/0xaa0
>>> Call Trace:
>>> [c0008f507610] [c00555e0] ___do_page_fault+0x7b0/0xaa0 
>>> (unreliable)
>>> [c0008f5076c0] [c0055938] do_page_fault+0x68/0x130
>>> [c0008f5076f0] [c0008914] 
>>> data_access_common_virt+0x194/0x1f0
>>> --- interrupt: 300 at __copy_tofrom_user_base+0x9c/0x5a4
>>
>> ...
>>
>>>
>>> Fix this by saving and restoring the kernel-side AMR/IAMR values when
>>> switching tasks.
>>
>> As explained above, KUAP access should be locked at that time, so saving
>> and restoring it should not have any effect. If it does, it means
>> something goes wrong somewhere else.
>>
>>>
>>> Fixes: 890274c2dc4c ("powerpc/64s: Implement KUAP for Radix MMU")
>>> Signed-off-by: Samuel Holland 
>>> ---
>>> I have no idea if this is the right change to make, and it could be
>>> optimized, but my system has been stable with this patch for 5 days now.
>>>
>>> Without the patch, I hit the bug every few minutes when my load average
>>> is <1, and I hit it immediately if I try to do a parallel kernel build.
>>
>> Great, then can you make a try with CONFIG_PPC_KUAP_DEBUG ?
> 
> Yes, I will try this out in the next few days.
> 

Thanks
Christophe