Re: [PATCH 1/6] powerpc/code-patching: Implement generic text patching function
On Mon, 2022-09-19 at 06:04 +, Christophe Leroy wrote: > With CONFIG_STRICT_KERNEL_RWX, this patches causes a 15% time > increase > for activation/deactivation of ftrace. It's possible that new alignment check is the cause. I'll see > Without CONFIG_STRICT_KERNEL_RWX, it doesn't build. Yup, fixed for v2 > > +static int __patch_text(void *dest, const void *src, size_t size, > > bool is_exec, void *exec_addr) > > Is 'text' a good name ? For me text mean executable code. Should it > be > __patch_memory() ? Well patching regular memory is just a normal store. Text to me implies its non-writeable. Though __patch_memory would be fine. > Why pass src as a void * ? This forces data to go via the stack. > Can't > you pass it as a 'long' ? Probably, I wasn't aware that it would make a difference. I prefer pointers in general for their semantic meaning, but will change if it affects param passing. > > + if (virt_to_pfn(dest) != virt_to_pfn(dest + size - 1)) > > + return -EFAULT; > > Why do you need that new check ? If the patch crosses a page boundary then letting it happen is unpredictable. Though perhaps this requirement can just be put as a comment, or require that patches be aligned to the patch size. > > + case 8: > > + __put_kernel_nofault(dest, src, u64, > > failed); > > + break; > > Is case 8 needed for PPC32 ? I don't have a particular need for it, but the underlying __put_kernel_nofault is capable of it so I included it. > > + } > > Do you catch it when size if none of 1,2,4,8 ? > Not yet. Perhaps I should wrap patch_text_data in a macro that checks the size with BUILD_BUG_ON? I'd rather not check at runtime. > > + > > + asm ("dcbst 0, %0; sync" :: "r" (dest)); > > Maybe write it in C: > > dcbst(dest); > mb(); /* sync */ > > > + > > + if (is_exec) > > + asm ("icbi 0,%0; sync; isync" :: "r" (exec_addr)); > > Same, can be: > > if (is_exec) { > icbi(exec_addr); > mb(); /* sync */ > isync(); > } > > Or keep it flat: > > if (!is_exec) > return 0; > > icbi(exec_addr); > mb(); /* sync */ > isync(); > > return 0; Will try this. > > +static int do_patch_text(void *dest, const void *src, size_t size, > > bool is_exec) > > +{ > > + int err; > > + pte_t *pte; > > + u32 *patch_addr; > > + > > + pte = start_text_patch(dest, &patch_addr); > > + err = __patch_text(patch_addr, src, size, is_exec, dest); > > + finish_text_patch(pte); > > Why do you need to split this function in three parts ? I can't see > the > added value, all it does is reduce readability. It made it more readable to me, so the __patch_text didn't get buried. It also made it easier to do the refactoring, and potentially add code patching variants that use the poke area but not __patch_text. I'll remove it for v2 though given this is the only use right now. > Did you check the impact of calling __this_cpu_read() twice ? I wasn't concerned about performance, but given I'll merge it back again it will only be read once in v2 again. > > +void *patch_memory(void *dest, const void *src, size_t size) > > What is this function used for ? > Build failure apparently :) It's removed in v2. >
Re: [PATCH 1/6] powerpc/code-patching: Implement generic text patching function
Le 16/09/2022 à 08:23, Benjamin Gray a écrit : > > -static int do_patch_instruction(u32 *addr, ppc_inst_t instr) > +static int patch_text(void *dest, const void *src, size_t size, bool is_exec) > { > int err; > unsigned long flags; > > - /* > - * During early early boot patch_instruction is called > - * when text_poke_area is not ready, but we still need > - * to allow patching. We just do the plain old patching > - */ > + /* Make sure we aren't patching a freed init section */ > + if (static_branch_likely(&init_mem_is_free) && > init_section_contains(dest, 4)) > + return 0; > + > if (!static_branch_likely(&poking_init_done)) > - return raw_patch_instruction(addr, instr); > + return __patch_text(dest, src, size, is_exec, dest); > Test ordering looks odd. How can init_mem_is_free be true and poking_init_done be false ? > local_irq_save(flags); > - err = __do_patch_instruction(addr, instr); > + err = do_patch_text(dest, src, size, is_exec); > local_irq_restore(flags); > > return err; > }
Re: [PATCH 4/4] powerpc/64s: Enable KFENCE on book3s64
Le 19/09/2022 à 03:44, Nicholas Miehlbradt a écrit : > KFENCE support was added for ppc32 in commit 90cbac0e995d > ("powerpc: Enable KFENCE for PPC32"). > Enable KFENCE on ppc64 architecture with hash and radix MMUs. > It uses the same mechanism as debug pagealloc to > protect/unprotect pages. All KFENCE kunit tests pass on both > MMUs. > > KFENCE memory is initially allocated using memblock but is > later marked as SLAB allocated. This necessitates the change > to __pud_free to ensure that the KFENCE pages are freed > appropriately. > > Based on previous work by Christophe Leroy and Jordan Niethe. > > Signed-off-by: Nicholas Miehlbradt > --- > arch/powerpc/Kconfig | 2 +- > arch/powerpc/include/asm/book3s/64/pgalloc.h | 6 -- > arch/powerpc/include/asm/book3s/64/pgtable.h | 2 +- > arch/powerpc/include/asm/kfence.h| 18 ++ > arch/powerpc/mm/book3s64/hash_utils.c| 10 +- > arch/powerpc/mm/book3s64/radix_pgtable.c | 8 +--- > 6 files changed, 34 insertions(+), 12 deletions(-) > > diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig > index a4f8a5276e5c..f7dd0f49510d 100644 > --- a/arch/powerpc/Kconfig > +++ b/arch/powerpc/Kconfig > @@ -194,7 +194,7 @@ config PPC > select HAVE_ARCH_KASAN if PPC32 && PPC_PAGE_SHIFT <= 14 > select HAVE_ARCH_KASAN if PPC_RADIX_MMU > select HAVE_ARCH_KASAN_VMALLOC if HAVE_ARCH_KASAN > - select HAVE_ARCH_KFENCE if PPC_BOOK3S_32 || PPC_8xx || > 40x > + select HAVE_ARCH_KFENCE if ARCH_SUPPORTS_DEBUG_PAGEALLOC > select HAVE_ARCH_KGDB > select HAVE_ARCH_MMAP_RND_BITS > select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT > diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h > b/arch/powerpc/include/asm/book3s/64/pgalloc.h > index e1af0b394ceb..dd2cff53a111 100644 > --- a/arch/powerpc/include/asm/book3s/64/pgalloc.h > +++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h > @@ -113,9 +113,11 @@ static inline void __pud_free(pud_t *pud) > > /* >* Early pud pages allocated via memblock allocator > - * can't be directly freed to slab > + * can't be directly freed to slab. KFENCE pages have > + * both reserved and slab flags set so need to be freed > + * kmem_cache_free. >*/ > - if (PageReserved(page)) > + if (PageReserved(page) && !PageSlab(page)) > free_reserved_page(page); > else > kmem_cache_free(PGT_CACHE(PUD_CACHE_INDEX), pud); > diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h > b/arch/powerpc/include/asm/book3s/64/pgtable.h > index cb9d5fd39d7f..fd5d800f2836 100644 > --- a/arch/powerpc/include/asm/book3s/64/pgtable.h > +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h > @@ -1123,7 +1123,7 @@ static inline void vmemmap_remove_mapping(unsigned long > start, > } > #endif > > -#ifdef CONFIG_DEBUG_PAGEALLOC > +#if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_KFENCE) > static inline void __kernel_map_pages(struct page *page, int numpages, int > enable) > { > if (radix_enabled()) > diff --git a/arch/powerpc/include/asm/kfence.h > b/arch/powerpc/include/asm/kfence.h > index a9846b68c6b9..33edbc312a51 100644 > --- a/arch/powerpc/include/asm/kfence.h > +++ b/arch/powerpc/include/asm/kfence.h > @@ -11,11 +11,28 @@ > #include > #include > > +#if defined(CONFIG_PPC64) && !defined(CONFIG_PPC64_ELF_ABI_V2) Can be replaced by: #ifdef CONFIG_PPC64_ELF_ABI_V1 > +#define ARCH_FUNC_PREFIX "." > +#endif > + > static inline bool arch_kfence_init_pool(void) > { > return true; > } > > +#ifdef CONFIG_PPC64 > +static inline bool kfence_protect_page(unsigned long addr, bool protect) > +{ > + struct page *page = virt_to_page(addr); > + > + if (protect) > + __kernel_map_pages(page, 1, 0); > + else > + __kernel_map_pages(page, 1, 1); Can be: __kernel_map_pages(virt_to_page(addr), !protect); > + > + return true; > +} > +#else > static inline bool kfence_protect_page(unsigned long addr, bool protect) > { > pte_t *kpte = virt_to_kpte(addr); > @@ -29,5 +46,6 @@ static inline bool kfence_protect_page(unsigned long addr, > bool protect) > > return true; > } > +#endif > > #endif /* __ASM_POWERPC_KFENCE_H */ > diff --git a/arch/powerpc/mm/book3s64/hash_utils.c > b/arch/powerpc/mm/book3s64/hash_utils.c > index b37412fe5930..9cceaa5998a3 100644 > --- a/arch/powerpc/mm/book3s64/hash_utils.c > +++ b/arch/powerpc/mm/book3s64/hash_utils.c > @@ -424,7 +424,7 @@ int htab_bolt_mapping(unsigned long vstart, unsigned long > vend, > break; > > cond_resched(); > - if (debug_pagealloc_enabled() && > + if (debug_pagealloc_enabled_or_kfence() && > (paddr >> PAGE_SHIFT) < linear_map_hash_count) >
Re: [PATCH 1/4] powerpc/64s: Add DEBUG_PAGEALLOC for radix
Le 19/09/2022 à 03:44, Nicholas Miehlbradt a écrit : > [Vous ne recevez pas souvent de courriers de nicho...@linux.ibm.com. > Découvrez pourquoi ceci est important à > https://aka.ms/LearnAboutSenderIdentification ] > > There is support for DEBUG_PAGEALLOC on hash but not on radix. > Add support on radix. > > Signed-off-by: Nicholas Miehlbradt > --- > arch/powerpc/mm/book3s64/radix_pgtable.c | 16 +++- > 1 file changed, 15 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c > b/arch/powerpc/mm/book3s64/radix_pgtable.c > index db2f3d193448..483c99bfbde5 100644 > --- a/arch/powerpc/mm/book3s64/radix_pgtable.c > +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c > @@ -30,6 +30,7 @@ > #include > #include > #include > +#include > > #include > > @@ -503,6 +504,9 @@ static unsigned long __init radix_memory_block_size(void) > { > unsigned long mem_block_size = MIN_MEMORY_BLOCK_SIZE; > > + if (debug_pagealloc_enabled()) > + return PAGE_SIZE; > + > /* > * OPAL firmware feature is set by now. Hence we are ok > * to test OPAL feature. > @@ -519,6 +523,9 @@ static unsigned long __init radix_memory_block_size(void) > > static unsigned long __init radix_memory_block_size(void) > { > + if (debug_pagealloc_enabled()) > + return PAGE_SIZE; > + > return 1UL * 1024 * 1024 * 1024; While at it, maybe you can replace the above by SZ_1G > } > > @@ -899,7 +906,14 @@ void __meminit radix__vmemmap_remove_mapping(unsigned > long start, unsigned long > #ifdef CONFIG_DEBUG_PAGEALLOC > void radix__kernel_map_pages(struct page *page, int numpages, int enable) > { > - pr_warn_once("DEBUG_PAGEALLOC not supported in radix mode\n"); > + unsigned long addr; > + > + addr = (unsigned long)page_address(page); > + > + if (enable) > + set_memory_p(addr, numpages); > + else > + set_memory_np(addr, numpages); > } > #endif > > -- > 2.34.1 >
Re: [PATCH 3/6] powerpc/module: Optimise nearby branches in ELF V2 ABI stub
Le 16/09/2022 à 08:23, Benjamin Gray a écrit : > Inserts a direct branch to the stub target when possible, replacing the > mtctr/btctr sequence. > > The load into r12 could potentially be skipped too, but that change > would need to refactor the arguments to indicate that the address > does not have a separate local entry point. > > This helps the static call implementation, where modules calling their > own trampolines are called through this stub and the trampoline is > easily within range of a direct branch. > > Signed-off-by: Benjamin Gray > --- > arch/powerpc/kernel/module_64.c | 13 +++-- > 1 file changed, 11 insertions(+), 2 deletions(-) > > diff --git a/arch/powerpc/kernel/module_64.c b/arch/powerpc/kernel/module_64.c > index 3656476097c2..03ab28d86008 100644 > --- a/arch/powerpc/kernel/module_64.c > +++ b/arch/powerpc/kernel/module_64.c > @@ -432,8 +432,17 @@ static inline int create_stub(const Elf64_Shdr *sechdrs, > return create_ftrace_stub(entry, addr, me); > > for (i = 0; i < ARRAY_SIZE(ppc64_stub_insns); i++) { > - if (patch_instruction(&entry->jump[i], > - ppc_inst(ppc64_stub_insns[i]))) > + ppc_inst_t inst = ppc_inst(ppc64_stub_insns[i]); > + > + // Replace the indirect branch with a direct branch where > possible > + if (IS_ENABLED(CONFIG_PPC64_ELF_ABI_V2) && i == 4) { > + ppc_inst_t direct; Add blank line here. > + if (create_branch(&direct, (void*) entry + (i * 4), > addr, 0) == 0) { No braces for single lines if statement. > + inst = direct; > + } > + } > + > + if (patch_instruction(&entry->jump[i], inst)) > return 0; > } >
Re: [PATCH 2/6] powerpc/module: Handle caller-saved TOC in module linker
Le 16/09/2022 à 08:23, Benjamin Gray a écrit : > The callee may set a field in `st_other` to 1 to indicate r2 should be > treated as caller-saved. This means a trampoline must be used to save > the current TOC before calling it and restore it afterwards, much like > external calls. > > This is necessary for supporting V2 ABI static calls that do not > preserve the TOC. > > Signed-off-by: Benjamin Gray > --- > arch/powerpc/kernel/module_64.c | 12 +++- > 1 file changed, 11 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/kernel/module_64.c b/arch/powerpc/kernel/module_64.c > index 7e45dc98df8a..3656476097c2 100644 > --- a/arch/powerpc/kernel/module_64.c > +++ b/arch/powerpc/kernel/module_64.c > @@ -55,6 +55,11 @@ static unsigned int local_entry_offset(const Elf64_Sym > *sym) >* of function and try to derive r2 from it). */ > return PPC64_LOCAL_ENTRY_OFFSET(sym->st_other); > } > + > +static bool need_r2save_stub(unsigned char st_other) { Please read https://docs.kernel.org/process/coding-style.html#placing-braces-and-spaces " ... functions: they have the opening brace at the beginning of the next line ..." > + return ((st_other & STO_PPC64_LOCAL_MASK) >> STO_PPC64_LOCAL_BIT) == 1; > +} > + > #else > > static func_desc_t func_desc(unsigned long addr) > @@ -66,6 +71,10 @@ static unsigned int local_entry_offset(const Elf64_Sym > *sym) > return 0; > } > > +static bool need_r2save_stub(unsigned char st_other) { > + return false; > +} > + > void *dereference_module_function_descriptor(struct module *mod, void *ptr) > { > if (ptr < (void *)mod->arch.start_opd || > @@ -632,7 +641,8 @@ int apply_relocate_add(Elf64_Shdr *sechdrs, > case R_PPC_REL24: > /* FIXME: Handle weak symbols here --RR */ > if (sym->st_shndx == SHN_UNDEF || > - sym->st_shndx == SHN_LIVEPATCH) { > + sym->st_shndx == SHN_LIVEPATCH || > + need_r2save_stub(sym->st_other)) { > /* External: go via stub */ > value = stub_for_addr(sechdrs, value, me, > strtab + sym->st_name);
Re: [PATCH 1/6] powerpc/code-patching: Implement generic text patching function
Le 16/09/2022 à 08:23, Benjamin Gray a écrit : > Adds a generic text patching mechanism for patches of 1, 2, 4, or 8 > bytes. The patcher conditionally syncs the icache depending on if > the content will be executed (as opposed to, e.g., read-only data). > > The `patch_instruction` function is reimplemented in terms of this > more generic function. This generic implementation allows patching of > arbitrary 64-bit data, whereas the original `patch_instruction` decided > the size based on the 'instruction' opcode, so was not suitable for > arbitrary data. With CONFIG_STRICT_KERNEL_RWX, this patches causes a 15% time increase for activation/deactivation of ftrace. Without CONFIG_STRICT_KERNEL_RWX, it doesn't build. > > Signed-off-by: Benjamin Gray > --- > arch/powerpc/include/asm/code-patching.h | 1 + > arch/powerpc/lib/code-patching.c | 135 +++ > 2 files changed, 89 insertions(+), 47 deletions(-) > > diff --git a/arch/powerpc/include/asm/code-patching.h > b/arch/powerpc/include/asm/code-patching.h > index 1c6316ec4b74..6a52c19dae46 100644 > --- a/arch/powerpc/include/asm/code-patching.h > +++ b/arch/powerpc/include/asm/code-patching.h > @@ -76,6 +76,7 @@ int create_cond_branch(ppc_inst_t *instr, const u32 *addr, > int patch_branch(u32 *addr, unsigned long target, int flags); > int patch_instruction(u32 *addr, ppc_inst_t instr); > int raw_patch_instruction(u32 *addr, ppc_inst_t instr); > +int patch_text_data(void *dest, const void *src, size_t size); > > static inline unsigned long patch_site_addr(s32 *site) > { > diff --git a/arch/powerpc/lib/code-patching.c > b/arch/powerpc/lib/code-patching.c > index ad0cf3108dd0..a09a0898c2ce 100644 > --- a/arch/powerpc/lib/code-patching.c > +++ b/arch/powerpc/lib/code-patching.c > @@ -3,6 +3,7 @@ >* Copyright 2008 Michael Ellerman, IBM Corporation. >*/ > > +#include > #include > #include > #include > @@ -14,32 +15,7 @@ > #include > #include > #include > - > -static int __patch_instruction(u32 *exec_addr, ppc_inst_t instr, u32 > *patch_addr) > -{ > - if (!ppc_inst_prefixed(instr)) { > - u32 val = ppc_inst_val(instr); > - > - __put_kernel_nofault(patch_addr, &val, u32, failed); > - } else { > - u64 val = ppc_inst_as_ulong(instr); > - > - __put_kernel_nofault(patch_addr, &val, u64, failed); > - } > - > - asm ("dcbst 0, %0; sync; icbi 0,%1; sync; isync" :: "r" (patch_addr), > - "r" (exec_addr)); > - > - return 0; > - > -failed: > - return -EPERM; > -} > - > -int raw_patch_instruction(u32 *addr, ppc_inst_t instr) > -{ > - return __patch_instruction(addr, instr, addr); > -} > +#include > > #ifdef CONFIG_STRICT_KERNEL_RWX > static DEFINE_PER_CPU(struct vm_struct *, text_poke_area); > @@ -147,16 +123,44 @@ static void unmap_patch_area(unsigned long addr) > flush_tlb_kernel_range(addr, addr + PAGE_SIZE); > } > > -static int __do_patch_instruction(u32 *addr, ppc_inst_t instr) > +static int __patch_text(void *dest, const void *src, size_t size, bool > is_exec, void *exec_addr) Is 'text' a good name ? For me text mean executable code. Should it be __patch_memory() ? Why pass src as a void * ? This forces data to go via the stack. Can't you pass it as a 'long' ? > +{ > + if (virt_to_pfn(dest) != virt_to_pfn(dest + size - 1)) > + return -EFAULT; Why do you need that new check ? > + > + switch (size) { > + case 1: > + __put_kernel_nofault(dest, src, u8, failed); > + break; > + case 2: > + __put_kernel_nofault(dest, src, u16, failed); > + break; > + case 4: > + __put_kernel_nofault(dest, src, u32, failed); > + break; > + case 8: > + __put_kernel_nofault(dest, src, u64, failed); > + break; Is case 8 needed for PPC32 ? > + } Do you catch it when size if none of 1,2,4,8 ? > + > + asm ("dcbst 0, %0; sync" :: "r" (dest)); Maybe write it in C: dcbst(dest); mb(); /* sync */ > + > + if (is_exec) > + asm ("icbi 0,%0; sync; isync" :: "r" (exec_addr)); Same, can be: if (is_exec) { icbi(exec_addr); mb(); /* sync */ isync(); } Or keep it flat: if (!is_exec) return 0; icbi(exec_addr); mb(); /* sync */ isync(); return 0; > + > + return 0; > + > +failed: > + return -EPERM; > +} > + > +static pte_t *start_text_patch(void* dest, u32 **patch_addr) > { > - int err; > - u32 *patch_addr; > - unsigned long text_poke_addr; > pte_t *pte; > - unsigned long pfn = get_patch_pfn(addr); > + unsigned long text_poke_addr = (unsigned > long
Re: [PATCH v2 3/7] powerpc/build: move got, toc, plt, branch_lt sections to read-only
kernel test robot writes: > Hi Nicholas, > > I love your patch! Yet something to improve: > > [auto build test ERROR on powerpc/next] > [also build test ERROR on linus/master v6.0-rc5 next-20220916] > [If your patch is applied to the wrong git tree, kindly drop us a note. > And when submitting patch, we suggest to use '--base' as documented in > https://git-scm.com/docs/git-format-patch#_base_tree_information] > > url: > https://github.com/intel-lab-lkp/linux/commits/Nicholas-Piggin/powerpc-build-linker-improvements/20220916-121310 > base: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next > config: powerpc-microwatt_defconfig > (https://download.01.org/0day-ci/archive/20220918/202209180437.4u3soljk-...@intel.com/config) > compiler: clang version 16.0.0 (https://github.com/llvm/llvm-project > 791a7ae1ba3efd6bca96338e10ffde557ba83920) > reproduce (this is a W=1 build): > wget > https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O > ~/bin/make.cross > chmod +x ~/bin/make.cross > # install powerpc cross compiling tool for clang build > # apt-get install binutils-powerpc-linux-gnu > # > https://github.com/intel-lab-lkp/linux/commit/6c034c08f8d0addb6fecba38c9c428b1c4df7c29 > git remote add linux-review https://github.com/intel-lab-lkp/linux > git fetch --no-tags linux-review > Nicholas-Piggin/powerpc-build-linker-improvements/20220916-121310 > git checkout 6c034c08f8d0addb6fecba38c9c428b1c4df7c29 > # save the config file > mkdir build_dir && cp config build_dir/.config > COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 > O=build_dir ARCH=powerpc SHELL=/bin/bash ^ > If you fix the issue, kindly add following tag where applicable > Reported-by: kernel test robot > > All errors (new ones prefixed by >>): > >>> ld.lld: error: ./arch/powerpc/kernel/vmlinux.lds:46: { expected, but got >>> SPECIAL >>>> .got : AT(ADDR(.got) - (0xc000 -0x)) SPECIAL { >>>> ^ I guess SPECIAL is a binutils ld ism? I can't find it documented anywhere. Presumably we can just not use it, given we never did before? cheers
Re: [PATCH] powerpc/microwatt: Remove unused early debug code
On Mon, 2022-09-19 at 15:27 +1000, Michael Ellerman wrote: > The original microwatt submission[1] included some early debug code > for > using the Microwatt "potato" UART. > > The series that was eventually merged switched to using a standard > UART, > and so doesn't need any special early debug handling. But some of the > original code was merged accidentally under the non-existent > CONFIG_PPC_EARLY_DEBUG_MICROWATT. > > Drop the unused code. > > 1: > https://lore.kernel.org/linuxppc-dev/20200509050340.gd1464...@thinks.paulus.ozlabs.org/ > > Fixes: 48b545b8018d ("powerpc/microwatt: Use standard 16550 UART for > console") > Reported-by: Lukas Bulwahn > Signed-off-by: Michael Ellerman grep confirms this is indeed dead code. Reviewed-by: Andrew Donnellan -- Andrew DonnellanOzLabs, ADL Canberra a...@linux.ibm.com IBM Australia Limited
[PATCH] powerpc/microwatt: Remove unused early debug code
The original microwatt submission[1] included some early debug code for using the Microwatt "potato" UART. The series that was eventually merged switched to using a standard UART, and so doesn't need any special early debug handling. But some of the original code was merged accidentally under the non-existent CONFIG_PPC_EARLY_DEBUG_MICROWATT. Drop the unused code. 1: https://lore.kernel.org/linuxppc-dev/20200509050340.gd1464...@thinks.paulus.ozlabs.org/ Fixes: 48b545b8018d ("powerpc/microwatt: Use standard 16550 UART for console") Reported-by: Lukas Bulwahn Signed-off-by: Michael Ellerman --- arch/powerpc/kernel/udbg_16550.c | 39 1 file changed, 39 deletions(-) diff --git a/arch/powerpc/kernel/udbg_16550.c b/arch/powerpc/kernel/udbg_16550.c index d3942de254c6..ddfbc74bf85f 100644 --- a/arch/powerpc/kernel/udbg_16550.c +++ b/arch/powerpc/kernel/udbg_16550.c @@ -296,42 +296,3 @@ void __init udbg_init_40x_realmode(void) } #endif /* CONFIG_PPC_EARLY_DEBUG_40x */ - -#ifdef CONFIG_PPC_EARLY_DEBUG_MICROWATT - -#define UDBG_UART_MW_ADDR ((void __iomem *)0xc0002000) - -static u8 udbg_uart_in_isa300_rm(unsigned int reg) -{ - uint64_t msr = mfmsr(); - uint8_t c; - - mtmsr(msr & ~(MSR_EE|MSR_DR)); - isync(); - eieio(); - c = __raw_rm_readb(UDBG_UART_MW_ADDR + (reg << 2)); - mtmsr(msr); - isync(); - return c; -} - -static void udbg_uart_out_isa300_rm(unsigned int reg, u8 val) -{ - uint64_t msr = mfmsr(); - - mtmsr(msr & ~(MSR_EE|MSR_DR)); - isync(); - eieio(); - __raw_rm_writeb(val, UDBG_UART_MW_ADDR + (reg << 2)); - mtmsr(msr); - isync(); -} - -void __init udbg_init_debug_microwatt(void) -{ - udbg_uart_in = udbg_uart_in_isa300_rm; - udbg_uart_out = udbg_uart_out_isa300_rm; - udbg_use_uart(); -} - -#endif /* CONFIG_PPC_EARLY_DEBUG_MICROWATT */ -- 2.37.2
Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation
On Mon, Sep 19, 2022 at 4:53 PM Barry Song <21cn...@gmail.com> wrote: > > On Mon, Sep 19, 2022 at 4:24 PM Anshuman Khandual > wrote: > > > > > > > > On 9/15/22 12:12, Barry Song wrote: > > > On Thu, Sep 15, 2022 at 6:07 PM Anshuman Khandual > > > wrote: > > >> > > >> > > >> > > >> On 9/9/22 11:05, Barry Song wrote: > > >>> On Fri, Sep 9, 2022 at 5:24 PM Anshuman Khandual > > >>> wrote: > > > > > > > > On 8/22/22 13:51, Yicong Yang wrote: > > > From: Barry Song > > > > > > on x86, batched and deferred tlb shootdown has lead to 90% > > > performance increase on tlb shootdown. on arm64, HW can do > > > tlb shootdown without software IPI. But sync tlbi is still > > > quite expensive. > > > > > > Even running a simplest program which requires swapout can > > > prove this is true, > > > #include > > > #include > > > #include > > > #include > > > > > > int main() > > > { > > > #define SIZE (1 * 1024 * 1024) > > > volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | > > > PROT_WRITE, > > > MAP_SHARED | MAP_ANONYMOUS, > > > -1, 0); > > > > > > memset(p, 0x88, SIZE); > > > > > > for (int k = 0; k < 1; k++) { > > > /* swap in */ > > > for (int i = 0; i < SIZE; i += 4096) { > > > (void)p[i]; > > > } > > > > > > /* swap out */ > > > madvise(p, SIZE, MADV_PAGEOUT); > > > } > > > } > > > > > > Perf result on snapdragon 888 with 8 cores by using zRAM > > > as the swap block device. > > > > > > ~ # perf record taskset -c 4 ./a.out > > > [ perf record: Woken up 10 times to write data ] > > > [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) > > > ] > > > ~ # perf report > > > # To display the perf.data header info, please use > > > --header/--header-only options. > > > # To display the perf.data header info, please use > > > --header/--header-only options. > > > # > > > # > > > # Total Lost Samples: 0 > > > # > > > # Samples: 60K of event 'cycles' > > > # Event count (approx.): 35706225414 > > > # > > > # Overhead Command Shared Object Symbol > > > # ... . > > > . > > > # > > > 21.07% a.out[kernel.kallsyms] [k] _raw_spin_unlock_irq > > > 8.23% a.out[kernel.kallsyms] [k] > > > _raw_spin_unlock_irqrestore > > > 6.67% a.out[kernel.kallsyms] [k] filemap_map_pages > > > 6.16% a.out[kernel.kallsyms] [k] __zram_bvec_write > > > 5.36% a.out[kernel.kallsyms] [k] ptep_clear_flush > > > 3.71% a.out[kernel.kallsyms] [k] _raw_spin_lock > > > 3.49% a.out[kernel.kallsyms] [k] memset64 > > > 1.63% a.out[kernel.kallsyms] [k] clear_page > > > 1.42% a.out[kernel.kallsyms] [k] _raw_spin_unlock > > > 1.26% a.out[kernel.kallsyms] [k] > > > mod_zone_state.llvm.8525150236079521930 > > > 1.23% a.out[kernel.kallsyms] [k] xas_load > > > 1.15% a.out[kernel.kallsyms] [k] zram_slot_lock > > > > > > ptep_clear_flush() takes 5.36% CPU in the micro-benchmark > > > swapping in/out a page mapped by only one process. If the > > > page is mapped by multiple processes, typically, like more > > > than 100 on a phone, the overhead would be much higher as > > > we have to run tlb flush 100 times for one single page. > > > Plus, tlb flush overhead will increase with the number > > > of CPU cores due to the bad scalability of tlb shootdown > > > in HW, so those ARM64 servers should expect much higher > > > overhead. > > > > > > Further perf annonate shows 95% cpu time of ptep_clear_flush > > > is actually used by the final dsb() to wait for the completion > > > of tlb flush. This provides us a very good chance to leverage > > > the existing batched tlb in kernel. The minimum modification > > > is that we only send async tlbi in the first stage and we send > > > dsb while we have to sync in the second stage. > > > > > > With the above simplest micro benchmark, collapsed time to > > > finish the program decreases around 5%. > > > > > > Typical collapsed time w/o patch: > > > ~ # time taskset -c 4 ./a.out > > > 0.21user 14.34system 0:14.69elapsed > > > w/ patch: > > > ~ # time taskset -c 4 ./a.out > > > 0.22user 13.45system 0:13.80elapsed > > > > > > Also, Yicong Yang added the following observation. > > > Tested with benchmark in the commit on Kun
Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation
On Mon, Sep 19, 2022 at 4:24 PM Anshuman Khandual wrote: > > > > On 9/15/22 12:12, Barry Song wrote: > > On Thu, Sep 15, 2022 at 6:07 PM Anshuman Khandual > > wrote: > >> > >> > >> > >> On 9/9/22 11:05, Barry Song wrote: > >>> On Fri, Sep 9, 2022 at 5:24 PM Anshuman Khandual > >>> wrote: > > > > On 8/22/22 13:51, Yicong Yang wrote: > > From: Barry Song > > > > on x86, batched and deferred tlb shootdown has lead to 90% > > performance increase on tlb shootdown. on arm64, HW can do > > tlb shootdown without software IPI. But sync tlbi is still > > quite expensive. > > > > Even running a simplest program which requires swapout can > > prove this is true, > > #include > > #include > > #include > > #include > > > > int main() > > { > > #define SIZE (1 * 1024 * 1024) > > volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | > > PROT_WRITE, > > MAP_SHARED | MAP_ANONYMOUS, > > -1, 0); > > > > memset(p, 0x88, SIZE); > > > > for (int k = 0; k < 1; k++) { > > /* swap in */ > > for (int i = 0; i < SIZE; i += 4096) { > > (void)p[i]; > > } > > > > /* swap out */ > > madvise(p, SIZE, MADV_PAGEOUT); > > } > > } > > > > Perf result on snapdragon 888 with 8 cores by using zRAM > > as the swap block device. > > > > ~ # perf record taskset -c 4 ./a.out > > [ perf record: Woken up 10 times to write data ] > > [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ] > > ~ # perf report > > # To display the perf.data header info, please use > > --header/--header-only options. > > # To display the perf.data header info, please use > > --header/--header-only options. > > # > > # > > # Total Lost Samples: 0 > > # > > # Samples: 60K of event 'cycles' > > # Event count (approx.): 35706225414 > > # > > # Overhead Command Shared Object Symbol > > # ... . > > . > > # > > 21.07% a.out[kernel.kallsyms] [k] _raw_spin_unlock_irq > > 8.23% a.out[kernel.kallsyms] [k] _raw_spin_unlock_irqrestore > > 6.67% a.out[kernel.kallsyms] [k] filemap_map_pages > > 6.16% a.out[kernel.kallsyms] [k] __zram_bvec_write > > 5.36% a.out[kernel.kallsyms] [k] ptep_clear_flush > > 3.71% a.out[kernel.kallsyms] [k] _raw_spin_lock > > 3.49% a.out[kernel.kallsyms] [k] memset64 > > 1.63% a.out[kernel.kallsyms] [k] clear_page > > 1.42% a.out[kernel.kallsyms] [k] _raw_spin_unlock > > 1.26% a.out[kernel.kallsyms] [k] > > mod_zone_state.llvm.8525150236079521930 > > 1.23% a.out[kernel.kallsyms] [k] xas_load > > 1.15% a.out[kernel.kallsyms] [k] zram_slot_lock > > > > ptep_clear_flush() takes 5.36% CPU in the micro-benchmark > > swapping in/out a page mapped by only one process. If the > > page is mapped by multiple processes, typically, like more > > than 100 on a phone, the overhead would be much higher as > > we have to run tlb flush 100 times for one single page. > > Plus, tlb flush overhead will increase with the number > > of CPU cores due to the bad scalability of tlb shootdown > > in HW, so those ARM64 servers should expect much higher > > overhead. > > > > Further perf annonate shows 95% cpu time of ptep_clear_flush > > is actually used by the final dsb() to wait for the completion > > of tlb flush. This provides us a very good chance to leverage > > the existing batched tlb in kernel. The minimum modification > > is that we only send async tlbi in the first stage and we send > > dsb while we have to sync in the second stage. > > > > With the above simplest micro benchmark, collapsed time to > > finish the program decreases around 5%. > > > > Typical collapsed time w/o patch: > > ~ # time taskset -c 4 ./a.out > > 0.21user 14.34system 0:14.69elapsed > > w/ patch: > > ~ # time taskset -c 4 ./a.out > > 0.22user 13.45system 0:13.80elapsed > > > > Also, Yicong Yang added the following observation. > > Tested with benchmark in the commit on Kunpeng920 arm64 server, > > observed an improvement around 12.5% with command > > `time ./swap_bench`. > > w/o w/ > > real0m13.460s 0m11.771s > > user0m0.248s0m0.279s > > sys 0m12.039s 0m11.458s > > > > Ori
Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation
On 9/15/22 12:12, Barry Song wrote: > On Thu, Sep 15, 2022 at 6:07 PM Anshuman Khandual > wrote: >> >> >> >> On 9/9/22 11:05, Barry Song wrote: >>> On Fri, Sep 9, 2022 at 5:24 PM Anshuman Khandual >>> wrote: On 8/22/22 13:51, Yicong Yang wrote: > From: Barry Song > > on x86, batched and deferred tlb shootdown has lead to 90% > performance increase on tlb shootdown. on arm64, HW can do > tlb shootdown without software IPI. But sync tlbi is still > quite expensive. > > Even running a simplest program which requires swapout can > prove this is true, > #include > #include > #include > #include > > int main() > { > #define SIZE (1 * 1024 * 1024) > volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | > PROT_WRITE, > MAP_SHARED | MAP_ANONYMOUS, -1, > 0); > > memset(p, 0x88, SIZE); > > for (int k = 0; k < 1; k++) { > /* swap in */ > for (int i = 0; i < SIZE; i += 4096) { > (void)p[i]; > } > > /* swap out */ > madvise(p, SIZE, MADV_PAGEOUT); > } > } > > Perf result on snapdragon 888 with 8 cores by using zRAM > as the swap block device. > > ~ # perf record taskset -c 4 ./a.out > [ perf record: Woken up 10 times to write data ] > [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ] > ~ # perf report > # To display the perf.data header info, please use > --header/--header-only options. > # To display the perf.data header info, please use > --header/--header-only options. > # > # > # Total Lost Samples: 0 > # > # Samples: 60K of event 'cycles' > # Event count (approx.): 35706225414 > # > # Overhead Command Shared Object Symbol > # ... . > . > # > 21.07% a.out[kernel.kallsyms] [k] _raw_spin_unlock_irq > 8.23% a.out[kernel.kallsyms] [k] _raw_spin_unlock_irqrestore > 6.67% a.out[kernel.kallsyms] [k] filemap_map_pages > 6.16% a.out[kernel.kallsyms] [k] __zram_bvec_write > 5.36% a.out[kernel.kallsyms] [k] ptep_clear_flush > 3.71% a.out[kernel.kallsyms] [k] _raw_spin_lock > 3.49% a.out[kernel.kallsyms] [k] memset64 > 1.63% a.out[kernel.kallsyms] [k] clear_page > 1.42% a.out[kernel.kallsyms] [k] _raw_spin_unlock > 1.26% a.out[kernel.kallsyms] [k] > mod_zone_state.llvm.8525150236079521930 > 1.23% a.out[kernel.kallsyms] [k] xas_load > 1.15% a.out[kernel.kallsyms] [k] zram_slot_lock > > ptep_clear_flush() takes 5.36% CPU in the micro-benchmark > swapping in/out a page mapped by only one process. If the > page is mapped by multiple processes, typically, like more > than 100 on a phone, the overhead would be much higher as > we have to run tlb flush 100 times for one single page. > Plus, tlb flush overhead will increase with the number > of CPU cores due to the bad scalability of tlb shootdown > in HW, so those ARM64 servers should expect much higher > overhead. > > Further perf annonate shows 95% cpu time of ptep_clear_flush > is actually used by the final dsb() to wait for the completion > of tlb flush. This provides us a very good chance to leverage > the existing batched tlb in kernel. The minimum modification > is that we only send async tlbi in the first stage and we send > dsb while we have to sync in the second stage. > > With the above simplest micro benchmark, collapsed time to > finish the program decreases around 5%. > > Typical collapsed time w/o patch: > ~ # time taskset -c 4 ./a.out > 0.21user 14.34system 0:14.69elapsed > w/ patch: > ~ # time taskset -c 4 ./a.out > 0.22user 13.45system 0:13.80elapsed > > Also, Yicong Yang added the following observation. > Tested with benchmark in the commit on Kunpeng920 arm64 server, > observed an improvement around 12.5% with command > `time ./swap_bench`. > w/o w/ > real0m13.460s 0m11.771s > user0m0.248s0m0.279s > sys 0m12.039s 0m11.458s > > Originally it's noticed a 16.99% overhead of ptep_clear_flush() > which has been eliminated by this patch: > > [root@localhost yang]# perf record -- ./swap_bench && perf report > [...] > 16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush > > Cc:
Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation
On 9/15/22 20:01, Nadav Amit wrote: > > >> On Sep 14, 2022, at 11:42 PM, Barry Song <21cn...@gmail.com> wrote: >> >>> >>> The very idea behind TLB deferral is the opportunity it (might) provide >>> to accumulate address ranges and cpu masks so that individual TLB flush >>> can be replaced with a more cost effective range based TLB flush. Hence >>> I guess unless address range or cpumask based cost effective TLB flush >>> is available, deferral does not improve the unmap performance as much. >> >> >> After sending tlbi, if we wait for the completion of tlbi, we have to get Ack >> from all cpus in the system, tlbi is not scalable. The point here is that we >> avoid waiting for each individual TLBi. Alternatively, they are batched. If >> you read the benchmark in the commit log, you can find the great decline >> in the cost to swap out a page. > > Just a minor correction: arch_tlbbatch_flush() does not collect ranges. > On x86 it only accumulate CPU mask. Thanks Nadav for the clarification.
[PATCH v3] hugetlb: simplify hugetlb handling in follow_page_mask
During discussions of this series [1], it was suggested that hugetlb handling code in follow_page_mask could be simplified. At the beginning of follow_page_mask, there currently is a call to follow_huge_addr which 'may' handle hugetlb pages. ia64 is the only architecture which provides a follow_huge_addr routine that does not return error. Instead, at each level of the page table a check is made for a hugetlb entry. If a hugetlb entry is found, a call to a routine associated with that entry is made. Currently, there are two checks for hugetlb entries at each page table level. The first check is of the form: if (p?d_huge()) page = follow_huge_p?d(); the second check is of the form: if (is_hugepd()) page = follow_huge_pd(). We can replace these checks, as well as the special handling routines such as follow_huge_p?d() and follow_huge_pd() with a single routine to handle hugetlb vmas. A new routine hugetlb_follow_page_mask is called for hugetlb vmas at the beginning of follow_page_mask. hugetlb_follow_page_mask will use the existing routine huge_pte_offset to walk page tables looking for hugetlb entries. huge_pte_offset can be overwritten by architectures, and already handles special cases such as hugepd entries. [1] https://lore.kernel.org/linux-mm/cover.1661240170.git.baolin.w...@linux.alibaba.com/ Suggested-by: David Hildenbrand Signed-off-by: Mike Kravetz --- v3 -Change WARN_ON_ONCE() to BUILD_BUG() as reminded by Christophe Leroy v2 -Added WARN_ON_ONCE() and updated comment as suggested by David Fixed build issue found by kernel test robot Added vma (pmd sharing) locking to hugetlb_follow_page_mask ReBased on Baolin's patch to fix issues with CONT_* entries arch/ia64/mm/hugetlbpage.c| 15 --- arch/powerpc/mm/hugetlbpage.c | 37 --- include/linux/hugetlb.h | 50 ++ mm/gup.c | 80 +++ mm/hugetlb.c | 182 -- 5 files changed, 86 insertions(+), 278 deletions(-) diff --git a/arch/ia64/mm/hugetlbpage.c b/arch/ia64/mm/hugetlbpage.c index f993cb36c062..380d2f3966c9 100644 --- a/arch/ia64/mm/hugetlbpage.c +++ b/arch/ia64/mm/hugetlbpage.c @@ -91,21 +91,6 @@ int prepare_hugepage_range(struct file *file, return 0; } -struct page *follow_huge_addr(struct mm_struct *mm, unsigned long addr, int write) -{ - struct page *page; - pte_t *ptep; - - if (REGION_NUMBER(addr) != RGN_HPAGE) - return ERR_PTR(-EINVAL); - - ptep = huge_pte_offset(mm, addr, HPAGE_SIZE); - if (!ptep || pte_none(*ptep)) - return NULL; - page = pte_page(*ptep); - page += ((addr & ~HPAGE_MASK) >> PAGE_SHIFT); - return page; -} int pmd_huge(pmd_t pmd) { return 0; diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index bc84a594ca62..b0e037c75c12 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -506,43 +506,6 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb, } while (addr = next, addr != end); } -struct page *follow_huge_pd(struct vm_area_struct *vma, - unsigned long address, hugepd_t hpd, - int flags, int pdshift) -{ - pte_t *ptep; - spinlock_t *ptl; - struct page *page = NULL; - unsigned long mask; - int shift = hugepd_shift(hpd); - struct mm_struct *mm = vma->vm_mm; - -retry: - /* -* hugepage directory entries are protected by mm->page_table_lock -* Use this instead of huge_pte_lockptr -*/ - ptl = &mm->page_table_lock; - spin_lock(ptl); - - ptep = hugepte_offset(hpd, address, pdshift); - if (pte_present(*ptep)) { - mask = (1UL << shift) - 1; - page = pte_page(*ptep); - page += ((address & mask) >> PAGE_SHIFT); - if (flags & FOLL_GET) - get_page(page); - } else { - if (is_hugetlb_entry_migration(*ptep)) { - spin_unlock(ptl); - __migration_entry_wait(mm, ptep, ptl); - goto retry; - } - } - spin_unlock(ptl); - return page; -} - bool __init arch_hugetlb_valid_size(unsigned long size) { int shift = __ffs(size); diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index cfe15b32e2d4..32d45e96a894 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -149,6 +149,8 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma, unsigned long len); int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *, struct vm_area_struct *); +struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma, + unsigned l
[PATCH 3/4] powerpc/64s: Allow double call of kernel_[un]map_linear_page()
From: Christophe Leroy If the page is already mapped resp. already unmapped, bail out. Signed-off-by: Christophe Leroy Signed-off-by: Nicholas Miehlbradt --- arch/powerpc/mm/book3s64/hash_utils.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c index e63ff401a6ea..b37412fe5930 100644 --- a/arch/powerpc/mm/book3s64/hash_utils.c +++ b/arch/powerpc/mm/book3s64/hash_utils.c @@ -2000,6 +2000,9 @@ static void kernel_map_linear_page(unsigned long vaddr, unsigned long lmi) if (!vsid) return; + if (linear_map_hash_slots[lmi] & 0x80) + return; + ret = hpte_insert_repeating(hash, vpn, __pa(vaddr), mode, HPTE_V_BOLTED, mmu_linear_psize, mmu_kernel_ssize); @@ -2019,7 +2022,10 @@ static void kernel_unmap_linear_page(unsigned long vaddr, unsigned long lmi) hash = hpt_hash(vpn, PAGE_SHIFT, mmu_kernel_ssize); spin_lock(&linear_map_hash_lock); - BUG_ON(!(linear_map_hash_slots[lmi] & 0x80)); + if (!(linear_map_hash_slots[lmi] & 0x80)) { + spin_unlock(&linear_map_hash_lock); + return; + } hidx = linear_map_hash_slots[lmi] & 0x7f; linear_map_hash_slots[lmi] = 0; spin_unlock(&linear_map_hash_lock); -- 2.34.1
[PATCH 2/4] powerpc/64s: Remove unneeded #ifdef CONFIG_DEBUG_PAGEALLOC in hash_utils
From: Christophe Leroy debug_pagealloc_enabled() is always defined and constant folds to 'false' when CONFIG_DEBUG_PAGEALLOC is not enabled. Remove the #ifdefs, the code and associated static variables will be optimised out by the compiler when CONFIG_DEBUG_PAGEALLOC is not defined. Signed-off-by: Christophe Leroy Signed-off-by: Nicholas Miehlbradt --- arch/powerpc/mm/book3s64/hash_utils.c | 9 ++--- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c index fc92613dc2bf..e63ff401a6ea 100644 --- a/arch/powerpc/mm/book3s64/hash_utils.c +++ b/arch/powerpc/mm/book3s64/hash_utils.c @@ -123,11 +123,8 @@ EXPORT_SYMBOL_GPL(mmu_slb_size); #ifdef CONFIG_PPC_64K_PAGES int mmu_ci_restrictions; #endif -#ifdef CONFIG_DEBUG_PAGEALLOC static u8 *linear_map_hash_slots; static unsigned long linear_map_hash_count; -static DEFINE_SPINLOCK(linear_map_hash_lock); -#endif /* CONFIG_DEBUG_PAGEALLOC */ struct mmu_hash_ops mmu_hash_ops; EXPORT_SYMBOL(mmu_hash_ops); @@ -427,11 +424,9 @@ int htab_bolt_mapping(unsigned long vstart, unsigned long vend, break; cond_resched(); -#ifdef CONFIG_DEBUG_PAGEALLOC if (debug_pagealloc_enabled() && (paddr >> PAGE_SHIFT) < linear_map_hash_count) linear_map_hash_slots[paddr >> PAGE_SHIFT] = ret | 0x80; -#endif /* CONFIG_DEBUG_PAGEALLOC */ } return ret < 0 ? ret : 0; } @@ -1066,7 +1061,6 @@ static void __init htab_initialize(void) prot = pgprot_val(PAGE_KERNEL); -#ifdef CONFIG_DEBUG_PAGEALLOC if (debug_pagealloc_enabled()) { linear_map_hash_count = memblock_end_of_DRAM() >> PAGE_SHIFT; linear_map_hash_slots = memblock_alloc_try_nid( @@ -1076,7 +1070,6 @@ static void __init htab_initialize(void) panic("%s: Failed to allocate %lu bytes max_addr=%pa\n", __func__, linear_map_hash_count, &ppc64_rma_size); } -#endif /* CONFIG_DEBUG_PAGEALLOC */ /* create bolted the linear mapping in the hash table */ for_each_mem_range(i, &base, &end) { @@ -1991,6 +1984,8 @@ long hpte_insert_repeating(unsigned long hash, unsigned long vpn, } #ifdef CONFIG_DEBUG_PAGEALLOC +static DEFINE_SPINLOCK(linear_map_hash_lock); + static void kernel_map_linear_page(unsigned long vaddr, unsigned long lmi) { unsigned long hash; -- 2.34.1
[PATCH 4/4] powerpc/64s: Enable KFENCE on book3s64
KFENCE support was added for ppc32 in commit 90cbac0e995d ("powerpc: Enable KFENCE for PPC32"). Enable KFENCE on ppc64 architecture with hash and radix MMUs. It uses the same mechanism as debug pagealloc to protect/unprotect pages. All KFENCE kunit tests pass on both MMUs. KFENCE memory is initially allocated using memblock but is later marked as SLAB allocated. This necessitates the change to __pud_free to ensure that the KFENCE pages are freed appropriately. Based on previous work by Christophe Leroy and Jordan Niethe. Signed-off-by: Nicholas Miehlbradt --- arch/powerpc/Kconfig | 2 +- arch/powerpc/include/asm/book3s/64/pgalloc.h | 6 -- arch/powerpc/include/asm/book3s/64/pgtable.h | 2 +- arch/powerpc/include/asm/kfence.h| 18 ++ arch/powerpc/mm/book3s64/hash_utils.c| 10 +- arch/powerpc/mm/book3s64/radix_pgtable.c | 8 +--- 6 files changed, 34 insertions(+), 12 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index a4f8a5276e5c..f7dd0f49510d 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -194,7 +194,7 @@ config PPC select HAVE_ARCH_KASAN if PPC32 && PPC_PAGE_SHIFT <= 14 select HAVE_ARCH_KASAN if PPC_RADIX_MMU select HAVE_ARCH_KASAN_VMALLOC if HAVE_ARCH_KASAN - select HAVE_ARCH_KFENCE if PPC_BOOK3S_32 || PPC_8xx || 40x + select HAVE_ARCH_KFENCE if ARCH_SUPPORTS_DEBUG_PAGEALLOC select HAVE_ARCH_KGDB select HAVE_ARCH_MMAP_RND_BITS select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h b/arch/powerpc/include/asm/book3s/64/pgalloc.h index e1af0b394ceb..dd2cff53a111 100644 --- a/arch/powerpc/include/asm/book3s/64/pgalloc.h +++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h @@ -113,9 +113,11 @@ static inline void __pud_free(pud_t *pud) /* * Early pud pages allocated via memblock allocator -* can't be directly freed to slab +* can't be directly freed to slab. KFENCE pages have +* both reserved and slab flags set so need to be freed +* kmem_cache_free. */ - if (PageReserved(page)) + if (PageReserved(page) && !PageSlab(page)) free_reserved_page(page); else kmem_cache_free(PGT_CACHE(PUD_CACHE_INDEX), pud); diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index cb9d5fd39d7f..fd5d800f2836 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -1123,7 +1123,7 @@ static inline void vmemmap_remove_mapping(unsigned long start, } #endif -#ifdef CONFIG_DEBUG_PAGEALLOC +#if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_KFENCE) static inline void __kernel_map_pages(struct page *page, int numpages, int enable) { if (radix_enabled()) diff --git a/arch/powerpc/include/asm/kfence.h b/arch/powerpc/include/asm/kfence.h index a9846b68c6b9..33edbc312a51 100644 --- a/arch/powerpc/include/asm/kfence.h +++ b/arch/powerpc/include/asm/kfence.h @@ -11,11 +11,28 @@ #include #include +#if defined(CONFIG_PPC64) && !defined(CONFIG_PPC64_ELF_ABI_V2) +#define ARCH_FUNC_PREFIX "." +#endif + static inline bool arch_kfence_init_pool(void) { return true; } +#ifdef CONFIG_PPC64 +static inline bool kfence_protect_page(unsigned long addr, bool protect) +{ + struct page *page = virt_to_page(addr); + + if (protect) + __kernel_map_pages(page, 1, 0); + else + __kernel_map_pages(page, 1, 1); + + return true; +} +#else static inline bool kfence_protect_page(unsigned long addr, bool protect) { pte_t *kpte = virt_to_kpte(addr); @@ -29,5 +46,6 @@ static inline bool kfence_protect_page(unsigned long addr, bool protect) return true; } +#endif #endif /* __ASM_POWERPC_KFENCE_H */ diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c index b37412fe5930..9cceaa5998a3 100644 --- a/arch/powerpc/mm/book3s64/hash_utils.c +++ b/arch/powerpc/mm/book3s64/hash_utils.c @@ -424,7 +424,7 @@ int htab_bolt_mapping(unsigned long vstart, unsigned long vend, break; cond_resched(); - if (debug_pagealloc_enabled() && + if (debug_pagealloc_enabled_or_kfence() && (paddr >> PAGE_SHIFT) < linear_map_hash_count) linear_map_hash_slots[paddr >> PAGE_SHIFT] = ret | 0x80; } @@ -773,7 +773,7 @@ static void __init htab_init_page_sizes(void) bool aligned = true; init_hpte_page_sizes(); - if (!debug_pagealloc_enabled()) { + if (!debug_pagealloc_enabled_or_kfence()) { /* * Pick a size for the linear mapping. Currently, we o
[PATCH 1/4] powerpc/64s: Add DEBUG_PAGEALLOC for radix
There is support for DEBUG_PAGEALLOC on hash but not on radix. Add support on radix. Signed-off-by: Nicholas Miehlbradt --- arch/powerpc/mm/book3s64/radix_pgtable.c | 16 +++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c index db2f3d193448..483c99bfbde5 100644 --- a/arch/powerpc/mm/book3s64/radix_pgtable.c +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c @@ -30,6 +30,7 @@ #include #include #include +#include #include @@ -503,6 +504,9 @@ static unsigned long __init radix_memory_block_size(void) { unsigned long mem_block_size = MIN_MEMORY_BLOCK_SIZE; + if (debug_pagealloc_enabled()) + return PAGE_SIZE; + /* * OPAL firmware feature is set by now. Hence we are ok * to test OPAL feature. @@ -519,6 +523,9 @@ static unsigned long __init radix_memory_block_size(void) static unsigned long __init radix_memory_block_size(void) { + if (debug_pagealloc_enabled()) + return PAGE_SIZE; + return 1UL * 1024 * 1024 * 1024; } @@ -899,7 +906,14 @@ void __meminit radix__vmemmap_remove_mapping(unsigned long start, unsigned long #ifdef CONFIG_DEBUG_PAGEALLOC void radix__kernel_map_pages(struct page *page, int numpages, int enable) { - pr_warn_once("DEBUG_PAGEALLOC not supported in radix mode\n"); + unsigned long addr; + + addr = (unsigned long)page_address(page); + + if (enable) + set_memory_p(addr, numpages); + else + set_memory_np(addr, numpages); } #endif -- 2.34.1
Re: [PATCH] cxl: fix repeated words in comments
On Sun, 2022-09-18 at 18:03 +0800, Jilin Yuan wrote: > Delete the redundant word 'dont'. > > Signed-off-by: Jilin Yuan --- > drivers/misc/cxl/native.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/misc/cxl/native.c b/drivers/misc/cxl/native.c > index 50b0c44bb8d7..6957946a6463 100644 > --- a/drivers/misc/cxl/native.c > +++ b/drivers/misc/cxl/native.c > @@ -920,7 +920,7 @@ int cxl_attach_dedicated_process_psl9(struct > cxl_context *ctx, u64 wed, u64 amr) > * Ideally we should do a wmb() here to make sure the changes > to the > * PE are visible to the card before we call afu_enable. > * On ppc64 though all mmios are preceded by a 'sync' > instruction hence > - * we dont dont need one here. > + * we dont need one here. > */ > > result = cxl_ops->afu_reset(afu); -- Andrew DonnellanOzLabs, ADL Canberra a...@linux.ibm.com IBM Australia Limited
Re: [PATCH] EDAC/ppc_4xx: Reorder symbols to get rid of a few forward declarations
On Sun, Sep 18, 2022 at 01:20:13AM +0200, Uwe Kleine-König wrote: > When moving the definition of ppc4xx_edac_driver further down, the > forward declarations can just be dropped. > > Do this to reduce line needless repetition. > > Signed-off-by: Uwe Kleine-König > --- > drivers/edac/ppc4xx_edac.c | 23 +-- > 1 file changed, 9 insertions(+), 14 deletions(-) Applied, thanks. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
Re: [PATCH linux-next] macintosh/windfarm: fix comparing pointer to 0
On Sun, Sep 18, 2022, at 5:50 PM, cgel@gmail.com wrote: > @@ -970,7 +970,7 @@ static int pm121_init_pm(void) > const struct smu_sdbp_header *hdr; > > hdr = smu_get_sdb_partition(SMU_SDB_SENSORTREE_ID, NULL); > - if (hdr != 0) { > + if (hdr != NULL) { > struct smu_sdbp_sensortree *st = > (struct smu_sdbp_sensortree *)&hdr[1]; The more common way of writing this in the kernel is "if (hdr)". Arnd
[PATCH linux-next] macintosh/windfarm: fix comparing pointer to 0
From: Xu Panda Comparing pointer whith NULL instead of comparing pointer to 0. Reported-by: Zeal Robot Signed-off-by: Xu Panda --- drivers/macintosh/windfarm_pm121.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/macintosh/windfarm_pm121.c b/drivers/macintosh/windfarm_pm121.c index 36312f163aac..82dcd35f439a 100644 --- a/drivers/macintosh/windfarm_pm121.c +++ b/drivers/macintosh/windfarm_pm121.c @@ -651,7 +651,7 @@ static void pm121_create_cpu_fans(void) /* First, locate the PID params in SMU SBD */ hdr = smu_get_sdb_partition(SMU_SDB_CPUPIDDATA_ID, NULL); - if (hdr == 0) { + if (hdr == NULL) { printk(KERN_WARNING "pm121: CPU PID fan config not found.\n"); goto fail; } @@ -970,7 +970,7 @@ static int pm121_init_pm(void) const struct smu_sdbp_header *hdr; hdr = smu_get_sdb_partition(SMU_SDB_SENSORTREE_ID, NULL); - if (hdr != 0) { + if (hdr != NULL) { struct smu_sdbp_sensortree *st = (struct smu_sdbp_sensortree *)&hdr[1]; pm121_mach_model = st->model_id; -- 2.15.2
[PATCH] cxl: fix repeated words in comments
Delete the redundant word 'dont'. Signed-off-by: Jilin Yuan --- drivers/misc/cxl/native.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/misc/cxl/native.c b/drivers/misc/cxl/native.c index 50b0c44bb8d7..6957946a6463 100644 --- a/drivers/misc/cxl/native.c +++ b/drivers/misc/cxl/native.c @@ -920,7 +920,7 @@ int cxl_attach_dedicated_process_psl9(struct cxl_context *ctx, u64 wed, u64 amr) * Ideally we should do a wmb() here to make sure the changes to the * PE are visible to the card before we call afu_enable. * On ppc64 though all mmios are preceded by a 'sync' instruction hence -* we dont dont need one here. +* we dont need one here. */ result = cxl_ops->afu_reset(afu); -- 2.36.1
Re: [PATCH] powerpc: Save AMR/IAMR when switching tasks
Le 17/09/2022 à 20:38, Samuel Holland a écrit : > On 9/17/22 03:16, Christophe Leroy wrote: >> Le 16/09/2022 à 07:05, Samuel Holland a écrit : >>> With CONFIG_PREEMPT=y (involuntary preemption enabled), it is possible >>> to switch away from a task inside copy_{from,to}_user. This left the CPU >>> with userspace access enabled until after the next IRQ or privilege >>> level switch, when AMR/IAMR got reset to AMR_KU[AE]P_BLOCKED. Then, when >>> switching back to the original task, the userspace access would fault: >> >> This is not supposed to happen. You never switch away from a task >> magically. Task switch will always happen in an interrupt, that means >> copy_{from,to}_user() get interrupted. > > That makes sense, the interrupt handler is responsible for saving the > KUAP status. It looks like neither DEFINE_INTERRUPT_HANDLER_RAW nor any > of its users (performance_monitor_exception(), do_slb_fault()) do that. As far as I can see, that's done earlier, in exceptions-64s.S. Look for kuap_save_amr_and_lock. Now, it may be possible that one of the exceptions pathes misses it. > Yet they still call one of the interrupt_return variants, which restores > the status from the stack. > >> Whenever an interrupt is taken, kuap_save_amr_and_lock() macro is used >> to save KUAP status into the stack then lock KUAP access. At interrupt >> exit, kuap_kernel_restore() macro or function is used to restore KUAP >> access from the stack. At the time the task switch happens, KUAP access >> is expected to be locked. During task switch, the stack is switched so >> the KUAP status is taken back from the new task's stack. > > What if another task calls schedule() from kernel process context, and > the scheduler switches to a task that had been preempted inside > copy_{from,to}_user()? Then there is no interrupt involved, and I don't > see where kuap_kernel_restore() would get called. Yes there is interrupt involved. That task, if it has been preempted inside copy_from_user(), it must have been through an interrupt, likely a timer interrupt. So switching back to that task means doing an interrupt return in the context of that task. That's when KUAP status should be restored. > >> Your fix suggests that there is some path where the KUAP status is not >> properly saved and/or restored. Did you try running with >> CONFIG_PPC_KUAP_DEBUG ? It should warn whenever a KUAP access is left >> unlocked. >> >>> >>> Kernel attempted to write user page (3fff7ab68190) - exploit attempt? >>> (uid: 65536) >>> [ cut here ] >>> Bug: Write fault blocked by KUAP! >>> WARNING: CPU: 56 PID: 4939 at arch/powerpc/mm/fault.c:228 >>> ___do_page_fault+0x7b4/0xaa0 >>> CPU: 56 PID: 4939 Comm: git Tainted: GW >>> 5.19.8-5-gba424747260d #1 >>> NIP: c00555e4 LR: c00555e0 CTR: c079d9d0 >>> REGS: c0008f507370 TRAP: 0700 Tainted: GW >>> (5.19.8-5-gba424747260d) >>> MSR: 90021033 CR: 2804 XER: >>> 2004 >>> CFAR: c0123780 IRQMASK: 3 >>> NIP [c00555e4] ___do_page_fault+0x7b4/0xaa0 >>> LR [c00555e0] ___do_page_fault+0x7b0/0xaa0 >>> Call Trace: >>> [c0008f507610] [c00555e0] ___do_page_fault+0x7b0/0xaa0 >>> (unreliable) >>> [c0008f5076c0] [c0055938] do_page_fault+0x68/0x130 >>> [c0008f5076f0] [c0008914] >>> data_access_common_virt+0x194/0x1f0 >>> --- interrupt: 300 at __copy_tofrom_user_base+0x9c/0x5a4 >> >> ... >> >>> >>> Fix this by saving and restoring the kernel-side AMR/IAMR values when >>> switching tasks. >> >> As explained above, KUAP access should be locked at that time, so saving >> and restoring it should not have any effect. If it does, it means >> something goes wrong somewhere else. >> >>> >>> Fixes: 890274c2dc4c ("powerpc/64s: Implement KUAP for Radix MMU") >>> Signed-off-by: Samuel Holland >>> --- >>> I have no idea if this is the right change to make, and it could be >>> optimized, but my system has been stable with this patch for 5 days now. >>> >>> Without the patch, I hit the bug every few minutes when my load average >>> is <1, and I hit it immediately if I try to do a parallel kernel build. >> >> Great, then can you make a try with CONFIG_PPC_KUAP_DEBUG ? > > Yes, I will try this out in the next few days. > Thanks Christophe