Re: [PATCH v3 5/6] powerpc/pseries: implement paravirt qspinlocks for SPLPAR
On 7/24/20 3:10 PM, Waiman Long wrote: On 7/24/20 4:16 AM, Will Deacon wrote: On Thu, Jul 23, 2020 at 08:47:59PM +0200, pet...@infradead.org wrote: On Thu, Jul 23, 2020 at 02:32:36PM -0400, Waiman Long wrote: BTW, do you have any comment on my v2 lock holder cpu info qspinlock patch? I will have to update the patch to fix the reported 0-day test problem, but I want to collect other feedback before sending out v3. I want to say I hate it all, it adds instructions to a path we spend an aweful lot of time optimizing without really getting anything back for it. Will, how do you feel about it? I can see it potentially being useful for debugging, but I hate the limitation to 256 CPUs. Even arm64 is hitting that now. After thinking more about that, I think we can use all the remaining bits in the 16-bit locked_pending. Reserving 1 bit for locked and 1 bit for pending, there are 14 bits left. So as long as NR_CPUS < 16k (requirement for 16-bit locked_pending), we can put all possible cpu numbers into the lock. We can also just use smp_processor_id() without additional percpu data. Sorry, that doesn't work. The extra bits in the pending byte won't get cleared on unlock. That will have noticeable performance impact. Clearing the pending byte on unlock will cause other performance problem. So I guess we will have to limit the cpu number in the locked byte. Regards, Longman
Re: [PATCH v4 06/12] ppc64/kexec_file: restrict memory usage of kdump kernel
Hari Bathini writes: > On 24/07/20 5:36 am, Thiago Jung Bauermann wrote: >> >> Hari Bathini writes: >> >>> Kdump kernel, used for capturing the kernel core image, is supposed >>> to use only specific memory regions to avoid corrupting the image to >>> be captured. The regions are crashkernel range - the memory reserved >>> explicitly for kdump kernel, memory used for the tce-table, the OPAL >>> region and RTAS region as applicable. Restrict kdump kernel memory >>> to use only these regions by setting up usable-memory DT property. >>> Also, tell the kdump kernel to run at the loaded address by setting >>> the magic word at 0x5c. >>> >>> Signed-off-by: Hari Bathini >>> Tested-by: Pingfan Liu >>> --- >>> >>> v3 -> v4: >>> * Updated get_node_path() to be an iterative function instead of a >>> recursive one. >>> * Added comment explaining why low memory is added to kdump kernel's >>> usable memory ranges though it doesn't fall in crashkernel region. >>> * For correctness, added fdt_add_mem_rsv() for the low memory being >>> added to kdump kernel's usable memory ranges. >> >> Good idea. >> >>> * Fixed prop pointer update in add_usable_mem_property() and changed >>> duple to tuple as suggested by Thiago. >> >> >> >>> +/** >>> + * get_node_pathlen - Get the full path length of the given node. >>> + * @dn: Node. >>> + * >>> + * Also, counts '/' at the end of the path. >>> + * For example, /memory@0 will be "/memory@0/\0" => 11 bytes. >> >> Wouldn't this function return 10 in the case of /memory@0? > > Actually, it does return 11. +1 while returning is for counting %NUL. > On top of that we count an extra '/' for root node.. so, it ends up as 11. > ('/'memory@0'/''\0'). Note the extra '/' before '\0'. Let me handle root node > separately. That should avoid the confusion. Ah, that is true. I forgot to count the iteration for the root node. Sorry about that. -- Thiago Jung Bauermann IBM Linux Technology Center
Re: [PATCH v4 0/6] powerpc: queued spinlocks and rwlocks
On 7/24/20 9:14 AM, Nicholas Piggin wrote: Updated with everybody's feedback (thanks all), and more performance results. What I've found is I might have been measuring the worst load point for the paravirt case, and by looking at a range of loads it's clear that queued spinlocks are overall better even on PV, doubly so when you look at the generally much improved worst case latencies. I have defaulted it to N even though I'm less concerned about the PV numbers now, just because I think it needs more stress testing. But it's very nicely selectable so should be low risk to include. All in all this is a very cool technology and great results especially on the big systems but even on smaller ones there are nice gains. Thanks Waiman and everyone who developed it. Thanks, Nick Nicholas Piggin (6): powerpc/pseries: move some PAPR paravirt functions to their own file powerpc: move spinlock implementation to simple_spinlock powerpc/64s: implement queued spinlocks and rwlocks powerpc/pseries: implement paravirt qspinlocks for SPLPAR powerpc/qspinlock: optimised atomic_try_cmpxchg_lock that adds the lock hint powerpc: implement smp_cond_load_relaxed arch/powerpc/Kconfig | 15 + arch/powerpc/include/asm/Kbuild | 1 + arch/powerpc/include/asm/atomic.h | 28 ++ arch/powerpc/include/asm/barrier.h| 14 + arch/powerpc/include/asm/paravirt.h | 87 + arch/powerpc/include/asm/qspinlock.h | 91 ++ arch/powerpc/include/asm/qspinlock_paravirt.h | 7 + arch/powerpc/include/asm/simple_spinlock.h| 288 .../include/asm/simple_spinlock_types.h | 21 ++ arch/powerpc/include/asm/spinlock.h | 308 +- arch/powerpc/include/asm/spinlock_types.h | 17 +- arch/powerpc/lib/Makefile | 3 + arch/powerpc/lib/locks.c | 12 +- arch/powerpc/platforms/pseries/Kconfig| 9 +- arch/powerpc/platforms/pseries/setup.c| 4 +- include/asm-generic/qspinlock.h | 4 + 16 files changed, 588 insertions(+), 321 deletions(-) create mode 100644 arch/powerpc/include/asm/paravirt.h create mode 100644 arch/powerpc/include/asm/qspinlock.h create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt.h create mode 100644 arch/powerpc/include/asm/simple_spinlock.h create mode 100644 arch/powerpc/include/asm/simple_spinlock_types.h That patch series looks good to me. Thanks for working on this. For the series, Acked-by: Waiman Long
Re: [PATCH v4 6/6] powerpc: implement smp_cond_load_relaxed
On 7/24/20 9:14 AM, Nicholas Piggin wrote: This implements smp_cond_load_relaed with the slowpath busy loop using the Nit: "smp_cond_load_relaxed" Cheers, Longman
[PATCH v5 02/11] powerpc/kexec_file: mark PPC64 specific code
Some of the kexec_file_load code isn't PPC64 specific. Move PPC64 specific code from kexec/file_load.c to kexec/file_load_64.c. Also, rename purgatory/trampoline.S to purgatory/trampoline_64.S in the same spirit. No functional changes. Signed-off-by: Hari Bathini Tested-by: Pingfan Liu Reviewed-by: Laurent Dufour Reviewed-by: Thiago Jung Bauermann --- v4 -> v5: * Unchanged. v3 -> v4: * Moved common code back to set_new_fdt() from setup_new_fdt_ppc64() function. Added Reviewed-by tags from Laurent & Thiago. v2 -> v3: * Unchanged. Added Tested-by tag from Pingfan. v1 -> v2: * No changes. arch/powerpc/include/asm/kexec.h |9 ++ arch/powerpc/kexec/Makefile|2 - arch/powerpc/kexec/elf_64.c|7 +- arch/powerpc/kexec/file_load.c | 19 + arch/powerpc/kexec/file_load_64.c | 87 arch/powerpc/purgatory/Makefile|4 + arch/powerpc/purgatory/trampoline.S| 117 arch/powerpc/purgatory/trampoline_64.S | 117 8 files changed, 222 insertions(+), 140 deletions(-) create mode 100644 arch/powerpc/kexec/file_load_64.c delete mode 100644 arch/powerpc/purgatory/trampoline.S create mode 100644 arch/powerpc/purgatory/trampoline_64.S diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h index c684768..ac8fd48 100644 --- a/arch/powerpc/include/asm/kexec.h +++ b/arch/powerpc/include/asm/kexec.h @@ -116,6 +116,15 @@ int setup_new_fdt(const struct kimage *image, void *fdt, unsigned long initrd_load_addr, unsigned long initrd_len, const char *cmdline); int delete_fdt_mem_rsv(void *fdt, unsigned long start, unsigned long size); + +#ifdef CONFIG_PPC64 +int setup_purgatory_ppc64(struct kimage *image, const void *slave_code, + const void *fdt, unsigned long kernel_load_addr, + unsigned long fdt_load_addr); +int setup_new_fdt_ppc64(const struct kimage *image, void *fdt, + unsigned long initrd_load_addr, + unsigned long initrd_len, const char *cmdline); +#endif /* CONFIG_PPC64 */ #endif /* CONFIG_KEXEC_FILE */ #else /* !CONFIG_KEXEC_CORE */ diff --git a/arch/powerpc/kexec/Makefile b/arch/powerpc/kexec/Makefile index 86380c6..67c3553 100644 --- a/arch/powerpc/kexec/Makefile +++ b/arch/powerpc/kexec/Makefile @@ -7,7 +7,7 @@ obj-y += core.o crash.o core_$(BITS).o obj-$(CONFIG_PPC32)+= relocate_32.o -obj-$(CONFIG_KEXEC_FILE) += file_load.o elf_$(BITS).o +obj-$(CONFIG_KEXEC_FILE) += file_load.o file_load_$(BITS).o elf_$(BITS).o ifdef CONFIG_HAVE_IMA_KEXEC ifdef CONFIG_IMA diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c index 3072fd6..23ad04c 100644 --- a/arch/powerpc/kexec/elf_64.c +++ b/arch/powerpc/kexec/elf_64.c @@ -88,7 +88,8 @@ static void *elf64_load(struct kimage *image, char *kernel_buf, goto out; } - ret = setup_new_fdt(image, fdt, initrd_load_addr, initrd_len, cmdline); + ret = setup_new_fdt_ppc64(image, fdt, initrd_load_addr, + initrd_len, cmdline); if (ret) goto out; @@ -107,8 +108,8 @@ static void *elf64_load(struct kimage *image, char *kernel_buf, pr_debug("Loaded device tree at 0x%lx\n", fdt_load_addr); slave_code = elf_info.buffer + elf_info.proghdrs[0].p_offset; - ret = setup_purgatory(image, slave_code, fdt, kernel_load_addr, - fdt_load_addr); + ret = setup_purgatory_ppc64(image, slave_code, fdt, kernel_load_addr, + fdt_load_addr); if (ret) pr_err("Error setting up the purgatory.\n"); diff --git a/arch/powerpc/kexec/file_load.c b/arch/powerpc/kexec/file_load.c index 143c917..38439ab 100644 --- a/arch/powerpc/kexec/file_load.c +++ b/arch/powerpc/kexec/file_load.c @@ -1,6 +1,6 @@ // SPDX-License-Identifier: GPL-2.0-only /* - * ppc64 code to implement the kexec_file_load syscall + * powerpc code to implement the kexec_file_load syscall * * Copyright (C) 2004 Adam Litke (a...@us.ibm.com) * Copyright (C) 2004 IBM Corp. @@ -20,22 +20,7 @@ #include #include -#define SLAVE_CODE_SIZE256 - -const struct kexec_file_ops * const kexec_file_loaders[] = { - &kexec_elf64_ops, - NULL -}; - -int arch_kexec_kernel_image_probe(struct kimage *image, void *buf, - unsigned long buf_len) -{ - /* We don't support crash kernels yet. */ - if (image->type == KEXEC_TYPE_CRASH) - return -EOPNOTSUPP; - - return kexec_image_probe_default(image, buf, buf_len); -} +#define SLAVE_CODE_SIZE256 /* First 0x100 bytes */ /** * setup_purgatory - initialize the purgatory's global variables diff --git a/arch/
[PATCH v5 01/11] kexec_file: allow archs to handle special regions while locating memory hole
Some architectures may have special memory regions, within the given memory range, which can't be used for the buffer in a kexec segment. Implement weak arch_kexec_locate_mem_hole() definition which arch code may override, to take care of special regions, while trying to locate a memory hole. Also, add the missing declarations for arch overridable functions and and drop the __weak descriptors in the declarations to avoid non-weak definitions from becoming weak. Reported-by: kernel test robot [lkp: In v1, arch_kimage_file_post_load_cleanup() declaration was missing] Signed-off-by: Hari Bathini Tested-by: Pingfan Liu Acked-by: Dave Young Reviewed-by: Thiago Jung Bauermann --- v4 -> v5: * Unchanged. v3 -> v4: * Unchanged. Added Reviewed-by tag from Thiago. v2 -> v3: * Unchanged. Added Acked-by & Tested-by tags from Dave & Pingfan. v1 -> v2: * Introduced arch_kexec_locate_mem_hole() for override and dropped weak arch_kexec_add_buffer(). * Dropped __weak identifier for arch overridable functions. * Fixed the missing declaration for arch_kimage_file_post_load_cleanup() reported by lkp. lkp report for reference: - https://lore.kernel.org/patchwork/patch/1264418/ include/linux/kexec.h | 29 ++--- kernel/kexec_file.c | 16 ++-- 2 files changed, 32 insertions(+), 13 deletions(-) diff --git a/include/linux/kexec.h b/include/linux/kexec.h index ea67910..9e93bef 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -183,17 +183,24 @@ int kexec_purgatory_get_set_symbol(struct kimage *image, const char *name, bool get_value); void *kexec_purgatory_get_symbol_addr(struct kimage *image, const char *name); -int __weak arch_kexec_kernel_image_probe(struct kimage *image, void *buf, -unsigned long buf_len); -void * __weak arch_kexec_kernel_image_load(struct kimage *image); -int __weak arch_kexec_apply_relocations_add(struct purgatory_info *pi, - Elf_Shdr *section, - const Elf_Shdr *relsec, - const Elf_Shdr *symtab); -int __weak arch_kexec_apply_relocations(struct purgatory_info *pi, - Elf_Shdr *section, - const Elf_Shdr *relsec, - const Elf_Shdr *symtab); +/* Architectures may override the below functions */ +int arch_kexec_kernel_image_probe(struct kimage *image, void *buf, + unsigned long buf_len); +void *arch_kexec_kernel_image_load(struct kimage *image); +int arch_kexec_apply_relocations_add(struct purgatory_info *pi, +Elf_Shdr *section, +const Elf_Shdr *relsec, +const Elf_Shdr *symtab); +int arch_kexec_apply_relocations(struct purgatory_info *pi, +Elf_Shdr *section, +const Elf_Shdr *relsec, +const Elf_Shdr *symtab); +int arch_kimage_file_post_load_cleanup(struct kimage *image); +#ifdef CONFIG_KEXEC_SIG +int arch_kexec_kernel_verify_sig(struct kimage *image, void *buf, +unsigned long buf_len); +#endif +int arch_kexec_locate_mem_hole(struct kexec_buf *kbuf); extern int kexec_add_buffer(struct kexec_buf *kbuf); int kexec_locate_mem_hole(struct kexec_buf *kbuf); diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index 09cc78d..e89912d 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -636,6 +636,19 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf) } /** + * arch_kexec_locate_mem_hole - Find free memory to place the segments. + * @kbuf: Parameters for the memory search. + * + * On success, kbuf->mem will have the start address of the memory region found. + * + * Return: 0 on success, negative errno on error. + */ +int __weak arch_kexec_locate_mem_hole(struct kexec_buf *kbuf) +{ + return kexec_locate_mem_hole(kbuf); +} + +/** * kexec_add_buffer - place a buffer in a kexec segment * @kbuf: Buffer contents and memory parameters. * @@ -647,7 +660,6 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf) */ int kexec_add_buffer(struct kexec_buf *kbuf) { - struct kexec_segment *ksegment; int ret; @@ -675,7 +687,7 @@ int kexec_add_buffer(struct kexec_buf *kbuf) kbuf->buf_align = max(kbuf->buf_align, PAGE_SIZE); /* Walk the RAM ranges and allocate a suitable range for the buffer */ - ret = kexec_locate_mem_hole(kbuf); + ret = arch_kexec_locate_mem_hole(kbuf); if (ret) return ret;
[PATCH v5 00/11] ppc64: enable kdump support for kexec_file_load syscall
This patch series enables kdump support for kexec_file_load system call (kexec -s -p) on PPC64. The changes are inspired from kexec-tools code but heavily modified for kernel consumption. The first patch adds a weak arch_kexec_locate_mem_hole() function to override locate memory hole logic suiting arch needs. There are some special regions in ppc64 which should be avoided while loading buffer & there are multiple callers to kexec_add_buffer making it complicated to maintain range sanity and using generic lookup at the same time. The second patch marks ppc64 specific code within arch/powerpc/kexec and arch/powerpc/purgatory to make the subsequent code changes easy to understand. The next patch adds helper function to setup different memory ranges needed for loading kdump kernel, booting into it and exporting the crashing kernel's elfcore. The fourth patch overrides arch_kexec_locate_mem_hole() function to locate memory hole for kdump segments by accounting for the special memory regions, referred to as excluded memory ranges, and sets kbuf->mem when a suitable memory region is found. The fifth patch moves walk_drmem_lmbs() out of .init section with a few changes to reuse it for setting up kdump kernel's usable memory ranges. The next patch uses walk_drmem_lmbs() to look up the LMBs and set linux,drconf-usable-memory & linux,usable-memory properties in order to restrict kdump kernel's memory usage. The seventh patch updates purgatory to setup r8 & r9 with opal base and opal entry addresses respectively to aid kernels built with CONFIG_PPC_EARLY_DEBUG_OPAL enabled. The next patch setups up backup region as a kexec segment while loading kdump kernel and teaches purgatory to copy data from source to destination. Patch 09 builds the elfcore header for the running kernel & passes the info to kdump kernel via "elfcorehdr=" parameter to export as /proc/vmcore file. The next patch sets up the memory reserve map for the kexec kernel and also claims kdump support for kdump as all the necessary changes are added. The last patch fixes a lookup issue for `kexec -l -s` case when memory is reserved for crashkernel. Tested the changes successfully on P8, P9 lpars, couple of OpenPOWER boxes, one with secureboot enabled, KVM guest and a simulator. v4 -> v5: * Dropped patches 07/12 & 08/12 and updated purgatory to do everything in assembly. * Added a new patch (which was part of patch 08/12 in v4) to update r8 & r9 registers with opal base & opal entry addresses as it is expected on kernels built with CONFIG_PPC_EARLY_DEBUG_OPAL enabled. * Fixed kexec load issue on KVM guest. v3 -> v4: * Updated get_node_path() function to be iterative instead of a recursive one. * Added comment explaining why low memory is added to kdump kernel's usable memory ranges though it doesn't fall in crashkernel region. * Fixed stack_buf to be quadword aligned in accordance with ABI. * Added missing of_node_put() in setup_purgatory_ppc64(). * Added a FIXME tag to indicate issue in adding opal/rtas regions to core image. v2 -> v3: * Fixed TOC pointer calculation for purgatory by using section info that has relocations applied. * Fixed arch_kexec_locate_mem_hole() function to fallback to generic kexec_locate_mem_hole() lookup if exclude ranges list is empty. * Dropped check for backup_start in trampoline_64.S as purgatory() function takes care of it anyway. v1 -> v2: * Introduced arch_kexec_locate_mem_hole() for override and dropped weak arch_kexec_add_buffer(). * Addressed warnings reported by lkp. * Added patch to address kexec load issue when memory is reserved for crashkernel. * Used the appropriate license header for the new files added. * Added an option to merge ranges to minimize reallocations while adding memory ranges. * Dropped within_crashkernel parameter for add_opal_mem_range() & add_rtas_mem_range() functions as it is not really needed. --- Hari Bathini (11): kexec_file: allow archs to handle special regions while locating memory hole powerpc/kexec_file: mark PPC64 specific code powerpc/kexec_file: add helper functions for getting memory ranges ppc64/kexec_file: avoid stomping memory used by special regions powerpc/drmem: make lmb walk a bit more flexible ppc64/kexec_file: restrict memory usage of kdump kernel ppc64/kexec_file: enable early kernel's OPAL calls ppc64/kexec_file: setup backup region for kdump kernel ppc64/kexec_file: prepare elfcore header for crashing kernel ppc64/kexec_file: add appropriate regions for memory reserve map ppc64/kexec_file: fix kexec load failure with lack of memory hole arch/powerpc/include/asm/crashdump-ppc64.h | 19 arch/powerpc/include/asm/drmem.h |9 arch/powerpc/include/asm/kexec.h | 29 + arch/powerpc/include/asm/kexec_ranges.h| 25 + arch/powerpc/kernel/prom.c | 13 arch/powerpc/kexec/Makefile|2 arch/powerpc/kexec/e
Re: [PATCH v3 5/6] powerpc/pseries: implement paravirt qspinlocks for SPLPAR
On 7/24/20 4:16 AM, Will Deacon wrote: On Thu, Jul 23, 2020 at 08:47:59PM +0200, pet...@infradead.org wrote: On Thu, Jul 23, 2020 at 02:32:36PM -0400, Waiman Long wrote: BTW, do you have any comment on my v2 lock holder cpu info qspinlock patch? I will have to update the patch to fix the reported 0-day test problem, but I want to collect other feedback before sending out v3. I want to say I hate it all, it adds instructions to a path we spend an aweful lot of time optimizing without really getting anything back for it. Will, how do you feel about it? I can see it potentially being useful for debugging, but I hate the limitation to 256 CPUs. Even arm64 is hitting that now. After thinking more about that, I think we can use all the remaining bits in the 16-bit locked_pending. Reserving 1 bit for locked and 1 bit for pending, there are 14 bits left. So as long as NR_CPUS < 16k (requirement for 16-bit locked_pending), we can put all possible cpu numbers into the lock. We can also just use smp_processor_id() without additional percpu data. Also, you're talking ~1% gains here. I think our collective time would be better spent off reviewing the CNA series and trying to make it more deterministic. I thought you guys are not interested in CNA. I do want to get CNA merged, if possible. Let review the current version again and see if there are ways we can further improve it. Cheers, Longman
Re: [PATCH 5/9] powerpc/32s: Fix CONFIG_BOOK3S_601 uses
Michael Ellerman a écrit : We have two uses of CONFIG_BOOK3S_601, which doesn't exist. Fix them to use CONFIG_PPC_BOOK3S_601 which is the correct symbol. Fixes: 12c3f1fd87bf ("powerpc/32s: get rid of CPU_FTR_601 feature") Signed-off-by: Michael Ellerman --- I think the bug in get_cycles() at least demonstrates that no one has booted a 601 since v5.4. Time to drop 601? Would be great. I can submit a patch for that in August. Christophe --- arch/powerpc/include/asm/ptrace.h | 2 +- arch/powerpc/include/asm/timex.h | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/ptrace.h b/arch/powerpc/include/asm/ptrace.h index f194339cef3b..155a197c0aa1 100644 --- a/arch/powerpc/include/asm/ptrace.h +++ b/arch/powerpc/include/asm/ptrace.h @@ -243,7 +243,7 @@ static inline void set_trap_norestart(struct pt_regs *regs) } #define arch_has_single_step() (1) -#ifndef CONFIG_BOOK3S_601 +#ifndef CONFIG_PPC_BOOK3S_601 #define arch_has_block_step() (true) #else #define arch_has_block_step() (false) diff --git a/arch/powerpc/include/asm/timex.h b/arch/powerpc/include/asm/timex.h index d2d2c4bd8435..6047402b0a4d 100644 --- a/arch/powerpc/include/asm/timex.h +++ b/arch/powerpc/include/asm/timex.h @@ -17,7 +17,7 @@ typedef unsigned long cycles_t; static inline cycles_t get_cycles(void) { - if (IS_ENABLED(CONFIG_BOOK3S_601)) + if (IS_ENABLED(CONFIG_PPC_BOOK3S_601)) return 0; return mftb(); -- 2.25.1
Re: [v3 12/15] powerpc/perf: Add support for outputting extended regs in perf intr_regs
> On 24-Jul-2020, at 5:56 PM, Ravi Bangoria wrote: > > Hi Athira, > >> +/* Function to return the extended register values */ >> +static u64 get_ext_regs_value(int idx) >> +{ >> +switch (idx) { >> +case PERF_REG_POWERPC_MMCR0: >> +return mfspr(SPRN_MMCR0); >> +case PERF_REG_POWERPC_MMCR1: >> +return mfspr(SPRN_MMCR1); >> +case PERF_REG_POWERPC_MMCR2: >> +return mfspr(SPRN_MMCR2); >> +default: return 0; >> +} >> +} >> + >> u64 perf_reg_value(struct pt_regs *regs, int idx) >> { >> -if (WARN_ON_ONCE(idx >= PERF_REG_POWERPC_MAX)) >> -return 0; >> +u64 PERF_REG_EXTENDED_MAX; > > PERF_REG_EXTENDED_MAX should be initialized. otherwise ... > >> + >> +if (cpu_has_feature(CPU_FTR_ARCH_300)) >> +PERF_REG_EXTENDED_MAX = PERF_REG_MAX_ISA_300; >> if (idx == PERF_REG_POWERPC_SIER && >> (IS_ENABLED(CONFIG_FSL_EMB_PERF_EVENT) || >> @@ -85,6 +103,16 @@ u64 perf_reg_value(struct pt_regs *regs, int idx) >> IS_ENABLED(CONFIG_PPC32))) >> return 0; >> + if (idx >= PERF_REG_POWERPC_MAX && idx < PERF_REG_EXTENDED_MAX) >> +return get_ext_regs_value(idx); > > On non p9/p10 machine, PERF_REG_EXTENDED_MAX may contain random value which > will > allow user to pass this if condition unintentionally. > > Neat: PERF_REG_EXTENDED_MAX is a local variable so it should be in lowercase. > Any specific reason to define it in capital? Hi Ravi There is no specific reason. I will include both these changes in next version Thanks Athira Rajeev > > Ravi
Re: [v3 13/15] tools/perf: Add perf tools support for extended register capability in powerpc
> On 24-Jul-2020, at 4:32 PM, Ravi Bangoria wrote: > > Hi Athira, > > On 7/17/20 8:08 PM, Athira Rajeev wrote: >> From: Anju T Sudhakar >> Add extended regs to sample_reg_mask in the tool side to use >> with `-I?` option. Perf tools side uses extended mask to display >> the platform supported register names (with -I? option) to the user >> and also send this mask to the kernel to capture the extended registers >> in each sample. Hence decide the mask value based on the processor >> version. >> Currently definitions for `mfspr`, `SPRN_PVR` are part of >> `arch/powerpc/util/header.c`. Move this to a header file so that >> these definitions can be re-used in other source files as well. > > It seems this patch has a regression. > > Without this patch: > > $ sudo ./perf record -I > ^C[ perf record: Woken up 1 times to write data ] > [ perf record: Captured and wrote 0.458 MB perf.data (318 samples) ] > > With this patch: > > $ sudo ./perf record -I > Error: > dummy:HG: PMU Hardware doesn't support sampling/overflow-interrupts. Try > 'perf stat' Hi Ravi, Thanks for reviewing this patch and also testing. The above issue happens since commit 0a892c1c9472 ("perf record: Add dummy event during system wide synthesis”) which adds a dummy event. The fix for this issue is currently discussed here: https://lkml.org/lkml/2020/7/19/413 So once this fix is in, the issue will be resolved. Thanks Athira > > Ravi
Re: [PATCHv3 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents
Pingfan Liu writes: > On Thu, Jul 23, 2020 at 9:27 PM Nathan Lynch wrote: >> Pingfan Liu writes: >> > This will introduce extra dt updating payload for each involved lmb when >> > hotplug. >> > But it should be fine since drmem_update_dt() is memory based operation and >> > hotplug is not a hot path. >> >> This is great analysis but the performance implications of the change >> are grave. The add/remove paths here are already O(n) where n is the >> quantity of memory assigned to the LP, this change would make it O(n^2): >> >> dlpar_memory_add_by_count >> for_each_drmem_lmb <-- >> dlpar_add_lmb >> drmem_update_dt(_v1|_v2) >> for_each_drmem_lmb <-- >> >> Memory add/remove isn't a hot path but quadratic runtime complexity >> isn't acceptable. Its current performance is bad enough that I have > Yes, the quadratic runtime complexity sounds terrible. > And I am curious about the bug. Does the system have thousands of lmb? Yes. >> Not to mention we leak memory every time drmem_update_dt is called >> because we can't safely free device tree properties :-( > Do you know what block us to free it? It's a longstanding problem. References to device tree properties aren't counted or tracked so there's no way to safely free them unless the node itself is released. But the ibm,dynamic-reconfiguration-memory node does not ever go away and its properties are only subject to updates. Maybe there's a way to address the specific case of ibm,dynamic-reconfiguration-memory and the ibm,dynamic-memory(-v2) properties, instead of tackling the general problem. Regardless of all that, the drmem code needs better data structures and lookup functions.
Re: [PATCH v4 06/12] ppc64/kexec_file: restrict memory usage of kdump kernel
On 24/07/20 5:36 am, Thiago Jung Bauermann wrote: > > Hari Bathini writes: > >> Kdump kernel, used for capturing the kernel core image, is supposed >> to use only specific memory regions to avoid corrupting the image to >> be captured. The regions are crashkernel range - the memory reserved >> explicitly for kdump kernel, memory used for the tce-table, the OPAL >> region and RTAS region as applicable. Restrict kdump kernel memory >> to use only these regions by setting up usable-memory DT property. >> Also, tell the kdump kernel to run at the loaded address by setting >> the magic word at 0x5c. >> >> Signed-off-by: Hari Bathini >> Tested-by: Pingfan Liu >> --- >> >> v3 -> v4: >> * Updated get_node_path() to be an iterative function instead of a >> recursive one. >> * Added comment explaining why low memory is added to kdump kernel's >> usable memory ranges though it doesn't fall in crashkernel region. >> * For correctness, added fdt_add_mem_rsv() for the low memory being >> added to kdump kernel's usable memory ranges. > > Good idea. > >> * Fixed prop pointer update in add_usable_mem_property() and changed >> duple to tuple as suggested by Thiago. > > > >> +/** >> + * get_node_pathlen - Get the full path length of the given node. >> + * @dn: Node. >> + * >> + * Also, counts '/' at the end of the path. >> + * For example, /memory@0 will be "/memory@0/\0" => 11 bytes. > > Wouldn't this function return 10 in the case of /memory@0? Actually, it does return 11. +1 while returning is for counting %NUL. On top of that we count an extra '/' for root node.. so, it ends up as 11. ('/'memory@0'/''\0'). Note the extra '/' before '\0'. Let me handle root node separately. That should avoid the confusion. >> + * >> + * Returns the string length of the node's full path. >> + */ > > Maybe it's me (by analogy with strlen()), but I would expect "string > length" to not include the terminating \0. I suggest renaming the > function to something like get_node_path_size() and do s/length/size/ in > the comment above if it's supposed to count the terminating \0. Sure, will update the function name. Thanks Hari
Re: [PATCH 1/1 V4] : PCIE PHB reset
On Mon, 13 Jul 2020 09:39:33 -0500, wenxi...@linux.vnet.ibm.com wrote: > Several device drivers hit EEH(Extended Error handling) when triggering > kdump on Pseries PowerVM. This patch implemented a reset of the PHBs > in pci general code when triggering kdump. PHB reset stop all PCI > transactions from normal kernel. We have tested the patch in several > enviroments: > - direct slot adapters > - adapters under the switch > - a VF adapter in PowerVM > - a VF adapter/adapter in KVM guest. > > [...] Applied to powerpc/next. [1/1] powerpc/pseries: PCIE PHB reset https://git.kernel.org/powerpc/c/5a090f7c363fdc09b99222eae679506a58e7cc68 cheers
Re: [PATCH -next] powerpc: Remove unneeded inline functions
On Fri, 17 Jul 2020 19:27:14 +0800, YueHaibing wrote: > Both of those functions are only called from 64-bit only code, so the > stubs should not be needed at all. Applied to powerpc/next. [1/1] powerpc: Remove unneeded inline functions https://git.kernel.org/powerpc/c/a3f3f8aa1f72dafe1450ccf8cbdfb1d12d42853a cheers
Re: [PATCH trivial] ppc64/mm: remove comment that is no longer valid
On Tue, 21 Jul 2020 14:49:15 +0530, Santosh Sivaraj wrote: > hash_low_64.S was removed in [1] and since flush_hash_page is not called > from any assembly routine. > > [1]: commit a43c0eb8364c0 ("powerpc/mm: Convert 4k insert from asm to C") Applied to powerpc/next. [1/1] powerpc/mm/hash64: Remove comment that is no longer valid https://git.kernel.org/powerpc/c/69507b984ddce803df81215cc7813825189adafa cheers
Re: [PATCH v2 1/2] powerpc/mce: Add MCE notification chain
On Thu, 9 Jul 2020 19:21:41 +0530, Santosh Sivaraj wrote: > Introduce notification chain which lets us know about uncorrected memory > errors(UE). This would help prospective users in pmem or nvdimm subsystem > to track bad blocks for better handling of persistent memory allocations. Applied to powerpc/next. [1/2] powerpc/mce: Add MCE notification chain https://git.kernel.org/powerpc/c/c37a63afc429ce959402168f67e4f094ab639ace [2/2] powerpc/papr/scm: Add bad memory ranges to nvdimm bad ranges https://git.kernel.org/powerpc/c/85343a8da2d969df1a10ada8f7cb857d52ea70a6 cheers
Re: [PATCH v4 0/3] powernv/idle: Power9 idle cleanup
On Tue, 21 Jul 2020 21:07:05 +0530, Pratik Rajesh Sampat wrote: > v3: https://lkml.org/lkml/2020/7/17/1093 > Changelog v3-->v4: > Based on comments from Nicholas Piggin and Gautham Shenoy, > 1. Changed the naming of pnv_first_spr_loss_level from > pnv_first_fullstate_loss_level to deep_spr_loss_state > 2. Make the P9 PVR check only on the top level function > pnv_probe_idle_states and let the rest of the checks be DT based because > it is faster to do so > > [...] Applied to powerpc/next. [1/3] powerpc/powernv/idle: Replace CPU feature check with PVR check https://git.kernel.org/powerpc/c/8747bf36f312356f8a295a0c39ff092d65ce75ae [2/3] powerpc/powernv/idle: Rename pnv_first_spr_loss_level variable https://git.kernel.org/powerpc/c/dcbbfa6b05daca94ebcdbce80a7cf05c717d2942 [3/3] powerpc/powernv/idle: Exclude mfspr on HID1, 4, 5 on P9 and above https://git.kernel.org/powerpc/c/5c92fb1b46102e1efe0eed69e743f711bc1c7d2e cheers
Re: [PATCH] powerpc/64: Fix an out of date comment about MMIO ordering
On Thu, 16 Jul 2020 12:38:20 -0700, Palmer Dabbelt wrote: > This primitive has been renamed, but because it was spelled incorrectly in the > first place it must have escaped the fixup patch. As far as I can tell this > logic is still correct: smp_mb__after_spinlock() uses the default smp_mb() > implementation, which is "sync" rather than "hwsync" but those are the same > (though I'm not that familiar with PowerPC). Applied to powerpc/next. [1/1] powerpc/64: Fix an out of date comment about MMIO ordering https://git.kernel.org/powerpc/c/147c13413c04bc6a2bd76f2503402905e5e98cff cheers
Re: [PATCH v3] powerpc: select ARCH_HAS_MEMBARRIER_SYNC_CORE
On Thu, 16 Jul 2020 11:35:22 +1000, Nicholas Piggin wrote: > powerpc return from interrupt and return from system call sequences are > context synchronising. Applied to powerpc/next. [1/1] powerpc: Select ARCH_HAS_MEMBARRIER_SYNC_CORE https://git.kernel.org/powerpc/c/2384b36f9156c3b815a5ce5f694edc5054ab7625 cheers
Re: [PATCH v2 0/3] remove PROT_SAO support and disable
On Fri, 3 Jul 2020 11:19:55 +1000, Nicholas Piggin wrote: > It was suggested that I post this to a wider audience on account of > the change to supported userspace features in patch 2 particularly. > > Thanks, > Nick > > Nicholas Piggin (3): > powerpc: remove stale calc_vm_prot_bits comment > powerpc/64s: remove PROT_SAO support > powerpc/64s/hash: disable subpage_prot syscall by default > > [...] Applied to powerpc/next. [1/3] powerpc: Remove stale calc_vm_prot_bits() comment https://git.kernel.org/powerpc/c/f4ac1774f2cba44994ce9ac0a65772e4656ac2df [2/3] powerpc/64s: Remove PROT_SAO support https://git.kernel.org/powerpc/c/5c9fa16e8abd342ce04dc830c1ebb2a03abf6c05 [3/3] powerpc/64s/hash: Disable subpage_prot syscall by default https://git.kernel.org/powerpc/c/63396ada804c676e070bd1b8663046f18698ab27 cheers
Re: [PATCH] powerpc/powernv: machine check handler for POWER10
On Fri, 3 Jul 2020 09:33:43 +1000, Nicholas Piggin wrote: > Applied to powerpc/next. [1/1] powerpc/powernv: Machine check handler for POWER10 https://git.kernel.org/powerpc/c/201220bb0e8cbc163ec7f550b3b7b3da46eb5877 cheers
Re: [PATCH v2] powerpc/spufs: Rework fcheck() usage
On Fri, 8 May 2020 23:06:33 +1000, Michael Ellerman wrote: > Currently the spu coredump code triggers an RCU warning: > > = > WARNING: suspicious RCU usage > 5.7.0-rc3-01755-g7cd49f0b7ec7 #1 Not tainted > - > include/linux/fdtable.h:95 suspicious rcu_dereference_check() usage! > > [...] Applied to powerpc/next. [1/1] powerpc/spufs: Rework fcheck() usage https://git.kernel.org/powerpc/c/38b407be172d3d15afdbfe172691b7caad98120f cheers
Re: [PATCH 1/2] powerpc/64s/exception: treat NIA below __end_interrupts as soft-masked
On Thu, 11 Jun 2020 18:12:02 +1000, Nicholas Piggin wrote: > The scv instruction causes an interrupt which can enter the kernel with > MSR[EE]=1, thus allowing interrupts to hit at any time. These must not > be taken as normal interrupts, because they come from MSR[PR]=0 context, > and yet the kernel stack is not yet set up and r13 is not set to the > PACA). > > Treat this as a soft-masked interrupt regardless of the soft masked > state. This does not affect behaviour yet, because currently all > interrupts are taken with MSR[EE]=0. Applied to powerpc/next. [1/2] powerpc/64s/exception: treat NIA below __end_interrupts as soft-masked https://git.kernel.org/powerpc/c/b2dc2977cba48990df45e0a96150663d4f342700 [2/2] powerpc/64s: system call support for scv/rfscv instructions https://git.kernel.org/powerpc/c/7fa95f9adaee7e5cbb195d3359741120829e488b cheers
Re: [PATCH] selftests/powerpc: Add test of memcmp at end of page
On Wed, 22 Jul 2020 15:53:15 +1000, Michael Ellerman wrote: > Update our memcmp selftest, to test the case where we're comparing up > to the end of a page and the subsequent page is not mapped. We have to > make sure we don't read off the end of the page and cause a fault. > > We had a bug there in the past, fixed in commit > d9470757398a ("powerpc/64: Fix memcmp reading past the end of src/dest"). Applied to powerpc/next. [1/1] selftests/powerpc: Add test of memcmp at end of page https://git.kernel.org/powerpc/c/8ac9b9d61f0eceba6ce571e7527798465ae9a7c5 cheers
Re: [PATCH] selftests/powerpc: Run per_event_excludes test on Power8 or later
On Thu, 16 Jul 2020 22:21:42 +1000, Michael Ellerman wrote: > The per_event_excludes test wants to run on Power8 or later. But > currently it checks that AT_BASE_PLATFORM *equals* power8, which means > it only runs on Power8. > > Fix it to check for the ISA 2.07 feature, which will be set on Power8 > and later CPUs. Applied to powerpc/next. [1/1] selftests/powerpc: Run per_event_excludes test on Power8 or later https://git.kernel.org/powerpc/c/9d1ebe9a98c1d7bf7cfbe1dba0052230c042ecdb cheers
Re: [PATCH] powerpc/perf: fix missing is_sier_aviable() during build
On Sun, 14 Jun 2020 14:06:04 +0530, Madhavan Srinivasan wrote: > Compilation error: > > arch/powerpc/perf/perf_regs.c:80:undefined reference to `.is_sier_available' > > Currently is_sier_available() is part of core-book3s.c. > But then, core-book3s.c is added to build based on > CONFIG_PPC_PERF_CTRS. A config with CONFIG_PERF_EVENTS > and without CONFIG_PPC_PERF_CTRS will have a build break > because of missing is_sier_available(). Patch adds > is_sier_available() in asm/perf_event.h to fix the build > break for configs missing CONFIG_PPC_PERF_CTRS. Applied to powerpc/next. [1/1] powerpc/perf: Fix missing is_sier_aviable() during build https://git.kernel.org/powerpc/c/3c9450c053f88e525b2db1e6990cdf34d14e7696 cheers
Re: [PATCH 1/1] KVM/PPC: Fix typo on H_DISABLE_AND_GET hcall
On Mon, 6 Jul 2020 21:48:12 -0300, Leonardo Bras wrote: > On PAPR+ the hcall() on 0x1B0 is called H_DISABLE_AND_GET, but got > defined as H_DISABLE_AND_GETC instead. > > This define was introduced with a typo in commit > ("[PATCH] powerpc: Extends HCALL interface for InfiniBand usage"), and was > later used without having the typo noticed. Applied to powerpc/next. [1/1] KVM: PPC: Fix typo on H_DISABLE_AND_GET hcall https://git.kernel.org/powerpc/c/0f10228c6ff6af36cbb31af35b01f76cdb0b3fc1 cheers
Re: [PATCH 1/5] powerpc sstep: Add tests for prefixed integer load/stores
On Mon, 25 May 2020 12:59:19 +1000, Jordan Niethe wrote: > Add tests for the prefixed versions of the integer load/stores that are > currently tested. This includes the following instructions: > * Prefixed Load Doubleword (pld) > * Prefixed Load Word and Zero (plwz) > * Prefixed Store Doubleword (pstd) > > Skip the new tests if ISA v3.1 is unsupported. Applied to powerpc/next. [1/5] powerpc/sstep: Add tests for prefixed integer load/stores https://git.kernel.org/powerpc/c/b6b54b42722a2393056c891c0d05cd8cc40eb776 [2/5] powerpc/sstep: Add tests for prefixed floating-point load/stores https://git.kernel.org/powerpc/c/0396de6d8561c721b03fce386eb9682b37a26013 [3/5] powerpc/sstep: Set NIP in instruction emulation tests https://git.kernel.org/powerpc/c/1c89cf7fbed36f078b20fd47d308b4fc6dbff5f6 [4/5] powerpc/sstep: Let compute tests specify a required cpu feature https://git.kernel.org/powerpc/c/301ebf7d69f6709575d137a41a0291f69f343aed [5/5] powerpc/sstep: Add tests for Prefixed Add Immediate https://git.kernel.org/powerpc/c/4f825900786e1c24e4c48622e12eb493a6cd27b6 cheers
Re: [PATCH 1/4] powerpc: Add a ppc_inst_as_str() helper
On Tue, 2 Jun 2020 15:27:25 +1000, Jordan Niethe wrote: > There are quite a few places where instructions are printed, this is > done using a '%x' format specifier. With the introduction of prefixed > instructions, this does not work well. Currently in these places, > ppc_inst_val() is used for the value for %x so only the first word of > prefixed instructions are printed. > > When the instructions are word instructions, only a single word should > be printed. For prefixed instructions both the prefix and suffix should > be printed. To accommodate both of these situations, instead of a '%x' > specifier use '%s' and introduce a helper, __ppc_inst_as_str() which > returns a char *. The char * __ppc_inst_as_str() returns is buffer that > is passed to it by the caller. > > [...] Patches 1-2 applied to powerpc/next. [1/4] powerpc: Add a ppc_inst_as_str() helper https://git.kernel.org/powerpc/c/50428fdc53ba48f6936b10dfdc0d644972403908 [2/4] powerpc/xmon: Improve dumping prefixed instructions https://git.kernel.org/powerpc/c/8b98afc117aaf825c66d7ddd59f1849e559b42cd cheers
Re: [PATCH] powerpc/spufs: fix the type of ret in spufs_arch_write_note
On Wed, 10 Jun 2020 10:55:54 +0200, Christoph Hellwig wrote: > Both the ->dump method and snprintf return an int. So switch to an > int and properly handle errors from ->dump. Applied to powerpc/next. [1/1] powerpc/spufs: Fix the type of ret in spufs_arch_write_note https://git.kernel.org/powerpc/c/7c7ff885c7bce40a487e41c68f1dac14dd2c8033 cheers
Re: [PATCH] powerpc: Replace HTTP links with HTTPS ones
On Sat, 18 Jul 2020 12:39:58 +0200, Alexander A. Klimov wrote: > Rationale: > Reduces attack surface on kernel devs opening the links for MITM > as HTTPS traffic is much harder to manipulate. > > Deterministic algorithm: > For each file: > If not .svg: > For each line: > If doesn't contain `\bxmlns\b`: > For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`: > If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`: > If both the HTTP and HTTPS versions > return 200 OK and serve the same content: > Replace HTTP with HTTPS. Applied to powerpc/next. [1/1] powerpc: Replace HTTP links with HTTPS ones https://git.kernel.org/powerpc/c/c8ed9fc9d29e24dafd08971e6a0c6b302a8ade2d cheers
Re: [PATCH] macintosh/therm_adt746x: Replace HTTP links with HTTPS ones
On Fri, 17 Jul 2020 20:29:40 +0200, Alexander A. Klimov wrote: > Rationale: > Reduces attack surface on kernel devs opening the links for MITM > as HTTPS traffic is much harder to manipulate. > > Deterministic algorithm: > For each file: > If not .svg: > For each line: > If doesn't contain `\bxmlns\b`: > For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`: > If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`: > If both the HTTP and HTTPS versions > return 200 OK and serve the same content: > Replace HTTP with HTTPS. Applied to powerpc/next. [1/1] macintosh/therm_adt746x: Replace HTTP links with HTTPS ones https://git.kernel.org/powerpc/c/1666e5ea2f838f4266e50e4f3d973c0824256429 cheers
Re: [PATCH] macintosh/adb: Replace HTTP links with HTTPS ones
On Fri, 17 Jul 2020 20:35:22 +0200, Alexander A. Klimov wrote: > Rationale: > Reduces attack surface on kernel devs opening the links for MITM > as HTTPS traffic is much harder to manipulate. > > Deterministic algorithm: > For each file: > If not .svg: > For each line: > If doesn't contain `\bxmlns\b`: > For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`: > If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`: > If both the HTTP and HTTPS versions > return 200 OK and serve the same content: > Replace HTTP with HTTPS. Applied to powerpc/next. [1/1] macintosh/adb: Replace HTTP links with HTTPS ones https://git.kernel.org/powerpc/c/a7beab413e2e67dd1abe6bdd0001576892a89e81 cheers
Re: [PATCH v2 0/4] Prefixed instruction tests to cover negative cases
On Fri, 26 Jun 2020 15:21:54 +0530, Balamuruhan S wrote: > This patchset adds support to test negative scenarios and adds testcase > for paddi with few fixes. It is based on powerpc/next and on top of > Jordan's tests for prefixed instructions patchsets, > > https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-May/211394.html > https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-June/211768.html > > [...] Applied to powerpc/next. [1/4] powerpc/test_emulate_step: Enhancement to test negative scenarios https://git.kernel.org/powerpc/c/93c3a0ba2a0863a5c82a518d64044434f82a57f5 [2/4] powerpc/test_emulate_step: Add negative tests for prefixed addi https://git.kernel.org/powerpc/c/7e67c73b939b25d4ad18a536e52282aa35d8ee56 [3/4] powerpc/sstep: Introduce macros to retrieve Prefix instruction operands https://git.kernel.org/powerpc/c/68a180a44c29d7e918ae7d3c18a01b0751d1c22f [4/4] powerpc/test_emulate_step: Move extern declaration to sstep.h https://git.kernel.org/powerpc/c/e93ad65e3611b06288efdf0cfd76c012df3feec1 cheers
Re: [v3 00/15] powerpc/perf: Add support for power10 PMU Hardware
On Fri, 17 Jul 2020 10:38:12 -0400, Athira Rajeev wrote: > The patch series adds support for power10 PMU hardware. > > Patches 1..3 are the clean up patches which refactors the way how > PMU SPR's are stored in core-book3s and in KVM book3s, as well as update > data type for PMU cache_events. > > Patches 12 and 13 adds base support for perf extended register > capability in powerpc. Support for extended regs in power10 is > covered in patches 14,15 > > [...] Patches 1-11 applied to powerpc/next. [01/15] powerpc/perf: Update cpu_hw_event to use `struct` for storing MMCR registers https://git.kernel.org/powerpc/c/78d76819e6f04672989506e7792895a51438516e [02/15] KVM: PPC: Book3S HV: Cleanup updates for kvm vcpu MMCR https://git.kernel.org/powerpc/c/7e4a145e5b675d5a9182f756950f001eaa256795 [03/15] powerpc/perf: Update Power PMU cache_events to u64 type https://git.kernel.org/powerpc/c/9d4fc86dcd510dab5521a6c891f9bf379b85a7e0 [04/15] powerpc/perf: Add support for ISA3.1 PMU SPRs https://git.kernel.org/powerpc/c/c718547e4a92d74089f862457adf1f617c498e16 [05/15] KVM: PPC: Book3S HV: Save/restore new PMU registers https://git.kernel.org/powerpc/c/5752fe0b811bb3cee531c52074921c6dd09dc42d [06/15] powerpc/xmon: Add PowerISA v3.1 PMU SPRs https://git.kernel.org/powerpc/c/1979ae8c7215718c7a98f038bad0122034ad6529 [07/15] powerpc/perf: Add Power10 PMU feature to DT CPU features https://git.kernel.org/powerpc/c/9908c826d5ed150637a3a4c0eec5146a0c438f21 [08/15] powerpc/perf: power10 Performance Monitoring support https://git.kernel.org/powerpc/c/a64e697cef23b3d24bac700f6d66c8e2bf8efccc [09/15] powerpc/perf: Ignore the BHRB kernel address filtering for P10 https://git.kernel.org/powerpc/c/bfe3b1945d5e0531103b3d4ab3a367a1a156d99a [10/15] powerpc/perf: Add Power10 BHRB filter support for PERF_SAMPLE_BRANCH_IND_CALL/COND https://git.kernel.org/powerpc/c/80350a4bac992e3404067d31ff901ae9ff76aaa8 [11/15] powerpc/perf: BHRB control to disable BHRB logic when not used https://git.kernel.org/powerpc/c/1cade527f6e9bec6a6412d0641643c359ada8096 cheers
Re: [PATCH v6 00/23] powerpc/book3s/64/pkeys: Simplify the code
On Thu, 9 Jul 2020 08:59:23 +0530, Aneesh Kumar K.V wrote: > This patch series update the pkey subsystem with more documentation and > rename variables so that it is easy to follow the code. We drop the changes > to support KUAP/KUEP with hash translation in this update. The changes > are adding 200 cycles to null syscalls benchmark and I want to look at that > closely before requesting a merge. The rest of the patches are included > in this series. This should avoid having to carry a large patchset across > the upstream merge. Some of the changes in here make the hash KUEP/KUAP > addition simpler. > > [...] Applied to powerpc/next. [01/23] powerpc/book3s64/pkeys: Use PVR check instead of cpu feature https://git.kernel.org/powerpc/c/d79e7a5f26f1d179cbb915a8bf2469b6d7431c29 [02/23] powerpc/book3s64/pkeys: Fixup bit numbering https://git.kernel.org/powerpc/c/33699023f51f96ac9be38747e64967ea05e00bab [03/23] powerpc/book3s64/pkeys: pkeys are supported only on hash on book3s. https://git.kernel.org/powerpc/c/b9658f83e721ddfcee3e08b16a6628420de424c3 [04/23] powerpc/book3s64/pkeys: Move pkey related bits in the linux page table https://git.kernel.org/powerpc/c/ee8b39331f89950b0a011c7965db5694f0153166 [05/23] powerpc/book3s64/pkeys: Explain key 1 reservation details https://git.kernel.org/powerpc/c/1f404058e2911afe08417ef82f17aba6adccfc63 [06/23] powerpc/book3s64/pkeys: Simplify the key initialization https://git.kernel.org/powerpc/c/f491fe3fb41eafc7a159874040e032ad41ade210 [07/23] powerpc/book3s64/pkeys: Prevent key 1 modification from userspace. https://git.kernel.org/powerpc/c/718d9b380174eb8fe16d67769395737b79654a02 [08/23] powerpc/book3s64/pkeys: kill cpu feature key CPU_FTR_PKEY https://git.kernel.org/powerpc/c/a24204c307962214996627e3f4caa8772b9b0cf4 [09/23] powerpc/book3s64/pkeys: Simplify pkey disable branch https://git.kernel.org/powerpc/c/a4678d4b477c3d2901f101986ca01406f3b7eaea [10/23] powerpc/book3s64/pkeys: Convert pkey_total to num_pkey https://git.kernel.org/powerpc/c/c529afd7cbc71ae1dc44a31efc7c1c9db3c3a143 [11/23] powerpc/book3s64/pkeys: Make initial_allocation_mask static https://git.kernel.org/powerpc/c/3c8ab47362fe9a74f61b48efe957666a423c55a2 [12/23] powerpc/book3s64/pkeys: Mark all the pkeys above max pkey as reserved https://git.kernel.org/powerpc/c/3e4352aeb8b17eb1040ba288f586620e8294389d [13/23] powerpc/book3s64/pkeys: Add MMU_FTR_PKEY https://git.kernel.org/powerpc/c/d3cd91fb8d2e202cf8ebb6f271898aaf37ecda8f [14/23] powerpc/book3s64/kuep: Add MMU_FTR_KUEP https://git.kernel.org/powerpc/c/e10cc8715d180509a367d3ab25d40e4a1612cb2f [15/23] powerpc/book3s64/pkeys: Use pkey_execute_disable_supported https://git.kernel.org/powerpc/c/2daf298de728dc37f32d0749fa4f59db36fa7d96 [16/23] powerpc/book3s64/pkeys: Use MMU_FTR_PKEY instead of pkey_disabled static key https://git.kernel.org/powerpc/c/f7045a45115b17fe695ea7075f5213706f202edb [17/23] powerpc/book3s64/keys: Print information during boot. https://git.kernel.org/powerpc/c/7cdd3745f2d75aecc2b61368e2563ae54bfac59a [18/23] powerpc/book3s64/keys/kuap: Reset AMR/IAMR values on kexec https://git.kernel.org/powerpc/c/000a42b35a54372597f0657f6b9875b38c641864 [19/23] powerpc/book3s64/kuap: Move UAMOR setup to key init function https://git.kernel.org/powerpc/c/e0d8e991be641ba0034c67785bf86f6c097869d6 [20/23] selftests/powerpc: ptrace-pkey: Rename variables to make it easier to follow code https://git.kernel.org/powerpc/c/9a11f12e0a6c374b3ef1ce81e32ce477d28eb1b8 [21/23] selftests/powerpc: ptrace-pkey: Update the test to mark an invalid pkey correctly https://git.kernel.org/powerpc/c/0eaa3b5ca7b5a76e3783639c828498343be66a01 [22/23] selftests/powerpc: ptrace-pkey: Don't update expected UAMOR value https://git.kernel.org/powerpc/c/3563b9bea0ca7f53e4218b5e268550341a49f333 [23/23] powerpc/book3s64/pkeys: Remove is_pkey_enabled() https://git.kernel.org/powerpc/c/482b9b3948675df60c015b2155011c1f93234992 cheers
Re: [PATCH v3 0/4] powerpc/mm/radix: Memory unplug fixes
On Thu, 9 Jul 2020 18:49:21 +0530, Aneesh Kumar K.V wrote: > This is the next version of the fixes for memory unplug on radix. > The issues and the fix are described in the actual patches. > > Changes from v2: > - Address review feedback > > Changes from v1: > - Added back patch to drop split_kernel_mapping > - Most of the split_kernel_mapping related issues are now described > in the removal patch > - drop pte fragment change > - use lmb size as the max mapping size. > - Radix baremetal now use memory block size of 1G. > > [...] Applied to powerpc/next. [1/4] powerpc/mm/radix: Fix PTE/PMD fragment count for early page table mappings https://git.kernel.org/powerpc/c/645d5ce2f7d6cb4dcf6a4e087fb550e238d24283 [2/4] powerpc/mm/radix: Free PUD table when freeing pagetable https://git.kernel.org/powerpc/c/9ce8853b4a735c8115f55ac0e9c2b27a4c8f80b5 [3/4] powerpc/mm/radix: Remove split_kernel_mapping() https://git.kernel.org/powerpc/c/d6d6ebfc5dbb4008be21baa4ec2ad45606578966 [4/4] powerpc/mm/radix: Create separate mappings for hot-plugged memory https://git.kernel.org/powerpc/c/af9d00e93a4f062c5f160325d7b8f6f6744e cheers
[PATCH 9/9] powerpc: Drop old comment about CONFIG_POWER
There's a comment in time.h referring to CONFIG_POWER, which doesn't exist. That confuses scripts/checkkconfigsymbols.py. Presumably the comment was referring to a CONFIG_POWER vs CONFIG_PPC, in which case for CONFIG_POWER we would #define __USE_RTC to 1. But instead we have CONFIG_PPC_BOOK3S_601, and these days we have IS_ENABLED(). So the comment is no longer relevant, drop it. Signed-off-by: Michael Ellerman --- arch/powerpc/include/asm/time.h | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h index b287cfc2dd85..cb326720a8a1 100644 --- a/arch/powerpc/include/asm/time.h +++ b/arch/powerpc/include/asm/time.h @@ -39,7 +39,6 @@ struct div_result { }; /* Accessor functions for the timebase (RTC on 601) registers. */ -/* If one day CONFIG_POWER is added just define __USE_RTC as 1 */ #define __USE_RTC()(IS_ENABLED(CONFIG_PPC_BOOK3S_601)) #ifdef CONFIG_PPC64 -- 2.25.1
[PATCH 8/9] powerpc/kvm: Use correct CONFIG symbol in comment
This comment refers to the non-existent CONFIG_PPC_BOOK3S_XX, which confuses scripts/checkkconfigsymbols.py. Change it to use the correct symbol. Signed-off-by: Michael Ellerman --- arch/powerpc/kvm/book3s_interrupts.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_interrupts.S b/arch/powerpc/kvm/book3s_interrupts.S index f7ad99d972ce..607a9b99c334 100644 --- a/arch/powerpc/kvm/book3s_interrupts.S +++ b/arch/powerpc/kvm/book3s_interrupts.S @@ -26,7 +26,7 @@ #define FUNC(name) name #define GET_SHADOW_VCPU(reg) lwz reg, (THREAD + THREAD_KVM_SVCPU)(r2) -#endif /* CONFIG_PPC_BOOK3S_XX */ +#endif /* CONFIG_PPC_BOOK3S_64 */ #define VCPU_LOAD_NVGPRS(vcpu) \ PPC_LL r14, VCPU_GPR(R14)(vcpu); \ -- 2.25.1
[PATCH 7/9] powerpc/boot: Fix CONFIG_PPC_MPC52XX references
Commit 866bfc75f40e ("powerpc: conditionally compile platform-specific serial drivers") made some code depend on CONFIG_PPC_MPC52XX, which doesn't exist. Fix it to use CONFIG_PPC_MPC52xx. Fixes: 866bfc75f40e ("powerpc: conditionally compile platform-specific serial drivers") Signed-off-by: Michael Ellerman --- arch/powerpc/boot/Makefile | 2 +- arch/powerpc/boot/serial.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/boot/Makefile b/arch/powerpc/boot/Makefile index 4d43cb59b4a4..44af71543380 100644 --- a/arch/powerpc/boot/Makefile +++ b/arch/powerpc/boot/Makefile @@ -117,7 +117,7 @@ src-wlib-y := string.S crt0.S stdio.c decompress.c main.c \ elf_util.c $(zlib-y) devtree.c stdlib.c \ oflib.c ofconsole.c cuboot.c -src-wlib-$(CONFIG_PPC_MPC52XX) += mpc52xx-psc.c +src-wlib-$(CONFIG_PPC_MPC52xx) += mpc52xx-psc.c src-wlib-$(CONFIG_PPC64_BOOT_WRAPPER) += opal-calls.S opal.c ifndef CONFIG_PPC64_BOOT_WRAPPER src-wlib-y += crtsavres.S diff --git a/arch/powerpc/boot/serial.c b/arch/powerpc/boot/serial.c index 0bfa7e87e546..9a19e5905485 100644 --- a/arch/powerpc/boot/serial.c +++ b/arch/powerpc/boot/serial.c @@ -128,7 +128,7 @@ int serial_console_init(void) dt_is_compatible(devp, "fsl,cpm2-smc-uart")) rc = cpm_console_init(devp, &serial_cd); #endif -#ifdef CONFIG_PPC_MPC52XX +#ifdef CONFIG_PPC_MPC52xx else if (dt_is_compatible(devp, "fsl,mpc5200-psc-uart")) rc = mpc5200_psc_console_init(devp, &serial_cd); #endif -- 2.25.1
[PATCH 5/9] powerpc/32s: Fix CONFIG_BOOK3S_601 uses
We have two uses of CONFIG_BOOK3S_601, which doesn't exist. Fix them to use CONFIG_PPC_BOOK3S_601 which is the correct symbol. Fixes: 12c3f1fd87bf ("powerpc/32s: get rid of CPU_FTR_601 feature") Signed-off-by: Michael Ellerman --- I think the bug in get_cycles() at least demonstrates that no one has booted a 601 since v5.4. Time to drop 601? --- arch/powerpc/include/asm/ptrace.h | 2 +- arch/powerpc/include/asm/timex.h | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/ptrace.h b/arch/powerpc/include/asm/ptrace.h index f194339cef3b..155a197c0aa1 100644 --- a/arch/powerpc/include/asm/ptrace.h +++ b/arch/powerpc/include/asm/ptrace.h @@ -243,7 +243,7 @@ static inline void set_trap_norestart(struct pt_regs *regs) } #define arch_has_single_step() (1) -#ifndef CONFIG_BOOK3S_601 +#ifndef CONFIG_PPC_BOOK3S_601 #define arch_has_block_step() (true) #else #define arch_has_block_step() (false) diff --git a/arch/powerpc/include/asm/timex.h b/arch/powerpc/include/asm/timex.h index d2d2c4bd8435..6047402b0a4d 100644 --- a/arch/powerpc/include/asm/timex.h +++ b/arch/powerpc/include/asm/timex.h @@ -17,7 +17,7 @@ typedef unsigned long cycles_t; static inline cycles_t get_cycles(void) { - if (IS_ENABLED(CONFIG_BOOK3S_601)) + if (IS_ENABLED(CONFIG_PPC_BOOK3S_601)) return 0; return mftb(); -- 2.25.1
[PATCH 6/9] powerpc/32s: Remove TAUException wart in traps.c
All 32 and 64-bit builds that don't have CONFIG_TAU_INT enabled (all of them), get a definition of TAUException() in traps.c. On 64-bit it's completely useless, and just wastes ~120 bytes of text. On 32-bit it allows the kernel to link because head_32.S calls it unconditionally. Instead follow the example of altivec_assist_exception(), and if CONFIG_TAU_INT is not enabled just point it at unknown_exception using the preprocessor. Signed-off-by: Michael Ellerman --- Can we just remove TAU_INT entirely? It's in zero defconfigs and doesn't sound like something anyone really wants to enable: However, on some cpus it appears that the TAU interrupt hardware is buggy and can cause a situation which would lead unexplained hard lockups. Unless you are extending the TAU driver, or enjoy kernel/hardware debugging, leave this option off. --- arch/powerpc/kernel/head_32.S | 4 arch/powerpc/kernel/traps.c | 8 2 files changed, 4 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/kernel/head_32.S b/arch/powerpc/kernel/head_32.S index 705c042309d8..dcfb7dceb6d6 100644 --- a/arch/powerpc/kernel/head_32.S +++ b/arch/powerpc/kernel/head_32.S @@ -671,6 +671,10 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_NEED_DTLB_SW_LRU) #ifndef CONFIG_ALTIVEC #define altivec_assist_exception unknown_exception +#endif + +#ifndef CONFIG_TAU_INT +#define TAUException unknown_exception #endif EXCEPTION(0x1300, Trap_13, instruction_breakpoint_exception, EXC_XFER_STD) diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c index 97413a385720..d1ebe152f210 100644 --- a/arch/powerpc/kernel/traps.c +++ b/arch/powerpc/kernel/traps.c @@ -2060,14 +2060,6 @@ void DebugException(struct pt_regs *regs, unsigned long debug_status) NOKPROBE_SYMBOL(DebugException); #endif /* CONFIG_PPC_ADV_DEBUG_REGS */ -#if !defined(CONFIG_TAU_INT) -void TAUException(struct pt_regs *regs) -{ - printk("TAU trap at PC: %lx, MSR: %lx, vector=%lx%s\n", - regs->nip, regs->msr, regs->trap, print_tainted()); -} -#endif /* CONFIG_INT_TAU */ - #ifdef CONFIG_ALTIVEC void altivec_assist_exception(struct pt_regs *regs) { -- 2.25.1
[PATCH 3/9] powerpc/52xx: Fix comment about CONFIG_BDI*
There's a comment in lite5200_sleep.S that refers to "CONFIG_BDI*". This confuses scripts/checkkconfigsymbols.py, which thinks it should be able to find CONFIG_BDI. Change the comment to refer to CONFIG_BDI_SWITCH which is presumably roughly what it was referring to. AFAICS there never has been a CONFIG_BDI. Signed-off-by: Michael Ellerman --- If anyone has a better idea what it means feel free to reply. --- arch/powerpc/platforms/52xx/lite5200_sleep.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/platforms/52xx/lite5200_sleep.S b/arch/powerpc/platforms/52xx/lite5200_sleep.S index 70083649c9ea..11475c58ea43 100644 --- a/arch/powerpc/platforms/52xx/lite5200_sleep.S +++ b/arch/powerpc/platforms/52xx/lite5200_sleep.S @@ -56,7 +56,7 @@ /* * save stuff BDI overwrites * 0xf0 (0xe0->0x100 gets overwritten when BDI connected; -* even when CONFIG_BDI* is disabled and MMU XLAT commented; heisenbug?)) +* even when CONFIG_BDI_SWITCH is disabled and MMU XLAT commented; heisenbug?)) * WARNING: self-refresh doesn't seem to work when BDI2000 is connected, * possibly because BDI sets SDRAM registers before wakeup code does */ -- 2.25.1
[PATCH 4/9] powerpc/64e: Drop dead BOOK3E_MMU_TLB_STATS code
This code was merged 11 years ago in commit 13363ab9b9d0 ("powerpc: Add definitions used by exception handling on 64-bit Book3E") but was never able to be built because CONFIG_BOOK3E_MMU_TLB_STATS never existed. Remove it. Signed-off-by: Michael Ellerman --- arch/powerpc/include/asm/exception-64e.h | 53 +--- arch/powerpc/mm/nohash/tlb_low_64e.S | 47 ++--- 2 files changed, 4 insertions(+), 96 deletions(-) diff --git a/arch/powerpc/include/asm/exception-64e.h b/arch/powerpc/include/asm/exception-64e.h index 72b6657acd2d..40cdcb2fb057 100644 --- a/arch/powerpc/include/asm/exception-64e.h +++ b/arch/powerpc/include/asm/exception-64e.h @@ -66,14 +66,7 @@ #define EX_TLB_SRR0(10 * 8) #define EX_TLB_SRR1(11 * 8) #define EX_TLB_R7 (12 * 8) -#ifdef CONFIG_BOOK3E_MMU_TLB_STATS -#define EX_TLB_R8 (13 * 8) -#define EX_TLB_R9 (14 * 8) -#define EX_TLB_LR (15 * 8) -#define EX_TLB_SIZE(16 * 8) -#else #define EX_TLB_SIZE(13 * 8) -#endif #defineSTART_EXCEPTION(label) \ .globl exc_##label##_book3e;\ @@ -110,8 +103,7 @@ exc_##label##_book3e: std r11,EX_TLB_R12(r12);\ mtspr SPRN_SPRG_TLB_EXFRAME,r14; \ std r15,EX_TLB_SRR1(r12); \ - std r16,EX_TLB_SRR0(r12); \ - TLB_MISS_PROLOG_STATS + std r16,EX_TLB_SRR0(r12); /* And these are the matching epilogs that restores things * @@ -143,7 +135,6 @@ exc_##label##_book3e: mtspr SPRN_SRR0,r15; \ ld r15,EX_TLB_R15(r12);\ mtspr SPRN_SRR1,r16; \ - TLB_MISS_RESTORE_STATS \ ld r16,EX_TLB_R16(r12);\ ld r12,EX_TLB_R12(r12);\ @@ -158,48 +149,6 @@ exc_##label##_book3e: addir11,r13,PACA_EXTLB; \ TLB_MISS_RESTORE(r11) -#ifdef CONFIG_BOOK3E_MMU_TLB_STATS -#define TLB_MISS_PROLOG_STATS \ - mflrr10;\ - std r8,EX_TLB_R8(r12); \ - std r9,EX_TLB_R9(r12); \ - std r10,EX_TLB_LR(r12); -#define TLB_MISS_RESTORE_STATS \ - ld r16,EX_TLB_LR(r12); \ - ld r9,EX_TLB_R9(r12); \ - ld r8,EX_TLB_R8(r12); \ - mtlrr16; -#define TLB_MISS_STATS_D(name) \ - addir9,r13,MMSTAT_DSTATS+name; \ - bl tlb_stat_inc; -#define TLB_MISS_STATS_I(name) \ - addir9,r13,MMSTAT_ISTATS+name; \ - bl tlb_stat_inc; -#define TLB_MISS_STATS_X(name) \ - ld r8,PACA_EXTLB+EX_TLB_ESR(r13); \ - cmpdi cr2,r8,-1; \ - beq cr2,61f;\ - addir9,r13,MMSTAT_DSTATS+name; \ - b 62f;\ -61:addir9,r13,MMSTAT_ISTATS+name; \ -62:bl tlb_stat_inc; -#define TLB_MISS_STATS_SAVE_INFO \ - std r14,EX_TLB_ESR(r12);/* save ESR */ -#define TLB_MISS_STATS_SAVE_INFO_BOLTED \ - std r14,PACA_EXTLB+EX_TLB_ESR(r13); /* save ESR */ -#else -#define TLB_MISS_PROLOG_STATS -#define TLB_MISS_RESTORE_STATS -#define TLB_MISS_PROLOG_STATS_BOLTED -#define TLB_MISS_RESTORE_STATS_BOLTED -#define TLB_MISS_STATS_D(name) -#define TLB_MISS_STATS_I(name) -#define TLB_MISS_STATS_X(name) -#define TLB_MISS_STATS_Y(name) -#define TLB_MISS_STATS_SAVE_INFO -#define TLB_MISS_STATS_SAVE_INFO_BOLTED -#endif - #define SET_IVOR(vector_number, vector_offset) \ LOAD_REG_ADDR(r3,interrupt_base_book3e);\ ori r3,r3,vector_offset@l; \ diff --git a/arch/powerpc/mm/nohash/tlb_low_64e.S b/arch/powerpc/mm/nohash/tlb_low_64e.S index d5e2704d0096..bf24451f3e71 100644 --- a/arch/powerpc/mm/nohash/tlb_low_64e.S +++ b/arch/powerpc/mm/nohash/tlb_low
[PATCH 2/9] powerpc/configs: Remove dead symbols
Remove references to symbols that no longer exist as reported by scripts/checkkconfigsymbols.py. Signed-off-by: Michael Ellerman --- arch/powerpc/configs/44x/akebono_defconfig | 1 - arch/powerpc/configs/85xx/xes_mpc85xx_defconfig | 3 --- arch/powerpc/configs/86xx-hw.config | 2 -- arch/powerpc/configs/fsl-emb-nonhw.config | 1 - arch/powerpc/configs/g5_defconfig | 1 - arch/powerpc/configs/linkstation_defconfig | 1 - arch/powerpc/configs/mpc512x_defconfig | 1 - arch/powerpc/configs/mpc83xx_defconfig | 1 - arch/powerpc/configs/mvme5100_defconfig | 1 - arch/powerpc/configs/pasemi_defconfig | 1 - arch/powerpc/configs/pmac32_defconfig | 8 arch/powerpc/configs/powernv_defconfig | 1 - arch/powerpc/configs/ppc40x_defconfig | 3 --- arch/powerpc/configs/ppc64_defconfig| 1 - arch/powerpc/configs/pseries_defconfig | 1 - 15 files changed, 27 deletions(-) diff --git a/arch/powerpc/configs/44x/akebono_defconfig b/arch/powerpc/configs/44x/akebono_defconfig index 60d5fa2c3b93..3894ba8f8ffc 100644 --- a/arch/powerpc/configs/44x/akebono_defconfig +++ b/arch/powerpc/configs/44x/akebono_defconfig @@ -56,7 +56,6 @@ CONFIG_BLK_DEV_SD=y # CONFIG_NET_VENDOR_DEC is not set # CONFIG_NET_VENDOR_DLINK is not set # CONFIG_NET_VENDOR_EMULEX is not set -# CONFIG_NET_VENDOR_EXAR is not set CONFIG_IBM_EMAC=y # CONFIG_NET_VENDOR_MARVELL is not set # CONFIG_NET_VENDOR_MELLANOX is not set diff --git a/arch/powerpc/configs/85xx/xes_mpc85xx_defconfig b/arch/powerpc/configs/85xx/xes_mpc85xx_defconfig index d50aca608736..3a6381aa9fdc 100644 --- a/arch/powerpc/configs/85xx/xes_mpc85xx_defconfig +++ b/arch/powerpc/configs/85xx/xes_mpc85xx_defconfig @@ -51,9 +51,6 @@ CONFIG_NET_IPIP=y CONFIG_IP_MROUTE=y CONFIG_IP_PIMSM_V1=y CONFIG_IP_PIMSM_V2=y -# CONFIG_INET_XFRM_MODE_TRANSPORT is not set -# CONFIG_INET_XFRM_MODE_TUNNEL is not set -# CONFIG_INET_XFRM_MODE_BEET is not set CONFIG_MTD=y CONFIG_MTD_REDBOOT_PARTS=y CONFIG_MTD_CMDLINE_PARTS=y diff --git a/arch/powerpc/configs/86xx-hw.config b/arch/powerpc/configs/86xx-hw.config index 151164cf8cb3..0cb24b33c88e 100644 --- a/arch/powerpc/configs/86xx-hw.config +++ b/arch/powerpc/configs/86xx-hw.config @@ -32,8 +32,6 @@ CONFIG_HW_RANDOM=y CONFIG_HZ_1000=y CONFIG_I2C_MPC=y CONFIG_I2C=y -# CONFIG_INET_XFRM_MODE_TRANSPORT is not set -# CONFIG_INET_XFRM_MODE_TUNNEL is not set CONFIG_INPUT_FF_MEMLESS=m # CONFIG_INPUT_KEYBOARD is not set # CONFIG_INPUT_MOUSEDEV is not set diff --git a/arch/powerpc/configs/fsl-emb-nonhw.config b/arch/powerpc/configs/fsl-emb-nonhw.config index 3c7dad19a691..df37efed0aec 100644 --- a/arch/powerpc/configs/fsl-emb-nonhw.config +++ b/arch/powerpc/configs/fsl-emb-nonhw.config @@ -56,7 +56,6 @@ CONFIG_IKCONFIG=y CONFIG_INET_AH=y CONFIG_INET_ESP=y CONFIG_INET_IPCOMP=y -# CONFIG_INET_XFRM_MODE_BEET is not set CONFIG_INET=y CONFIG_IP_ADVANCED_ROUTER=y CONFIG_IP_MROUTE=y diff --git a/arch/powerpc/configs/g5_defconfig b/arch/powerpc/configs/g5_defconfig index a68c7f3af10e..1c674c4c1d86 100644 --- a/arch/powerpc/configs/g5_defconfig +++ b/arch/powerpc/configs/g5_defconfig @@ -51,7 +51,6 @@ CONFIG_NF_CONNTRACK_FTP=m CONFIG_NF_CONNTRACK_IRC=m CONFIG_NF_CONNTRACK_TFTP=m CONFIG_NF_CT_NETLINK=m -CONFIG_NF_CONNTRACK_IPV4=m CONFIG_DEVTMPFS=y CONFIG_DEVTMPFS_MOUNT=y CONFIG_BLK_DEV_LOOP=y diff --git a/arch/powerpc/configs/linkstation_defconfig b/arch/powerpc/configs/linkstation_defconfig index ea59f3d146df..d4be64f190ff 100644 --- a/arch/powerpc/configs/linkstation_defconfig +++ b/arch/powerpc/configs/linkstation_defconfig @@ -37,7 +37,6 @@ CONFIG_NF_CONNTRACK_TFTP=m CONFIG_NETFILTER_XT_MATCH_MAC=m CONFIG_NETFILTER_XT_MATCH_PKTTYPE=m CONFIG_NETFILTER_XT_MATCH_STATE=m -CONFIG_NF_CONNTRACK_IPV4=m CONFIG_IP_NF_IPTABLES=m CONFIG_IP_NF_FILTER=m CONFIG_IP_NF_TARGET_REJECT=m diff --git a/arch/powerpc/configs/mpc512x_defconfig b/arch/powerpc/configs/mpc512x_defconfig index e39346b3dc3b..e75d3f3060c9 100644 --- a/arch/powerpc/configs/mpc512x_defconfig +++ b/arch/powerpc/configs/mpc512x_defconfig @@ -47,7 +47,6 @@ CONFIG_MTD_UBI=y CONFIG_BLK_DEV_RAM=y CONFIG_BLK_DEV_RAM_COUNT=1 CONFIG_BLK_DEV_RAM_SIZE=8192 -CONFIG_BLK_DEV_RAM_DAX=y CONFIG_EEPROM_AT24=y CONFIG_EEPROM_AT25=y CONFIG_SCSI=y diff --git a/arch/powerpc/configs/mpc83xx_defconfig b/arch/powerpc/configs/mpc83xx_defconfig index be125729635c..95d43f8a3869 100644 --- a/arch/powerpc/configs/mpc83xx_defconfig +++ b/arch/powerpc/configs/mpc83xx_defconfig @@ -19,7 +19,6 @@ CONFIG_MPC836x_MDS=y CONFIG_MPC836x_RDK=y CONFIG_MPC837x_MDS=y CONFIG_MPC837x_RDB=y -CONFIG_SBC834x=y CONFIG_ASP834x=y CONFIG_QE_GPIO=y CONFIG_MATH_EMULATION=y diff --git a/arch/powerpc/configs/mvme5100_defconfig b/arch/powerpc/configs/mvme5100_defconfig index 3d53d69ed36c..1fed6be95d53 100644 --- a/arch/powerpc/configs/mvme5100_defconfig +++ b/arch/powerpc/configs/mvme5100_defconfig @@
[PATCH 1/9] powerpc/configs: Drop old symbols from ppc6xx_defconfig
ppc6xx_defconfig refers to quite a few symbols that no longer exist, as reported by scripts/checkkconfigsymbols.py, remove them. Signed-off-by: Michael Ellerman --- arch/powerpc/configs/ppc6xx_defconfig | 39 --- 1 file changed, 39 deletions(-) diff --git a/arch/powerpc/configs/ppc6xx_defconfig b/arch/powerpc/configs/ppc6xx_defconfig index feb5d47d8d1e..5e6f92ba3210 100644 --- a/arch/powerpc/configs/ppc6xx_defconfig +++ b/arch/powerpc/configs/ppc6xx_defconfig @@ -53,7 +53,6 @@ CONFIG_MPC836x_MDS=y CONFIG_MPC836x_RDK=y CONFIG_MPC837x_MDS=y CONFIG_MPC837x_RDB=y -CONFIG_SBC834x=y CONFIG_ASP834x=y CONFIG_PPC_86xx=y CONFIG_MPC8641_HPCN=y @@ -187,7 +186,6 @@ CONFIG_NETFILTER_XT_MATCH_STRING=m CONFIG_NETFILTER_XT_MATCH_TCPMSS=m CONFIG_NETFILTER_XT_MATCH_TIME=m CONFIG_NETFILTER_XT_MATCH_U32=m -CONFIG_NF_CONNTRACK_IPV4=m CONFIG_IP_NF_IPTABLES=m CONFIG_IP_NF_MATCH_AH=m CONFIG_IP_NF_MATCH_ECN=m @@ -203,7 +201,6 @@ CONFIG_IP_NF_SECURITY=m CONFIG_IP_NF_ARPTABLES=m CONFIG_IP_NF_ARPFILTER=m CONFIG_IP_NF_ARP_MANGLE=m -CONFIG_NF_CONNTRACK_IPV6=m CONFIG_IP6_NF_IPTABLES=m CONFIG_IP6_NF_MATCH_AH=m CONFIG_IP6_NF_MATCH_EUI64=m @@ -241,7 +238,6 @@ CONFIG_BRIDGE_EBT_SNAT=m CONFIG_BRIDGE_EBT_LOG=m CONFIG_BRIDGE_EBT_NFLOG=m CONFIG_IP_DCCP=m -CONFIG_NET_DCCPPROBE=m CONFIG_TIPC=m CONFIG_ATM=m CONFIG_ATM_CLIP=m @@ -251,7 +247,6 @@ CONFIG_BRIDGE=m CONFIG_VLAN_8021Q=m CONFIG_DECNET=m CONFIG_DECNET_ROUTER=y -CONFIG_IPX=m CONFIG_ATALK=m CONFIG_DEV_APPLETALK=m CONFIG_IPDDP=m @@ -297,26 +292,6 @@ CONFIG_NET_ACT_NAT=m CONFIG_NET_ACT_PEDIT=m CONFIG_NET_ACT_SIMP=m CONFIG_NET_ACT_SKBEDIT=m -CONFIG_IRDA=m -CONFIG_IRLAN=m -CONFIG_IRNET=m -CONFIG_IRCOMM=m -CONFIG_IRDA_CACHE_LAST_LSAP=y -CONFIG_IRDA_FAST_RR=y -CONFIG_IRTTY_SIR=m -CONFIG_KINGSUN_DONGLE=m -CONFIG_KSDAZZLE_DONGLE=m -CONFIG_KS959_DONGLE=m -CONFIG_USB_IRDA=m -CONFIG_SIGMATEL_FIR=m -CONFIG_NSC_FIR=m -CONFIG_WINBOND_FIR=m -CONFIG_TOSHIBA_FIR=m -CONFIG_SMC_IRCC_FIR=m -CONFIG_ALI_FIR=m -CONFIG_VLSI_FIR=m -CONFIG_VIA_FIR=m -CONFIG_MCS_FIR=m CONFIG_BT=m CONFIG_BT_RFCOMM=m CONFIG_BT_RFCOMM_TTY=y @@ -332,7 +307,6 @@ CONFIG_BT_HCIBFUSB=m CONFIG_BT_HCIDTL1=m CONFIG_BT_HCIBT3C=m CONFIG_BT_HCIBLUECARD=m -CONFIG_BT_HCIBTUART=m CONFIG_BT_HCIVHCI=m CONFIG_CFG80211=m CONFIG_MAC80211=m @@ -366,7 +340,6 @@ CONFIG_EEPROM_93CX6=m CONFIG_RAID_ATTRS=m CONFIG_BLK_DEV_SD=y CONFIG_CHR_DEV_ST=m -CONFIG_CHR_DEV_OSST=m CONFIG_BLK_DEV_SR=m CONFIG_CHR_DEV_SG=y CONFIG_CHR_DEV_SCH=m @@ -663,7 +636,6 @@ CONFIG_I2C_MPC=m CONFIG_I2C_PCA_PLATFORM=m CONFIG_I2C_SIMTEC=m CONFIG_I2C_PARPORT=m -CONFIG_I2C_PARPORT_LIGHT=m CONFIG_I2C_TINY_USB=m CONFIG_I2C_PCA_ISA=m CONFIG_I2C_STUB=m @@ -676,7 +648,6 @@ CONFIG_W1_SLAVE_THERM=m CONFIG_W1_SLAVE_SMEM=m CONFIG_W1_SLAVE_DS2433=m CONFIG_W1_SLAVE_DS2433_CRC=y -CONFIG_W1_SLAVE_DS2760=m CONFIG_APM_POWER=m CONFIG_BATTERY_PMU=m CONFIG_HWMON=m @@ -1065,15 +1036,6 @@ CONFIG_CIFS_UPCALL=y CONFIG_CIFS_XATTR=y CONFIG_CIFS_POSIX=y CONFIG_CIFS_DFS_UPCALL=y -CONFIG_NCP_FS=m -CONFIG_NCPFS_PACKET_SIGNING=y -CONFIG_NCPFS_IOCTL_LOCKING=y -CONFIG_NCPFS_STRONG=y -CONFIG_NCPFS_NFS_NS=y -CONFIG_NCPFS_OS2_NS=y -CONFIG_NCPFS_SMALLDOS=y -CONFIG_NCPFS_NLS=y -CONFIG_NCPFS_EXTRAS=y CONFIG_CODA_FS=m CONFIG_9P_FS=m CONFIG_NLS_DEFAULT="utf8" @@ -1117,7 +1079,6 @@ CONFIG_NLS_KOI8_U=m CONFIG_DEBUG_INFO=y CONFIG_UNUSED_SYMBOLS=y CONFIG_HEADERS_INSTALL=y -CONFIG_HEADERS_CHECK=y CONFIG_MAGIC_SYSRQ=y CONFIG_DEBUG_KERNEL=y CONFIG_DEBUG_OBJECTS=y -- 2.25.1
[PATCH] powerpc/sstep: Fix incorrect CONFIG symbol in scv handling
When I "fixed" the ppc64e build in Nick's recent patch, I typoed the CONFIG symbol, resulting in one that doesn't exist. Fix it to use the correct symbol. Reported-by: Christophe Leroy Fixes: 7fa95f9adaee ("powerpc/64s: system call support for scv/rfscv instructions") Signed-off-by: Michael Ellerman --- arch/powerpc/lib/sstep.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c index 4194119eff82..c58ea9e787cb 100644 --- a/arch/powerpc/lib/sstep.c +++ b/arch/powerpc/lib/sstep.c @@ -3382,7 +3382,7 @@ int emulate_step(struct pt_regs *regs, struct ppc_inst instr) regs->msr = MSR_KERNEL; return 1; -#ifdef CONFIG_PPC64_BOOK3S +#ifdef CONFIG_PPC_BOOK3S_64 case SYSCALL_VECTORED_0:/* scv 0 */ regs->gpr[9] = regs->gpr[13]; regs->gpr[10] = MSR_KERNEL; -- 2.25.1
[PATCH v4 6/6] powerpc: implement smp_cond_load_relaxed
This implements smp_cond_load_relaed with the slowpath busy loop using the preferred SMT priority pattern. Signed-off-by: Nicholas Piggin --- arch/powerpc/include/asm/barrier.h | 14 ++ 1 file changed, 14 insertions(+) diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h index 123adcefd40f..9b4671d38674 100644 --- a/arch/powerpc/include/asm/barrier.h +++ b/arch/powerpc/include/asm/barrier.h @@ -76,6 +76,20 @@ do { \ ___p1; \ }) +#define smp_cond_load_relaxed(ptr, cond_expr) ({ \ + typeof(ptr) __PTR = (ptr); \ + __unqual_scalar_typeof(*ptr) VAL; \ + VAL = READ_ONCE(*__PTR);\ + if (unlikely(!(cond_expr))) { \ + spin_begin(); \ + do {\ + VAL = READ_ONCE(*__PTR);\ + } while (!(cond_expr)); \ + spin_end(); \ + } \ + (typeof(*ptr))VAL; \ +}) + #ifdef CONFIG_PPC_BOOK3S_64 #define NOSPEC_BARRIER_SLOT nop #elif defined(CONFIG_PPC_FSL_BOOK3E) -- 2.23.0
[PATCH v4 5/6] powerpc/qspinlock: optimised atomic_try_cmpxchg_lock that adds the lock hint
This brings the behaviour of the uncontended fast path back to roughly equivalent to simple spinlocks -- a single atomic op with lock hint. Signed-off-by: Nicholas Piggin --- arch/powerpc/include/asm/atomic.h| 28 arch/powerpc/include/asm/qspinlock.h | 2 +- 2 files changed, 29 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/atomic.h b/arch/powerpc/include/asm/atomic.h index 498785ffc25f..f6a3d145ffb7 100644 --- a/arch/powerpc/include/asm/atomic.h +++ b/arch/powerpc/include/asm/atomic.h @@ -193,6 +193,34 @@ static __inline__ int atomic_dec_return_relaxed(atomic_t *v) #define atomic_xchg(v, new) (xchg(&((v)->counter), new)) #define atomic_xchg_relaxed(v, new) xchg_relaxed(&((v)->counter), (new)) +/* + * Don't want to override the generic atomic_try_cmpxchg_acquire, because + * we add a lock hint to the lwarx, which may not be wanted for the + * _acquire case (and is not used by the other _acquire variants so it + * would be a surprise). + */ +static __always_inline bool +atomic_try_cmpxchg_lock(atomic_t *v, int *old, int new) +{ + int r, o = *old; + + __asm__ __volatile__ ( +"1:\t" PPC_LWARX(%0,0,%2,1) " # atomic_try_cmpxchg_acquire\n" +" cmpw0,%0,%3 \n" +" bne-2f \n" +" stwcx. %4,0,%2 \n" +" bne-1b \n" +"\t" PPC_ACQUIRE_BARRIER " \n" +"2:\n" + : "=&r" (r), "+m" (v->counter) + : "r" (&v->counter), "r" (o), "r" (new) + : "cr0", "memory"); + + if (unlikely(r != o)) + *old = r; + return likely(r == o); +} + /** * atomic_fetch_add_unless - add unless the number is a given value * @v: pointer of type atomic_t diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h index f5066f00a08c..b752d34517b3 100644 --- a/arch/powerpc/include/asm/qspinlock.h +++ b/arch/powerpc/include/asm/qspinlock.h @@ -37,7 +37,7 @@ static __always_inline void queued_spin_lock(struct qspinlock *lock) { u32 val = 0; - if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL))) + if (likely(atomic_try_cmpxchg_lock(&lock->val, &val, _Q_LOCKED_VAL))) return; queued_spin_lock_slowpath(lock, val); -- 2.23.0
[PATCH v4 4/6] powerpc/pseries: implement paravirt qspinlocks for SPLPAR
This implements the generic paravirt qspinlocks using H_PROD and H_CONFER to kick and wait. This uses an un-directed yield to any CPU rather than the directed yield to a pre-empted lock holder that paravirtualised simple spinlocks use, that requires no kick hcall. This is something that could be investigated and improved in future. Performance results can be found in the commit which added queued spinlocks. Acked-by: Peter Zijlstra (Intel) Acked-by: Waiman Long Signed-off-by: Nicholas Piggin --- arch/powerpc/include/asm/paravirt.h | 28 arch/powerpc/include/asm/qspinlock.h | 66 +++ arch/powerpc/include/asm/qspinlock_paravirt.h | 7 ++ arch/powerpc/include/asm/spinlock.h | 4 ++ arch/powerpc/platforms/pseries/Kconfig| 9 ++- arch/powerpc/platforms/pseries/setup.c| 4 +- include/asm-generic/qspinlock.h | 2 + 7 files changed, 118 insertions(+), 2 deletions(-) create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt.h diff --git a/arch/powerpc/include/asm/paravirt.h b/arch/powerpc/include/asm/paravirt.h index 339e8533464b..21e5f29ca251 100644 --- a/arch/powerpc/include/asm/paravirt.h +++ b/arch/powerpc/include/asm/paravirt.h @@ -28,6 +28,16 @@ static inline void yield_to_preempted(int cpu, u32 yield_count) { plpar_hcall_norets(H_CONFER, get_hard_smp_processor_id(cpu), yield_count); } + +static inline void prod_cpu(int cpu) +{ + plpar_hcall_norets(H_PROD, get_hard_smp_processor_id(cpu)); +} + +static inline void yield_to_any(void) +{ + plpar_hcall_norets(H_CONFER, -1, 0); +} #else static inline bool is_shared_processor(void) { @@ -44,6 +54,19 @@ static inline void yield_to_preempted(int cpu, u32 yield_count) { ___bad_yield_to_preempted(); /* This would be a bug */ } + +extern void ___bad_yield_to_any(void); +static inline void yield_to_any(void) +{ + ___bad_yield_to_any(); /* This would be a bug */ +} + +extern void ___bad_prod_cpu(void); +static inline void prod_cpu(int cpu) +{ + ___bad_prod_cpu(); /* This would be a bug */ +} + #endif #define vcpu_is_preempted vcpu_is_preempted @@ -56,4 +79,9 @@ static inline bool vcpu_is_preempted(int cpu) return false; } +static inline bool pv_is_native_spin_unlock(void) +{ + return !is_shared_processor(); +} + #endif /* _ASM_POWERPC_PARAVIRT_H */ diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h index c49e33e24edd..f5066f00a08c 100644 --- a/arch/powerpc/include/asm/qspinlock.h +++ b/arch/powerpc/include/asm/qspinlock.h @@ -3,9 +3,47 @@ #define _ASM_POWERPC_QSPINLOCK_H #include +#include #define _Q_PENDING_LOOPS (1 << 9) /* not tuned */ +#ifdef CONFIG_PARAVIRT_SPINLOCKS +extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val); +extern void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val); +extern void __pv_queued_spin_unlock(struct qspinlock *lock); + +static __always_inline void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) +{ + if (!is_shared_processor()) + native_queued_spin_lock_slowpath(lock, val); + else + __pv_queued_spin_lock_slowpath(lock, val); +} + +#define queued_spin_unlock queued_spin_unlock +static inline void queued_spin_unlock(struct qspinlock *lock) +{ + if (!is_shared_processor()) + smp_store_release(&lock->locked, 0); + else + __pv_queued_spin_unlock(lock); +} + +#else +extern void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val); +#endif + +static __always_inline void queued_spin_lock(struct qspinlock *lock) +{ + u32 val = 0; + + if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL))) + return; + + queued_spin_lock_slowpath(lock, val); +} +#define queued_spin_lock queued_spin_lock + #define smp_mb__after_spinlock() smp_mb() static __always_inline int queued_spin_is_locked(struct qspinlock *lock) @@ -20,6 +58,34 @@ static __always_inline int queued_spin_is_locked(struct qspinlock *lock) } #define queued_spin_is_locked queued_spin_is_locked +#ifdef CONFIG_PARAVIRT_SPINLOCKS +#define SPIN_THRESHOLD (1<<15) /* not tuned */ + +static __always_inline void pv_wait(u8 *ptr, u8 val) +{ + if (*ptr != val) + return; + yield_to_any(); + /* +* We could pass in a CPU here if waiting in the queue and yield to +* the previous CPU in the queue. +*/ +} + +static __always_inline void pv_kick(int cpu) +{ + prod_cpu(cpu); +} + +extern void __pv_init_lock_hash(void); + +static inline void pv_spinlocks_init(void) +{ + __pv_init_lock_hash(); +} + +#endif + #include #endif /* _ASM_POWERPC_QSPINLOCK_H */ diff --git a/arch/powerpc/include/asm/qspinlock_paravirt.h b/arch/powerpc/include/asm/qspinlock_paravirt.h new file mode 100644 index ..6b60e7736a47 --- /dev/null +++
[PATCH v4 3/6] powerpc/64s: implement queued spinlocks and rwlocks
These have shown significantly improved performance and fairness when spinlock contention is moderate to high on very large systems. With this series including subsequent patches, on a 16 socket 1536 thread POWER9, a stress test such as same-file open/close from all CPUs gets big speedups, 11620op/s aggregate with simple spinlocks vs 384158op/s (33x faster), where the difference in throughput between the fastest and slowest thread goes from 7x to 1.4x. Thanks to the fast path being identical in terms of atomics and barriers (after a subsequent optimisation patch), single threaded performance is not changed (no measurable difference). On smaller systems, performance and fairness seems to be generally improved. Using dbench on tmpfs as a test (that starts to run into kernel spinlock contention), a 2-socket OpenPOWER POWER9 system was tested with bare metal and KVM guest configurations. Results can be found here: https://github.com/linuxppc/issues/issues/305#issuecomment-663487453 Observations are: - Queued spinlocks are equal when contention is insignificant, as expected and as measured with microbenchmarks. - When there is contention, on bare metal queued spinlocks have better throughput and max latency at all points. - When virtualised, queued spinlocks are slightly worse approaching peak throughput, but significantly better throughput and max latency at all points beyond peak, until queued spinlock maximum latency rises when clients are 2x vCPUs. The regressions haven't been analysed very well yet, there are a lot of things that can be tuned, particularly the paravirtualised locking, but the numbers already look like a good net win even on relatively small systems. Acked-by: Peter Zijlstra (Intel) Signed-off-by: Nicholas Piggin --- arch/powerpc/Kconfig | 15 ++ arch/powerpc/include/asm/Kbuild | 1 + arch/powerpc/include/asm/qspinlock.h | 25 +++ arch/powerpc/include/asm/spinlock.h | 5 + arch/powerpc/include/asm/spinlock_types.h | 5 + arch/powerpc/lib/Makefile | 3 +++ include/asm-generic/qspinlock.h | 2 ++ 7 files changed, 56 insertions(+) create mode 100644 arch/powerpc/include/asm/qspinlock.h diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 9fa23eb320ff..641946052d67 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -145,6 +145,8 @@ config PPC select ARCH_SUPPORTS_ATOMIC_RMW select ARCH_USE_BUILTIN_BSWAP select ARCH_USE_CMPXCHG_LOCKREF if PPC64 + select ARCH_USE_QUEUED_RWLOCKS if PPC_QUEUED_SPINLOCKS + select ARCH_USE_QUEUED_SPINLOCKSif PPC_QUEUED_SPINLOCKS select ARCH_WANT_IPC_PARSE_VERSION select ARCH_WEAK_RELEASE_ACQUIRE select BINFMT_ELF @@ -490,6 +492,19 @@ config HOTPLUG_CPU Say N if you are unsure. +config PPC_QUEUED_SPINLOCKS + bool "Queued spinlocks" + depends on SMP + help + Say Y here to use to use queued spinlocks which give better + scalability and fairness on large SMP and NUMA systems without + harming single threaded performance. + + This option is currently experimental, the code is more complex + and less tested so it defaults to "N" for the moment. + + If unsure, say "N". + config ARCH_CPU_PROBE_RELEASE def_bool y depends on HOTPLUG_CPU diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild index dadbcf3a0b1e..27c2268dfd6c 100644 --- a/arch/powerpc/include/asm/Kbuild +++ b/arch/powerpc/include/asm/Kbuild @@ -6,5 +6,6 @@ generated-y += syscall_table_spu.h generic-y += export.h generic-y += local64.h generic-y += mcs_spinlock.h +generic-y += qrwlock.h generic-y += vtime.h generic-y += early_ioremap.h diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h new file mode 100644 index ..c49e33e24edd --- /dev/null +++ b/arch/powerpc/include/asm/qspinlock.h @@ -0,0 +1,25 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_POWERPC_QSPINLOCK_H +#define _ASM_POWERPC_QSPINLOCK_H + +#include + +#define _Q_PENDING_LOOPS (1 << 9) /* not tuned */ + +#define smp_mb__after_spinlock() smp_mb() + +static __always_inline int queued_spin_is_locked(struct qspinlock *lock) +{ + /* +* This barrier was added to simple spinlocks by commit 51d7d5205d338, +* but it should now be possible to remove it, asm arm64 has done with +* commit c6f5d02b6a0f. +*/ + smp_mb(); + return atomic_read(&lock->val); +} +#define queued_spin_is_locked queued_spin_is_locked + +#include + +#endif /* _ASM_POWERPC_QSPINLOCK_H */ diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h index 21357fe05fe0..434615f1d761 100644 --- a/arch/powerpc/include/asm/spinlock.h +++ b/arch/powerpc/include/asm/spinlock.h @@ -3,7 +3,12 @@
[PATCH v4 2/6] powerpc: move spinlock implementation to simple_spinlock
To prepare for queued spinlocks. This is a simple rename except to update preprocessor guard name and a file reference. Signed-off-by: Nicholas Piggin --- arch/powerpc/include/asm/simple_spinlock.h| 288 ++ .../include/asm/simple_spinlock_types.h | 21 ++ arch/powerpc/include/asm/spinlock.h | 285 + arch/powerpc/include/asm/spinlock_types.h | 12 +- 4 files changed, 311 insertions(+), 295 deletions(-) create mode 100644 arch/powerpc/include/asm/simple_spinlock.h create mode 100644 arch/powerpc/include/asm/simple_spinlock_types.h diff --git a/arch/powerpc/include/asm/simple_spinlock.h b/arch/powerpc/include/asm/simple_spinlock.h new file mode 100644 index ..fe6cff7df48e --- /dev/null +++ b/arch/powerpc/include/asm/simple_spinlock.h @@ -0,0 +1,288 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +#ifndef _ASM_POWERPC_SIMPLE_SPINLOCK_H +#define _ASM_POWERPC_SIMPLE_SPINLOCK_H + +/* + * Simple spin lock operations. + * + * Copyright (C) 2001-2004 Paul Mackerras , IBM + * Copyright (C) 2001 Anton Blanchard , IBM + * Copyright (C) 2002 Dave Engebretsen , IBM + * Rework to support virtual processors + * + * Type of int is used as a full 64b word is not necessary. + * + * (the type definitions are in asm/simple_spinlock_types.h) + */ +#include +#include +#include +#include +#include + +#ifdef CONFIG_PPC64 +/* use 0x80yy when locked, where yy == CPU number */ +#ifdef __BIG_ENDIAN__ +#define LOCK_TOKEN (*(u32 *)(&get_paca()->lock_token)) +#else +#define LOCK_TOKEN (*(u32 *)(&get_paca()->paca_index)) +#endif +#else +#define LOCK_TOKEN 1 +#endif + +static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) +{ + return lock.slock == 0; +} + +static inline int arch_spin_is_locked(arch_spinlock_t *lock) +{ + smp_mb(); + return !arch_spin_value_unlocked(*lock); +} + +/* + * This returns the old value in the lock, so we succeeded + * in getting the lock if the return value is 0. + */ +static inline unsigned long __arch_spin_trylock(arch_spinlock_t *lock) +{ + unsigned long tmp, token; + + token = LOCK_TOKEN; + __asm__ __volatile__( +"1:" PPC_LWARX(%0,0,%2,1) "\n\ + cmpwi 0,%0,0\n\ + bne-2f\n\ + stwcx. %1,0,%2\n\ + bne-1b\n" + PPC_ACQUIRE_BARRIER +"2:" + : "=&r" (tmp) + : "r" (token), "r" (&lock->slock) + : "cr0", "memory"); + + return tmp; +} + +static inline int arch_spin_trylock(arch_spinlock_t *lock) +{ + return __arch_spin_trylock(lock) == 0; +} + +/* + * On a system with shared processors (that is, where a physical + * processor is multiplexed between several virtual processors), + * there is no point spinning on a lock if the holder of the lock + * isn't currently scheduled on a physical processor. Instead + * we detect this situation and ask the hypervisor to give the + * rest of our timeslice to the lock holder. + * + * So that we can tell which virtual processor is holding a lock, + * we put 0x8000 | smp_processor_id() in the lock when it is + * held. Conveniently, we have a word in the paca that holds this + * value. + */ + +#if defined(CONFIG_PPC_SPLPAR) +/* We only yield to the hypervisor if we are in shared processor mode */ +void splpar_spin_yield(arch_spinlock_t *lock); +void splpar_rw_yield(arch_rwlock_t *lock); +#else /* SPLPAR */ +static inline void splpar_spin_yield(arch_spinlock_t *lock) {}; +static inline void splpar_rw_yield(arch_rwlock_t *lock) {}; +#endif + +static inline void spin_yield(arch_spinlock_t *lock) +{ + if (is_shared_processor()) + splpar_spin_yield(lock); + else + barrier(); +} + +static inline void rw_yield(arch_rwlock_t *lock) +{ + if (is_shared_processor()) + splpar_rw_yield(lock); + else + barrier(); +} + +static inline void arch_spin_lock(arch_spinlock_t *lock) +{ + while (1) { + if (likely(__arch_spin_trylock(lock) == 0)) + break; + do { + HMT_low(); + if (is_shared_processor()) + splpar_spin_yield(lock); + } while (unlikely(lock->slock != 0)); + HMT_medium(); + } +} + +static inline +void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned long flags) +{ + unsigned long flags_dis; + + while (1) { + if (likely(__arch_spin_trylock(lock) == 0)) + break; + local_save_flags(flags_dis); + local_irq_restore(flags); + do { + HMT_low(); + if (is_shared_processor()) + splpar_spin_yield(lock); + } while (unlikely(lock->slock != 0)); + HMT_medium(); + local_irq_restore(flags_dis); +
[PATCH v4 1/6] powerpc/pseries: move some PAPR paravirt functions to their own file
These functions will be used by queued spinlock implementation, and may be useful elsewhere too, so move them out of spinlock.h. Signed-off-by: Nicholas Piggin --- arch/powerpc/include/asm/paravirt.h | 59 + arch/powerpc/include/asm/spinlock.h | 24 +--- arch/powerpc/lib/locks.c| 12 +++--- 3 files changed, 66 insertions(+), 29 deletions(-) create mode 100644 arch/powerpc/include/asm/paravirt.h diff --git a/arch/powerpc/include/asm/paravirt.h b/arch/powerpc/include/asm/paravirt.h new file mode 100644 index ..339e8533464b --- /dev/null +++ b/arch/powerpc/include/asm/paravirt.h @@ -0,0 +1,59 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +#ifndef _ASM_POWERPC_PARAVIRT_H +#define _ASM_POWERPC_PARAVIRT_H + +#include +#include +#ifdef CONFIG_PPC64 +#include +#include +#endif + +#ifdef CONFIG_PPC_SPLPAR +DECLARE_STATIC_KEY_FALSE(shared_processor); + +static inline bool is_shared_processor(void) +{ + return static_branch_unlikely(&shared_processor); +} + +/* If bit 0 is set, the cpu has been preempted */ +static inline u32 yield_count_of(int cpu) +{ + __be32 yield_count = READ_ONCE(lppaca_of(cpu).yield_count); + return be32_to_cpu(yield_count); +} + +static inline void yield_to_preempted(int cpu, u32 yield_count) +{ + plpar_hcall_norets(H_CONFER, get_hard_smp_processor_id(cpu), yield_count); +} +#else +static inline bool is_shared_processor(void) +{ + return false; +} + +static inline u32 yield_count_of(int cpu) +{ + return 0; +} + +extern void ___bad_yield_to_preempted(void); +static inline void yield_to_preempted(int cpu, u32 yield_count) +{ + ___bad_yield_to_preempted(); /* This would be a bug */ +} +#endif + +#define vcpu_is_preempted vcpu_is_preempted +static inline bool vcpu_is_preempted(int cpu) +{ + if (!is_shared_processor()) + return false; + if (yield_count_of(cpu) & 1) + return true; + return false; +} + +#endif /* _ASM_POWERPC_PARAVIRT_H */ diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h index 2d620896cdae..79be9bb10bbb 100644 --- a/arch/powerpc/include/asm/spinlock.h +++ b/arch/powerpc/include/asm/spinlock.h @@ -15,11 +15,10 @@ * * (the type definitions are in asm/spinlock_types.h) */ -#include #include +#include #ifdef CONFIG_PPC64 #include -#include #endif #include #include @@ -35,18 +34,6 @@ #define LOCK_TOKEN 1 #endif -#ifdef CONFIG_PPC_PSERIES -DECLARE_STATIC_KEY_FALSE(shared_processor); - -#define vcpu_is_preempted vcpu_is_preempted -static inline bool vcpu_is_preempted(int cpu) -{ - if (!static_branch_unlikely(&shared_processor)) - return false; - return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1); -} -#endif - static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) { return lock.slock == 0; @@ -110,15 +97,6 @@ static inline void splpar_spin_yield(arch_spinlock_t *lock) {}; static inline void splpar_rw_yield(arch_rwlock_t *lock) {}; #endif -static inline bool is_shared_processor(void) -{ -#ifdef CONFIG_PPC_SPLPAR - return static_branch_unlikely(&shared_processor); -#else - return false; -#endif -} - static inline void spin_yield(arch_spinlock_t *lock) { if (is_shared_processor()) diff --git a/arch/powerpc/lib/locks.c b/arch/powerpc/lib/locks.c index 6440d5943c00..04165b7a163f 100644 --- a/arch/powerpc/lib/locks.c +++ b/arch/powerpc/lib/locks.c @@ -27,14 +27,14 @@ void splpar_spin_yield(arch_spinlock_t *lock) return; holder_cpu = lock_value & 0x; BUG_ON(holder_cpu >= NR_CPUS); - yield_count = be32_to_cpu(lppaca_of(holder_cpu).yield_count); + + yield_count = yield_count_of(holder_cpu); if ((yield_count & 1) == 0) return; /* virtual cpu is currently running */ rmb(); if (lock->slock != lock_value) return; /* something has changed */ - plpar_hcall_norets(H_CONFER, - get_hard_smp_processor_id(holder_cpu), yield_count); + yield_to_preempted(holder_cpu, yield_count); } EXPORT_SYMBOL_GPL(splpar_spin_yield); @@ -53,13 +53,13 @@ void splpar_rw_yield(arch_rwlock_t *rw) return; /* no write lock at present */ holder_cpu = lock_value & 0x; BUG_ON(holder_cpu >= NR_CPUS); - yield_count = be32_to_cpu(lppaca_of(holder_cpu).yield_count); + + yield_count = yield_count_of(holder_cpu); if ((yield_count & 1) == 0) return; /* virtual cpu is currently running */ rmb(); if (rw->lock != lock_value) return; /* something has changed */ - plpar_hcall_norets(H_CONFER, - get_hard_smp_processor_id(holder_cpu), yield_count); + yield_to_preempted(holder_cpu, yield_count); } #endif -- 2.23.0
[PATCH v4 0/6] powerpc: queued spinlocks and rwlocks
Updated with everybody's feedback (thanks all), and more performance results. What I've found is I might have been measuring the worst load point for the paravirt case, and by looking at a range of loads it's clear that queued spinlocks are overall better even on PV, doubly so when you look at the generally much improved worst case latencies. I have defaulted it to N even though I'm less concerned about the PV numbers now, just because I think it needs more stress testing. But it's very nicely selectable so should be low risk to include. All in all this is a very cool technology and great results especially on the big systems but even on smaller ones there are nice gains. Thanks Waiman and everyone who developed it. Thanks, Nick Nicholas Piggin (6): powerpc/pseries: move some PAPR paravirt functions to their own file powerpc: move spinlock implementation to simple_spinlock powerpc/64s: implement queued spinlocks and rwlocks powerpc/pseries: implement paravirt qspinlocks for SPLPAR powerpc/qspinlock: optimised atomic_try_cmpxchg_lock that adds the lock hint powerpc: implement smp_cond_load_relaxed arch/powerpc/Kconfig | 15 + arch/powerpc/include/asm/Kbuild | 1 + arch/powerpc/include/asm/atomic.h | 28 ++ arch/powerpc/include/asm/barrier.h| 14 + arch/powerpc/include/asm/paravirt.h | 87 + arch/powerpc/include/asm/qspinlock.h | 91 ++ arch/powerpc/include/asm/qspinlock_paravirt.h | 7 + arch/powerpc/include/asm/simple_spinlock.h| 288 .../include/asm/simple_spinlock_types.h | 21 ++ arch/powerpc/include/asm/spinlock.h | 308 +- arch/powerpc/include/asm/spinlock_types.h | 17 +- arch/powerpc/lib/Makefile | 3 + arch/powerpc/lib/locks.c | 12 +- arch/powerpc/platforms/pseries/Kconfig| 9 +- arch/powerpc/platforms/pseries/setup.c| 4 +- include/asm-generic/qspinlock.h | 4 + 16 files changed, 588 insertions(+), 321 deletions(-) create mode 100644 arch/powerpc/include/asm/paravirt.h create mode 100644 arch/powerpc/include/asm/qspinlock.h create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt.h create mode 100644 arch/powerpc/include/asm/simple_spinlock.h create mode 100644 arch/powerpc/include/asm/simple_spinlock_types.h -- 2.23.0
Re: [v3 12/15] powerpc/perf: Add support for outputting extended regs in perf intr_regs
Hi Athira, +/* Function to return the extended register values */ +static u64 get_ext_regs_value(int idx) +{ + switch (idx) { + case PERF_REG_POWERPC_MMCR0: + return mfspr(SPRN_MMCR0); + case PERF_REG_POWERPC_MMCR1: + return mfspr(SPRN_MMCR1); + case PERF_REG_POWERPC_MMCR2: + return mfspr(SPRN_MMCR2); + default: return 0; + } +} + u64 perf_reg_value(struct pt_regs *regs, int idx) { - if (WARN_ON_ONCE(idx >= PERF_REG_POWERPC_MAX)) - return 0; + u64 PERF_REG_EXTENDED_MAX; PERF_REG_EXTENDED_MAX should be initialized. otherwise ... + + if (cpu_has_feature(CPU_FTR_ARCH_300)) + PERF_REG_EXTENDED_MAX = PERF_REG_MAX_ISA_300; if (idx == PERF_REG_POWERPC_SIER && (IS_ENABLED(CONFIG_FSL_EMB_PERF_EVENT) || @@ -85,6 +103,16 @@ u64 perf_reg_value(struct pt_regs *regs, int idx) IS_ENABLED(CONFIG_PPC32))) return 0; + if (idx >= PERF_REG_POWERPC_MAX && idx < PERF_REG_EXTENDED_MAX) + return get_ext_regs_value(idx); On non p9/p10 machine, PERF_REG_EXTENDED_MAX may contain random value which will allow user to pass this if condition unintentionally. Neat: PERF_REG_EXTENDED_MAX is a local variable so it should be in lowercase. Any specific reason to define it in capital? Ravi
Re: [PATCHv3 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents
On Thu, Jul 23, 2020 at 9:27 PM Nathan Lynch wrote: > > Pingfan Liu writes: > > A bug is observed on pseries by taking the following steps on rhel: > > -1. drmgr -c mem -r -q 5 > > -2. echo c > /proc/sysrq-trigger > > > > And then, the failure looks like: > > kdump: saving to /sysroot//var/crash/127.0.0.1-2020-01-16-02:06:14/ > > kdump: saving vmcore-dmesg.txt > > kdump: saving vmcore-dmesg.txt complete > > kdump: saving vmcore > > Checking for memory holes : [ 0.0 %] / > >Checking for memory holes : [100.0 %] | > > Excluding unnecessary pages : [100.0 %] > > \ Copying data : [ > > 0.3 %] - eta: 38s[ 44.337636] hash-mmu: mm: Hashing failure ! > > EA=0x7fffba40 access=0x8004 current=makedumpfile > > [ 44.337663] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 > > psize 2 pte=0xc0005504 > > [ 44.337677] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 > > access=0x8004 current=makedumpfile > > [ 44.337692] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 > > psize 2 pte=0xc0005504 > > [ 44.337708] makedumpfile[469]: unhandled signal 7 at 7fffba40 > > nip 7fffbbc4d7fc lr 00011356ca3c code 2 > > [ 44.338548] Core dump to |/bin/false pipe failed > > /lib/kdump-lib-initramfs.sh: line 98: 469 Bus error > > $CORE_COLLECTOR /proc/vmcore > > $_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore-incomplete > > kdump: saving vmcore failed > > > > * Root cause * > > After analyzing, it turns out that in the current implementation, > > when hot-removing lmb, the KOBJ_REMOVE event ejects before the dt updating > > as > > the code __remove_memory() comes before drmem_update_dt(). > > So in kdump kernel, when read_from_oldmem() resorts to > > pSeries_lpar_hpte_insert() to install hpte, but fails with -2 due to > > non-exist pfn. And finally, low_hash_fault() raise SIGBUS to process, as it > > can be observed "Bus error" > > > > From a viewpoint of listener and publisher, the publisher notifies the > > listener before data is ready. This introduces a problem where udev > > launches kexec-tools (due to KOBJ_REMOVE) and loads a stale dt before > > updating. And in capture kernel, makedumpfile will access the memory based > > on the stale dt info, and hit a SIGBUS error due to an un-existed lmb. > > > > * Fix * > > In order to fix this issue, update dt before __remove_memory(), and > > accordingly the same rule in hot-add path. > > > > This will introduce extra dt updating payload for each involved lmb when > > hotplug. > > But it should be fine since drmem_update_dt() is memory based operation and > > hotplug is not a hot path. > > This is great analysis but the performance implications of the change > are grave. The add/remove paths here are already O(n) where n is the > quantity of memory assigned to the LP, this change would make it O(n^2): > > dlpar_memory_add_by_count > for_each_drmem_lmb <-- > dlpar_add_lmb > drmem_update_dt(_v1|_v2) > for_each_drmem_lmb <-- > > Memory add/remove isn't a hot path but quadratic runtime complexity > isn't acceptable. Its current performance is bad enough that I have Yes, the quadratic runtime complexity sounds terrible. And I am curious about the bug. Does the system have thousands of lmb? > internal bugs open on it. > > Not to mention we leak memory every time drmem_update_dt is called > because we can't safely free device tree properties :-( Do you know what block us to free it? > > Also note that this sort of reverts (fixes?) 063b8b1251fd > ("powerpc/pseries/memory-hotplug: Only update DT once per memory DLPAR > request"). Yes. And now, I think I need to bring up another method to fix it. Thanks, Pingfan
Re: [PATCH v3 0/4] powerpc/mm/radix: Memory unplug fixes
On Fri, Jul 24, 2020 at 09:52:14PM +1000, Michael Ellerman wrote: > Bharata B Rao writes: > > On Tue, Jul 21, 2020 at 10:25:58PM +1000, Michael Ellerman wrote: > >> Bharata B Rao writes: > >> > On Tue, Jul 21, 2020 at 11:45:20AM +1000, Michael Ellerman wrote: > >> >> Nathan Lynch writes: > >> >> > "Aneesh Kumar K.V" writes: > >> >> >> This is the next version of the fixes for memory unplug on radix. > >> >> >> The issues and the fix are described in the actual patches. > >> >> > > >> >> > I guess this isn't actually causing problems at runtime right now, > >> >> > but I > >> >> > notice calls to resize_hpt_for_hotplug() from arch_add_memory() and > >> >> > arch_remove_memory(), which ought to be mmu-agnostic: > >> >> > > >> >> > int __ref arch_add_memory(int nid, u64 start, u64 size, > >> >> > struct mhp_params *params) > >> >> > { > >> >> > unsigned long start_pfn = start >> PAGE_SHIFT; > >> >> > unsigned long nr_pages = size >> PAGE_SHIFT; > >> >> > int rc; > >> >> > > >> >> > resize_hpt_for_hotplug(memblock_phys_mem_size()); > >> >> > > >> >> > start = (unsigned long)__va(start); > >> >> > rc = create_section_mapping(start, start + size, nid, > >> >> > params->pgprot); > >> >> > ... > >> >> > >> >> Hmm well spotted. > >> >> > >> >> That does return early if the ops are not setup: > >> >> > >> >> int resize_hpt_for_hotplug(unsigned long new_mem_size) > >> >> { > >> >> unsigned target_hpt_shift; > >> >> > >> >> if (!mmu_hash_ops.resize_hpt) > >> >> return 0; > >> >> > >> >> > >> >> And: > >> >> > >> >> void __init hpte_init_pseries(void) > >> >> { > >> >> ... > >> >> if (firmware_has_feature(FW_FEATURE_HPT_RESIZE)) > >> >> mmu_hash_ops.resize_hpt = pseries_lpar_resize_hpt; > >> >> > >> >> And that comes in via ibm,hypertas-functions: > >> >> > >> >> {FW_FEATURE_HPT_RESIZE, "hcall-hpt-resize"}, > >> >> > >> >> > >> >> But firmware is not necessarily going to add/remove that call based on > >> >> whether we're using hash/radix. > >> > > >> > Correct but hpte_init_pseries() will not be called for radix guests. > >> > >> Yeah, duh. You'd think the function name would have been a sufficient > >> clue for me :) > >> > >> >> So I think a follow-up patch is needed to make this more robust. > >> >> > >> >> Aneesh/Bharata what platform did you test this series on? I'm curious > >> >> how this didn't break. > >> > > >> > I have tested memory hotplug/unplug for radix guest on zz platform and > >> > sanity-tested this for hash guest on P8. > >> > > >> > As noted above, mmu_hash_ops.resize_hpt will not be set for radix > >> > guest and hence we won't see any breakage. > >> > >> OK. > >> > >> That's probably fine as it is then. Or maybe just a comment in > >> resize_hpt_for_hotplug() pointing out that resize_hpt will be NULL if > >> we're using radix. > > > > Or we could move these calls to hpt-only routines like below? > > That looks like it would be equivalent, and would nicely isolate those > calls in hash specific code. So yeah I think that's worth sending as a > proper patch, even better if you can test it. Sure I will send it as a proper patch. I did test minimal hotplug/unplug for hash guest with that patch, will do more extensive test and resend. > > > David - Do you remember if there was any particular reason to have > > these two hpt-resize calls within powerpc-generic memory hotplug code? > > I think the HPT resizing was developed before or concurrently with the > radix support, so I would guess it was just not something we thought > about at the time. Right. Regards, Bharata.
Re: [PATCH v3 0/4] powerpc/mm/radix: Memory unplug fixes
Bharata B Rao writes: > On Tue, Jul 21, 2020 at 10:25:58PM +1000, Michael Ellerman wrote: >> Bharata B Rao writes: >> > On Tue, Jul 21, 2020 at 11:45:20AM +1000, Michael Ellerman wrote: >> >> Nathan Lynch writes: >> >> > "Aneesh Kumar K.V" writes: >> >> >> This is the next version of the fixes for memory unplug on radix. >> >> >> The issues and the fix are described in the actual patches. >> >> > >> >> > I guess this isn't actually causing problems at runtime right now, but I >> >> > notice calls to resize_hpt_for_hotplug() from arch_add_memory() and >> >> > arch_remove_memory(), which ought to be mmu-agnostic: >> >> > >> >> > int __ref arch_add_memory(int nid, u64 start, u64 size, >> >> > struct mhp_params *params) >> >> > { >> >> > unsigned long start_pfn = start >> PAGE_SHIFT; >> >> > unsigned long nr_pages = size >> PAGE_SHIFT; >> >> > int rc; >> >> > >> >> > resize_hpt_for_hotplug(memblock_phys_mem_size()); >> >> > >> >> > start = (unsigned long)__va(start); >> >> > rc = create_section_mapping(start, start + size, nid, >> >> > params->pgprot); >> >> > ... >> >> >> >> Hmm well spotted. >> >> >> >> That does return early if the ops are not setup: >> >> >> >> int resize_hpt_for_hotplug(unsigned long new_mem_size) >> >> { >> >> unsigned target_hpt_shift; >> >> >> >> if (!mmu_hash_ops.resize_hpt) >> >> return 0; >> >> >> >> >> >> And: >> >> >> >> void __init hpte_init_pseries(void) >> >> { >> >> ... >> >> if (firmware_has_feature(FW_FEATURE_HPT_RESIZE)) >> >> mmu_hash_ops.resize_hpt = pseries_lpar_resize_hpt; >> >> >> >> And that comes in via ibm,hypertas-functions: >> >> >> >> {FW_FEATURE_HPT_RESIZE, "hcall-hpt-resize"}, >> >> >> >> >> >> But firmware is not necessarily going to add/remove that call based on >> >> whether we're using hash/radix. >> > >> > Correct but hpte_init_pseries() will not be called for radix guests. >> >> Yeah, duh. You'd think the function name would have been a sufficient >> clue for me :) >> >> >> So I think a follow-up patch is needed to make this more robust. >> >> >> >> Aneesh/Bharata what platform did you test this series on? I'm curious >> >> how this didn't break. >> > >> > I have tested memory hotplug/unplug for radix guest on zz platform and >> > sanity-tested this for hash guest on P8. >> > >> > As noted above, mmu_hash_ops.resize_hpt will not be set for radix >> > guest and hence we won't see any breakage. >> >> OK. >> >> That's probably fine as it is then. Or maybe just a comment in >> resize_hpt_for_hotplug() pointing out that resize_hpt will be NULL if >> we're using radix. > > Or we could move these calls to hpt-only routines like below? That looks like it would be equivalent, and would nicely isolate those calls in hash specific code. So yeah I think that's worth sending as a proper patch, even better if you can test it. > David - Do you remember if there was any particular reason to have > these two hpt-resize calls within powerpc-generic memory hotplug code? I think the HPT resizing was developed before or concurrently with the radix support, so I would guess it was just not something we thought about at the time. cheers > diff --git a/arch/powerpc/include/asm/sparsemem.h > b/arch/powerpc/include/asm/sparsemem.h > index c89b32443cff..1e6fa371cc38 100644 > --- a/arch/powerpc/include/asm/sparsemem.h > +++ b/arch/powerpc/include/asm/sparsemem.h > @@ -17,12 +17,6 @@ extern int create_section_mapping(unsigned long start, > unsigned long end, > int nid, pgprot_t prot); > extern int remove_section_mapping(unsigned long start, unsigned long end); > > -#ifdef CONFIG_PPC_BOOK3S_64 > -extern int resize_hpt_for_hotplug(unsigned long new_mem_size); > -#else > -static inline int resize_hpt_for_hotplug(unsigned long new_mem_size) { > return 0; } > -#endif > - > #ifdef CONFIG_NUMA > extern int hot_add_scn_to_nid(unsigned long scn_addr); > #else > diff --git a/arch/powerpc/mm/book3s64/hash_utils.c > b/arch/powerpc/mm/book3s64/hash_utils.c > index eec6f4e5e481..5daf53ec7600 100644 > --- a/arch/powerpc/mm/book3s64/hash_utils.c > +++ b/arch/powerpc/mm/book3s64/hash_utils.c > @@ -787,7 +787,7 @@ static unsigned long __init htab_get_table_size(void) > } > > #ifdef CONFIG_MEMORY_HOTPLUG > -int resize_hpt_for_hotplug(unsigned long new_mem_size) > +static int resize_hpt_for_hotplug(unsigned long new_mem_size) > { > unsigned target_hpt_shift; > > @@ -821,6 +821,8 @@ int hash__create_section_mapping(unsigned long start, > unsigned long end, > return -1; > } > > + resize_hpt_for_hotplug(memblock_phys_mem_size()); > + > rc = htab_bolt_mapping(start, end, __pa(start), > pgprot_val(prot), mmu_linear_psize, > mmu_kernel_ssize); > @@ -838,6 +840,10 @@ int hash__remove_section_mappi
RE: [RFC PATCH] powerpc/pseries/svm: capture instruction faulting on MMIO access, in sprg0 register
Ram Pai writes: > On Wed, Jul 22, 2020 at 12:06:06PM +1000, Michael Ellerman wrote: >> Ram Pai writes: >> > An instruction accessing a mmio address, generates a HDSI fault. This >> > fault is >> > appropriately handled by the Hypervisor. However in the case of >> > secureVMs, the >> > fault is delivered to the ultravisor. >> > >> > Unfortunately the Ultravisor has no correct-way to fetch the faulting >> > instruction. The PEF architecture does not allow Ultravisor to enable MMU >> > translation. Walking the two level page table to read the instruction can >> > race >> > with other vcpus modifying the SVM's process scoped page table. >> >> You're trying to read the guest's kernel text IIUC, that mapping should >> be stable. Possibly permissions on it could change over time, but the >> virtual -> real mapping should not. > > Actually the code does not capture the address of the instruction in the > sprg0 register. It captures the instruction itself. So should the mapping > matter? >> >> > This problem can be correctly solved with some help from the kernel. >> > >> > Capture the faulting instruction in SPRG0 register, before executing the >> > faulting instruction. This enables the ultravisor to easily procure the >> > faulting instruction and emulate it. >> >> This is not something I'm going to merge. Sorry. > > Ok. Will consider other approaches. To elaborate ... You've basically invented a custom ucall ABI. But a really strange one which takes an instruction as its first parameter in SPRG0, and then subsequent parameters in any GPR depending on what the instruction was. The UV should either emulate the instruction, which means the guest should not be expected to do anything other than execute the instruction. Or it should be done with a proper ucall that the guest explicitly makes with a well defined ABI. cheers
Re: [v3 13/15] tools/perf: Add perf tools support for extended register capability in powerpc
Hi Athira, On 7/17/20 8:08 PM, Athira Rajeev wrote: From: Anju T Sudhakar Add extended regs to sample_reg_mask in the tool side to use with `-I?` option. Perf tools side uses extended mask to display the platform supported register names (with -I? option) to the user and also send this mask to the kernel to capture the extended registers in each sample. Hence decide the mask value based on the processor version. Currently definitions for `mfspr`, `SPRN_PVR` are part of `arch/powerpc/util/header.c`. Move this to a header file so that these definitions can be re-used in other source files as well. It seems this patch has a regression. Without this patch: $ sudo ./perf record -I ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.458 MB perf.data (318 samples) ] With this patch: $ sudo ./perf record -I Error: dummy:HG: PMU Hardware doesn't support sampling/overflow-interrupts. Try 'perf stat' Ravi
[PATCH v2] powerpc/numa: Limit possible nodes to within num_possible_nodes
MAX_NUMNODES is a theoretical maximum number of nodes thats is supported by the kernel. Device tree properties exposes the number of possible nodes on the current platform. The kernel would detected this and would use it for most of its resource allocations. If the platform now increases the nodes to over what was already exposed, then it may lead to inconsistencies. Hence limit it to the already exposed nodes. Suggested-by: Nathan Lynch Cc: linuxppc-dev Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Nathan Lynch Cc: Tyrel Datwyler Signed-off-by: Srikar Dronamraju Changelog v1 -> v2: v1: https://lore.kernel.org/linuxppc-dev/20200715120534.3673-1-sri...@linux.vnet.ibm.com/t/#u Use nr_node_ids instead of num_possible_nodes() When nodes are sparse like in PowerNV, nr_node_ids gets the right value unlike num_possible_nodes() Signed-off-by: Srikar Dronamraju --- arch/powerpc/mm/numa.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index e437a9ac4956..383359272270 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -221,7 +221,7 @@ static void initialize_distance_lookup_table(int nid, } } -/* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa +/* Returns nid in the range [0..nr_node_ids], or -1 if no useful numa * info is found. */ static int associativity_to_nid(const __be32 *associativity) @@ -235,7 +235,7 @@ static int associativity_to_nid(const __be32 *associativity) nid = of_read_number(&associativity[min_common_depth], 1); /* POWER4 LPAR uses 0x as invalid node */ - if (nid == 0x || nid >= MAX_NUMNODES) + if (nid == 0x || nid >= nr_node_ids) nid = NUMA_NO_NODE; if (nid > 0 && @@ -448,7 +448,7 @@ static int of_drconf_to_nid_single(struct drmem_lmb *lmb) index = lmb->aa_index * aa.array_sz + min_common_depth - 1; nid = of_read_number(&aa.arrays[index], 1); - if (nid == 0x || nid >= MAX_NUMNODES) + if (nid == 0x || nid >= nr_node_ids) nid = default_nid; if (nid > 0) { -- 2.17.1
Re: [PATCH v 1/1] powerpc/64s: allow for clang's objdump differences
Hi Bill, Bill Wendling writes: > Clang's objdump emits slightly different output from GNU's objdump, > causing a list of warnings to be emitted during relocatable builds. > E.g., clang's objdump emits this: > >c004: 2c 00 00 48 b 0xc030 >... >c0005c6c: 10 00 82 40 bf 2, 0xc0005c7c > > while GNU objdump emits: > >c004: 2c 00 00 48 bc030 <__start+0x30> >... >c0005c6c: 10 00 82 40 bne c0005c7c > > > Adjust llvm-objdump's output to remove the extraneous '0x' and convert > 'bf' and 'bt' to 'bne' and 'beq' resp. to more closely match GNU > objdump's output. > > Note that clang's objdump doesn't yet output the relocation symbols on > PPC. > > Signed-off-by: Bill Wendling > --- > arch/powerpc/tools/unrel_branch_check.sh | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/arch/powerpc/tools/unrel_branch_check.sh > b/arch/powerpc/tools/unrel_branch_check.sh > index 77114755dc6f..71ce86b68d18 100755 > --- a/arch/powerpc/tools/unrel_branch_check.sh > +++ b/arch/powerpc/tools/unrel_branch_check.sh > @@ -31,6 +31,9 @@ grep -e > "^c[0-9a-f]*:[[:space:]]*\([0-9a-f][0-9a-f][[:space:]]\)\{4\}[[:space:]] > grep -v '\<__start_initialization_multiplatform>' | > grep -v -e 'b.\?.\?ctr' | > grep -v -e 'b.\?.\?lr' | > +sed 's/\bbt.\?[[:space:]]*[[:digit:]][[:digit:]]*,/beq/' | > +sed 's/\bbf.\?[[:space:]]*[[:digit:]][[:digit:]]*,/bne/' | > +sed 's/[[:space:]]0x/ /' | > sed 's/://' | I know you followed the example in the script of just doing everything as a separate entry in the pipeline, but I think we could consolidate all the seds into one? eg: sed -e 's/\bbt.\?[[:space:]]*[[:digit:]][[:digit:]]*,/beq/' \ -e 's/\bbf.\?[[:space:]]*[[:digit:]][[:digit:]]*,/bne/' \ -e 's/[[:space:]]0x/ /' \ -e 's/://' | Does that work? cheers
Re: [PATCH 2/2] powerpc/64s: system call support for scv/rfscv instructions
Christophe Leroy writes: > Michael Ellerman a écrit : > >> Nicholas Piggin writes: >>> diff --git a/arch/powerpc/include/asm/ppc-opcode.h >>> b/arch/powerpc/include/asm/ppc-opcode.h >>> index 2a39c716c343..b2bdc4de1292 100644 >>> --- a/arch/powerpc/include/asm/ppc-opcode.h >>> +++ b/arch/powerpc/include/asm/ppc-opcode.h >>> @@ -257,6 +257,7 @@ >>> #define PPC_INST_MFVSRD0x7c66 >>> #define PPC_INST_MTVSRD0x7c000166 >>> #define PPC_INST_SC0x4402 >>> +#define PPC_INST_SCV 0x4401 >> ... >>> @@ -411,6 +412,7 @@ >> ... >>> +#define __PPC_LEV(l) (((l) & 0x7f) << 5) >> >> These conflicted and didn't seem to be used so I dropped them. >> >>> diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c >>> index 5abe98216dc2..161bfccbc309 100644 >>> --- a/arch/powerpc/lib/sstep.c >>> +++ b/arch/powerpc/lib/sstep.c >>> @@ -3378,6 +3382,16 @@ int emulate_step(struct pt_regs *regs, >>> struct ppc_inst instr) >>> regs->msr = MSR_KERNEL; >>> return 1; >>> >>> + case SYSCALL_VECTORED_0:/* scv 0 */ >>> + regs->gpr[9] = regs->gpr[13]; >>> + regs->gpr[10] = MSR_KERNEL; >>> + regs->gpr[11] = regs->nip + 4; >>> + regs->gpr[12] = regs->msr & MSR_MASK; >>> + regs->gpr[13] = (unsigned long) get_paca(); >>> + regs->nip = (unsigned long) &system_call_vectored_emulate; >>> + regs->msr = MSR_KERNEL; >>> + return 1; >>> + >> >> This broke the ppc64e build: >> >> ld: arch/powerpc/lib/sstep.o:(.toc+0x0): undefined reference to >> `system_call_vectored_emulate' >> make[1]: *** [/home/michael/linux/Makefile:1139: vmlinux] Error 1 >> >> I wrapped it in #ifdef CONFIG_PPC64_BOOK3S. > > You mean CONFIG_PPC_BOOK3S_64 ? I hope so ... ## . Will send a fixup. Thanks for noticing. cheers
[PATCH v2 5/5] selftests/powerpc: Remove powerpc special cases from stack expansion test
Now that the powerpc code behaves the same as other architectures we can drop the special cases we had. Signed-off-by: Michael Ellerman --- .../powerpc/mm/stack_expansion_ldst.c | 41 +++ 1 file changed, 5 insertions(+), 36 deletions(-) v2: no change just rebased. diff --git a/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c b/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c index 8dbfb51acf0f..ed9143990888 100644 --- a/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c +++ b/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c @@ -56,13 +56,7 @@ int consume_stack(unsigned long target_sp, unsigned long stack_high, int delta, #else asm volatile ("mov %%rsp, %[sp]" : [sp] "=r" (stack_top_sp)); #endif - - // Kludge, delta < 0 indicates relative to SP - if (delta < 0) - target = stack_top_sp + delta; - else - target = stack_high - delta + 1; - + target = stack_high - delta + 1; volatile char *p = (char *)target; if (type == STORE) @@ -162,41 +156,16 @@ static int test_one(unsigned int stack_used, int delta, enum access_type type) static void test_one_type(enum access_type type, unsigned long page_size, unsigned long rlim_cur) { - assert(test_one(DEFAULT_SIZE, 512 * _KB, type) == 0); + unsigned long delta; - // powerpc has a special case to allow up to 1MB - assert(test_one(DEFAULT_SIZE, 1 * _MB, type) == 0); - -#ifdef __powerpc__ - // This fails on powerpc because it's > 1MB and is not a stdu & - // not close to r1 - assert(test_one(DEFAULT_SIZE, 1 * _MB + 8, type) != 0); -#else - assert(test_one(DEFAULT_SIZE, 1 * _MB + 8, type) == 0); -#endif - -#ifdef __powerpc__ - // Accessing way past the stack pointer is not allowed on powerpc - assert(test_one(DEFAULT_SIZE, rlim_cur, type) != 0); -#else // We should be able to access anywhere within the rlimit + for (delta = page_size; delta <= rlim_cur; delta += page_size) + assert(test_one(DEFAULT_SIZE, delta, type) == 0); + assert(test_one(DEFAULT_SIZE, rlim_cur, type) == 0); -#endif // But if we go past the rlimit it should fail assert(test_one(DEFAULT_SIZE, rlim_cur + 1, type) != 0); - - // Above 1MB powerpc only allows accesses within 4224 bytes of - // r1 for accesses that aren't stdu - assert(test_one(1 * _MB + page_size - 128, -4224, type) == 0); -#ifdef __powerpc__ - assert(test_one(1 * _MB + page_size - 128, -4225, type) != 0); -#else - assert(test_one(1 * _MB + page_size - 128, -4225, type) == 0); -#endif - - // By consuming 2MB of stack we test the stdu case - assert(test_one(2 * _MB + page_size - 128, -4224, type) == 0); } static int test(void) -- 2.25.1
[PATCH v2 4/5] powerpc/mm: Remove custom stack expansion checking
We have powerpc specific logic in our page fault handling to decide if an access to an unmapped address below the stack pointer should expand the stack VMA. The logic aims to prevent userspace from doing bad accesses below the stack pointer. However as long as the stack is < 1MB in size, we allow all accesses without further checks. Adding some debug I see that I can do a full kernel build and LTP run, and not a single process has used more than 1MB of stack. So for the majority of processes the logic never even fires. We also recently found a nasty bug in this code which could cause userspace programs to be killed during signal delivery. It went unnoticed presumably because most processes use < 1MB of stack. The generic mm code has also grown support for stack guard pages since this code was originally written, so the most heinous case of the stack expanding into other mappings is now handled for us. Finally although some other arches have special logic in this path, from what I can tell none of x86, arm64, arm and s390 impose any extra checks other than those in expand_stack(). So drop our complicated logic and like other architectures just let the stack expand as long as its within the rlimit. Signed-off-by: Michael Ellerman --- arch/powerpc/mm/fault.c | 109 ++-- 1 file changed, 5 insertions(+), 104 deletions(-) v2: no change just rebased. diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 3ebb1792e636..925a7231abb3 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -42,39 +42,7 @@ #include #include -/* - * Check whether the instruction inst is a store using - * an update addressing form which will update r1. - */ -static bool store_updates_sp(struct ppc_inst inst) -{ - /* check for 1 in the rA field */ - if (((ppc_inst_val(inst) >> 16) & 0x1f) != 1) - return false; - /* check major opcode */ - switch (ppc_inst_primary_opcode(inst)) { - case OP_STWU: - case OP_STBU: - case OP_STHU: - case OP_STFSU: - case OP_STFDU: - return true; - case OP_STD:/* std or stdu */ - return (ppc_inst_val(inst) & 3) == 1; - case OP_31: - /* check minor opcode */ - switch ((ppc_inst_val(inst) >> 1) & 0x3ff) { - case OP_31_XOP_STDUX: - case OP_31_XOP_STWUX: - case OP_31_XOP_STBUX: - case OP_31_XOP_STHUX: - case OP_31_XOP_STFSUX: - case OP_31_XOP_STFDUX: - return true; - } - } - return false; -} + /* * do_page_fault error handling helpers */ @@ -267,57 +235,6 @@ static bool bad_kernel_fault(struct pt_regs *regs, unsigned long error_code, return false; } -// This comes from 64-bit struct rt_sigframe + __SIGNAL_FRAMESIZE -#define SIGFRAME_MAX_SIZE (4096 + 128) - -static bool bad_stack_expansion(struct pt_regs *regs, unsigned long address, - struct vm_area_struct *vma, unsigned int flags, - bool *must_retry) -{ - /* -* N.B. The POWER/Open ABI allows programs to access up to -* 288 bytes below the stack pointer. -* The kernel signal delivery code writes a bit over 4KB -* below the stack pointer (r1) before decrementing it. -* The exec code can write slightly over 640kB to the stack -* before setting the user r1. Thus we allow the stack to -* expand to 1MB without further checks. -*/ - if (address + 0x10 < vma->vm_end) { - struct ppc_inst __user *nip = (struct ppc_inst __user *)regs->nip; - /* get user regs even if this fault is in kernel mode */ - struct pt_regs *uregs = current->thread.regs; - if (uregs == NULL) - return true; - - /* -* A user-mode access to an address a long way below -* the stack pointer is only valid if the instruction -* is one which would update the stack pointer to the -* address accessed if the instruction completed, -* i.e. either stwu rs,n(r1) or stwux rs,r1,rb -* (or the byte, halfword, float or double forms). -* -* If we don't check this then any write to the area -* between the last mapped region and the stack will -* expand the stack rather than segfaulting. -*/ - if (address + SIGFRAME_MAX_SIZE >= uregs->gpr[1]) - return false; - - if ((flags & FAULT_FLAG_WRITE) && (flags & FAULT_FLAG_USER) && - access_ok(nip, sizeof(*nip))) { - struct ppc_inst inst; - - if (!probe_user_read_inst(&inst, nip)) -
[PATCH v2 3/5] selftests/powerpc: Update the stack expansion test
Update the stack expansion load/store test to take into account the new allowance of 4224 bytes below the stack pointer. Signed-off-by: Michael Ellerman --- .../selftests/powerpc/mm/stack_expansion_ldst.c| 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) v2: Update for change of size to 4224. diff --git a/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c b/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c index 0587e11437f5..8dbfb51acf0f 100644 --- a/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c +++ b/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c @@ -186,17 +186,17 @@ static void test_one_type(enum access_type type, unsigned long page_size, unsign // But if we go past the rlimit it should fail assert(test_one(DEFAULT_SIZE, rlim_cur + 1, type) != 0); - // Above 1MB powerpc only allows accesses within 2048 bytes of + // Above 1MB powerpc only allows accesses within 4224 bytes of // r1 for accesses that aren't stdu - assert(test_one(1 * _MB + page_size - 128, -2048, type) == 0); + assert(test_one(1 * _MB + page_size - 128, -4224, type) == 0); #ifdef __powerpc__ - assert(test_one(1 * _MB + page_size - 128, -2049, type) != 0); + assert(test_one(1 * _MB + page_size - 128, -4225, type) != 0); #else - assert(test_one(1 * _MB + page_size - 128, -2049, type) == 0); + assert(test_one(1 * _MB + page_size - 128, -4225, type) == 0); #endif // By consuming 2MB of stack we test the stdu case - assert(test_one(2 * _MB + page_size - 128, -2048, type) == 0); + assert(test_one(2 * _MB + page_size - 128, -4224, type) == 0); } static int test(void) -- 2.25.1
[PATCH v2 2/5] powerpc: Allow 4224 bytes of stack expansion for the signal frame
We have powerpc specific logic in our page fault handling to decide if an access to an unmapped address below the stack pointer should expand the stack VMA. The code was originally added in 2004 "ported from 2.4". The rough logic is that the stack is allowed to grow to 1MB with no extra checking. Over 1MB the access must be within 2048 bytes of the stack pointer, or be from a user instruction that updates the stack pointer. The 2048 byte allowance below the stack pointer is there to cover the 288 byte "red zone" as well as the "about 1.5kB" needed by the signal delivery code. Unfortunately since then the signal frame has expanded, and is now 4224 bytes on 64-bit kernels with transactional memory enabled. This means if a process has consumed more than 1MB of stack, and its stack pointer lies less than 4224 bytes from the next page boundary, signal delivery will fault when trying to expand the stack and the process will see a SEGV. The total size of the signal frame is the size of struct rt_sigframe (which includes the red zone) plus __SIGNAL_FRAMESIZE (128 bytes on 64-bit). The 2048 byte allowance was correct until 2008 as the signal frame was: struct rt_sigframe { struct ucontextuc; /* 0 1440 */ /* --- cacheline 11 boundary (1408 bytes) was 32 bytes ago --- */ long unsigned int _unused[2]; /* 144016 */ unsigned int tramp[6]; /* 145624 */ struct siginfo * pinfo;/* 1480 8 */ void * puc; /* 1488 8 */ struct siginfo info; /* 1496 128 */ /* --- cacheline 12 boundary (1536 bytes) was 88 bytes ago --- */ char abigap[288]; /* 1624 288 */ /* size: 1920, cachelines: 15, members: 7 */ /* padding: 8 */ }; 1920 + 128 = 2048 Then in commit ce48b2100785 ("powerpc: Add VSX context save/restore, ptrace and signal support") (Jul 2008) the signal frame expanded to 2304 bytes: struct rt_sigframe { struct ucontextuc; /* 0 1696 */ <-- /* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */ long unsigned int _unused[2]; /* 169616 */ unsigned int tramp[6]; /* 171224 */ struct siginfo * pinfo;/* 1736 8 */ void * puc; /* 1744 8 */ struct siginfo info; /* 1752 128 */ /* --- cacheline 14 boundary (1792 bytes) was 88 bytes ago --- */ char abigap[288]; /* 1880 288 */ /* size: 2176, cachelines: 17, members: 7 */ /* padding: 8 */ }; 2176 + 128 = 2304 At this point we should have been exposed to the bug, though as far as I know it was never reported. I no longer have a system old enough to easily test on. Then in 2010 commit 320b2b8de126 ("mm: keep a guard page below a grow-down stack segment") caused our stack expansion code to never trigger, as there was always a VMA found for a write up to PAGE_SIZE below r1. That meant the bug was hidden as we continued to expand the signal frame in commit 2b0a576d15e0 ("powerpc: Add new transactional memory state to the signal context") (Feb 2013): struct rt_sigframe { struct ucontextuc; /* 0 1696 */ /* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */ struct ucontextuc_transact; /* 1696 1696 */ <-- /* --- cacheline 26 boundary (3328 bytes) was 64 bytes ago --- */ long unsigned int _unused[2]; /* 339216 */ unsigned int tramp[6]; /* 340824 */ struct siginfo * pinfo;/* 3432 8 */ void * puc; /* 3440 8 */ struct siginfo info; /* 3448 128 */ /* --- cacheline 27 boundary (3456 bytes) was 120 bytes ago --- */ char abigap[288]; /* 3576 288 */ /* size: 3872, cachelines: 31, members: 8 */ /* padding: 8 */ /* last cacheline: 32 bytes */ }; 3872 + 128 = 4000 And commit 573ebfa6601f ("powerpc: Increase stack redzone for 64-bit userspace to 512 bytes") (Feb 2014): struct rt_sigframe { struct ucontextuc; /* 0 1696 */ /* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */ struct ucontextuc_transact; /* 1696 1696 */ /* --- cacheline 26 boundary (3328 bytes) was 64 bytes ago --- */ long unsigned int _unused[2]; /* 339216 */ unsigned int
[PATCH v2 1/5] selftests/powerpc: Add test of stack expansion logic
We have custom stack expansion checks that it turns out are extremely badly tested and contain bugs, surprise. So add some tests that exercise the code and capture the current boundary conditions. The signal test currently fails on 64-bit kernels because the 2048 byte allowance for the signal frame is too small, we will fix that in a subsequent patch. Signed-off-by: Michael Ellerman --- v2: - Concentrate on used stack around the 1MB size, as that's where our custom logic kicks in. - Increment the used stack size by 64 so we can exercise the case where we overflow the page by less than 128 (__SIGNAL_FRAMESIZE). --- tools/testing/selftests/powerpc/mm/.gitignore | 2 + tools/testing/selftests/powerpc/mm/Makefile | 9 +- .../powerpc/mm/stack_expansion_ldst.c | 233 ++ .../powerpc/mm/stack_expansion_signal.c | 118 + tools/testing/selftests/powerpc/pmu/lib.h | 1 + 5 files changed, 362 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c create mode 100644 tools/testing/selftests/powerpc/mm/stack_expansion_signal.c diff --git a/tools/testing/selftests/powerpc/mm/.gitignore b/tools/testing/selftests/powerpc/mm/.gitignore index 8d041f508a51..52308f42b7de 100644 --- a/tools/testing/selftests/powerpc/mm/.gitignore +++ b/tools/testing/selftests/powerpc/mm/.gitignore @@ -8,3 +8,5 @@ large_vm_fork_separation bad_accesses tlbie_test pkey_exec_prot +stack_expansion_ldst +stack_expansion_signal diff --git a/tools/testing/selftests/powerpc/mm/Makefile b/tools/testing/selftests/powerpc/mm/Makefile index 5a86d59441dc..6cd772e0e374 100644 --- a/tools/testing/selftests/powerpc/mm/Makefile +++ b/tools/testing/selftests/powerpc/mm/Makefile @@ -3,7 +3,9 @@ $(MAKE) -C ../ TEST_GEN_PROGS := hugetlb_vs_thp_test subpage_prot segv_errors wild_bctr \ - large_vm_fork_separation bad_accesses pkey_exec_prot + large_vm_fork_separation bad_accesses pkey_exec_prot stack_expansion_signal \ + stack_expansion_ldst + TEST_GEN_PROGS_EXTENDED := tlbie_test TEST_GEN_FILES := tempfile @@ -17,6 +19,11 @@ $(OUTPUT)/large_vm_fork_separation: CFLAGS += -m64 $(OUTPUT)/bad_accesses: CFLAGS += -m64 $(OUTPUT)/pkey_exec_prot: CFLAGS += -m64 +$(OUTPUT)/stack_expansion_signal: ../utils.c ../pmu/lib.c + +$(OUTPUT)/stack_expansion_ldst: CFLAGS += -fno-stack-protector +$(OUTPUT)/stack_expansion_ldst: ../utils.c + $(OUTPUT)/tempfile: dd if=/dev/zero of=$@ bs=64k count=1 diff --git a/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c b/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c new file mode 100644 index ..0587e11437f5 --- /dev/null +++ b/tools/testing/selftests/powerpc/mm/stack_expansion_ldst.c @@ -0,0 +1,233 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Test that loads/stores expand the stack segment, or trigger a SEGV, in + * various conditions. + * + * Based on test code by Tom Lane. + */ + +#undef NDEBUG +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define _KB (1024) +#define _MB (1024 * 1024) + +volatile char *stack_top_ptr; +volatile unsigned long stack_top_sp; +volatile char c; + +enum access_type { + LOAD, + STORE, +}; + +/* + * Consume stack until the stack pointer is below @target_sp, then do an access + * (load or store) at offset @delta from either the base of the stack or the + * current stack pointer. + */ +__attribute__ ((noinline)) +int consume_stack(unsigned long target_sp, unsigned long stack_high, int delta, enum access_type type) +{ + unsigned long target; + char stack_cur; + + if ((unsigned long)&stack_cur > target_sp) + return consume_stack(target_sp, stack_high, delta, type); + else { + // We don't really need this, but without it GCC might not + // generate a recursive call above. + stack_top_ptr = &stack_cur; + +#ifdef __powerpc__ + asm volatile ("mr %[sp], %%r1" : [sp] "=r" (stack_top_sp)); +#else + asm volatile ("mov %%rsp, %[sp]" : [sp] "=r" (stack_top_sp)); +#endif + + // Kludge, delta < 0 indicates relative to SP + if (delta < 0) + target = stack_top_sp + delta; + else + target = stack_high - delta + 1; + + volatile char *p = (char *)target; + + if (type == STORE) + *p = c; + else + c = *p; + + // Do something to prevent the stack frame being popped prior to + // our access above. + getpid(); + } + + return 0; +} + +static int search_proc_maps(char *needle, unsigned long *low, unsigned long *high) +{ + unsigned long start, end; + static char buf[4096]; +
Re: [PATCH 2/5] powerpc: Allow 4096 bytes of stack expansion for the signal frame
Daniel Axtens writes: > Hi Michael, > > Unfortunately, this patch doesn't completely solve the problem. > > Trying the original reproducer, I'm still able to trigger the crash even > with this patch, although not 100% of the time. (If I turn ASLR off > outside of tmux it reliably crashes, if I turn ASLR off _inside_ of tmux > it reliably succeeds; all of this is on a serial console.) > > ./foo 1241000 & sleep 1; killall -USR1 foo; echo ok > > If I add some debugging information, I see that I'm getting > address + 4096 = 7fed0fa0 > gpr1 = 7fed1020 > > So address + 4096 is 0x80 bytes below the 4k window. I haven't been able > to figure out why, gdb gives me a NIP in __kernel_sigtramp_rt64 but I > don't know what to make of that. Thanks for testing. I looked at it again this morning and it's fairly obvious when it's not 11pm :) We need space for struct rt_sigframe as well as another 128 bytes, which is __SIGNAL_FRAMESIZE. It's actually mentioned in the comment above struct rt_sigframe. I'll send a v2. > P.S. I don't know what your policy on linking to kernel bugzilla is, but > if you want: > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=205183 In general I prefer to keep things clean with just a single Link: tag pointing to the archive of the patch submission. That can then contain further links and other info, and has the advantage that people can reply to the patch submission in the future to add information to the thread that wasn't known at the time of the commit. cheers
[PATCH] KVM: PPC: Book3S HV: rework secure mem slot dropping
When a secure memslot is dropped, all the pages backed in the secure device (aka really backed by secure memory by the Ultravisor) should be paged out to a normal page. Previously, this was achieved by triggering the page fault mechanism which is calling kvmppc_svm_page_out() on each pages. This can't work when hot unplugging a memory slot because the memory slot is flagged as invalid and gfn_to_pfn() is then not trying to access the page, so the page fault mechanism is not triggered. Since the final goal is to make a call to kvmppc_svm_page_out() it seems simpler to call directly instead of triggering such a mechanism. This way kvmppc_uvmem_drop_pages() can be called even when hot unplugging a memslot. Since kvmppc_uvmem_drop_pages() is already holding kvm->arch.uvmem_lock, the call to __kvmppc_svm_page_out() is made. As __kvmppc_svm_page_out needs the vma pointer to migrate the pages, the VMA is fetched in a lazy way, to not trigger find_vma() all the time. In addition, the mmap_sem is held in read mode during that time, not in write mode since the virual memory layout is not impacted, and kvm->arch.uvmem_lock prevents concurrent operation on the secure device. Cc: Ram Pai Cc: Bharata B Rao Cc: Paul Mackerras Signed-off-by: Ram Pai [modified the changelog description] Signed-off-by: Laurent Dufour [modified check on the VMA in kvmppc_uvmem_drop_pages] --- arch/powerpc/kvm/book3s_hv_uvmem.c | 53 -- 1 file changed, 36 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c index c772e921f769..5dd3e9acdcab 100644 --- a/arch/powerpc/kvm/book3s_hv_uvmem.c +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c @@ -632,35 +632,54 @@ static inline int kvmppc_svm_page_out(struct vm_area_struct *vma, * fault on them, do fault time migration to replace the device PTEs in * QEMU page table with normal PTEs from newly allocated pages. */ -void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free, +void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *slot, struct kvm *kvm, bool skip_page_out) { int i; struct kvmppc_uvmem_page_pvt *pvt; - unsigned long pfn, uvmem_pfn; - unsigned long gfn = free->base_gfn; + struct page *uvmem_page; + struct vm_area_struct *vma = NULL; + unsigned long uvmem_pfn, gfn; + unsigned long addr, end; + + mmap_read_lock(kvm->mm); + + addr = slot->userspace_addr; + end = addr + (slot->npages * PAGE_SIZE); - for (i = free->npages; i; --i, ++gfn) { - struct page *uvmem_page; + gfn = slot->base_gfn; + for (i = slot->npages; i; --i, ++gfn, addr += PAGE_SIZE) { + + /* Fetch the VMA if addr is not in the latest fetched one */ + if (!vma || addr >= vma->vm_end) { + vma = find_vma_intersection(kvm->mm, addr, addr+1); + if (!vma) { + pr_err("Can't find VMA for gfn:0x%lx\n", gfn); + break; + } + } mutex_lock(&kvm->arch.uvmem_lock); - if (!kvmppc_gfn_is_uvmem_pfn(gfn, kvm, &uvmem_pfn)) { + + if (kvmppc_gfn_is_uvmem_pfn(gfn, kvm, &uvmem_pfn)) { + uvmem_page = pfn_to_page(uvmem_pfn); + pvt = uvmem_page->zone_device_data; + pvt->skip_page_out = skip_page_out; + pvt->remove_gfn = true; + + if (__kvmppc_svm_page_out(vma, addr, addr + PAGE_SIZE, + PAGE_SHIFT, kvm, pvt->gpa)) + pr_err("Can't page out gpa:0x%lx addr:0x%lx\n", + pvt->gpa, addr); + } else { + /* Remove the shared flag if any */ kvmppc_gfn_remove(gfn, kvm); - mutex_unlock(&kvm->arch.uvmem_lock); - continue; } - uvmem_page = pfn_to_page(uvmem_pfn); - pvt = uvmem_page->zone_device_data; - pvt->skip_page_out = skip_page_out; - pvt->remove_gfn = true; mutex_unlock(&kvm->arch.uvmem_lock); - - pfn = gfn_to_pfn(kvm, gfn); - if (is_error_noslot_pfn(pfn)) - continue; - kvm_release_pfn_clean(pfn); } + + mmap_read_unlock(kvm->mm); } unsigned long kvmppc_h_svm_init_abort(struct kvm *kvm) -- 2.27.0
Re: [PATCH v2 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper
+++ Jarkko Sakkinen [24/07/20 10:36 +0300]: On Thu, Jul 23, 2020 at 03:42:09PM +0300, Ard Biesheuvel wrote: On Thu, 23 Jul 2020 at 04:52, Jarkko Sakkinen wrote: > > On Thu, Jul 16, 2020 at 06:49:09PM +0200, Christophe Leroy wrote: > > Jarkko Sakkinen a écrit : > > > > > Rename module_alloc() to text_alloc() and module_memfree() to > > > text_memfree(), and move them to kernel/text.c, which is unconditionally > > > compiled to the kernel proper. This allows kprobes, ftrace and bpf to > > > allocate space for executable code without requiring to compile the modules > > > support (CONFIG_MODULES=y) in. > > > > You are not changing enough in powerpc to have this work. > > On powerpc 32 bits (6xx), when STRICT_KERNEL_RWX is selected, the vmalloc > > space is set to NX (no exec) at segment level (ie by 256Mbytes zone) unless > > CONFIG_MODULES is selected. > > > > Christophe > > This has been deduced down to: > > https://lore.kernel.org/lkml/20200717030422.679972-1-jarkko.sakki...@linux.intel.com/ > > I.e. not intruding PPC anymore :-) > Ok, so after the elaborate discussion we had between Jessica, Russell, Peter, Will, Mark, you and myself, where we pointed out that a) a single text_alloc() abstraction for bpf, kprobes and ftrace does not fit other architectures very well, and b) that module_alloc() is not suitable as a default to base text_alloc() on, In the latest iteration (v5) it is conditionally available only if arch defines and fallback has been removed. you went ahead and implemented that anyway, but only cc'ing Peter, akpm, Masami and the mm list this time? No problems with that. Actually each patch gets everything that get_maintainer.pl gives with a cc cmd script, not just the ones explicitly listed in the patch. Should I explicitly CC you to the next version? I'm happy to grow the list when requested. Yes, please CC everybody that was part of the discussion last time especially during v2, and please use a consistent CC list for the whole patchset. It is difficult to review when you only receive patch 1 out of 6 with no mention of text_alloc() anywhere and without being CC'd on the cover letter. Jessica
Re: [PATCH v2 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper
On Thu, Jul 23, 2020 at 03:42:09PM +0300, Ard Biesheuvel wrote: > On Thu, 23 Jul 2020 at 04:52, Jarkko Sakkinen > wrote: > > > > On Thu, Jul 16, 2020 at 06:49:09PM +0200, Christophe Leroy wrote: > > > Jarkko Sakkinen a écrit : > > > > > > > Rename module_alloc() to text_alloc() and module_memfree() to > > > > text_memfree(), and move them to kernel/text.c, which is unconditionally > > > > compiled to the kernel proper. This allows kprobes, ftrace and bpf to > > > > allocate space for executable code without requiring to compile the > > > > modules > > > > support (CONFIG_MODULES=y) in. > > > > > > You are not changing enough in powerpc to have this work. > > > On powerpc 32 bits (6xx), when STRICT_KERNEL_RWX is selected, the vmalloc > > > space is set to NX (no exec) at segment level (ie by 256Mbytes zone) > > > unless > > > CONFIG_MODULES is selected. > > > > > > Christophe > > > > This has been deduced down to: > > > > https://lore.kernel.org/lkml/20200717030422.679972-1-jarkko.sakki...@linux.intel.com/ > > > > I.e. not intruding PPC anymore :-) > > > > Ok, so after the elaborate discussion we had between Jessica, Russell, > Peter, Will, Mark, you and myself, where we pointed out that > a) a single text_alloc() abstraction for bpf, kprobes and ftrace does > not fit other architectures very well, and > b) that module_alloc() is not suitable as a default to base text_alloc() on, In the latest iteration (v5) it is conditionally available only if arch defines and fallback has been removed. > you went ahead and implemented that anyway, but only cc'ing Peter, > akpm, Masami and the mm list this time? No problems with that. Actually each patch gets everything that get_maintainer.pl gives with a cc cmd script, not just the ones explicitly listed in the patch. Should I explicitly CC you to the next version? I'm happy to grow the list when requested. > Sorry, but that is not how it works. Once people get pulled into a > discussion, you cannot dismiss them or their feedback like that and go > off and do your own thing anyway. Generic features like this are > tricky to get right, and it will likely take many iterations and input > from many different people. Sure. I'm not expecting this move quickly. I don't think I've at least purposely done that. As you said it's tricky to get this right. /Jarkko
Re: [v3 12/15] powerpc/perf: Add support for outputting extended regs in perf intr_regs
> On 23-Jul-2020, at 8:26 PM, Arnaldo Carvalho de Melo wrote: > > Em Thu, Jul 23, 2020 at 11:14:16AM +0530, kajoljain escreveu: >> >> >> On 7/21/20 11:32 AM, kajoljain wrote: >>> >>> >>> On 7/17/20 8:08 PM, Athira Rajeev wrote: From: Anju T Sudhakar Add support for perf extended register capability in powerpc. The capability flag PERF_PMU_CAP_EXTENDED_REGS, is used to indicate the PMU which support extended registers. The generic code define the mask of extended registers as 0 for non supported architectures. Patch adds extended regs support for power9 platform by exposing MMCR0, MMCR1 and MMCR2 registers. REG_RESERVED mask needs update to include extended regs. `PERF_REG_EXTENDED_MASK`, contains mask value of the supported registers, is defined at runtime in the kernel based on platform since the supported registers may differ from one processor version to another and hence the MASK value. with patch -- available registers: r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11 r12 r13 r14 r15 r16 r17 r18 r19 r20 r21 r22 r23 r24 r25 r26 r27 r28 r29 r30 r31 nip msr orig_r3 ctr link xer ccr softe trap dar dsisr sier mmcra mmcr0 mmcr1 mmcr2 PERF_RECORD_SAMPLE(IP, 0x1): 4784/4784: 0 period: 1 addr: 0 ... intr regs: mask 0x ABI 64-bit r00xc012b77c r10xc03fe5e03930 r20xc1b0e000 r30xc03fdcddf800 r40xc03fc788 r50x9c422724be r60xc03fe5e03908 r70xff63bddc8706 r80x9e4 r90x0 r10 0x1 r11 0x0 r12 0xc01299c0 r13 0xc03c4800 r14 0x0 r15 0x7fffdd8b8b00 r16 0x0 r17 0x7fffdd8be6b8 r18 0x7e7076607730 r19 0x2f r20 0xc0001fc26c68 r21 0xc0002041e4227e00 r22 0xc0002018fb60 r23 0x1 r24 0xc03ffec4d900 r25 0x8000 r26 0x0 r27 0x1 r28 0x1 r29 0xc1be1260 r30 0x6008010 r31 0xc03ffebb7218 nip 0xc012b910 msr 0x90009033 orig_r3 0xc012b86c ctr 0xc01299c0 link 0xc012b77c xer 0x0 ccr 0x2800 softe 0x1 trap 0xf00 dar 0x0 dsisr 0x800 sier 0x0 mmcra 0x800 mmcr0 0x82008090 mmcr1 0x1e00 mmcr2 0x0 ... thread: perf:4784 Signed-off-by: Anju T Sudhakar [Defined PERF_REG_EXTENDED_MASK at run time to add support for different platforms ] Signed-off-by: Athira Rajeev Reviewed-by: Madhavan Srinivasan --- >>> >>> Patch looks good to me. >>> >>> Reviewed-by: Kajol Jain >> >> Hi Arnaldo and Jiri, >> Please let me know if you have any comments on these patches. Can you >> pull/ack these >> patches if they seems fine to you. > > Can you please clarify something here, I think I saw a kernel build bot > complaint followed by a fix, in these cases I think, for reviewer's > sake, that this would entail a v4 patchkit? One that has no such build > issues? > > Or have I got something wrong? Hi Arnaldo, yes you are right, I will send version 4 as a new series with changes to add support for extended regs and including fix for the build issue. Thanks for your response. Athira > > - Arnaldo
Re: [PATCH v3 5/6] powerpc/pseries: implement paravirt qspinlocks for SPLPAR
On Thu, Jul 23, 2020 at 08:47:59PM +0200, pet...@infradead.org wrote: > On Thu, Jul 23, 2020 at 02:32:36PM -0400, Waiman Long wrote: > > BTW, do you have any comment on my v2 lock holder cpu info qspinlock patch? > > I will have to update the patch to fix the reported 0-day test problem, but > > I want to collect other feedback before sending out v3. > > I want to say I hate it all, it adds instructions to a path we spend an > aweful lot of time optimizing without really getting anything back for > it. > > Will, how do you feel about it? I can see it potentially being useful for debugging, but I hate the limitation to 256 CPUs. Even arm64 is hitting that now. Also, you're talking ~1% gains here. I think our collective time would be better spent off reviewing the CNA series and trying to make it more deterministic. Will
[PATCH] powerpc/book3s64/radix: Add kernel command line option to disable radix GTSE
This adds a kernel command line option that can be used to disable GTSE support. Disabling GTSE implies kernel will make hcalls to invalidate TLB entries. This was done so that we can do VM migration between configs that enable/disable GTSE support via hypervisor. To migrate a VM from a system that supports GTSE to a system that doesn't, we can boot the guest with radix_gtse=off, thereby forcing the guest to use hcalls for TLB invalidates. The check for hcall availability is done in pSeries_setup_arch so that the panic message appears on the console. This should only happen on a hypervisor that doesn't force the guest to hash translation even though it can't handle the radix GTSE=0 request via CAS. With radix_gtse=off if the hypervisor doesn't support hcall_rpt_invalidate hcall it should force the LPAR to hash translation. Signed-off-by: Aneesh Kumar K.V --- Documentation/admin-guide/kernel-parameters.txt | 3 +++ arch/powerpc/include/asm/firmware.h | 4 +++- arch/powerpc/kernel/prom_init.c | 13 + arch/powerpc/platforms/pseries/firmware.c | 1 + arch/powerpc/platforms/pseries/setup.c | 5 + 5 files changed, 21 insertions(+), 5 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index fb95fad81c79..df20c98a8920 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -896,6 +896,9 @@ disable_radix [PPC] Disable RADIX MMU mode on POWER9 + radix_gtse=off [PPC/PSERIES] + Disable RADIX GTSE feature. + disable_tlbie [PPC] Disable TLBIE instruction. Currently does not work with KVM, with HASH MMU, or with coherent accelerators. diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h index 6003c2e533a0..aa6a5ef5d483 100644 --- a/arch/powerpc/include/asm/firmware.h +++ b/arch/powerpc/include/asm/firmware.h @@ -52,6 +52,7 @@ #define FW_FEATURE_PAPR_SCMASM_CONST(0x0020) #define FW_FEATURE_ULTRAVISOR ASM_CONST(0x0040) #define FW_FEATURE_STUFF_TCE ASM_CONST(0x0080) +#define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0100) #ifndef __ASSEMBLY__ @@ -71,7 +72,8 @@ enum { FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN | FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 | FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE | - FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR, + FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR | + FW_FEATURE_RPT_INVALIDATE, FW_FEATURE_PSERIES_ALWAYS = 0, FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR, FW_FEATURE_POWERNV_ALWAYS = 0, diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c index cbc605cfdec0..e7e91965fe6c 100644 --- a/arch/powerpc/kernel/prom_init.c +++ b/arch/powerpc/kernel/prom_init.c @@ -169,6 +169,7 @@ static unsigned long __prombss prom_tce_alloc_end; #ifdef CONFIG_PPC_PSERIES static bool __prombss prom_radix_disable; +static bool __prombss prom_radix_gtse_disable; static bool __prombss prom_xive_disable; #endif @@ -823,6 +824,12 @@ static void __init early_cmdline_parse(void) if (prom_radix_disable) prom_debug("Radix disabled from cmdline\n"); + opt = prom_strstr(prom_cmd_line, "radix_gtse=off"); + if (opt) { + prom_radix_gtse_disable = true; + prom_debug("Radix GTSE disabled from cmdline\n"); + } + opt = prom_strstr(prom_cmd_line, "xive=off"); if (opt) { prom_xive_disable = true; @@ -1285,10 +1292,8 @@ static void __init prom_parse_platform_support(u8 index, u8 val, prom_parse_mmu_model(val & OV5_FEAT(OV5_MMU_SUPPORT), support); break; case OV5_INDX(OV5_RADIX_GTSE): /* Radix Extensions */ - if (val & OV5_FEAT(OV5_RADIX_GTSE)) { - prom_debug("Radix - GTSE supported\n"); - support->radix_gtse = true; - } + if (val & OV5_FEAT(OV5_RADIX_GTSE)) + support->radix_gtse = !prom_radix_gtse_disable; break; case OV5_INDX(OV5_XIVE_SUPPORT): /* Interrupt mode */ prom_parse_xive_model(val & OV5_FEAT(OV5_XIVE_SUPPORT), diff --git a/arch/powerpc/platforms/pseries/firmware.c b/arch/powerpc/platforms/pseries/firmware.c index 3e49cc23a97a..4c7b7f5a2ebc 100644 --- a/arch/powerpc/platforms/pseries/firmware.c +++ b/arch/powerpc/platforms/pseries/firmware.c @@ -65,6 +65,7 @@ hypertas_fw_features_table[] = { {FW_FEATURE_HPT_RESIZE, "hcall-hpt-resize"}, {FW_FEATURE_BLOCK_REMOVE, "hcall-block-remove"}, {FW_FEATURE_PAPR_SCM
Re: [PATCH v5 7/7] KVM: PPC: Book3S HV: rework secure mem slot dropping
Le 24/07/2020 à 05:03, Bharata B Rao a écrit : On Thu, Jul 23, 2020 at 01:07:24PM -0700, Ram Pai wrote: From: Laurent Dufour When a secure memslot is dropped, all the pages backed in the secure device (aka really backed by secure memory by the Ultravisor) should be paged out to a normal page. Previously, this was achieved by triggering the page fault mechanism which is calling kvmppc_svm_page_out() on each pages. This can't work when hot unplugging a memory slot because the memory slot is flagged as invalid and gfn_to_pfn() is then not trying to access the page, so the page fault mechanism is not triggered. Since the final goal is to make a call to kvmppc_svm_page_out() it seems simpler to call directly instead of triggering such a mechanism. This way kvmppc_uvmem_drop_pages() can be called even when hot unplugging a memslot. Since kvmppc_uvmem_drop_pages() is already holding kvm->arch.uvmem_lock, the call to __kvmppc_svm_page_out() is made. As __kvmppc_svm_page_out needs the vma pointer to migrate the pages, the VMA is fetched in a lazy way, to not trigger find_vma() all the time. In addition, the mmap_sem is held in read mode during that time, not in write mode since the virual memory layout is not impacted, and kvm->arch.uvmem_lock prevents concurrent operation on the secure device. Cc: Ram Pai Cc: Bharata B Rao Cc: Paul Mackerras Signed-off-by: Ram Pai [modified the changelog description] Signed-off-by: Laurent Dufour --- arch/powerpc/kvm/book3s_hv_uvmem.c | 54 ++ 1 file changed, 37 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c index c772e92..daffa6e 100644 --- a/arch/powerpc/kvm/book3s_hv_uvmem.c +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c @@ -632,35 +632,55 @@ static inline int kvmppc_svm_page_out(struct vm_area_struct *vma, * fault on them, do fault time migration to replace the device PTEs in * QEMU page table with normal PTEs from newly allocated pages. */ -void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free, +void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *slot, struct kvm *kvm, bool skip_page_out) { int i; struct kvmppc_uvmem_page_pvt *pvt; - unsigned long pfn, uvmem_pfn; - unsigned long gfn = free->base_gfn; + struct page *uvmem_page; + struct vm_area_struct *vma = NULL; + unsigned long uvmem_pfn, gfn; + unsigned long addr, end; + + mmap_read_lock(kvm->mm); + + addr = slot->userspace_addr; + end = addr + (slot->npages * PAGE_SIZE); - for (i = free->npages; i; --i, ++gfn) { - struct page *uvmem_page; + gfn = slot->base_gfn; + for (i = slot->npages; i; --i, ++gfn, addr += PAGE_SIZE) { + + /* Fetch the VMA if addr is not in the latest fetched one */ + if (!vma || (addr < vma->vm_start || addr >= vma->vm_end)) { + vma = find_vma_intersection(kvm->mm, addr, end); + if (!vma || + vma->vm_start > addr || vma->vm_end < end) { + pr_err("Can't find VMA for gfn:0x%lx\n", gfn); + break; + } There is a potential issue with the boundary condition check here which I discussed with Laurent yesterday. Guess he hasn't gotten around to look at it yet. Right, I'm working on that..
Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone
On Wed, Jul 22, 2020 at 11:06 PM Atish Patra wrote: > > On Wed, Jul 22, 2020 at 1:23 PM Arnd Bergmann wrote: > > > > I just noticed that rv32 allows 2GB of lowmem rather than just the usual > > 768MB or 1GB, at the expense of addressable user memory. This seems > > like an unusual choice, but I also don't see any reason to change this > > or make it more flexible unless actual users appear. > > > > I am a bit confused here. As per my understanding, RV32 supports 1GB > of lowmem only > as the page offset is set to 0xC000. The config option > MAXPHYSMEM_2GB is misleading > as RV32 actually allows 1GB of physical memory only. Ok, in that case I was apparently misled by the Kconfig option name. I just tried building a kernel to see what the boundaries actually are, as this is not the only confusing bit. Here is what I see: 0x9dc0 TASK_SIZE/FIXADDR_START /* code comment says 0x9fc0 */ 0x9e00 FIXADDR_TOP/PCI_IO_START 0x9f00 PCI_IO_END/VMEMMAP_START 0xa000 VMEMMAP_END/VMALLOC_START 0xc000 VMALLOC_END/PAGE_OFFSET Having exactly 1GB of linear map does make a lot of sense. Having PCI I/O, vmemmap and fixmap come out of the user range means you get slightly different behavior in user space if there are any changes to that set, but that is probably fine as well, if you want the flexibility to go to a 2GB linear map and expect user space to deal with that as well. There is one common trick from arm32 however that you might want to consider: if vmalloc was moved above the linear map rather than below, the size of the vmalloc area can dynamically depend on the amount of RAM that is actually present rather than be set to a fixed value. On arm32, there is around 240MB of vmalloc space if the linear map is fully populated with RAM, but it can grow to use all of the avaialable address space if less RAM was detected at boot time (up to 3GB depending on CONFIG_VMSPLIT). > Any memory blocks beyond > DRAM + 1GB are removed in setup_bootmem. IMHO, The current config > should clarify that. > > Moreover, we should add 2G split under a separate configuration if we > want to support that. Right. It's probably not needed immediately, but can't hurt either. Arnd
Re: [PATCH v3 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling
On Thu, Jul 23, 2020 at 02:21:11PM +0530, Srikar Dronamraju wrote: > Current code assumes that cpumask of cpus sharing a l2-cache mask will > always be a superset of cpu_sibling_mask. > > Lets stop that assumption. cpu_l2_cache_mask is a superset of > cpu_sibling_mask if and only if shared_caches is set. > > Cc: linuxppc-dev > Cc: LKML > Cc: Michael Ellerman > Cc: Nicholas Piggin > Cc: Anton Blanchard > Cc: Oliver O'Halloran > Cc: Nathan Lynch > Cc: Michael Neuling > Cc: Gautham R Shenoy > Cc: Ingo Molnar > Cc: Peter Zijlstra > Cc: Valentin Schneider > Cc: Jordan Niethe > Signed-off-by: Srikar Dronamraju Reviewed-by: Gautham R. Shenoy > --- > Changelog v1 -> v2: > Set cpumask after verifying l2-cache. (Gautham) > > arch/powerpc/kernel/smp.c | 28 +++- > 1 file changed, 15 insertions(+), 13 deletions(-) > > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c > index da27f6909be1..d997c7411664 100644 > --- a/arch/powerpc/kernel/smp.c > +++ b/arch/powerpc/kernel/smp.c > @@ -1194,6 +1194,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask > *(*mask_fn)(int)) > if (!l2_cache) > return false; > > + cpumask_set_cpu(cpu, mask_fn(cpu)); > for_each_cpu(i, cpu_online_mask) { > /* >* when updating the marks the current CPU has not been marked > @@ -1276,29 +1277,30 @@ static void add_cpu_to_masks(int cpu) >* add it to it's own thread sibling mask. >*/ > cpumask_set_cpu(cpu, cpu_sibling_mask(cpu)); > + cpumask_set_cpu(cpu, cpu_core_mask(cpu)); > > for (i = first_thread; i < first_thread + threads_per_core; i++) > if (cpu_online(i)) > set_cpus_related(i, cpu, cpu_sibling_mask); > > add_cpu_to_smallcore_masks(cpu); > - /* > - * Copy the thread sibling mask into the cache sibling mask > - * and mark any CPUs that share an L2 with this CPU. > - */ > - for_each_cpu(i, cpu_sibling_mask(cpu)) > - set_cpus_related(cpu, i, cpu_l2_cache_mask); > update_mask_by_l2(cpu, cpu_l2_cache_mask); > > - /* > - * Copy the cache sibling mask into core sibling mask and mark > - * any CPUs on the same chip as this CPU. > - */ > - for_each_cpu(i, cpu_l2_cache_mask(cpu)) > - set_cpus_related(cpu, i, cpu_core_mask); > + if (pkg_id == -1) { > + struct cpumask *(*mask)(int) = cpu_sibling_mask; > + > + /* > + * Copy the sibling mask into core sibling mask and > + * mark any CPUs on the same chip as this CPU. > + */ > + if (shared_caches) > + mask = cpu_l2_cache_mask; > + > + for_each_cpu(i, mask(cpu)) > + set_cpus_related(cpu, i, cpu_core_mask); > > - if (pkg_id == -1) > return; > + } > > for_each_cpu(i, cpu_online_mask) > if (get_physical_package_id(i) == pkg_id) > -- > 2.18.2 >
Re: [PATCH v2 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling
On Wed, Jul 22, 2020 at 12:27:47PM +0530, Srikar Dronamraju wrote: > * Gautham R Shenoy [2020-07-22 11:51:14]: > > > Hi Srikar, > > > > > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c > > > index 72f16dc0cb26..57468877499a 100644 > > > --- a/arch/powerpc/kernel/smp.c > > > +++ b/arch/powerpc/kernel/smp.c > > > @@ -1196,6 +1196,7 @@ static bool update_mask_by_l2(int cpu, struct > > > cpumask *(*mask_fn)(int)) > > > if (!l2_cache) > > > return false; > > > > > > + cpumask_set_cpu(cpu, mask_fn(cpu)); > > > > > > Ok, we need to do this because "cpu" is not yet set in the > > cpu_online_mask. Prior to your patch the "cpu" was getting set in > > cpu_l2_cache_map(cpu) as a side-effect of the code that is removed in > > the patch. > > > > Right. > > > > > > for_each_cpu(i, cpu_online_mask) { > > > /* > > >* when updating the marks the current CPU has not been marked > > > @@ -1278,29 +1279,30 @@ static void add_cpu_to_masks(int cpu) > > >* add it to it's own thread sibling mask. > > >*/ > > > cpumask_set_cpu(cpu, cpu_sibling_mask(cpu)); > > > + cpumask_set_cpu(cpu, cpu_core_mask(cpu)); > > Note: Above, we are explicitly setting the cpu_core_mask. You are right. I missed this. > > > > > > > for (i = first_thread; i < first_thread + threads_per_core; i++) > > > if (cpu_online(i)) > > > set_cpus_related(i, cpu, cpu_sibling_mask); > > > > > > add_cpu_to_smallcore_masks(cpu); > > > - /* > > > - * Copy the thread sibling mask into the cache sibling mask > > > - * and mark any CPUs that share an L2 with this CPU. > > > - */ > > > - for_each_cpu(i, cpu_sibling_mask(cpu)) > > > - set_cpus_related(cpu, i, cpu_l2_cache_mask); > > > update_mask_by_l2(cpu, cpu_l2_cache_mask); > > > > > > - /* > > > - * Copy the cache sibling mask into core sibling mask and mark > > > - * any CPUs on the same chip as this CPU. > > > - */ > > > - for_each_cpu(i, cpu_l2_cache_mask(cpu)) > > > - set_cpus_related(cpu, i, cpu_core_mask); > > > + if (pkg_id == -1) { > > > > I suppose this "if" condition is an optimization, since if pkg_id != -1, > > we anyway set these CPUs in the cpu_core_mask below. > > > > However... > > This is not just an optimization. > The hunk removed would only work if cpu_l2_cache_mask is bigger than > cpu_sibling_mask. (this was the previous assumption that we want to break) > If the cpu_sibling_mask is bigger than cpu_l2_cache_mask and pkg_id is -1, > then setting only cpu_l2_cache_mask in cpu_core_mask will result in a broken > topology. > > > > > > + struct cpumask *(*mask)(int) = cpu_sibling_mask; > > > + > > > + /* > > > + * Copy the sibling mask into core sibling mask and > > > + * mark any CPUs on the same chip as this CPU. > > > + */ > > > + if (shared_caches) > > > + mask = cpu_l2_cache_mask; > > > + > > > + for_each_cpu(i, mask(cpu)) > > > + set_cpus_related(cpu, i, cpu_core_mask); > > > > > > - if (pkg_id == -1) > > > return; > > > + } > > > > > > ... since "cpu" is not yet set in the cpu_online_mask, do we not miss > > setting > > "cpu" in the cpu_core_mask(cpu) in the for-loop below ? > > > > > > As noted above, we are setting before. So we don't missing the cpu and hence > have not different from before. Fair enough. > > > -- > > Thanks and Regards > > gautham. > > -- > Thanks and Regards > Srikar Dronamraju
Re: [PATCH v3 04/10] powerpc/smp: Move topology fixups into a new function
On Thu, Jul 23, 2020 at 02:21:10PM +0530, Srikar Dronamraju wrote: > Move topology fixup based on the platform attributes into its own > function which is called just before set_sched_topology. > > Cc: linuxppc-dev > Cc: LKML > Cc: Michael Ellerman > Cc: Nicholas Piggin > Cc: Anton Blanchard > Cc: Oliver O'Halloran > Cc: Nathan Lynch > Cc: Michael Neuling > Cc: Gautham R Shenoy > Cc: Ingo Molnar > Cc: Peter Zijlstra > Cc: Valentin Schneider > Cc: Jordan Niethe > Signed-off-by: Srikar Dronamraju Reviewed-by: Gautham R. Shenoy > --- > Changelog v2 -> v3: > Rewrote changelog (Gautham) > Renamed to powerpc/smp: Move topology fixups into a new function > > arch/powerpc/kernel/smp.c | 17 +++-- > 1 file changed, 11 insertions(+), 6 deletions(-) > > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c > index a685915e5941..da27f6909be1 100644 > --- a/arch/powerpc/kernel/smp.c > +++ b/arch/powerpc/kernel/smp.c > @@ -1368,6 +1368,16 @@ int setup_profiling_timer(unsigned int multiplier) > return 0; > } > > +static void fixup_topology(void) > +{ > +#ifdef CONFIG_SCHED_SMT > + if (has_big_cores) { > + pr_info("Big cores detected but using small core scheduling\n"); > + powerpc_topology[0].mask = smallcore_smt_mask; > + } > +#endif > +} > + > void __init smp_cpus_done(unsigned int max_cpus) > { > /* > @@ -1381,12 +1391,7 @@ void __init smp_cpus_done(unsigned int max_cpus) > > dump_numa_cpu_topology(); > > -#ifdef CONFIG_SCHED_SMT > - if (has_big_cores) { > - pr_info("Big cores detected but using small core scheduling\n"); > - powerpc_topology[0].mask = smallcore_smt_mask; > - } > -#endif > + fixup_topology(); > set_sched_topology(powerpc_topology); > } > > -- > 2.18.2 >
Re: [PATCH v3 02/10] powerpc/smp: Merge Power9 topology with Power topology
On Thu, Jul 23, 2020 at 02:21:08PM +0530, Srikar Dronamraju wrote: > A new sched_domain_topology_level was added just for Power9. However the > same can be achieved by merging powerpc_topology with power9_topology > and makes the code more simpler especially when adding a new sched > domain. > > Cc: linuxppc-dev > Cc: LKML > Cc: Michael Ellerman > Cc: Nicholas Piggin > Cc: Anton Blanchard > Cc: Oliver O'Halloran > Cc: Nathan Lynch > Cc: Michael Neuling > Cc: Gautham R Shenoy > Cc: Ingo Molnar > Cc: Peter Zijlstra > Cc: Valentin Schneider > Cc: Jordan Niethe > Signed-off-by: Srikar Dronamraju LGTM. Reviewed-by: Gautham R. Shenoy > --- > Changelog v1 -> v2: > Replaced a reference to cpu_smt_mask with per_cpu(cpu_sibling_map, cpu) > since cpu_smt_mask is only defined under CONFIG_SCHED_SMT > > arch/powerpc/kernel/smp.c | 33 ++--- > 1 file changed, 10 insertions(+), 23 deletions(-) > > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c > index edf94ca64eea..283a04e54f52 100644 > --- a/arch/powerpc/kernel/smp.c > +++ b/arch/powerpc/kernel/smp.c > @@ -1313,7 +1313,7 @@ int setup_profiling_timer(unsigned int multiplier) > } > > #ifdef CONFIG_SCHED_SMT > -/* cpumask of CPUs with asymetric SMT dependancy */ > +/* cpumask of CPUs with asymmetric SMT dependency */ > static int powerpc_smt_flags(void) > { > int flags = SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES; > @@ -1326,14 +1326,6 @@ static int powerpc_smt_flags(void) > } > #endif > > -static struct sched_domain_topology_level powerpc_topology[] = { > -#ifdef CONFIG_SCHED_SMT > - { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) }, > -#endif > - { cpu_cpu_mask, SD_INIT_NAME(DIE) }, > - { NULL, }, > -}; > - > /* > * P9 has a slightly odd architecture where pairs of cores share an L2 cache. > * This topology makes it *much* cheaper to migrate tasks between adjacent > cores > @@ -1351,7 +1343,13 @@ static int powerpc_shared_cache_flags(void) > */ > static const struct cpumask *shared_cache_mask(int cpu) > { > - return cpu_l2_cache_mask(cpu); > + if (shared_caches) > + return cpu_l2_cache_mask(cpu); > + > + if (has_big_cores) > + return cpu_smallcore_mask(cpu); > + > + return per_cpu(cpu_sibling_map, cpu); > } > > #ifdef CONFIG_SCHED_SMT > @@ -1361,7 +1359,7 @@ static const struct cpumask *smallcore_smt_mask(int cpu) > } > #endif > > -static struct sched_domain_topology_level power9_topology[] = { > +static struct sched_domain_topology_level powerpc_topology[] = { > #ifdef CONFIG_SCHED_SMT > { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) }, > #endif > @@ -1386,21 +1384,10 @@ void __init smp_cpus_done(unsigned int max_cpus) > #ifdef CONFIG_SCHED_SMT > if (has_big_cores) { > pr_info("Big cores detected but using small core scheduling\n"); > - power9_topology[0].mask = smallcore_smt_mask; > powerpc_topology[0].mask = smallcore_smt_mask; > } > #endif > - /* > - * If any CPU detects that it's sharing a cache with another CPU then > - * use the deeper topology that is aware of this sharing. > - */ > - if (shared_caches) { > - pr_info("Using shared cache scheduler topology\n"); > - set_sched_topology(power9_topology); > - } else { > - pr_info("Using standard scheduler topology\n"); > - set_sched_topology(powerpc_topology); > - } > + set_sched_topology(powerpc_topology); > } > > #ifdef CONFIG_HOTPLUG_CPU > -- > 2.18.2 >