Re: [PATCH 17/17] arch: rename copy_thread_tls() back to copy_thread()
Christian Brauner 於 2020年6月23日 週二 上午7:58寫道: > > Now that HAVE_COPY_THREAD_TLS has been removed, rename copy_thread_tls() > back simply copy_thread(). It's a simpler name, and doesn't imply that only > tls is copied here. This finishes an outstanding chunk of internal process > creation work since we've added clone3(). > > Cc: Richard Henderson > Cc: Ivan Kokshaysky > Cc: Matt Turner > Cc: Vineet Gupta > Cc: Russell King > Cc: Catalin Marinas > Cc: Will Deacon > Cc: Mark Salter > Cc: Aurelien Jacquiot > Cc: Guo Ren > Cc: Yoshinori Sato > Cc: Brian Cain > Cc: Tony Luck > Cc: Fenghua Yu > Cc: Geert Uytterhoeven > Cc: Michal Simek > Cc: Thomas Bogendoerfer > Cc: Nick Hu > Cc: Greentime Hu > Cc: Vincent Chen > Cc: Ley Foon Tan > Cc: Jonas Bonn > Cc: Stefan Kristiansson > Cc: Stafford Horne > Cc: "James E.J. Bottomley" > Cc: Helge Deller > Cc: Michael Ellerman > Cc: Benjamin Herrenschmidt > Cc: Paul Mackerras > Cc: Paul Walmsley > Cc: Palmer Dabbelt > Cc: Albert Ou > Cc: Christian Borntraeger > Cc: Rich Felker > Cc: "David S. Miller" > Cc: Jeff Dike > Cc: Richard Weinberger > Cc: Anton Ivanov > Cc: Guan Xuetao > Cc: Thomas Gleixner > Cc: Ingo Molnar > Cc: x...@kernel.org > Cc: "H. Peter Anvin" > Cc: Chris Zankel > Cc: Max Filippov > Cc: Kees Cook > Cc: "Peter Zijlstra (Intel)" > Cc: "Matthew Wilcox (Oracle)" > Cc: Al Viro > Cc: linux-al...@vger.kernel.org > Cc: linux-ker...@vger.kernel.org > Cc: linux-snps-...@lists.infradead.org > Cc: linux-arm-ker...@lists.infradead.org > Cc: linux-c6x-...@linux-c6x.org > Cc: linux-c...@vger.kernel.org > Cc: uclinux-h8-de...@lists.sourceforge.jp > Cc: linux-hexa...@vger.kernel.org > Cc: linux-i...@vger.kernel.org > Cc: linux-m...@lists.linux-m68k.org > Cc: linux-m...@vger.kernel.org > Cc: openr...@lists.librecores.org > Cc: linux-par...@vger.kernel.org > Cc: linuxppc-dev@lists.ozlabs.org > Cc: linux-ri...@lists.infradead.org > Cc: linux-s...@vger.kernel.org > Cc: linux...@vger.kernel.org > Cc: sparcli...@vger.kernel.org > Cc: linux...@lists.infradead.org > Cc: linux-xte...@linux-xtensa.org > Signed-off-by: Christian Brauner > --- > arch/alpha/kernel/process.c | 2 +- > arch/arc/kernel/process.c| 2 +- > arch/arm/kernel/process.c| 2 +- > arch/arm64/kernel/process.c | 2 +- > arch/c6x/kernel/process.c| 2 +- > arch/csky/kernel/process.c | 2 +- > arch/h8300/kernel/process.c | 2 +- > arch/hexagon/kernel/process.c| 2 +- > arch/ia64/kernel/process.c | 4 ++-- > arch/m68k/kernel/process.c | 2 +- > arch/microblaze/kernel/process.c | 2 +- > arch/mips/kernel/process.c | 2 +- > arch/nds32/kernel/process.c | 2 +- > arch/nios2/kernel/process.c | 2 +- > arch/openrisc/kernel/process.c | 4 ++-- > arch/parisc/kernel/process.c | 2 +- > arch/powerpc/kernel/process.c| 2 +- > arch/riscv/kernel/process.c | 2 +- > arch/s390/kernel/process.c | 2 +- > arch/sh/kernel/process_32.c | 2 +- > arch/sparc/kernel/process.c | 6 +++--- > arch/sparc/kernel/process_32.c | 2 +- > arch/sparc/kernel/process_64.c | 2 +- > arch/um/kernel/process.c | 2 +- > arch/unicore32/kernel/process.c | 2 +- > arch/x86/kernel/process.c| 2 +- > arch/x86/kernel/unwind_frame.c | 2 +- > arch/xtensa/kernel/process.c | 2 +- > include/linux/sched/task.h | 2 +- > kernel/fork.c| 2 +- > 30 files changed, 34 insertions(+), 34 deletions(-) > > [...] > diff --git a/arch/nds32/kernel/process.c b/arch/nds32/kernel/process.c > index 7dbb1bf64165..1020e2c6dcd8 100644 > --- a/arch/nds32/kernel/process.c > +++ b/arch/nds32/kernel/process.c > @@ -149,7 +149,7 @@ void flush_thread(void) > DEFINE_PER_CPU(struct task_struct *, __entry_task); > > asmlinkage void ret_from_fork(void) __asm__("ret_from_fork"); > -int copy_thread_tls(unsigned long clone_flags, unsigned long stack_start, > +int copy_thread(unsigned long clone_flags, unsigned long stack_start, > unsigned long stk_sz, struct task_struct *p, > unsigned long tls) > { Hi Christian, Thank you. Acked-by: Greentime Hu
Re: [PATCH 16/17] arch: remove HAVE_COPY_THREAD_TLS
Christian Brauner 於 2020年6月23日 週二 上午7:46寫道: > > All architectures support copy_thread_tls() now, so remove the legacy > copy_thread() function and the HAVE_COPY_THREAD_TLS config option. Everyone > uses the same process creation calling convention based on > copy_thread_tls() and struct kernel_clone_args. This will make it easier to > maintain the core process creation code under kernel/, simplifies the > callpaths and makes the identical for all architectures. > > Cc: Richard Henderson > Cc: Ivan Kokshaysky > Cc: Matt Turner > Cc: Vineet Gupta > Cc: Russell King > Cc: Catalin Marinas > Cc: Will Deacon > Cc: Mark Salter > Cc: Aurelien Jacquiot > Cc: Guo Ren > Cc: Yoshinori Sato > Cc: Brian Cain > Cc: Tony Luck > Cc: Fenghua Yu > Cc: Geert Uytterhoeven > Cc: Michal Simek > Cc: Thomas Bogendoerfer > Cc: Nick Hu > Cc: Greentime Hu > Cc: Vincent Chen > Cc: Ley Foon Tan > Cc: Jonas Bonn > Cc: Stefan Kristiansson > Cc: Stafford Horne > Cc: "James E.J. Bottomley" > Cc: Helge Deller > Cc: Michael Ellerman > Cc: Benjamin Herrenschmidt > Cc: Paul Mackerras > Cc: Paul Walmsley > Cc: Palmer Dabbelt > Cc: Albert Ou > Cc: Heiko Carstens > Cc: Vasily Gorbik > Cc: Christian Borntraeger > Cc: Rich Felker > Cc: "David S. Miller" > Cc: Jeff Dike > Cc: Richard Weinberger > Cc: Anton Ivanov > Cc: Guan Xuetao > Cc: Thomas Gleixner > Cc: Ingo Molnar > Cc: x...@kernel.org > Cc: Peter Zijlstra > Cc: Kees Cook > Cc: Mike Rapoport > Cc: "Matthew Wilcox > Cc: Al Viro > Cc: linux-ker...@vger.kernel.org > Cc: linux-al...@vger.kernel.org > Cc: linux-snps-...@lists.infradead.org > Cc: linux-arm-ker...@lists.infradead.org > Cc: linux-c6x-...@linux-c6x.org > Cc: linux-c...@vger.kernel.org > Cc: uclinux-h8-de...@lists.sourceforge.jp > Cc: linux-hexa...@vger.kernel.org > Cc: linux-i...@vger.kernel.org > Cc: linux-m...@lists.linux-m68k.org > Cc: linux-m...@vger.kernel.org > Cc: openr...@lists.librecores.org > Cc: linux-par...@vger.kernel.org > Cc: linuxppc-dev@lists.ozlabs.org > Cc: linux-ri...@lists.infradead.org > Cc: linux-s...@vger.kernel.org > Cc: linux...@vger.kernel.org > Cc: sparcli...@vger.kernel.org > Cc: linux...@lists.infradead.org > Cc: linux-xte...@linux-xtensa.org > Signed-off-by: Christian Brauner > [...] > diff --git a/arch/nds32/Kconfig b/arch/nds32/Kconfig > index 7b6eaca81cce..e30298e99e1b 100644 > --- a/arch/nds32/Kconfig > +++ b/arch/nds32/Kconfig > @@ -48,7 +48,6 @@ config NDS32 > select HAVE_FUNCTION_GRAPH_TRACER > select HAVE_FTRACE_MCOUNT_RECORD > select HAVE_DYNAMIC_FTRACE > - select HAVE_COPY_THREAD_TLS > help > Andes(nds32) Linux support. Hi Christian, Thank you. Acked-by: Greentime Hu
Re: [PATCH v5 06/12] arch: xtensa: add linker section for KUnit test suites
On Fri, Jun 26, 2020 at 02:09:11PM -0700, Brendan Higgins wrote: > Add a linker section to xtensa where KUnit can put references to its > test suites. This patch is an early step in transitioning to dispatching > all KUnit tests from a centralized executor rather than having each as > its own separate late_initcall. > > Signed-off-by: Brendan Higgins > --- > arch/xtensa/kernel/vmlinux.lds.S | 4 If you ever find yourself modifying multiple arch linker scripts for a series, something has gone wrong. ;) -- Kees Cook
Re: [PATCH v5 01/12] vmlinux.lds.h: add linker section for KUnit test suites
On Fri, Jun 26, 2020 at 02:09:06PM -0700, Brendan Higgins wrote: > Add a linker section where KUnit can put references to its test suites. > This patch is the first step in transitioning to dispatching all KUnit > tests from a centralized executor rather than having each as its own > separate late_initcall. > > Co-developed-by: Iurii Zaikin > Signed-off-by: Iurii Zaikin > Signed-off-by: Brendan Higgins > Reviewed-by: Stephen Boyd > --- > include/asm-generic/vmlinux.lds.h | 8 > 1 file changed, 8 insertions(+) > > diff --git a/include/asm-generic/vmlinux.lds.h > b/include/asm-generic/vmlinux.lds.h > index db600ef218d7d..4f9b036fc9616 100644 > --- a/include/asm-generic/vmlinux.lds.h > +++ b/include/asm-generic/vmlinux.lds.h > @@ -881,6 +881,13 @@ > KEEP(*(.con_initcall.init)) \ > __con_initcall_end = .; > > +/* Alignment must be consistent with (kunit_suite *) in include/kunit/test.h > */ Nit on naming: > +#define KUNIT_TEST_SUITES\ I would call this KUNIT_TABLE to maintain the same names as other things of this nature. > + . = ALIGN(8); \ > + __kunit_suites_start = .; \ > + KEEP(*(.kunit_test_suites)) \ > + __kunit_suites_end = .; > + > #ifdef CONFIG_BLK_DEV_INITRD > #define INIT_RAM_FS \ > . = ALIGN(4); \ > @@ -1056,6 +1063,7 @@ > INIT_CALLS \ > CON_INITCALL\ > INIT_RAM_FS \ > + KUNIT_TEST_SUITES \ > } Nack: this must be in INIT_DATA, not in INIT_DATA_SECTION. Not all architectures use the INIT_DATA_SECTION macro (e.g. arm64), but everything uses INIT_DATA. -- Kees Cook
[PATCH 11/11] ppc64/kexec_file: add appropriate regions for memory reserve map
While initrd, elfcorehdr and backup regions are already added to the reserve map, there are a few missing regions that need to be added to the memory reserve map. Add them here. And now that all the changes to load panic kernel are in place, claim likewise. Signed-off-by: Hari Bathini --- arch/powerpc/kexec/file_load_64.c | 61 ++--- 1 file changed, 56 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/kexec/file_load_64.c b/arch/powerpc/kexec/file_load_64.c index 58fc2d8..813453d 100644 --- a/arch/powerpc/kexec/file_load_64.c +++ b/arch/powerpc/kexec/file_load_64.c @@ -185,6 +185,38 @@ static int get_crash_memory_ranges(struct crash_mem **mem_ranges) } /** + * get_reserved_memory_ranges - Get reserve memory ranges. This list includes + * memory regions that should be added to the + * memory reserve map to ensure the region is + * protected from any mischeif. + * @mem_ranges: Range list to add the memory ranges to. + * + * Returns 0 on success, negative errno on error. + */ +static int get_reserved_memory_ranges(struct crash_mem **mem_ranges) +{ + int ret; + + ret = add_rtas_mem_range(mem_ranges, false); + if (ret) + goto out; + + ret = add_opal_mem_range(mem_ranges, false); + if (ret) + goto out; + + ret = add_tce_mem_ranges(mem_ranges); + if (ret) + goto out; + + ret = add_reserved_ranges(mem_ranges); +out: + if (ret) + pr_err("Failed to setup reserved memory ranges\n"); + return ret; +} + +/** * __locate_mem_hole_ppc64 - Tests if the memory hole between buf_min & buf_max * is large enough for the buffer. If true, sets * kbuf->mem to the buffer. @@ -1182,8 +1214,8 @@ int setup_new_fdt_ppc64(const struct kimage *image, void *fdt, unsigned long initrd_load_addr, unsigned long initrd_len, const char *cmdline) { - struct crash_mem *umem = NULL; - int chosen_node, ret; + struct crash_mem *umem = NULL, *rmem = NULL; + int i, chosen_node, ret; /* Remove memory reservation for the current device tree. */ ret = delete_fdt_mem_rsv(fdt, __pa(initial_boot_params), @@ -1229,6 +1261,24 @@ int setup_new_fdt_ppc64(const struct kimage *image, void *fdt, } } + /* Update memory reserve map */ + ret = get_reserved_memory_ranges(); + if (ret) + goto out; + + for (i = 0; i < rmem->nr_ranges; i++) { + u64 base, size; + + base = rmem->ranges[i].start; + size = rmem->ranges[i].end - base + 1; + ret = fdt_add_mem_rsv(fdt, base, size); + if (ret) { + pr_err("Error updating memory reserve map: %s\n", + fdt_strerror(ret)); + goto out; + } + } + ret = setup_new_fdt(image, fdt, initrd_load_addr, initrd_len, cmdline, _node); if (ret) @@ -1239,6 +1289,7 @@ int setup_new_fdt_ppc64(const struct kimage *image, void *fdt, pr_err("Failed to update device-tree with linux,booted-from-kexec\n"); out: kfree(umem); + kfree(rmem); return ret; } @@ -1378,10 +1429,10 @@ int arch_kexec_kernel_image_probe(struct kimage *image, void *buf, /* Get exclude memory ranges needed for setting up kdump segments */ ret = get_exclude_memory_ranges(&(image->arch.exclude_ranges)); - if (ret) + if (ret) { pr_err("Failed to setup exclude memory ranges for buffer lookup\n"); - /* Return this until all changes for panic kernel are in */ - return -EOPNOTSUPP; + return ret; + } } return kexec_image_probe_default(image, buf, buf_len);
[PATCH 09/11] ppc64/kexec_file: setup backup region for kdump kernel
Though kdump kernel boots from loaded address, the first 64K bytes of it is copied down to real 0. So, setup a backup region to copy the first 64K bytes of crashed kernel, in purgatory, before booting into kdump kernel. Also, update reserve map with backup region and crashed kernel's memory to avoid kdump kernel from accidentially using that memory. Signed-off-by: Hari Bathini --- arch/powerpc/include/asm/crashdump-ppc64.h |5 + arch/powerpc/include/asm/kexec.h |7 ++ arch/powerpc/kexec/elf_64.c|9 +++ arch/powerpc/kexec/file_load_64.c | 96 arch/powerpc/purgatory/Makefile| 28 arch/powerpc/purgatory/purgatory_64.c | 35 ++ arch/powerpc/purgatory/trampoline_64.S | 23 +-- 7 files changed, 195 insertions(+), 8 deletions(-) create mode 100644 arch/powerpc/purgatory/purgatory_64.c diff --git a/arch/powerpc/include/asm/crashdump-ppc64.h b/arch/powerpc/include/asm/crashdump-ppc64.h index 3596c25..504a579 100644 --- a/arch/powerpc/include/asm/crashdump-ppc64.h +++ b/arch/powerpc/include/asm/crashdump-ppc64.h @@ -2,6 +2,11 @@ #ifndef _ARCH_POWERPC_KEXEC_CRASHDUMP_PPC64_H #define _ARCH_POWERPC_KEXEC_CRASHDUMP_PPC64_H +/* Backup region - first 64K bytes of System RAM. */ +#define BACKUP_SRC_START 0 +#define BACKUP_SRC_END 0x +#define BACKUP_SRC_SIZE(BACKUP_SRC_END - BACKUP_SRC_START + 1) + /* min & max addresses for kdump load segments */ #define KDUMP_BUF_MIN (crashk_res.start) #define KDUMP_BUF_MAX ((crashk_res.end < ppc64_rma_size) ? \ diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h index e78cd0a..037cf2b 100644 --- a/arch/powerpc/include/asm/kexec.h +++ b/arch/powerpc/include/asm/kexec.h @@ -109,6 +109,9 @@ extern const struct kexec_file_ops kexec_elf64_ops; struct kimage_arch { struct crash_mem *exclude_ranges; + unsigned long backup_start; + void *backup_buf; + #ifdef CONFIG_IMA_KEXEC phys_addr_t ima_buffer_addr; size_t ima_buffer_size; @@ -124,6 +127,10 @@ int setup_new_fdt(const struct kimage *image, void *fdt, int delete_fdt_mem_rsv(void *fdt, unsigned long start, unsigned long size); #ifdef CONFIG_PPC64 +struct kexec_buf; + +int load_crashdump_segments_ppc64(struct kimage *image, + struct kexec_buf *kbuf); int setup_purgatory_ppc64(struct kimage *image, const void *slave_code, const void *fdt, unsigned long kernel_load_addr, unsigned long fdt_load_addr); diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c index c695f94..4838b42 100644 --- a/arch/powerpc/kexec/elf_64.c +++ b/arch/powerpc/kexec/elf_64.c @@ -67,6 +67,15 @@ static void *elf64_load(struct kimage *image, char *kernel_buf, pr_debug("Loaded purgatory at 0x%lx\n", pbuf.mem); + /* Setup additional segments needed for panic kernel */ + if (image->type == KEXEC_TYPE_CRASH) { + ret = load_crashdump_segments_ppc64(image, ); + if (ret) { + pr_err("Failed to load kdump kernel segments\n"); + goto out; + } + } + if (initrd != NULL) { kbuf.buffer = initrd; kbuf.bufsz = kbuf.memsz = initrd_len; diff --git a/arch/powerpc/kexec/file_load_64.c b/arch/powerpc/kexec/file_load_64.c index 8e66c28..d7d3841 100644 --- a/arch/powerpc/kexec/file_load_64.c +++ b/arch/powerpc/kexec/file_load_64.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include #include @@ -863,6 +864,70 @@ static int kexec_do_relocs_ppc64(unsigned long my_r2, const Elf_Sym *sym, } /** + * load_backup_segment - Initialize backup segment of crashing kernel. + * @image: Kexec image. + * @kbuf:Buffer contents and memory parameters. + * + * Returns 0 on success, negative errno on error. + */ +static int load_backup_segment(struct kimage *image, struct kexec_buf *kbuf) +{ + void *buf; + int ret; + + /* Setup a segment for backup region */ + buf = vzalloc(BACKUP_SRC_SIZE); + if (!buf) + return -ENOMEM; + + /* +* A source buffer has no meaning for backup region as data will +* be copied from backup source, after crash, in the purgatory. +* But as load segment code doesn't recognize such segments, +* setup a dummy source buffer to keep it happy for now. +*/ + kbuf->buffer = buf; + kbuf->mem = KEXEC_BUF_MEM_UNKNOWN; + kbuf->bufsz = kbuf->memsz = BACKUP_SRC_SIZE; + kbuf->top_down = false; + + ret = kexec_add_buffer(kbuf); + if (ret) { + vfree(buf); + return ret; + } + + image->arch.backup_buf = buf; + image->arch.backup_start = kbuf->mem; + return
[PATCH 08/11] ppc64/kexec_file: setup the stack for purgatory
To avoid any weird errors, the purgatory should run with its own stack. Set one up by adding the stack buffer to .data section of the purgatory. Signed-off-by: Hari Bathini --- arch/powerpc/include/asm/kexec.h |4 arch/powerpc/kexec/file_load_64.c | 14 +- arch/powerpc/purgatory/trampoline_64.S | 15 +++ 3 files changed, 32 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h index bf47a01..e78cd0a 100644 --- a/arch/powerpc/include/asm/kexec.h +++ b/arch/powerpc/include/asm/kexec.h @@ -45,6 +45,10 @@ #define KEXEC_ARCH KEXEC_ARCH_PPC #endif +#ifdef CONFIG_KEXEC_FILE +#define KEXEC_PURGATORY_STACK_SIZE 16384 /* 16KB stack size */ +#endif + #define KEXEC_STATE_NONE 0 #define KEXEC_STATE_IRQS_OFF 1 #define KEXEC_STATE_REAL_MODE 2 diff --git a/arch/powerpc/kexec/file_load_64.c b/arch/powerpc/kexec/file_load_64.c index 4430336..8e66c28 100644 --- a/arch/powerpc/kexec/file_load_64.c +++ b/arch/powerpc/kexec/file_load_64.c @@ -878,7 +878,8 @@ int setup_purgatory_ppc64(struct kimage *image, const void *slave_code, const void *fdt, unsigned long kernel_load_addr, unsigned long fdt_load_addr) { - uint64_t toc_ptr; + uint64_t toc_ptr, stack_top; + void *stack_buf; int ret; ret = setup_purgatory(image, slave_code, fdt, kernel_load_addr, @@ -901,6 +902,17 @@ int setup_purgatory_ppc64(struct kimage *image, const void *slave_code, goto out; } + /* Setup the stack top */ + stack_buf = kexec_purgatory_get_symbol_addr(image, "stack_buf"); + if (!stack_buf) + goto out; + + stack_top = (u64)stack_buf + KEXEC_PURGATORY_STACK_SIZE; + ret = kexec_purgatory_get_set_symbol(image, "stack", _top, +sizeof(stack_top), false); + if (ret) + goto out; + /* Setup the TOC pointer */ toc_ptr = get_toc_ptr(image->purgatory_info.ehdr); ret = kexec_purgatory_get_set_symbol(image, "my_toc", _ptr, diff --git a/arch/powerpc/purgatory/trampoline_64.S b/arch/powerpc/purgatory/trampoline_64.S index 7b4a5f7..80615b4 100644 --- a/arch/powerpc/purgatory/trampoline_64.S +++ b/arch/powerpc/purgatory/trampoline_64.S @@ -9,6 +9,7 @@ * Copyright (C) 2013, Anton Blanchard, IBM Corporation */ +#include #include .machine ppc64 @@ -53,6 +54,8 @@ master: ld %r2,(my_toc - 0b)(%r18) /* setup toc */ + ld %r1,(stack - 0b)(%r18) /* setup stack */ + /* load device-tree address */ ld %r3, (dt_offset - 0b)(%r18) mr %r16,%r3/* save dt address in reg16 */ @@ -111,6 +114,12 @@ my_toc: .8byte 0x0 .size my_toc, . - my_toc + .balign 8 + .globl stack +stack: + .8byte 0x0 + .size stack, . - stack + .data .balign 8 .globl purgatory_sha256_digest @@ -123,3 +132,9 @@ purgatory_sha256_digest: purgatory_sha_regions: .skip 8 * 2 * 16 .size purgatory_sha_regions, . - purgatory_sha_regions + + .balign 8 +.globl stack_buf +stack_buf: + .skip KEXEC_PURGATORY_STACK_SIZE + .size stack_buf, . - stack_buf
[PATCH 07/11] ppc64/kexec_file: add support to relocate purgatory
Right now purgatory implementation is only minimal. But if purgatory code is to be enhanced to copy memory to the backup region and verify sha256 digest, relocations may have to be applied to the purgatory. So, add support to relocate purgatory in kexec_file_load system call by setting up TOC pointer and applying RELA relocations as needed. Signed-off-by: Hari Bathini --- arch/powerpc/kexec/file_load_64.c | 338 arch/powerpc/purgatory/trampoline_64.S |8 + 2 files changed, 346 insertions(+) diff --git a/arch/powerpc/kexec/file_load_64.c b/arch/powerpc/kexec/file_load_64.c index d85cba4d..4430336 100644 --- a/arch/powerpc/kexec/file_load_64.c +++ b/arch/powerpc/kexec/file_load_64.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include #include @@ -626,6 +627,242 @@ static int update_usable_mem_fdt(void *fdt, struct crash_mem *usable_mem) } /** + * get_toc_section - Look for ".toc" symbol and return the corresponding section + * @ehdr:ELF header. + * + * Returns TOC section on success, NULL otherwise. + */ +static const Elf_Shdr *get_toc_section(const Elf_Ehdr *ehdr) +{ + const Elf_Shdr *sechdrs; + const char *secstrings; + int i; + + if (!ehdr) { + pr_err("Purgatory elf load info missing?\n"); + return NULL; + } + + sechdrs = (void *)ehdr + ehdr->e_shoff; + secstrings = (void *)ehdr + sechdrs[ehdr->e_shstrndx].sh_offset; + + for (i = 0; i < ehdr->e_shnum; i++) { + if ((sechdrs[i].sh_size != 0) && + (strcmp(secstrings + sechdrs[i].sh_name, ".toc") == 0)) { + /* Return the ".toc" section */ + pr_debug("TOC section number is %d\n", i); + return [i]; + } + } + + return NULL; +} + +/** + * get_toc_ptr - r2 is the TOC pointer: it points 0x8000 into the TOC + * @ehdr:ELF header. + * + * Returns r2 on success, 0 otherwise. + */ +static unsigned long get_toc_ptr(const Elf_Ehdr *ehdr) +{ + const Elf_Shdr *sechdr; + + sechdr = get_toc_section(ehdr); + if (!sechdr) { + pr_err("Could not get the TOC section!\n"); + return 0; + } + + return sechdr->sh_addr + 0x8000; +} + +/* Helper functions to apply relocations */ +static int do_relative_toc(unsigned long val, uint16_t *loc, + unsigned long mask, int complain_signed) +{ + if (complain_signed && (val + 0x8000 > 0x)) { + pr_err("TOC16 relocation overflows (%lu)\n", val); + return -ENOEXEC; + } + + if ((~mask & 0x) & val) { + pr_err("Bad TOC16 relocation (%lu)\n", val); + return -ENOEXEC; + } + + *loc = (*loc & ~mask) | (val & mask); + return 0; +} +#ifdef PPC64_ELF_ABI_v2 +/* PowerPC64 specific values for the Elf64_Sym st_other field. */ +#define STO_PPC64_LOCAL_BIT5 +#define STO_PPC64_LOCAL_MASK (7 << STO_PPC64_LOCAL_BIT) +#define PPC64_LOCAL_ENTRY_OFFSET(other) \ + (((1 << (((other) & STO_PPC64_LOCAL_MASK) >> STO_PPC64_LOCAL_BIT)) \ +>> 2) << 2) + +static unsigned int local_entry_offset(const Elf64_Sym *sym) +{ + /* If this symbol has a local entry point, use it. */ + return PPC64_LOCAL_ENTRY_OFFSET(sym->st_other); +} +#else +static unsigned int local_entry_offset(struct mem_sym *UNUSED(sym)) +{ + return 0; +} +#endif + +/** + * kexec_do_relocs_ppc64 - Apply relocations based on relocation type. + * @my_r2: TOC pointer. + * @sym: Symbol to relocate. + * @r_type:Relocation type. + * @loc: Location to modify. + * @val: Relocated symbol value. + * @addr: Final location after relocation. + * + * Returns 0 on success, negative errno on error. + */ +static int kexec_do_relocs_ppc64(unsigned long my_r2, const Elf_Sym *sym, +int r_type, void *loc, unsigned long val, +unsigned long addr) +{ + int ret = 0; + + switch (r_type) { + case R_PPC64_ADDR32: + /* Simply set it */ + *(uint32_t *)loc = val; + break; + + case R_PPC64_ADDR64: + /* Simply set it */ + *(uint64_t *)loc = val; + break; + + case R_PPC64_REL64: + *(uint64_t *)loc = val - (uint64_t)loc; + break; + + case R_PPC64_REL32: + /* Convert value to relative */ + val -= (unsigned long)loc; + if (val + 0x8000 > 0x) { + pr_err("REL32 %li out of range!\n", val); + return -ENOEXEC; + } + + *(uint32_t *)loc = val; + break; +
[PATCH 06/11] ppc64/kexec_file: restrict memory usage of kdump kernel
Kdump kernel, used for capturing the kernel core image, is supposed to use only specific memory regions to avoid corrupting the image to be captured. The regions are crashkernel range - the memory reserved explicitly for kdump kernel, memory used for the tce-table, the OPAL region and RTAS region as applicable. Restrict kdump kernel memory to use only these regions by setting up usable-memory DT property. Also, tell the kdump kernel to run at the loaded address by setting the magic word at 0x5c. Signed-off-by: Hari Bathini --- arch/powerpc/kexec/file_load_64.c | 400 + 1 file changed, 398 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kexec/file_load_64.c b/arch/powerpc/kexec/file_load_64.c index f1d7160..d85cba4d 100644 --- a/arch/powerpc/kexec/file_load_64.c +++ b/arch/powerpc/kexec/file_load_64.c @@ -17,9 +17,21 @@ #include #include #include +#include +#include +#include #include #include +struct umem_info { + uint64_t *buf; /* data buffer for usable-memory property */ + uint32_t idx; /* current index */ + uint32_t size; /* size allocated for the data buffer */ + + /* usable memory ranges to look up */ + const struct crash_mem *umrngs; +}; + const struct kexec_file_ops * const kexec_file_loaders[] = { _elf64_ops, NULL @@ -75,6 +87,38 @@ static int get_exclude_memory_ranges(struct crash_mem **mem_ranges) } /** + * get_usable_memory_ranges - Get usable memory ranges. This list includes + *regions like crashkernel, opal/rtas & tce-table, + *that kdump kernel could use. + * @mem_ranges: Range list to add the memory ranges to. + * + * Returns 0 on success, negative errno on error. + */ +static int get_usable_memory_ranges(struct crash_mem **mem_ranges) +{ + int ret; + + /* First memory block & crashkernel region */ + ret = add_mem_range(mem_ranges, 0, crashk_res.end + 1); + if (ret) + goto out; + + ret = add_rtas_mem_range(mem_ranges, false); + if (ret) + goto out; + + ret = add_opal_mem_range(mem_ranges, false); + if (ret) + goto out; + + ret = add_tce_mem_ranges(mem_ranges); +out: + if (ret) + pr_err("Failed to setup usable memory ranges\n"); + return ret; +} + +/** * __locate_mem_hole_ppc64 - Tests if the memory hole between buf_min & buf_max * is large enough for the buffer. If true, sets * kbuf->mem to the buffer. @@ -267,6 +311,321 @@ static int kexec_locate_mem_hole_ppc64(struct kexec_buf *kbuf) } /** + * check_realloc_usable_mem - Reallocate buffer if it can't accommodate entries + * @um_info: Usable memory buffer and ranges info. + * @cnt: No. of entries to accommodate. + * + * Returns 0 on success, negative errno on error. + */ +static uint64_t *check_realloc_usable_mem(struct umem_info *um_info, int cnt) +{ + void *tbuf; + + if (um_info->size >= + ((um_info->idx + cnt) * sizeof(*(um_info->buf + return um_info->buf; + + um_info->size += MEM_RANGE_CHUNK_SZ; + tbuf = krealloc(um_info->buf, um_info->size, GFP_KERNEL); + if (!tbuf) { + um_info->size -= MEM_RANGE_CHUNK_SZ; + return NULL; + } + + memset(tbuf + um_info->idx, 0, MEM_RANGE_CHUNK_SZ); + return tbuf; +} + +/** + * add_usable_mem - Add the usable memory ranges within the given memory range + * to the buffer + * @um_info:Usable memory buffer and ranges info. + * @base: Base address of memory range to look for. + * @end:End address of memory range to look for. + * @cnt:No. of usable memory ranges added to buffer. + * + * Returns 0 on success, negative errno on error. + */ +static int add_usable_mem(struct umem_info *um_info, uint64_t base, + uint64_t end, int *cnt) +{ + uint64_t loc_base, loc_end, *buf; + const struct crash_mem *umrngs; + int i, add; + + *cnt = 0; + umrngs = um_info->umrngs; + for (i = 0; i < umrngs->nr_ranges; i++) { + add = 0; + loc_base = umrngs->ranges[i].start; + loc_end = umrngs->ranges[i].end; + if (loc_base >= base && loc_end <= end) + add = 1; + else if (base < loc_end && end > loc_base) { + if (loc_base < base) + loc_base = base; + if (loc_end > end) + loc_end = end; + add = 1; + } + + if (add) { + buf = check_realloc_usable_mem(um_info, 2); + if (!buf) + return -ENOMEM; + +
[PATCH 05/11] powerpc/drmem: make lmb walk a bit more flexible
Currently, numa & prom are the users of drmem lmb walk code. Loading kdump with kexec_file also needs to walk the drmem LMBs to setup the usable memory ranges for kdump kernel. But there are couple of issues in using the code as is. One, walk_drmem_lmb() code is built into the .init section currently, while kexec_file needs it later. Two, there is no scope to pass data to the callback function for processing and/ or erroring out on certain conditions. Fix that by, moving drmem LMB walk code out of .init section, adding scope to pass data to the callback function and bailing out when an error is encountered in the callback function. Signed-off-by: Hari Bathini --- arch/powerpc/include/asm/drmem.h |9 ++-- arch/powerpc/kernel/prom.c | 13 +++--- arch/powerpc/mm/drmem.c | 87 +- arch/powerpc/mm/numa.c | 13 +++--- 4 files changed, 78 insertions(+), 44 deletions(-) diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h index 414d209..17ccc64 100644 --- a/arch/powerpc/include/asm/drmem.h +++ b/arch/powerpc/include/asm/drmem.h @@ -90,13 +90,14 @@ static inline bool drmem_lmb_reserved(struct drmem_lmb *lmb) } u64 drmem_lmb_memory_max(void); -void __init walk_drmem_lmbs(struct device_node *dn, - void (*func)(struct drmem_lmb *, const __be32 **)); +int walk_drmem_lmbs(struct device_node *dn, void *data, + int (*func)(struct drmem_lmb *, const __be32 **, void *)); int drmem_update_dt(void); #ifdef CONFIG_PPC_PSERIES -void __init walk_drmem_lmbs_early(unsigned long node, - void (*func)(struct drmem_lmb *, const __be32 **)); +int __init +walk_drmem_lmbs_early(unsigned long node, void *data, + int (*func)(struct drmem_lmb *, const __be32 **, void *)); #endif static inline void invalidate_lmb_associativity_index(struct drmem_lmb *lmb) diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c index 9cc49f2..7df78de 100644 --- a/arch/powerpc/kernel/prom.c +++ b/arch/powerpc/kernel/prom.c @@ -468,8 +468,9 @@ static bool validate_mem_limit(u64 base, u64 *size) * This contains a list of memory blocks along with NUMA affinity * information. */ -static void __init early_init_drmem_lmb(struct drmem_lmb *lmb, - const __be32 **usm) +static int __init early_init_drmem_lmb(struct drmem_lmb *lmb, + const __be32 **usm, + void *data) { u64 base, size; int is_kexec_kdump = 0, rngs; @@ -484,7 +485,7 @@ static void __init early_init_drmem_lmb(struct drmem_lmb *lmb, */ if ((lmb->flags & DRCONF_MEM_RESERVED) || !(lmb->flags & DRCONF_MEM_ASSIGNED)) - return; + return 0; if (*usm) is_kexec_kdump = 1; @@ -499,7 +500,7 @@ static void __init early_init_drmem_lmb(struct drmem_lmb *lmb, */ rngs = dt_mem_next_cell(dt_root_size_cells, usm); if (!rngs) /* there are no (base, size) duple */ - return; + return 0; } do { @@ -524,6 +525,8 @@ static void __init early_init_drmem_lmb(struct drmem_lmb *lmb, if (lmb->flags & DRCONF_MEM_HOTREMOVABLE) memblock_mark_hotplug(base, size); } while (--rngs); + + return 0; } #endif /* CONFIG_PPC_PSERIES */ @@ -534,7 +537,7 @@ static int __init early_init_dt_scan_memory_ppc(unsigned long node, #ifdef CONFIG_PPC_PSERIES if (depth == 1 && strcmp(uname, "ibm,dynamic-reconfiguration-memory") == 0) { - walk_drmem_lmbs_early(node, early_init_drmem_lmb); + walk_drmem_lmbs_early(node, NULL, early_init_drmem_lmb); return 0; } #endif diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c index 59327ce..b2eeea3 100644 --- a/arch/powerpc/mm/drmem.c +++ b/arch/powerpc/mm/drmem.c @@ -14,6 +14,8 @@ #include #include +static int n_root_addr_cells, n_root_size_cells; + static struct drmem_lmb_info __drmem_info; struct drmem_lmb_info *drmem_info = &__drmem_info; @@ -189,12 +191,13 @@ int drmem_update_dt(void) return rc; } -static void __init read_drconf_v1_cell(struct drmem_lmb *lmb, +static void read_drconf_v1_cell(struct drmem_lmb *lmb, const __be32 **prop) { const __be32 *p = *prop; - lmb->base_addr = dt_mem_next_cell(dt_root_addr_cells, ); + lmb->base_addr = of_read_number(p, n_root_addr_cells); + p += n_root_addr_cells; lmb->drc_index = of_read_number(p++, 1); p++; /* skip reserved field */ @@ -205,29 +208,33 @@ static void __init read_drconf_v1_cell(struct drmem_lmb *lmb, *prop = p; } -static void __init __walk_drmem_v1_lmbs(const __be32
[PATCH 00/11] ppc64: enable kdump support for kexec_file_load syscall
This patch series enables kdump support for kexec_file_load system call (kexec -s -p) on PPC64. The changes are inspired from kexec-tools code but heavily modified for kernel consumption. There is scope to expand purgatory to verify sha digest but tried to keep purgatory changes minimal in the interest of this series. The first patch adds a weak arch_kexec_add_buffer function to override locate memory hole logic suiting arch needs. There are some special regions in ppc64 which should be avoided while loading buffer & there are multiple callers to kexec_add_buffer making it too complicated to maintain range sanity and using generic lookup at the same time. The second patch marks ppc64 specific code within arch/powerpc/kexec and arch/powerpc/purgatory to make the subsequent code changes easy to understand. The next patch adds helper function to setup different memory ranges needed for loading kdump kernel, booting into it and exporting the crashing kernel's elfcore. The fourth patch overrides arch_kexec_add_buffer to locate memory hole for kdump segments by accounting for the special memory regions, referred to as excluded memory ranges, and calls __kexec_add_buffer with kbuf->mem set to skip the generic locate memory hole lookup. The fifth patch moves walk_drmem_lmbs() out of .init section with a few changes to reuse it for setting up kdump kernel's usable memory ranges. The next patch uses walk_drmem_lmbs() to look up the LMBs and set linux,drconf-usable-memory & linux,usable-memory properties in order to restrict kdump kernel's memory usage. The seventh patch adds relocation support for the purgatory. Patch 8 helps setup the stack for the purgatory. The next patch setups up backup region as a segment while loading kdump kernel and teaches purgatory to copy it from source to destination. Patch 10 builds the elfcore header for the running kernel & passes the info to kdump kernel via "elfcorehdr=" parameter to export as /proc/vmcore file. The last patch sets up the memory reserve map for the kexec kernel and also claims kdump support as all the necessary changes are added. Tested the changes successfully on P8, P9 lpars & an OpenPOWER box. --- Hari Bathini (11): kexec_file: allow archs to handle special regions while locating memory hole powerpc/kexec_file: mark PPC64 specific code powerpc/kexec_file: add helper functions for getting memory ranges ppc64/kexec_file: avoid stomping memory used by special regions powerpc/drmem: make lmb walk a bit more flexible ppc64/kexec_file: restrict memory usage of kdump kernel ppc64/kexec_file: add support to relocate purgatory ppc64/kexec_file: setup the stack for purgatory ppc64/kexec_file: setup backup region for kdump kernel ppc64/kexec_file: prepare elfcore header for crashing kernel ppc64/kexec_file: add appropriate regions for memory reserve map arch/powerpc/include/asm/crashdump-ppc64.h | 15 arch/powerpc/include/asm/drmem.h |9 arch/powerpc/include/asm/kexec.h | 35 + arch/powerpc/include/asm/kexec_ranges.h| 18 arch/powerpc/kernel/prom.c | 13 arch/powerpc/kexec/Makefile|2 arch/powerpc/kexec/elf_64.c| 35 + arch/powerpc/kexec/file_load.c | 78 + arch/powerpc/kexec/file_load_64.c | 1461 arch/powerpc/kexec/ranges.c| 387 +++ arch/powerpc/mm/drmem.c| 87 +- arch/powerpc/mm/numa.c | 13 arch/powerpc/purgatory/Makefile| 28 - arch/powerpc/purgatory/purgatory_64.c | 35 + arch/powerpc/purgatory/trampoline.S| 117 -- arch/powerpc/purgatory/trampoline_64.S | 153 +++ include/linux/kexec.h |5 kernel/kexec_file.c| 37 + 18 files changed, 2327 insertions(+), 201 deletions(-) create mode 100644 arch/powerpc/include/asm/crashdump-ppc64.h create mode 100644 arch/powerpc/include/asm/kexec_ranges.h create mode 100644 arch/powerpc/kexec/file_load_64.c create mode 100644 arch/powerpc/kexec/ranges.c create mode 100644 arch/powerpc/purgatory/purgatory_64.c delete mode 100644 arch/powerpc/purgatory/trampoline.S create mode 100644 arch/powerpc/purgatory/trampoline_64.S
Re: [PATCH 0/8 v2] PCI: Align return values of PCIe capability and PCI accessors
On Mon, Jun 15, 2020 at 09:32:17AM +0200, refactormys...@gmail.com wrote: > From: Bolarinwa Olayemi Saheed > > > PATCH 1/8 to 7/8: > PCIBIOS_ error codes have positive values and they are passed down the > call heirarchy from accessors. For functions which are meant to return > only a negative value on failure, passing on this value is a bug. > To mitigate this, call pcibios_err_to_errno() before passing on return > value from PCIe capability accessors call heirarchy. This function > converts any positive PCIBIOS_ error codes to negative generic error > values. > > PATCH 8/8: > The PCIe capability accessors can return 0, -EINVAL, or any PCIBIOS_ error > code. The pci accessor on the other hand can only return 0 or any PCIBIOS_ > error code.This inconsistency among these accessor makes it harder for > callers to check for errors. > Return PCIBIOS_BAD_REGISTER_NUMBER instead of -EINVAL in all PCIe > capability accessors. > > MERGING: > These may all be merged via the PCI tree, since it is a collection of > similar fixes. This way they all get merged at once. > > Version 2: > * cc to maintainers and mailing lists > * Edit the Subject to conform with previous style > * reorder "Signed by" and "Suggested by" > * made spelling corrections > * fixed redundant initialisation in PATCH 3/8 > * include missing call to pcibios_err_to_errno() in PATCH 6/8 and 7/8 > > > Bolarinwa Olayemi Saheed (8): > dmaengine: ioatdma: Convert PCIBIOS_* errors to generic -E* errors > IB/hfi1: Convert PCIBIOS_* errors to generic -E* errors > IB/hfi1: Convert PCIBIOS_* errors to generic -E* errors > PCI: Convert PCIBIOS_* errors to generic -E* errors > scsi: smartpqi: Convert PCIBIOS_* errors to generic -E* errors > PCI/AER: Convert PCIBIOS_* errors to generic -E* errors > PCI/AER: Convert PCIBIOS_* errors to generic -E* errors > PCI: Align return values of PCIe capability and PCI accessorss > > drivers/dma/ioat/init.c | 4 ++-- > drivers/infiniband/hw/hfi1/pcie.c | 18 +- > drivers/pci/access.c | 8 > drivers/pci/pci.c | 10 -- > drivers/pci/pcie/aer.c| 12 ++-- > drivers/scsi/smartpqi/smartpqi_init.c | 6 +- > 6 files changed, 42 insertions(+), 16 deletions(-) Since these are really fixing a single PCI API problem, not individual driver-related problems, I squashed the pcibios_err_to_errno() patches together (except IB/hfi1, since Jason will take those separately) and applied them to pci/misc, thanks! The squashed patch as applied is: commit d20df83b66cc ("PCI: Convert PCIe capability PCIBIOS errors to errno") Author: Bolarinwa Olayemi Saheed Date: Mon Jun 15 09:32:18 2020 +0200 PCI: Convert PCIe capability PCIBIOS errors to errno The PCI config accessors (pci_read_config_word(), et al) return PCIBIOS_SUCCESSFUL (zero) or positive error values like PCIBIOS_FUNC_NOT_SUPPORTED. The PCIe capability accessors (pcie_capability_read_word(), et al) similarly return PCIBIOS errors, but some callers assume they return generic errno values like -EINVAL. For example, the Myri-10G probe function returns a positive PCIBIOS error if the pcie_capability_clear_and_set_word() in pcie_set_readrq() fails: myri10ge_probe status = pcie_set_readrq return pcie_capability_clear_and_set_word if (status) return status A positive return from a PCI driver probe function would cause a "Driver probe function unexpectedly returned" warning from local_pci_probe() instead of the desired probe failure. Convert PCIBIOS errors to generic errno for all callers of: pcie_capability_read_word pcie_capability_read_dword pcie_capability_write_word pcie_capability_write_dword pcie_capability_set_word pcie_capability_set_dword pcie_capability_clear_word pcie_capability_clear_dword pcie_capability_clear_and_set_word pcie_capability_clear_and_set_dword that check the return code for anything other than zero. [bhelgaas: commit log, squash together] Suggested-by: Bjorn Helgaas Link: https://lore.kernel.org/r/20200615073225.24061-1-refactormys...@gmail.com Signed-off-by: Bolarinwa Olayemi Saheed Signed-off-by: Bjorn Helgaas diff --git a/drivers/dma/ioat/init.c b/drivers/dma/ioat/init.c index 58d13564f88b..9a6a9ec3cf48 100644 --- a/drivers/dma/ioat/init.c +++ b/drivers/dma/ioat/init.c @@ -1195,13 +1195,13 @@ static int ioat3_dma_probe(struct ioatdma_device *ioat_dma, int dca) /* disable relaxed ordering */ err = pcie_capability_read_word(pdev, IOAT_DEVCTRL_OFFSET, ); if (err) - return err; + return pcibios_err_to_errno(err); /* clear relaxed ordering enable */ val16 &= ~IOAT_DEVCTRL_ROE; err = pcie_capability_write_word(pdev,
Re: [PATCH v2 5/6] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition
On Fri, 2020-06-26 at 12:23 -0300, Leonardo Bras wrote: > On Wed, 2020-06-24 at 03:24 -0300, Leonardo Bras wrote: > > As of today, if a DDW is created and can't map the whole partition, it's > > removed and the default DMA window "ibm,dma-window" is used instead. > > > > Usually this DDW is bigger than the default DMA window, so it would be > > better to make use of it instead. > > > > Signed-off-by: Leonardo Bras > > --- > > I tested this change with a 256GB DDW which did not map the whole > partition, with a MT27700 Family [ConnectX-4 Virtual Function]. > > I noticed the performance improvement is about the same as using DDW > with IOMMU bypass. > > 64 thread write throughput: +203.0% > 64 thread read throughput: +17.5% > 1 thread write throughput: +20.5% > 1 thread read throughput: +3.43% > Average write latency: -23.0% > Average read latency: -2.26% The above improvements are based on the default DMA window, which is currently used if DDW can't map the whole partition. Those values are an average of 20 tests for each environment, 30 seconds each test. I also did some intense testing, for 5 hour each: 64 thread write throughput 64 thread read throughput The throughput values are stable in the whole test, and I noticed no error on dmesg / journalctl.
Re: [PATCH v2 6/6] powerpc/pseries/iommu: Avoid errors when DDW starts at 0x00
On Wed, 2020-06-24 at 03:24 -0300, Leonardo Bras wrote: > As of today, enable_ddw() will return a non-null DMA address if the > created DDW maps the whole partition. If the address is valid, > iommu_bypass_supported_pSeriesLP() will consider iommu bypass enabled. > > This can cause some trouble if the DDW happens to start at 0x00. > > Instead if checking if the address is non-null, check directly if > the DDW maps the whole partition, so it can bypass iommu. > > Signed-off-by: Leonardo Bras This patch has a bug in it. I will rework it soon. Please keep reviewing patches 1-5. Best regards, Leonardo
[PATCH] selftests/powerpc: Purge extra count_pmc() calls of ebb selftests
An extra count on ebb_state.stats.pmc_count[PMC_INDEX(pmc)] is being per- formed when count_pmc() is used to reset PMCs on a few selftests. This extra pmc_count can occasionally invalidate results, such as the ones from cycles_test shown hereafter. The ebb_check_count() failed with an above the upper limit error due to the extra value on ebb_state.stats.pmc_count. Furthermore, this extra count is also indicated by extra PMC1 trace_log on the output of the cycle test (as well as on pmc56_overflow_test): == ... [21]: counter = 8 [22]: register SPRN_MMCR0 = 0x8080 [23]: register SPRN_PMC1 = 0x8004 [24]: counter = 9 [25]: register SPRN_MMCR0 = 0x8080 [26]: register SPRN_PMC1 = 0x8004 [27]: counter = 10 [28]: register SPRN_MMCR0 = 0x8080 [29]: register SPRN_PMC1 = 0x8004 >> [30]: register SPRN_PMC1 = 0x451e PMC1 count (0x28546) above upper limit 0x283e8 (+0x15e) [FAIL] Test FAILED on line 52 failure: cycles == Signed-off-by: Desnes A. Nunes do Rosario --- .../selftests/powerpc/pmu/ebb/back_to_back_ebbs_test.c | 2 -- tools/testing/selftests/powerpc/pmu/ebb/cycles_test.c | 2 -- .../selftests/powerpc/pmu/ebb/cycles_with_freeze_test.c| 2 -- .../selftests/powerpc/pmu/ebb/cycles_with_mmcr2_test.c | 2 -- tools/testing/selftests/powerpc/pmu/ebb/ebb.c | 2 -- .../selftests/powerpc/pmu/ebb/ebb_on_willing_child_test.c | 2 -- .../selftests/powerpc/pmu/ebb/lost_exception_test.c| 1 - .../testing/selftests/powerpc/pmu/ebb/multi_counter_test.c | 7 --- .../selftests/powerpc/pmu/ebb/multi_ebb_procs_test.c | 2 -- .../testing/selftests/powerpc/pmu/ebb/pmae_handling_test.c | 2 -- .../selftests/powerpc/pmu/ebb/pmc56_overflow_test.c| 2 -- 11 files changed, 26 deletions(-) diff --git a/tools/testing/selftests/powerpc/pmu/ebb/back_to_back_ebbs_test.c b/tools/testing/selftests/powerpc/pmu/ebb/back_to_back_ebbs_test.c index a2d7b0e3dca9..a26ac122c759 100644 --- a/tools/testing/selftests/powerpc/pmu/ebb/back_to_back_ebbs_test.c +++ b/tools/testing/selftests/powerpc/pmu/ebb/back_to_back_ebbs_test.c @@ -91,8 +91,6 @@ int back_to_back_ebbs(void) ebb_global_disable(); ebb_freeze_pmcs(); - count_pmc(1, sample_period); - dump_ebb_state(); event_close(); diff --git a/tools/testing/selftests/powerpc/pmu/ebb/cycles_test.c b/tools/testing/selftests/powerpc/pmu/ebb/cycles_test.c index bc893813483e..bb9f587fa76e 100644 --- a/tools/testing/selftests/powerpc/pmu/ebb/cycles_test.c +++ b/tools/testing/selftests/powerpc/pmu/ebb/cycles_test.c @@ -42,8 +42,6 @@ int cycles(void) ebb_global_disable(); ebb_freeze_pmcs(); - count_pmc(1, sample_period); - dump_ebb_state(); event_close(); diff --git a/tools/testing/selftests/powerpc/pmu/ebb/cycles_with_freeze_test.c b/tools/testing/selftests/powerpc/pmu/ebb/cycles_with_freeze_test.c index dcd351d20328..9ae795ce314e 100644 --- a/tools/testing/selftests/powerpc/pmu/ebb/cycles_with_freeze_test.c +++ b/tools/testing/selftests/powerpc/pmu/ebb/cycles_with_freeze_test.c @@ -99,8 +99,6 @@ int cycles_with_freeze(void) ebb_global_disable(); ebb_freeze_pmcs(); - count_pmc(1, sample_period); - dump_ebb_state(); printf("EBBs while frozen %d\n", ebbs_while_frozen); diff --git a/tools/testing/selftests/powerpc/pmu/ebb/cycles_with_mmcr2_test.c b/tools/testing/selftests/powerpc/pmu/ebb/cycles_with_mmcr2_test.c index 94c99c12c0f2..4b45a2e70f62 100644 --- a/tools/testing/selftests/powerpc/pmu/ebb/cycles_with_mmcr2_test.c +++ b/tools/testing/selftests/powerpc/pmu/ebb/cycles_with_mmcr2_test.c @@ -71,8 +71,6 @@ int cycles_with_mmcr2(void) ebb_global_disable(); ebb_freeze_pmcs(); - count_pmc(1, sample_period); - dump_ebb_state(); event_close(); diff --git a/tools/testing/selftests/powerpc/pmu/ebb/ebb.c b/tools/testing/selftests/powerpc/pmu/ebb/ebb.c index dfbc5c3ad52d..21537d6eb6b7 100644 --- a/tools/testing/selftests/powerpc/pmu/ebb/ebb.c +++ b/tools/testing/selftests/powerpc/pmu/ebb/ebb.c @@ -396,8 +396,6 @@ int ebb_child(union pipe read_pipe, union pipe write_pipe) ebb_global_disable(); ebb_freeze_pmcs(); - count_pmc(1, sample_period); - dump_ebb_state(); event_close(); diff --git a/tools/testing/selftests/powerpc/pmu/ebb/ebb_on_willing_child_test.c b/tools/testing/selftests/powerpc/pmu/ebb/ebb_on_willing_child_test.c index ca2f7d729155..b208bf6ad58d 100644 --- a/tools/testing/selftests/powerpc/pmu/ebb/ebb_on_willing_child_test.c +++ b/tools/testing/selftests/powerpc/pmu/ebb/ebb_on_willing_child_test.c @@ -38,8 +38,6 @@ static int victim_child(union pipe read_pipe, union pipe write_pipe) ebb_global_disable(); ebb_freeze_pmcs(); - count_pmc(1, sample_period); -
Re: [PATCH v2 00/15] Documentation fixes
On Tue, 23 Jun 2020 09:08:56 +0200 Mauro Carvalho Chehab wrote: > As requested, this is a rebase of a previous series posted on Jan, 15. > > Since then, several patches got merged via other trees or became > obsolete. There were also 2 patches before that fits better at the > ReST conversion patchset. So, I'll be sending it on another patch > series together with the remaining ReST conversions. > > I also added reviews/acks received. > > So, the series reduced from 29 to 15 patches. > > Let's hope b4 would be able to properly handle this one. Nope. I don't know what it is about your patch series, but b4 is never able to put them together. I've applied the series except for #1, which already went through the -mm tree. Thanks, jon
[PATCH v2 1/3] powerpc/mm: Enable radix GTSE only if supported.
Make GTSE an MMU feature and enable it by default for radix. However for guest, conditionally enable it if hypervisor supports it via OV5 vector. Let prom_init ask for radix GTSE only if the support exists. Having GTSE as an MMU feature will make it easy to enable radix without GTSE. Currently radix assumes GTSE is enabled by default. Signed-off-by: Bharata B Rao --- arch/powerpc/include/asm/mmu.h| 4 arch/powerpc/kernel/dt_cpu_ftrs.c | 1 + arch/powerpc/kernel/prom_init.c | 13 - arch/powerpc/mm/init_64.c | 5 - 4 files changed, 17 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/mmu.h b/arch/powerpc/include/asm/mmu.h index f4ac25d4df05..884d51995934 100644 --- a/arch/powerpc/include/asm/mmu.h +++ b/arch/powerpc/include/asm/mmu.h @@ -28,6 +28,9 @@ * Individual features below. */ +/* Guest Translation Shootdown Enable */ +#define MMU_FTR_GTSE ASM_CONST(0x1000) + /* * Support for 68 bit VA space. We added that from ISA 2.05 */ @@ -173,6 +176,7 @@ enum { #endif #ifdef CONFIG_PPC_RADIX_MMU MMU_FTR_TYPE_RADIX | + MMU_FTR_GTSE | #ifdef CONFIG_PPC_KUAP MMU_FTR_RADIX_KUAP | #endif /* CONFIG_PPC_KUAP */ diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c b/arch/powerpc/kernel/dt_cpu_ftrs.c index a0edeb391e3e..ac650c233cd9 100644 --- a/arch/powerpc/kernel/dt_cpu_ftrs.c +++ b/arch/powerpc/kernel/dt_cpu_ftrs.c @@ -336,6 +336,7 @@ static int __init feat_enable_mmu_radix(struct dt_cpu_feature *f) #ifdef CONFIG_PPC_RADIX_MMU cur_cpu_spec->mmu_features |= MMU_FTR_TYPE_RADIX; cur_cpu_spec->mmu_features |= MMU_FTRS_HASH_BASE; + cur_cpu_spec->mmu_features |= MMU_FTR_GTSE; cur_cpu_spec->cpu_user_features |= PPC_FEATURE_HAS_MMU; return 1; diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c index 90c604d00b7d..cbc605cfdec0 100644 --- a/arch/powerpc/kernel/prom_init.c +++ b/arch/powerpc/kernel/prom_init.c @@ -1336,12 +1336,15 @@ static void __init prom_check_platform_support(void) } } - if (supported.radix_mmu && supported.radix_gtse && - IS_ENABLED(CONFIG_PPC_RADIX_MMU)) { - /* Radix preferred - but we require GTSE for now */ - prom_debug("Asking for radix with GTSE\n"); + if (supported.radix_mmu && IS_ENABLED(CONFIG_PPC_RADIX_MMU)) { + /* Radix preferred - Check if GTSE is also supported */ + prom_debug("Asking for radix\n"); ibm_architecture_vec.vec5.mmu = OV5_FEAT(OV5_MMU_RADIX); - ibm_architecture_vec.vec5.radix_ext = OV5_FEAT(OV5_RADIX_GTSE); + if (supported.radix_gtse) + ibm_architecture_vec.vec5.radix_ext = + OV5_FEAT(OV5_RADIX_GTSE); + else + prom_debug("Radix GTSE isn't supported\n"); } else if (supported.hash_mmu) { /* Default to hash mmu (if we can) */ prom_debug("Asking for hash\n"); diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c index bc73abf0bc25..152aa0200cef 100644 --- a/arch/powerpc/mm/init_64.c +++ b/arch/powerpc/mm/init_64.c @@ -407,12 +407,15 @@ static void __init early_check_vec5(void) if (!(vec5[OV5_INDX(OV5_RADIX_GTSE)] & OV5_FEAT(OV5_RADIX_GTSE))) { pr_warn("WARNING: Hypervisor doesn't support RADIX with GTSE\n"); - } + cur_cpu_spec->mmu_features &= ~MMU_FTR_GTSE; + } else + cur_cpu_spec->mmu_features |= MMU_FTR_GTSE; /* Do radix anyway - the hypervisor said we had to */ cur_cpu_spec->mmu_features |= MMU_FTR_TYPE_RADIX; } else if (mmu_supported == OV5_FEAT(OV5_MMU_HASH)) { /* Hypervisor only supports hash - disable radix */ cur_cpu_spec->mmu_features &= ~MMU_FTR_TYPE_RADIX; + cur_cpu_spec->mmu_features &= ~MMU_FTR_GTSE; } } -- 2.21.3
[PATCH v2 2/3] powerpc/pseries: H_REGISTER_PROC_TBL should ask for GTSE only if enabled
H_REGISTER_PROC_TBL asks for GTSE by default. GTSE flag bit should be set only when GTSE is supported. Signed-off-by: Bharata B Rao --- arch/powerpc/platforms/pseries/lpar.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c index fd26f3d21d7b..f82569a505f1 100644 --- a/arch/powerpc/platforms/pseries/lpar.c +++ b/arch/powerpc/platforms/pseries/lpar.c @@ -1680,9 +1680,11 @@ static int pseries_lpar_register_process_table(unsigned long base, if (table_size) flags |= PROC_TABLE_NEW; - if (radix_enabled()) - flags |= PROC_TABLE_RADIX | PROC_TABLE_GTSE; - else + if (radix_enabled()) { + flags |= PROC_TABLE_RADIX; + if (mmu_has_feature(MMU_FTR_GTSE)) + flags |= PROC_TABLE_GTSE; + } else flags |= PROC_TABLE_HPT_SLB; for (;;) { rc = plpar_hcall_norets(H_REGISTER_PROC_TBL, flags, base, -- 2.21.3
Re: [PATCH] powerpc/pseries: Use doorbells even if XIVE is available
Adding David, On 6/25/20 3:11 AM, Michael Ellerman wrote: > Nicholas Piggin writes: >> KVM supports msgsndp in guests by trapping and emulating the >> instruction, so it was decided to always use XIVE for IPIs if it is >> available. However on PowerVM systems, msgsndp can be used and gives >> better performance. On large systems, high XIVE interrupt rates can >> have sub-linear scaling, and using msgsndp can reduce the load on >> the interrupt controller. >> >> So switch to using core local doorbells even if XIVE is available. >> This reduces performance for KVM guests with an SMT topology by >> about 50% for ping-pong context switching between SMT vCPUs. > > You have to take explicit steps to configure KVM in that way with qemu. > eg. "qemu .. -smp 8" will give you 8 SMT1 CPUs by default. > >> An option vector (or dt-cpu-ftrs) could be defined to disable msgsndp >> to get KVM performance back. An option vector would require a PAPR change. Unless the architecture reserves some bits for the implementation, but I don't think so. Same for CAS. > Qemu/KVM populates /proc/device-tree/hypervisor, so we *could* look at > that. Though adding PowerVM/KVM specific hacks is obviously a very > slippery slope. QEMU could advertise a property "emulated-msgsndp", or something similar, which would be interpreted by Linux as a CPU feature and taken into account when doing the IPIs. The CPU setup for XIVE needs a cleanup also. There is no need to allocate interrupts for IPIs anymore in that case. Thanks, C. > >> diff --git a/arch/powerpc/platforms/pseries/smp.c >> b/arch/powerpc/platforms/pseries/smp.c >> index 6891710833be..a737a2f87c67 100644 >> --- a/arch/powerpc/platforms/pseries/smp.c >> +++ b/arch/powerpc/platforms/pseries/smp.c >> @@ -188,13 +188,14 @@ static int pseries_smp_prepare_cpu(int cpu) >> return 0; >> } >> >> +static void (*cause_ipi_offcore)(int cpu) __ro_after_init; >> + >> static void smp_pseries_cause_ipi(int cpu) > > This is static so the name could be more descriptive, it doesn't need > the "smp_pseries" prefix. > >> { >> -/* POWER9 should not use this handler */ >> if (doorbell_try_core_ipi(cpu)) >> return; > > Seems like it would be worth making that static inline so we can avoid > the function call overhead. > >> -icp_ops->cause_ipi(cpu); >> +cause_ipi_offcore(cpu); >> } >> >> static int pseries_cause_nmi_ipi(int cpu) >> @@ -222,10 +223,7 @@ static __init void pSeries_smp_probe_xics(void) >> { >> xics_smp_probe(); >> >> -if (cpu_has_feature(CPU_FTR_DBELL) && !is_secure_guest()) >> -smp_ops->cause_ipi = smp_pseries_cause_ipi; >> -else >> -smp_ops->cause_ipi = icp_ops->cause_ipi; >> +smp_ops->cause_ipi = icp_ops->cause_ipi; >> } >> >> static __init void pSeries_smp_probe(void) >> @@ -238,6 +236,18 @@ static __init void pSeries_smp_probe(void) > > The comment just above here says: > > /* >* Don't use P9 doorbells when XIVE is enabled. IPIs >* using MMIOs should be faster >*/ >> xive_smp_probe(); > > Which is no longer true. > >> else >> pSeries_smp_probe_xics(); > > I think you should just fold this in, it would make the logic slightly > easier to follow. > >> +/* >> + * KVM emulates doorbells by reading the instruction, which >> + * can't be done if the guest is secure. If a secure guest >> + * runs under PowerVM, it could use msgsndp but would need a >> + * way to distinguish. >> + */ > > It's not clear what it needs to distinguish: That it's running under > PowerVM and therefore *can* use msgsndp even though it's secure. > > Also the comment just talks about the is_secure_guest() test, which is > not obvious on first reading. > >> +if (cpu_has_feature(CPU_FTR_DBELL) && >> +cpu_has_feature(CPU_FTR_SMT) && !is_secure_guest()) { >> +cause_ipi_offcore = smp_ops->cause_ipi; >> +smp_ops->cause_ipi = smp_pseries_cause_ipi; >> +} > > Because we're at the tail of the function I think this would be clearer > if it used early returns, eg: > > // If the CPU doesn't have doorbells then we must use xics/xive > if (!cpu_has_feature(CPU_FTR_DBELL)) > return; > > // If the CPU doesn't have SMT then doorbells don't help us > if (!cpu_has_feature(CPU_FTR_SMT)) > return; > > // Secure guests can't use doorbells because ... > if (!is_secure_guest() > return; > > /* > * Otherwise we want to use doorbells for sibling threads and > * xics/xive for IPIs off the core, because it performs better > * on large systems ... > */ > cause_ipi_offcore = smp_ops->cause_ipi; > smp_ops->cause_ipi = smp_pseries_cause_ipi; > } > > > cheers >
Re: [PATCH 09/13] x86: Remove dev->archdata.iommu pointer
On Thu, Jun 25, 2020 at 03:08:32PM +0200, Joerg Roedel wrote: > From: Joerg Roedel > > There are no users left, all drivers have been converted to use the > per-device private pointer offered by IOMMU core. > > Signed-off-by: Joerg Roedel > --- > arch/x86/include/asm/device.h | 3 --- > 1 file changed, 3 deletions(-) > > diff --git a/arch/x86/include/asm/device.h b/arch/x86/include/asm/device.h > index 49bd6cf3eec9..7c0a52ca2f4d 100644 > --- a/arch/x86/include/asm/device.h > +++ b/arch/x86/include/asm/device.h > @@ -3,9 +3,6 @@ > #define _ASM_X86_DEVICE_H > > struct dev_archdata { > -#ifdef CONFIG_IOMMU_API > - void *iommu; /* hook for IOMMU specific extension */ > -#endif > }; > > struct pdev_archdata { > -- Acked-by: Borislav Petkov -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
[PATCH v3 0/2] Add cpu hotplug support for powerpc/perf/hv-24x7
This patchset add cpu hotplug support for hv_24x7 driver by adding online/offline cpu hotplug function. It also add sysfs file "cpumask" to expose current online cpu that can be used for hv_24x7 event count. Changelog: v2 -> v3 - Corrected some of the typo mistakes and update commit message as suggested by Gautham R Shenoy. - Added Reviewed-by tag for the first patch in the patchset. v1 -> v2 - Changed function to pick active cpu incase of offline from "cpumask_any_but" to "cpumask_last", as cpumask_any_but function pick very next online cpu and incase where we are sequentially off-lining multiple cpus, "pmu_migrate_context" can add extra latency. - Suggested by: Gautham R Shenoy. - Change documentation for cpumask and rather then hardcode the initialization for cpumask_attr_group, add loop to get very first NULL as suggested by Gautham R Shenoy. Kajol Jain (2): powerpc/perf/hv-24x7: Add cpu hotplug support powerpc/hv-24x7: Add sysfs files inside hv-24x7 device to show cpumask .../sysfs-bus-event_source-devices-hv_24x7| 7 ++ arch/powerpc/perf/hv-24x7.c | 79 ++- include/linux/cpuhotplug.h| 1 + 3 files changed, 86 insertions(+), 1 deletion(-) -- 2.18.2
[PATCH v3 2/2] powerpc/hv-24x7: Add sysfs files inside hv-24x7 device to show cpumask
Patch here adds a cpumask attr to hv_24x7 pmu along with ABI documentation. Primary use to expose the cpumask is for the perf tool which has the capability to parse the driver sysfs folder and understand the cpumask file. Having cpumask file will reduce the number of perf command line parameters (will avoid "-C" option in the perf tool command line). It can also notify the user which is the current cpu used to retrieve the counter data. command:# cat /sys/devices/hv_24x7/cpumask 0 Signed-off-by: Kajol Jain --- .../sysfs-bus-event_source-devices-hv_24x7| 7 arch/powerpc/perf/hv-24x7.c | 36 +-- 2 files changed, 41 insertions(+), 2 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7 b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7 index e8698afcd952..f9dd3755b049 100644 --- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7 +++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7 @@ -43,6 +43,13 @@ Description: read only This sysfs interface exposes the number of cores per chip present in the system. +What: /sys/devices/hv_24x7/cpumask +Date: June 2020 +Contact: Linux on PowerPC Developer List +Description: read only + This sysfs file exposes the cpumask which is designated to make + HCALLs to retrieve hv-24x7 pmu event counter data. + What: /sys/bus/event_source/devices/hv_24x7/event_descs/ Date: February 2014 Contact: Linux on PowerPC Developer List diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c index ce4739e2b407..3c699612d29f 100644 --- a/arch/powerpc/perf/hv-24x7.c +++ b/arch/powerpc/perf/hv-24x7.c @@ -448,6 +448,12 @@ static ssize_t device_show_string(struct device *dev, return sprintf(buf, "%s\n", (char *)d->var); } +static ssize_t cpumask_get_attr(struct device *dev, + struct device_attribute *attr, char *buf) +{ + return cpumap_print_to_pagebuf(true, buf, _24x7_cpumask); +} + static ssize_t sockets_show(struct device *dev, struct device_attribute *attr, char *buf) { @@ -1116,6 +1122,17 @@ static DEVICE_ATTR_RO(sockets); static DEVICE_ATTR_RO(chipspersocket); static DEVICE_ATTR_RO(coresperchip); +static DEVICE_ATTR(cpumask, S_IRUGO, cpumask_get_attr, NULL); + +static struct attribute *cpumask_attrs[] = { + _attr_cpumask.attr, + NULL, +}; + +static struct attribute_group cpumask_attr_group = { + .attrs = cpumask_attrs, +}; + static struct bin_attribute *if_bin_attrs[] = { _attr_catalog, NULL, @@ -1143,6 +1160,11 @@ static const struct attribute_group *attr_groups[] = { _desc_group, _long_desc_group, _group, + /* +* This NULL is a placeholder for the cpumask attr which will update +* onlyif cpuhotplug registration is successful +*/ + NULL, NULL, }; @@ -1683,7 +1705,7 @@ static int hv_24x7_cpu_hotplug_init(void) static int hv_24x7_init(void) { - int r; + int r, i = -1; unsigned long hret; struct hv_perf_caps caps; @@ -1727,8 +1749,18 @@ static int hv_24x7_init(void) /* init cpuhotplug */ r = hv_24x7_cpu_hotplug_init(); - if (r) + if (r) { pr_err("hv_24x7: CPU hotplug init failed\n"); + } else { + /* +* Cpu hotplug init is successful, add the +* cpumask file as part of pmu attr group and +* assign it to very first NULL location. +*/ + while (attr_groups[++i]) + /* nothing */; + attr_groups[i] = _attr_group; + } r = perf_pmu_register(_24x7_pmu, h_24x7_pmu.name, -1); if (r) -- 2.18.2
[PATCH v3 1/2] powerpc/perf/hv-24x7: Add cpu hotplug support
Patch here adds cpu hotplug functions to hv_24x7 pmu. A new cpuhp_state "CPUHP_AP_PERF_POWERPC_HV_24x7_ONLINE" enum is added. The online callback function updates the cpumask only if its empty. As the primary intention of adding hotplug support is to designate a CPU to make HCALL to collect the counter data. The offline function test and clear corresponding cpu in a cpumask and update cpumask to any other active cpu. Signed-off-by: Kajol Jain Reviewed-by: Gautham R. Shenoy --- arch/powerpc/perf/hv-24x7.c | 45 + include/linux/cpuhotplug.h | 1 + 2 files changed, 46 insertions(+) diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c index db213eb7cb02..ce4739e2b407 100644 --- a/arch/powerpc/perf/hv-24x7.c +++ b/arch/powerpc/perf/hv-24x7.c @@ -31,6 +31,8 @@ static int interface_version; /* Whether we have to aggregate result data for some domains. */ static bool aggregate_result_elements; +static cpumask_t hv_24x7_cpumask; + static bool domain_is_valid(unsigned domain) { switch (domain) { @@ -1641,6 +1643,44 @@ static struct pmu h_24x7_pmu = { .capabilities = PERF_PMU_CAP_NO_EXCLUDE, }; +static int ppc_hv_24x7_cpu_online(unsigned int cpu) +{ + /* Make this CPU the designated target for counter collection */ + if (cpumask_empty(_24x7_cpumask)) + cpumask_set_cpu(cpu, _24x7_cpumask); + + return 0; +} + +static int ppc_hv_24x7_cpu_offline(unsigned int cpu) +{ + int target = -1; + + /* Check if exiting cpu is used for collecting 24x7 events */ + if (!cpumask_test_and_clear_cpu(cpu, _24x7_cpumask)) + return 0; + + /* Find a new cpu to collect 24x7 events */ + target = cpumask_last(cpu_active_mask); + + if (target < 0 || target >= nr_cpu_ids) + return -1; + + /* Migrate 24x7 events to the new target */ + cpumask_set_cpu(target, _24x7_cpumask); + perf_pmu_migrate_context(_24x7_pmu, cpu, target); + + return 0; +} + +static int hv_24x7_cpu_hotplug_init(void) +{ + return cpuhp_setup_state(CPUHP_AP_PERF_POWERPC_HV_24x7_ONLINE, + "perf/powerpc/hv_24x7:online", + ppc_hv_24x7_cpu_online, + ppc_hv_24x7_cpu_offline); +} + static int hv_24x7_init(void) { int r; @@ -1685,6 +1725,11 @@ static int hv_24x7_init(void) if (r) return r; + /* init cpuhotplug */ + r = hv_24x7_cpu_hotplug_init(); + if (r) + pr_err("hv_24x7: CPU hotplug init failed\n"); + r = perf_pmu_register(_24x7_pmu, h_24x7_pmu.name, -1); if (r) return r; diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h index 191772d4a4d7..a2710e654b64 100644 --- a/include/linux/cpuhotplug.h +++ b/include/linux/cpuhotplug.h @@ -181,6 +181,7 @@ enum cpuhp_state { CPUHP_AP_PERF_POWERPC_CORE_IMC_ONLINE, CPUHP_AP_PERF_POWERPC_THREAD_IMC_ONLINE, CPUHP_AP_PERF_POWERPC_TRACE_IMC_ONLINE, + CPUHP_AP_PERF_POWERPC_HV_24x7_ONLINE, CPUHP_AP_WATCHDOG_ONLINE, CPUHP_AP_WORKQUEUE_ONLINE, CPUHP_AP_RCUTREE_ONLINE, -- 2.18.2
Re: [PATCH v2 3/5] libnvdimm/nvdimm/flush: Allow architecture to override the flush barrier
On Fri, May 22, 2020 at 09:01:17AM -0400, Mikulas Patocka wrote: > > > On Fri, 22 May 2020, Aneesh Kumar K.V wrote: > > > On 5/22/20 3:01 PM, Michal Suchánek wrote: > > > On Thu, May 21, 2020 at 02:52:30PM -0400, Mikulas Patocka wrote: > > > > > > > > > > > > On Thu, 21 May 2020, Dan Williams wrote: > > > > > > > > > On Thu, May 21, 2020 at 10:03 AM Aneesh Kumar K.V > > > > > wrote: > > > > > > > > > > > > > Moving on to the patch itself--Aneesh, have you audited other > > > > > > > persistent > > > > > > > memory users in the kernel? For example, > > > > > > > drivers/md/dm-writecache.c > > > > > > > does > > > > > > > this: > > > > > > > > > > > > > > static void writecache_commit_flushed(struct dm_writecache *wc, > > > > > > > bool > > > > > > > wait_for_ios) > > > > > > > { > > > > > > >if (WC_MODE_PMEM(wc)) > > > > > > >wmb(); <== > > > > > > > else > > > > > > > ssd_commit_flushed(wc, wait_for_ios); > > > > > > > } > > > > > > > > > > > > > > I believe you'll need to make modifications there. > > > > > > > > > > > > > > > > > > > Correct. Thanks for catching that. > > > > > > > > > > > > > > > > > > I don't understand dm much, wondering how this will work with > > > > > > non-synchronous DAX device? > > > > > > > > > > That's a good point. DM-writecache needs to be cognizant of things > > > > > like virtio-pmem that violate the rule that persisent memory writes > > > > > can be flushed by CPU functions rather than calling back into the > > > > > driver. It seems we need to always make the flush case a dax_operation > > > > > callback to account for this. > > > > > > > > dm-writecache is normally sitting on the top of dm-linear, so it would > > > > need to pass the wmb() call through the dm core and dm-linear target ... > > > > that would slow it down ... I remember that you already did it this way > > > > some times ago and then removed it. > > > > > > > > What's the exact problem with POWER? Could the POWER system have two > > > > types > > > > of persistent memory that need two different ways of flushing? > > > > > > As far as I understand the discussion so far > > > > > > - on POWER $oldhardware uses $oldinstruction to ensure pmem consistency > > > - on POWER $newhardware uses $newinstruction to ensure pmem consistency > > > (compatible with $oldinstruction on $oldhardware) > > > > Correct. > > > > > - on some platforms instead of barrier instruction a callback into the > > > driver is issued to ensure consistency > > > > This is virtio-pmem only at this point IIUC. > > > > -aneesh > > And does the virtio-pmem driver track which pages are dirty? Or does it > need to specify the range of pages to flush in the flush function? > > > > None of this is reflected by the dm driver. > > We could make a new dax method: > void *(dax_get_flush_function)(void); > > This would return a pointer to "wmb()" on x86 and something else on Power. > > The method "dax_get_flush_function" would be called only once when > initializing the writecache driver (because the call would be slow because > it would have to go through the DM stack) and then, the returned function > would be called each time we need write ordering. The returned function > would do just "sfence; ret". Hello, as far as I understand the code virtio_pmem has a fush function defined which indeed can make use of the region properties, such as memory range. If such function exists you need quivalent of sync() - call into the device in question. If it does not calling arch_pmem_flush_barrier() instead of wmb() should suffice. I am not aware of an interface to determine if the flush function exists for a particular region. Thanks Michal
[PATCH v2 3/4] powerpc sstep: introduce macros to retrieve Prefix instruction operands
retrieve prefix instruction operands RA and pc relative bit R values using macros and adopt it in sstep.c and test_emulate_step.c. Signed-off-by: Balamuruhan S --- arch/powerpc/include/asm/sstep.h | 4 arch/powerpc/lib/sstep.c | 12 ++-- 2 files changed, 10 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/sstep.h b/arch/powerpc/include/asm/sstep.h index 3b01c69a44aa..325975b4ef30 100644 --- a/arch/powerpc/include/asm/sstep.h +++ b/arch/powerpc/include/asm/sstep.h @@ -104,6 +104,10 @@ enum instruction_type { #define MKOP(t, f, s) ((t) | (f) | SIZE(s)) +/* Prefix instruction operands */ +#define GET_PREFIX_RA(i) (((i) >> 16) & 0x1f) +#define GET_PREFIX_R(i)((i) & (1ul << 20)) + struct instruction_op { int type; int reg; diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c index 5abe98216dc2..fb4c5767663d 100644 --- a/arch/powerpc/lib/sstep.c +++ b/arch/powerpc/lib/sstep.c @@ -200,8 +200,8 @@ static nokprobe_inline unsigned long mlsd_8lsd_ea(unsigned int instr, unsigned int dd; unsigned long ea, d0, d1, d; - prefix_r = instr & (1ul << 20); - ra = (suffix >> 16) & 0x1f; + prefix_r = GET_PREFIX_R(instr); + ra = GET_PREFIX_RA(suffix); d0 = instr & 0x3; d1 = suffix & 0x; @@ -1339,8 +1339,8 @@ int analyse_instr(struct instruction_op *op, const struct pt_regs *regs, switch (opcode) { #ifdef __powerpc64__ case 1: - prefix_r = word & (1ul << 20); - ra = (suffix >> 16) & 0x1f; + prefix_r = GET_PREFIX_R(word); + ra = GET_PREFIX_RA(suffix); rd = (suffix >> 21) & 0x1f; op->reg = rd; op->val = regs->gpr[rd]; @@ -2715,8 +2715,8 @@ int analyse_instr(struct instruction_op *op, const struct pt_regs *regs, } break; case 1: /* Prefixed instructions */ - prefix_r = word & (1ul << 20); - ra = (suffix >> 16) & 0x1f; + prefix_r = GET_PREFIX_R(word); + ra = GET_PREFIX_RA(suffix); op->update_reg = ra; rd = (suffix >> 21) & 0x1f; op->reg = rd; -- 2.24.1
[PATCH v2 1/4] powerpc test_emulate_step: enhancement to test negative scenarios
add provision to declare test is a negative scenario, verify whether emulation fails and avoid executing it. Signed-off-by: Balamuruhan S --- arch/powerpc/lib/test_emulate_step.c | 30 +++- 1 file changed, 21 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/lib/test_emulate_step.c b/arch/powerpc/lib/test_emulate_step.c index 0ca2b7cc8d8c..7c30a69c174f 100644 --- a/arch/powerpc/lib/test_emulate_step.c +++ b/arch/powerpc/lib/test_emulate_step.c @@ -118,6 +118,7 @@ #define IGNORE_GPR(n) (0x1UL << (n)) #define IGNORE_XER (0x1UL << 32) #define IGNORE_CCR (0x1UL << 33) +#define NEGATIVE_TEST (0x1UL << 63) static void __init init_pt_regs(struct pt_regs *regs) { @@ -1202,8 +1203,10 @@ static struct compute_test compute_tests[] = { }; static int __init emulate_compute_instr(struct pt_regs *regs, - struct ppc_inst instr) + struct ppc_inst instr, + bool negative) { + int analysed; extern s32 patch__exec_instr; struct instruction_op op; @@ -1212,13 +1215,17 @@ static int __init emulate_compute_instr(struct pt_regs *regs, regs->nip = patch_site_addr(__exec_instr); - if (analyse_instr(, regs, instr) != 1 || - GETTYPE(op.type) != COMPUTE) { - pr_info("execution failed, instruction = %s\n", ppc_inst_as_str(instr)); + analysed = analyse_instr(, regs, instr); + if (analysed != 1 || GETTYPE(op.type) != COMPUTE) { + if (negative) + return -EFAULT; + pr_info("emulation failed, instruction = %s\n", ppc_inst_as_str(instr)); return -EFAULT; } - - emulate_update_regs(regs, ); + if (analysed == 1 && negative) + pr_info("negative test failed, instruction = %s\n", ppc_inst_as_str(instr)); + if (!negative) + emulate_update_regs(regs, ); return 0; } @@ -1256,7 +1263,7 @@ static void __init run_tests_compute(void) struct pt_regs *regs, exp, got; unsigned int i, j, k; struct ppc_inst instr; - bool ignore_gpr, ignore_xer, ignore_ccr, passed; + bool ignore_gpr, ignore_xer, ignore_ccr, passed, rc, negative; for (i = 0; i < ARRAY_SIZE(compute_tests); i++) { test = _tests[i]; @@ -1270,6 +1277,7 @@ static void __init run_tests_compute(void) instr = test->subtests[j].instr; flags = test->subtests[j].flags; regs = >subtests[j].regs; + negative = flags & NEGATIVE_TEST; ignore_xer = flags & IGNORE_XER; ignore_ccr = flags & IGNORE_CCR; passed = true; @@ -1284,8 +1292,12 @@ static void __init run_tests_compute(void) exp.msr = MSR_KERNEL; got.msr = MSR_KERNEL; - if (emulate_compute_instr(, instr) || - execute_compute_instr(, instr)) { + rc = emulate_compute_instr(, instr, negative) != 0; + if (negative) { + /* skip executing instruction */ + passed = rc; + goto print; + } else if (rc || execute_compute_instr(, instr)) { passed = false; goto print; } -- 2.24.1
Re: [bug] LTP mmap03 stuck in page fault loop after c46241a370a6 ("powerpc/pkeys: Check vma before returning key fault error to the user")
On 6/26/20 1:17 PM, Aneesh Kumar K.V wrote: Hi Jan, On 6/26/20 12:29 PM, Jan Stancek wrote: Hi, LTP mmap03 is getting stuck in page fault loop after commit c46241a370a6 ("powerpc/pkeys: Check vma before returning key fault error to the user") System is ppc64le P9 lpar [1] running v5.8-rc2-34-g3e08a95294a4. Here's a minimized reproducer: - 8< - #include #include #include #include #include int main(int ac, char **av) { int page_sz = getpagesize(); int fildes; char *addr; fildes = open("tempfile", O_WRONLY | O_CREAT, 0666); write(fildes, , sizeof(fildes)); close(fildes); fildes = open("tempfile", O_RDONLY); unlink("tempfile"); addr = mmap(0, page_sz, PROT_EXEC, MAP_FILE | MAP_PRIVATE, fildes, 0); printf("%d\n", *addr); return 0; } - >8 - Thanks for the report. This is execute only key where vma has the implied read permission. So The patch do break this case. I will see how best we can handle PROT_EXEC and the multi threaded test that required the change in the patch. Can you check with this change? While checking for access permission we are checking against UAMOR value which i think is wrong. We just need to look at the AMR and IAMR values to check whether access is permitted or not. Even if UAMOR deny the userspace management of the key, we should do the correct access check. modified arch/powerpc/mm/book3s64/pkeys.c @@ -353,9 +353,6 @@ static bool pkey_access_permitted(int pkey, bool write, bool execute) int pkey_shift; u64 amr; - if (!is_pkey_enabled(pkey)) - return true; - pkey_shift = pkeyshift(pkey); if (execute && !(read_iamr() & (IAMR_EX_BIT << pkey_shift))) return true;
Re: [PATCH] powerpc/pseries: Use doorbells even if XIVE is available
>>> An option vector (or dt-cpu-ftrs) could be defined to disable msgsndp >>> to get KVM performance back. > > An option vector would require a PAPR change. Unless the architecture > reserves some bits for the implementation, but I don't think so. Same > for CAS. > >> Qemu/KVM populates /proc/device-tree/hypervisor, so we *could* look at >> that. Though adding PowerVM/KVM specific hacks is obviously a very >> slippery slope. > > QEMU could advertise a property "emulated-msgsndp", or something similar, > which would be interpreted by Linux as a CPU feature and taken into account > when doing the IPIs. Could we remove msgsndp support from HFSCR in KVM and test it in pseries ? C.
Re: [PATCH v2 2/2] powerpc/hv-24x7: Add sysfs files inside hv-24x7 device to show cpumask
On 6/26/20 1:15 PM, Gautham R Shenoy wrote: > On Wed, Jun 24, 2020 at 05:58:31PM +0530, Madhavan Srinivasan wrote: >> >> >> On 6/24/20 4:26 PM, Gautham R Shenoy wrote: >>> Hi Kajol, >>> >>> On Wed, Jun 24, 2020 at 03:47:54PM +0530, Kajol Jain wrote: Patch here adds a cpumask attr to hv_24x7 pmu along with ABI documentation. command:# cat /sys/devices/hv_24x7/cpumask 0 >>> Since this sysfs interface is read-only, and the user cannot change >>> the CPU which will be making the HCALLs to obtain the 24x7 counts, >>> does the user even need to know if currently CPU X is the one which is >>> going to make HCALLs to retrive the 24x7 counts ? Does it help in any >>> kind of trouble-shooting ? >> Primary use to expose the cpumask is for the perf tool. >> Which has the capability to parse the driver sysfs folder >> and understand the cpumask file. Having cpumask >> file will reduce the number of perf commandline >> parameters (will avoid "-C" option in the perf tool >> command line). I can also notify the user which is >> the current cpu used to retrieve the counter data. > > Fair enough. Can we include this in the patch description ? Sure will update in next version of patchset. Thanks, Kajol Jain > >> >>> It would have made sense if the interface was read-write, since a user >>> can set this to a CPU which is not running user applications. This >>> would help in minimising jitter on those active CPUs running the user >>> applications. >> >> With cpumask backed by hotplug >> notifiers, enabling user write access to it will >> complicate the code with more additional check. >> CPU will come to play only if the user request for >> counter data. If not, then there will be no HCALLs made >> using the CPU. > > Well, I was wondering if you could make the interface writable because > I couldn't think of the use of a read-only interface. With the > perf-use case you have provided, I guess it makes sense. I am ok with > it being a read-only interface. > >> >> Maddy > > -- > Thanks and Regards > gautham. >
Re: [bug] LTP mmap03 stuck in page fault loop after c46241a370a6 ("powerpc/pkeys: Check vma before returning key fault error to the user")
Hi Jan, On 6/26/20 12:29 PM, Jan Stancek wrote: Hi, LTP mmap03 is getting stuck in page fault loop after commit c46241a370a6 ("powerpc/pkeys: Check vma before returning key fault error to the user") System is ppc64le P9 lpar [1] running v5.8-rc2-34-g3e08a95294a4. Here's a minimized reproducer: - 8< - #include #include #include #include #include int main(int ac, char **av) { int page_sz = getpagesize(); int fildes; char *addr; fildes = open("tempfile", O_WRONLY | O_CREAT, 0666); write(fildes, , sizeof(fildes)); close(fildes); fildes = open("tempfile", O_RDONLY); unlink("tempfile"); addr = mmap(0, page_sz, PROT_EXEC, MAP_FILE | MAP_PRIVATE, fildes, 0); printf("%d\n", *addr); return 0; } - >8 - Thanks for the report. This is execute only key where vma has the implied read permission. So The patch do break this case. I will see how best we can handle PROT_EXEC and the multi threaded test that required the change in the patch. -aneesh
Re: [PATCH v2 2/2] powerpc/hv-24x7: Add sysfs files inside hv-24x7 device to show cpumask
On Wed, Jun 24, 2020 at 05:58:31PM +0530, Madhavan Srinivasan wrote: > > > On 6/24/20 4:26 PM, Gautham R Shenoy wrote: > >Hi Kajol, > > > >On Wed, Jun 24, 2020 at 03:47:54PM +0530, Kajol Jain wrote: > >>Patch here adds a cpumask attr to hv_24x7 pmu along with ABI documentation. > >> > >>command:# cat /sys/devices/hv_24x7/cpumask > >>0 > >Since this sysfs interface is read-only, and the user cannot change > >the CPU which will be making the HCALLs to obtain the 24x7 counts, > >does the user even need to know if currently CPU X is the one which is > >going to make HCALLs to retrive the 24x7 counts ? Does it help in any > >kind of trouble-shooting ? > Primary use to expose the cpumask is for the perf tool. > Which has the capability to parse the driver sysfs folder > and understand the cpumask file. Having cpumask > file will reduce the number of perf commandline > parameters (will avoid "-C" option in the perf tool > command line). I can also notify the user which is > the current cpu used to retrieve the counter data. Fair enough. Can we include this in the patch description ? > > >It would have made sense if the interface was read-write, since a user > >can set this to a CPU which is not running user applications. This > >would help in minimising jitter on those active CPUs running the user > >applications. > > With cpumask backed by hotplug > notifiers, enabling user write access to it will > complicate the code with more additional check. > CPU will come to play only if the user request for > counter data. If not, then there will be no HCALLs made > using the CPU. Well, I was wondering if you could make the interface writable because I couldn't think of the use of a read-only interface. With the perf-use case you have provided, I guess it makes sense. I am ok with it being a read-only interface. > > Maddy -- Thanks and Regards gautham.
Re: [PATCH] powerpc/pseries: Use doorbells even if XIVE is available
[ ... ] >>> An option vector (or dt-cpu-ftrs) could be defined to disable msgsndp >>> to get KVM performance back. > > An option vector would require a PAPR change. Unless the architecture > reserves some bits for the implementation, but I don't think so. Same > for CAS. > >> Qemu/KVM populates /proc/device-tree/hypervisor, so we *could* look at >> that. Though adding PowerVM/KVM specific hacks is obviously a very >> slippery slope. > > QEMU could advertise a property "emulated-msgsndp", or something similar, > which would be interpreted by Linux as a CPU feature and taken into account > when doing the IPIs. This is really a PowerVM optimization. > The CPU setup for XIVE needs a cleanup also. There is no need to allocate > interrupts for IPIs anymore in that case. We need to keep these for the other cores. The XIVE layer is unchanged. C.
[bug] LTP mmap03 stuck in page fault loop after c46241a370a6 ("powerpc/pkeys: Check vma before returning key fault error to the user")
Hi, LTP mmap03 is getting stuck in page fault loop after commit c46241a370a6 ("powerpc/pkeys: Check vma before returning key fault error to the user") System is ppc64le P9 lpar [1] running v5.8-rc2-34-g3e08a95294a4. Here's a minimized reproducer: - 8< - #include #include #include #include #include int main(int ac, char **av) { int page_sz = getpagesize(); int fildes; char *addr; fildes = open("tempfile", O_WRONLY | O_CREAT, 0666); write(fildes, , sizeof(fildes)); close(fildes); fildes = open("tempfile", O_RDONLY); unlink("tempfile"); addr = mmap(0, page_sz, PROT_EXEC, MAP_FILE | MAP_PRIVATE, fildes, 0); printf("%d\n", *addr); return 0; } - >8 - This would previously end quickly with segmentation fault, after commit c46241a370a6 test is stuck: # perf stat timeout 5 ./a.out Performance counter stats for 'timeout 5 ./a.out': 5,001.74 msec task-clock#1.000 CPUs utilized 9 context-switches #0.002 K/sec 0 cpu-migrations#0.000 K/sec 3,094,893 page-faults #0.619 M/sec 18,940,869,512 cycles#3.787 GHz (33.39%) 1,377,005,087 stalled-cycles-frontend #7.27% frontend cycles idle (50.19%) 10,949,936,056 stalled-cycles-backend# 57.81% backend cycles idle (16.62%) 21,133,828,748 instructions #1.12 insn per cycle #0.52 stalled cycles per insn (33.22%) 4,395,016,137 branches # 878.698 M/sec (49.81%) 164,499,002 branch-misses #3.74% of all branches (16.60%) 5.001237248 seconds time elapsed 0.321276000 seconds user 4.680772000 seconds sys access_pkey_error() in page fault handler now always seem to return false: __do_page_fault access_pkey_error(is_pkey: 1, is_exec: 0, is_write: 0) arch_vma_access_permitted pkey_access_permitted if (!is_pkey_enabled(pkey)) return true return false Regards, Jan [1] Architecture:ppc64le Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 8 Core(s) per socket: 1 Socket(s): 1 NUMA node(s):2 Model: 2.2 (pvr 004e 0202) Model name: POWER9 (architected), altivec supported Hypervisor vendor: pHyp Virtualization type: para L1d cache: 32 KiB L1i cache: 32 KiB NUMA node0 CPU(s): NUMA node1 CPU(s): 0-7 Physical sockets:2 Physical chips: 1 Physical cores/chip: 8 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Mitigation; RFI Flush, L1D private per thread Vulnerability Mds: Not affected Vulnerability Meltdown: Mitigation; RFI Flush, L1D private per thread Vulnerability Spec store bypass: Mitigation; Kernel entry/exit barrier (eieio) Vulnerability Spectre v1:Mitigation; __user pointer sanitization, ori31 speculation barrier enabled Vulnerability Spectre v2:Mitigation; Indirect branch cache disabled, Software link stack flush Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected
[PATCH] crypto: af_alg - Fix regression on empty requests
On Tue, Jun 23, 2020 at 10:02:17AM -0700, Eric Biggers wrote: > > The source code for the two failing AF_ALG tests is here: > > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/crypto/af_alg02.c > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/crypto/af_alg05.c > > They use read() and write(), not send() and recv(). > > af_alg02 uses read() to read from a "salsa20" request socket without writing > anything to it. It is expected that this returns 0, i.e. that behaves like > encrypting an empty message. > > af_alg05 uses write() to write 15 bytes to a "cbc(aes-generic)" request > socket, > then read() to read 15 bytes. It is expected that this fails with EINVAL, > since > the length is not aligned to the AES block size (16 bytes). This patch should fix the regression: ---8<--- Some user-space programs rely on crypto requests that have no control metadata. This broke when a check was added to require the presence of control metadata with the ctx->init flag. This patch fixes the regression by removing the ctx->init flag. This means that we do not distinguish the case of no metadata as opposed to an empty request. IOW it is always assumed that if you call recv(2) before sending metadata that you are working with an empty request. Reported-by: Sachin Sant Reported-by: Naresh Kamboju Fixes: f3c802a1f300 ("crypto: algif_aead - Only wake up when...") Signed-off-by: Herbert Xu diff --git a/crypto/af_alg.c b/crypto/af_alg.c index 9fcb91ea10c4..2d391117c020 100644 --- a/crypto/af_alg.c +++ b/crypto/af_alg.c @@ -635,7 +635,6 @@ void af_alg_pull_tsgl(struct sock *sk, size_t used, struct scatterlist *dst, if (!ctx->used) ctx->merge = 0; - ctx->init = ctx->more; } EXPORT_SYMBOL_GPL(af_alg_pull_tsgl); @@ -757,8 +756,7 @@ int af_alg_wait_for_data(struct sock *sk, unsigned flags, unsigned min) break; timeout = MAX_SCHEDULE_TIMEOUT; if (sk_wait_event(sk, , - ctx->init && (!ctx->more || - (min && ctx->used >= min)), + !ctx->more || (min && ctx->used >= min), )) { err = 0; break; @@ -847,7 +845,7 @@ int af_alg_sendmsg(struct socket *sock, struct msghdr *msg, size_t size, } lock_sock(sk); - if (ctx->init && (init || !ctx->more)) { + if (!ctx->more && ctx->used) { err = -EINVAL; goto unlock; } @@ -858,7 +856,6 @@ int af_alg_sendmsg(struct socket *sock, struct msghdr *msg, size_t size, memcpy(ctx->iv, con.iv->iv, ivsize); ctx->aead_assoclen = con.aead_assoclen; - ctx->init = true; } while (size) { diff --git a/crypto/algif_aead.c b/crypto/algif_aead.c index d48d2156e621..749fe42315be 100644 --- a/crypto/algif_aead.c +++ b/crypto/algif_aead.c @@ -106,7 +106,7 @@ static int _aead_recvmsg(struct socket *sock, struct msghdr *msg, size_t usedpages = 0; /* [in] RX bufs to be used from user */ size_t processed = 0; /* [in] TX bufs to be consumed */ - if (!ctx->init || ctx->more) { + if (ctx->more) { err = af_alg_wait_for_data(sk, flags, 0); if (err) return err; diff --git a/crypto/algif_skcipher.c b/crypto/algif_skcipher.c index a51ba22fef58..5b6fa5e8c00d 100644 --- a/crypto/algif_skcipher.c +++ b/crypto/algif_skcipher.c @@ -61,7 +61,7 @@ static int _skcipher_recvmsg(struct socket *sock, struct msghdr *msg, int err = 0; size_t len = 0; - if (!ctx->init || (ctx->more && ctx->used < bs)) { + if (ctx->more && ctx->used < bs) { err = af_alg_wait_for_data(sk, flags, bs); if (err) return err; diff --git a/include/crypto/if_alg.h b/include/crypto/if_alg.h index ee6412314f8f..08c087cc89d6 100644 --- a/include/crypto/if_alg.h +++ b/include/crypto/if_alg.h @@ -135,7 +135,6 @@ struct af_alg_async_req { * SG? * @enc: Cryptographic operation to be performed when * recvmsg is invoked. - * @init: True if metadata has been sent. * @len: Length of memory allocated for this data structure. */ struct af_alg_ctx { @@ -152,7 +151,6 @@ struct af_alg_ctx { bool more; bool merge; bool enc; - bool init; unsigned int len; }; -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt