[PATCH v9 6/6] riscv: Use --emit-relocs in order to move .rela.dyn in init
To circumvent an issue where placing the relocations inside the init sections produces empty relocations, use --emit-relocs. But to avoid carrying those relocations in vmlinux, use an intermediate vmlinux.relocs file which is a copy of vmlinux *before* stripping its relocations. Suggested-by: Björn Töpel Suggested-by: Nick Desaulniers Signed-off-by: Alexandre Ghiti --- arch/riscv/Makefile | 2 +- arch/riscv/Makefile.postlink | 13 + arch/riscv/boot/Makefile | 7 +++ 3 files changed, 21 insertions(+), 1 deletion(-) diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile index 860b09e409c7..7dc6904a6836 100644 --- a/arch/riscv/Makefile +++ b/arch/riscv/Makefile @@ -8,7 +8,7 @@ OBJCOPYFLAGS:= -O binary ifeq ($(CONFIG_RELOCATABLE),y) - LDFLAGS_vmlinux += -shared -Bsymbolic -z notext -z norelro + LDFLAGS_vmlinux += -shared -Bsymbolic -z notext -z norelro --emit-relocs KBUILD_CFLAGS += -fPIE endif ifeq ($(CONFIG_DYNAMIC_FTRACE),y) diff --git a/arch/riscv/Makefile.postlink b/arch/riscv/Makefile.postlink index d5de8d520d3e..a46fc578b30b 100644 --- a/arch/riscv/Makefile.postlink +++ b/arch/riscv/Makefile.postlink @@ -15,12 +15,25 @@ quiet_cmd_relocs_check = CHKREL $@ cmd_relocs_check = \ $(CONFIG_SHELL) $(srctree)/arch/riscv/tools/relocs_check.sh "$(OBJDUMP)" "$(NM)" "$@" +ifdef CONFIG_RELOCATABLE +quiet_cmd_cp_vmlinux_relocs = CPREL vmlinux.relocs +cmd_cp_vmlinux_relocs = cp vmlinux vmlinux.relocs + +quiet_cmd_relocs_strip = STRIPREL $@ +cmd_relocs_strip = $(OBJCOPY) --remove-section='.rel.*' \ +--remove-section='.rel__*' \ +--remove-section='.rela.*' \ +--remove-section='.rela__*' $@ +endif + # `@true` prevents complaint when there is nothing to be done vmlinux: FORCE @true ifdef CONFIG_RELOCATABLE $(call if_changed,relocs_check) + $(call if_changed,cp_vmlinux_relocs) + $(call if_changed,relocs_strip) endif %.ko: FORCE diff --git a/arch/riscv/boot/Makefile b/arch/riscv/boot/Makefile index c72de7232abb..22b13947bd13 100644 --- a/arch/riscv/boot/Makefile +++ b/arch/riscv/boot/Makefile @@ -33,7 +33,14 @@ $(obj)/xipImage: vmlinux FORCE endif +ifdef CONFIG_RELOCATABLE +vmlinux.relocs: vmlinux + @ (! [ -f vmlinux.relocs ] && echo "vmlinux.relocs can't be found, please remove vmlinux and try again") || true + +$(obj)/Image: vmlinux.relocs FORCE +else $(obj)/Image: vmlinux FORCE +endif $(call if_changed,objcopy) $(obj)/Image.gz: $(obj)/Image FORCE -- 2.37.2
[PATCH v9 5/6] riscv: Check relocations at compile time
From: Alexandre Ghiti Relocating kernel at runtime is done very early in the boot process, so it is not convenient to check for relocations there and react in case a relocation was not expected. There exists a script in scripts/ that extracts the relocations from vmlinux that is then used at postlink to check the relocations. Signed-off-by: Alexandre Ghiti Reviewed-by: Anup Patel --- arch/riscv/Makefile.postlink | 36 arch/riscv/tools/relocs_check.sh | 26 +++ 2 files changed, 62 insertions(+) create mode 100644 arch/riscv/Makefile.postlink create mode 100755 arch/riscv/tools/relocs_check.sh diff --git a/arch/riscv/Makefile.postlink b/arch/riscv/Makefile.postlink new file mode 100644 index ..d5de8d520d3e --- /dev/null +++ b/arch/riscv/Makefile.postlink @@ -0,0 +1,36 @@ +# SPDX-License-Identifier: GPL-2.0 +# === +# Post-link riscv pass +# === +# +# Check that vmlinux relocations look sane + +PHONY := __archpost +__archpost: + +-include include/config/auto.conf +include $(srctree)/scripts/Kbuild.include + +quiet_cmd_relocs_check = CHKREL $@ +cmd_relocs_check = \ + $(CONFIG_SHELL) $(srctree)/arch/riscv/tools/relocs_check.sh "$(OBJDUMP)" "$(NM)" "$@" + +# `@true` prevents complaint when there is nothing to be done + +vmlinux: FORCE + @true +ifdef CONFIG_RELOCATABLE + $(call if_changed,relocs_check) +endif + +%.ko: FORCE + @true + +clean: + @true + +PHONY += FORCE clean + +FORCE: + +.PHONY: $(PHONY) diff --git a/arch/riscv/tools/relocs_check.sh b/arch/riscv/tools/relocs_check.sh new file mode 100755 index ..baeb2e7b2290 --- /dev/null +++ b/arch/riscv/tools/relocs_check.sh @@ -0,0 +1,26 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0-or-later +# Based on powerpc relocs_check.sh + +# This script checks the relocations of a vmlinux for "suspicious" +# relocations. + +if [ $# -lt 3 ]; then +echo "$0 [path to objdump] [path to nm] [path to vmlinux]" 1>&2 +exit 1 +fi + +bad_relocs=$( +${srctree}/scripts/relocs_check.sh "$@" | + # These relocations are okay + # R_RISCV_RELATIVE + grep -F -w -v 'R_RISCV_RELATIVE' +) + +if [ -z "$bad_relocs" ]; then + exit 0 +fi + +num_bad=$(echo "$bad_relocs" | wc -l) +echo "WARNING: $num_bad bad relocations" +echo "$bad_relocs" -- 2.37.2
[PATCH v9 4/6] powerpc: Move script to check relocations at compile time in scripts/
From: Alexandre Ghiti Relocating kernel at runtime is done very early in the boot process, so it is not convenient to check for relocations there and react in case a relocation was not expected. Powerpc architecture has a script that allows to check at compile time for such unexpected relocations: extract the common logic to scripts/ so that other architectures can take advantage of it. Signed-off-by: Alexandre Ghiti Reviewed-by: Anup Patel Acked-by: Michael Ellerman (powerpc) --- arch/powerpc/tools/relocs_check.sh | 18 ++ scripts/relocs_check.sh| 20 2 files changed, 22 insertions(+), 16 deletions(-) create mode 100755 scripts/relocs_check.sh diff --git a/arch/powerpc/tools/relocs_check.sh b/arch/powerpc/tools/relocs_check.sh index 63792af00417..6b350e75014c 100755 --- a/arch/powerpc/tools/relocs_check.sh +++ b/arch/powerpc/tools/relocs_check.sh @@ -15,21 +15,8 @@ if [ $# -lt 3 ]; then exit 1 fi -# Have Kbuild supply the path to objdump and nm so we handle cross compilation. -objdump="$1" -nm="$2" -vmlinux="$3" - -# Remove from the bad relocations those that match an undefined weak symbol -# which will result in an absolute relocation to 0. -# Weak unresolved symbols are of that form in nm output: -# " w _binary__btf_vmlinux_bin_end" -undef_weak_symbols=$($nm "$vmlinux" | awk '$1 ~ /w/ { print $2 }') - bad_relocs=$( -$objdump -R "$vmlinux" | - # Only look at relocation lines. - grep -E '\
[PATCH v9 3/6] riscv: Introduce CONFIG_RELOCATABLE
This config allows to compile 64b kernel as PIE and to relocate it at any virtual address at runtime: this paves the way to KASLR. Runtime relocation is possible since relocation metadata are embedded into the kernel. Note that relocating at runtime introduces an overhead even if the kernel is loaded at the same address it was linked at and that the compiler options are those used in arm64 which uses the same RELA relocation format. Signed-off-by: Alexandre Ghiti --- arch/riscv/Kconfig | 14 + arch/riscv/Makefile | 7 +++-- arch/riscv/kernel/vmlinux.lds.S | 17 +-- arch/riscv/mm/Makefile | 4 +++ arch/riscv/mm/init.c| 54 - 5 files changed, 91 insertions(+), 5 deletions(-) diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index 3c5907431081..6ff9f574195d 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -562,6 +562,20 @@ config COMPAT If you want to execute 32-bit userspace applications, say Y. +config RELOCATABLE + bool "Build a relocatable kernel" + depends on MMU && 64BIT && !XIP_KERNEL + help + This builds a kernel as a Position Independent Executable (PIE), + which retains all relocation metadata required to relocate the + kernel binary at runtime to a different virtual address than the + address it was linked at. + Since RISCV uses the RELA relocation format, this requires a + relocation pass at runtime even if the kernel is loaded at the + same address it was linked at. + + If unsure, say N. + endmenu # "Kernel features" menu "Boot options" diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile index 6203c3378922..860b09e409c7 100644 --- a/arch/riscv/Makefile +++ b/arch/riscv/Makefile @@ -7,9 +7,12 @@ # OBJCOPYFLAGS:= -O binary -LDFLAGS_vmlinux := +ifeq ($(CONFIG_RELOCATABLE),y) + LDFLAGS_vmlinux += -shared -Bsymbolic -z notext -z norelro + KBUILD_CFLAGS += -fPIE +endif ifeq ($(CONFIG_DYNAMIC_FTRACE),y) - LDFLAGS_vmlinux := --no-relax + LDFLAGS_vmlinux += --no-relax KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY ifeq ($(CONFIG_RISCV_ISA_C),y) CC_FLAGS_FTRACE := -fpatchable-function-entry=4 diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S index e05e6df44225..615ff5842690 100644 --- a/arch/riscv/kernel/vmlinux.lds.S +++ b/arch/riscv/kernel/vmlinux.lds.S @@ -122,10 +122,23 @@ SECTIONS *(.sdata*) } - .rela.dyn : { - *(.rela*) + .rela.dyn : ALIGN(8) { + __rela_dyn_start = .; + *(.rela .rela*) + __rela_dyn_end = .; } +#ifdef CONFIG_RELOCATABLE + .data.rel : { *(.data.rel*) } + .got : { *(.got*) } + .plt : { *(.plt) } + .dynamic : { *(.dynamic) } + .dynsym : { *(.dynsym) } + .dynstr : { *(.dynstr) } + .hash : { *(.hash) } + .gnu.hash : { *(.gnu.hash) } +#endif + #ifdef CONFIG_EFI .pecoff_edata_padding : { BYTE(0); . = ALIGN(PECOFF_FILE_ALIGNMENT); } __pecoff_data_raw_size = ABSOLUTE(. - __pecoff_text_end); diff --git a/arch/riscv/mm/Makefile b/arch/riscv/mm/Makefile index 2ac177c05352..b85e9e82f082 100644 --- a/arch/riscv/mm/Makefile +++ b/arch/riscv/mm/Makefile @@ -1,6 +1,10 @@ # SPDX-License-Identifier: GPL-2.0-only CFLAGS_init.o := -mcmodel=medany +ifdef CONFIG_RELOCATABLE +CFLAGS_init.o += -fno-pie +endif + ifdef CONFIG_FTRACE CFLAGS_REMOVE_init.o = $(CC_FLAGS_FTRACE) CFLAGS_REMOVE_cacheflush.o = $(CC_FLAGS_FTRACE) diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c index f803671d18b2..bce899b180cd 100644 --- a/arch/riscv/mm/init.c +++ b/arch/riscv/mm/init.c @@ -20,6 +20,9 @@ #include #include #include +#ifdef CONFIG_RELOCATABLE +#include +#endif #include #include @@ -146,7 +149,7 @@ static void __init print_vm_layout(void) print_ml("kasan", KASAN_SHADOW_START, KASAN_SHADOW_END); #endif - print_ml("kernel", (unsigned long)KERNEL_LINK_ADDR, + print_ml("kernel", (unsigned long)kernel_map.virt_addr, (unsigned long)ADDRESS_SPACE_END); } } @@ -831,6 +834,44 @@ static __init void set_satp_mode(void) #error "setup_vm() is called from head.S before relocate so it should not use absolute addressing." #endif +#ifdef CONFIG_RELOCATABLE +extern unsigned long __rela_dyn_start, __rela_dyn_end; + +static void __init relocate_kernel(void) +{ + Elf64_Rela *rela = (Elf64_Rela *)&__rela_dyn_start; + /* +* This holds the offset between the linked virtual address and the +* relocated virtual address. +*/ + uintptr_t reloc_offset = kernel_map.virt_addr - KERNEL_LINK_ADDR; + /* +* This holds the offset between kernel linked virtual address and +* physical address. +*/ +
[PATCH v9 2/6] riscv: Move .rela.dyn outside of init to avoid empty relocations
This is a preparatory patch for relocatable kernels: .rela.dyn should be in .init but doing so actually produces empty relocations, so this should be a temporary commit until we find a solution. This issue was reported here [1]. [1] https://lore.kernel.org/all/4a6fc7a3-9697-a49b-0941-97f32194b...@ghiti.fr/. Signed-off-by: Alexandre Ghiti --- arch/riscv/kernel/vmlinux.lds.S | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S index 1c38294580c0..e05e6df44225 100644 --- a/arch/riscv/kernel/vmlinux.lds.S +++ b/arch/riscv/kernel/vmlinux.lds.S @@ -96,10 +96,6 @@ SECTIONS *(.rel.dyn*) } - .rela.dyn : { - *(.rela*) - } - __init_data_end = .; . = ALIGN(8); @@ -126,6 +122,10 @@ SECTIONS *(.sdata*) } + .rela.dyn : { + *(.rela*) + } + #ifdef CONFIG_EFI .pecoff_edata_padding : { BYTE(0); . = ALIGN(PECOFF_FILE_ALIGNMENT); } __pecoff_data_raw_size = ABSOLUTE(. - __pecoff_text_end); -- 2.37.2
[PATCH v9 1/6] riscv: Prepare EFI header for relocatable kernels
ld does not handle relocations correctly as explained here [1], a fix for that was proposed by Nelson there but we have to support older toolchains and then provide this fix. Note that llvm does not need this fix and is then excluded. [1] https://sourceware.org/pipermail/binutils/2023-March/126690.html Signed-off-by: Alexandre Ghiti --- arch/riscv/include/asm/set_memory.h | 3 +++ arch/riscv/kernel/efi-header.S | 19 --- arch/riscv/kernel/vmlinux.lds.S | 5 ++--- 3 files changed, 21 insertions(+), 6 deletions(-) diff --git a/arch/riscv/include/asm/set_memory.h b/arch/riscv/include/asm/set_memory.h index a2c14d4b3993..ec11001c3fe0 100644 --- a/arch/riscv/include/asm/set_memory.h +++ b/arch/riscv/include/asm/set_memory.h @@ -56,4 +56,7 @@ bool kernel_page_present(struct page *page); #define SECTION_ALIGN L1_CACHE_BYTES #endif /* CONFIG_STRICT_KERNEL_RWX */ +#define PECOFF_SECTION_ALIGNMENT0x1000 +#define PECOFF_FILE_ALIGNMENT 0x200 + #endif /* _ASM_RISCV_SET_MEMORY_H */ diff --git a/arch/riscv/kernel/efi-header.S b/arch/riscv/kernel/efi-header.S index 8e733aa48ba6..515b2dfbca75 100644 --- a/arch/riscv/kernel/efi-header.S +++ b/arch/riscv/kernel/efi-header.S @@ -6,6 +6,7 @@ #include #include +#include .macro __EFI_PE_HEADER .long PE_MAGIC @@ -33,7 +34,11 @@ optional_header: .byte 0x02// MajorLinkerVersion .byte 0x14// MinorLinkerVersion .long __pecoff_text_end - efi_header_end // SizeOfCode - .long __pecoff_data_virt_size // SizeOfInitializedData +#ifdef __clang__ + .long __pecoff_data_virt_size // SizeOfInitializedData +#else + .long __pecoff_data_virt_end - __pecoff_text_end // SizeOfInitializedData +#endif .long 0 // SizeOfUninitializedData .long __efistub_efi_pe_entry - _start // AddressOfEntryPoint .long efi_header_end - _start // BaseOfCode @@ -91,9 +96,17 @@ section_table: IMAGE_SCN_MEM_EXECUTE // Characteristics .ascii ".data\0\0\0" - .long __pecoff_data_virt_size // VirtualSize +#ifdef __clang__ + .long __pecoff_data_virt_size // VirtualSize +#else + .long __pecoff_data_virt_end - __pecoff_text_end // VirtualSize +#endif .long __pecoff_text_end - _start // VirtualAddress - .long __pecoff_data_raw_size // SizeOfRawData +#ifdef __clang__ + .long __pecoff_data_raw_size // SizeOfRawData +#else + .long __pecoff_data_raw_end - __pecoff_text_end // SizeOfRawData +#endif .long __pecoff_text_end - _start // PointerToRawData .long 0 // PointerToRelocations diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S index 53a8ad65b255..1c38294580c0 100644 --- a/arch/riscv/kernel/vmlinux.lds.S +++ b/arch/riscv/kernel/vmlinux.lds.S @@ -27,9 +27,6 @@ ENTRY(_start) jiffies = jiffies_64; -PECOFF_SECTION_ALIGNMENT = 0x1000; -PECOFF_FILE_ALIGNMENT = 0x200; - SECTIONS { /* Beginning of code and text segment */ @@ -132,6 +129,7 @@ SECTIONS #ifdef CONFIG_EFI .pecoff_edata_padding : { BYTE(0); . = ALIGN(PECOFF_FILE_ALIGNMENT); } __pecoff_data_raw_size = ABSOLUTE(. - __pecoff_text_end); + __pecoff_data_raw_end = ABSOLUTE(.); #endif /* End of data section */ @@ -142,6 +140,7 @@ SECTIONS #ifdef CONFIG_EFI . = ALIGN(PECOFF_SECTION_ALIGNMENT); __pecoff_data_virt_size = ABSOLUTE(. - __pecoff_text_end); + __pecoff_data_virt_end = ABSOLUTE(.); #endif _end = .; -- 2.37.2
[PATCH v9 0/6] Introduce 64b relocatable kernel
After multiple attempts, this patchset is now based on the fact that the 64b kernel mapping was moved outside the linear mapping. The first patch allows to build relocatable kernels but is not selected by default. That patch is a requirement for KASLR. The second and third patches take advantage of an already existing powerpc script that checks relocations at compile-time, and uses it for riscv. This patchset is rebased on top of: riscv: Use PUD/P4D/PGD pages for the linear mapping (https://patchwork.kernel.org/project/linux-riscv/list/?series=733603) base-commit-tag: v6.3-rc1 Changes in v9: * Fix gcc/llvm compilation errors by adding patch 1, thanks to Bjorn * Move a patch to move rela.dyn outside of init (patch 2): it is a separate patch to clearly explain why * To effectively move rela.dyn to init, we need to add patch 6: separate patch since we may be able at some point to revert (along with patch 2). * Add a lot of orphan sections to the linker script Changes in v8: * Fix UEFI boot by moving rela.dyn section into the data so that PE/COFF loader actually copies the relocations too * Fix check that used PGDIR instead of PUD which was not correct for sv48 and sv57 * Fix PE/COFF header data size definition as it led to size of 0 Changes in v7: * Rebase on top of v5.15 * Fix LDFLAGS_vmlinux which was overriden when CONFIG_DYNAMIC_FTRACE was set * Make relocate_kernel static * Add Ack from Michael Changes in v6: * Remove the kernel move to vmalloc zone * Rebased on top of for-next * Remove relocatable property from 32b kernel as the kernel is mapped in the linear mapping and would then need to be copied physically too * CONFIG_RELOCATABLE depends on !XIP_KERNEL * Remove Reviewed-by from first patch as it changed a bit Changes in v5: * Add "static __init" to create_kernel_page_table function as reported by Kbuild test robot * Add reviewed-by from Zong * Rebase onto v5.7 Changes in v4: * Fix BPF region that overlapped with kernel's as suggested by Zong * Fix end of module region that could be larger than 2GB as suggested by Zong * Fix the size of the vm area reserved for the kernel as we could lose PMD_SIZE if the size was already aligned on PMD_SIZE * Split compile time relocations check patch into 2 patches as suggested by Anup * Applied Reviewed-by from Zong and Anup Changes in v3: * Move kernel mapping to vmalloc Changes in v2: * Make RELOCATABLE depend on MMU as suggested by Anup * Rename kernel_load_addr into kernel_virt_addr as suggested by Anup * Use __pa_symbol instead of __pa, as suggested by Zong * Rebased on top of v5.6-rc3 * Tested with sv48 patchset * Add Reviewed/Tested-by from Zong and Anup Alexandre Ghiti (6): riscv: Prepare EFI header for relocatable kernels riscv: Move .rela.dyn outside of init to avoid empty relocations riscv: Introduce CONFIG_RELOCATABLE powerpc: Move script to check relocations at compile time in scripts/ riscv: Check relocations at compile time riscv: Use --emit-relocs in order to move .rela.dyn in init arch/powerpc/tools/relocs_check.sh | 18 ++ arch/riscv/Kconfig | 14 arch/riscv/Makefile | 7 ++-- arch/riscv/Makefile.postlink| 49 ++ arch/riscv/boot/Makefile| 7 arch/riscv/include/asm/set_memory.h | 3 ++ arch/riscv/kernel/efi-header.S | 19 -- arch/riscv/kernel/vmlinux.lds.S | 26 ++ arch/riscv/mm/Makefile | 4 +++ arch/riscv/mm/init.c| 54 - arch/riscv/tools/relocs_check.sh| 26 ++ scripts/relocs_check.sh | 20 +++ 12 files changed, 218 insertions(+), 29 deletions(-) create mode 100644 arch/riscv/Makefile.postlink create mode 100755 arch/riscv/tools/relocs_check.sh create mode 100755 scripts/relocs_check.sh -- 2.37.2
Re: [powerpc:next-test] BUILD REGRESSION c827c932c00ccd231a73da9816a51ce2c2b557d6
kernel test robot writes: > tree/branch: > https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next-test > branch HEAD: c827c932c00ccd231a73da9816a51ce2c2b557d6 powerpc: Use > of_address_to_resource() > > Error/Warning reports: > > https://lore.kernel.org/oe-kbuild-all/202303240411.tq2tzkig-...@intel.com > > Error/Warning: (recently discovered and may have been fixed) > > arch/powerpc/platforms/embedded6xx/ls_uart.c:128:15: error: implicit > declaration of function 'of_address_to_resource' > [-Werror=implicit-function-declaration] > > Error/Warning ids grouped by kconfigs: > > gcc_recent_errors > `-- powerpc-linkstation_defconfig > `-- > arch-powerpc-platforms-embedded6xx-ls_uart.c:error:implicit-declaration-of-function-of_address_to_resource I've dropped that patch for now. cheers
[PATCH v8 1/2] mm/tlbbatch: Introduce arch_tlbbatch_should_defer()
From: Anshuman Khandual The entire scheme of deferred TLB flush in reclaim path rests on the fact that the cost to refill TLB entries is less than flushing out individual entries by sending IPI to remote CPUs. But architecture can have different ways to evaluate that. Hence apart from checking TTU_BATCH_FLUSH in the TTU flags, rest of the decision should be architecture specific. Signed-off-by: Anshuman Khandual [https://lore.kernel.org/linuxppc-dev/20171101101735.2318-2-khand...@linux.vnet.ibm.com/] Signed-off-by: Yicong Yang [Rebase and fix incorrect return value type] Reviewed-by: Kefeng Wang Reviewed-by: Anshuman Khandual Reviewed-by: Barry Song Reviewed-by: Xin Hao Tested-by: Punit Agrawal --- arch/x86/include/asm/tlbflush.h | 12 mm/rmap.c | 9 + 2 files changed, 13 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index cda3118f3b27..8a497d902c16 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -240,6 +240,18 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a) flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false); } +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) +{ + bool should_defer = false; + + /* If remote CPUs need to be flushed then defer batch the flush */ + if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) + should_defer = true; + put_cpu(); + + return should_defer; +} + static inline u64 inc_mm_tlb_gen(struct mm_struct *mm) { /* diff --git a/mm/rmap.c b/mm/rmap.c index 8632e02661ac..38ccb700748c 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -686,17 +686,10 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable) */ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) { - bool should_defer = false; - if (!(flags & TTU_BATCH_FLUSH)) return false; - /* If remote CPUs need to be flushed then defer batch the flush */ - if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) - should_defer = true; - put_cpu(); - - return should_defer; + return arch_tlbbatch_should_defer(mm); } /* -- 2.24.0
[PATCH v8 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
From: Barry Song on x86, batched and deferred tlb shootdown has lead to 90% performance increase on tlb shootdown. on arm64, HW can do tlb shootdown without software IPI. But sync tlbi is still quite expensive. Even running a simplest program which requires swapout can prove this is true, #include #include #include #include int main() { #define SIZE (1 * 1024 * 1024) volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); memset(p, 0x88, SIZE); for (int k = 0; k < 1; k++) { /* swap in */ for (int i = 0; i < SIZE; i += 4096) { (void)p[i]; } /* swap out */ madvise(p, SIZE, MADV_PAGEOUT); } } Perf result on snapdragon 888 with 8 cores by using zRAM as the swap block device. ~ # perf record taskset -c 4 ./a.out [ perf record: Woken up 10 times to write data ] [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ] ~ # perf report # To display the perf.data header info, please use --header/--header-only options. # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 60K of event 'cycles' # Event count (approx.): 35706225414 # # Overhead Command Shared Object Symbol # ... . . # 21.07% a.out[kernel.kallsyms] [k] _raw_spin_unlock_irq 8.23% a.out[kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 6.67% a.out[kernel.kallsyms] [k] filemap_map_pages 6.16% a.out[kernel.kallsyms] [k] __zram_bvec_write 5.36% a.out[kernel.kallsyms] [k] ptep_clear_flush 3.71% a.out[kernel.kallsyms] [k] _raw_spin_lock 3.49% a.out[kernel.kallsyms] [k] memset64 1.63% a.out[kernel.kallsyms] [k] clear_page 1.42% a.out[kernel.kallsyms] [k] _raw_spin_unlock 1.26% a.out[kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930 1.23% a.out[kernel.kallsyms] [k] xas_load 1.15% a.out[kernel.kallsyms] [k] zram_slot_lock ptep_clear_flush() takes 5.36% CPU in the micro-benchmark swapping in/out a page mapped by only one process. If the page is mapped by multiple processes, typically, like more than 100 on a phone, the overhead would be much higher as we have to run tlb flush 100 times for one single page. Plus, tlb flush overhead will increase with the number of CPU cores due to the bad scalability of tlb shootdown in HW, so those ARM64 servers should expect much higher overhead. Further perf annonate shows 95% cpu time of ptep_clear_flush is actually used by the final dsb() to wait for the completion of tlb flush. This provides us a very good chance to leverage the existing batched tlb in kernel. The minimum modification is that we only send async tlbi in the first stage and we send dsb while we have to sync in the second stage. With the above simplest micro benchmark, collapsed time to finish the program decreases around 5%. Typical collapsed time w/o patch: ~ # time taskset -c 4 ./a.out 0.21user 14.34system 0:14.69elapsed w/ patch: ~ # time taskset -c 4 ./a.out 0.22user 13.45system 0:13.80elapsed Also, Yicong Yang added the following observation. Tested with benchmark in the commit on Kunpeng920 arm64 server, observed an improvement around 12.5% with command `time ./swap_bench`. w/o w/ real0m13.460s 0m11.771s user0m0.248s0m0.279s sys 0m12.039s 0m11.458s Originally it's noticed a 16.99% overhead of ptep_clear_flush() which has been eliminated by this patch: [root@localhost yang]# perf record -- ./swap_bench && perf report [...] 16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush It is tested on 4,8,128 CPU platforms and shows to be beneficial on large systems but may not have improvement on small systems like on a 4 CPU platform. So make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends on CONFIG_EXPERT for this stage and make this disabled on systems with less than 8 CPUs. User can modify this threshold according to their own platforms by CONFIG_NR_CPUS_FOR_BATCHED_TLB. Also this patch improve the performance of page migration. Using pmbench and tries to migrate the pages of pmbench between node 0 and node 1 for 20 times, this patch decrease the time used more than 50% and saved the time used by ptep_clear_flush(). This patch extends arch_tlbbatch_add_mm() to take an address of the target page to support the feature on arm64. Also rename it to arch_tlbbatch_add_pending() to better match its function since we don't need to handle the mm on arm64 and add_mm is not proper. add_pending will
[PATCH v8 0/2] arm64: support batched/deferred tlb shootdown during page reclamation
From: Yicong Yang Though ARM64 has the hardware to do tlb shootdown, the hardware broadcasting is not free. A simplest micro benchmark shows even on snapdragon 888 with only 8 cores, the overhead for ptep_clear_flush is huge even for paging out one page mapped by only one process: 5.36% a.out[kernel.kallsyms] [k] ptep_clear_flush While pages are mapped by multiple processes or HW has more CPUs, the cost should become even higher due to the bad scalability of tlb shootdown. The same benchmark can result in 16.99% CPU consumption on ARM64 server with around 100 cores according to Yicong's test on patch 4/4. This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by 1. only send tlbi instructions in the first stage - arch_tlbbatch_add_mm() 2. wait for the completion of tlbi by dsb while doing tlbbatch sync in arch_tlbbatch_flush() Testing on snapdragon shows the overhead of ptep_clear_flush is removed by the patchset. The micro benchmark becomes 5% faster even for one page mapped by single process on snapdragon 888. This support also optimize the page migration more than 50% with support of batched TLB flushing [*]. [*] https://lore.kernel.org/linux-mm/20230213123444.155149-1-ying.hu...@intel.com/ -v8: 1. Rebase on 6.3-rc4 2. Tested the optimization on page migration and mentioned it in the commit 3. Thanks the review from Anshuman. Link: https://lore.kernel.org/linux-mm/20221117082648.47526-1-yangyic...@huawei.com/ -v7: 1. rename arch_tlbbatch_add_mm() to arch_tlbbatch_add_pending() as suggested, since it takes an extra address for arm64, per Nadav and Anshuman. Also mentioned in the commit. 2. add tags from Xin Hao, thanks. Link: https://lore.kernel.org/lkml/20221115031425.44640-1-yangyic...@huawei.com/ -v6: 1. comment we don't defer TLB flush on platforms affected by ARM64_WORKAROUND_REPEAT_TLBI 2. use cpus_have_const_cap() instead of this_cpu_has_cap() 3. add tags from Punit, Thanks. 4. default enable the feature when cpus >= 8 rather than > 8, since the original improvement is observed on snapdragon 888 with 8 cores. Link: https://lore.kernel.org/lkml/20221028081255.19157-1-yangyic...@huawei.com/ -v5: 1. Make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends on EXPERT for this stage on arm64. 2. Make a threshold of CPU numbers for enabling batched TLP flush on arm64 Link: https://lore.kernel.org/linux-arm-kernel/20220921084302.43631-1-yangyic...@huawei.com/T/ -v4: 1. Add tags from Kefeng and Anshuman, Thanks. 2. Limit the TLB batch/defer on systems with >4 CPUs, per Anshuman 3. Merge previous Patch 1,2-3 into one, per Anshuman Link: https://lore.kernel.org/linux-mm/20220822082120.8347-1-yangyic...@huawei.com/ -v3: 1. Declare arch's tlbbatch defer support by arch_tlbbatch_should_defer() instead of ARCH_HAS_MM_CPUMASK, per Barry and Kefeng 2. Add Tested-by from Xin Hao Link: https://lore.kernel.org/linux-mm/20220711034615.482895-1-21cn...@gmail.com/ -v2: 1. Collected Yicong's test result on kunpeng920 ARM64 server; 2. Removed the redundant vma parameter in arch_tlbbatch_add_mm() according to the comments of Peter Zijlstra and Dave Hansen 3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask is empty according to the comments of Nadav Amit Thanks, Peter, Dave and Nadav for your testing or reviewing , and comments. -v1: https://lore.kernel.org/lkml/20220707125242.425242-1-21cn...@gmail.com/ Anshuman Khandual (1): mm/tlbbatch: Introduce arch_tlbbatch_should_defer() Barry Song (1): arm64: support batched/deferred tlb shootdown during page reclamation .../features/vm/TLB/arch-support.txt | 2 +- arch/arm64/Kconfig| 6 +++ arch/arm64/include/asm/tlbbatch.h | 12 + arch/arm64/include/asm/tlbflush.h | 52 ++- arch/x86/include/asm/tlbflush.h | 17 +- include/linux/mm_types_task.h | 4 +- mm/rmap.c | 21 +++- 7 files changed, 94 insertions(+), 20 deletions(-) create mode 100644 arch/arm64/include/asm/tlbbatch.h -- 2.24.0
Re: [PATCH v3 4/4] of: address: Always use dma_default_coherent for default coherency
Jiaxun Yang writes: > As for now all arches have dma_default_coherent reflecting default > DMA coherency for of devices, so there is no need to have a standalone > config option. > > Signed-off-by: Jiaxun Yang > --- > v3: Squash setting ARCH_DMA_DEFAULT_COHERENT into this patch. > --- > arch/powerpc/Kconfig | 2 +- > arch/riscv/Kconfig | 2 +- > drivers/of/Kconfig | 4 > drivers/of/address.c | 10 +- > 4 files changed, 3 insertions(+), 15 deletions(-) > > diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig > index 57f5d2f53d06..824e00a1277b 100644 > --- a/arch/powerpc/Kconfig > +++ b/arch/powerpc/Kconfig > @@ -113,6 +113,7 @@ config PPC > # > select ARCH_32BIT_OFF_T if PPC32 > select ARCH_DISABLE_KASAN_INLINEif PPC_RADIX_MMU > + select ARCH_DMA_DEFAULT_COHERENTif !NOT_COHERENT_CACHE > select ARCH_ENABLE_MEMORY_HOTPLUG > select ARCH_ENABLE_MEMORY_HOTREMOVE > select ARCH_HAS_COPY_MC if PPC64 > @@ -273,7 +274,6 @@ config PPC > select NEED_PER_CPU_PAGE_FIRST_CHUNKif PPC64 > select NEED_SG_DMA_LENGTH > select OF > - select OF_DMA_DEFAULT_COHERENT if !NOT_COHERENT_CACHE > select OF_EARLY_FLATTREE > select OLD_SIGACTIONif PPC32 > select OLD_SIGSUSPEND Acked-by: Michael Ellerman (powerpc) cheers
[powerpc:next-test] BUILD REGRESSION c827c932c00ccd231a73da9816a51ce2c2b557d6
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next-test branch HEAD: c827c932c00ccd231a73da9816a51ce2c2b557d6 powerpc: Use of_address_to_resource() Error/Warning reports: https://lore.kernel.org/oe-kbuild-all/202303240411.tq2tzkig-...@intel.com Error/Warning: (recently discovered and may have been fixed) arch/powerpc/platforms/embedded6xx/ls_uart.c:128:15: error: implicit declaration of function 'of_address_to_resource' [-Werror=implicit-function-declaration] Error/Warning ids grouped by kconfigs: gcc_recent_errors `-- powerpc-linkstation_defconfig `-- arch-powerpc-platforms-embedded6xx-ls_uart.c:error:implicit-declaration-of-function-of_address_to_resource elapsed time: 9412m configs tested: 354 configs skipped: 35 tested configs: alphaallyesconfig gcc alphabuildonly-randconfig-r005-20230322 gcc alpha defconfig gcc alpharandconfig-r012-20230323 gcc alpharandconfig-r014-20230322 gcc alpharandconfig-r014-20230323 gcc alpharandconfig-r021-20230322 gcc alpharandconfig-r025-20230322 gcc arc allyesconfig gcc arc buildonly-randconfig-r004-20230323 gcc arc defconfig gcc arcnsim_700_defconfig gcc arc randconfig-r016-20230322 gcc arc randconfig-r023-20230322 gcc arc randconfig-r024-20230322 gcc arc randconfig-r031-20230322 gcc arc randconfig-r034-20230322 gcc arc randconfig-r036-20230322 gcc arc randconfig-r043-20230322 gcc arcvdk_hs38_defconfig gcc arm allmodconfig gcc arm allyesconfig gcc arm assabet_defconfig gcc arm buildonly-randconfig-r001-20230322 clang arm defconfig gcc armdove_defconfig clang arm footbridge_defconfig gcc arm gemini_defconfig gcc arm imx_v4_v5_defconfig clang arm imx_v6_v7_defconfig gcc arm integrator_defconfig gcc arm jornada720_defconfig gcc arm lpc18xx_defconfig gcc arm mv78xx0_defconfig clang arm netwinder_defconfig clang arm pxa_defconfig gcc arm randconfig-c002-20230322 clang arm randconfig-c002-20230322 gcc arm randconfig-c002-20230324 gcc arm randconfig-r002-20230322 gcc arm randconfig-r013-20230322 clang arm randconfig-r014-20230322 clang arm randconfig-r022-20230322 clang arm randconfig-r023-20230322 clang arm randconfig-r026-20230322 clang arm randconfig-r035-20230322 gcc arm stm32_defconfig gcc arm sunxi_defconfig gcc arm vf610m4_defconfig gcc arm64allyesconfig gcc arm64 defconfig gcc arm64randconfig-r001-20230322 clang arm64randconfig-r011-20230322 gcc arm64randconfig-r023-20230322 gcc arm64randconfig-r025-20230322 gcc arm64randconfig-r031-20230322 clang arm64randconfig-r032-20230322 clang arm64randconfig-r036-20230322 clang csky buildonly-randconfig-r001-20230322 gcc cskydefconfig gcc csky randconfig-r003-20230322 gcc csky randconfig-r012-20230322 gcc csky randconfig-r013-20230322 gcc csky randconfig-r023-20230322 gcc csky randconfig-r024-20230322 gcc csky randconfig-r026-20230322 gcc csky randconfig-r034-20230322 gcc hexagon buildonly-randconfig-r003-20230322 clang hexagon buildonly-randconfig-r004-20230322 clang hexagon buildonly-randconfig-r006-20230323 clang hexagon randconfig-r003-20230322 clang hexagon randconfig-r004-20230322 clang hexagon randconfig-r015-20230322 clang hexagon randconfig-r034-20230322 clang hexagon randconfig-r035-20230322 clang i386 alldefconfig gcc i386 allyesconfig gcc i386
[powerpc:merge] BUILD SUCCESS 0f98241d5ef5c3bb4c5ca07ceee3a825d79999fd
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git merge branch HEAD: 0f98241d5ef5c3bb4c5ca07ceee3a825d7fd Automatic merge of 'fixes' into merge (2023-03-28 23:32) elapsed time: 741m configs tested: 43 configs skipped: 3 The following configs have been built successfully. More configs may be tested in the coming days. tested configs: alphaallyesconfig gcc alpha defconfig gcc arc allyesconfig gcc arc defconfig gcc arm allmodconfig gcc arm allyesconfig gcc arm defconfig gcc arm64allyesconfig gcc arm64 defconfig gcc cskydefconfig gcc i386 allyesconfig gcc i386 debian-10.3 gcc i386defconfig gcc ia64 allmodconfig gcc ia64defconfig gcc loongarchallmodconfig gcc loongarch allnoconfig gcc loongarch defconfig gcc m68k allmodconfig gcc m68kdefconfig gcc mips allmodconfig gcc mips allyesconfig gcc nios2 defconfig gcc parisc defconfig gcc parisc64defconfig gcc powerpc allmodconfig gcc powerpc allnoconfig gcc riscvallmodconfig gcc riscv allnoconfig gcc riscv defconfig gcc riscv rv32_defconfig gcc s390 allmodconfig gcc s390 allyesconfig gcc s390defconfig gcc sh allmodconfig gcc sparc defconfig gcc um i386_defconfig gcc um x86_64_defconfig gcc x86_64allnoconfig gcc x86_64 allyesconfig gcc x86_64 defconfig gcc x86_64 kexec gcc x86_64 rhel-8.3 gcc -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests
[powerpc:topic/ppc-kvm] BUILD SUCCESS 5f4f53d28cde2cc7be96f657229c8603da578500
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git topic/ppc-kvm branch HEAD: 5f4f53d28cde2cc7be96f657229c8603da578500 KVM: PPC: Book3S HV: kvmppc_hv_entry: remove .global scope elapsed time: 741m configs tested: 2 configs skipped: 163 The following configs have been built successfully. More configs may be tested in the coming days. tested configs: powerpc allmodconfig gcc powerpc allnoconfig gcc -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests
[powerpc:fixes-test] BUILD SUCCESS fd7276189450110ed835eb0a334e62d2f1c4e3be
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git fixes-test branch HEAD: fd7276189450110ed835eb0a334e62d2f1c4e3be powerpc: Don't try to copy PPR for task with NULL pt_regs elapsed time: 742m configs tested: 7 configs skipped: 163 The following configs have been built successfully. More configs may be tested in the coming days. tested configs: nios2 defconfig gcc parisc defconfig gcc parisc64defconfig gcc powerpc allmodconfig gcc powerpc allnoconfig gcc sh allmodconfig gcc x86_64randconfig-k001 clang -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests
[PATCH 5/5] of/address: Add of_property_read_reg() helper
Add a helper, of_property_read_reg(), to read "reg" entries untranslated address and size. This function is intended mainly for cases with an untranslatable "reg" address (i.e. not MMIO). There's also a few translatable cases such as address cells containing a bus chip-select number. Signed-off-by: Rob Herring --- drivers/of/address.c | 23 +++ drivers/of/unittest.c | 22 ++ include/linux/of_address.h | 7 +++ 3 files changed, 52 insertions(+) diff --git a/drivers/of/address.c b/drivers/of/address.c index 8cfae24148e0..fdb15c6fb3b1 100644 --- a/drivers/of/address.c +++ b/drivers/of/address.c @@ -760,6 +760,29 @@ const __be32 *__of_get_address(struct device_node *dev, int index, int bar_no, } EXPORT_SYMBOL(__of_get_address); +/** + * of_property_read_reg - Retrieve the specified "reg" entry index without translating + * @np: device tree node for which to retrieve "reg" from + * @idx: "reg" entry index to read + * @addr: return value for the untranslated address + * @size: return value for the entry size + * + * Returns -EINVAL if "reg" is not found. Returns 0 on success with addr and + * size values filled in. + */ +int of_property_read_reg(struct device_node *np, int idx, u64 *addr, u64 *size) +{ + const __be32 *prop = of_get_address(np, idx, size, NULL); + + if (!prop) + return -EINVAL; + + *addr = of_read_number(prop, of_n_addr_cells(np)); + + return 0; +} +EXPORT_SYMBOL(of_property_read_reg); + static int parser_init(struct of_pci_range_parser *parser, struct device_node *node, const char *name) { diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c index eaeb58065acc..e73ecbef977b 100644 --- a/drivers/of/unittest.c +++ b/drivers/of/unittest.c @@ -1134,6 +1134,27 @@ static void __init of_unittest_bus_3cell_ranges(void) of_node_put(np); } +static void __init of_unittest_reg(void) +{ + struct device_node *np; + int ret; + u64 addr, size; + + np = of_find_node_by_path("/testcase-data/address-tests/bus@8000/device@1000"); + if (!np) { + pr_err("missing testcase data\n"); + return; + } + + ret = of_property_read_reg(np, 0, , ); + unittest(!ret, "of_property_read_reg(%pOF) returned error %d\n", + np, ret); + unittest(addr == 0x1000, "of_property_read_reg(%pOF) untranslated address (%llx) incorrect\n", + np, addr); + + of_node_put(np); +} + static void __init of_unittest_parse_interrupts(void) { struct device_node *np; @@ -3772,6 +3793,7 @@ static int __init of_unittest(void) of_unittest_pci_dma_ranges(); of_unittest_bus_ranges(); of_unittest_bus_3cell_ranges(); + of_unittest_reg(); of_unittest_match_node(); of_unittest_platform_populate(); of_unittest_overlay(); diff --git a/include/linux/of_address.h b/include/linux/of_address.h index 5292f62c1baa..95cb6c5c2d67 100644 --- a/include/linux/of_address.h +++ b/include/linux/of_address.h @@ -72,6 +72,8 @@ void __iomem *of_io_request_and_map(struct device_node *device, extern const __be32 *__of_get_address(struct device_node *dev, int index, int bar_no, u64 *size, unsigned int *flags); +int of_property_read_reg(struct device_node *np, int idx, u64 *addr, u64 *size); + extern int of_pci_range_parser_init(struct of_pci_range_parser *parser, struct device_node *node); extern int of_pci_dma_range_parser_init(struct of_pci_range_parser *parser, @@ -106,6 +108,11 @@ static inline const __be32 *__of_get_address(struct device_node *dev, int index, return NULL; } +static int of_property_read_reg(struct device_node *np, int idx, u64 *addr, u64 *size) +{ + return -ENOSYS; +} + static inline int of_pci_range_parser_init(struct of_pci_range_parser *parser, struct device_node *node) { -- 2.39.2
[PATCH 1/5] of: unittest: Add bus address range parsing tests
While there are tests for "dma-ranges" helpers, "ranges" is missing any tests. It's the same underlying code, but for completeness add a test for "ranges" parsing iterators. This is in preparation to add some additional "ranges" helpers. Signed-off-by: Rob Herring --- drivers/of/unittest.c | 53 +++ 1 file changed, 53 insertions(+) diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c index b5a7a31d8bd2..1a45df1f354a 100644 --- a/drivers/of/unittest.c +++ b/drivers/of/unittest.c @@ -1008,6 +1008,58 @@ static void __init of_unittest_pci_dma_ranges(void) of_node_put(np); } +static void __init of_unittest_bus_ranges(void) +{ + struct device_node *np; + struct of_range range; + struct of_range_parser parser; + int i = 0; + + np = of_find_node_by_path("/testcase-data/address-tests"); + if (!np) { + pr_err("missing testcase data\n"); + return; + } + + if (of_range_parser_init(, np)) { + pr_err("missing ranges property\n"); + return; + } + + /* +* Get the "ranges" from the device tree +*/ + for_each_of_range(, ) { + unittest(range.flags == IORESOURCE_MEM, + "for_each_of_range wrong flags on node %pOF flags=%x (expected %x)\n", + np, range.flags, IORESOURCE_MEM); + if (!i) { + unittest(range.size == 0x4000, +"for_each_of_range wrong size on node %pOF size=%llx\n", +np, range.size); + unittest(range.cpu_addr == 0x7000, +"for_each_of_range wrong CPU addr (%llx) on node %pOF", +range.cpu_addr, np); + unittest(range.bus_addr == 0x7000, +"for_each_of_range wrong bus addr (%llx) on node %pOF", +range.pci_addr, np); + } else { + unittest(range.size == 0x2000, +"for_each_of_range wrong size on node %pOF size=%llx\n", +np, range.size); + unittest(range.cpu_addr == 0xd000, +"for_each_of_range wrong CPU addr (%llx) on node %pOF", +range.cpu_addr, np); + unittest(range.bus_addr == 0x, +"for_each_of_range wrong bus addr (%llx) on node %pOF", +range.pci_addr, np); + } + i++; + } + + of_node_put(np); +} + static void __init of_unittest_parse_interrupts(void) { struct device_node *np; @@ -3644,6 +3696,7 @@ static int __init of_unittest(void) of_unittest_dma_get_max_cpu_address(); of_unittest_parse_dma_ranges(); of_unittest_pci_dma_ranges(); + of_unittest_bus_ranges(); of_unittest_match_node(); of_unittest_platform_populate(); of_unittest_overlay(); -- 2.39.2
[PATCH 3/5] of/address: Add support for 3 address cell bus
There's a few custom bus bindings (e.g. fsl,qoriq-mc) which use a 3 cell format with custom flags in the high cell. We can match these buses as a fallback if we didn't match on PCI bus which is the only standard bus binding with 3 address cells. Signed-off-by: Rob Herring --- drivers/of/address.c| 22 +++ drivers/of/unittest-data/tests-address.dtsi | 9 - drivers/of/unittest.c | 58 - 3 files changed, 87 insertions(+), 2 deletions(-) diff --git a/drivers/of/address.c b/drivers/of/address.c index b79f005834fc..8cfae24148e0 100644 --- a/drivers/of/address.c +++ b/drivers/of/address.c @@ -95,11 +95,17 @@ static int of_bus_default_translate(__be32 *addr, u64 offset, int na) return 0; } +static unsigned int of_bus_default_flags_get_flags(const __be32 *addr) +{ + return of_read_number(addr, 1); +} + static unsigned int of_bus_default_get_flags(const __be32 *addr) { return IORESOURCE_MEM; } + #ifdef CONFIG_PCI static unsigned int of_bus_pci_get_flags(const __be32 *addr) { @@ -344,6 +350,11 @@ static unsigned int of_bus_isa_get_flags(const __be32 *addr) return flags; } +static int of_bus_default_flags_match(struct device_node *np) +{ + return of_bus_n_addr_cells(np) == 3; +} + /* * Array of bus specific translators */ @@ -373,6 +384,17 @@ static struct of_bus of_busses[] = { .has_flags = true, .get_flags = of_bus_isa_get_flags, }, + /* Default with flags cell */ + { + .name = "default-flags", + .addresses = "reg", + .match = of_bus_default_flags_match, + .count_cells = of_bus_default_count_cells, + .map = of_bus_default_map, + .translate = of_bus_default_translate, + .has_flags = true, + .get_flags = of_bus_default_flags_get_flags, + }, /* Default */ { .name = "default", diff --git a/drivers/of/unittest-data/tests-address.dtsi b/drivers/of/unittest-data/tests-address.dtsi index 6604a52bf6cb..bc0029cbf8ea 100644 --- a/drivers/of/unittest-data/tests-address.dtsi +++ b/drivers/of/unittest-data/tests-address.dtsi @@ -14,7 +14,7 @@ address-tests { #size-cells = <1>; /* ranges here is to make sure we don't use it for * dma-ranges translation */ - ranges = <0x7000 0x7000 0x4000>, + ranges = <0x7000 0x7000 0x5000>, <0x 0xd000 0x2000>; dma-ranges = <0x0 0x2000 0x4000>; @@ -43,6 +43,13 @@ pci@9000 { <0x4200 0x0 0xc000 0x2000 0x0 0x1000>; }; + bus@a000 { + #address-cells = <3>; + #size-cells = <2>; + ranges = <0xf00baa 0x0 0x0 0xa000 0x0 0x10>, +<0xf00bee 0x1 0x0 0xb000 0x0 0x20>; + }; + }; }; }; diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c index 945288d5672f..29066ecbed47 100644 --- a/drivers/of/unittest.c +++ b/drivers/of/unittest.c @@ -1048,7 +1048,7 @@ static void __init of_unittest_bus_ranges(void) "for_each_of_range wrong flags on node %pOF flags=%x (expected %x)\n", np, range.flags, IORESOURCE_MEM); if (!i) { - unittest(range.size == 0x4000, + unittest(range.size == 0x5000, "for_each_of_range wrong size on node %pOF size=%llx\n", np, range.size); unittest(range.cpu_addr == 0x7000, @@ -1074,6 +1074,61 @@ static void __init of_unittest_bus_ranges(void) of_node_put(np); } +static void __init of_unittest_bus_3cell_ranges(void) +{ + struct device_node *np; + struct of_range range; + struct of_range_parser parser; + int i = 0; + + np = of_find_node_by_path("/testcase-data/address-tests/bus@a000"); + if (!np) { + pr_err("missing testcase data\n"); + return; + } + + if (of_range_parser_init(, np)) { + pr_err("missing ranges property\n"); + return; + } + + /* +* Get the "ranges" from the device tree +*/ + for_each_of_range(, ) { + if (!i) { + unittest(range.flags == 0xf00baa, +"for_each_of_range wrong flags on node %pOF flags=%x\n", +np, range.flags); +
[PATCH 2/5] of/address: Add of_range_to_resource() helper
A few users need to convert a specific "ranges" entry into a struct resource. Add a helper to similar to of_address_to_resource(). The existing of_pci_range_to_resource() helper isn't really PCI specific, so it can be used with the CONFIG_PCI check dropped. Signed-off-by: Rob Herring --- drivers/of/address.c | 31 --- drivers/of/unittest.c | 16 +++- include/linux/of_address.h | 8 3 files changed, 51 insertions(+), 4 deletions(-) diff --git a/drivers/of/address.c b/drivers/of/address.c index 4c0b169ef9bf..b79f005834fc 100644 --- a/drivers/of/address.c +++ b/drivers/of/address.c @@ -229,9 +229,6 @@ int of_pci_range_to_resource(struct of_pci_range *range, res->parent = res->child = res->sibling = NULL; res->name = np->full_name; - if (!IS_ENABLED(CONFIG_PCI)) - return -ENOSYS; - if (res->flags & IORESOURCE_IO) { unsigned long port; err = pci_register_io_range(>fwnode, range->cpu_addr, @@ -263,6 +260,34 @@ int of_pci_range_to_resource(struct of_pci_range *range, } EXPORT_SYMBOL(of_pci_range_to_resource); +/* + * of_range_to_resource - Create a resource from a ranges entry + * @np:device node where the range belongs to + * @index: the 'ranges' index to convert to a resource + * @res: pointer to a valid resource that will be updated to + * reflect the values contained in the range. + * + * Returns ENOENT if the entry is not found or EINVAL if the range cannot be + * converted to resource. + */ +int of_range_to_resource(struct device_node *np, int index, struct resource *res) +{ + int ret, i = 0; + struct of_range_parser parser; + struct of_range range; + + ret = of_range_parser_init(, np); + if (ret) + return ret; + + for_each_of_range(, ) + if (i++ == index) + return of_pci_range_to_resource(, np, res); + + return -ENOENT; +} +EXPORT_SYMBOL(of_range_to_resource); + /* * ISA bus specific translator */ diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c index 1a45df1f354a..945288d5672f 100644 --- a/drivers/of/unittest.c +++ b/drivers/of/unittest.c @@ -1013,7 +1013,8 @@ static void __init of_unittest_bus_ranges(void) struct device_node *np; struct of_range range; struct of_range_parser parser; - int i = 0; + struct resource res; + int ret, i = 0; np = of_find_node_by_path("/testcase-data/address-tests"); if (!np) { @@ -1026,6 +1027,19 @@ static void __init of_unittest_bus_ranges(void) return; } + ret = of_range_to_resource(np, 1, ); + unittest(!ret, "of_range_to_resource returned error (%d) node %pOF\n", + ret, np); + unittest(resource_type() == IORESOURCE_MEM, + "of_range_to_resource wrong resource type on node %pOF res=%pR\n", + np, ); + unittest(res.start == 0xd000, + "of_range_to_resource wrong resource start address on node %pOF res=%pR\n", + np, ); + unittest(resource_size() == 0x2000, + "of_range_to_resource wrong resource start address on node %pOF res=%pR\n", + np, ); + /* * Get the "ranges" from the device tree */ diff --git a/include/linux/of_address.h b/include/linux/of_address.h index 376671594746..1d005439f026 100644 --- a/include/linux/of_address.h +++ b/include/linux/of_address.h @@ -68,6 +68,8 @@ extern int of_pci_address_to_resource(struct device_node *dev, int bar, extern int of_pci_range_to_resource(struct of_pci_range *range, struct device_node *np, struct resource *res); +extern int of_range_to_resource(struct device_node *np, int index, + struct resource *res); extern bool of_dma_is_coherent(struct device_node *np); #else /* CONFIG_OF_ADDRESS */ static inline void __iomem *of_io_request_and_map(struct device_node *device, @@ -120,6 +122,12 @@ static inline int of_pci_range_to_resource(struct of_pci_range *range, return -ENOSYS; } +static inline int of_range_to_resource(struct device_node *np, int index, + struct resource *res) +{ + return -ENOSYS; +} + static inline bool of_dma_is_coherent(struct device_node *np) { return false; -- 2.39.2
[PATCH 0/5] of: More address parsing helpers
This series is part of some clean-ups to reduce the open coded parsing of "reg" and "ranges" in the kernel. As those are standard properties, the common DT code should be able to handle parsing them. However, there are a few gaps in the API for what some drivers need which this series addresses (pun intended). I intend to add these helpers for v6.4 and then convert the users in v6.5 to avoid any dependency issues. This series and the WIP conversions are on this branch[1]. [1] git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git dt/address-helpers Signed-off-by: Rob Herring --- Rob Herring (5): of: unittest: Add bus address range parsing tests of/address: Add of_range_to_resource() helper of/address: Add support for 3 address cell bus of/address: Add of_range_count() helper of/address: Add of_property_read_reg() helper drivers/of/address.c| 76 +- drivers/of/unittest-data/tests-address.dtsi | 9 +- drivers/of/unittest.c | 150 include/linux/of_address.h | 31 ++ 4 files changed, 262 insertions(+), 4 deletions(-) --- base-commit: fe15c26ee26efa11741a7b632e9f23b01aca4cc6 change-id: 20230328-dt-address-helpers-2f00c5c1eba4 Best regards, -- Rob Herring
[PATCH 4/5] of/address: Add of_range_count() helper
Some users need a count of the number of ranges entries before iterating over the entries. Typically this is for allocating some data structure based on the size. Add a helper, of_range_count(), to get the count. The helper must be called with an struct of_range_parser initialized by of_range_parser_init(). Signed-off-by: Rob Herring --- drivers/of/unittest.c | 7 ++- include/linux/of_address.h | 16 2 files changed, 22 insertions(+), 1 deletion(-) diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c index 29066ecbed47..eaeb58065acc 100644 --- a/drivers/of/unittest.c +++ b/drivers/of/unittest.c @@ -1014,7 +1014,7 @@ static void __init of_unittest_bus_ranges(void) struct of_range range; struct of_range_parser parser; struct resource res; - int ret, i = 0; + int ret, count, i = 0; np = of_find_node_by_path("/testcase-data/address-tests"); if (!np) { @@ -1040,6 +1040,11 @@ static void __init of_unittest_bus_ranges(void) "of_range_to_resource wrong resource start address on node %pOF res=%pR\n", np, ); + count = of_range_count(); + unittest(count == 2, + "of_range_count wrong size on node %pOF count=%d\n", + np, count); + /* * Get the "ranges" from the device tree */ diff --git a/include/linux/of_address.h b/include/linux/of_address.h index 1d005439f026..5292f62c1baa 100644 --- a/include/linux/of_address.h +++ b/include/linux/of_address.h @@ -35,6 +35,22 @@ struct of_pci_range { for (; of_pci_range_parser_one(parser, range);) #define for_each_of_range for_each_of_pci_range +/* + * of_range_count - Get the number of "ranges" or "dma-ranges" entries + * @parser:Parser state initialized by of_range_parser_init() + * + * Returns the number of entries or 0 if none. + * + * Note that calling this within or after the for_each_of_range() iterator will + * be inaccurate giving the number of entries remaining. + */ +static inline int of_range_count(const struct of_range_parser *parser) +{ + if (!parser || !parser->node || !parser->range || parser->range == parser->end) + return 0; + return (parser->end - parser->range) / (parser->na + parser->pna + parser->ns); +} + /* Translate a DMA address from device space to CPU space */ extern u64 of_translate_dma_address(struct device_node *dev, const __be32 *in_addr); -- 2.39.2
Re: [PATCH v2 4/5] cxl/pci: Forward RCH downstream port-detected errors to the CXL.mem dev handler
[+cc linux-pci, more error handling folks; beginning of thread at https://lore.kernel.org/all/20230323213808.398039-1-terry.bow...@amd.com/] On Mon, Mar 27, 2023 at 11:51:39PM +0200, Robert Richter wrote: > On 24.03.23 17:36:56, Bjorn Helgaas wrote: > > > The CXL device driver is then responsible to > > > enable error reporting in the RCEC's AER cap > > > > I don't know exactly what you mean by "error reporting in the RCEC's > > AER cap", but IIUC, for non-Root Port devices, generation of ERR_COR/ > > ERR_NONFATAL/ERR_FATAL messages is controlled by the Device Control > > register and should already be enabled by pci_aer_init(). > > > > Maybe you mean setting AER mask/severity specifically for Internal > > Errors? I'm hoping to get as much of AER management as we can in the > > Richt, this is implemented in patch #5 in function > rcec_enable_aer_ints(). I think we should add a PCI core interface for this so we can enforce the AER ownership question (all the crud like pcie_aer_is_native()) in one place. > > PCI core and out of drivers, so maybe we need a new PCI interface to > > do that. > > > > In any event, I assume this sort of configuration would be an > > enumeration-time thing, while *this* patch is a run-time thing, so > > maybe this information belongs with a different patch? > > Do you mean once a Restricted CXL host (RCH) is detected, the internal > errors should be enabled in the device mask, all this done during > device enumeration? But wouldn't interrupts being enabled then before > the CXL device is ready? I'm not sure what you mean by "before the CXL device is ready." What makes a CXL device ready, and how do we know when it is ready? pci_aer_init() turns on PCI_EXP_DEVCTL_CERE, PCI_EXP_DEVCTL_FERE, etc as soon as we enumerate the device, before any driver claims the device. I'm wondering whether we can do this PCI_ERR_COR_INTERNAL and PCI_ERR_UNC_INTN fiddling around the same time? > > I haven't worked all the way through this, but I thought Sean Kelley's > > and Qiuxu Zhuo's work was along the same line and might cover this, > > e.g., > > > > a175102b0a82 ("PCI/ERR: Recover from RCEC AER errors") > > 579086225502 ("PCI/ERR: Recover from RCiEP AER errors") > > af113553d961 ("PCI/AER: Add pcie_walk_rcec() to RCEC AER handling") > > > > But I guess maybe it's not quite the same case? > > Actually, we use this code to handle errors that are reported to the > RCEC and only implement here the CXL specifics. That is, checking if > the RCEC receives something from a CXL downstream port and forwarding > that to a CXL handler (this patch). The handler then checks the AER > err cap in the RCRB of all CXL downstream ports associated to the RCEC > (not visible in the PCI hierarchy), but discovered through the :00.0 > RCiEP (patch #5). There are two calls to pcie_walk_rcec(): 1) The existing one in find_source_device() 2) The one you add in handle_cxl_error() Does the call in handle_cxl_error() look at devices that the existing call in find_source_device() does not? I'm trying to understand why we need both calls. > > > +static bool is_internal_error(struct aer_err_info *info) > > > +{ > > > + if (info->severity == AER_CORRECTABLE) > > > + return info->status & PCI_ERR_COR_INTERNAL; > > > + > > > + return info->status & PCI_ERR_UNC_INTN; > > > +} > > > + > > > +static void handle_cxl_error(struct pci_dev *dev, struct aer_err_info > > > *info) > > > +{ > > > + if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC && > > > + is_internal_error(info)) > > > > What's unique about Internal Errors? I'm trying to figure out why you > > wouldn't do this for *all* CXL errors. > > Per CXL specification downstream port errors are signaled using > internal errors. Maybe a spec reference here to explain is_internal_error()? Is the point of the check to *exclude* non-internal errors? Or is basically documentation that there shouldn't ever *be* any non-internal errors? I guess the latter wouldn't make sense because at this point we don't know whether this is a CXL hierarchy. > All other errors would be device specific, we cannot > handle that in a generic CXL driver. I'm missing the point here. We don't have any device-specific error handling in aer.c; it only connects the generic *reporting* mechanism (AER log registers and Root Port interrupts) to the drivers that do the device-specific things via err_handler hooks. I assume we want a similar model for CXL. Bjorn
Re: Memory coherency issue with IO thread offloading?
On 3/28/23 6:51?AM, Michael Ellerman wrote: > Jens Axboe writes: Can the queueing cause the creation of an IO thread (if one does not exist, or all blocked?) >>> >>> Yep >>> >>> Since writing this email, I've gone through a lot of different tests. >>> Here's a rough listing of what I found: >>> >>> - Like using the hack patch, if I just limit the number of IO thread >>> workers to 1, it seems to pass. At least longer than before, does 1000 >>> iterations. >>> >>> - If I pin each IO worker to a single CPU, it also passes. >>> >>> - If I liberally sprinkle smp_mb() for the io-wq side, test still fails. >>> I've added one before queueing the work item, and after. One before >>> the io-wq worker grabs a work item and one after. Eg full hammer >>> approach. This still fails. >>> >>> Puzzling... For the "pin each IO worker to a single CPU" I added some >>> basic code around trying to ensure that a work item queued on CPU X >>> would be processed by a worker on CPU X, and too a large degree, this >>> does happen. But since the work list is a normal list, it's quite >>> possible that some other worker finishes its work on CPU Y just in time >>> to grab the one from cpu X. I checked and this does happen in the test >>> case, yet it still passes. This may be because I got a bit lucky, but >>> seems suspect with thousands of passes of the test case. >>> >>> Another theory there is that it's perhaps related to an io-wq worker >>> being rescheduled on a different CPU. Though again puzzled as to why the >>> smp_mb sprinkling didn't fix that then. I'm going to try and run the >>> test case with JUST the io-wq worker pinning and not caring about where >>> the work is processed to see if that does anything. >> >> Just pinning each worker to whatever CPU they got created on seemingly >> fixes the issue too. This does not mean that each worker will process >> work on the CPU on which it was queued, just that each worker will >> remain on whatever CPU it originally got created on. >> >> Puzzling... >> >> Note that it is indeed quite possible that this isn't a ppc issue at >> all, just shows on ppc. It could be page cache related, or it could even >> be a bug in mariadb itself. > > I tried binary patching every lwsync to hwsync (read/write to full > barrier) in mariadbd and all the libaries it links. It didn't fix the > problem. > > I also tried switching all the kernel barriers/spin locks to using a > hwsync, but that also didn't fix it. > > It's still possible there's somewhere that currently has no barrier at > all that needs one, the above would only fix the problem if we have a > read/write barrier that actually needs to be a full barrier. > > > I also looked at making all TLB invalidates broadcast, regardless of > whether we think the thread has only been on a single CPU. That didn't > help, but I'm not sure I got all places where we do TLB invalidates, so > I'll look at that some more tomorrow. Thanks, appreciate your testing! I have no new data points since yesterday, but the key point from then still seems to be that if an io worker never reschedules onto a different CPU, then the problem doesn't occur. This could very well be a page cache issue, if it isn't an issue on the powerpc side... -- Jens Axboe
Re: [PATCH] perf vendor events power9: Remove UTF-8 characters from json files
On Tue, Mar 28, 2023 at 4:30 AM Kajol Jain wrote: > > Commit 3c22ba524304 ("perf vendor events powerpc: Update POWER9 events") > added and updated power9 pmu json events. However some of the json > events which are part of other.json and pipeline.json files, > contains UTF-8 characters in their brief description. > Having UTF-8 character could brakes the perf build on some distros. nit: s/bakes/break/ > Fix this issue by removing the UTF-8 characters from other.json and > pipeline.json files. > > Result without the fix patch: > [command]# file -i pmu-events/arch/powerpc/power9/* > pmu-events/arch/powerpc/power9/cache.json: application/json; > charset=us-ascii > pmu-events/arch/powerpc/power9/floating-point.json: application/json; > charset=us-ascii > pmu-events/arch/powerpc/power9/frontend.json: application/json; > charset=us-ascii > pmu-events/arch/powerpc/power9/marked.json: application/json; > charset=us-ascii > pmu-events/arch/powerpc/power9/memory.json: application/json; > charset=us-ascii > pmu-events/arch/powerpc/power9/metrics.json:application/json; > charset=us-ascii > pmu-events/arch/powerpc/power9/nest_metrics.json: application/json; > charset=us-ascii > pmu-events/arch/powerpc/power9/other.json: application/json; > charset=utf-8 > pmu-events/arch/powerpc/power9/pipeline.json: application/json; > charset=utf-8 > pmu-events/arch/powerpc/power9/pmc.json:application/json; > charset=us-ascii > pmu-events/arch/powerpc/power9/translation.json:application/json; > charset=us-ascii > > Result with the fix patch: > > [command]# file -i pmu-events/arch/powerpc/power9/* > pmu-events/arch/powerpc/power9/cache.json: application/json; > charset=us-ascii > pmu-events/arch/powerpc/power9/floating-point.json: application/json; > charset=us-ascii > pmu-events/arch/powerpc/power9/frontend.json: application/json; > charset=us-ascii > pmu-events/arch/powerpc/power9/marked.json: application/json; > charset=us-ascii > pmu-events/arch/powerpc/power9/memory.json: application/json; > charset=us-ascii > pmu-events/arch/powerpc/power9/metrics.json:application/json; > charset=us-ascii > pmu-events/arch/powerpc/power9/nest_metrics.json: application/json; > charset=us-ascii > pmu-events/arch/powerpc/power9/other.json: application/json; > charset=us-ascii > pmu-events/arch/powerpc/power9/pipeline.json: application/json; > charset=us-ascii > pmu-events/arch/powerpc/power9/pmc.json:application/json; > charset=us-ascii > pmu-events/arch/powerpc/power9/translation.json:application/json; > charset=us-ascii > > Fixes: 3c22ba524304 ("perf vendor events powerpc: Update POWER9 events") > Reported-by: Arnaldo Carvalho de Melo > Link: https://lore.kernel.org/lkml/zbxp77deq7ikt...@kernel.org/ > Signed-off-by: Kajol Jain Acked-by: Ian Rogers Thanks, Ian > --- > tools/perf/pmu-events/arch/powerpc/power9/other.json| 4 ++-- > tools/perf/pmu-events/arch/powerpc/power9/pipeline.json | 2 +- > 2 files changed, 3 insertions(+), 3 deletions(-) > > diff --git a/tools/perf/pmu-events/arch/powerpc/power9/other.json > b/tools/perf/pmu-events/arch/powerpc/power9/other.json > index 3f69422c21f9..f10bd554521a 100644 > --- a/tools/perf/pmu-events/arch/powerpc/power9/other.json > +++ b/tools/perf/pmu-events/arch/powerpc/power9/other.json > @@ -1417,7 +1417,7 @@ >{ > "EventCode": "0x45054", > "EventName": "PM_FMA_CMPL", > -"BriefDescription": "two flops operation completed (fmadd, fnmadd, > fmsub, fnmsub) Scalar instructions only. " > +"BriefDescription": "two flops operation completed (fmadd, fnmadd, > fmsub, fnmsub) Scalar instructions only." >}, >{ > "EventCode": "0x201E8", > @@ -2017,7 +2017,7 @@ >{ > "EventCode": "0xC0BC", > "EventName": "PM_LSU_FLUSH_OTHER", > -"BriefDescription": "Other LSU flushes including: Sync (sync ack from L2 > caused search of LRQ for oldest snooped load, This will either signal a > Precise Flush of the oldest snooped loa or a Flush Next PPC); Data Valid > Flush Next (several cases of this, one example is store and reload are lined > up such that a store-hit-reload scenario exists and the CDF has already > launched and has gotten bad/stale data); Bad Data Valid Flush Next (might be > a few cases of this, one example is a larxa (D$ hit) return data and dval but > can't allocate to LMQ (LMQ full or other reason). Already gave dval but can't > watch it for snoop_hit_larx. Need to take the “bad dval” back and flush all > younger ops)" > +"BriefDescription": "Other LSU flushes including: Sync (sync ack from L2 > caused search of LRQ for oldest snooped load, This will either signal a > Precise Flush of the oldest snooped loa or a Flush Next PPC); Data Valid > Flush Next (several cases of this, one example is store and reload are lined > up such that a store-hit-reload scenario exists
Re: [PATCH v12 13/13] arm64: dts: ls1088ardb: Add serdes descriptions
On 3/28/23 05:25, Ioana Ciornei wrote: > On Mon, Mar 27, 2023 at 02:15:47PM -0400, Sean Anderson wrote: >> On 3/24/23 09:17, Ioana Ciornei wrote: >> > On Tue, Mar 21, 2023 at 04:13:12PM -0400, Sean Anderson wrote: >> >> This adds serdes support to the LS1088ARDB. I have tested the QSGMII >> >> ports as well as the two 10G ports. The SFP slot is now fully supported, >> >> instead of being modeled as a fixed-link. >> >> >> >> Linux hangs around when the serdes is initialized if the si5341 is >> >> enabled with the in-tree driver, so I have modeled it as a two fixed >> >> clocks instead. There are a few registers in the QIXIS FPGA which >> >> control the SFP GPIOs; I have modeled them as discrete GPIO controllers >> >> for now. I never saw the AQR105 interrupt fire; not sure what was going >> >> on, but I have removed it to force polling. >> > >> > So you didn't see the interrupt fire even without these patches? >> >> Not sure. I went to check this, and discovered I could no longer get the >> link to come up in Linux, even on v6.0 (before the rate adaptation >> tuff). I see the LEDs blinking in U-Boot, so presumably it's some >> configuration problem. I'm going to look into this further when I have >> more time. >> >> > I just tested this on a LS1088ARDB and it works. >> > >> >root@localhost:~# cat /proc/interrupts | grep extirq >> > 99: 5 ls-extirq 2 Level 0x08b97000:00 >> >root@localhost:~# ip link set dev endpmac2 up >> >root@localhost:~# cat /proc/interrupts | grep extirq >> > 99: 6 ls-extirq 2 Level 0x08b97000:00 >> >root@localhost:~# ip link set dev endpmac2 down >> >root@localhost:~# cat /proc/interrupts | grep extirq >> > 99: 7 ls-extirq 2 Level 0x08b97000:00 >> > >> > Please don't just remove things. >> >> Well, polling isn't the worst thing for a single interface... I do >> remember having a problem with the interrupt. If this series works >> with interrupts enabled, I can leave it in. >> >> Did you have a chance to look at the core (patches 7 and 8) of this >> series? Does it make sense to you? Am I missing something which would >> allow switching from 1G->10G? >> > > For a bit of context, I also attempted dynamic switching from 1G to 10G > on my own even before this patch set but I did not get a link up on the > PCS (CDR lock was there through). Pretty much the same state as you. > > What I propose is to take this whole endeavor step by step. > I am also interrested in getting this feature to actually work but I > just didn't have the time to investigate in depth was is missing. > And without the dynamic switching I cannot say that I find the addition > of the SerDes PHY driver useful. Well, it's still useful for supporting 1G and 10G. I touched on this in the cover letter, but there are conflicting PLL defaults on the LS1046A. If you use SRDS_PRCTL 1133, the PLL mapping is 2211. But for it's . This means PLL2 is used for both 1G and 10G, but no reference frequency can work for both. This will cause the PBL to enter a reset loop, since it wants the PLLs to lock before booting, and this happens before any user configuration. To get around this, we disconnected RESET_REQ_B on our board, and we use this driver to configure the PLLs correctly for whatever SRDS_PRCTL we boot up with. This way we can have two RCWs for 1G and 10G configuration. > I have the Lynx 10G on my TODO list but I still have some other tasks > on the Lynx 28G for the next 2-3 weeks. Once I get those done, I will > look closer at the patches. OK, thanks. > In the meantime, some small thigs from this patch set can be submitted > separately. For example, describing the SFP cage on the LS1088ARDB. I'll have a look at this. I suppose I will also split off the IRQ thing. > I still have some small questions on the DTS implementation for the gpio > controllers but I would be able to submit this myself if you do not find > the time (with your authorship of course). I am going to do another revision to address the GPIO binding problem, so please ask away. --Sean
Re: [PATCH v3 4/4] of: address: Always use dma_default_coherent for default coherency
On Tue, Mar 21, 2023 at 6:08 AM Jiaxun Yang wrote: > > As for now all arches have dma_default_coherent reflecting default > DMA coherency for of devices, so there is no need to have a standalone > config option. > > Signed-off-by: Jiaxun Yang > --- > v3: Squash setting ARCH_DMA_DEFAULT_COHERENT into this patch. > --- > arch/powerpc/Kconfig | 2 +- > arch/riscv/Kconfig | 2 +- > drivers/of/Kconfig | 4 > drivers/of/address.c | 10 +- > 4 files changed, 3 insertions(+), 15 deletions(-) Reviewed-by: Rob Herring
Re: [PATCH v3 1/4] of: address: Fix default coherency for MIPS
On Tue, Mar 21, 2023 at 6:08 AM Jiaxun Yang wrote: > > DT-based MIPS doesn't use OF_DMA_DEFAULT_COHERENT, but > might override the system-wide default at runtime. > > Use dma_default_coherent to override default coherence for > MIPS. > I assume you want this tagged for stable? Otherwise, I don't understand why you add this here and then remove in patch 4. > Signed-off-by: Jiaxun Yang > --- > drivers/of/address.c | 8 > 1 file changed, 8 insertions(+) > > diff --git a/drivers/of/address.c b/drivers/of/address.c > index 4c0b169ef9bf..c105d66a1fa4 100644 > --- a/drivers/of/address.c > +++ b/drivers/of/address.c > @@ -1105,6 +1105,14 @@ bool of_dma_is_coherent(struct device_node *np) > struct device_node *node; > bool is_coherent = IS_ENABLED(CONFIG_OF_DMA_DEFAULT_COHERENT); > > + /* > +* DT-based MIPS doesn't use OF_DMA_DEFAULT_COHERENT, but > +* might override the system-wide default at runtime. > +*/ > +#if defined(CONFIG_MIPS) && defined(CONFIG_DMA_NONCOHERENT) > + is_coherent = dma_default_coherent; > +#endif > + > node = of_node_get(np); > > while (node) { > -- > 2.37.1 (Apple Git-137.1) >
Re: [PATCH v3 0/4] Use dma_default_coherent for devicetree default coherency
> 2023年3月28日 08:45,Thomas Bogendoerfer 写道: > > On Tue, Mar 28, 2023 at 03:18:12AM +0200, Christoph Hellwig wrote: >> On Fri, Mar 24, 2023 at 09:17:38AM +, Jiaxun Yang wrote: Is patch a 6.3 candidate or should all of it go into 6.4? >>> >>> Please leave it for 6.4, as corresponding MIPS arch part will be a part of >>> 6.4. >> >> Ok. I'll really need review from the MIPS and drivers/of/ maintainers, >> through. +cc devicetree foks. > > I don't see any MIPS changes in the series besides the ifdef CONFIG_MIPS > part in patch 1, which gets removed again in patch 4 (chance to drop > that completely ?). It was suggested by DMA folks to have that patch. > I've merged the corresponding MIPS patches into mips-next last week. Thanks - Jiaxun > > Thomas. > > -- > Crap can work. Given enough thrust pigs will fly, but it's not necessarily a > good idea.[ RFC1925, 2.3 ]
Re: Memory coherency issue with IO thread offloading?
Jens Axboe writes: >>> Can the queueing cause the creation of an IO thread (if one does not >>> exist, or all blocked?) >> >> Yep >> >> Since writing this email, I've gone through a lot of different tests. >> Here's a rough listing of what I found: >> >> - Like using the hack patch, if I just limit the number of IO thread >> workers to 1, it seems to pass. At least longer than before, does 1000 >> iterations. >> >> - If I pin each IO worker to a single CPU, it also passes. >> >> - If I liberally sprinkle smp_mb() for the io-wq side, test still fails. >> I've added one before queueing the work item, and after. One before >> the io-wq worker grabs a work item and one after. Eg full hammer >> approach. This still fails. >> >> Puzzling... For the "pin each IO worker to a single CPU" I added some >> basic code around trying to ensure that a work item queued on CPU X >> would be processed by a worker on CPU X, and too a large degree, this >> does happen. But since the work list is a normal list, it's quite >> possible that some other worker finishes its work on CPU Y just in time >> to grab the one from cpu X. I checked and this does happen in the test >> case, yet it still passes. This may be because I got a bit lucky, but >> seems suspect with thousands of passes of the test case. >> >> Another theory there is that it's perhaps related to an io-wq worker >> being rescheduled on a different CPU. Though again puzzled as to why the >> smp_mb sprinkling didn't fix that then. I'm going to try and run the >> test case with JUST the io-wq worker pinning and not caring about where >> the work is processed to see if that does anything. > > Just pinning each worker to whatever CPU they got created on seemingly > fixes the issue too. This does not mean that each worker will process > work on the CPU on which it was queued, just that each worker will > remain on whatever CPU it originally got created on. > > Puzzling... > > Note that it is indeed quite possible that this isn't a ppc issue at > all, just shows on ppc. It could be page cache related, or it could even > be a bug in mariadb itself. I tried binary patching every lwsync to hwsync (read/write to full barrier) in mariadbd and all the libaries it links. It didn't fix the problem. I also tried switching all the kernel barriers/spin locks to using a hwsync, but that also didn't fix it. It's still possible there's somewhere that currently has no barrier at all that needs one, the above would only fix the problem if we have a read/write barrier that actually needs to be a full barrier. I also looked at making all TLB invalidates broadcast, regardless of whether we think the thread has only been on a single CPU. That didn't help, but I'm not sure I got all places where we do TLB invalidates, so I'll look at that some more tomorrow. cheers
Re: [PATCH] powerpc: don't try to copy ppc for task with NULL pt_regs
On 3/28/23 5:32?AM, Michael Ellerman wrote: > Jens Axboe writes: >> Powerpc sets up PF_KTHREAD and PF_IO_WORKER with a NULL pt_regs, which >> from my (arguably very short) checking is not commonly done for other >> archs. This is fine, except when PF_IO_WORKER's have been created and >> the task does something that causes a coredump to be generated. > > Do kthread's ever core dump? I didn't think they did, but I can't find > any logic to prevent it. kthreads aren't associated with the original task, they just exist by themselves. They also can't take signals. Eg they cannot core dump, just oops :-) This is different than io workers that do show up as threads, but they still don't exit to userspace. That is why it ended being a problem. > As Nick said we should probably have a non-NULL regs for PF_IO_WORKERS, > but I'll still take this as a nice backportable fix for the immediate > crash. > > I tagged it as Fixes: pointing back at the commit that added ppr_get(), > even though I don't know for sure the bug was triggerable back then > (v4.8). Thanks! -- Jens Axboe
Re: [PATCH] arch/powerpc/kvm: kvmppc_core_vcpu_create_hv: check for kzalloc failure
On 2023-03-28 23:02:09, Michael Ellerman wrote: > Kautuk Consul writes: > > On 2023-03-28 15:44:02, Kautuk Consul wrote: > >> On 2023-03-28 20:44:48, Michael Ellerman wrote: > >> > Kautuk Consul writes: > >> > > kvmppc_vcore_create() might not be able to allocate memory through > >> > > kzalloc. In that case the kvm->arch.online_vcores shouldn't be > >> > > incremented. > >> > > >> > I agree that looks wrong. > >> > > >> > Have you tried to test what goes wrong if it fails? It looks like it > >> > will break the LPCR update, which likely will cause the guest to crash > >> > horribly. > > Also, are you referring to the code in kvmppc_update_lpcr()? > > That code will not crash as it checks for the vc before trying to > > dereference it. > > Yeah that's what I was looking at. I didn't mean it would crash, but > that it would bail out early when it sees a NULL vcore, leaving other > vcores with the wrong LPCR value. > > But as you say it doesn't happen because qemu quits on the first ENOMEM. > > And regardless if qemu does something that means the guest is broken > that's just a qemu bug, no big deal as far as the kernel is concerned. But there could be another user-mode application other than qemu that actually tries to create a vcpu after it gets a -ENOMEM for another vcpu. Shouldn't the kernel be independent of qemu? > > > But the following 2 places that utilize the arch.online_vcores will have > > problems in logic if the usermode test-case doesn't pull down the > > kvm context after the -ENOMEM vcpu allocation failure: > > book3s_hv.c:3030: if (!kvm->arch.online_vcores) { > > book3s_hv_rm_mmu.c:44: if (kvm->arch.online_vcores == 1 && > > local_paca->kvm_hstate.kvm_vcpu) > > OK. Both of those look harmless to the host. Harmless to the host in terms of a crash, not in terms of behavior. For example in the case of kvmhv_set_smt_mode: If we got a kzalloc failure once (and online_vcores was wrongly incremented), then if kvmhv_set_smt_mode() is called after that then it would be not be setting the arch.smt_mode and arch.emul_smt_mode correctly and it would be wrongly returning with -EBUSY instead of 0. Isn't that incorrect with respect to the intent of the code ? I agree that applications like qemu might not do that but don't we need to have some integrity with respect to the intent and value of variable use ? What about good code and logic quality ? > > If we find a case where a misbehaving qemu can crash the host then we > need to be a bit more careful and treat it at least as a > denial-of-service bug. But looks like this is not one of those. > > cheers beers
Re: Memory coherency issue with IO thread offloading?
Daniel Black writes: > Thanks Jens, Nick, Christophe and Michael for your work so far. > > Apologies for the out of thread email. > > Confirming MariabD-10.6+ is required( when we added liburing), and > previous versions used libaio (which tested without incident as mpe > retested). Thanks! I was trying to follow the threads from the mariadb/server github to find the CI but didn't have much luck. No .travis.yml these days :) > We were (we're now back on the old good kernel Jens indicated) getting > failures like https://buildbot.mariadb.org/#/builders/231/builds/16857 > in a container (of various distro userspaces) on bare metal. > > bare metal end of /proc/cpuinfo > > processor: 127 > cpu: POWER9, altivec supported > clock: 3283.00MHz > revision: 2.2 (pvr 004e 1202) > > timebase: 51200 > platform: PowerNV > model: 9006-22P > machine: PowerNV 9006-22P > firmware: OPAL > MMU: Radix Thanks, looks like that's a Boston, which at least has the same CPU as the machine I'm testing on. I have also reproduced it on bare metal now, but it seems much easier to hit it in an LPAR (on P10). Not sure that's an actual data point, could just be due to other activity on the box. cheers
Re: [PATCH] arch/powerpc/kvm: kvmppc_core_vcpu_create_hv: check for kzalloc failure
Kautuk Consul writes: > On 2023-03-28 15:44:02, Kautuk Consul wrote: >> On 2023-03-28 20:44:48, Michael Ellerman wrote: >> > Kautuk Consul writes: >> > > kvmppc_vcore_create() might not be able to allocate memory through >> > > kzalloc. In that case the kvm->arch.online_vcores shouldn't be >> > > incremented. >> > >> > I agree that looks wrong. >> > >> > Have you tried to test what goes wrong if it fails? It looks like it >> > will break the LPCR update, which likely will cause the guest to crash >> > horribly. > Also, are you referring to the code in kvmppc_update_lpcr()? > That code will not crash as it checks for the vc before trying to > dereference it. Yeah that's what I was looking at. I didn't mean it would crash, but that it would bail out early when it sees a NULL vcore, leaving other vcores with the wrong LPCR value. But as you say it doesn't happen because qemu quits on the first ENOMEM. And regardless if qemu does something that means the guest is broken that's just a qemu bug, no big deal as far as the kernel is concerned. > But the following 2 places that utilize the arch.online_vcores will have > problems in logic if the usermode test-case doesn't pull down the > kvm context after the -ENOMEM vcpu allocation failure: > book3s_hv.c:3030: if (!kvm->arch.online_vcores) { > book3s_hv_rm_mmu.c:44: if (kvm->arch.online_vcores == 1 && > local_paca->kvm_hstate.kvm_vcpu) OK. Both of those look harmless to the host. If we find a case where a misbehaving qemu can crash the host then we need to be a bit more careful and treat it at least as a denial-of-service bug. But looks like this is not one of those. cheers
Re: [PATCH] powerpc: don't try to copy ppc for task with NULL pt_regs
"Nicholas Piggin" writes: > On Mon Mar 27, 2023 at 8:26 PM AEST, Christophe Leroy wrote: ... >> >> Now that thread.regs doesn't change anymore at each interrupt, it would >> probably be worth dropping it and falling back to task_pt_regs() as >> defined on most architecture. >> Knowing whether a thread is a kernel or user thread can for instance be >> known with PF_KTHREAD flag, so no need of thread.regs for that. > > That would be nice if we can define regs that way, I agree. We should > look into doing that. Yeah it's on the long-list of things that need cleaning up. I think there's some complication in working out which sites are OK to use/give-out the value in pt_regs that's potentially a dummy value, vs cases that actually want to check PF_KTHREAD and do something different. But that's just my hunch I haven't looked through all the sites. The thread.regs = NULL for kernel threads goes back to arch/ppc, about 2002 by the looks: https://github.com/mpe/linux-fullhistory/commit/2a8e186c384c0c911f91cd12367658eabdc820d8#diff-939b705cff722ee75595fad30d56bb1175dfdce49a69adb4d5656f354be076c6 There's no change log of course :) Still maybe it doesn't matter why it was originally done that way, if we can do it differently now. cheers
Re: [PATCH] powerpc: don't try to copy ppc for task with NULL pt_regs
Jens Axboe writes: > Powerpc sets up PF_KTHREAD and PF_IO_WORKER with a NULL pt_regs, which > from my (arguably very short) checking is not commonly done for other > archs. This is fine, except when PF_IO_WORKER's have been created and > the task does something that causes a coredump to be generated. Do kthread's ever core dump? I didn't think they did, but I can't find any logic to prevent it. Maybe it's always been possible but just never happened due to luck. As Nick said we should probably have a non-NULL regs for PF_IO_WORKERS, but I'll still take this as a nice backportable fix for the immediate crash. I tagged it as Fixes: pointing back at the commit that added ppr_get(), even though I don't know for sure the bug was triggerable back then (v4.8). cheers
[PATCH] perf vendor events power9: Remove UTF-8 characters from json files
Commit 3c22ba524304 ("perf vendor events powerpc: Update POWER9 events") added and updated power9 pmu json events. However some of the json events which are part of other.json and pipeline.json files, contains UTF-8 characters in their brief description. Having UTF-8 character could brakes the perf build on some distros. Fix this issue by removing the UTF-8 characters from other.json and pipeline.json files. Result without the fix patch: [command]# file -i pmu-events/arch/powerpc/power9/* pmu-events/arch/powerpc/power9/cache.json: application/json; charset=us-ascii pmu-events/arch/powerpc/power9/floating-point.json: application/json; charset=us-ascii pmu-events/arch/powerpc/power9/frontend.json: application/json; charset=us-ascii pmu-events/arch/powerpc/power9/marked.json: application/json; charset=us-ascii pmu-events/arch/powerpc/power9/memory.json: application/json; charset=us-ascii pmu-events/arch/powerpc/power9/metrics.json:application/json; charset=us-ascii pmu-events/arch/powerpc/power9/nest_metrics.json: application/json; charset=us-ascii pmu-events/arch/powerpc/power9/other.json: application/json; charset=utf-8 pmu-events/arch/powerpc/power9/pipeline.json: application/json; charset=utf-8 pmu-events/arch/powerpc/power9/pmc.json:application/json; charset=us-ascii pmu-events/arch/powerpc/power9/translation.json:application/json; charset=us-ascii Result with the fix patch: [command]# file -i pmu-events/arch/powerpc/power9/* pmu-events/arch/powerpc/power9/cache.json: application/json; charset=us-ascii pmu-events/arch/powerpc/power9/floating-point.json: application/json; charset=us-ascii pmu-events/arch/powerpc/power9/frontend.json: application/json; charset=us-ascii pmu-events/arch/powerpc/power9/marked.json: application/json; charset=us-ascii pmu-events/arch/powerpc/power9/memory.json: application/json; charset=us-ascii pmu-events/arch/powerpc/power9/metrics.json:application/json; charset=us-ascii pmu-events/arch/powerpc/power9/nest_metrics.json: application/json; charset=us-ascii pmu-events/arch/powerpc/power9/other.json: application/json; charset=us-ascii pmu-events/arch/powerpc/power9/pipeline.json: application/json; charset=us-ascii pmu-events/arch/powerpc/power9/pmc.json:application/json; charset=us-ascii pmu-events/arch/powerpc/power9/translation.json:application/json; charset=us-ascii Fixes: 3c22ba524304 ("perf vendor events powerpc: Update POWER9 events") Reported-by: Arnaldo Carvalho de Melo Link: https://lore.kernel.org/lkml/zbxp77deq7ikt...@kernel.org/ Signed-off-by: Kajol Jain --- tools/perf/pmu-events/arch/powerpc/power9/other.json| 4 ++-- tools/perf/pmu-events/arch/powerpc/power9/pipeline.json | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/tools/perf/pmu-events/arch/powerpc/power9/other.json b/tools/perf/pmu-events/arch/powerpc/power9/other.json index 3f69422c21f9..f10bd554521a 100644 --- a/tools/perf/pmu-events/arch/powerpc/power9/other.json +++ b/tools/perf/pmu-events/arch/powerpc/power9/other.json @@ -1417,7 +1417,7 @@ { "EventCode": "0x45054", "EventName": "PM_FMA_CMPL", -"BriefDescription": "two flops operation completed (fmadd, fnmadd, fmsub, fnmsub) Scalar instructions only. " +"BriefDescription": "two flops operation completed (fmadd, fnmadd, fmsub, fnmsub) Scalar instructions only." }, { "EventCode": "0x201E8", @@ -2017,7 +2017,7 @@ { "EventCode": "0xC0BC", "EventName": "PM_LSU_FLUSH_OTHER", -"BriefDescription": "Other LSU flushes including: Sync (sync ack from L2 caused search of LRQ for oldest snooped load, This will either signal a Precise Flush of the oldest snooped loa or a Flush Next PPC); Data Valid Flush Next (several cases of this, one example is store and reload are lined up such that a store-hit-reload scenario exists and the CDF has already launched and has gotten bad/stale data); Bad Data Valid Flush Next (might be a few cases of this, one example is a larxa (D$ hit) return data and dval but can't allocate to LMQ (LMQ full or other reason). Already gave dval but can't watch it for snoop_hit_larx. Need to take the “bad dval” back and flush all younger ops)" +"BriefDescription": "Other LSU flushes including: Sync (sync ack from L2 caused search of LRQ for oldest snooped load, This will either signal a Precise Flush of the oldest snooped loa or a Flush Next PPC); Data Valid Flush Next (several cases of this, one example is store and reload are lined up such that a store-hit-reload scenario exists and the CDF has already launched and has gotten bad/stale data); Bad Data Valid Flush Next (might be a few cases of this, one example is a larxa (D$ hit) return data and dval but can't allocate to LMQ (LMQ full or other reason). Already gave dval but can't watch it for snoop_hit_larx. Need to take the 'bad
Re: [PATCH] arch/powerpc/kvm: kvmppc_core_vcpu_create_hv: check for kzalloc failure
On 2023-03-28 15:44:02, Kautuk Consul wrote: > On 2023-03-28 20:44:48, Michael Ellerman wrote: > > Kautuk Consul writes: > > > kvmppc_vcore_create() might not be able to allocate memory through > > > kzalloc. In that case the kvm->arch.online_vcores shouldn't be > > > incremented. > > > > I agree that looks wrong. > > > > Have you tried to test what goes wrong if it fails? It looks like it > > will break the LPCR update, which likely will cause the guest to crash > > horribly. Also, are you referring to the code in kvmppc_update_lpcr()? That code will not crash as it checks for the vc before trying to dereference it. But the following 2 places that utilize the arch.online_vcores will have problems in logic if the usermode test-case doesn't pull down the kvm context after the -ENOMEM vcpu allocation failure: book3s_hv.c:3030: if (!kvm->arch.online_vcores) { book3s_hv_rm_mmu.c:44: if (kvm->arch.online_vcores == 1 && local_paca->kvm_hstate.kvm_vcpu) > Not sure about LPCR update, but with and without the patch qemu exits > and so the kvm context is pulled down fine. > > > > You could use CONFIG_FAIL_SLAB and fail-nth etc. to fail just one > > allocation for a guest. Or probably easier to just hack the code to fail > > the 4th time it's called using a static counter. > I am using live debug and I set the r3 return value to 0x0 after the > call to kzalloc. > > > > Doesn't really matter but could be interesting. > With and without this patch qemu quits with: > qemu-system-ppc64: kvm_init_vcpu: kvm_get_vcpu failed (0): Cannot allocate > memory > > That's because qemu will shut down when any vcpu is not able > to be allocated. > > > > > Add a check for kzalloc failure and return with -ENOMEM from > > > kvmppc_core_vcpu_create_hv(). > > > > > > Signed-off-by: Kautuk Consul > > > --- > > > arch/powerpc/kvm/book3s_hv.c | 10 +++--- > > > 1 file changed, 7 insertions(+), 3 deletions(-) > > > > > > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c > > > index 6ba68dd6190b..e29ee755c920 100644 > > > --- a/arch/powerpc/kvm/book3s_hv.c > > > +++ b/arch/powerpc/kvm/book3s_hv.c > > > @@ -2968,13 +2968,17 @@ static int kvmppc_core_vcpu_create_hv(struct > > > kvm_vcpu *vcpu) > > > pr_devel("KVM: collision on id %u", id); > > > vcore = NULL; > > > } else if (!vcore) { > > > + vcore = kvmppc_vcore_create(kvm, > > > + id & ~(kvm->arch.smt_mode - 1)); > > > > That line doesn't need to be wrapped, we allow 90 columns. > > > > > + if (unlikely(!vcore)) { > > > + mutex_unlock(>lock); > > > + return -ENOMEM; > > > + } > > > > Rather than introducing a new return point here, I think it would be > > preferable to use the existing !vcore case below. > > > > > /* > > >* Take mmu_setup_lock for mutual exclusion > > >* with kvmppc_update_lpcr(). > > >*/ > > > - err = -ENOMEM; > > > - vcore = kvmppc_vcore_create(kvm, > > > - id & ~(kvm->arch.smt_mode - 1)); > > > > So leave that as is (maybe move the comment down). > > > > And wrap the below in: > > > > + if (vcore) { > > > > > mutex_lock(>arch.mmu_setup_lock); > > > kvm->arch.vcores[core] = vcore; > > > kvm->arch.online_vcores++; > > > > mutex_unlock(>arch.mmu_setup_lock); > > + } > > } > > } > > > > Meaning the vcore == NULL case will fall through to here and return via > > this existing path: > > > > mutex_unlock(>lock); > > > > if (!vcore) > > return err; > > > > > > cheers
Re: [PATCH] arch/powerpc/kvm: kvmppc_core_vcpu_create_hv: check for kzalloc failure
On 2023-03-28 20:44:48, Michael Ellerman wrote: > Kautuk Consul writes: > > kvmppc_vcore_create() might not be able to allocate memory through > > kzalloc. In that case the kvm->arch.online_vcores shouldn't be > > incremented. > > I agree that looks wrong. > > Have you tried to test what goes wrong if it fails? It looks like it > will break the LPCR update, which likely will cause the guest to crash > horribly. Not sure about LPCR update, but with and without the patch qemu exits and so the kvm context is pulled down fine. > > You could use CONFIG_FAIL_SLAB and fail-nth etc. to fail just one > allocation for a guest. Or probably easier to just hack the code to fail > the 4th time it's called using a static counter. I am using live debug and I set the r3 return value to 0x0 after the call to kzalloc. > > Doesn't really matter but could be interesting. With and without this patch qemu quits with: qemu-system-ppc64: kvm_init_vcpu: kvm_get_vcpu failed (0): Cannot allocate memory That's because qemu will shut down when any vcpu is not able to be allocated. > > > Add a check for kzalloc failure and return with -ENOMEM from > > kvmppc_core_vcpu_create_hv(). > > > > Signed-off-by: Kautuk Consul > > --- > > arch/powerpc/kvm/book3s_hv.c | 10 +++--- > > 1 file changed, 7 insertions(+), 3 deletions(-) > > > > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c > > index 6ba68dd6190b..e29ee755c920 100644 > > --- a/arch/powerpc/kvm/book3s_hv.c > > +++ b/arch/powerpc/kvm/book3s_hv.c > > @@ -2968,13 +2968,17 @@ static int kvmppc_core_vcpu_create_hv(struct > > kvm_vcpu *vcpu) > > pr_devel("KVM: collision on id %u", id); > > vcore = NULL; > > } else if (!vcore) { > > + vcore = kvmppc_vcore_create(kvm, > > + id & ~(kvm->arch.smt_mode - 1)); > > That line doesn't need to be wrapped, we allow 90 columns. > > > + if (unlikely(!vcore)) { > > + mutex_unlock(>lock); > > + return -ENOMEM; > > + } > > Rather than introducing a new return point here, I think it would be > preferable to use the existing !vcore case below. > > > /* > > * Take mmu_setup_lock for mutual exclusion > > * with kvmppc_update_lpcr(). > > */ > > - err = -ENOMEM; > > - vcore = kvmppc_vcore_create(kvm, > > - id & ~(kvm->arch.smt_mode - 1)); > > So leave that as is (maybe move the comment down). > > And wrap the below in: > > + if (vcore) { > > > mutex_lock(>arch.mmu_setup_lock); > > kvm->arch.vcores[core] = vcore; > > kvm->arch.online_vcores++; > > mutex_unlock(>arch.mmu_setup_lock); > + } > } > } > > Meaning the vcore == NULL case will fall through to here and return via > this existing path: > > mutex_unlock(>lock); > > if (!vcore) > return err; > > > cheers
Re: [PATCH] arch/powerpc/kvm: kvmppc_core_vcpu_create_hv: check for kzalloc failure
Kautuk Consul writes: > kvmppc_vcore_create() might not be able to allocate memory through > kzalloc. In that case the kvm->arch.online_vcores shouldn't be > incremented. I agree that looks wrong. Have you tried to test what goes wrong if it fails? It looks like it will break the LPCR update, which likely will cause the guest to crash horribly. You could use CONFIG_FAIL_SLAB and fail-nth etc. to fail just one allocation for a guest. Or probably easier to just hack the code to fail the 4th time it's called using a static counter. Doesn't really matter but could be interesting. > Add a check for kzalloc failure and return with -ENOMEM from > kvmppc_core_vcpu_create_hv(). > > Signed-off-by: Kautuk Consul > --- > arch/powerpc/kvm/book3s_hv.c | 10 +++--- > 1 file changed, 7 insertions(+), 3 deletions(-) > > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c > index 6ba68dd6190b..e29ee755c920 100644 > --- a/arch/powerpc/kvm/book3s_hv.c > +++ b/arch/powerpc/kvm/book3s_hv.c > @@ -2968,13 +2968,17 @@ static int kvmppc_core_vcpu_create_hv(struct kvm_vcpu > *vcpu) > pr_devel("KVM: collision on id %u", id); > vcore = NULL; > } else if (!vcore) { > + vcore = kvmppc_vcore_create(kvm, > + id & ~(kvm->arch.smt_mode - 1)); That line doesn't need to be wrapped, we allow 90 columns. > + if (unlikely(!vcore)) { > + mutex_unlock(>lock); > + return -ENOMEM; > + } Rather than introducing a new return point here, I think it would be preferable to use the existing !vcore case below. > /* >* Take mmu_setup_lock for mutual exclusion >* with kvmppc_update_lpcr(). >*/ > - err = -ENOMEM; > - vcore = kvmppc_vcore_create(kvm, > - id & ~(kvm->arch.smt_mode - 1)); So leave that as is (maybe move the comment down). And wrap the below in: + if (vcore) { > mutex_lock(>arch.mmu_setup_lock); > kvm->arch.vcores[core] = vcore; > kvm->arch.online_vcores++; mutex_unlock(>arch.mmu_setup_lock); + } } } Meaning the vcore == NULL case will fall through to here and return via this existing path: mutex_unlock(>lock); if (!vcore) return err; cheers
Re: [PATCH v12 13/13] arm64: dts: ls1088ardb: Add serdes descriptions
On Mon, Mar 27, 2023 at 02:15:47PM -0400, Sean Anderson wrote: > On 3/24/23 09:17, Ioana Ciornei wrote: > > On Tue, Mar 21, 2023 at 04:13:12PM -0400, Sean Anderson wrote: > >> This adds serdes support to the LS1088ARDB. I have tested the QSGMII > >> ports as well as the two 10G ports. The SFP slot is now fully supported, > >> instead of being modeled as a fixed-link. > >> > >> Linux hangs around when the serdes is initialized if the si5341 is > >> enabled with the in-tree driver, so I have modeled it as a two fixed > >> clocks instead. There are a few registers in the QIXIS FPGA which > >> control the SFP GPIOs; I have modeled them as discrete GPIO controllers > >> for now. I never saw the AQR105 interrupt fire; not sure what was going > >> on, but I have removed it to force polling. > > > > So you didn't see the interrupt fire even without these patches? > > Not sure. I went to check this, and discovered I could no longer get the > link to come up in Linux, even on v6.0 (before the rate adaptation > tuff). I see the LEDs blinking in U-Boot, so presumably it's some > configuration problem. I'm going to look into this further when I have > more time. > > > I just tested this on a LS1088ARDB and it works. > > > > root@localhost:~# cat /proc/interrupts | grep extirq > > 99: 5 ls-extirq 2 Level 0x08b97000:00 > > root@localhost:~# ip link set dev endpmac2 up > > root@localhost:~# cat /proc/interrupts | grep extirq > > 99: 6 ls-extirq 2 Level 0x08b97000:00 > > root@localhost:~# ip link set dev endpmac2 down > > root@localhost:~# cat /proc/interrupts | grep extirq > > 99: 7 ls-extirq 2 Level 0x08b97000:00 > > > > Please don't just remove things. > > Well, polling isn't the worst thing for a single interface... I do > remember having a problem with the interrupt. If this series works > with interrupts enabled, I can leave it in. > > Did you have a chance to look at the core (patches 7 and 8) of this > series? Does it make sense to you? Am I missing something which would > allow switching from 1G->10G? > For a bit of context, I also attempted dynamic switching from 1G to 10G on my own even before this patch set but I did not get a link up on the PCS (CDR lock was there through). Pretty much the same state as you. What I propose is to take this whole endeavor step by step. I am also interrested in getting this feature to actually work but I just didn't have the time to investigate in depth was is missing. And without the dynamic switching I cannot say that I find the addition of the SerDes PHY driver useful. I have the Lynx 10G on my TODO list but I still have some other tasks on the Lynx 28G for the next 2-3 weeks. Once I get those done, I will look closer at the patches. In the meantime, some small thigs from this patch set can be submitted separately. For example, describing the SFP cage on the LS1088ARDB. I still have some small questions on the DTS implementation for the gpio controllers but I would be able to submit this myself if you do not find the time (with your authorship of course). Ioana
Re: [PATCH 0/2] KVM: PPC: support kvm selftests
"Nicholas Piggin" writes: > On Tue Mar 28, 2023 at 3:43 AM AEST, Sean Christopherson wrote: >> On Mon, Mar 27, 2023, Nicholas Piggin wrote: >> > On Thu Mar 23, 2023 at 3:41 AM AEST, Sean Christopherson wrote: ... >> > > >> > > What is the long term plan for KVM PPC maintenance? I was under the >> > > impression >> > > that KVM PPC was trending toward "bug fixes only", but the addition of >> > > selftests >> > > support suggests otherwise. ... > >> and ideally there would be one or more M: (and R:) entries as well. I'm not >> all that concerned about the selftests support being abandoned, but the lack >> of >> specific contacts makes it look like KVM PPC is in maintenance-only mode, >> and it >> sounds like that's not the case. > > Yeah, I guess the intention was to bring it a bit more under general > arch/powerpc maintainership but it does look a bit odd having a top > level entry and no contacts. We'll reconsider what to do with that. Yeah I agree it ends up looking a bit weird. The intention was just to make it clear that Paul's tree was no longer where patches were being handled, and that they'd be handled as regular powerpc patches. At the time I hoped we'd relatively quickly be able to add someone as at least a KVM "R:"eviewer, but circumstances intervened. Hopefully we can fix that soon. cheers
Re: [PATCH v3 0/4] Use dma_default_coherent for devicetree default coherency
On Tue, Mar 28, 2023 at 03:18:12AM +0200, Christoph Hellwig wrote: > On Fri, Mar 24, 2023 at 09:17:38AM +, Jiaxun Yang wrote: > > > > > > Is patch a 6.3 candidate or should all of it go into 6.4? > > > > Please leave it for 6.4, as corresponding MIPS arch part will be a part of > > 6.4. > > Ok. I'll really need review from the MIPS and drivers/of/ maintainers, > through. I don't see any MIPS changes in the series besides the ifdef CONFIG_MIPS part in patch 1, which gets removed again in patch 4 (chance to drop that completely ?). I've merged the corresponding MIPS patches into mips-next last week. Thomas. -- Crap can work. Given enough thrust pigs will fly, but it's not necessarily a good idea.[ RFC1925, 2.3 ]
Re: [kvm-unit-tests v3 00/13] powerpc: updates, P10, PNV support
On 3/28/23 09:15, Nicholas Piggin wrote: On Tue Mar 28, 2023 at 2:09 AM AEST, Cédric Le Goater wrote: On 3/27/23 14:45, Nicholas Piggin wrote: This series is growing a bit I'm sorry. v2 series added extra interrupt vectors support which was actually wrong because interrupt handling code can only cope with 0x100-size vectors and new ones are 0x80 and 0x20. It managed to work because those alias to the 0x100 boundary, but if more than one handler were installed in the same 0x100-aligned block it would crash. So a couple of patches added to cope with that. I gave them a try on P9 box Thanks! $ ./run_tests.sh PASS selftest-setup (2 tests) PASS spapr_hcall (9 tests, 1 skipped) PASS spapr_vpa (13 tests) PASS rtas-get-time-of-day (10 tests) PASS rtas-get-time-of-day-base (10 tests) PASS rtas-set-time-of-day (5 tests) PASS emulator (4 tests) PASS h_cede_tm (2 tests) FAIL sprs (75 tests, 1 unexpected failures) Oh you have a SPR failure too? I'll check that on a I think it was the WORT SPR FAIL sprs-migration (75 tests, 5 unexpected failures) And with TCG: $ ACCEL=tcg ./run_tests.sh PASS selftest-setup (2 tests) PASS spapr_hcall (9 tests, 1 skipped) FAIL spapr_vpa (13 tests, 1 unexpected failures) The dispatch count seems bogus after unregister Yeah, that dispatch count after unregister test may be bogus actually. PAPR doesn't specify what should happen in that case. It was working here for me though so interesting it's different for you. I'll investigate it and maybe just remove that test for now. It would be nice to keep it and skip it until the emulation is fixed. PASS rtas-get-time-of-day (10 tests) PASS rtas-get-time-of-day-base (10 tests) PASS rtas-set-time-of-day (5 tests) PASS emulator (4 tests) SKIP h_cede_tm (qemu-system-ppc64: TCG cannot support more than 1 thread/core on a pseries machine) FAIL sprs (75 tests, 16 unexpected failures) These should be TCG errors. I have it passing them all with patches posted to qemu lists. Very simple but effective way to catch a few classes of errors. Ah I didn't try with your QEMU patches. Make sense then. Thanks, C.
Re: [kvm-unit-tests v3 00/13] powerpc: updates, P10, PNV support
On Tue Mar 28, 2023 at 2:09 AM AEST, Cédric Le Goater wrote: > On 3/27/23 14:45, Nicholas Piggin wrote: > > This series is growing a bit I'm sorry. v2 series added extra interrupt > > vectors support which was actually wrong because interrupt handling > > code can only cope with 0x100-size vectors and new ones are 0x80 and > > 0x20. It managed to work because those alias to the 0x100 boundary, but > > if more than one handler were installed in the same 0x100-aligned > > block it would crash. So a couple of patches added to cope with that. > > > > I gave them a try on P9 box Thanks! > > $ ./run_tests.sh > PASS selftest-setup (2 tests) > PASS spapr_hcall (9 tests, 1 skipped) > PASS spapr_vpa (13 tests) > PASS rtas-get-time-of-day (10 tests) > PASS rtas-get-time-of-day-base (10 tests) > PASS rtas-set-time-of-day (5 tests) > PASS emulator (4 tests) > PASS h_cede_tm (2 tests) > FAIL sprs (75 tests, 1 unexpected failures) Oh you have a SPR failure too? I'll check that on a P9. > FAIL sprs-migration (75 tests, 5 unexpected failures) > > And with TCG: > > $ ACCEL=tcg ./run_tests.sh > PASS selftest-setup (2 tests) > PASS spapr_hcall (9 tests, 1 skipped) > FAIL spapr_vpa (13 tests, 1 unexpected failures) > > The dispatch count seems bogus after unregister Yeah, that dispatch count after unregister test may be bogus actually. PAPR doesn't specify what should happen in that case. It was working here for me though so interesting it's different for you. I'll investigate it and maybe just remove that test for now. > > PASS rtas-get-time-of-day (10 tests) > PASS rtas-get-time-of-day-base (10 tests) > PASS rtas-set-time-of-day (5 tests) > PASS emulator (4 tests) > SKIP h_cede_tm (qemu-system-ppc64: TCG cannot support more than 1 thread/core > on a pseries machine) > FAIL sprs (75 tests, 16 unexpected failures) These should be TCG errors. I have it passing them all with patches posted to qemu lists. Very simple but effective way to catch a few classes of errors. Thanks, Nick
Re: [PATCH 2/2] powerpc/64: Rename entry_64.S to prom_entry.S
Le 28/03/2023 à 08:51, Nicholas Piggin a écrit : > On Tue Mar 28, 2023 at 3:48 AM AEST, Christophe Leroy wrote: >> >> >> Le 25/03/2023 à 14:06, Nicholas Piggin a écrit : >>> This file contains only the enter_prom implementation now. >>> Trim includes and update header comment while we're here. >>> >>> Signed-off-by: Nicholas Piggin >>> --- >>>arch/powerpc/kernel/Makefile | 8 +++-- >>>.../kernel/{entry_64.S => prom_entry.S} | 30 ++- >>>scripts/head-object-list.txt | 2 +- >>>3 files changed, 9 insertions(+), 31 deletions(-) >>>rename arch/powerpc/kernel/{entry_64.S => prom_entry.S} (73%) >>> >>> diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile >>> index ec70a1748506..ebba0896998a 100644 >>> --- a/arch/powerpc/kernel/Makefile >>> +++ b/arch/powerpc/kernel/Makefile >>> @@ -209,10 +209,12 @@ CFLAGS_paca.o += -fno-stack-protector >>> >>>obj-$(CONFIG_PPC_FPU)+= fpu.o >>>obj-$(CONFIG_ALTIVEC)+= vector.o >>> -obj-$(CONFIG_PPC64)+= entry_64.o >>> -obj-$(CONFIG_PPC_OF_BOOT_TRAMPOLINE) += prom_init.o >>> >>> -extra-$(CONFIG_PPC_OF_BOOT_TRAMPOLINE) += prom_init_check >>> +ifdef CONFIG_PPC_OF_BOOT_TRAMPOLINE >> >> You don't need that ifdef construct, you can do: >> >> obj64-$(CONFIG_PPC_OF_BOOT_TRAMPOLINE) += prom_entry.o > > Nice. So that could have been obj64-y from the start? Yes, allthought it is not used that way in ppc/kernel/Makefile: $ git grep -e obj64 -e obj32 arch/powerpc/kernel/Makefile arch/powerpc/kernel/Makefile:obj64-$(CONFIG_HIBERNATION)+= swsusp_asm64.o arch/powerpc/kernel/Makefile:obj64-$(CONFIG_AUDIT) += compat_audit.o arch/powerpc/kernel/Makefile:obj64-$(CONFIG_PPC_TRANSACTIONAL_MEM) += tm.o arch/powerpc/kernel/Makefile:obj-$(CONFIG_PPC64)+= $(obj64-y) arch/powerpc/kernel/Makefile:obj-$(CONFIG_PPC32)+= $(obj32-y) But it is in ppc/lib/Makefile: $ git grep -e obj64 -e obj32 arch/powerpc/lib/Makefile arch/powerpc/lib/Makefile:obj64-y += copypage_64.o copyuser_64.o mem_64.o hweight_64.o \ arch/powerpc/lib/Makefile:obj64-$(CONFIG_SMP) += locks.o arch/powerpc/lib/Makefile:obj64-$(CONFIG_ALTIVEC) += vmx-helper.o arch/powerpc/lib/Makefile:obj64-$(CONFIG_KPROBES_SANITY_TEST) += test_emulate_step.o \ arch/powerpc/lib/Makefile:obj64-y += quad.o arch/powerpc/lib/Makefile:obj-$(CONFIG_PPC64) += $(obj64-y) Christophe
Re: [kvm-unit-tests v3 03/13] powerpc: Add some checking to exception handler install
On Tue Mar 28, 2023 at 12:39 AM AEST, Thomas Huth wrote: > On 27/03/2023 14.45, Nicholas Piggin wrote: > > Check to ensure exception handlers are not being overwritten or > > invalid exception numbers are used. > > > > Signed-off-by: Nicholas Piggin > > --- > > Since v2: > > - New patch > > > > lib/powerpc/processor.c | 12 > > 1 file changed, 12 insertions(+) > > > > diff --git a/lib/powerpc/processor.c b/lib/powerpc/processor.c > > index ec85b9d..70391aa 100644 > > --- a/lib/powerpc/processor.c > > +++ b/lib/powerpc/processor.c > > @@ -19,11 +19,23 @@ static struct { > > void handle_exception(int trap, void (*func)(struct pt_regs *, void *), > > void * data) > > { > > + if (trap & 0xff) { > > You could check for the other "invalid exception handler" condition here > already, i.e. if (trap & ~0xf00) ... > > I'd maybe simply do an "assert(!(trap & ~0xf00))" here. > > > + printf("invalid exception handler %#x\n", trap); > > + abort(); > > + } > > + > > trap >>= 8; > > > > if (trap < 16) { > > ... then you could get rid of the if-statement here and remove one level of > indentation in the code below. Yes that's the way to do it. I feel embarrassed for not thinking of it :) Thanks, Nick > > > + if (func && handlers[trap].func) { > > + printf("exception handler installed twice %#x\n", trap); > > + abort(); > > + } > > handlers[trap].func = func; > > handlers[trap].data = data; > > + } else { > > + printf("invalid exception handler %#x\n", trap); > > + abort(); > > } > > } > > > > Thomas
Re: [PATCH 2/2] powerpc/64: Rename entry_64.S to prom_entry.S
On Tue Mar 28, 2023 at 3:48 AM AEST, Christophe Leroy wrote: > > > Le 25/03/2023 à 14:06, Nicholas Piggin a écrit : > > This file contains only the enter_prom implementation now. > > Trim includes and update header comment while we're here. > > > > Signed-off-by: Nicholas Piggin > > --- > > arch/powerpc/kernel/Makefile | 8 +++-- > > .../kernel/{entry_64.S => prom_entry.S} | 30 ++- > > scripts/head-object-list.txt | 2 +- > > 3 files changed, 9 insertions(+), 31 deletions(-) > > rename arch/powerpc/kernel/{entry_64.S => prom_entry.S} (73%) > > > > diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile > > index ec70a1748506..ebba0896998a 100644 > > --- a/arch/powerpc/kernel/Makefile > > +++ b/arch/powerpc/kernel/Makefile > > @@ -209,10 +209,12 @@ CFLAGS_paca.o += -fno-stack-protector > > > > obj-$(CONFIG_PPC_FPU) += fpu.o > > obj-$(CONFIG_ALTIVEC) += vector.o > > -obj-$(CONFIG_PPC64)+= entry_64.o > > -obj-$(CONFIG_PPC_OF_BOOT_TRAMPOLINE) += prom_init.o > > > > -extra-$(CONFIG_PPC_OF_BOOT_TRAMPOLINE) += prom_init_check > > +ifdef CONFIG_PPC_OF_BOOT_TRAMPOLINE > > You don't need that ifdef construct, you can do: > > obj64-$(CONFIG_PPC_OF_BOOT_TRAMPOLINE) += prom_entry.o Nice. So that could have been obj64-y from the start? Thanks, Nick > > > +obj-y += prom_init.o > > +obj-$(CONFIG_PPC64)+= prom_entry.o > > +extra-y+= prom_init_check > > +endif > > > > quiet_cmd_prom_init_check = PROMCHK $@ > > cmd_prom_init_check = $(CONFIG_SHELL) $< "$(NM)" > > $(obj)/prom_init.o; touch $@ > > > Christophe
Re: [PATCH 0/2] KVM: PPC: support kvm selftests
On Tue Mar 28, 2023 at 3:43 AM AEST, Sean Christopherson wrote: > On Mon, Mar 27, 2023, Nicholas Piggin wrote: > > On Thu Mar 23, 2023 at 3:41 AM AEST, Sean Christopherson wrote: > > > On Thu, Mar 16, 2023, Michael Ellerman wrote: > > > > Nicholas Piggin writes: > > > > > Hi, > > > > > > > > > > This series adds initial KVM selftests support for powerpc > > > > > (64-bit, BookS). > > > > > > > > Awesome. > > > > > > > > > It spans 3 maintainers but it does not really > > > > > affect arch/powerpc, and it is well contained in selftests > > > > > code, just touches some makefiles and a tiny bit headers so > > > > > conflicts should be unlikely and trivial. > > > > > > > > > > I guess Paolo is the best point to merge these, if no comments > > > > > or objections? > > > > > > > > Yeah. If it helps: > > > > > > > > Acked-by: Michael Ellerman (powerpc) > > > > > > What is the long term plan for KVM PPC maintenance? I was under the > > > impression > > > that KVM PPC was trending toward "bug fixes only", but the addition of > > > selftests > > > support suggests otherwise. > > > > We plan to continue maintaining it. New support and features has been a > > bit low in the past couple of years, hopefully that will pick up a bit > > though. > > Partly out of curiosity, but also to get a general feel for what types of > changes > we might see, what are the main use cases for KVM PPC these days? E.g. is it > mainly > a vehicle for developing and testing, hosting VMs in the cloud, something > else? I didn't have too much specific in mind, just generally a bit more development activity. As for use cases, I think broadly KVM is increasingly seen as a Linux feature and expected to be there, even in so-called enterprise world. For software and workflows designed around Linux virt, it can be difficult or impossible to substitue the proprietary partitioning layer in our firmware. So there is always someone who wants KVM for something. It could be all of the above, and even just as a traditional hypervisor hosting VMs in-house, there are people who would like to use KVM. > > > > I ask primarily because routing KVM PPC patches through the PPC tree is > > > going to > > > be problematic if KVM PPC sees signficiant development. The current > > > situation is > > > ok because the volume of patches is low and KVM PPC isn't trying to drive > > > anything > > > substantial into common KVM code, but if that changes... > > > > Michael has done KVM topic branches to pull from a few times when such > > conflicts came up (at smaller scale). If we end up with larger changes > > or regular conflicts we might start up a kvm-ppc tree again I guess. > > A wait-and-see approach works for me. I don't have any complaints with the > current > process, I was just caught off guard. Okay. Complaints and suggestions to improve the process are always welcome, so if something could be done better, feel free to ping the list or powerpc maintainers offline. > > > My other concern is that for selftests specifically, us KVM folks are > > > taking on > > > more maintenance burden by supporting PPC. AFAIK, none of the people > > > that focus > > > on KVM selftests in any meaningful capacity have access to PPC hardware, > > > let alone > > > know enough about the architecture to make intelligent code changes. > > > > > > Don't get me wrong, I'm very much in favor of more testing, I just don't > > > want KVM > > > to get left holding the bag. > > > > Understood. I'll be happy to maintain powerpc part of kvm selftests and > > do any fixes that are needed for core code changes.If support fell away > > you could leave it broken (or rm -rf it if you prefer) -- I wouldn't ask > > anybody to work on archs they don't know or aren't paid to. > > > > Not sure if anything more can be done to help your process or ease your > > mind. It (KVM and kvm-selftests) can run in QEMU at least. > > Updating the KVM/powerpc to include selftests would be very helpful, e.g > > F: tools/testing/selftests/kvm/*/powerpc/ > F: tools/testing/selftests/kvm/powerpc/ Oh good point, I should have included that. I'll send a patch. > and ideally there would be one or more M: (and R:) entries as well. I'm not > all that concerned about the selftests support being abandoned, but the lack > of > specific contacts makes it look like KVM PPC is in maintenance-only mode, and > it > sounds like that's not the case. Yeah, I guess the intention was to bring it a bit more under general arch/powerpc maintainership but it does look a bit odd having a top level entry and no contacts. We'll reconsider what to do with that. Thanks, Nick
Memory coherency issue with IO thread offloading?
Thanks Jens, Nick, Christophe and Michael for your work so far. Apologies for the out of thread email. Confirming MariabD-10.6+ is required( when we added liburing), and previous versions used libaio (which tested without incident as mpe retested). We were (we're now back on the old good kernel Jens indicated) getting failures like https://buildbot.mariadb.org/#/builders/231/builds/16857 in a container (of various distro userspaces) on bare metal. bare metal end of /proc/cpuinfo processor: 127 cpu: POWER9, altivec supported clock: 3283.00MHz revision: 2.2 (pvr 004e 1202) timebase: 51200 platform: PowerNV model: 9006-22P machine: PowerNV 9006-22P firmware: OPAL MMU: Radix
Re: [PATCH] powerpc: don't try to copy ppc for task with NULL pt_regs
On Mon Mar 27, 2023 at 8:26 PM AEST, Christophe Leroy wrote: > > > Le 27/03/2023 à 08:36, Nicholas Piggin a écrit : > > On Mon Mar 27, 2023 at 8:15 AM AEST, Jens Axboe wrote: > >> Powerpc sets up PF_KTHREAD and PF_IO_WORKER with a NULL pt_regs, which > >> from my (arguably very short) checking is not commonly done for other > >> archs. This is fine, except when PF_IO_WORKER's have been created and > >> the task does something that causes a coredump to be generated. Then we > >> get this crash: > > > > Hey Jens, > > > > Thanks for the testing and the patch. > > > > I think your patch would work, but I'd be inclined to give the IO worker > > a pt_regs so it looks more like other archs and a regular user thread. > > > > Your IO worker bug reminded me to resurrect some copy_thread patches I > > had and I think they should do that > > > > https://lists.ozlabs.org/pipermail/linuxppc-dev/2023-March/256271.html > > > > I wouldn't ask you to test it until I've at least tried, do you have a > > test case that triggers this? > > I fact, most architectures don't have thread.regs, but rely on something > like: > > #define task_pt_regs(p) \ > ((struct pt_regs *)(THREAD_SIZE + task_stack_page(p)) - 1) > > > In powerpc, thread.regs was there because of the regs not being at the > same place in the stack depending on which interrupt it was. > > However with the two commits below, we now have stable fixed location > for the regs, so thread.regs shouldn't be needed anymore: > - db297c3b07af ("powerpc/32: Don't save thread.regs on interrupt entry") > - b5cfc9cd7b04 ("powerpc/32: Fix critical and debug interrupts on BOOKE") > > But in the meantime, thread.regs started to be used for additional > purpose, like knowing if it is a user thread or a kernel thread by using > thread.regs nullity. > > > Now that thread.regs doesn't change anymore at each interrupt, it would > probably be worth dropping it and falling back to task_pt_regs() as > defined on most architecture. > Knowing whether a thread is a kernel or user thread can for instance be > known with PF_KTHREAD flag, so no need of thread.regs for that. That would be nice if we can define regs that way, I agree. We should look into doing that. Thanks, Nick