[PATCH v9 6/6] riscv: Use --emit-relocs in order to move .rela.dyn in init

2023-03-28 Thread Alexandre Ghiti
To circumvent an issue where placing the relocations inside the init
sections produces empty relocations, use --emit-relocs. But to avoid
carrying those relocations in vmlinux, use an intermediate
vmlinux.relocs file which is a copy of vmlinux *before* stripping its
relocations.

Suggested-by: Björn Töpel 
Suggested-by: Nick Desaulniers 
Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/Makefile  |  2 +-
 arch/riscv/Makefile.postlink | 13 +
 arch/riscv/boot/Makefile |  7 +++
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
index 860b09e409c7..7dc6904a6836 100644
--- a/arch/riscv/Makefile
+++ b/arch/riscv/Makefile
@@ -8,7 +8,7 @@
 
 OBJCOPYFLAGS:= -O binary
 ifeq ($(CONFIG_RELOCATABLE),y)
-   LDFLAGS_vmlinux += -shared -Bsymbolic -z notext -z norelro
+   LDFLAGS_vmlinux += -shared -Bsymbolic -z notext -z norelro --emit-relocs
KBUILD_CFLAGS += -fPIE
 endif
 ifeq ($(CONFIG_DYNAMIC_FTRACE),y)
diff --git a/arch/riscv/Makefile.postlink b/arch/riscv/Makefile.postlink
index d5de8d520d3e..a46fc578b30b 100644
--- a/arch/riscv/Makefile.postlink
+++ b/arch/riscv/Makefile.postlink
@@ -15,12 +15,25 @@ quiet_cmd_relocs_check = CHKREL  $@
 cmd_relocs_check = \
$(CONFIG_SHELL) $(srctree)/arch/riscv/tools/relocs_check.sh 
"$(OBJDUMP)" "$(NM)" "$@"
 
+ifdef CONFIG_RELOCATABLE
+quiet_cmd_cp_vmlinux_relocs = CPREL   vmlinux.relocs
+cmd_cp_vmlinux_relocs = cp vmlinux vmlinux.relocs
+
+quiet_cmd_relocs_strip = STRIPREL $@
+cmd_relocs_strip = $(OBJCOPY)   --remove-section='.rel.*'   \
+--remove-section='.rel__*'  \
+--remove-section='.rela.*'  \
+--remove-section='.rela__*' $@
+endif
+
 # `@true` prevents complaint when there is nothing to be done
 
 vmlinux: FORCE
@true
 ifdef CONFIG_RELOCATABLE
$(call if_changed,relocs_check)
+   $(call if_changed,cp_vmlinux_relocs)
+   $(call if_changed,relocs_strip)
 endif
 
 %.ko: FORCE
diff --git a/arch/riscv/boot/Makefile b/arch/riscv/boot/Makefile
index c72de7232abb..22b13947bd13 100644
--- a/arch/riscv/boot/Makefile
+++ b/arch/riscv/boot/Makefile
@@ -33,7 +33,14 @@ $(obj)/xipImage: vmlinux FORCE
 
 endif
 
+ifdef CONFIG_RELOCATABLE
+vmlinux.relocs: vmlinux
+   @ (! [ -f vmlinux.relocs ] && echo "vmlinux.relocs can't be found, 
please remove vmlinux and try again") || true
+
+$(obj)/Image: vmlinux.relocs FORCE
+else
 $(obj)/Image: vmlinux FORCE
+endif
$(call if_changed,objcopy)
 
 $(obj)/Image.gz: $(obj)/Image FORCE
-- 
2.37.2



[PATCH v9 5/6] riscv: Check relocations at compile time

2023-03-28 Thread Alexandre Ghiti
From: Alexandre Ghiti 

Relocating kernel at runtime is done very early in the boot process, so
it is not convenient to check for relocations there and react in case a
relocation was not expected.

There exists a script in scripts/ that extracts the relocations from
vmlinux that is then used at postlink to check the relocations.

Signed-off-by: Alexandre Ghiti 
Reviewed-by: Anup Patel 
---
 arch/riscv/Makefile.postlink | 36 
 arch/riscv/tools/relocs_check.sh | 26 +++
 2 files changed, 62 insertions(+)
 create mode 100644 arch/riscv/Makefile.postlink
 create mode 100755 arch/riscv/tools/relocs_check.sh

diff --git a/arch/riscv/Makefile.postlink b/arch/riscv/Makefile.postlink
new file mode 100644
index ..d5de8d520d3e
--- /dev/null
+++ b/arch/riscv/Makefile.postlink
@@ -0,0 +1,36 @@
+# SPDX-License-Identifier: GPL-2.0
+# ===
+# Post-link riscv pass
+# ===
+#
+# Check that vmlinux relocations look sane
+
+PHONY := __archpost
+__archpost:
+
+-include include/config/auto.conf
+include $(srctree)/scripts/Kbuild.include
+
+quiet_cmd_relocs_check = CHKREL  $@
+cmd_relocs_check = \
+   $(CONFIG_SHELL) $(srctree)/arch/riscv/tools/relocs_check.sh 
"$(OBJDUMP)" "$(NM)" "$@"
+
+# `@true` prevents complaint when there is nothing to be done
+
+vmlinux: FORCE
+   @true
+ifdef CONFIG_RELOCATABLE
+   $(call if_changed,relocs_check)
+endif
+
+%.ko: FORCE
+   @true
+
+clean:
+   @true
+
+PHONY += FORCE clean
+
+FORCE:
+
+.PHONY: $(PHONY)
diff --git a/arch/riscv/tools/relocs_check.sh b/arch/riscv/tools/relocs_check.sh
new file mode 100755
index ..baeb2e7b2290
--- /dev/null
+++ b/arch/riscv/tools/relocs_check.sh
@@ -0,0 +1,26 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Based on powerpc relocs_check.sh
+
+# This script checks the relocations of a vmlinux for "suspicious"
+# relocations.
+
+if [ $# -lt 3 ]; then
+echo "$0 [path to objdump] [path to nm] [path to vmlinux]" 1>&2
+exit 1
+fi
+
+bad_relocs=$(
+${srctree}/scripts/relocs_check.sh "$@" |
+   # These relocations are okay
+   #   R_RISCV_RELATIVE
+   grep -F -w -v 'R_RISCV_RELATIVE'
+)
+
+if [ -z "$bad_relocs" ]; then
+   exit 0
+fi
+
+num_bad=$(echo "$bad_relocs" | wc -l)
+echo "WARNING: $num_bad bad relocations"
+echo "$bad_relocs"
-- 
2.37.2



[PATCH v9 4/6] powerpc: Move script to check relocations at compile time in scripts/

2023-03-28 Thread Alexandre Ghiti
From: Alexandre Ghiti 

Relocating kernel at runtime is done very early in the boot process, so
it is not convenient to check for relocations there and react in case a
relocation was not expected.

Powerpc architecture has a script that allows to check at compile time
for such unexpected relocations: extract the common logic to scripts/
so that other architectures can take advantage of it.

Signed-off-by: Alexandre Ghiti 
Reviewed-by: Anup Patel 
Acked-by: Michael Ellerman  (powerpc)
---
 arch/powerpc/tools/relocs_check.sh | 18 ++
 scripts/relocs_check.sh| 20 
 2 files changed, 22 insertions(+), 16 deletions(-)
 create mode 100755 scripts/relocs_check.sh

diff --git a/arch/powerpc/tools/relocs_check.sh 
b/arch/powerpc/tools/relocs_check.sh
index 63792af00417..6b350e75014c 100755
--- a/arch/powerpc/tools/relocs_check.sh
+++ b/arch/powerpc/tools/relocs_check.sh
@@ -15,21 +15,8 @@ if [ $# -lt 3 ]; then
exit 1
 fi
 
-# Have Kbuild supply the path to objdump and nm so we handle cross compilation.
-objdump="$1"
-nm="$2"
-vmlinux="$3"
-
-# Remove from the bad relocations those that match an undefined weak symbol
-# which will result in an absolute relocation to 0.
-# Weak unresolved symbols are of that form in nm output:
-# "  w _binary__btf_vmlinux_bin_end"
-undef_weak_symbols=$($nm "$vmlinux" | awk '$1 ~ /w/ { print $2 }')
-
 bad_relocs=$(
-$objdump -R "$vmlinux" |
-   # Only look at relocation lines.
-   grep -E '\

[PATCH v9 3/6] riscv: Introduce CONFIG_RELOCATABLE

2023-03-28 Thread Alexandre Ghiti
This config allows to compile 64b kernel as PIE and to relocate it at
any virtual address at runtime: this paves the way to KASLR.
Runtime relocation is possible since relocation metadata are embedded into
the kernel.

Note that relocating at runtime introduces an overhead even if the
kernel is loaded at the same address it was linked at and that the compiler
options are those used in arm64 which uses the same RELA relocation
format.

Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/Kconfig  | 14 +
 arch/riscv/Makefile |  7 +++--
 arch/riscv/kernel/vmlinux.lds.S | 17 +--
 arch/riscv/mm/Makefile  |  4 +++
 arch/riscv/mm/init.c| 54 -
 5 files changed, 91 insertions(+), 5 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 3c5907431081..6ff9f574195d 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -562,6 +562,20 @@ config COMPAT
 
  If you want to execute 32-bit userspace applications, say Y.
 
+config RELOCATABLE
+   bool "Build a relocatable kernel"
+   depends on MMU && 64BIT && !XIP_KERNEL
+   help
+  This builds a kernel as a Position Independent Executable (PIE),
+  which retains all relocation metadata required to relocate the
+  kernel binary at runtime to a different virtual address than the
+  address it was linked at.
+  Since RISCV uses the RELA relocation format, this requires a
+  relocation pass at runtime even if the kernel is loaded at the
+  same address it was linked at.
+
+  If unsure, say N.
+
 endmenu # "Kernel features"
 
 menu "Boot options"
diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
index 6203c3378922..860b09e409c7 100644
--- a/arch/riscv/Makefile
+++ b/arch/riscv/Makefile
@@ -7,9 +7,12 @@
 #
 
 OBJCOPYFLAGS:= -O binary
-LDFLAGS_vmlinux :=
+ifeq ($(CONFIG_RELOCATABLE),y)
+   LDFLAGS_vmlinux += -shared -Bsymbolic -z notext -z norelro
+   KBUILD_CFLAGS += -fPIE
+endif
 ifeq ($(CONFIG_DYNAMIC_FTRACE),y)
-   LDFLAGS_vmlinux := --no-relax
+   LDFLAGS_vmlinux += --no-relax
KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
 ifeq ($(CONFIG_RISCV_ISA_C),y)
CC_FLAGS_FTRACE := -fpatchable-function-entry=4
diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S
index e05e6df44225..615ff5842690 100644
--- a/arch/riscv/kernel/vmlinux.lds.S
+++ b/arch/riscv/kernel/vmlinux.lds.S
@@ -122,10 +122,23 @@ SECTIONS
*(.sdata*)
}
 
-   .rela.dyn : {
-   *(.rela*)
+   .rela.dyn : ALIGN(8) {
+   __rela_dyn_start = .;
+   *(.rela .rela*)
+   __rela_dyn_end = .;
}
 
+#ifdef CONFIG_RELOCATABLE
+   .data.rel : { *(.data.rel*) }
+   .got : { *(.got*) }
+   .plt : { *(.plt) }
+   .dynamic : { *(.dynamic) }
+   .dynsym : { *(.dynsym) }
+   .dynstr : { *(.dynstr) }
+   .hash : { *(.hash) }
+   .gnu.hash : { *(.gnu.hash) }
+#endif
+
 #ifdef CONFIG_EFI
.pecoff_edata_padding : { BYTE(0); . = ALIGN(PECOFF_FILE_ALIGNMENT); }
__pecoff_data_raw_size = ABSOLUTE(. - __pecoff_text_end);
diff --git a/arch/riscv/mm/Makefile b/arch/riscv/mm/Makefile
index 2ac177c05352..b85e9e82f082 100644
--- a/arch/riscv/mm/Makefile
+++ b/arch/riscv/mm/Makefile
@@ -1,6 +1,10 @@
 # SPDX-License-Identifier: GPL-2.0-only
 
 CFLAGS_init.o := -mcmodel=medany
+ifdef CONFIG_RELOCATABLE
+CFLAGS_init.o += -fno-pie
+endif
+
 ifdef CONFIG_FTRACE
 CFLAGS_REMOVE_init.o = $(CC_FLAGS_FTRACE)
 CFLAGS_REMOVE_cacheflush.o = $(CC_FLAGS_FTRACE)
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index f803671d18b2..bce899b180cd 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -20,6 +20,9 @@
 #include 
 #include 
 #include 
+#ifdef CONFIG_RELOCATABLE
+#include 
+#endif
 
 #include 
 #include 
@@ -146,7 +149,7 @@ static void __init print_vm_layout(void)
print_ml("kasan", KASAN_SHADOW_START, KASAN_SHADOW_END);
 #endif
 
-   print_ml("kernel", (unsigned long)KERNEL_LINK_ADDR,
+   print_ml("kernel", (unsigned long)kernel_map.virt_addr,
 (unsigned long)ADDRESS_SPACE_END);
}
 }
@@ -831,6 +834,44 @@ static __init void set_satp_mode(void)
 #error "setup_vm() is called from head.S before relocate so it should not use 
absolute addressing."
 #endif
 
+#ifdef CONFIG_RELOCATABLE
+extern unsigned long __rela_dyn_start, __rela_dyn_end;
+
+static void __init relocate_kernel(void)
+{
+   Elf64_Rela *rela = (Elf64_Rela *)&__rela_dyn_start;
+   /*
+* This holds the offset between the linked virtual address and the
+* relocated virtual address.
+*/
+   uintptr_t reloc_offset = kernel_map.virt_addr - KERNEL_LINK_ADDR;
+   /*
+* This holds the offset between kernel linked virtual address and
+* physical address.
+*/
+  

[PATCH v9 2/6] riscv: Move .rela.dyn outside of init to avoid empty relocations

2023-03-28 Thread Alexandre Ghiti
This is a preparatory patch for relocatable kernels: .rela.dyn should be
in .init but doing so actually produces empty relocations, so this should
be a temporary commit until we find a solution.

This issue was reported here [1].

[1] https://lore.kernel.org/all/4a6fc7a3-9697-a49b-0941-97f32194b...@ghiti.fr/.

Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/kernel/vmlinux.lds.S | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S
index 1c38294580c0..e05e6df44225 100644
--- a/arch/riscv/kernel/vmlinux.lds.S
+++ b/arch/riscv/kernel/vmlinux.lds.S
@@ -96,10 +96,6 @@ SECTIONS
*(.rel.dyn*)
}
 
-   .rela.dyn : {
-   *(.rela*)
-   }
-
__init_data_end = .;
 
. = ALIGN(8);
@@ -126,6 +122,10 @@ SECTIONS
*(.sdata*)
}
 
+   .rela.dyn : {
+   *(.rela*)
+   }
+
 #ifdef CONFIG_EFI
.pecoff_edata_padding : { BYTE(0); . = ALIGN(PECOFF_FILE_ALIGNMENT); }
__pecoff_data_raw_size = ABSOLUTE(. - __pecoff_text_end);
-- 
2.37.2



[PATCH v9 1/6] riscv: Prepare EFI header for relocatable kernels

2023-03-28 Thread Alexandre Ghiti
ld does not handle relocations correctly as explained here [1],
a fix for that was proposed by Nelson there but we have to support older
toolchains and then provide this fix.

Note that llvm does not need this fix and is then excluded.

[1] https://sourceware.org/pipermail/binutils/2023-March/126690.html

Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/include/asm/set_memory.h |  3 +++
 arch/riscv/kernel/efi-header.S  | 19 ---
 arch/riscv/kernel/vmlinux.lds.S |  5 ++---
 3 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/arch/riscv/include/asm/set_memory.h 
b/arch/riscv/include/asm/set_memory.h
index a2c14d4b3993..ec11001c3fe0 100644
--- a/arch/riscv/include/asm/set_memory.h
+++ b/arch/riscv/include/asm/set_memory.h
@@ -56,4 +56,7 @@ bool kernel_page_present(struct page *page);
 #define SECTION_ALIGN L1_CACHE_BYTES
 #endif /* CONFIG_STRICT_KERNEL_RWX */
 
+#define PECOFF_SECTION_ALIGNMENT0x1000
+#define PECOFF_FILE_ALIGNMENT   0x200
+
 #endif /* _ASM_RISCV_SET_MEMORY_H */
diff --git a/arch/riscv/kernel/efi-header.S b/arch/riscv/kernel/efi-header.S
index 8e733aa48ba6..515b2dfbca75 100644
--- a/arch/riscv/kernel/efi-header.S
+++ b/arch/riscv/kernel/efi-header.S
@@ -6,6 +6,7 @@
 
 #include 
 #include 
+#include 
 
.macro  __EFI_PE_HEADER
.long   PE_MAGIC
@@ -33,7 +34,11 @@ optional_header:
.byte   0x02// MajorLinkerVersion
.byte   0x14// MinorLinkerVersion
.long   __pecoff_text_end - efi_header_end  // SizeOfCode
-   .long   __pecoff_data_virt_size // SizeOfInitializedData
+#ifdef __clang__
+   .long   __pecoff_data_virt_size // SizeOfInitializedData
+#else
+   .long   __pecoff_data_virt_end - __pecoff_text_end  // 
SizeOfInitializedData
+#endif
.long   0   // 
SizeOfUninitializedData
.long   __efistub_efi_pe_entry - _start // AddressOfEntryPoint
.long   efi_header_end - _start // BaseOfCode
@@ -91,9 +96,17 @@ section_table:
IMAGE_SCN_MEM_EXECUTE   // Characteristics
 
.ascii  ".data\0\0\0"
-   .long   __pecoff_data_virt_size // VirtualSize
+#ifdef __clang__
+   .long   __pecoff_data_virt_size // VirtualSize
+#else
+   .long   __pecoff_data_virt_end - __pecoff_text_end  // VirtualSize
+#endif
.long   __pecoff_text_end - _start  // VirtualAddress
-   .long   __pecoff_data_raw_size  // SizeOfRawData
+#ifdef __clang__
+   .long   __pecoff_data_raw_size  // SizeOfRawData
+#else
+   .long   __pecoff_data_raw_end - __pecoff_text_end   // SizeOfRawData
+#endif
.long   __pecoff_text_end - _start  // PointerToRawData
 
.long   0   // PointerToRelocations
diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S
index 53a8ad65b255..1c38294580c0 100644
--- a/arch/riscv/kernel/vmlinux.lds.S
+++ b/arch/riscv/kernel/vmlinux.lds.S
@@ -27,9 +27,6 @@ ENTRY(_start)
 
 jiffies = jiffies_64;
 
-PECOFF_SECTION_ALIGNMENT = 0x1000;
-PECOFF_FILE_ALIGNMENT = 0x200;
-
 SECTIONS
 {
/* Beginning of code and text segment */
@@ -132,6 +129,7 @@ SECTIONS
 #ifdef CONFIG_EFI
.pecoff_edata_padding : { BYTE(0); . = ALIGN(PECOFF_FILE_ALIGNMENT); }
__pecoff_data_raw_size = ABSOLUTE(. - __pecoff_text_end);
+   __pecoff_data_raw_end = ABSOLUTE(.);
 #endif
 
/* End of data section */
@@ -142,6 +140,7 @@ SECTIONS
 #ifdef CONFIG_EFI
. = ALIGN(PECOFF_SECTION_ALIGNMENT);
__pecoff_data_virt_size = ABSOLUTE(. - __pecoff_text_end);
+   __pecoff_data_virt_end = ABSOLUTE(.);
 #endif
_end = .;
 
-- 
2.37.2



[PATCH v9 0/6] Introduce 64b relocatable kernel

2023-03-28 Thread Alexandre Ghiti
After multiple attempts, this patchset is now based on the fact that the
64b kernel mapping was moved outside the linear mapping.

The first patch allows to build relocatable kernels but is not selected
by default. That patch is a requirement for KASLR.
The second and third patches take advantage of an already existing powerpc
script that checks relocations at compile-time, and uses it for riscv.

This patchset is rebased on top of:

riscv: Use PUD/P4D/PGD pages for the linear mapping
(https://patchwork.kernel.org/project/linux-riscv/list/?series=733603)
base-commit-tag: v6.3-rc1

Changes in v9:
  * Fix gcc/llvm compilation errors by adding patch 1, thanks to Bjorn
  * Move a patch to move rela.dyn outside of init (patch 2): it is a
separate patch to clearly explain why
  * To effectively move rela.dyn to init, we need to add patch 6: separate 
patch since we may be
able at some point to revert (along with patch 2).
  * Add a lot of orphan sections to the linker script

Changes in v8:
  * Fix UEFI boot by moving rela.dyn section into the data so that PE/COFF
loader actually copies the relocations too
  * Fix check that used PGDIR instead of PUD which was not correct
for sv48 and sv57
  * Fix PE/COFF header data size definition as it led to size of 0

Changes in v7:
  * Rebase on top of v5.15
  * Fix LDFLAGS_vmlinux which was overriden when CONFIG_DYNAMIC_FTRACE was
set
  * Make relocate_kernel static
  * Add Ack from Michael

Changes in v6:
  * Remove the kernel move to vmalloc zone
  * Rebased on top of for-next
  * Remove relocatable property from 32b kernel as the kernel is mapped in
the linear mapping and would then need to be copied physically too
  * CONFIG_RELOCATABLE depends on !XIP_KERNEL
  * Remove Reviewed-by from first patch as it changed a bit

Changes in v5:
  * Add "static __init" to create_kernel_page_table function as reported by
Kbuild test robot
  * Add reviewed-by from Zong
  * Rebase onto v5.7

Changes in v4:
  * Fix BPF region that overlapped with kernel's as suggested by Zong
  * Fix end of module region that could be larger than 2GB as suggested by Zong
  * Fix the size of the vm area reserved for the kernel as we could lose
PMD_SIZE if the size was already aligned on PMD_SIZE
  * Split compile time relocations check patch into 2 patches as suggested by 
Anup
  * Applied Reviewed-by from Zong and Anup

Changes in v3:
  * Move kernel mapping to vmalloc

Changes in v2:
  * Make RELOCATABLE depend on MMU as suggested by Anup
  * Rename kernel_load_addr into kernel_virt_addr as suggested by Anup
  * Use __pa_symbol instead of __pa, as suggested by Zong
  * Rebased on top of v5.6-rc3
  * Tested with sv48 patchset
  * Add Reviewed/Tested-by from Zong and Anup

Alexandre Ghiti (6):
  riscv: Prepare EFI header for relocatable kernels
  riscv: Move .rela.dyn outside of init to avoid empty relocations
  riscv: Introduce CONFIG_RELOCATABLE
  powerpc: Move script to check relocations at compile time in scripts/
  riscv: Check relocations at compile time
  riscv: Use --emit-relocs in order to move .rela.dyn in init

 arch/powerpc/tools/relocs_check.sh  | 18 ++
 arch/riscv/Kconfig  | 14 
 arch/riscv/Makefile |  7 ++--
 arch/riscv/Makefile.postlink| 49 ++
 arch/riscv/boot/Makefile|  7 
 arch/riscv/include/asm/set_memory.h |  3 ++
 arch/riscv/kernel/efi-header.S  | 19 --
 arch/riscv/kernel/vmlinux.lds.S | 26 ++
 arch/riscv/mm/Makefile  |  4 +++
 arch/riscv/mm/init.c| 54 -
 arch/riscv/tools/relocs_check.sh| 26 ++
 scripts/relocs_check.sh | 20 +++
 12 files changed, 218 insertions(+), 29 deletions(-)
 create mode 100644 arch/riscv/Makefile.postlink
 create mode 100755 arch/riscv/tools/relocs_check.sh
 create mode 100755 scripts/relocs_check.sh

-- 
2.37.2



Re: [powerpc:next-test] BUILD REGRESSION c827c932c00ccd231a73da9816a51ce2c2b557d6

2023-03-28 Thread Michael Ellerman
kernel test robot  writes:
> tree/branch: 
> https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next-test
> branch HEAD: c827c932c00ccd231a73da9816a51ce2c2b557d6  powerpc: Use 
> of_address_to_resource()
>
> Error/Warning reports:
>
> https://lore.kernel.org/oe-kbuild-all/202303240411.tq2tzkig-...@intel.com
>
> Error/Warning: (recently discovered and may have been fixed)
>
> arch/powerpc/platforms/embedded6xx/ls_uart.c:128:15: error: implicit 
> declaration of function 'of_address_to_resource' 
> [-Werror=implicit-function-declaration]
>
> Error/Warning ids grouped by kconfigs:
>
> gcc_recent_errors
> `-- powerpc-linkstation_defconfig
> `-- 
> arch-powerpc-platforms-embedded6xx-ls_uart.c:error:implicit-declaration-of-function-of_address_to_resource

I've dropped that patch for now.

cheers


[PATCH v8 1/2] mm/tlbbatch: Introduce arch_tlbbatch_should_defer()

2023-03-28 Thread Yicong Yang
From: Anshuman Khandual 

The entire scheme of deferred TLB flush in reclaim path rests on the
fact that the cost to refill TLB entries is less than flushing out
individual entries by sending IPI to remote CPUs. But architecture
can have different ways to evaluate that. Hence apart from checking
TTU_BATCH_FLUSH in the TTU flags, rest of the decision should be
architecture specific.

Signed-off-by: Anshuman Khandual 
[https://lore.kernel.org/linuxppc-dev/20171101101735.2318-2-khand...@linux.vnet.ibm.com/]
Signed-off-by: Yicong Yang 
[Rebase and fix incorrect return value type]
Reviewed-by: Kefeng Wang 
Reviewed-by: Anshuman Khandual 
Reviewed-by: Barry Song 
Reviewed-by: Xin Hao 
Tested-by: Punit Agrawal 
---
 arch/x86/include/asm/tlbflush.h | 12 
 mm/rmap.c   |  9 +
 2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index cda3118f3b27..8a497d902c16 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -240,6 +240,18 @@ static inline void flush_tlb_page(struct vm_area_struct 
*vma, unsigned long a)
flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
 }
 
+static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
+{
+   bool should_defer = false;
+
+   /* If remote CPUs need to be flushed then defer batch the flush */
+   if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
+   should_defer = true;
+   put_cpu();
+
+   return should_defer;
+}
+
 static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 {
/*
diff --git a/mm/rmap.c b/mm/rmap.c
index 8632e02661ac..38ccb700748c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -686,17 +686,10 @@ static void set_tlb_ubc_flush_pending(struct mm_struct 
*mm, bool writable)
  */
 static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
 {
-   bool should_defer = false;
-
if (!(flags & TTU_BATCH_FLUSH))
return false;
 
-   /* If remote CPUs need to be flushed then defer batch the flush */
-   if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
-   should_defer = true;
-   put_cpu();
-
-   return should_defer;
+   return arch_tlbbatch_should_defer(mm);
 }
 
 /*
-- 
2.24.0



[PATCH v8 2/2] arm64: support batched/deferred tlb shootdown during page reclamation

2023-03-28 Thread Yicong Yang
From: Barry Song 

on x86, batched and deferred tlb shootdown has lead to 90%
performance increase on tlb shootdown. on arm64, HW can do
tlb shootdown without software IPI. But sync tlbi is still
quite expensive.

Even running a simplest program which requires swapout can
prove this is true,
 #include 
 #include 
 #include 
 #include 

 int main()
 {
 #define SIZE (1 * 1024 * 1024)
 volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
  MAP_SHARED | MAP_ANONYMOUS, -1, 0);

 memset(p, 0x88, SIZE);

 for (int k = 0; k < 1; k++) {
 /* swap in */
 for (int i = 0; i < SIZE; i += 4096) {
 (void)p[i];
 }

 /* swap out */
 madvise(p, SIZE, MADV_PAGEOUT);
 }
 }

Perf result on snapdragon 888 with 8 cores by using zRAM
as the swap block device.

 ~ # perf record taskset -c 4 ./a.out
 [ perf record: Woken up 10 times to write data ]
 [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
 ~ # perf report
 # To display the perf.data header info, please use --header/--header-only 
options.
 # To display the perf.data header info, please use --header/--header-only 
options.
 #
 #
 # Total Lost Samples: 0
 #
 # Samples: 60K of event 'cycles'
 # Event count (approx.): 35706225414
 #
 # Overhead  Command  Shared Object  Symbol
 #   ...  .  
.
 #
21.07%  a.out[kernel.kallsyms]  [k] _raw_spin_unlock_irq
 8.23%  a.out[kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
 6.67%  a.out[kernel.kallsyms]  [k] filemap_map_pages
 6.16%  a.out[kernel.kallsyms]  [k] __zram_bvec_write
 5.36%  a.out[kernel.kallsyms]  [k] ptep_clear_flush
 3.71%  a.out[kernel.kallsyms]  [k] _raw_spin_lock
 3.49%  a.out[kernel.kallsyms]  [k] memset64
 1.63%  a.out[kernel.kallsyms]  [k] clear_page
 1.42%  a.out[kernel.kallsyms]  [k] _raw_spin_unlock
 1.26%  a.out[kernel.kallsyms]  [k] 
mod_zone_state.llvm.8525150236079521930
 1.23%  a.out[kernel.kallsyms]  [k] xas_load
 1.15%  a.out[kernel.kallsyms]  [k] zram_slot_lock

ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
swapping in/out a page mapped by only one process. If the
page is mapped by multiple processes, typically, like more
than 100 on a phone, the overhead would be much higher as
we have to run tlb flush 100 times for one single page.
Plus, tlb flush overhead will increase with the number
of CPU cores due to the bad scalability of tlb shootdown
in HW, so those ARM64 servers should expect much higher
overhead.

Further perf annonate shows 95% cpu time of ptep_clear_flush
is actually used by the final dsb() to wait for the completion
of tlb flush. This provides us a very good chance to leverage
the existing batched tlb in kernel. The minimum modification
is that we only send async tlbi in the first stage and we send
dsb while we have to sync in the second stage.

With the above simplest micro benchmark, collapsed time to
finish the program decreases around 5%.

Typical collapsed time w/o patch:
 ~ # time taskset -c 4 ./a.out
 0.21user 14.34system 0:14.69elapsed
w/ patch:
 ~ # time taskset -c 4 ./a.out
 0.22user 13.45system 0:13.80elapsed

Also, Yicong Yang added the following observation.
Tested with benchmark in the commit on Kunpeng920 arm64 server,
observed an improvement around 12.5% with command
`time ./swap_bench`.
w/o w/
real0m13.460s   0m11.771s
user0m0.248s0m0.279s
sys 0m12.039s   0m11.458s

Originally it's noticed a 16.99% overhead of ptep_clear_flush()
which has been eliminated by this patch:

[root@localhost yang]# perf record -- ./swap_bench && perf report
[...]
16.99%  swap_bench  [kernel.kallsyms]  [k] ptep_clear_flush

It is tested on 4,8,128 CPU platforms and shows to be beneficial on
large systems but may not have improvement on small systems like on
a 4 CPU platform. So make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends
on CONFIG_EXPERT for this stage and make this disabled on systems
with less than 8 CPUs. User can modify this threshold according to
their own platforms by CONFIG_NR_CPUS_FOR_BATCHED_TLB.

Also this patch improve the performance of page migration. Using pmbench
and tries to migrate the pages of pmbench between node 0 and node 1 for
20 times, this patch decrease the time used more than 50% and saved the
time used by ptep_clear_flush().

This patch extends arch_tlbbatch_add_mm() to take an address of the
target page to support the feature on arm64. Also rename it to
arch_tlbbatch_add_pending() to better match its function since we
don't need to handle the mm on arm64 and add_mm is not proper.
add_pending will 

[PATCH v8 0/2] arm64: support batched/deferred tlb shootdown during page reclamation

2023-03-28 Thread Yicong Yang
From: Yicong Yang 

Though ARM64 has the hardware to do tlb shootdown, the hardware
broadcasting is not free.
A simplest micro benchmark shows even on snapdragon 888 with only
8 cores, the overhead for ptep_clear_flush is huge even for paging
out one page mapped by only one process:
5.36%  a.out[kernel.kallsyms]  [k] ptep_clear_flush

While pages are mapped by multiple processes or HW has more CPUs,
the cost should become even higher due to the bad scalability of
tlb shootdown.

The same benchmark can result in 16.99% CPU consumption on ARM64
server with around 100 cores according to Yicong's test on patch
4/4.

This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
1. only send tlbi instructions in the first stage -
arch_tlbbatch_add_mm()
2. wait for the completion of tlbi by dsb while doing tlbbatch
sync in arch_tlbbatch_flush()
Testing on snapdragon shows the overhead of ptep_clear_flush
is removed by the patchset. The micro benchmark becomes 5% faster
even for one page mapped by single process on snapdragon 888.

This support also optimize the page migration more than 50% with support
of batched TLB flushing [*].

[*] 
https://lore.kernel.org/linux-mm/20230213123444.155149-1-ying.hu...@intel.com/

-v8:
1. Rebase on 6.3-rc4
2. Tested the optimization on page migration and mentioned it in the commit
3. Thanks the review from Anshuman.
Link: 
https://lore.kernel.org/linux-mm/20221117082648.47526-1-yangyic...@huawei.com/

-v7:
1. rename arch_tlbbatch_add_mm() to arch_tlbbatch_add_pending() as suggested, 
since it
   takes an extra address for arm64, per Nadav and Anshuman. Also mentioned in 
the commit.
2. add tags from Xin Hao, thanks.
Link: https://lore.kernel.org/lkml/20221115031425.44640-1-yangyic...@huawei.com/

-v6:
1. comment we don't defer TLB flush on platforms affected by 
ARM64_WORKAROUND_REPEAT_TLBI
2. use cpus_have_const_cap() instead of this_cpu_has_cap()
3. add tags from Punit, Thanks.
4. default enable the feature when cpus >= 8 rather than > 8, since the original
   improvement is observed on snapdragon 888 with 8 cores.
Link: https://lore.kernel.org/lkml/20221028081255.19157-1-yangyic...@huawei.com/

-v5:
1. Make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends on EXPERT for this stage on 
arm64.
2. Make a threshold of CPU numbers for enabling batched TLP flush on arm64
Link: 
https://lore.kernel.org/linux-arm-kernel/20220921084302.43631-1-yangyic...@huawei.com/T/

-v4:
1. Add tags from Kefeng and Anshuman, Thanks.
2. Limit the TLB batch/defer on systems with >4 CPUs, per Anshuman
3. Merge previous Patch 1,2-3 into one, per Anshuman
Link: 
https://lore.kernel.org/linux-mm/20220822082120.8347-1-yangyic...@huawei.com/

-v3:
1. Declare arch's tlbbatch defer support by arch_tlbbatch_should_defer() instead
   of ARCH_HAS_MM_CPUMASK, per Barry and Kefeng
2. Add Tested-by from Xin Hao
Link: 
https://lore.kernel.org/linux-mm/20220711034615.482895-1-21cn...@gmail.com/

-v2:
1. Collected Yicong's test result on kunpeng920 ARM64 server;
2. Removed the redundant vma parameter in arch_tlbbatch_add_mm()
   according to the comments of Peter Zijlstra and Dave Hansen
3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask
   is empty according to the comments of Nadav Amit

Thanks, Peter, Dave and Nadav for your testing or reviewing
, and comments.

-v1:
https://lore.kernel.org/lkml/20220707125242.425242-1-21cn...@gmail.com/

Anshuman Khandual (1):
  mm/tlbbatch: Introduce arch_tlbbatch_should_defer()

Barry Song (1):
  arm64: support batched/deferred tlb shootdown during page reclamation

 .../features/vm/TLB/arch-support.txt  |  2 +-
 arch/arm64/Kconfig|  6 +++
 arch/arm64/include/asm/tlbbatch.h | 12 +
 arch/arm64/include/asm/tlbflush.h | 52 ++-
 arch/x86/include/asm/tlbflush.h   | 17 +-
 include/linux/mm_types_task.h |  4 +-
 mm/rmap.c | 21 +++-
 7 files changed, 94 insertions(+), 20 deletions(-)
 create mode 100644 arch/arm64/include/asm/tlbbatch.h

-- 
2.24.0



Re: [PATCH v3 4/4] of: address: Always use dma_default_coherent for default coherency

2023-03-28 Thread Michael Ellerman
Jiaxun Yang  writes:
> As for now all arches have dma_default_coherent reflecting default
> DMA coherency for of devices, so there is no need to have a standalone
> config option.
>
> Signed-off-by: Jiaxun Yang 
> ---
> v3: Squash setting ARCH_DMA_DEFAULT_COHERENT into this patch.
> ---
>  arch/powerpc/Kconfig |  2 +-
>  arch/riscv/Kconfig   |  2 +-
>  drivers/of/Kconfig   |  4 
>  drivers/of/address.c | 10 +-
>  4 files changed, 3 insertions(+), 15 deletions(-)
>
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 57f5d2f53d06..824e00a1277b 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -113,6 +113,7 @@ config PPC
>   #
>   select ARCH_32BIT_OFF_T if PPC32
>   select ARCH_DISABLE_KASAN_INLINEif PPC_RADIX_MMU
> + select ARCH_DMA_DEFAULT_COHERENTif !NOT_COHERENT_CACHE
>   select ARCH_ENABLE_MEMORY_HOTPLUG
>   select ARCH_ENABLE_MEMORY_HOTREMOVE
>   select ARCH_HAS_COPY_MC if PPC64
> @@ -273,7 +274,6 @@ config PPC
>   select NEED_PER_CPU_PAGE_FIRST_CHUNKif PPC64
>   select NEED_SG_DMA_LENGTH
>   select OF
> - select OF_DMA_DEFAULT_COHERENT  if !NOT_COHERENT_CACHE
>   select OF_EARLY_FLATTREE
>   select OLD_SIGACTIONif PPC32
>   select OLD_SIGSUSPEND

Acked-by: Michael Ellerman  (powerpc)

cheers


[powerpc:next-test] BUILD REGRESSION c827c932c00ccd231a73da9816a51ce2c2b557d6

2023-03-28 Thread kernel test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
next-test
branch HEAD: c827c932c00ccd231a73da9816a51ce2c2b557d6  powerpc: Use 
of_address_to_resource()

Error/Warning reports:

https://lore.kernel.org/oe-kbuild-all/202303240411.tq2tzkig-...@intel.com

Error/Warning: (recently discovered and may have been fixed)

arch/powerpc/platforms/embedded6xx/ls_uart.c:128:15: error: implicit 
declaration of function 'of_address_to_resource' 
[-Werror=implicit-function-declaration]

Error/Warning ids grouped by kconfigs:

gcc_recent_errors
`-- powerpc-linkstation_defconfig
`-- 
arch-powerpc-platforms-embedded6xx-ls_uart.c:error:implicit-declaration-of-function-of_address_to_resource

elapsed time: 9412m

configs tested: 354
configs skipped: 35

tested configs:
alphaallyesconfig   gcc  
alphabuildonly-randconfig-r005-20230322   gcc  
alpha   defconfig   gcc  
alpharandconfig-r012-20230323   gcc  
alpharandconfig-r014-20230322   gcc  
alpharandconfig-r014-20230323   gcc  
alpharandconfig-r021-20230322   gcc  
alpharandconfig-r025-20230322   gcc  
arc  allyesconfig   gcc  
arc  buildonly-randconfig-r004-20230323   gcc  
arc defconfig   gcc  
arcnsim_700_defconfig   gcc  
arc  randconfig-r016-20230322   gcc  
arc  randconfig-r023-20230322   gcc  
arc  randconfig-r024-20230322   gcc  
arc  randconfig-r031-20230322   gcc  
arc  randconfig-r034-20230322   gcc  
arc  randconfig-r036-20230322   gcc  
arc  randconfig-r043-20230322   gcc  
arcvdk_hs38_defconfig   gcc  
arm  allmodconfig   gcc  
arm  allyesconfig   gcc  
arm assabet_defconfig   gcc  
arm  buildonly-randconfig-r001-20230322   clang
arm defconfig   gcc  
armdove_defconfig   clang
arm  footbridge_defconfig   gcc  
arm  gemini_defconfig   gcc  
arm   imx_v4_v5_defconfig   clang
arm   imx_v6_v7_defconfig   gcc  
arm  integrator_defconfig   gcc  
arm  jornada720_defconfig   gcc  
arm lpc18xx_defconfig   gcc  
arm mv78xx0_defconfig   clang
arm   netwinder_defconfig   clang
arm pxa_defconfig   gcc  
arm  randconfig-c002-20230322   clang
arm  randconfig-c002-20230322   gcc  
arm  randconfig-c002-20230324   gcc  
arm  randconfig-r002-20230322   gcc  
arm  randconfig-r013-20230322   clang
arm  randconfig-r014-20230322   clang
arm  randconfig-r022-20230322   clang
arm  randconfig-r023-20230322   clang
arm  randconfig-r026-20230322   clang
arm  randconfig-r035-20230322   gcc  
arm   stm32_defconfig   gcc  
arm   sunxi_defconfig   gcc  
arm vf610m4_defconfig   gcc  
arm64allyesconfig   gcc  
arm64   defconfig   gcc  
arm64randconfig-r001-20230322   clang
arm64randconfig-r011-20230322   gcc  
arm64randconfig-r023-20230322   gcc  
arm64randconfig-r025-20230322   gcc  
arm64randconfig-r031-20230322   clang
arm64randconfig-r032-20230322   clang
arm64randconfig-r036-20230322   clang
csky buildonly-randconfig-r001-20230322   gcc  
cskydefconfig   gcc  
csky randconfig-r003-20230322   gcc  
csky randconfig-r012-20230322   gcc  
csky randconfig-r013-20230322   gcc  
csky randconfig-r023-20230322   gcc  
csky randconfig-r024-20230322   gcc  
csky randconfig-r026-20230322   gcc  
csky randconfig-r034-20230322   gcc  
hexagon  buildonly-randconfig-r003-20230322   clang
hexagon  buildonly-randconfig-r004-20230322   clang
hexagon  buildonly-randconfig-r006-20230323   clang
hexagon  randconfig-r003-20230322   clang
hexagon  randconfig-r004-20230322   clang
hexagon  randconfig-r015-20230322   clang
hexagon  randconfig-r034-20230322   clang
hexagon  randconfig-r035-20230322   clang
i386 alldefconfig   gcc  
i386 allyesconfig   gcc  
i386 

[powerpc:merge] BUILD SUCCESS 0f98241d5ef5c3bb4c5ca07ceee3a825d79999fd

2023-03-28 Thread kernel test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
merge
branch HEAD: 0f98241d5ef5c3bb4c5ca07ceee3a825d7fd  Automatic merge of 
'fixes' into merge (2023-03-28 23:32)

elapsed time: 741m

configs tested: 43
configs skipped: 3

The following configs have been built successfully.
More configs may be tested in the coming days.

tested configs:
alphaallyesconfig   gcc  
alpha   defconfig   gcc  
arc  allyesconfig   gcc  
arc defconfig   gcc  
arm  allmodconfig   gcc  
arm  allyesconfig   gcc  
arm defconfig   gcc  
arm64allyesconfig   gcc  
arm64   defconfig   gcc  
cskydefconfig   gcc  
i386 allyesconfig   gcc  
i386  debian-10.3   gcc  
i386defconfig   gcc  
ia64 allmodconfig   gcc  
ia64defconfig   gcc  
loongarchallmodconfig   gcc  
loongarch allnoconfig   gcc  
loongarch   defconfig   gcc  
m68k allmodconfig   gcc  
m68kdefconfig   gcc  
mips allmodconfig   gcc  
mips allyesconfig   gcc  
nios2   defconfig   gcc  
parisc  defconfig   gcc  
parisc64defconfig   gcc  
powerpc  allmodconfig   gcc  
powerpc   allnoconfig   gcc  
riscvallmodconfig   gcc  
riscv allnoconfig   gcc  
riscv   defconfig   gcc  
riscv  rv32_defconfig   gcc  
s390 allmodconfig   gcc  
s390 allyesconfig   gcc  
s390defconfig   gcc  
sh   allmodconfig   gcc  
sparc   defconfig   gcc  
um i386_defconfig   gcc  
um   x86_64_defconfig   gcc  
x86_64allnoconfig   gcc  
x86_64   allyesconfig   gcc  
x86_64  defconfig   gcc  
x86_64  kexec   gcc  
x86_64   rhel-8.3   gcc  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests


[powerpc:topic/ppc-kvm] BUILD SUCCESS 5f4f53d28cde2cc7be96f657229c8603da578500

2023-03-28 Thread kernel test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
topic/ppc-kvm
branch HEAD: 5f4f53d28cde2cc7be96f657229c8603da578500  KVM: PPC: Book3S HV: 
kvmppc_hv_entry: remove .global scope

elapsed time: 741m

configs tested: 2
configs skipped: 163

The following configs have been built successfully.
More configs may be tested in the coming days.

tested configs:
powerpc  allmodconfig   gcc  
powerpc   allnoconfig   gcc  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests


[powerpc:fixes-test] BUILD SUCCESS fd7276189450110ed835eb0a334e62d2f1c4e3be

2023-03-28 Thread kernel test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
fixes-test
branch HEAD: fd7276189450110ed835eb0a334e62d2f1c4e3be  powerpc: Don't try to 
copy PPR for task with NULL pt_regs

elapsed time: 742m

configs tested: 7
configs skipped: 163

The following configs have been built successfully.
More configs may be tested in the coming days.

tested configs:
nios2   defconfig   gcc  
parisc  defconfig   gcc  
parisc64defconfig   gcc  
powerpc  allmodconfig   gcc  
powerpc   allnoconfig   gcc  
sh   allmodconfig   gcc  
x86_64randconfig-k001   clang

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests


[PATCH 5/5] of/address: Add of_property_read_reg() helper

2023-03-28 Thread Rob Herring
Add a helper, of_property_read_reg(), to read "reg" entries untranslated
address and size. This function is intended mainly for cases with an
untranslatable "reg" address (i.e. not MMIO). There's also a few
translatable cases such as address cells containing a bus chip-select
number.

Signed-off-by: Rob Herring 
---
 drivers/of/address.c   | 23 +++
 drivers/of/unittest.c  | 22 ++
 include/linux/of_address.h |  7 +++
 3 files changed, 52 insertions(+)

diff --git a/drivers/of/address.c b/drivers/of/address.c
index 8cfae24148e0..fdb15c6fb3b1 100644
--- a/drivers/of/address.c
+++ b/drivers/of/address.c
@@ -760,6 +760,29 @@ const __be32 *__of_get_address(struct device_node *dev, 
int index, int bar_no,
 }
 EXPORT_SYMBOL(__of_get_address);
 
+/**
+ * of_property_read_reg - Retrieve the specified "reg" entry index without 
translating
+ * @np: device tree node for which to retrieve "reg" from
+ * @idx: "reg" entry index to read
+ * @addr: return value for the untranslated address
+ * @size: return value for the entry size
+ *
+ * Returns -EINVAL if "reg" is not found. Returns 0 on success with addr and
+ * size values filled in.
+ */
+int of_property_read_reg(struct device_node *np, int idx, u64 *addr, u64 *size)
+{
+   const __be32 *prop = of_get_address(np, idx, size, NULL);
+
+   if (!prop)
+   return -EINVAL;
+
+   *addr = of_read_number(prop, of_n_addr_cells(np));
+
+   return 0;
+}
+EXPORT_SYMBOL(of_property_read_reg);
+
 static int parser_init(struct of_pci_range_parser *parser,
struct device_node *node, const char *name)
 {
diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c
index eaeb58065acc..e73ecbef977b 100644
--- a/drivers/of/unittest.c
+++ b/drivers/of/unittest.c
@@ -1134,6 +1134,27 @@ static void __init of_unittest_bus_3cell_ranges(void)
of_node_put(np);
 }
 
+static void __init of_unittest_reg(void)
+{
+   struct device_node *np;
+   int ret;
+   u64 addr, size;
+
+   np = 
of_find_node_by_path("/testcase-data/address-tests/bus@8000/device@1000");
+   if (!np) {
+   pr_err("missing testcase data\n");
+   return;
+   }
+
+   ret = of_property_read_reg(np, 0, , );
+   unittest(!ret, "of_property_read_reg(%pOF) returned error %d\n",
+   np, ret);
+   unittest(addr == 0x1000, "of_property_read_reg(%pOF) untranslated 
address (%llx) incorrect\n",
+   np, addr);
+
+   of_node_put(np);
+}
+
 static void __init of_unittest_parse_interrupts(void)
 {
struct device_node *np;
@@ -3772,6 +3793,7 @@ static int __init of_unittest(void)
of_unittest_pci_dma_ranges();
of_unittest_bus_ranges();
of_unittest_bus_3cell_ranges();
+   of_unittest_reg();
of_unittest_match_node();
of_unittest_platform_populate();
of_unittest_overlay();
diff --git a/include/linux/of_address.h b/include/linux/of_address.h
index 5292f62c1baa..95cb6c5c2d67 100644
--- a/include/linux/of_address.h
+++ b/include/linux/of_address.h
@@ -72,6 +72,8 @@ void __iomem *of_io_request_and_map(struct device_node 
*device,
 extern const __be32 *__of_get_address(struct device_node *dev, int index, int 
bar_no,
  u64 *size, unsigned int *flags);
 
+int of_property_read_reg(struct device_node *np, int idx, u64 *addr, u64 
*size);
+
 extern int of_pci_range_parser_init(struct of_pci_range_parser *parser,
struct device_node *node);
 extern int of_pci_dma_range_parser_init(struct of_pci_range_parser *parser,
@@ -106,6 +108,11 @@ static inline const __be32 *__of_get_address(struct 
device_node *dev, int index,
return NULL;
 }
 
+static int of_property_read_reg(struct device_node *np, int idx, u64 *addr, 
u64 *size)
+{
+   return -ENOSYS;
+}
+
 static inline int of_pci_range_parser_init(struct of_pci_range_parser *parser,
struct device_node *node)
 {

-- 
2.39.2



[PATCH 1/5] of: unittest: Add bus address range parsing tests

2023-03-28 Thread Rob Herring
While there are tests for "dma-ranges" helpers, "ranges" is missing any
tests. It's the same underlying code, but for completeness add a test
for "ranges" parsing iterators. This is in preparation to add some
additional "ranges" helpers.

Signed-off-by: Rob Herring 
---
 drivers/of/unittest.c | 53 +++
 1 file changed, 53 insertions(+)

diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c
index b5a7a31d8bd2..1a45df1f354a 100644
--- a/drivers/of/unittest.c
+++ b/drivers/of/unittest.c
@@ -1008,6 +1008,58 @@ static void __init of_unittest_pci_dma_ranges(void)
of_node_put(np);
 }
 
+static void __init of_unittest_bus_ranges(void)
+{
+   struct device_node *np;
+   struct of_range range;
+   struct of_range_parser parser;
+   int i = 0;
+
+   np = of_find_node_by_path("/testcase-data/address-tests");
+   if (!np) {
+   pr_err("missing testcase data\n");
+   return;
+   }
+
+   if (of_range_parser_init(, np)) {
+   pr_err("missing ranges property\n");
+   return;
+   }
+
+   /*
+* Get the "ranges" from the device tree
+*/
+   for_each_of_range(, ) {
+   unittest(range.flags == IORESOURCE_MEM,
+   "for_each_of_range wrong flags on node %pOF flags=%x 
(expected %x)\n",
+   np, range.flags, IORESOURCE_MEM);
+   if (!i) {
+   unittest(range.size == 0x4000,
+"for_each_of_range wrong size on node %pOF 
size=%llx\n",
+np, range.size);
+   unittest(range.cpu_addr == 0x7000,
+"for_each_of_range wrong CPU addr (%llx) on 
node %pOF",
+range.cpu_addr, np);
+   unittest(range.bus_addr == 0x7000,
+"for_each_of_range wrong bus addr (%llx) on 
node %pOF",
+range.pci_addr, np);
+   } else {
+   unittest(range.size == 0x2000,
+"for_each_of_range wrong size on node %pOF 
size=%llx\n",
+np, range.size);
+   unittest(range.cpu_addr == 0xd000,
+"for_each_of_range wrong CPU addr (%llx) on 
node %pOF",
+range.cpu_addr, np);
+   unittest(range.bus_addr == 0x,
+"for_each_of_range wrong bus addr (%llx) on 
node %pOF",
+range.pci_addr, np);
+   }
+   i++;
+   }
+
+   of_node_put(np);
+}
+
 static void __init of_unittest_parse_interrupts(void)
 {
struct device_node *np;
@@ -3644,6 +3696,7 @@ static int __init of_unittest(void)
of_unittest_dma_get_max_cpu_address();
of_unittest_parse_dma_ranges();
of_unittest_pci_dma_ranges();
+   of_unittest_bus_ranges();
of_unittest_match_node();
of_unittest_platform_populate();
of_unittest_overlay();

-- 
2.39.2



[PATCH 3/5] of/address: Add support for 3 address cell bus

2023-03-28 Thread Rob Herring
There's a few custom bus bindings (e.g. fsl,qoriq-mc) which use a
3 cell format with custom flags in the high cell. We can match these
buses as a fallback if we didn't match on PCI bus which is the only
standard bus binding with 3 address cells.

Signed-off-by: Rob Herring 
---
 drivers/of/address.c| 22 +++
 drivers/of/unittest-data/tests-address.dtsi |  9 -
 drivers/of/unittest.c   | 58 -
 3 files changed, 87 insertions(+), 2 deletions(-)

diff --git a/drivers/of/address.c b/drivers/of/address.c
index b79f005834fc..8cfae24148e0 100644
--- a/drivers/of/address.c
+++ b/drivers/of/address.c
@@ -95,11 +95,17 @@ static int of_bus_default_translate(__be32 *addr, u64 
offset, int na)
return 0;
 }
 
+static unsigned int of_bus_default_flags_get_flags(const __be32 *addr)
+{
+   return of_read_number(addr, 1);
+}
+
 static unsigned int of_bus_default_get_flags(const __be32 *addr)
 {
return IORESOURCE_MEM;
 }
 
+
 #ifdef CONFIG_PCI
 static unsigned int of_bus_pci_get_flags(const __be32 *addr)
 {
@@ -344,6 +350,11 @@ static unsigned int of_bus_isa_get_flags(const __be32 
*addr)
return flags;
 }
 
+static int of_bus_default_flags_match(struct device_node *np)
+{
+   return of_bus_n_addr_cells(np) == 3;
+}
+
 /*
  * Array of bus specific translators
  */
@@ -373,6 +384,17 @@ static struct of_bus of_busses[] = {
.has_flags = true,
.get_flags = of_bus_isa_get_flags,
},
+   /* Default with flags cell */
+   {
+   .name = "default-flags",
+   .addresses = "reg",
+   .match = of_bus_default_flags_match,
+   .count_cells = of_bus_default_count_cells,
+   .map = of_bus_default_map,
+   .translate = of_bus_default_translate,
+   .has_flags = true,
+   .get_flags = of_bus_default_flags_get_flags,
+   },
/* Default */
{
.name = "default",
diff --git a/drivers/of/unittest-data/tests-address.dtsi 
b/drivers/of/unittest-data/tests-address.dtsi
index 6604a52bf6cb..bc0029cbf8ea 100644
--- a/drivers/of/unittest-data/tests-address.dtsi
+++ b/drivers/of/unittest-data/tests-address.dtsi
@@ -14,7 +14,7 @@ address-tests {
#size-cells = <1>;
/* ranges here is to make sure we don't use it for
 * dma-ranges translation */
-   ranges = <0x7000 0x7000 0x4000>,
+   ranges = <0x7000 0x7000 0x5000>,
 <0x 0xd000 0x2000>;
dma-ranges = <0x0 0x2000 0x4000>;
 
@@ -43,6 +43,13 @@ pci@9000 {
 <0x4200 0x0 0xc000 
0x2000 0x0 0x1000>;
};
 
+   bus@a000 {
+   #address-cells = <3>;
+   #size-cells = <2>;
+   ranges = <0xf00baa 0x0 0x0 0xa000 0x0 
0x10>,
+<0xf00bee 0x1 0x0 0xb000 0x0 
0x20>;
+   };
+
};
};
 };
diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c
index 945288d5672f..29066ecbed47 100644
--- a/drivers/of/unittest.c
+++ b/drivers/of/unittest.c
@@ -1048,7 +1048,7 @@ static void __init of_unittest_bus_ranges(void)
"for_each_of_range wrong flags on node %pOF flags=%x 
(expected %x)\n",
np, range.flags, IORESOURCE_MEM);
if (!i) {
-   unittest(range.size == 0x4000,
+   unittest(range.size == 0x5000,
 "for_each_of_range wrong size on node %pOF 
size=%llx\n",
 np, range.size);
unittest(range.cpu_addr == 0x7000,
@@ -1074,6 +1074,61 @@ static void __init of_unittest_bus_ranges(void)
of_node_put(np);
 }
 
+static void __init of_unittest_bus_3cell_ranges(void)
+{
+   struct device_node *np;
+   struct of_range range;
+   struct of_range_parser parser;
+   int i = 0;
+
+   np = of_find_node_by_path("/testcase-data/address-tests/bus@a000");
+   if (!np) {
+   pr_err("missing testcase data\n");
+   return;
+   }
+
+   if (of_range_parser_init(, np)) {
+   pr_err("missing ranges property\n");
+   return;
+   }
+
+   /*
+* Get the "ranges" from the device tree
+*/
+   for_each_of_range(, ) {
+   if (!i) {
+   unittest(range.flags == 0xf00baa,
+"for_each_of_range wrong flags on node %pOF 
flags=%x\n",
+np, range.flags);
+   

[PATCH 2/5] of/address: Add of_range_to_resource() helper

2023-03-28 Thread Rob Herring
A few users need to convert a specific "ranges" entry into a struct
resource. Add a helper to similar to of_address_to_resource(). The
existing of_pci_range_to_resource() helper isn't really PCI specific,
so it can be used with the CONFIG_PCI check dropped.

Signed-off-by: Rob Herring 
---
 drivers/of/address.c   | 31 ---
 drivers/of/unittest.c  | 16 +++-
 include/linux/of_address.h |  8 
 3 files changed, 51 insertions(+), 4 deletions(-)

diff --git a/drivers/of/address.c b/drivers/of/address.c
index 4c0b169ef9bf..b79f005834fc 100644
--- a/drivers/of/address.c
+++ b/drivers/of/address.c
@@ -229,9 +229,6 @@ int of_pci_range_to_resource(struct of_pci_range *range,
res->parent = res->child = res->sibling = NULL;
res->name = np->full_name;
 
-   if (!IS_ENABLED(CONFIG_PCI))
-   return -ENOSYS;
-
if (res->flags & IORESOURCE_IO) {
unsigned long port;
err = pci_register_io_range(>fwnode, range->cpu_addr,
@@ -263,6 +260,34 @@ int of_pci_range_to_resource(struct of_pci_range *range,
 }
 EXPORT_SYMBOL(of_pci_range_to_resource);
 
+/*
+ * of_range_to_resource - Create a resource from a ranges entry
+ * @np:device node where the range belongs to
+ * @index: the 'ranges' index to convert to a resource
+ * @res:   pointer to a valid resource that will be updated to
+ *  reflect the values contained in the range.
+ *
+ * Returns ENOENT if the entry is not found or EINVAL if the range cannot be
+ * converted to resource.
+ */
+int of_range_to_resource(struct device_node *np, int index, struct resource 
*res)
+{
+   int ret, i = 0;
+   struct of_range_parser parser;
+   struct of_range range;
+
+   ret = of_range_parser_init(, np);
+   if (ret)
+   return ret;
+
+   for_each_of_range(, )
+   if (i++ == index)
+   return of_pci_range_to_resource(, np, res);
+
+   return -ENOENT;
+}
+EXPORT_SYMBOL(of_range_to_resource);
+
 /*
  * ISA bus specific translator
  */
diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c
index 1a45df1f354a..945288d5672f 100644
--- a/drivers/of/unittest.c
+++ b/drivers/of/unittest.c
@@ -1013,7 +1013,8 @@ static void __init of_unittest_bus_ranges(void)
struct device_node *np;
struct of_range range;
struct of_range_parser parser;
-   int i = 0;
+   struct resource res;
+   int ret, i = 0;
 
np = of_find_node_by_path("/testcase-data/address-tests");
if (!np) {
@@ -1026,6 +1027,19 @@ static void __init of_unittest_bus_ranges(void)
return;
}
 
+   ret = of_range_to_resource(np, 1, );
+   unittest(!ret, "of_range_to_resource returned error (%d) node %pOF\n",
+   ret, np);
+   unittest(resource_type() == IORESOURCE_MEM,
+   "of_range_to_resource wrong resource type on node %pOF 
res=%pR\n",
+   np, );
+   unittest(res.start == 0xd000,
+   "of_range_to_resource wrong resource start address on node %pOF 
res=%pR\n",
+   np, );
+   unittest(resource_size() == 0x2000,
+   "of_range_to_resource wrong resource start address on node %pOF 
res=%pR\n",
+   np, );
+
/*
 * Get the "ranges" from the device tree
 */
diff --git a/include/linux/of_address.h b/include/linux/of_address.h
index 376671594746..1d005439f026 100644
--- a/include/linux/of_address.h
+++ b/include/linux/of_address.h
@@ -68,6 +68,8 @@ extern int of_pci_address_to_resource(struct device_node 
*dev, int bar,
 extern int of_pci_range_to_resource(struct of_pci_range *range,
struct device_node *np,
struct resource *res);
+extern int of_range_to_resource(struct device_node *np, int index,
+   struct resource *res);
 extern bool of_dma_is_coherent(struct device_node *np);
 #else /* CONFIG_OF_ADDRESS */
 static inline void __iomem *of_io_request_and_map(struct device_node *device,
@@ -120,6 +122,12 @@ static inline int of_pci_range_to_resource(struct 
of_pci_range *range,
return -ENOSYS;
 }
 
+static inline int of_range_to_resource(struct device_node *np, int index,
+  struct resource *res)
+{
+   return -ENOSYS;
+}
+
 static inline bool of_dma_is_coherent(struct device_node *np)
 {
return false;

-- 
2.39.2



[PATCH 0/5] of: More address parsing helpers

2023-03-28 Thread Rob Herring
This series is part of some clean-ups to reduce the open coded parsing 
of "reg" and "ranges" in the kernel. As those are standard properties, 
the common DT code should be able to handle parsing them. However, there 
are a few gaps in the API for what some drivers need which this series 
addresses (pun intended).

I intend to add these helpers for v6.4 and then convert the users in 
v6.5 to avoid any dependency issues. This series and the WIP conversions 
are on this branch[1].

[1] git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git 
dt/address-helpers

Signed-off-by: Rob Herring 
---
Rob Herring (5):
  of: unittest: Add bus address range parsing tests
  of/address: Add of_range_to_resource() helper
  of/address: Add support for 3 address cell bus
  of/address: Add of_range_count() helper
  of/address: Add of_property_read_reg() helper

 drivers/of/address.c|  76 +-
 drivers/of/unittest-data/tests-address.dtsi |   9 +-
 drivers/of/unittest.c   | 150 
 include/linux/of_address.h  |  31 ++
 4 files changed, 262 insertions(+), 4 deletions(-)
---
base-commit: fe15c26ee26efa11741a7b632e9f23b01aca4cc6
change-id: 20230328-dt-address-helpers-2f00c5c1eba4

Best regards,
-- 
Rob Herring 



[PATCH 4/5] of/address: Add of_range_count() helper

2023-03-28 Thread Rob Herring
Some users need a count of the number of ranges entries before
iterating over the entries. Typically this is for allocating some data
structure based on the size. Add a helper, of_range_count(), to get the
count. The helper must be called with an struct of_range_parser
initialized by of_range_parser_init().

Signed-off-by: Rob Herring 
---
 drivers/of/unittest.c  |  7 ++-
 include/linux/of_address.h | 16 
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c
index 29066ecbed47..eaeb58065acc 100644
--- a/drivers/of/unittest.c
+++ b/drivers/of/unittest.c
@@ -1014,7 +1014,7 @@ static void __init of_unittest_bus_ranges(void)
struct of_range range;
struct of_range_parser parser;
struct resource res;
-   int ret, i = 0;
+   int ret, count, i = 0;
 
np = of_find_node_by_path("/testcase-data/address-tests");
if (!np) {
@@ -1040,6 +1040,11 @@ static void __init of_unittest_bus_ranges(void)
"of_range_to_resource wrong resource start address on node %pOF 
res=%pR\n",
np, );
 
+   count = of_range_count();
+   unittest(count == 2,
+   "of_range_count wrong size on node %pOF count=%d\n",
+   np, count);
+
/*
 * Get the "ranges" from the device tree
 */
diff --git a/include/linux/of_address.h b/include/linux/of_address.h
index 1d005439f026..5292f62c1baa 100644
--- a/include/linux/of_address.h
+++ b/include/linux/of_address.h
@@ -35,6 +35,22 @@ struct of_pci_range {
for (; of_pci_range_parser_one(parser, range);)
 #define for_each_of_range for_each_of_pci_range
 
+/*
+ * of_range_count - Get the number of "ranges" or "dma-ranges" entries
+ * @parser:Parser state initialized by of_range_parser_init()
+ *
+ * Returns the number of entries or 0 if none.
+ *
+ * Note that calling this within or after the for_each_of_range() iterator will
+ * be inaccurate giving the number of entries remaining.
+ */
+static inline int of_range_count(const struct of_range_parser *parser)
+{
+   if (!parser || !parser->node || !parser->range || parser->range == 
parser->end)
+   return 0;
+   return (parser->end - parser->range) / (parser->na + parser->pna + 
parser->ns);
+}
+
 /* Translate a DMA address from device space to CPU space */
 extern u64 of_translate_dma_address(struct device_node *dev,
const __be32 *in_addr);

-- 
2.39.2



Re: [PATCH v2 4/5] cxl/pci: Forward RCH downstream port-detected errors to the CXL.mem dev handler

2023-03-28 Thread Bjorn Helgaas
[+cc linux-pci, more error handling folks; beginning of thread at
https://lore.kernel.org/all/20230323213808.398039-1-terry.bow...@amd.com/]

On Mon, Mar 27, 2023 at 11:51:39PM +0200, Robert Richter wrote:
> On 24.03.23 17:36:56, Bjorn Helgaas wrote:

> > > The CXL device driver is then responsible to
> > > enable error reporting in the RCEC's AER cap
> > 
> > I don't know exactly what you mean by "error reporting in the RCEC's
> > AER cap", but IIUC, for non-Root Port devices, generation of ERR_COR/
> > ERR_NONFATAL/ERR_FATAL messages is controlled by the Device Control
> > register and should already be enabled by pci_aer_init().
> > 
> > Maybe you mean setting AER mask/severity specifically for Internal
> > Errors?  I'm hoping to get as much of AER management as we can in the
> 
> Richt, this is implemented in patch #5 in function
> rcec_enable_aer_ints().

I think we should add a PCI core interface for this so we can enforce
the AER ownership question (all the crud like pcie_aer_is_native()) in
one place.

> > PCI core and out of drivers, so maybe we need a new PCI interface to
> > do that.
> > 
> > In any event, I assume this sort of configuration would be an
> > enumeration-time thing, while *this* patch is a run-time thing, so
> > maybe this information belongs with a different patch?
> 
> Do you mean once a Restricted CXL host (RCH) is detected, the internal
> errors should be enabled in the device mask, all this done during
> device enumeration? But wouldn't interrupts being enabled then before
> the CXL device is ready?

I'm not sure what you mean by "before the CXL device is ready."  What
makes a CXL device ready, and how do we know when it is ready?

pci_aer_init() turns on PCI_EXP_DEVCTL_CERE, PCI_EXP_DEVCTL_FERE, etc
as soon as we enumerate the device, before any driver claims the
device.  I'm wondering whether we can do this PCI_ERR_COR_INTERNAL and
PCI_ERR_UNC_INTN fiddling around the same time?

> > I haven't worked all the way through this, but I thought Sean Kelley's
> > and Qiuxu Zhuo's work was along the same line and might cover this,
> > e.g.,
> > 
> >   a175102b0a82 ("PCI/ERR: Recover from RCEC AER errors")
> >   579086225502 ("PCI/ERR: Recover from RCiEP AER errors")
> >   af113553d961 ("PCI/AER: Add pcie_walk_rcec() to RCEC AER handling")
> > 
> > But I guess maybe it's not quite the same case?
> 
> Actually, we use this code to handle errors that are reported to the
> RCEC and only implement here the CXL specifics. That is, checking if
> the RCEC receives something from a CXL downstream port and forwarding
> that to a CXL handler (this patch). The handler then checks the AER
> err cap in the RCRB of all CXL downstream ports associated to the RCEC
> (not visible in the PCI hierarchy), but discovered through the :00.0
> RCiEP (patch #5).

There are two calls to pcie_walk_rcec():

  1) The existing one in find_source_device()
  2) The one you add in handle_cxl_error()

Does the call in handle_cxl_error() look at devices that the existing
call in find_source_device() does not?  I'm trying to understand why
we need both calls.

> > > +static bool is_internal_error(struct aer_err_info *info)
> > > +{
> > > + if (info->severity == AER_CORRECTABLE)
> > > + return info->status & PCI_ERR_COR_INTERNAL;
> > > +
> > > + return info->status & PCI_ERR_UNC_INTN;
> > > +}
> > > +
> > > +static void handle_cxl_error(struct pci_dev *dev, struct aer_err_info 
> > > *info)
> > > +{
> > > + if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
> > > + is_internal_error(info))
> > 
> > What's unique about Internal Errors?  I'm trying to figure out why you
> > wouldn't do this for *all* CXL errors.
> 
> Per CXL specification downstream port errors are signaled using
> internal errors. 

Maybe a spec reference here to explain is_internal_error()?  Is the
point of the check to *exclude* non-internal errors?  Or is basically
documentation that there shouldn't ever *be* any non-internal errors?
I guess the latter wouldn't make sense because at this point we don't
know whether this is a CXL hierarchy.

> All other errors would be device specific, we cannot
> handle that in a generic CXL driver.

I'm missing the point here.  We don't have any device-specific error
handling in aer.c; it only connects the generic *reporting* mechanism
(AER log registers and Root Port interrupts) to the drivers that do
the device-specific things via err_handler hooks.  I assume we want a
similar model for CXL.

Bjorn


Re: Memory coherency issue with IO thread offloading?

2023-03-28 Thread Jens Axboe
On 3/28/23 6:51?AM, Michael Ellerman wrote:
> Jens Axboe  writes:
 Can the  queueing cause the creation of an IO thread (if one does not
 exist, or all blocked?)
>>>
>>> Yep
>>>
>>> Since writing this email, I've gone through a lot of different tests.
>>> Here's a rough listing of what I found:
>>>
>>> - Like using the hack patch, if I just limit the number of IO thread
>>>   workers to 1, it seems to pass. At least longer than before, does 1000
>>>   iterations.
>>>
>>> - If I pin each IO worker to a single CPU, it also passes.
>>>
>>> - If I liberally sprinkle smp_mb() for the io-wq side, test still fails.
>>>   I've added one before queueing the work item, and after. One before
>>>   the io-wq worker grabs a work item and one after. Eg full hammer
>>>   approach. This still fails.
>>>
>>> Puzzling... For the "pin each IO worker to a single CPU" I added some
>>> basic code around trying to ensure that a work item queued on CPU X
>>> would be processed by a worker on CPU X, and too a large degree, this
>>> does happen. But since the work list is a normal list, it's quite
>>> possible that some other worker finishes its work on CPU Y just in time
>>> to grab the one from cpu X. I checked and this does happen in the test
>>> case, yet it still passes. This may be because I got a bit lucky, but
>>> seems suspect with thousands of passes of the test case.
>>>
>>> Another theory there is that it's perhaps related to an io-wq worker
>>> being rescheduled on a different CPU. Though again puzzled as to why the
>>> smp_mb sprinkling didn't fix that then. I'm going to try and run the
>>> test case with JUST the io-wq worker pinning and not caring about where
>>> the work is processed to see if that does anything.
>>
>> Just pinning each worker to whatever CPU they got created on seemingly
>> fixes the issue too. This does not mean that each worker will process
>> work on the CPU on which it was queued, just that each worker will
>> remain on whatever CPU it originally got created on.
>>
>> Puzzling...
>>
>> Note that it is indeed quite possible that this isn't a ppc issue at
>> all, just shows on ppc. It could be page cache related, or it could even
>> be a bug in mariadb itself.
> 
> I tried binary patching every lwsync to hwsync (read/write to full
> barrier) in mariadbd and all the libaries it links. It didn't fix the
> problem.
> 
> I also tried switching all the kernel barriers/spin locks to using a
> hwsync, but that also didn't fix it.
> 
> It's still possible there's somewhere that currently has no barrier at
> all that needs one, the above would only fix the problem if we have a
> read/write barrier that actually needs to be a full barrier.
> 
> 
> I also looked at making all TLB invalidates broadcast, regardless of
> whether we think the thread has only been on a single CPU. That didn't
> help, but I'm not sure I got all places where we do TLB invalidates, so
> I'll look at that some more tomorrow.

Thanks, appreciate your testing! I have no new data points since
yesterday, but the key point from then still seems to be that if an io
worker never reschedules onto a different CPU, then the problem doesn't
occur. This could very well be a page cache issue, if it isn't an issue
on the powerpc side...

-- 
Jens Axboe



Re: [PATCH] perf vendor events power9: Remove UTF-8 characters from json files

2023-03-28 Thread Ian Rogers
On Tue, Mar 28, 2023 at 4:30 AM Kajol Jain  wrote:
>
> Commit 3c22ba524304 ("perf vendor events powerpc: Update POWER9 events")
> added and updated power9 pmu json events. However some of the json
> events which are part of other.json and pipeline.json files,
> contains UTF-8 characters in their brief description.
> Having UTF-8 character could brakes the perf build on some distros.

nit: s/bakes/break/

> Fix this issue by removing the UTF-8 characters from other.json and
> pipeline.json files.
>
> Result without the fix patch:
> [command]# file -i pmu-events/arch/powerpc/power9/*
> pmu-events/arch/powerpc/power9/cache.json:  application/json; 
> charset=us-ascii
> pmu-events/arch/powerpc/power9/floating-point.json: application/json; 
> charset=us-ascii
> pmu-events/arch/powerpc/power9/frontend.json:   application/json; 
> charset=us-ascii
> pmu-events/arch/powerpc/power9/marked.json: application/json; 
> charset=us-ascii
> pmu-events/arch/powerpc/power9/memory.json: application/json; 
> charset=us-ascii
> pmu-events/arch/powerpc/power9/metrics.json:application/json; 
> charset=us-ascii
> pmu-events/arch/powerpc/power9/nest_metrics.json:   application/json; 
> charset=us-ascii
> pmu-events/arch/powerpc/power9/other.json:  application/json; 
> charset=utf-8
> pmu-events/arch/powerpc/power9/pipeline.json:   application/json; 
> charset=utf-8
> pmu-events/arch/powerpc/power9/pmc.json:application/json; 
> charset=us-ascii
> pmu-events/arch/powerpc/power9/translation.json:application/json; 
> charset=us-ascii
>
> Result with the fix patch:
>
> [command]# file -i pmu-events/arch/powerpc/power9/*
> pmu-events/arch/powerpc/power9/cache.json:  application/json; 
> charset=us-ascii
> pmu-events/arch/powerpc/power9/floating-point.json: application/json; 
> charset=us-ascii
> pmu-events/arch/powerpc/power9/frontend.json:   application/json; 
> charset=us-ascii
> pmu-events/arch/powerpc/power9/marked.json: application/json; 
> charset=us-ascii
> pmu-events/arch/powerpc/power9/memory.json: application/json; 
> charset=us-ascii
> pmu-events/arch/powerpc/power9/metrics.json:application/json; 
> charset=us-ascii
> pmu-events/arch/powerpc/power9/nest_metrics.json:   application/json; 
> charset=us-ascii
> pmu-events/arch/powerpc/power9/other.json:  application/json; 
> charset=us-ascii
> pmu-events/arch/powerpc/power9/pipeline.json:   application/json; 
> charset=us-ascii
> pmu-events/arch/powerpc/power9/pmc.json:application/json; 
> charset=us-ascii
> pmu-events/arch/powerpc/power9/translation.json:application/json; 
> charset=us-ascii
>
> Fixes: 3c22ba524304 ("perf vendor events powerpc: Update POWER9 events")
> Reported-by: Arnaldo Carvalho de Melo 
> Link: https://lore.kernel.org/lkml/zbxp77deq7ikt...@kernel.org/
> Signed-off-by: Kajol Jain 

Acked-by: Ian Rogers 

Thanks,
Ian

> ---
>  tools/perf/pmu-events/arch/powerpc/power9/other.json| 4 ++--
>  tools/perf/pmu-events/arch/powerpc/power9/pipeline.json | 2 +-
>  2 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/tools/perf/pmu-events/arch/powerpc/power9/other.json 
> b/tools/perf/pmu-events/arch/powerpc/power9/other.json
> index 3f69422c21f9..f10bd554521a 100644
> --- a/tools/perf/pmu-events/arch/powerpc/power9/other.json
> +++ b/tools/perf/pmu-events/arch/powerpc/power9/other.json
> @@ -1417,7 +1417,7 @@
>{
>  "EventCode": "0x45054",
>  "EventName": "PM_FMA_CMPL",
> -"BriefDescription": "two flops operation completed (fmadd, fnmadd, 
> fmsub, fnmsub) Scalar instructions only. "
> +"BriefDescription": "two flops operation completed (fmadd, fnmadd, 
> fmsub, fnmsub) Scalar instructions only."
>},
>{
>  "EventCode": "0x201E8",
> @@ -2017,7 +2017,7 @@
>{
>  "EventCode": "0xC0BC",
>  "EventName": "PM_LSU_FLUSH_OTHER",
> -"BriefDescription": "Other LSU flushes including: Sync (sync ack from L2 
> caused search of LRQ for oldest snooped load, This will either signal a 
> Precise Flush of the oldest snooped loa or a Flush Next PPC); Data Valid 
> Flush Next (several cases of this, one example is store and reload are lined 
> up such that a store-hit-reload scenario exists and the CDF has already 
> launched and has gotten bad/stale data); Bad Data Valid Flush Next (might be 
> a few cases of this, one example is a larxa (D$ hit) return data and dval but 
> can't allocate to LMQ (LMQ full or other reason). Already gave dval but can't 
> watch it for snoop_hit_larx. Need to take the “bad dval” back and flush all 
> younger ops)"
> +"BriefDescription": "Other LSU flushes including: Sync (sync ack from L2 
> caused search of LRQ for oldest snooped load, This will either signal a 
> Precise Flush of the oldest snooped loa or a Flush Next PPC); Data Valid 
> Flush Next (several cases of this, one example is store and reload are lined 
> up such that a store-hit-reload scenario exists 

Re: [PATCH v12 13/13] arm64: dts: ls1088ardb: Add serdes descriptions

2023-03-28 Thread Sean Anderson
On 3/28/23 05:25, Ioana Ciornei wrote:
> On Mon, Mar 27, 2023 at 02:15:47PM -0400, Sean Anderson wrote:
>> On 3/24/23 09:17, Ioana Ciornei wrote:
>> > On Tue, Mar 21, 2023 at 04:13:12PM -0400, Sean Anderson wrote:
>> >> This adds serdes support to the LS1088ARDB. I have tested the QSGMII
>> >> ports as well as the two 10G ports. The SFP slot is now fully supported,
>> >> instead of being modeled as a fixed-link.
>> >> 
>> >> Linux hangs around when the serdes is initialized if the si5341 is
>> >> enabled with the in-tree driver, so I have modeled it as a two fixed
>> >> clocks instead. There are a few registers in the QIXIS FPGA which
>> >> control the SFP GPIOs; I have modeled them as discrete GPIO controllers
>> >> for now. I never saw the AQR105 interrupt fire; not sure what was going
>> >> on, but I have removed it to force polling.
>> > 
>> > So you didn't see the interrupt fire even without these patches?
>> 
>> Not sure. I went to check this, and discovered I could no longer get the
>> link to come up in Linux, even on v6.0 (before the rate adaptation
>> tuff). I see the LEDs blinking in U-Boot, so presumably it's some
>> configuration problem. I'm going to look into this further when I have
>> more time.
>> 
>> > I just tested this on a LS1088ARDB and it works.
>> > 
>> >root@localhost:~# cat /proc/interrupts | grep extirq
>> > 99:  5  ls-extirq   2 Level 0x08b97000:00
>> >root@localhost:~# ip link set dev endpmac2 up
>> >root@localhost:~# cat /proc/interrupts | grep extirq
>> > 99:  6  ls-extirq   2 Level 0x08b97000:00
>> >root@localhost:~# ip link set dev endpmac2 down
>> >root@localhost:~# cat /proc/interrupts | grep extirq
>> > 99:  7  ls-extirq   2 Level 0x08b97000:00
>> > 
>> > Please don't just remove things.
>> 
>> Well, polling isn't the worst thing for a single interface... I do
>> remember having a problem with the interrupt. If this series works
>> with interrupts enabled, I can leave it in.
>> 
>> Did you have a chance to look at the core (patches 7 and 8) of this
>> series? Does it make sense to you? Am I missing something which would
>> allow switching from 1G->10G?
>> 
> 
> For a bit of context, I also attempted dynamic switching from 1G to 10G
> on my own even before this patch set but I did not get a link up on the
> PCS (CDR lock was there through). Pretty much the same state as you.
> 
> What I propose is to take this whole endeavor step by step.
> I am also interrested in getting this feature to actually work but I
> just didn't have the time to investigate in depth was is missing.
> And without the dynamic switching I cannot say that I find the addition
> of the SerDes PHY driver useful.

Well, it's still useful for supporting 1G and 10G. I touched on this in
the cover letter, but there are conflicting PLL defaults on the LS1046A.
If you use SRDS_PRCTL 1133, the PLL mapping is 2211. But for  it's
. This means PLL2 is used for both 1G and 10G, but no reference
frequency can work for both. This will cause the PBL to enter a reset
loop, since it wants the PLLs to lock before booting, and this happens
before any user configuration. To get around this, we disconnected
RESET_REQ_B on our board, and we use this driver to configure the PLLs
correctly for whatever SRDS_PRCTL we boot up with.  This way we can have
two RCWs for 1G and 10G configuration.

> I have the Lynx 10G on my TODO list but I still have some other tasks
> on the Lynx 28G for the next 2-3 weeks. Once I get those done, I will
> look closer at the patches.

OK, thanks.

> In the meantime, some small thigs from this patch set can be submitted
> separately. For example, describing the SFP cage on the LS1088ARDB.

I'll have a look at this. I suppose I will also split off the IRQ thing.

> I still have some small questions on the DTS implementation for the gpio
> controllers but I would be able to submit this myself if you do not find
> the time (with your authorship of course).

I am going to do another revision to address the GPIO binding problem,
so please ask away.

--Sean


Re: [PATCH v3 4/4] of: address: Always use dma_default_coherent for default coherency

2023-03-28 Thread Rob Herring
On Tue, Mar 21, 2023 at 6:08 AM Jiaxun Yang  wrote:
>
> As for now all arches have dma_default_coherent reflecting default
> DMA coherency for of devices, so there is no need to have a standalone
> config option.
>
> Signed-off-by: Jiaxun Yang 
> ---
> v3: Squash setting ARCH_DMA_DEFAULT_COHERENT into this patch.
> ---
>  arch/powerpc/Kconfig |  2 +-
>  arch/riscv/Kconfig   |  2 +-
>  drivers/of/Kconfig   |  4 
>  drivers/of/address.c | 10 +-
>  4 files changed, 3 insertions(+), 15 deletions(-)

Reviewed-by: Rob Herring 


Re: [PATCH v3 1/4] of: address: Fix default coherency for MIPS

2023-03-28 Thread Rob Herring
On Tue, Mar 21, 2023 at 6:08 AM Jiaxun Yang  wrote:
>
> DT-based MIPS doesn't use OF_DMA_DEFAULT_COHERENT, but
> might override the system-wide default at runtime.
>
> Use dma_default_coherent to override default coherence for
> MIPS.
>

I assume you want this tagged for stable? Otherwise, I don't
understand why you add this here and then remove in patch 4.

> Signed-off-by: Jiaxun Yang 
> ---
>  drivers/of/address.c | 8 
>  1 file changed, 8 insertions(+)
>
> diff --git a/drivers/of/address.c b/drivers/of/address.c
> index 4c0b169ef9bf..c105d66a1fa4 100644
> --- a/drivers/of/address.c
> +++ b/drivers/of/address.c
> @@ -1105,6 +1105,14 @@ bool of_dma_is_coherent(struct device_node *np)
> struct device_node *node;
> bool is_coherent = IS_ENABLED(CONFIG_OF_DMA_DEFAULT_COHERENT);
>
> +   /*
> +* DT-based MIPS doesn't use OF_DMA_DEFAULT_COHERENT, but
> +* might override the system-wide default at runtime.
> +*/
> +#if defined(CONFIG_MIPS) && defined(CONFIG_DMA_NONCOHERENT)
> +   is_coherent = dma_default_coherent;
> +#endif
> +
> node = of_node_get(np);
>
> while (node) {
> --
> 2.37.1 (Apple Git-137.1)
>


Re: [PATCH v3 0/4] Use dma_default_coherent for devicetree default coherency

2023-03-28 Thread Jiaxun Yang



> 2023年3月28日 08:45,Thomas Bogendoerfer  写道:
> 
> On Tue, Mar 28, 2023 at 03:18:12AM +0200, Christoph Hellwig wrote:
>> On Fri, Mar 24, 2023 at 09:17:38AM +, Jiaxun Yang wrote:
 
 Is patch a 6.3 candidate or should all of it go into 6.4?
>>> 
>>> Please leave it for 6.4, as corresponding MIPS arch part will be a part of 
>>> 6.4.
>> 
>> Ok.  I'll really need review from the MIPS and drivers/of/ maintainers,
>> through.

+cc devicetree foks.

> 
> I don't see any MIPS changes in the series besides the ifdef CONFIG_MIPS
> part in patch 1, which gets removed again in patch 4 (chance to drop
> that completely ?).

It was suggested by DMA folks to have that patch.

> I've merged the corresponding MIPS patches into mips-next last week.

Thanks
- Jiaxun

> 
> Thomas.
> 
> -- 
> Crap can work. Given enough thrust pigs will fly, but it's not necessarily a
> good idea.[ RFC1925, 2.3 ]



Re: Memory coherency issue with IO thread offloading?

2023-03-28 Thread Michael Ellerman
Jens Axboe  writes:
>>> Can the  queueing cause the creation of an IO thread (if one does not
>>> exist, or all blocked?)
>> 
>> Yep
>> 
>> Since writing this email, I've gone through a lot of different tests.
>> Here's a rough listing of what I found:
>> 
>> - Like using the hack patch, if I just limit the number of IO thread
>>   workers to 1, it seems to pass. At least longer than before, does 1000
>>   iterations.
>> 
>> - If I pin each IO worker to a single CPU, it also passes.
>> 
>> - If I liberally sprinkle smp_mb() for the io-wq side, test still fails.
>>   I've added one before queueing the work item, and after. One before
>>   the io-wq worker grabs a work item and one after. Eg full hammer
>>   approach. This still fails.
>> 
>> Puzzling... For the "pin each IO worker to a single CPU" I added some
>> basic code around trying to ensure that a work item queued on CPU X
>> would be processed by a worker on CPU X, and too a large degree, this
>> does happen. But since the work list is a normal list, it's quite
>> possible that some other worker finishes its work on CPU Y just in time
>> to grab the one from cpu X. I checked and this does happen in the test
>> case, yet it still passes. This may be because I got a bit lucky, but
>> seems suspect with thousands of passes of the test case.
>> 
>> Another theory there is that it's perhaps related to an io-wq worker
>> being rescheduled on a different CPU. Though again puzzled as to why the
>> smp_mb sprinkling didn't fix that then. I'm going to try and run the
>> test case with JUST the io-wq worker pinning and not caring about where
>> the work is processed to see if that does anything.
>
> Just pinning each worker to whatever CPU they got created on seemingly
> fixes the issue too. This does not mean that each worker will process
> work on the CPU on which it was queued, just that each worker will
> remain on whatever CPU it originally got created on.
>
> Puzzling...
>
> Note that it is indeed quite possible that this isn't a ppc issue at
> all, just shows on ppc. It could be page cache related, or it could even
> be a bug in mariadb itself.

I tried binary patching every lwsync to hwsync (read/write to full
barrier) in mariadbd and all the libaries it links. It didn't fix the
problem.

I also tried switching all the kernel barriers/spin locks to using a
hwsync, but that also didn't fix it.

It's still possible there's somewhere that currently has no barrier at
all that needs one, the above would only fix the problem if we have a
read/write barrier that actually needs to be a full barrier.


I also looked at making all TLB invalidates broadcast, regardless of
whether we think the thread has only been on a single CPU. That didn't
help, but I'm not sure I got all places where we do TLB invalidates, so
I'll look at that some more tomorrow.

cheers


Re: [PATCH] powerpc: don't try to copy ppc for task with NULL pt_regs

2023-03-28 Thread Jens Axboe
On 3/28/23 5:32?AM, Michael Ellerman wrote:
> Jens Axboe  writes:
>> Powerpc sets up PF_KTHREAD and PF_IO_WORKER with a NULL pt_regs, which
>> from my (arguably very short) checking is not commonly done for other
>> archs. This is fine, except when PF_IO_WORKER's have been created and
>> the task does something that causes a coredump to be generated.
> 
> Do kthread's ever core dump? I didn't think they did, but I can't find
> any logic to prevent it.

kthreads aren't associated with the original task, they just exist by
themselves. They also can't take signals. Eg they cannot core dump, just
oops :-)

This is different than io workers that do show up as threads, but they
still don't exit to userspace. That is why it ended being a problem.

> As Nick said we should probably have a non-NULL regs for PF_IO_WORKERS,
> but I'll still take this as a nice backportable fix for the immediate
> crash.
> 
> I tagged it as Fixes: pointing back at the commit that added ppr_get(),
> even though I don't know for sure the bug was triggerable back then
> (v4.8).

Thanks!

-- 
Jens Axboe



Re: [PATCH] arch/powerpc/kvm: kvmppc_core_vcpu_create_hv: check for kzalloc failure

2023-03-28 Thread Kautuk Consul
On 2023-03-28 23:02:09, Michael Ellerman wrote:
> Kautuk Consul  writes:
> > On 2023-03-28 15:44:02, Kautuk Consul wrote:
> >> On 2023-03-28 20:44:48, Michael Ellerman wrote:
> >> > Kautuk Consul  writes:
> >> > > kvmppc_vcore_create() might not be able to allocate memory through
> >> > > kzalloc. In that case the kvm->arch.online_vcores shouldn't be
> >> > > incremented.
> >> > 
> >> > I agree that looks wrong.
> >> > 
> >> > Have you tried to test what goes wrong if it fails? It looks like it
> >> > will break the LPCR update, which likely will cause the guest to crash
> >> > horribly.
> > Also, are you referring to the code in kvmppc_update_lpcr()?
> > That code will not crash as it checks for the vc before trying to
> > dereference it.
> 
> Yeah that's what I was looking at. I didn't mean it would crash, but
> that it would bail out early when it sees a NULL vcore, leaving other
> vcores with the wrong LPCR value.
> 
> But as you say it doesn't happen because qemu quits on the first ENOMEM.
> 
> And regardless if qemu does something that means the guest is broken
> that's just a qemu bug, no big deal as far as the kernel is concerned.
But there could be another user-mode application other than qemu that
actually tries to create a vcpu after it gets a -ENOMEM for another
vcpu. Shouldn't the kernel be independent of qemu?
> 
> > But the following 2 places that utilize the arch.online_vcores will have
> > problems in logic if the usermode test-case doesn't pull down the
> > kvm context after the -ENOMEM vcpu allocation failure:
> > book3s_hv.c:3030:   if (!kvm->arch.online_vcores) {
> > book3s_hv_rm_mmu.c:44:  if (kvm->arch.online_vcores == 1 && 
> > local_paca->kvm_hstate.kvm_vcpu)
> 
> OK. Both of those look harmless to the host.
Harmless to the host in terms of a crash, not in terms of behavior.
For example in the case of kvmhv_set_smt_mode:
If we got a kzalloc failure once (and online_vcores was wrongly incremented), 
then if kvmhv_set_smt_mode() is called after that then it would be
not be setting the arch.smt_mode and arch.emul_smt_mode correctly and it
would be wrongly returning with -EBUSY instead of 0.
Isn't that incorrect with respect to the intent of the code ?
I agree that applications like qemu might not do that but don't we need
to have some integrity with respect to the intent and value of variable
use ? What about good code and logic quality ?
> 
> If we find a case where a misbehaving qemu can crash the host then we
> need to be a bit more careful and treat it at least as a
> denial-of-service bug. But looks like this is not one of those.
> 
> cheers
beers


Re: Memory coherency issue with IO thread offloading?

2023-03-28 Thread Michael Ellerman
Daniel Black  writes:
> Thanks Jens, Nick, Christophe and Michael for your work so far.
>
> Apologies for the out of thread email.
>
> Confirming MariabD-10.6+ is required( when we added liburing), and
> previous versions used libaio (which tested without incident as mpe
> retested).

Thanks! I was trying to follow the threads from the mariadb/server
github to find the CI but didn't have much luck. No .travis.yml these
days :)

> We were (we're now back on the old good kernel Jens indicated) getting
> failures like https://buildbot.mariadb.org/#/builders/231/builds/16857
> in a container (of various distro userspaces) on bare metal.
>
> bare metal end of /proc/cpuinfo
>
> processor: 127
> cpu: POWER9, altivec supported
> clock: 3283.00MHz
> revision: 2.2 (pvr 004e 1202)
>
> timebase: 51200
> platform: PowerNV
> model: 9006-22P
> machine: PowerNV 9006-22P
> firmware: OPAL
> MMU: Radix

Thanks, looks like that's a Boston, which at least has the same CPU as
the machine I'm testing on.

I have also reproduced it on bare metal now, but it seems much easier to
hit it in an LPAR (on P10). Not sure that's an actual data point, could
just be due to other activity on the box.

cheers


Re: [PATCH] arch/powerpc/kvm: kvmppc_core_vcpu_create_hv: check for kzalloc failure

2023-03-28 Thread Michael Ellerman
Kautuk Consul  writes:
> On 2023-03-28 15:44:02, Kautuk Consul wrote:
>> On 2023-03-28 20:44:48, Michael Ellerman wrote:
>> > Kautuk Consul  writes:
>> > > kvmppc_vcore_create() might not be able to allocate memory through
>> > > kzalloc. In that case the kvm->arch.online_vcores shouldn't be
>> > > incremented.
>> > 
>> > I agree that looks wrong.
>> > 
>> > Have you tried to test what goes wrong if it fails? It looks like it
>> > will break the LPCR update, which likely will cause the guest to crash
>> > horribly.
> Also, are you referring to the code in kvmppc_update_lpcr()?
> That code will not crash as it checks for the vc before trying to
> dereference it.

Yeah that's what I was looking at. I didn't mean it would crash, but
that it would bail out early when it sees a NULL vcore, leaving other
vcores with the wrong LPCR value.

But as you say it doesn't happen because qemu quits on the first ENOMEM.

And regardless if qemu does something that means the guest is broken
that's just a qemu bug, no big deal as far as the kernel is concerned.

> But the following 2 places that utilize the arch.online_vcores will have
> problems in logic if the usermode test-case doesn't pull down the
> kvm context after the -ENOMEM vcpu allocation failure:
> book3s_hv.c:3030:   if (!kvm->arch.online_vcores) {
> book3s_hv_rm_mmu.c:44:  if (kvm->arch.online_vcores == 1 && 
> local_paca->kvm_hstate.kvm_vcpu)

OK. Both of those look harmless to the host.

If we find a case where a misbehaving qemu can crash the host then we
need to be a bit more careful and treat it at least as a
denial-of-service bug. But looks like this is not one of those.

cheers


Re: [PATCH] powerpc: don't try to copy ppc for task with NULL pt_regs

2023-03-28 Thread Michael Ellerman
"Nicholas Piggin"  writes:
> On Mon Mar 27, 2023 at 8:26 PM AEST, Christophe Leroy wrote:
...
>>
>> Now that thread.regs doesn't change anymore at each interrupt, it would 
>> probably be worth dropping it and falling back to task_pt_regs() as 
>> defined on most architecture.
>> Knowing whether a thread is a kernel or user thread can for instance be 
>> known with PF_KTHREAD flag, so no need of thread.regs for that.
>
> That would be nice if we can define regs that way, I agree. We should
> look into doing that.

Yeah it's on the long-list of things that need cleaning up.

I think there's some complication in working out which sites are OK to
use/give-out the value in pt_regs that's potentially a dummy value, vs
cases that actually want to check PF_KTHREAD and do something different.
But that's just my hunch I haven't looked through all the sites.

The thread.regs = NULL for kernel threads goes back to arch/ppc, about
2002 by the looks:

  
https://github.com/mpe/linux-fullhistory/commit/2a8e186c384c0c911f91cd12367658eabdc820d8#diff-939b705cff722ee75595fad30d56bb1175dfdce49a69adb4d5656f354be076c6

There's no change log of course :)

Still maybe it doesn't matter why it was originally done that way, if we
can do it differently now.

cheers


Re: [PATCH] powerpc: don't try to copy ppc for task with NULL pt_regs

2023-03-28 Thread Michael Ellerman
Jens Axboe  writes:
> Powerpc sets up PF_KTHREAD and PF_IO_WORKER with a NULL pt_regs, which
> from my (arguably very short) checking is not commonly done for other
> archs. This is fine, except when PF_IO_WORKER's have been created and
> the task does something that causes a coredump to be generated.

Do kthread's ever core dump? I didn't think they did, but I can't find
any logic to prevent it.

Maybe it's always been possible but just never happened due to luck.

As Nick said we should probably have a non-NULL regs for PF_IO_WORKERS,
but I'll still take this as a nice backportable fix for the immediate
crash.

I tagged it as Fixes: pointing back at the commit that added ppr_get(),
even though I don't know for sure the bug was triggerable back then
(v4.8).

cheers


[PATCH] perf vendor events power9: Remove UTF-8 characters from json files

2023-03-28 Thread Kajol Jain
Commit 3c22ba524304 ("perf vendor events powerpc: Update POWER9 events")
added and updated power9 pmu json events. However some of the json
events which are part of other.json and pipeline.json files,
contains UTF-8 characters in their brief description.
Having UTF-8 character could brakes the perf build on some distros.
Fix this issue by removing the UTF-8 characters from other.json and
pipeline.json files.

Result without the fix patch:
[command]# file -i pmu-events/arch/powerpc/power9/*
pmu-events/arch/powerpc/power9/cache.json:  application/json; 
charset=us-ascii
pmu-events/arch/powerpc/power9/floating-point.json: application/json; 
charset=us-ascii
pmu-events/arch/powerpc/power9/frontend.json:   application/json; 
charset=us-ascii
pmu-events/arch/powerpc/power9/marked.json: application/json; 
charset=us-ascii
pmu-events/arch/powerpc/power9/memory.json: application/json; 
charset=us-ascii
pmu-events/arch/powerpc/power9/metrics.json:application/json; 
charset=us-ascii
pmu-events/arch/powerpc/power9/nest_metrics.json:   application/json; 
charset=us-ascii
pmu-events/arch/powerpc/power9/other.json:  application/json; 
charset=utf-8
pmu-events/arch/powerpc/power9/pipeline.json:   application/json; 
charset=utf-8
pmu-events/arch/powerpc/power9/pmc.json:application/json; 
charset=us-ascii
pmu-events/arch/powerpc/power9/translation.json:application/json; 
charset=us-ascii

Result with the fix patch:

[command]# file -i pmu-events/arch/powerpc/power9/*
pmu-events/arch/powerpc/power9/cache.json:  application/json; 
charset=us-ascii
pmu-events/arch/powerpc/power9/floating-point.json: application/json; 
charset=us-ascii
pmu-events/arch/powerpc/power9/frontend.json:   application/json; 
charset=us-ascii
pmu-events/arch/powerpc/power9/marked.json: application/json; 
charset=us-ascii
pmu-events/arch/powerpc/power9/memory.json: application/json; 
charset=us-ascii
pmu-events/arch/powerpc/power9/metrics.json:application/json; 
charset=us-ascii
pmu-events/arch/powerpc/power9/nest_metrics.json:   application/json; 
charset=us-ascii
pmu-events/arch/powerpc/power9/other.json:  application/json; 
charset=us-ascii
pmu-events/arch/powerpc/power9/pipeline.json:   application/json; 
charset=us-ascii
pmu-events/arch/powerpc/power9/pmc.json:application/json; 
charset=us-ascii
pmu-events/arch/powerpc/power9/translation.json:application/json; 
charset=us-ascii

Fixes: 3c22ba524304 ("perf vendor events powerpc: Update POWER9 events")
Reported-by: Arnaldo Carvalho de Melo 
Link: https://lore.kernel.org/lkml/zbxp77deq7ikt...@kernel.org/
Signed-off-by: Kajol Jain 
---
 tools/perf/pmu-events/arch/powerpc/power9/other.json| 4 ++--
 tools/perf/pmu-events/arch/powerpc/power9/pipeline.json | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/perf/pmu-events/arch/powerpc/power9/other.json 
b/tools/perf/pmu-events/arch/powerpc/power9/other.json
index 3f69422c21f9..f10bd554521a 100644
--- a/tools/perf/pmu-events/arch/powerpc/power9/other.json
+++ b/tools/perf/pmu-events/arch/powerpc/power9/other.json
@@ -1417,7 +1417,7 @@
   {
 "EventCode": "0x45054",
 "EventName": "PM_FMA_CMPL",
-"BriefDescription": "two flops operation completed (fmadd, fnmadd, fmsub, 
fnmsub) Scalar instructions only. "
+"BriefDescription": "two flops operation completed (fmadd, fnmadd, fmsub, 
fnmsub) Scalar instructions only."
   },
   {
 "EventCode": "0x201E8",
@@ -2017,7 +2017,7 @@
   {
 "EventCode": "0xC0BC",
 "EventName": "PM_LSU_FLUSH_OTHER",
-"BriefDescription": "Other LSU flushes including: Sync (sync ack from L2 
caused search of LRQ for oldest snooped load, This will either signal a Precise 
Flush of the oldest snooped loa or a Flush Next PPC); Data Valid Flush Next 
(several cases of this, one example is store and reload are lined up such that 
a store-hit-reload scenario exists and the CDF has already launched and has 
gotten bad/stale data); Bad Data Valid Flush Next (might be a few cases of 
this, one example is a larxa (D$ hit) return data and dval but can't allocate 
to LMQ (LMQ full or other reason). Already gave dval but can't watch it for 
snoop_hit_larx. Need to take the “bad dval” back and flush all younger ops)"
+"BriefDescription": "Other LSU flushes including: Sync (sync ack from L2 
caused search of LRQ for oldest snooped load, This will either signal a Precise 
Flush of the oldest snooped loa or a Flush Next PPC); Data Valid Flush Next 
(several cases of this, one example is store and reload are lined up such that 
a store-hit-reload scenario exists and the CDF has already launched and has 
gotten bad/stale data); Bad Data Valid Flush Next (might be a few cases of 
this, one example is a larxa (D$ hit) return data and dval but can't allocate 
to LMQ (LMQ full or other reason). Already gave dval but can't watch it for 
snoop_hit_larx. Need to take the 'bad 

Re: [PATCH] arch/powerpc/kvm: kvmppc_core_vcpu_create_hv: check for kzalloc failure

2023-03-28 Thread Kautuk Consul
On 2023-03-28 15:44:02, Kautuk Consul wrote:
> On 2023-03-28 20:44:48, Michael Ellerman wrote:
> > Kautuk Consul  writes:
> > > kvmppc_vcore_create() might not be able to allocate memory through
> > > kzalloc. In that case the kvm->arch.online_vcores shouldn't be
> > > incremented.
> > 
> > I agree that looks wrong.
> > 
> > Have you tried to test what goes wrong if it fails? It looks like it
> > will break the LPCR update, which likely will cause the guest to crash
> > horribly.
Also, are you referring to the code in kvmppc_update_lpcr()?
That code will not crash as it checks for the vc before trying to
dereference it.
But the following 2 places that utilize the arch.online_vcores will have
problems in logic if the usermode test-case doesn't pull down the
kvm context after the -ENOMEM vcpu allocation failure:
book3s_hv.c:3030:   if (!kvm->arch.online_vcores) {
book3s_hv_rm_mmu.c:44:  if (kvm->arch.online_vcores == 1 && 
local_paca->kvm_hstate.kvm_vcpu)

> Not sure about LPCR update, but with and without the patch qemu exits
> and so the kvm context is pulled down fine.
> > 
> > You could use CONFIG_FAIL_SLAB and fail-nth etc. to fail just one
> > allocation for a guest. Or probably easier to just hack the code to fail
> > the 4th time it's called using a static counter.
> I am using live debug and I set the r3 return value to 0x0 after the
> call to kzalloc.
> > 
> > Doesn't really matter but could be interesting.
> With and without this patch qemu quits with:
> qemu-system-ppc64: kvm_init_vcpu: kvm_get_vcpu failed (0): Cannot allocate 
> memory
> 
> That's because qemu will shut down when any vcpu is not able
> to be allocated.
> > 
> > > Add a check for kzalloc failure and return with -ENOMEM from
> > > kvmppc_core_vcpu_create_hv().
> > >
> > > Signed-off-by: Kautuk Consul 
> > > ---
> > >  arch/powerpc/kvm/book3s_hv.c | 10 +++---
> > >  1 file changed, 7 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > > index 6ba68dd6190b..e29ee755c920 100644
> > > --- a/arch/powerpc/kvm/book3s_hv.c
> > > +++ b/arch/powerpc/kvm/book3s_hv.c
> > > @@ -2968,13 +2968,17 @@ static int kvmppc_core_vcpu_create_hv(struct 
> > > kvm_vcpu *vcpu)
> > >   pr_devel("KVM: collision on id %u", id);
> > >   vcore = NULL;
> > >   } else if (!vcore) {
> > > + vcore = kvmppc_vcore_create(kvm,
> > > + id & ~(kvm->arch.smt_mode - 1));
> > 
> > That line doesn't need to be wrapped, we allow 90 columns.
> > 
> > > + if (unlikely(!vcore)) {
> > > + mutex_unlock(>lock);
> > > + return -ENOMEM;
> > > + }
> > 
> > Rather than introducing a new return point here, I think it would be
> > preferable to use the existing !vcore case below.
> > 
> > >   /*
> > >* Take mmu_setup_lock for mutual exclusion
> > >* with kvmppc_update_lpcr().
> > >*/
> > > - err = -ENOMEM;
> > > - vcore = kvmppc_vcore_create(kvm,
> > > - id & ~(kvm->arch.smt_mode - 1));
> > 
> > So leave that as is (maybe move the comment down).
> > 
> > And wrap the below in:
> > 
> >  +  if (vcore) {
> > 
> > >   mutex_lock(>arch.mmu_setup_lock);
> > >   kvm->arch.vcores[core] = vcore;
> > >   kvm->arch.online_vcores++;
> > 
> > mutex_unlock(>arch.mmu_setup_lock);
> >  +  }
> > }
> > }
> > 
> > Meaning the vcore == NULL case will fall through to here and return via
> > this existing path:
> > 
> > mutex_unlock(>lock);
> > 
> > if (!vcore)
> > return err;
> > 
> > 
> > cheers


Re: [PATCH] arch/powerpc/kvm: kvmppc_core_vcpu_create_hv: check for kzalloc failure

2023-03-28 Thread Kautuk Consul
On 2023-03-28 20:44:48, Michael Ellerman wrote:
> Kautuk Consul  writes:
> > kvmppc_vcore_create() might not be able to allocate memory through
> > kzalloc. In that case the kvm->arch.online_vcores shouldn't be
> > incremented.
> 
> I agree that looks wrong.
> 
> Have you tried to test what goes wrong if it fails? It looks like it
> will break the LPCR update, which likely will cause the guest to crash
> horribly.
Not sure about LPCR update, but with and without the patch qemu exits
and so the kvm context is pulled down fine.
> 
> You could use CONFIG_FAIL_SLAB and fail-nth etc. to fail just one
> allocation for a guest. Or probably easier to just hack the code to fail
> the 4th time it's called using a static counter.
I am using live debug and I set the r3 return value to 0x0 after the
call to kzalloc.
> 
> Doesn't really matter but could be interesting.
With and without this patch qemu quits with:
qemu-system-ppc64: kvm_init_vcpu: kvm_get_vcpu failed (0): Cannot allocate 
memory

That's because qemu will shut down when any vcpu is not able
to be allocated.
> 
> > Add a check for kzalloc failure and return with -ENOMEM from
> > kvmppc_core_vcpu_create_hv().
> >
> > Signed-off-by: Kautuk Consul 
> > ---
> >  arch/powerpc/kvm/book3s_hv.c | 10 +++---
> >  1 file changed, 7 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > index 6ba68dd6190b..e29ee755c920 100644
> > --- a/arch/powerpc/kvm/book3s_hv.c
> > +++ b/arch/powerpc/kvm/book3s_hv.c
> > @@ -2968,13 +2968,17 @@ static int kvmppc_core_vcpu_create_hv(struct 
> > kvm_vcpu *vcpu)
> > pr_devel("KVM: collision on id %u", id);
> > vcore = NULL;
> > } else if (!vcore) {
> > +   vcore = kvmppc_vcore_create(kvm,
> > +   id & ~(kvm->arch.smt_mode - 1));
> 
> That line doesn't need to be wrapped, we allow 90 columns.
> 
> > +   if (unlikely(!vcore)) {
> > +   mutex_unlock(>lock);
> > +   return -ENOMEM;
> > +   }
> 
> Rather than introducing a new return point here, I think it would be
> preferable to use the existing !vcore case below.
> 
> > /*
> >  * Take mmu_setup_lock for mutual exclusion
> >  * with kvmppc_update_lpcr().
> >  */
> > -   err = -ENOMEM;
> > -   vcore = kvmppc_vcore_create(kvm,
> > -   id & ~(kvm->arch.smt_mode - 1));
> 
> So leave that as is (maybe move the comment down).
> 
> And wrap the below in:
> 
>  +  if (vcore) {
> 
> > mutex_lock(>arch.mmu_setup_lock);
> > kvm->arch.vcores[core] = vcore;
> > kvm->arch.online_vcores++;
>   
>   mutex_unlock(>arch.mmu_setup_lock);
>  +  }
>   }
>   }
> 
> Meaning the vcore == NULL case will fall through to here and return via
> this existing path:
> 
>   mutex_unlock(>lock);
> 
>   if (!vcore)
>   return err;
> 
> 
> cheers


Re: [PATCH] arch/powerpc/kvm: kvmppc_core_vcpu_create_hv: check for kzalloc failure

2023-03-28 Thread Michael Ellerman
Kautuk Consul  writes:
> kvmppc_vcore_create() might not be able to allocate memory through
> kzalloc. In that case the kvm->arch.online_vcores shouldn't be
> incremented.

I agree that looks wrong.

Have you tried to test what goes wrong if it fails? It looks like it
will break the LPCR update, which likely will cause the guest to crash
horribly.

You could use CONFIG_FAIL_SLAB and fail-nth etc. to fail just one
allocation for a guest. Or probably easier to just hack the code to fail
the 4th time it's called using a static counter.

Doesn't really matter but could be interesting.

> Add a check for kzalloc failure and return with -ENOMEM from
> kvmppc_core_vcpu_create_hv().
>
> Signed-off-by: Kautuk Consul 
> ---
>  arch/powerpc/kvm/book3s_hv.c | 10 +++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 6ba68dd6190b..e29ee755c920 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -2968,13 +2968,17 @@ static int kvmppc_core_vcpu_create_hv(struct kvm_vcpu 
> *vcpu)
>   pr_devel("KVM: collision on id %u", id);
>   vcore = NULL;
>   } else if (!vcore) {
> + vcore = kvmppc_vcore_create(kvm,
> + id & ~(kvm->arch.smt_mode - 1));

That line doesn't need to be wrapped, we allow 90 columns.

> + if (unlikely(!vcore)) {
> + mutex_unlock(>lock);
> + return -ENOMEM;
> + }

Rather than introducing a new return point here, I think it would be
preferable to use the existing !vcore case below.

>   /*
>* Take mmu_setup_lock for mutual exclusion
>* with kvmppc_update_lpcr().
>*/
> - err = -ENOMEM;
> - vcore = kvmppc_vcore_create(kvm,
> - id & ~(kvm->arch.smt_mode - 1));

So leave that as is (maybe move the comment down).

And wrap the below in:

 +  if (vcore) {

>   mutex_lock(>arch.mmu_setup_lock);
>   kvm->arch.vcores[core] = vcore;
>   kvm->arch.online_vcores++;

mutex_unlock(>arch.mmu_setup_lock);
 +  }
}
}

Meaning the vcore == NULL case will fall through to here and return via
this existing path:

mutex_unlock(>lock);

if (!vcore)
return err;


cheers


Re: [PATCH v12 13/13] arm64: dts: ls1088ardb: Add serdes descriptions

2023-03-28 Thread Ioana Ciornei
On Mon, Mar 27, 2023 at 02:15:47PM -0400, Sean Anderson wrote:
> On 3/24/23 09:17, Ioana Ciornei wrote:
> > On Tue, Mar 21, 2023 at 04:13:12PM -0400, Sean Anderson wrote:
> >> This adds serdes support to the LS1088ARDB. I have tested the QSGMII
> >> ports as well as the two 10G ports. The SFP slot is now fully supported,
> >> instead of being modeled as a fixed-link.
> >> 
> >> Linux hangs around when the serdes is initialized if the si5341 is
> >> enabled with the in-tree driver, so I have modeled it as a two fixed
> >> clocks instead. There are a few registers in the QIXIS FPGA which
> >> control the SFP GPIOs; I have modeled them as discrete GPIO controllers
> >> for now. I never saw the AQR105 interrupt fire; not sure what was going
> >> on, but I have removed it to force polling.
> > 
> > So you didn't see the interrupt fire even without these patches?
> 
> Not sure. I went to check this, and discovered I could no longer get the
> link to come up in Linux, even on v6.0 (before the rate adaptation
> tuff). I see the LEDs blinking in U-Boot, so presumably it's some
> configuration problem. I'm going to look into this further when I have
> more time.
> 
> > I just tested this on a LS1088ARDB and it works.
> > 
> > root@localhost:~# cat /proc/interrupts | grep extirq
> >  99:  5  ls-extirq   2 Level 0x08b97000:00
> > root@localhost:~# ip link set dev endpmac2 up
> > root@localhost:~# cat /proc/interrupts | grep extirq
> >  99:  6  ls-extirq   2 Level 0x08b97000:00
> > root@localhost:~# ip link set dev endpmac2 down
> > root@localhost:~# cat /proc/interrupts | grep extirq
> >  99:  7  ls-extirq   2 Level 0x08b97000:00
> > 
> > Please don't just remove things.
> 
> Well, polling isn't the worst thing for a single interface... I do
> remember having a problem with the interrupt. If this series works
> with interrupts enabled, I can leave it in.
> 
> Did you have a chance to look at the core (patches 7 and 8) of this
> series? Does it make sense to you? Am I missing something which would
> allow switching from 1G->10G?
> 

For a bit of context, I also attempted dynamic switching from 1G to 10G
on my own even before this patch set but I did not get a link up on the
PCS (CDR lock was there through). Pretty much the same state as you.

What I propose is to take this whole endeavor step by step.
I am also interrested in getting this feature to actually work but I
just didn't have the time to investigate in depth was is missing.
And without the dynamic switching I cannot say that I find the addition
of the SerDes PHY driver useful.

I have the Lynx 10G on my TODO list but I still have some other tasks
on the Lynx 28G for the next 2-3 weeks. Once I get those done, I will
look closer at the patches.

In the meantime, some small thigs from this patch set can be submitted
separately. For example, describing the SFP cage on the LS1088ARDB.
I still have some small questions on the DTS implementation for the gpio
controllers but I would be able to submit this myself if you do not find
the time (with your authorship of course).

Ioana


Re: [PATCH 0/2] KVM: PPC: support kvm selftests

2023-03-28 Thread Michael Ellerman
"Nicholas Piggin"  writes:
> On Tue Mar 28, 2023 at 3:43 AM AEST, Sean Christopherson wrote:
>> On Mon, Mar 27, 2023, Nicholas Piggin wrote:
>> > On Thu Mar 23, 2023 at 3:41 AM AEST, Sean Christopherson wrote:
...
>> > >
>> > > What is the long term plan for KVM PPC maintenance?  I was under the 
>> > > impression
>> > > that KVM PPC was trending toward "bug fixes only", but the addition of 
>> > > selftests
>> > > support suggests otherwise.
...
>
>> and ideally there would be one or more M: (and R:) entries as well.  I'm not
>> all that concerned about the selftests support being abandoned, but the lack 
>> of
>> specific contacts makes it look like KVM PPC is in maintenance-only mode, 
>> and it
>> sounds like that's not the case.
>
> Yeah, I guess the intention was to bring it a bit more under general
> arch/powerpc maintainership but it does look a bit odd having a top
> level entry and no contacts. We'll reconsider what to do with that.

Yeah I agree it ends up looking a bit weird.

The intention was just to make it clear that Paul's tree was no longer
where patches were being handled, and that they'd be handled as regular
powerpc patches.

At the time I hoped we'd relatively quickly be able to add someone as at
least a KVM "R:"eviewer, but circumstances intervened.

Hopefully we can fix that soon.

cheers


Re: [PATCH v3 0/4] Use dma_default_coherent for devicetree default coherency

2023-03-28 Thread Thomas Bogendoerfer
On Tue, Mar 28, 2023 at 03:18:12AM +0200, Christoph Hellwig wrote:
> On Fri, Mar 24, 2023 at 09:17:38AM +, Jiaxun Yang wrote:
> > > 
> > > Is patch a 6.3 candidate or should all of it go into 6.4?
> > 
> > Please leave it for 6.4, as corresponding MIPS arch part will be a part of 
> > 6.4.
> 
> Ok.  I'll really need review from the MIPS and drivers/of/ maintainers,
> through.

I don't see any MIPS changes in the series besides the ifdef CONFIG_MIPS
part in patch 1, which gets removed again in patch 4 (chance to drop
that completely ?).

I've merged the corresponding MIPS patches into mips-next last week.

Thomas.

-- 
Crap can work. Given enough thrust pigs will fly, but it's not necessarily a
good idea.[ RFC1925, 2.3 ]


Re: [kvm-unit-tests v3 00/13] powerpc: updates, P10, PNV support

2023-03-28 Thread Cédric Le Goater

On 3/28/23 09:15, Nicholas Piggin wrote:

On Tue Mar 28, 2023 at 2:09 AM AEST, Cédric Le Goater wrote:

On 3/27/23 14:45, Nicholas Piggin wrote:

This series is growing a bit I'm sorry. v2 series added extra interrupt
vectors support which was actually wrong because interrupt handling
code can only cope with 0x100-size vectors and new ones are 0x80 and
0x20. It managed to work because those alias to the 0x100 boundary, but
if more than one handler were installed in the same 0x100-aligned
block it would crash. So a couple of patches added to cope with that.



I gave them a try on P9 box


Thanks!



$ ./run_tests.sh
PASS selftest-setup (2 tests)
PASS spapr_hcall (9 tests, 1 skipped)
PASS spapr_vpa (13 tests)
PASS rtas-get-time-of-day (10 tests)
PASS rtas-get-time-of-day-base (10 tests)
PASS rtas-set-time-of-day (5 tests)
PASS emulator (4 tests)
PASS h_cede_tm (2 tests)
FAIL sprs (75 tests, 1 unexpected failures)


Oh you have a SPR failure too? I'll check that on a 


I think it was the WORT SPR




FAIL sprs-migration (75 tests, 5 unexpected failures)

And with TCG:

$ ACCEL=tcg ./run_tests.sh
PASS selftest-setup (2 tests)
PASS spapr_hcall (9 tests, 1 skipped)
FAIL spapr_vpa (13 tests, 1 unexpected failures)

The dispatch count seems bogus after unregister


Yeah, that dispatch count after unregister test may be bogus actually.
PAPR doesn't specify what should happen in that case. It was working
here for me though so interesting it's different for you. I'll
investigate it and maybe just remove that test for now.


It would be nice to keep it and skip it until the emulation is fixed.
 




PASS rtas-get-time-of-day (10 tests)
PASS rtas-get-time-of-day-base (10 tests)
PASS rtas-set-time-of-day (5 tests)
PASS emulator (4 tests)
SKIP h_cede_tm (qemu-system-ppc64: TCG cannot support more than 1 thread/core 
on a pseries machine)
FAIL sprs (75 tests, 16 unexpected failures)


These should be TCG errors. I have it passing them all with patches
posted to qemu lists. Very simple but effective way to catch a few
classes of errors.


Ah  I didn't try with your QEMU patches. Make sense then.

Thanks,

C.



Re: [kvm-unit-tests v3 00/13] powerpc: updates, P10, PNV support

2023-03-28 Thread Nicholas Piggin
On Tue Mar 28, 2023 at 2:09 AM AEST, Cédric Le Goater wrote:
> On 3/27/23 14:45, Nicholas Piggin wrote:
> > This series is growing a bit I'm sorry. v2 series added extra interrupt
> > vectors support which was actually wrong because interrupt handling
> > code can only cope with 0x100-size vectors and new ones are 0x80 and
> > 0x20. It managed to work because those alias to the 0x100 boundary, but
> > if more than one handler were installed in the same 0x100-aligned
> > block it would crash. So a couple of patches added to cope with that.
> > 
>
> I gave them a try on P9 box

Thanks!

>
> $ ./run_tests.sh
> PASS selftest-setup (2 tests)
> PASS spapr_hcall (9 tests, 1 skipped)
> PASS spapr_vpa (13 tests)
> PASS rtas-get-time-of-day (10 tests)
> PASS rtas-get-time-of-day-base (10 tests)
> PASS rtas-set-time-of-day (5 tests)
> PASS emulator (4 tests)
> PASS h_cede_tm (2 tests)
> FAIL sprs (75 tests, 1 unexpected failures)

Oh you have a SPR failure too? I'll check that on a P9.

> FAIL sprs-migration (75 tests, 5 unexpected failures)
>
> And with TCG:
>
> $ ACCEL=tcg ./run_tests.sh
> PASS selftest-setup (2 tests)
> PASS spapr_hcall (9 tests, 1 skipped)
> FAIL spapr_vpa (13 tests, 1 unexpected failures)
>
> The dispatch count seems bogus after unregister

Yeah, that dispatch count after unregister test may be bogus actually.
PAPR doesn't specify what should happen in that case. It was working
here for me though so interesting it's different for you. I'll
investigate it and maybe just remove that test for now.

>
> PASS rtas-get-time-of-day (10 tests)
> PASS rtas-get-time-of-day-base (10 tests)
> PASS rtas-set-time-of-day (5 tests)
> PASS emulator (4 tests)
> SKIP h_cede_tm (qemu-system-ppc64: TCG cannot support more than 1 thread/core 
> on a pseries machine)
> FAIL sprs (75 tests, 16 unexpected failures)

These should be TCG errors. I have it passing them all with patches
posted to qemu lists. Very simple but effective way to catch a few
classes of errors.

Thanks,
Nick


Re: [PATCH 2/2] powerpc/64: Rename entry_64.S to prom_entry.S

2023-03-28 Thread Christophe Leroy


Le 28/03/2023 à 08:51, Nicholas Piggin a écrit :
> On Tue Mar 28, 2023 at 3:48 AM AEST, Christophe Leroy wrote:
>>
>>
>> Le 25/03/2023 à 14:06, Nicholas Piggin a écrit :
>>> This file contains only the enter_prom implementation now.
>>> Trim includes and update header comment while we're here.
>>>
>>> Signed-off-by: Nicholas Piggin 
>>> ---
>>>arch/powerpc/kernel/Makefile  |  8 +++--
>>>.../kernel/{entry_64.S => prom_entry.S}   | 30 ++-
>>>scripts/head-object-list.txt  |  2 +-
>>>3 files changed, 9 insertions(+), 31 deletions(-)
>>>rename arch/powerpc/kernel/{entry_64.S => prom_entry.S} (73%)
>>>
>>> diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
>>> index ec70a1748506..ebba0896998a 100644
>>> --- a/arch/powerpc/kernel/Makefile
>>> +++ b/arch/powerpc/kernel/Makefile
>>> @@ -209,10 +209,12 @@ CFLAGS_paca.o += -fno-stack-protector
>>>
>>>obj-$(CONFIG_PPC_FPU)+= fpu.o
>>>obj-$(CONFIG_ALTIVEC)+= vector.o
>>> -obj-$(CONFIG_PPC64)+= entry_64.o
>>> -obj-$(CONFIG_PPC_OF_BOOT_TRAMPOLINE)   += prom_init.o
>>>
>>> -extra-$(CONFIG_PPC_OF_BOOT_TRAMPOLINE) += prom_init_check
>>> +ifdef CONFIG_PPC_OF_BOOT_TRAMPOLINE
>>
>> You don't need that ifdef construct, you can do:
>>
>> obj64-$(CONFIG_PPC_OF_BOOT_TRAMPOLINE) += prom_entry.o
> 
> Nice. So that could have been obj64-y from the start?

Yes, allthought it is not used that way in ppc/kernel/Makefile:

$ git grep -e obj64 -e obj32 arch/powerpc/kernel/Makefile
arch/powerpc/kernel/Makefile:obj64-$(CONFIG_HIBERNATION)+= 
swsusp_asm64.o
arch/powerpc/kernel/Makefile:obj64-$(CONFIG_AUDIT)  += 
compat_audit.o
arch/powerpc/kernel/Makefile:obj64-$(CONFIG_PPC_TRANSACTIONAL_MEM) 
+= tm.o
arch/powerpc/kernel/Makefile:obj-$(CONFIG_PPC64)+= 
$(obj64-y)
arch/powerpc/kernel/Makefile:obj-$(CONFIG_PPC32)+= 
$(obj32-y)

But it is in ppc/lib/Makefile:

$ git grep -e obj64 -e obj32 arch/powerpc/lib/Makefile
arch/powerpc/lib/Makefile:obj64-y   += copypage_64.o copyuser_64.o 
mem_64.o hweight_64.o \
arch/powerpc/lib/Makefile:obj64-$(CONFIG_SMP)   += locks.o
arch/powerpc/lib/Makefile:obj64-$(CONFIG_ALTIVEC)   += vmx-helper.o
arch/powerpc/lib/Makefile:obj64-$(CONFIG_KPROBES_SANITY_TEST)   += 
test_emulate_step.o \
arch/powerpc/lib/Makefile:obj64-y   += quad.o
arch/powerpc/lib/Makefile:obj-$(CONFIG_PPC64) += $(obj64-y)

Christophe


Re: [kvm-unit-tests v3 03/13] powerpc: Add some checking to exception handler install

2023-03-28 Thread Nicholas Piggin
On Tue Mar 28, 2023 at 12:39 AM AEST, Thomas Huth wrote:
> On 27/03/2023 14.45, Nicholas Piggin wrote:
> > Check to ensure exception handlers are not being overwritten or
> > invalid exception numbers are used.
> > 
> > Signed-off-by: Nicholas Piggin 
> > ---
> > Since v2:
> > - New patch
> > 
> >   lib/powerpc/processor.c | 12 
> >   1 file changed, 12 insertions(+)
> > 
> > diff --git a/lib/powerpc/processor.c b/lib/powerpc/processor.c
> > index ec85b9d..70391aa 100644
> > --- a/lib/powerpc/processor.c
> > +++ b/lib/powerpc/processor.c
> > @@ -19,11 +19,23 @@ static struct {
> >   void handle_exception(int trap, void (*func)(struct pt_regs *, void *),
> >   void * data)
> >   {
> > +   if (trap & 0xff) {
>
> You could check for the other "invalid exception handler" condition here 
> already, i.e. if (trap & ~0xf00) ...
>
> I'd maybe simply do an "assert(!(trap & ~0xf00))" here.
>
> > +   printf("invalid exception handler %#x\n", trap);
> > +   abort();
> > +   }
> > +
> > trap >>= 8;
> >   
> > if (trap < 16) {
>
> ... then you could get rid of the if-statement here and remove one level of 
> indentation in the code below.

Yes that's the  way to do it. I feel embarrassed for not thinking
of it :)

Thanks,
Nick

>
> > +   if (func && handlers[trap].func) {
> > +   printf("exception handler installed twice %#x\n", trap);
> > +   abort();
> > +   }
> > handlers[trap].func = func;
> > handlers[trap].data = data;
> > +   } else {
> > +   printf("invalid exception handler %#x\n", trap);
> > +   abort();
> > }
> >   }
> >   
>
>   Thomas



Re: [PATCH 2/2] powerpc/64: Rename entry_64.S to prom_entry.S

2023-03-28 Thread Nicholas Piggin
On Tue Mar 28, 2023 at 3:48 AM AEST, Christophe Leroy wrote:
>
>
> Le 25/03/2023 à 14:06, Nicholas Piggin a écrit :
> > This file contains only the enter_prom implementation now.
> > Trim includes and update header comment while we're here.
> > 
> > Signed-off-by: Nicholas Piggin 
> > ---
> >   arch/powerpc/kernel/Makefile  |  8 +++--
> >   .../kernel/{entry_64.S => prom_entry.S}   | 30 ++-
> >   scripts/head-object-list.txt  |  2 +-
> >   3 files changed, 9 insertions(+), 31 deletions(-)
> >   rename arch/powerpc/kernel/{entry_64.S => prom_entry.S} (73%)
> > 
> > diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
> > index ec70a1748506..ebba0896998a 100644
> > --- a/arch/powerpc/kernel/Makefile
> > +++ b/arch/powerpc/kernel/Makefile
> > @@ -209,10 +209,12 @@ CFLAGS_paca.o += -fno-stack-protector
> >   
> >   obj-$(CONFIG_PPC_FPU) += fpu.o
> >   obj-$(CONFIG_ALTIVEC) += vector.o
> > -obj-$(CONFIG_PPC64)+= entry_64.o
> > -obj-$(CONFIG_PPC_OF_BOOT_TRAMPOLINE)   += prom_init.o
> >   
> > -extra-$(CONFIG_PPC_OF_BOOT_TRAMPOLINE) += prom_init_check
> > +ifdef CONFIG_PPC_OF_BOOT_TRAMPOLINE
>
> You don't need that ifdef construct, you can do:
>
> obj64-$(CONFIG_PPC_OF_BOOT_TRAMPOLINE) += prom_entry.o

Nice. So that could have been obj64-y from the start?

Thanks,
Nick

>
> > +obj-y  += prom_init.o
> > +obj-$(CONFIG_PPC64)+= prom_entry.o
> > +extra-y+= prom_init_check
> > +endif
> >   
> >   quiet_cmd_prom_init_check = PROMCHK $@
> > cmd_prom_init_check = $(CONFIG_SHELL) $< "$(NM)" 
> > $(obj)/prom_init.o; touch $@
>
>
> Christophe



Re: [PATCH 0/2] KVM: PPC: support kvm selftests

2023-03-28 Thread Nicholas Piggin
On Tue Mar 28, 2023 at 3:43 AM AEST, Sean Christopherson wrote:
> On Mon, Mar 27, 2023, Nicholas Piggin wrote:
> > On Thu Mar 23, 2023 at 3:41 AM AEST, Sean Christopherson wrote:
> > > On Thu, Mar 16, 2023, Michael Ellerman wrote:
> > > > Nicholas Piggin  writes:
> > > > > Hi,
> > > > >
> > > > > This series adds initial KVM selftests support for powerpc
> > > > > (64-bit, BookS).
> > > > 
> > > > Awesome.
> > > >  
> > > > > It spans 3 maintainers but it does not really
> > > > > affect arch/powerpc, and it is well contained in selftests
> > > > > code, just touches some makefiles and a tiny bit headers so
> > > > > conflicts should be unlikely and trivial.
> > > > >
> > > > > I guess Paolo is the best point to merge these, if no comments
> > > > > or objections?
> > > > 
> > > > Yeah. If it helps:
> > > > 
> > > > Acked-by: Michael Ellerman  (powerpc)
> > >
> > > What is the long term plan for KVM PPC maintenance?  I was under the 
> > > impression
> > > that KVM PPC was trending toward "bug fixes only", but the addition of 
> > > selftests
> > > support suggests otherwise.
> > 
> > We plan to continue maintaining it. New support and features has been a
> > bit low in the past couple of years, hopefully that will pick up a bit
> > though.
>
> Partly out of curiosity, but also to get a general feel for what types of 
> changes
> we might see, what are the main use cases for KVM PPC these days?  E.g. is it 
> mainly
> a vehicle for developing and testing, hosting VMs in the cloud, something 
> else?

I didn't have too much specific in mind, just generally a bit more
development activity.

As for use cases, I think broadly KVM is increasingly seen as a Linux
feature and expected to be there, even in so-called enterprise world.
For software and workflows designed around Linux virt, it can be
difficult or impossible to substitue the proprietary partitioning layer
in our firmware. So there is always someone who wants KVM for something.
It could be all of the above, and even just as a traditional hypervisor
hosting VMs in-house, there are people who would like to use KVM.

>
> > > I ask primarily because routing KVM PPC patches through the PPC tree is 
> > > going to
> > > be problematic if KVM PPC sees signficiant development.  The current 
> > > situation is
> > > ok because the volume of patches is low and KVM PPC isn't trying to drive 
> > > anything
> > > substantial into common KVM code, but if that changes... 
> > 
> > Michael has done KVM topic branches to pull from a few times when such
> > conflicts came up (at smaller scale). If we end up with larger changes
> > or regular conflicts we might start up a kvm-ppc tree again I guess.
>
> A wait-and-see approach works for me.  I don't have any complaints with the 
> current
> process, I was just caught off guard.

Okay. Complaints and suggestions to improve the process are always
welcome, so if something could be done better, feel free to ping
the list or powerpc maintainers offline.

> > > My other concern is that for selftests specifically, us KVM folks are 
> > > taking on
> > > more maintenance burden by supporting PPC.  AFAIK, none of the people 
> > > that focus
> > > on KVM selftests in any meaningful capacity have access to PPC hardware, 
> > > let alone
> > > know enough about the architecture to make intelligent code changes.
> > >
> > > Don't get me wrong, I'm very much in favor of more testing, I just don't 
> > > want KVM
> > > to get left holding the bag.
> > 
> > Understood. I'll be happy to maintain powerpc part of kvm selftests and
> > do any fixes that are needed for core code changes.If support fell away
> > you could leave it broken (or rm -rf it if you prefer) -- I wouldn't ask
> > anybody to work on archs they don't know or aren't paid to.
> > 
> > Not sure if anything more can be done to help your process or ease your
> > mind. It (KVM and kvm-selftests) can run in QEMU at least.
>
> Updating the KVM/powerpc to include selftests would be very helpful, e.g
>
>   F:  tools/testing/selftests/kvm/*/powerpc/
>   F:  tools/testing/selftests/kvm/powerpc/

Oh good point, I should have included that. I'll send a patch.

> and ideally there would be one or more M: (and R:) entries as well.  I'm not
> all that concerned about the selftests support being abandoned, but the lack 
> of
> specific contacts makes it look like KVM PPC is in maintenance-only mode, and 
> it
> sounds like that's not the case.

Yeah, I guess the intention was to bring it a bit more under general
arch/powerpc maintainership but it does look a bit odd having a top
level entry and no contacts. We'll reconsider what to do with that.

Thanks,
Nick


Memory coherency issue with IO thread offloading?

2023-03-28 Thread Daniel Black
Thanks Jens, Nick, Christophe and Michael for your work so far.

Apologies for the out of thread email.

Confirming MariabD-10.6+ is required( when we added liburing), and
previous versions used libaio (which tested without incident as mpe
retested).

We were (we're now back on the old good kernel Jens indicated) getting
failures like https://buildbot.mariadb.org/#/builders/231/builds/16857
in a container (of various distro userspaces) on bare metal.

bare metal end of /proc/cpuinfo

processor: 127
cpu: POWER9, altivec supported
clock: 3283.00MHz
revision: 2.2 (pvr 004e 1202)

timebase: 51200
platform: PowerNV
model: 9006-22P
machine: PowerNV 9006-22P
firmware: OPAL
MMU: Radix


Re: [PATCH] powerpc: don't try to copy ppc for task with NULL pt_regs

2023-03-28 Thread Nicholas Piggin
On Mon Mar 27, 2023 at 8:26 PM AEST, Christophe Leroy wrote:
>
>
> Le 27/03/2023 à 08:36, Nicholas Piggin a écrit :
> > On Mon Mar 27, 2023 at 8:15 AM AEST, Jens Axboe wrote:
> >> Powerpc sets up PF_KTHREAD and PF_IO_WORKER with a NULL pt_regs, which
> >> from my (arguably very short) checking is not commonly done for other
> >> archs. This is fine, except when PF_IO_WORKER's have been created and
> >> the task does something that causes a coredump to be generated. Then we
> >> get this crash:
> > 
> > Hey Jens,
> > 
> > Thanks for the testing and the patch.
> > 
> > I think your patch would work, but I'd be inclined to give the IO worker
> > a pt_regs so it looks more like other archs and a regular user thread.
> > 
> > Your IO worker bug reminded me to resurrect some copy_thread patches I
> > had and I think they should do that
> > 
> > https://lists.ozlabs.org/pipermail/linuxppc-dev/2023-March/256271.html
> > 
> > I wouldn't ask you to test it until I've at least tried, do you have a
> > test case that triggers this?
>
> I fact, most architectures don't have thread.regs, but rely on something 
> like:
>
> #define task_pt_regs(p) \
>   ((struct pt_regs *)(THREAD_SIZE + task_stack_page(p)) - 1)
>
>
> In powerpc, thread.regs was there because of the regs not being at the 
> same place in the stack depending on which interrupt it was.
>
> However with the two commits below, we now have stable fixed location 
> for the regs, so thread.regs shouldn't be needed anymore:
> - db297c3b07af ("powerpc/32: Don't save thread.regs on interrupt entry")
> - b5cfc9cd7b04 ("powerpc/32: Fix critical and debug interrupts on BOOKE")
>
> But in the meantime, thread.regs started to be used for additional 
> purpose, like knowing if it is a user thread or a kernel thread by using 
> thread.regs nullity.
>
>
> Now that thread.regs doesn't change anymore at each interrupt, it would 
> probably be worth dropping it and falling back to task_pt_regs() as 
> defined on most architecture.
> Knowing whether a thread is a kernel or user thread can for instance be 
> known with PF_KTHREAD flag, so no need of thread.regs for that.

That would be nice if we can define regs that way, I agree. We should
look into doing that.

Thanks,
Nick