date:20210518

[PATCH v6 3/3] riscv: Check relocations at compile time

2021-05-18 Thread Alexandre Ghiti

Relocating kernel at runtime is done very early in the boot process, so
it is not convenient to check for relocations there and react in case a
relocation was not expected.

There exists a script in scripts/ that extracts the relocations from
vmlinux that is then used at postlink to check the relocations.

Signed-off-by: Alexandre Ghiti 
Reviewed-by: Anup Patel 
---
 arch/riscv/Makefile.postlink | 36 
 arch/riscv/tools/relocs_check.sh | 26 +++
 2 files changed, 62 insertions(+)
 create mode 100644 arch/riscv/Makefile.postlink
 create mode 100755 arch/riscv/tools/relocs_check.sh

diff --git a/arch/riscv/Makefile.postlink b/arch/riscv/Makefile.postlink
new file mode 100644
index ..bf2b2bca1845
--- /dev/null
+++ b/arch/riscv/Makefile.postlink
@@ -0,0 +1,36 @@
+# SPDX-License-Identifier: GPL-2.0
+# ===
+# Post-link riscv pass
+# ===
+#
+# Check that vmlinux relocations look sane
+
+PHONY := __archpost
+__archpost:
+
+-include include/config/auto.conf
+include scripts/Kbuild.include
+
+quiet_cmd_relocs_check = CHKREL  $@
+cmd_relocs_check = \
+   $(CONFIG_SHELL) $(srctree)/arch/riscv/tools/relocs_check.sh 
"$(OBJDUMP)" "$(NM)" "$@"
+
+# `@true` prevents complaint when there is nothing to be done
+
+vmlinux: FORCE
+   @true
+ifdef CONFIG_RELOCATABLE
+   $(call if_changed,relocs_check)
+endif
+
+%.ko: FORCE
+   @true
+
+clean:
+   @true
+
+PHONY += FORCE clean
+
+FORCE:
+
+.PHONY: $(PHONY)
diff --git a/arch/riscv/tools/relocs_check.sh b/arch/riscv/tools/relocs_check.sh
new file mode 100755
index ..baeb2e7b2290
--- /dev/null
+++ b/arch/riscv/tools/relocs_check.sh
@@ -0,0 +1,26 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Based on powerpc relocs_check.sh
+
+# This script checks the relocations of a vmlinux for "suspicious"
+# relocations.
+
+if [ $# -lt 3 ]; then
+echo "$0 [path to objdump] [path to nm] [path to vmlinux]" 1>&2
+exit 1
+fi
+
+bad_relocs=$(
+${srctree}/scripts/relocs_check.sh "$@" |
+   # These relocations are okay
+   #   R_RISCV_RELATIVE
+   grep -F -w -v 'R_RISCV_RELATIVE'
+)
+
+if [ -z "$bad_relocs" ]; then
+   exit 0
+fi
+
+num_bad=$(echo "$bad_relocs" | wc -l)
+echo "WARNING: $num_bad bad relocations"
+echo "$bad_relocs"
-- 
2.30.2

[PATCH v2 0/2] mm: unify the allocation of pglist_data instances

2021-05-18 Thread Miles Chen

This patches is created to fix the __pa() warning messages when
CONFIG_DEBUG_VIRTUAL=y by unifying the allocation of pglist_data
instances.

In current implementation of node_data, if CONFIG_NEED_MULTIPLE_NODES=y,
pglist_data is allocated by a memblock API. If CONFIG_NEED_MULTIPLE_NODES=n,
we use a global variable named "contig_page_data".

If CONFIG_DEBUG_VIRTUAL is not enabled. __pa() can handle both
allocation and symbol cases. But if CONFIG_DEBUG_VIRTUAL is set,
we will have the "virt_to_phys used for non-linear address" warning
when booting.

To fix the warning, always allocate pglist_data by memblock APIs and
remove the usage of contig_page_data.

Warning message:
[0.00] [ cut here ]
[0.00] virt_to_phys used for non-linear address: (ptrval) 
(contig_page_data+0x0/0x1c00)
[0.00] WARNING: CPU: 0 PID: 0 at arch/arm64/mm/physaddr.c:15 
__virt_to_phys+0x58/0x68
[0.00] Modules linked in:
[0.00] CPU: 0 PID: 0 Comm: swapper Tainted: GW 
5.13.0-rc1-00074-g1140ab592e2e #3
[0.00] Hardware name: linux,dummy-virt (DT)
[0.00] pstate: 60c5 (nZCv daIF -PAN -UAO -TCO BTYPE=--)
[0.00] pc : __virt_to_phys+0x58/0x68
[0.00] lr : __virt_to_phys+0x54/0x68
[0.00] sp : 800011833e70
[0.00] x29: 800011833e70 x28: 418a0018 x27: 
[0.00] x26: 000a x25: 800011b7 x24: 800011b7
[0.00] x23: fc0001c0 x22: 800011b7 x21: 47b0
[0.00] x20: 0008 x19: 800011b082c0 x18: 
[0.00] x17:  x16: 800011833bf9 x15: 0004
[0.00] x14: 0fff x13: 80001186a548 x12: 
[0.00] x11:  x10:  x9 : 
[0.00] x8 : 8000115c9000 x7 : 737520737968705f x6 : 800011b62ef8
[0.00] x5 :  x4 : 0001 x3 : 
[0.00] x2 :  x1 : 80001159585e x0 : 0058
[0.00] Call trace:
[0.00]  __virt_to_phys+0x58/0x68
[0.00]  check_usemap_section_nr+0x50/0xfc
[0.00]  sparse_init_nid+0x1ac/0x28c
[0.00]  sparse_init+0x1c4/0x1e0
[0.00]  bootmem_init+0x60/0x90
[0.00]  setup_arch+0x184/0x1f0
[0.00]  start_kernel+0x78/0x488
[0.00] ---[ end trace f68728a0d3053b60 ]---

[1] https://lore.kernel.org/patchwork/patch/1425110/

Change since v1:
- use memblock_alloc() to create pglist_data when CONFIG_NUMA=n

Miles Chen (2):
  mm: introduce prepare_node_data
  mm: replace contig_page_data with node_data

 Documentation/admin-guide/kdump/vmcoreinfo.rst | 13 -
 arch/powerpc/kexec/core.c  |  5 -
 include/linux/gfp.h|  3 ---
 include/linux/mm.h |  2 ++
 include/linux/mmzone.h |  4 ++--
 kernel/crash_core.c|  1 -
 mm/memblock.c  |  3 +--
 mm/page_alloc.c| 16 
 mm/sparse.c|  2 ++
 9 files changed, 23 insertions(+), 26 deletions(-)


base-commit: 8ac91e6c6033ebc12c5c1e4aa171b81a662bd70f
-- 
2.18.0

[PATCH v2 2/2] mm: replace contig_page_data with node_data

2021-05-18 Thread Miles Chen

Replace contig_page_data with node_data. Change the definition
of NODE_DATA(nid) from (_page_data) to (node_data[0]).

Remove contig_page_data from the tree.

Cc: Mike Rapoport 
Cc: Baoquan He 
Cc: Kazu 
Signed-off-by: Miles Chen 
---
 Documentation/admin-guide/kdump/vmcoreinfo.rst | 13 -
 arch/powerpc/kexec/core.c  |  5 -
 include/linux/gfp.h|  3 ---
 include/linux/mmzone.h |  3 +--
 kernel/crash_core.c|  1 -
 mm/memblock.c  |  2 --
 6 files changed, 1 insertion(+), 26 deletions(-)

diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst 
b/Documentation/admin-guide/kdump/vmcoreinfo.rst
index 3861a25faae1..74185245c580 100644
--- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
+++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
@@ -81,14 +81,6 @@ into that mem_map array.
 
 Used to map an address to the corresponding struct page.
 
-contig_page_data
-
-
-Makedumpfile gets the pglist_data structure from this symbol, which is
-used to describe the memory layout.
-
-User-space tools use this to exclude free pages when dumping memory.
-
 mem_section|(mem_section, NR_SECTION_ROOTS)|(mem_section, section_mem_map)
 --
 
@@ -531,11 +523,6 @@ node_data|(node_data, MAX_NUMNODES)
 
 See above.
 
-contig_page_data
-
-
-See above.
-
 vmemmap_list
 
 
diff --git a/arch/powerpc/kexec/core.c b/arch/powerpc/kexec/core.c
index 56da5eb2b923..41f31dfb540c 100644
--- a/arch/powerpc/kexec/core.c
+++ b/arch/powerpc/kexec/core.c
@@ -68,13 +68,8 @@ void machine_kexec_cleanup(struct kimage *image)
 void arch_crash_save_vmcoreinfo(void)
 {
 
-#ifdef CONFIG_NEED_MULTIPLE_NODES
VMCOREINFO_SYMBOL(node_data);
VMCOREINFO_LENGTH(node_data, MAX_NUMNODES);
-#endif
-#ifndef CONFIG_NEED_MULTIPLE_NODES
-   VMCOREINFO_SYMBOL(contig_page_data);
-#endif
 #if defined(CONFIG_PPC64) && defined(CONFIG_SPARSEMEM_VMEMMAP)
VMCOREINFO_SYMBOL(vmemmap_list);
VMCOREINFO_SYMBOL(mmu_vmemmap_psize);
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 11da8af06704..ba8c511c402f 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -493,9 +493,6 @@ static inline int gfp_zonelist(gfp_t flags)
  * This zone list contains a maximum of MAX_NUMNODES*MAX_NR_ZONES zones.
  * There are two zonelists per node, one for all zones with memory and
  * one containing just zones from the node the zonelist belongs to.
- *
- * For the normal case of non-DISCONTIGMEM systems the NODE_DATA() gets
- * optimized to _page_data at compile-time.
  */
 static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 557918dcc755..c0769292187c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1043,9 +1043,8 @@ extern char numa_zonelist_order[];
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES
 
-extern struct pglist_data contig_page_data;
-#define NODE_DATA(nid) (_page_data)
 extern struct pglist_data *node_data[];
+#define NODE_DATA(nid) (node_data[0])
 #define NODE_MEM_MAP(nid)  mem_map
 
 #else /* CONFIG_NEED_MULTIPLE_NODES */
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 825284baaf46..d1e324be67f9 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -457,7 +457,6 @@ static int __init crash_save_vmcoreinfo_init(void)
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES
VMCOREINFO_SYMBOL(mem_map);
-   VMCOREINFO_SYMBOL(contig_page_data);
 #endif
 #ifdef CONFIG_SPARSEMEM
VMCOREINFO_SYMBOL_ARRAY(mem_section);
diff --git a/mm/memblock.c b/mm/memblock.c
index ebddb57ea62d..7cfc9a9d6243 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -93,8 +93,6 @@
  */
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES
-struct pglist_data __refdata contig_page_data;
-EXPORT_SYMBOL(contig_page_data);
 struct pglist_data *node_data[MAX_NUMNODES];
 #endif
 
-- 
2.18.0

[PATCH v2 1/2] mm: introduce prepare_node_data

2021-05-18 Thread Miles Chen

When CONFIG_NEED_MULTIPLE_NODES=y (CONFIG_NUMA=y),
the pglist_data is allocated by a memblock API and stored in an array
named node_data[].
When CONFIG_NEED_MULTIPLE_NODES=n (CONFIG_NUMA=n), the pglist_data
is defined as global variable contig_page_data. The difference
causes problems when we enable CONFIG_DEBUG_VIRTUAL and use __pa()
to get the physical address of NODE_DATA.

To solve the issue, introduce prepare_node_data() to allocate
pglist_data when CONFIG_NUMA=n and stored it to node_data.
i.e., Use the same way to allocate node_data[] when CONFIG_NUMA=y
or CONFIG_NUMA=n.
prepare_node_data() is called in sparer_init() and
free_area_init().

This is the first step to replace contig_page_data with allocated
pglist_data.

Cc: Mike Rapoport 
Cc: Baoquan He 
Cc: Kazu 
Signed-off-by: Miles Chen 
---
 include/linux/mm.h |  2 ++
 include/linux/mmzone.h |  1 +
 mm/memblock.c  |  1 +
 mm/page_alloc.c| 16 
 mm/sparse.c|  2 ++
 5 files changed, 22 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c274f75efcf9..3052eeb87455 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2462,9 +2462,11 @@ static inline int early_pfn_to_nid(unsigned long pfn)
 {
return 0;
 }
+extern void prepare_node_data(void);
 #else
 /* please see mm/page_alloc.c */
 extern int __meminit early_pfn_to_nid(unsigned long pfn);
+static inline void prepare_node_data(void) {};
 #endif
 
 extern void set_dma_reserve(unsigned long new_dma_reserve);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0d53eba1c383..557918dcc755 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1045,6 +1045,7 @@ extern char numa_zonelist_order[];
 
 extern struct pglist_data contig_page_data;
 #define NODE_DATA(nid) (_page_data)
+extern struct pglist_data *node_data[];
 #define NODE_MEM_MAP(nid)  mem_map
 
 #else /* CONFIG_NEED_MULTIPLE_NODES */
diff --git a/mm/memblock.c b/mm/memblock.c
index afaefa8fc6ab..ebddb57ea62d 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -95,6 +95,7 @@
 #ifndef CONFIG_NEED_MULTIPLE_NODES
 struct pglist_data __refdata contig_page_data;
 EXPORT_SYMBOL(contig_page_data);
+struct pglist_data *node_data[MAX_NUMNODES];
 #endif
 
 unsigned long max_low_pfn;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index aaa1655cf682..0c6d421f4cfb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1659,6 +1659,20 @@ int __meminit early_pfn_to_nid(unsigned long pfn)
 
return nid;
 }
+#else
+void __init prepare_node_data(void)
+{
+   if (node_data[0])
+   return;
+
+   node_data[0] = memblock_alloc(sizeof(struct pglist_data),
+   SMP_CACHE_BYTES);
+
+   if (!node_data[0])
+   panic("Cannot allocate node_data\n");
+
+   memset(node_data[0], 0, sizeof(struct pglist_data));
+}
 #endif /* CONFIG_NEED_MULTIPLE_NODES */
 
 void __init memblock_free_pages(struct page *page, unsigned long pfn,
@@ -7697,6 +7711,8 @@ void __init free_area_init(unsigned long *max_zone_pfn)
int i, nid, zone;
bool descending;
 
+   prepare_node_data();
+
/* Record where the zone boundaries are */
memset(arch_zone_lowest_possible_pfn, 0,
sizeof(arch_zone_lowest_possible_pfn));
diff --git a/mm/sparse.c b/mm/sparse.c
index b2ada9dc00cb..afcfe7463b4a 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -580,6 +580,8 @@ void __init sparse_init(void)
 
memblocks_present();
 
+   prepare_node_data();
+
pnum_begin = first_present_section_nr();
nid_begin = sparse_early_nid(__nr_to_section(pnum_begin));
 
-- 
2.18.0

[PATCH v6 1/3] riscv: Introduce CONFIG_RELOCATABLE

2021-05-18 Thread Alexandre Ghiti

This config allows to compile 64b kernel as PIE and to relocate it at
any virtual address at runtime: this paves the way to KASLR.
Runtime relocation is possible since relocation metadata are embedded into
the kernel.

Note that relocating at runtime introduces an overhead even if the
kernel is loaded at the same address it was linked at and that the compiler
options are those used in arm64 which uses the same RELA relocation
format.

Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/Kconfig  | 12 
 arch/riscv/Makefile |  5 +++-
 arch/riscv/kernel/vmlinux.lds.S |  6 
 arch/riscv/mm/Makefile  |  4 +++
 arch/riscv/mm/init.c| 53 -
 5 files changed, 78 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index a8ad8eb76120..7d49c9fa9a91 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -205,6 +205,18 @@ config PGTABLE_LEVELS
 config LOCKDEP_SUPPORT
def_bool y
 
+config RELOCATABLE
+   bool
+   depends on MMU && 64BIT && !XIP_KERNEL
+   help
+  This builds a kernel as a Position Independent Executable (PIE),
+  which retains all relocation metadata required to relocate the
+  kernel binary at runtime to a different virtual address than the
+  address it was linked at.
+  Since RISCV uses the RELA relocation format, this requires a
+  relocation pass at runtime even if the kernel is loaded at the
+  same address it was linked at.
+
 source "arch/riscv/Kconfig.socs"
 source "arch/riscv/Kconfig.erratas"
 
diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
index 3eb9590a0775..2d217ecb6e6b 100644
--- a/arch/riscv/Makefile
+++ b/arch/riscv/Makefile
@@ -9,7 +9,10 @@
 #
 
 OBJCOPYFLAGS:= -O binary
-LDFLAGS_vmlinux :=
+ifeq ($(CONFIG_RELOCATABLE),y)
+LDFLAGS_vmlinux := -shared -Bsymbolic -z notext -z norelro
+KBUILD_CFLAGS += -fPIE
+endif
 ifeq ($(CONFIG_DYNAMIC_FTRACE),y)
LDFLAGS_vmlinux := --no-relax
KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S
index 891742ff75a7..1517fd1c7246 100644
--- a/arch/riscv/kernel/vmlinux.lds.S
+++ b/arch/riscv/kernel/vmlinux.lds.S
@@ -133,6 +133,12 @@ SECTIONS
 
BSS_SECTION(PAGE_SIZE, PAGE_SIZE, 0)
 
+   .rela.dyn : ALIGN(8) {
+   __rela_dyn_start = .;
+   *(.rela .rela*)
+   __rela_dyn_end = .;
+   }
+
 #ifdef CONFIG_EFI
. = ALIGN(PECOFF_SECTION_ALIGNMENT);
__pecoff_data_virt_size = ABSOLUTE(. - __pecoff_text_end);
diff --git a/arch/riscv/mm/Makefile b/arch/riscv/mm/Makefile
index 7ebaef10ea1b..2d33ec574bbb 100644
--- a/arch/riscv/mm/Makefile
+++ b/arch/riscv/mm/Makefile
@@ -1,6 +1,10 @@
 # SPDX-License-Identifier: GPL-2.0-only
 
 CFLAGS_init.o := -mcmodel=medany
+ifdef CONFIG_RELOCATABLE
+CFLAGS_init.o += -fno-pie
+endif
+
 ifdef CONFIG_FTRACE
 CFLAGS_REMOVE_init.o = $(CC_FLAGS_FTRACE)
 CFLAGS_REMOVE_cacheflush.o = $(CC_FLAGS_FTRACE)
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 4faf8bd157ea..5e0a19d9d8fa 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -18,6 +18,9 @@
 #include 
 #include 
 #include 
+#ifdef CONFIG_RELOCATABLE
+#include 
+#endif
 
 #include 
 #include 
@@ -99,7 +102,7 @@ static void __init print_vm_layout(void)
print_mlm("lowmem", (unsigned long)PAGE_OFFSET,
  (unsigned long)high_memory);
 #ifdef CONFIG_64BIT
-   print_mlm("kernel", (unsigned long)KERNEL_LINK_ADDR,
+   print_mlm("kernel", (unsigned long)kernel_virt_addr,
  (unsigned long)ADDRESS_SPACE_END);
 #endif
 }
@@ -454,6 +457,44 @@ asmlinkage void __init __copy_data(void)
 #error "setup_vm() is called from head.S before relocate so it should not use 
absolute addressing."
 #endif
 
+#ifdef CONFIG_RELOCATABLE
+extern unsigned long __rela_dyn_start, __rela_dyn_end;
+
+void __init relocate_kernel(uintptr_t load_pa)
+{
+   Elf64_Rela *rela = (Elf64_Rela *)&__rela_dyn_start;
+   /*
+* This holds the offset between the linked virtual address and the
+* relocated virtual address.
+*/
+   uintptr_t reloc_offset = kernel_virt_addr - KERNEL_LINK_ADDR;
+   /*
+* This holds the offset between kernel linked virtual address and
+* physical address.
+*/
+   uintptr_t va_kernel_link_pa_offset = KERNEL_LINK_ADDR - load_pa;
+
+   for ( ; rela < (Elf64_Rela *)&__rela_dyn_end; rela++) {
+   Elf64_Addr addr = (rela->r_offset - va_kernel_link_pa_offset);
+   Elf64_Addr relocated_addr = rela->r_addend;
+
+   if (rela->r_info != R_RISCV_RELATIVE)
+   continue;
+
+   /*
+* Make sure to not relocate vdso symbols like rt_sigreturn
+* which are linked from the address 0 in vmlinux since
+* vdso symbol

[PATCH v6 0/3] Introduce 64b relocatable kernel

2021-05-18 Thread Alexandre Ghiti

After multiple attempts, this patchset is now based on the fact that the
64b kernel mapping was moved outside the linear mapping.

The first patch allows to build relocatable kernels but is not selected 
by default. That patch should ease KASLR implementation a lot.  
  
The second and third patches take advantage of an already existing powerpc  
 
script that checks relocations at compile-time, and uses it for riscv.  
 

This patchset was tested on:

* kernel:   
 
- rv32: OK  
 
- rv64 with RELOCATABLE: OK and checked that "suspicious" relocations are 
caught.
- rv64 without RELOCATABLE: OK  
 
- powerpc: build only and checked that "suspicious" relocations are caught. 
 

 
* xipkernel:
 
- rv32: build only  
 
- rv64: OK  
 

* nommukernel:  
 
- rv64: build only

Changes in v6:
  * Remove the kernel move to vmalloc zone
  * Rebased on top of for-next
  * Remove relocatable property from 32b kernel as the kernel is mapped in
the linear mapping and would then need to be copied physically too
  * CONFIG_RELOCATABLE depends on !XIP_KERNEL
  * Remove Reviewed-by from first patch as it changed a bit

Changes in v5:
  * Add "static __init" to create_kernel_page_table function as reported by
Kbuild test robot
  * Add reviewed-by from Zong
  * Rebase onto v5.7

Changes in v4:  
 
  * Fix BPF region that overlapped with kernel's as suggested by Zong   
 
  * Fix end of module region that could be larger than 2GB as suggested by Zong 
 
  * Fix the size of the vm area reserved for the kernel as we could lose
 
PMD_SIZE if the size was already aligned on PMD_SIZE
 
  * Split compile time relocations check patch into 2 patches as suggested by 
Anup
  * Applied Reviewed-by from Zong and Anup  
 

Changes in v3:  
 
  * Move kernel mapping to vmalloc  
 

Changes in v2:  
 
  * Make RELOCATABLE depend on MMU as suggested by Anup 
 
  * Rename kernel_load_addr into kernel_virt_addr as suggested by Anup  
 
  * Use __pa_symbol instead of __pa, as suggested by Zong   
 
  * Rebased on top of v5.6-rc3  
 
  * Tested with sv48 patchset   
 
  * Add Reviewed/Tested-by from Zong and Anup

Alexandre Ghiti (3):
  riscv: Introduce CONFIG_RELOCATABLE
  powerpc: Move script to check relocations at compile time in scripts/
  riscv: Check relocations at compile time

 arch/powerpc/tools/relocs_check.sh | 18 ++
 arch/riscv/Kconfig | 12 +++
 arch/riscv/Makefile|  5 ++-
 arch/riscv/Makefile.postlink   | 36 
 arch/riscv/kernel/vmlinux.lds.S|  6 
 arch/riscv/mm/Makefile |  4 +++
 arch/riscv/mm/init.c   | 53 +-
 arch/riscv/tools/relocs_check.sh   | 26 +++
 scripts/relocs_check.sh| 20 +++
 9 files changed, 162 insertions(+), 18 deletions(-)
 create mode 100644 arch/riscv/Makefile.postlink
 create mode 100755 arch/riscv/tools/relocs_check.sh
 create mode 100755 scripts/relocs_check.sh

-- 
2.30.2

[PATCH v6 2/3] powerpc: Move script to check relocations at compile time in scripts/

2021-05-18 Thread Alexandre Ghiti

Relocating kernel at runtime is done very early in the boot process, so
it is not convenient to check for relocations there and react in case a
relocation was not expected.

Powerpc architecture has a script that allows to check at compile time
for such unexpected relocations: extract the common logic to scripts/
so that other architectures can take advantage of it.

Signed-off-by: Alexandre Ghiti 
Reviewed-by: Anup Patel 
---
 arch/powerpc/tools/relocs_check.sh | 18 ++
 scripts/relocs_check.sh| 20 
 2 files changed, 22 insertions(+), 16 deletions(-)
 create mode 100755 scripts/relocs_check.sh

diff --git a/arch/powerpc/tools/relocs_check.sh 
b/arch/powerpc/tools/relocs_check.sh
index 014e00e74d2b..e367895941ae 100755
--- a/arch/powerpc/tools/relocs_check.sh
+++ b/arch/powerpc/tools/relocs_check.sh
@@ -15,21 +15,8 @@ if [ $# -lt 3 ]; then
exit 1
 fi
 
-# Have Kbuild supply the path to objdump and nm so we handle cross compilation.
-objdump="$1"
-nm="$2"
-vmlinux="$3"
-
-# Remove from the bad relocations those that match an undefined weak symbol
-# which will result in an absolute relocation to 0.
-# Weak unresolved symbols are of that form in nm output:
-# "  w _binary__btf_vmlinux_bin_end"
-undef_weak_symbols=$($nm "$vmlinux" | awk '$1 ~ /w/ { print $2 }')
-
 bad_relocs=$(
-$objdump -R "$vmlinux" |
-   # Only look at relocation lines.
-   grep -E '\

Re: [PATCH v2 0/2] mm: unify the allocation of pglist_data instances

2021-05-18 Thread Mike Rapoport

Hello Miles,

On Tue, May 18, 2021 at 05:24:44PM +0800, Miles Chen wrote:
> This patches is created to fix the __pa() warning messages when
> CONFIG_DEBUG_VIRTUAL=y by unifying the allocation of pglist_data
> instances.
> 
> In current implementation of node_data, if CONFIG_NEED_MULTIPLE_NODES=y,
> pglist_data is allocated by a memblock API. If CONFIG_NEED_MULTIPLE_NODES=n,
> we use a global variable named "contig_page_data".
> 
> If CONFIG_DEBUG_VIRTUAL is not enabled. __pa() can handle both
> allocation and symbol cases. But if CONFIG_DEBUG_VIRTUAL is set,
> we will have the "virt_to_phys used for non-linear address" warning
> when booting.
> 
> To fix the warning, always allocate pglist_data by memblock APIs and
> remove the usage of contig_page_data.

Somehow I was sure that we can allocate pglist_data before it is accessed
in sparse_init() somewhere outside mm/sparse.c. It's really not the case
and having two places that may allocated this structure is surely worth
than your previous suggestion.

Sorry about that.
 
> Warning message:
> [0.00] [ cut here ]
> [0.00] virt_to_phys used for non-linear address: (ptrval) 
> (contig_page_data+0x0/0x1c00)
> [0.00] WARNING: CPU: 0 PID: 0 at arch/arm64/mm/physaddr.c:15 
> __virt_to_phys+0x58/0x68
> [0.00] Modules linked in:
> [0.00] CPU: 0 PID: 0 Comm: swapper Tainted: GW 
> 5.13.0-rc1-00074-g1140ab592e2e #3
> [0.00] Hardware name: linux,dummy-virt (DT)
> [0.00] pstate: 60c5 (nZCv daIF -PAN -UAO -TCO BTYPE=--)
> [0.00] pc : __virt_to_phys+0x58/0x68
> [0.00] lr : __virt_to_phys+0x54/0x68
> [0.00] sp : 800011833e70
> [0.00] x29: 800011833e70 x28: 418a0018 x27: 
> 
> [0.00] x26: 000a x25: 800011b7 x24: 
> 800011b7
> [0.00] x23: fc0001c0 x22: 800011b7 x21: 
> 47b0
> [0.00] x20: 0008 x19: 800011b082c0 x18: 
> 
> [0.00] x17:  x16: 800011833bf9 x15: 
> 0004
> [0.00] x14: 0fff x13: 80001186a548 x12: 
> 
> [0.00] x11:  x10:  x9 : 
> 
> [0.00] x8 : 8000115c9000 x7 : 737520737968705f x6 : 
> 800011b62ef8
> [0.00] x5 :  x4 : 0001 x3 : 
> 
> [0.00] x2 :  x1 : 80001159585e x0 : 
> 0058
> [0.00] Call trace:
> [0.00]  __virt_to_phys+0x58/0x68
> [0.00]  check_usemap_section_nr+0x50/0xfc
> [0.00]  sparse_init_nid+0x1ac/0x28c
> [0.00]  sparse_init+0x1c4/0x1e0
> [0.00]  bootmem_init+0x60/0x90
> [0.00]  setup_arch+0x184/0x1f0
> [0.00]  start_kernel+0x78/0x488
> [0.00] ---[ end trace f68728a0d3053b60 ]---
> 
> [1] https://lore.kernel.org/patchwork/patch/1425110/
> 
> Change since v1:
> - use memblock_alloc() to create pglist_data when CONFIG_NUMA=n
> 
> Miles Chen (2):
>   mm: introduce prepare_node_data
>   mm: replace contig_page_data with node_data
> 
>  Documentation/admin-guide/kdump/vmcoreinfo.rst | 13 -
>  arch/powerpc/kexec/core.c  |  5 -
>  include/linux/gfp.h|  3 ---
>  include/linux/mm.h |  2 ++
>  include/linux/mmzone.h |  4 ++--
>  kernel/crash_core.c|  1 -
>  mm/memblock.c  |  3 +--
>  mm/page_alloc.c| 16 
>  mm/sparse.c|  2 ++
>  9 files changed, 23 insertions(+), 26 deletions(-)
> 
> 
> base-commit: 8ac91e6c6033ebc12c5c1e4aa171b81a662bd70f
> -- 
> 2.18.0
> 

-- 
Sincerely yours,
Mike.

Re: [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance

2021-05-18 Thread Ricardo Neri

On Fri, May 14, 2021 at 07:14:15PM -0700, Ricardo Neri wrote:
> On Fri, May 14, 2021 at 11:47:45AM +0200, Peter Zijlstra wrote:
> > On Thu, May 13, 2021 at 08:49:08AM -0700, Ricardo Neri wrote:
> > >  include/linux/sched/topology.h |   1 +
> > >  kernel/sched/fair.c| 101 +
> > >  2 files changed, 102 insertions(+)
> > > 
> > > diff --git a/include/linux/sched/topology.h 
> > > b/include/linux/sched/topology.h
> > > index 8f0f778b7c91..43bdb8b1e1df 100644
> > > --- a/include/linux/sched/topology.h
> > > +++ b/include/linux/sched/topology.h
> > > @@ -57,6 +57,7 @@ static inline int cpu_numa_flags(void)
> > >  #endif
> > >  
> > >  extern int arch_asym_cpu_priority(int cpu);
> > > +extern bool arch_asym_check_smt_siblings(void);
> > >  
> > >  struct sched_domain_attr {
> > >   int relax_domain_level;
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index c8b66a5d593e..3d6cc027e6e6 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -106,6 +106,15 @@ int __weak arch_asym_cpu_priority(int cpu)
> > >   return -cpu;
> > >  }
> > >  
> > > +/*
> > > + * For asym packing, first check the state of SMT siblings before 
> > > deciding to
> > > + * pull tasks.
> > > + */
> > > +bool __weak arch_asym_check_smt_siblings(void)
> > > +{
> > > + return false;
> > > +}
> > > +
> > >  /*
> > >   * The margin used when comparing utilization with CPU capacity.
> > >   *
> > 
> > > @@ -8458,6 +8550,9 @@ sched_asym(struct lb_env *env, struct sd_lb_stats 
> > > *sds,  struct sg_lb_stats *sgs
> > >   if (group == sds->local)
> > >   return false;
> > >  
> > > + if (arch_asym_check_smt_siblings())
> > > + return asym_can_pull_tasks(env->dst_cpu, sds, sgs, group);
> > > +
> > >   return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
> > >  }
> > 
> > So I'm thinking that this is a property of having ASYM_PACKING at a core
> > level, rather than some arch special. Wouldn't something like this be
> > more appropriate?
> > 
> > ---
> > --- a/include/linux/sched/topology.h
> > +++ b/include/linux/sched/topology.h
> > @@ -57,7 +57,6 @@ static inline int cpu_numa_flags(void)
> >  #endif
> >  
> >  extern int arch_asym_cpu_priority(int cpu);
> > -extern bool arch_asym_check_smt_siblings(void);
> >  
> >  struct sched_domain_attr {
> > int relax_domain_level;
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -107,15 +107,6 @@ int __weak arch_asym_cpu_priority(int cp
> >  }
> >  
> >  /*
> > - * For asym packing, first check the state of SMT siblings before deciding 
> > to
> > - * pull tasks.
> > - */
> > -bool __weak arch_asym_check_smt_siblings(void)
> > -{
> > -   return false;
> > -}
> > -
> > -/*
> >   * The margin used when comparing utilization with CPU capacity.
> >   *
> >   * (default: ~20%)
> > @@ -8550,7 +8541,8 @@ sched_asym(struct lb_env *env, struct sd
> > if (group == sds->local)
> > return false;
> >  
> > -   if (arch_asym_check_smt_siblings())
> > +   if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
> > +   (group->flags & SD_SHARE_CPUCAPACITY))
> > return asym_can_pull_tasks(env->dst_cpu, sds, sgs, group);
> 
> Thanks Peter for the quick review! This makes sense to me. The only
> reason we proposed arch_asym_check_smt_siblings() is because we were
> about breaking powerpc (I need to study how they set priorities for SMT,
> if applicable). If you think this is not an issue I can post a
> v4 with this update.

As far as I can see, priorities in powerpc are set by the CPU number.
However, I am not sure how CPUs are enumerated? If CPUs in brackets are
SMT sibling, Does an enumeration looks like A) [0, 1], [2, 3] or B) [0, 2],
[1, 3]? I guess B is the right answer. Otherwise, both SMT siblings of a
core would need to be busy before a new core is used.

Still, I think the issue described in the cover letter may be
reproducible in powerpc as well. If CPU3 is offlined, and [0, 2] pulled
tasks from [1, -] so that both CPU0 and CPU2 become busy, CPU1 would not be
able to help since CPU0 has the highest priority.

I am cc'ing the linuxppc list to get some feedback.

Thanks and BR,
Ricardo

Re: [PATCH v8 27/30] powerpc/kprobes: Don't allow breakpoints on suffixes

2021-05-18 Thread Christophe Leroy





Le 06/05/2020 à 05:40, Jordan Niethe a écrit :

Do not allow inserting breakpoints on the suffix of a prefix instruction
in kprobes.

Signed-off-by: Jordan Niethe 
---
v8: Add this back from v3
---
  arch/powerpc/kernel/kprobes.c | 13 +
  1 file changed, 13 insertions(+)

diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index 33d54b091c70..227510df8c55 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -106,7 +106,9 @@ kprobe_opcode_t *kprobe_lookup_name(const char *name, 
unsigned int offset)
  int arch_prepare_kprobe(struct kprobe *p)
  {
int ret = 0;
+   struct kprobe *prev;
struct ppc_inst insn = ppc_inst_read((struct ppc_inst *)p->addr);
+   struct ppc_inst prefix = ppc_inst_read((struct ppc_inst *)(p->addr - 
1));


What if p->addr is the first word of a page and the previous page is not mapped 
?

  
  	if ((unsigned long)p->addr & 0x03) {

printk("Attempt to register kprobe at an unaligned address\n");
@@ -114,6 +116,17 @@ int arch_prepare_kprobe(struct kprobe *p)
} else if (IS_MTMSRD(insn) || IS_RFID(insn) || IS_RFI(insn)) {
printk("Cannot register a kprobe on rfi/rfid or mtmsr[d]\n");
ret = -EINVAL;
+   } else if (ppc_inst_prefixed(prefix)) {


If p->addr - 2 contains a valid prefixed instruction, then p->addr - 1 contains the suffix of that 
prefixed instruction. Are we sure a suffix can never ever be misinterpreted as the prefix of a 
prefixed instruction ?




+   printk("Cannot register a kprobe on the second word of prefixed 
instruction\n");
+   ret = -EINVAL;
+   }
+   preempt_disable();
+   prev = get_kprobe(p->addr - 1);
+   preempt_enable_no_resched();
+   if (prev &&
+   ppc_inst_prefixed(ppc_inst_read((struct ppc_inst 
*)prev->ainsn.insn))) {
+   printk("Cannot register a kprobe on the second word of prefixed 
instruction\n");
+   ret = -EINVAL;
}
  
  	/* insn must be on a special executable page on ppc64.  This is

Re: [PATCH] powerpc: Kconfig: disable CONFIG_COMPAT for clang < 12

2021-05-18 Thread Nathan Chancellor


On 5/18/2021 1:58 PM, Nick Desaulniers wrote:

Until clang-12, clang would attempt to assemble 32b powerpc assembler in
64b emulation mode when using a 64b target triple with -m32, leading to
errors during the build of the compat VDSO. Simply disable all of
CONFIG_COMPAT; users should upgrade to the latest release of clang for
proper support.

Link: https://github.com/ClangBuiltLinux/linux/issues/1160
Link: 
https://github.com/llvm/llvm-project/commits/2288319733cd5f525bf7e24dece08bfcf9d0ff9e
Link: https://groups.google.com/g/clang-built-linux/c/ayNmi3HoNdY/m/XJAGj_G2AgAJ
Suggested-by: Nathan Chancellor 
Signed-off-by: Nick Desaulniers 


Reviewed-by: Nathan Chancellor 


---
  arch/powerpc/Kconfig | 1 +
  1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index ce3f59531b51..2a02784b7ef0 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -289,6 +289,7 @@ config PANIC_TIMEOUT
  config COMPAT
bool "Enable support for 32bit binaries"
depends on PPC64
+   depends on !CC_IS_CLANG || CLANG_VERSION >= 12
default y if !CPU_LITTLE_ENDIAN
select ARCH_WANT_OLD_COMPAT_IPC
select COMPAT_OLD_SIGACTION

Re: [PATCH v2] powerpc/powernv/pci: fix header guard

2021-05-18 Thread Nathan Chancellor


On 5/18/2021 1:40 PM, Nick Desaulniers wrote:

While looking at -Wundef warnings, the #if CONFIG_EEH stood out as a
possible candidate to convert to #ifdef CONFIG_EEH.

It seems that based on Kconfig dependencies it's not possible to build
this file without CONFIG_EEH enabled, but based on upstream discussion,
it's not clear yet that CONFIG_EEH should be enabled by default.

For now, simply fix the -Wundef warning.

Suggested-by: Nathan Chancellor 
Suggested-by: Joe Perches 
Link: https://github.com/ClangBuiltLinux/linux/issues/570
Link: 
https://lore.kernel.org/lkml/67f6cd269684c9aa8463ff4812c3b4605e6739c3.ca...@perches.com/
Link: 
https://lore.kernel.org/lkml/CAOSf1CGoN5R0LUrU=Y=uwho1z_9slgcx8s3sbfjxwjxc5by...@mail.gmail.com/
Signed-off-by: Nick Desaulniers 


Makes sense, thanks for the patch!

Reviewed-by: Nathan Chancellor 


---
  arch/powerpc/platforms/powernv/pci.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/pci.c 
b/arch/powerpc/platforms/powernv/pci.c
index b18468dc31ff..6bb3c52633fb 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -711,7 +711,7 @@ int pnv_pci_cfg_write(struct pci_dn *pdn,
return PCIBIOS_SUCCESSFUL;
  }
  
-#if CONFIG_EEH

+#ifdef CONFIG_EEH
  static bool pnv_pci_cfg_check(struct pci_dn *pdn)
  {
struct eeh_dev *edev = NULL;

Re: [PATCH v5 3/9] mm/mremap: Use pmd/pud_poplulate to update page table entries

2021-05-18 Thread Nathan Chancellor

Hi Aneesh,

On Thu, Apr 22, 2021 at 11:13:17AM +0530, Aneesh Kumar K.V wrote:
> pmd/pud_populate is the right interface to be used to set the respective
> page table entries. Some architectures like ppc64 do assume that 
> set_pmd/pud_at
> can only be used to set a hugepage PTE. Since we are not setting up a hugepage
> PTE here, use the pmd/pud_populate interface.
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  mm/mremap.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/mremap.c b/mm/mremap.c
> index ec8f840399ed..574287f9bb39 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -26,6 +26,7 @@
>  
>  #include 
>  #include 
> +#include 
>  
>  #include "internal.h"
>  
> @@ -257,9 +258,8 @@ static bool move_normal_pmd(struct vm_area_struct *vma, 
> unsigned long old_addr,
>   pmd_clear(old_pmd);
>  
>   VM_BUG_ON(!pmd_none(*new_pmd));
> + pmd_populate(mm, new_pmd, (pgtable_t)pmd_page_vaddr(pmd));
>  
> - /* Set the new pmd */
> - set_pmd_at(mm, new_addr, new_pmd, pmd);
>   flush_tlb_range(vma, old_addr, old_addr + PMD_SIZE);
>   if (new_ptl != old_ptl)
>   spin_unlock(new_ptl);
> @@ -306,8 +306,7 @@ static bool move_normal_pud(struct vm_area_struct *vma, 
> unsigned long old_addr,
>  
>   VM_BUG_ON(!pud_none(*new_pud));
>  
> - /* Set the new pud */
> - set_pud_at(mm, new_addr, new_pud, pud);
> + pud_populate(mm, new_pud, (pmd_t *)pud_page_vaddr(pud));
>   flush_tlb_range(vma, old_addr, old_addr + PUD_SIZE);
>   if (new_ptl != old_ptl)
>   spin_unlock(new_ptl);
> -- 
> 2.30.2
> 
> 

This commit causes my WSL2 VM to close when compiling something memory
intensive, such as an x86_64_defconfig + CONFIG_LTO_CLANG_FULL=y kernel
or LLVM/Clang. Unfortunately, I do not have much further information to
provide since I do not see any sort of splat in dmesg right before it
closes and I have found zero information about getting the previous
kernel message in WSL2 (custom init so no systemd or anything).

The config file is the stock one from Microsoft:

https://github.com/microsoft/WSL2-Linux-Kernel/blob/a571dc8cedc8e0e56487c0dc93243e0b5db8960a/Microsoft/config-wsl

I have attached my .config anyways, which includes CONFIG_DEBUG_VM,
which does not appear to show anything out of the ordinary. I have also
attached a dmesg just in case anything sticks out. I am happy to provide
any additional information or perform additional debugging steps as
needed.

Cheers,
Nathan

$ git bisect log
# bad: [cd557f1c605fc5a2c0eb0b540610f50dc67dd849] Add linux-next specific files 
for 20210514
# good: [315d99318179b9cd5077ccc9f7f26a164c9fa998] Merge tag 'pm-5.13-rc2' of 
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
git bisect start 'cd557f1c605fc5a2c0eb0b540610f50dc67dd849' 
'315d99318179b9cd5077ccc9f7f26a164c9fa998'
# good: [9634d7cb3c506ae886a5136d12b4af696b9cee8a] Merge remote-tracking branch 
'drm-misc/for-linux-next'
git bisect good 9634d7cb3c506ae886a5136d12b4af696b9cee8a
# good: [294636a24ae819a7caf0807d05d8eb5b964ec06f] Merge remote-tracking branch 
'rcu/rcu/next'
git bisect good 294636a24ae819a7caf0807d05d8eb5b964ec06f
# good: [cb753d0611f912439c8e814f4254d15fa8fa1d75] Merge remote-tracking branch 
'gpio-brgl/gpio/for-next'
git bisect good cb753d0611f912439c8e814f4254d15fa8fa1d75
# bad: [b1e7389449084b74a044a70860c6a1c7466781cb] lib/string_helpers: switch to 
use BIT() macro
git bisect bad b1e7389449084b74a044a70860c6a1c7466781cb
# bad: [bf5570ed0654a21000e5dad9243ea1ba30bfe208] kasan: use 
dump_stack_lvl(KERN_ERR) to print stacks
git bisect bad bf5570ed0654a21000e5dad9243ea1ba30bfe208
# good: [4a292ff7a819404039588c7a9af272aca22c869e] fixup! mm: gup: pack 
has_pinned in MMF_HAS_PINNED
git bisect good 4a292ff7a819404039588c7a9af272aca22c869e
# good: [5ed68c90c7fb884c3c493d5529aca79dcf125848] mm: memcontrol: move 
obj_cgroup_uncharge_pages() out of css_set_lock
git bisect good 5ed68c90c7fb884c3c493d5529aca79dcf125848
# good: [f96ae2c1e63b71134e216e9940df3f2793a9a4b1] mm/memory.c: fix comment of 
finish_mkwrite_fault()
git bisect good f96ae2c1e63b71134e216e9940df3f2793a9a4b1
# bad: [5b0a28a7f9f5fdc2fe5a5e2cce7ea17b98e5eaeb] mm/mremap: use range flush 
that does TLB and page walk cache flush
git bisect bad 5b0a28a7f9f5fdc2fe5a5e2cce7ea17b98e5eaeb
# bad: [dbee97d1f49a2f2f1f5c26bf15151cc998572e89] mm/mremap: use 
pmd/pud_poplulate to update page table entries
git bisect bad dbee97d1f49a2f2f1f5c26bf15151cc998572e89
# good: [c4c8a76d96a7d38d1ec8732e3f852418d18a7424] selftest/mremap_test: avoid 
crash with static build
git bisect good c4c8a76d96a7d38d1ec8732e3f852418d18a7424
# first bad commit: [dbee97d1f49a2f2f1f5c26bf15151cc998572e89] mm/mremap: use 
pmd/pud_poplulate to update page table entries 
#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 5.13.0-rc2 Kernel Configuration
#
CONFIG_CC_VERSION_TEXT="gcc (GCC) 11.1.0"
CONFIG_CC_IS_GCC=y
CONFIG_GCC_VERSION=110100
CONFIG_CLANG_VERSION=0

Re: [PATCH v8 27/30] powerpc/kprobes: Don't allow breakpoints on suffixes

2021-05-18 Thread Gabriel Paubert

On Tue, May 18, 2021 at 08:43:39PM +0200, Christophe Leroy wrote:
> 
> 
> Le 06/05/2020 à 05:40, Jordan Niethe a écrit :
> > Do not allow inserting breakpoints on the suffix of a prefix instruction
> > in kprobes.
> > 
> > Signed-off-by: Jordan Niethe 
> > ---
> > v8: Add this back from v3
> > ---
> >   arch/powerpc/kernel/kprobes.c | 13 +
> >   1 file changed, 13 insertions(+)
> > 
> > diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
> > index 33d54b091c70..227510df8c55 100644
> > --- a/arch/powerpc/kernel/kprobes.c
> > +++ b/arch/powerpc/kernel/kprobes.c
> > @@ -106,7 +106,9 @@ kprobe_opcode_t *kprobe_lookup_name(const char *name, 
> > unsigned int offset)
> >   int arch_prepare_kprobe(struct kprobe *p)
> >   {
> > int ret = 0;
> > +   struct kprobe *prev;
> > struct ppc_inst insn = ppc_inst_read((struct ppc_inst *)p->addr);
> > +   struct ppc_inst prefix = ppc_inst_read((struct ppc_inst *)(p->addr - 
> > 1));
> 
> What if p->addr is the first word of a page and the previous page is not 
> mapped ?

IIRC prefixed instructions can't straddle 64 byte boundaries (or was it
128 bytes?), much less page boundaries.

> 
> > if ((unsigned long)p->addr & 0x03) {
> > printk("Attempt to register kprobe at an unaligned address\n");
> > @@ -114,6 +116,17 @@ int arch_prepare_kprobe(struct kprobe *p)
> > } else if (IS_MTMSRD(insn) || IS_RFID(insn) || IS_RFI(insn)) {
> > printk("Cannot register a kprobe on rfi/rfid or mtmsr[d]\n");
> > ret = -EINVAL;
> > +   } else if (ppc_inst_prefixed(prefix)) {
> 
> If p->addr - 2 contains a valid prefixed instruction, then p->addr - 1
> contains the suffix of that prefixed instruction. Are we sure a suffix can
> never ever be misinterpreted as the prefix of a prefixed instruction ?
> 

Prefixes are easy to decode, the 6 MSB are 0b01 (from memory).

After some digging on the 'net: "All prefixes have the major opcode 1. A
prefix will never be a valid word instruction. A suffix may be an
existing word instruction or a new instruction."

IOW, detecting prefixes is trivial. It's not x86...

Gabriel

> 
> > +   printk("Cannot register a kprobe on the second word of prefixed 
> > instruction\n");
> > +   ret = -EINVAL;
> > +   }
> > +   preempt_disable();
> > +   prev = get_kprobe(p->addr - 1);
> > +   preempt_enable_no_resched();
> > +   if (prev &&
> > +   ppc_inst_prefixed(ppc_inst_read((struct ppc_inst 
> > *)prev->ainsn.insn))) {
> > +   printk("Cannot register a kprobe on the second word of prefixed 
> > instruction\n");
> > +   ret = -EINVAL;
> > }
> > /* insn must be on a special executable page on ppc64.  This is
> >

Re: [PATCH v2 0/2] mm: unify the allocation of pglist_data instances

2021-05-18 Thread Miles Chen

On Tue, 2021-05-18 at 19:09 +0300, Mike Rapoport wrote:
> Hello Miles,
> 
> On Tue, May 18, 2021 at 05:24:44PM +0800, Miles Chen wrote:
> > This patches is created to fix the __pa() warning messages when
> > CONFIG_DEBUG_VIRTUAL=y by unifying the allocation of pglist_data
> > instances.
> > 
> > In current implementation of node_data, if CONFIG_NEED_MULTIPLE_NODES=y,
> > pglist_data is allocated by a memblock API. If CONFIG_NEED_MULTIPLE_NODES=n,
> > we use a global variable named "contig_page_data".
> > 
> > If CONFIG_DEBUG_VIRTUAL is not enabled. __pa() can handle both
> > allocation and symbol cases. But if CONFIG_DEBUG_VIRTUAL is set,
> > we will have the "virt_to_phys used for non-linear address" warning
> > when booting.
> > 
> > To fix the warning, always allocate pglist_data by memblock APIs and
> > remove the usage of contig_page_data.
> 
> Somehow I was sure that we can allocate pglist_data before it is accessed
> in sparse_init() somewhere outside mm/sparse.c. It's really not the case
> and having two places that may allocated this structure is surely worth
> than your previous suggestion.
> 
> Sorry about that.

Do you mean taht to call allocation function arch/*, somewhere after
paging_init() (so we can access pglist_data) and before sparse_init()
and free_area_init()?

Miles

>  
> > Warning message:
> > [0.00] [ cut here ]
> > [0.00] virt_to_phys used for non-linear address: (ptrval) 
> > (contig_page_data+0x0/0x1c00)
> > [0.00] WARNING: CPU: 0 PID: 0 at arch/arm64/mm/physaddr.c:15 
> > __virt_to_phys+0x58/0x68
> > [0.00] Modules linked in:
> > [0.00] CPU: 0 PID: 0 Comm: swapper Tainted: GW 
> > 5.13.0-rc1-00074-g1140ab592e2e #3
> > [0.00] Hardware name: linux,dummy-virt (DT)
> > [0.00] pstate: 60c5 (nZCv daIF -PAN -UAO -TCO BTYPE=--)
> > [0.00] pc : __virt_to_phys+0x58/0x68
> > [0.00] lr : __virt_to_phys+0x54/0x68
> > [0.00] sp : 800011833e70
> > [0.00] x29: 800011833e70 x28: 418a0018 x27: 
> > 
> > [0.00] x26: 000a x25: 800011b7 x24: 
> > 800011b7
> > [0.00] x23: fc0001c0 x22: 800011b7 x21: 
> > 47b0
> > [0.00] x20: 0008 x19: 800011b082c0 x18: 
> > 
> > [0.00] x17:  x16: 800011833bf9 x15: 
> > 0004
> > [0.00] x14: 0fff x13: 80001186a548 x12: 
> > 
> > [0.00] x11:  x10:  x9 : 
> > 
> > [0.00] x8 : 8000115c9000 x7 : 737520737968705f x6 : 
> > 800011b62ef8
> > [0.00] x5 :  x4 : 0001 x3 : 
> > 
> > [0.00] x2 :  x1 : 80001159585e x0 : 
> > 0058
> > [0.00] Call trace:
> > [0.00]  __virt_to_phys+0x58/0x68
> > [0.00]  check_usemap_section_nr+0x50/0xfc
> > [0.00]  sparse_init_nid+0x1ac/0x28c
> > [0.00]  sparse_init+0x1c4/0x1e0
> > [0.00]  bootmem_init+0x60/0x90
> > [0.00]  setup_arch+0x184/0x1f0
> > [0.00]  start_kernel+0x78/0x488
> > [0.00] ---[ end trace f68728a0d3053b60 ]---
> > 
> > [1] 
> > https://urldefense.com/v3/__https://lore.kernel.org/patchwork/patch/1425110/__;!!CTRNKA9wMg0ARbw!x-wGFEC1wLzXho2kI1CrC2fjXNaQm5f-n0ADQyJDckCOKZHAP_q055DCSWYcQ7Zdcw$
> >  
> > 
> > Change since v1:
> > - use memblock_alloc() to create pglist_data when CONFIG_NUMA=n
> > 
> > Miles Chen (2):
> >   mm: introduce prepare_node_data
> >   mm: replace contig_page_data with node_data
> > 
> >  Documentation/admin-guide/kdump/vmcoreinfo.rst | 13 -
> >  arch/powerpc/kexec/core.c  |  5 -
> >  include/linux/gfp.h|  3 ---
> >  include/linux/mm.h |  2 ++
> >  include/linux/mmzone.h |  4 ++--
> >  kernel/crash_core.c|  1 -
> >  mm/memblock.c  |  3 +--
> >  mm/page_alloc.c| 16 
> >  mm/sparse.c|  2 ++
> >  9 files changed, 23 insertions(+), 26 deletions(-)
> > 
> > 
> > base-commit: 8ac91e6c6033ebc12c5c1e4aa171b81a662bd70f
> > -- 
> > 2.18.0
> > 
>

Re: [PATCH v5 5/9] powerpc/mm/book3s64: Update tlb flush routines to take a page walk cache flush argument

2021-05-18 Thread Guenter Roeck


On 5/18/21 5:26 PM, Michael Ellerman wrote:
[ ... ]

That was the generic header change in the patch. I was commenting about the
ppc64 specific change causing build failures.


Ah, sorry. I wasn't aware that the following is valid C code

void f1()
{
  return f2();
  ^^
}

as long as f2() is void as well. Confusing, but we live and learn.


It might be valid, but it's still bad IMHO.

It's confusing to readers, and serves no useful purpose.



Agreed, but it is surprisingly wide-spread. Try to run the coccinelle
script below, just for fun. The script doesn't even catch instances
in include files, yet there are more than 450 hits.

Guenter

---
virtual report

@d@
identifier f;
expression e;
position p;
@@

void f(...)
{
<...
  return e@p;
...>
}

@script:python depends on report@
f << d.f;
p << d.p;
@@

print "void function %s:%s() with non-void return in line %s" % (p[0].file, f, 
p[0].line)

Re: Linux powerpc new system call instruction and ABI

2021-05-18 Thread Dmitry V. Levin

Hi,

On Thu, Jun 11, 2020 at 06:12:01PM +1000, Nicholas Piggin wrote:
[...]
> - Error handling: The consensus among kernel, glibc, and musl is to move to
>   using negative return values in r3 rather than CR0[SO]=1 to indicate error,
>   which matches most other architectures, and is closer to a function call.

Apparently, the patchset merged by commit v5.9-rc1~100^2~164 was
incomplete: all functions defined in arch/powerpc/include/asm/ptrace.h and
arch/powerpc/include/asm/syscall.h that use ccr are broken when scv is used.
This includes syscall_get_error() and all its users including
PTRACE_GET_SYSCALL_INFO API, which in turn makes strace unusable
when scv is used.

See also https://bugzilla.redhat.com/1929836

-- 
ldv

Re: [PATCH v6 2/3] powerpc: Move script to check relocations at compile time in scripts/

2021-05-18 Thread Michael Ellerman

Alexandre Ghiti  writes:
> Relocating kernel at runtime is done very early in the boot process, so
> it is not convenient to check for relocations there and react in case a
> relocation was not expected.
>
> Powerpc architecture has a script that allows to check at compile time
> for such unexpected relocations: extract the common logic to scripts/
> so that other architectures can take advantage of it.
>
> Signed-off-by: Alexandre Ghiti 
> Reviewed-by: Anup Patel 
> ---
>  arch/powerpc/tools/relocs_check.sh | 18 ++
>  scripts/relocs_check.sh| 20 
>  2 files changed, 22 insertions(+), 16 deletions(-)
>  create mode 100755 scripts/relocs_check.sh

I'm not sure that script is really big/complicated enough to warrant
sharing vs just copying, but I don't mind either.

Acked-by: Michael Ellerman  (powerpc)

cheers

> diff --git a/arch/powerpc/tools/relocs_check.sh 
> b/arch/powerpc/tools/relocs_check.sh
> index 014e00e74d2b..e367895941ae 100755
> --- a/arch/powerpc/tools/relocs_check.sh
> +++ b/arch/powerpc/tools/relocs_check.sh
> @@ -15,21 +15,8 @@ if [ $# -lt 3 ]; then
>   exit 1
>  fi
>  
> -# Have Kbuild supply the path to objdump and nm so we handle cross 
> compilation.
> -objdump="$1"
> -nm="$2"
> -vmlinux="$3"
> -
> -# Remove from the bad relocations those that match an undefined weak symbol
> -# which will result in an absolute relocation to 0.
> -# Weak unresolved symbols are of that form in nm output:
> -# "  w _binary__btf_vmlinux_bin_end"
> -undef_weak_symbols=$($nm "$vmlinux" | awk '$1 ~ /w/ { print $2 }')
> -
>  bad_relocs=$(
> -$objdump -R "$vmlinux" |
> - # Only look at relocation lines.
> - grep -E '\ +${srctree}/scripts/relocs_check.sh "$@" |
>   # These relocations are okay
>   # On PPC64:
>   #   R_PPC64_RELATIVE, R_PPC64_NONE
> @@ -43,8 +30,7 @@ R_PPC_ADDR16_LO
>  R_PPC_ADDR16_HI
>  R_PPC_ADDR16_HA
>  R_PPC_RELATIVE
> -R_PPC_NONE' |
> - ([ "$undef_weak_symbols" ] && grep -F -w -v "$undef_weak_symbols" || 
> cat)
> +R_PPC_NONE'
>  )
>  
>  if [ -z "$bad_relocs" ]; then
> diff --git a/scripts/relocs_check.sh b/scripts/relocs_check.sh
> new file mode 100755
> index ..137c660499f3
> --- /dev/null
> +++ b/scripts/relocs_check.sh
> @@ -0,0 +1,20 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +
> +# Get a list of all the relocations, remove from it the relocations
> +# that are known to be legitimate and return this list to arch specific
> +# script that will look for suspicious relocations.
> +
> +objdump="$1"
> +nm="$2"
> +vmlinux="$3"
> +
> +# Remove from the possible bad relocations those that match an undefined
> +# weak symbol which will result in an absolute relocation to 0.
> +# Weak unresolved symbols are of that form in nm output:
> +# "  w _binary__btf_vmlinux_bin_end"
> +undef_weak_symbols=$($nm "$vmlinux" | awk '$1 ~ /w/ { print $2 }')
> +
> +$objdump -R "$vmlinux" |
> + grep -E '\ + ([ "$undef_weak_symbols" ] && grep -F -w -v "$undef_weak_symbols" || 
> cat)
> -- 
> 2.30.2
>
>
> ___
> linux-riscv mailing list
> linux-ri...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv

[PATCH v2] powerpc/powernv/pci: fix header guard

2021-05-18 Thread Nick Desaulniers

While looking at -Wundef warnings, the #if CONFIG_EEH stood out as a
possible candidate to convert to #ifdef CONFIG_EEH.

It seems that based on Kconfig dependencies it's not possible to build
this file without CONFIG_EEH enabled, but based on upstream discussion,
it's not clear yet that CONFIG_EEH should be enabled by default.

For now, simply fix the -Wundef warning.

Suggested-by: Nathan Chancellor 
Suggested-by: Joe Perches 
Link: https://github.com/ClangBuiltLinux/linux/issues/570
Link: 
https://lore.kernel.org/lkml/67f6cd269684c9aa8463ff4812c3b4605e6739c3.ca...@perches.com/
Link: 
https://lore.kernel.org/lkml/CAOSf1CGoN5R0LUrU=Y=uwho1z_9slgcx8s3sbfjxwjxc5by...@mail.gmail.com/
Signed-off-by: Nick Desaulniers 
---
 arch/powerpc/platforms/powernv/pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/pci.c 
b/arch/powerpc/platforms/powernv/pci.c
index b18468dc31ff..6bb3c52633fb 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -711,7 +711,7 @@ int pnv_pci_cfg_write(struct pci_dn *pdn,
return PCIBIOS_SUCCESSFUL;
 }
 
-#if CONFIG_EEH
+#ifdef CONFIG_EEH
 static bool pnv_pci_cfg_check(struct pci_dn *pdn)
 {
struct eeh_dev *edev = NULL;
-- 
2.31.1.751.gd2f1c929bd-goog

Re: [PATCH v5 5/9] powerpc/mm/book3s64: Update tlb flush routines to take a page walk cache flush argument

2021-05-18 Thread Segher Boessenkool

On Wed, May 19, 2021 at 10:26:22AM +1000, Michael Ellerman wrote:
> Guenter Roeck  writes:
> > Ah, sorry. I wasn't aware that the following is valid C code
> >
> > void f1()
> > {
> >  return f2();
> >  ^^
> > }
> >
> > as long as f2() is void as well. Confusing, but we live and learn.
> 
> It might be valid, but it's still bad IMHO.
> 
> It's confusing to readers, and serves no useful purpose.

And it actually explicitly is undefined behaviour in C90 already
(3.6.6.4 in C90, 6.8.6.4 in C99 and later).


Segher

[PATCH] powerpc: Kconfig: disable CONFIG_COMPAT for clang < 12

2021-05-18 Thread Nick Desaulniers

Until clang-12, clang would attempt to assemble 32b powerpc assembler in
64b emulation mode when using a 64b target triple with -m32, leading to
errors during the build of the compat VDSO. Simply disable all of
CONFIG_COMPAT; users should upgrade to the latest release of clang for
proper support.

Link: https://github.com/ClangBuiltLinux/linux/issues/1160
Link: 
https://github.com/llvm/llvm-project/commits/2288319733cd5f525bf7e24dece08bfcf9d0ff9e
Link: https://groups.google.com/g/clang-built-linux/c/ayNmi3HoNdY/m/XJAGj_G2AgAJ
Suggested-by: Nathan Chancellor 
Signed-off-by: Nick Desaulniers 
---
 arch/powerpc/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index ce3f59531b51..2a02784b7ef0 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -289,6 +289,7 @@ config PANIC_TIMEOUT
 config COMPAT
bool "Enable support for 32bit binaries"
depends on PPC64
+   depends on !CC_IS_CLANG || CLANG_VERSION >= 12
default y if !CPU_LITTLE_ENDIAN
select ARCH_WANT_OLD_COMPAT_IPC
select COMPAT_OLD_SIGACTION
-- 
2.31.1.751.gd2f1c929bd-goog

Re: [PATCH v5 5/9] powerpc/mm/book3s64: Update tlb flush routines to take a page walk cache flush argument

2021-05-18 Thread Michael Ellerman

Guenter Roeck  writes:
> On 5/17/21 6:55 AM, Aneesh Kumar K.V wrote:
>> Guenter Roeck  writes:
>> 
>>> On 5/17/21 1:40 AM, Aneesh Kumar K.V wrote:
 On 5/15/21 10:05 PM, Guenter Roeck wrote:
> On Thu, Apr 22, 2021 at 11:13:19AM +0530, Aneesh Kumar K.V wrote:
>> 
>> ...
>> 
>    extern void radix__local_flush_all_mm(struct mm_struct *mm);
>> diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h 
>> b/arch/powerpc/include/asm/book3s/64/tlbflush.h
>> index 215973b4cb26..f9f8a3a264f7 100644
>> --- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
>> +++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
>> @@ -45,13 +45,30 @@ static inline void tlbiel_all_lpid(bool radix)
>>    hash__tlbiel_all(TLB_INVAL_SCOPE_LPID);
>>    }
>> +static inline void flush_pmd_tlb_pwc_range(struct vm_area_struct *vma,
>    
>> +   unsigned long start,
>> +   unsigned long end,
>> +   bool flush_pwc)
>> +{
>> +    if (radix_enabled())
>> +    return radix__flush_pmd_tlb_range(vma, start, end, flush_pwc);
>> +    return hash__flush_tlb_range(vma, start, end);
>   ^^
>
>> +}

 In this specific case we won't have  build errors because,

 static inline void hash__flush_tlb_range(struct vm_area_struct *vma,
    unsigned long start, unsigned long end)
 {

>>>
>>> Sorry, you completely lost me.
>>>
>>> Building parisc:allnoconfig ... failed
>>> --
>>> Error log:
>>> In file included from arch/parisc/include/asm/cacheflush.h:7,
>>>from include/linux/highmem.h:12,
>>>from include/linux/pagemap.h:11,
>>>from include/linux/ksm.h:13,
>>>from mm/mremap.c:14:
>>> mm/mremap.c: In function 'flush_pte_tlb_pwc_range':
>>> arch/parisc/include/asm/tlbflush.h:20:2: error: 'return' with a value, in 
>>> function returning void
>> 
>> As replied here
>> https://lore.kernel.org/mm-commits/8eedb441-a612-1ec8-8bf7-b40184de9...@linux.ibm.com/
>> 
>> That was the generic header change in the patch. I was commenting about the
>> ppc64 specific change causing build failures.
>
> Ah, sorry. I wasn't aware that the following is valid C code
>
> void f1()
> {
>  return f2();
>  ^^
> }
>
> as long as f2() is void as well. Confusing, but we live and learn.

It might be valid, but it's still bad IMHO.

It's confusing to readers, and serves no useful purpose.

cheers

[Bug 213069] kernel BUG at arch/powerpc/include/asm/book3s/64/hash-4k.h:147! Oops: Exception in kernel mode, sig: 5 [#1]

2021-05-18 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=213069

Michael Ellerman (mich...@ellerman.id.au) changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |CODE_FIX

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

Re: [PATCH v2 0/2] mm: unify the allocation of pglist_data instances

2021-05-18 Thread Miles Chen

On Wed, 2021-05-19 at 06:48 +0300, Mike Rapoport wrote:
> On Wed, May 19, 2021 at 08:12:06AM +0800, Miles Chen wrote:
> > On Tue, 2021-05-18 at 19:09 +0300, Mike Rapoport wrote:
> > > Hello Miles,
> > > 
> > > On Tue, May 18, 2021 at 05:24:44PM +0800, Miles Chen wrote:
> > > > This patches is created to fix the __pa() warning messages when
> > > > CONFIG_DEBUG_VIRTUAL=y by unifying the allocation of pglist_data
> > > > instances.
> > > > 
> > > > In current implementation of node_data, if CONFIG_NEED_MULTIPLE_NODES=y,
> > > > pglist_data is allocated by a memblock API. If 
> > > > CONFIG_NEED_MULTIPLE_NODES=n,
> > > > we use a global variable named "contig_page_data".
> > > > 
> > > > If CONFIG_DEBUG_VIRTUAL is not enabled. __pa() can handle both
> > > > allocation and symbol cases. But if CONFIG_DEBUG_VIRTUAL is set,
> > > > we will have the "virt_to_phys used for non-linear address" warning
> > > > when booting.
> > > > 
> > > > To fix the warning, always allocate pglist_data by memblock APIs and
> > > > remove the usage of contig_page_data.
> > > 
> > > Somehow I was sure that we can allocate pglist_data before it is accessed
> > > in sparse_init() somewhere outside mm/sparse.c. It's really not the case
> > > and having two places that may allocated this structure is surely worth
> > > than your previous suggestion.
> > > 
> > > Sorry about that.
> > 
> > Do you mean taht to call allocation function arch/*, somewhere after
> > paging_init() (so we can access pglist_data) and before sparse_init()
> > and free_area_init()?
> 
> No, I meant that your original patch is better than adding allocation of
> NODE_DATA(0) in two places.

Got it. will you re-review the original patch?


>  
> > Miles
> > 
> > >  
> > > > Warning message:
> > > > [0.00] [ cut here ]
> > > > [0.00] virt_to_phys used for non-linear address: 
> > > > (ptrval) (contig_page_data+0x0/0x1c00)
> > > > [0.00] WARNING: CPU: 0 PID: 0 at arch/arm64/mm/physaddr.c:15 
> > > > __virt_to_phys+0x58/0x68
> > > > [0.00] Modules linked in:
> > > > [0.00] CPU: 0 PID: 0 Comm: swapper Tainted: GW 
> > > > 5.13.0-rc1-00074-g1140ab592e2e #3
> > > > [0.00] Hardware name: linux,dummy-virt (DT)
> > > > [0.00] pstate: 60c5 (nZCv daIF -PAN -UAO -TCO BTYPE=--)
> > > > [0.00] pc : __virt_to_phys+0x58/0x68
> > > > [0.00] lr : __virt_to_phys+0x54/0x68
> > > > [0.00] sp : 800011833e70
> > > > [0.00] x29: 800011833e70 x28: 418a0018 x27: 
> > > > 
> > > > [0.00] x26: 000a x25: 800011b7 x24: 
> > > > 800011b7
> > > > [0.00] x23: fc0001c0 x22: 800011b7 x21: 
> > > > 47b0
> > > > [0.00] x20: 0008 x19: 800011b082c0 x18: 
> > > > 
> > > > [0.00] x17:  x16: 800011833bf9 x15: 
> > > > 0004
> > > > [0.00] x14: 0fff x13: 80001186a548 x12: 
> > > > 
> > > > [0.00] x11:  x10:  x9 : 
> > > > 
> > > > [0.00] x8 : 8000115c9000 x7 : 737520737968705f x6 : 
> > > > 800011b62ef8
> > > > [0.00] x5 :  x4 : 0001 x3 : 
> > > > 
> > > > [0.00] x2 :  x1 : 80001159585e x0 : 
> > > > 0058
> > > > [0.00] Call trace:
> > > > [0.00]  __virt_to_phys+0x58/0x68
> > > > [0.00]  check_usemap_section_nr+0x50/0xfc
> > > > [0.00]  sparse_init_nid+0x1ac/0x28c
> > > > [0.00]  sparse_init+0x1c4/0x1e0
> > > > [0.00]  bootmem_init+0x60/0x90
> > > > [0.00]  setup_arch+0x184/0x1f0
> > > > [0.00]  start_kernel+0x78/0x488
> > > > [0.00] ---[ end trace f68728a0d3053b60 ]---
> > > > 
> > > > [1] 
> > > > https://urldefense.com/v3/__https://lore.kernel.org/patchwork/patch/1425110/__;!!CTRNKA9wMg0ARbw!x-wGFEC1wLzXho2kI1CrC2fjXNaQm5f-n0ADQyJDckCOKZHAP_q055DCSWYcQ7Zdcw$
> > > >  
> > > > 
> > > > Change since v1:
> > > > - use memblock_alloc() to create pglist_data when CONFIG_NUMA=n
> > > > 
> > > > Miles Chen (2):
> > > >   mm: introduce prepare_node_data
> > > >   mm: replace contig_page_data with node_data
> > > > 
> > > >  Documentation/admin-guide/kdump/vmcoreinfo.rst | 13 -
> > > >  arch/powerpc/kexec/core.c  |  5 -
> > > >  include/linux/gfp.h|  3 ---
> > > >  include/linux/mm.h |  2 ++
> > > >  include/linux/mmzone.h |  4 ++--
> > > >  kernel/crash_core.c|  1 -
> > > >  mm/memblock.c  |  3 +--
> > > >  mm/page_alloc.c| 16 
> > > >  mm/sparse.c|  2 ++
> > > >  9

[PATCH net-next] ibmveth: fix kobj_to_dev.cocci warnings

2021-05-18 Thread YueHaibing

Use kobj_to_dev() instead of container_of()

Generated by: scripts/coccinelle/api/kobj_to_dev.cocci

Signed-off-by: YueHaibing 
---
 drivers/net/ethernet/ibm/ibmveth.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmveth.c 
b/drivers/net/ethernet/ibm/ibmveth.c
index 7fea9ae60f13..bc67a7ee872b 100644
--- a/drivers/net/ethernet/ibm/ibmveth.c
+++ b/drivers/net/ethernet/ibm/ibmveth.c
@@ -1799,8 +1799,7 @@ static ssize_t veth_pool_store(struct kobject *kobj, 
struct attribute *attr,
struct ibmveth_buff_pool *pool = container_of(kobj,
  struct ibmveth_buff_pool,
  kobj);
-   struct net_device *netdev = dev_get_drvdata(
-   container_of(kobj->parent, struct device, kobj));
+   struct net_device *netdev = dev_get_drvdata(kobj_to_dev(kobj->parent));
struct ibmveth_adapter *adapter = netdev_priv(netdev);
long value = simple_strtol(buf, NULL, 10);
long rc;
-- 
2.17.1

Re: [PATCH v2 0/2] mm: unify the allocation of pglist_data instances

2021-05-18 Thread Mike Rapoport

On Wed, May 19, 2021 at 08:12:06AM +0800, Miles Chen wrote:
> On Tue, 2021-05-18 at 19:09 +0300, Mike Rapoport wrote:
> > Hello Miles,
> > 
> > On Tue, May 18, 2021 at 05:24:44PM +0800, Miles Chen wrote:
> > > This patches is created to fix the __pa() warning messages when
> > > CONFIG_DEBUG_VIRTUAL=y by unifying the allocation of pglist_data
> > > instances.
> > > 
> > > In current implementation of node_data, if CONFIG_NEED_MULTIPLE_NODES=y,
> > > pglist_data is allocated by a memblock API. If 
> > > CONFIG_NEED_MULTIPLE_NODES=n,
> > > we use a global variable named "contig_page_data".
> > > 
> > > If CONFIG_DEBUG_VIRTUAL is not enabled. __pa() can handle both
> > > allocation and symbol cases. But if CONFIG_DEBUG_VIRTUAL is set,
> > > we will have the "virt_to_phys used for non-linear address" warning
> > > when booting.
> > > 
> > > To fix the warning, always allocate pglist_data by memblock APIs and
> > > remove the usage of contig_page_data.
> > 
> > Somehow I was sure that we can allocate pglist_data before it is accessed
> > in sparse_init() somewhere outside mm/sparse.c. It's really not the case
> > and having two places that may allocated this structure is surely worth
> > than your previous suggestion.
> > 
> > Sorry about that.
> 
> Do you mean taht to call allocation function arch/*, somewhere after
> paging_init() (so we can access pglist_data) and before sparse_init()
> and free_area_init()?

No, I meant that your original patch is better than adding allocation of
NODE_DATA(0) in two places.
 
> Miles
> 
> >  
> > > Warning message:
> > > [0.00] [ cut here ]
> > > [0.00] virt_to_phys used for non-linear address: (ptrval) 
> > > (contig_page_data+0x0/0x1c00)
> > > [0.00] WARNING: CPU: 0 PID: 0 at arch/arm64/mm/physaddr.c:15 
> > > __virt_to_phys+0x58/0x68
> > > [0.00] Modules linked in:
> > > [0.00] CPU: 0 PID: 0 Comm: swapper Tainted: GW 
> > > 5.13.0-rc1-00074-g1140ab592e2e #3
> > > [0.00] Hardware name: linux,dummy-virt (DT)
> > > [0.00] pstate: 60c5 (nZCv daIF -PAN -UAO -TCO BTYPE=--)
> > > [0.00] pc : __virt_to_phys+0x58/0x68
> > > [0.00] lr : __virt_to_phys+0x54/0x68
> > > [0.00] sp : 800011833e70
> > > [0.00] x29: 800011833e70 x28: 418a0018 x27: 
> > > 
> > > [0.00] x26: 000a x25: 800011b7 x24: 
> > > 800011b7
> > > [0.00] x23: fc0001c0 x22: 800011b7 x21: 
> > > 47b0
> > > [0.00] x20: 0008 x19: 800011b082c0 x18: 
> > > 
> > > [0.00] x17:  x16: 800011833bf9 x15: 
> > > 0004
> > > [0.00] x14: 0fff x13: 80001186a548 x12: 
> > > 
> > > [0.00] x11:  x10:  x9 : 
> > > 
> > > [0.00] x8 : 8000115c9000 x7 : 737520737968705f x6 : 
> > > 800011b62ef8
> > > [0.00] x5 :  x4 : 0001 x3 : 
> > > 
> > > [0.00] x2 :  x1 : 80001159585e x0 : 
> > > 0058
> > > [0.00] Call trace:
> > > [0.00]  __virt_to_phys+0x58/0x68
> > > [0.00]  check_usemap_section_nr+0x50/0xfc
> > > [0.00]  sparse_init_nid+0x1ac/0x28c
> > > [0.00]  sparse_init+0x1c4/0x1e0
> > > [0.00]  bootmem_init+0x60/0x90
> > > [0.00]  setup_arch+0x184/0x1f0
> > > [0.00]  start_kernel+0x78/0x488
> > > [0.00] ---[ end trace f68728a0d3053b60 ]---
> > > 
> > > [1] 
> > > https://urldefense.com/v3/__https://lore.kernel.org/patchwork/patch/1425110/__;!!CTRNKA9wMg0ARbw!x-wGFEC1wLzXho2kI1CrC2fjXNaQm5f-n0ADQyJDckCOKZHAP_q055DCSWYcQ7Zdcw$
> > >  
> > > 
> > > Change since v1:
> > > - use memblock_alloc() to create pglist_data when CONFIG_NUMA=n
> > > 
> > > Miles Chen (2):
> > >   mm: introduce prepare_node_data
> > >   mm: replace contig_page_data with node_data
> > > 
> > >  Documentation/admin-guide/kdump/vmcoreinfo.rst | 13 -
> > >  arch/powerpc/kexec/core.c  |  5 -
> > >  include/linux/gfp.h|  3 ---
> > >  include/linux/mm.h |  2 ++
> > >  include/linux/mmzone.h |  4 ++--
> > >  kernel/crash_core.c|  1 -
> > >  mm/memblock.c  |  3 +--
> > >  mm/page_alloc.c| 16 
> > >  mm/sparse.c|  2 ++
> > >  9 files changed, 23 insertions(+), 26 deletions(-)
> > > 
> > > 
> > > base-commit: 8ac91e6c6033ebc12c5c1e4aa171b81a662bd70f
> > > -- 
> > > 2.18.0
> > > 
> > 
> 

-- 
Sincerely yours,
Mike.

Re: Linux powerpc new system call instruction and ABI

2021-05-18 Thread Nicholas Piggin

Excerpts from Dmitry V. Levin's message of May 19, 2021 9:13 am:
> Hi,
> 
> On Thu, Jun 11, 2020 at 06:12:01PM +1000, Nicholas Piggin wrote:
> [...]
>> - Error handling: The consensus among kernel, glibc, and musl is to move to
>>   using negative return values in r3 rather than CR0[SO]=1 to indicate error,
>>   which matches most other architectures, and is closer to a function call.
> 
> Apparently, the patchset merged by commit v5.9-rc1~100^2~164 was
> incomplete: all functions defined in arch/powerpc/include/asm/ptrace.h and
> arch/powerpc/include/asm/syscall.h that use ccr are broken when scv is used.
> This includes syscall_get_error() and all its users including
> PTRACE_GET_SYSCALL_INFO API, which in turn makes strace unusable
> when scv is used.
> 
> See also https://bugzilla.redhat.com/1929836

I see, thanks. Using latest strace from github.com, the attached kernel
patch makes strace -k check results a lot greener.

Some of the remaining failing tests look like this (I didn't look at all
of them yet):

signal(SIGUSR1, 0xfacefeeddeadbeef) = 0 (SIG_DFL)
write(1, "signal(SIGUSR1, 0xfacefeeddeadbe"..., 50signal(SIGUSR1, 
0xfacefeeddeadbeef) = 0 (SIG_DFL)
) = 50
signal(SIGUSR1, SIG_IGN)= 0xfacefeeddeadbeef
write(2, "errno2name.c:461: unknown errno "..., 41errno2name.c:461: unknown 
errno 559038737) = 41
write(2, ": Unknown error 559038737\n", 26: Unknown error 559038737
) = 26
exit_group(1)   = ?

I think the problem is glibc testing for -ve, but it should be comparing
against -4095 (+cc Matheus)

  #define RET_SCV \
  cmpdi r3,0; \
  bgelr+; \
  neg r3,r3;

With this patch, I think the ptrace ABI should mostly be fixed. I think 
a problem remains with applications that look at system call return 
registers directly and have powerpc specific error cases. Those probably
will just need to be updated unfortunately. Michael thought it might be
possible to return an indication via ptrace somehow that the syscall is
using a new ABI, so such apps can be updated to test for it. I don't 
know how that would be done.

Thanks,
Nick

--
diff --git a/arch/powerpc/include/asm/ptrace.h 
b/arch/powerpc/include/asm/ptrace.h
index 9c9ab2746168..b476a685f066 100644
--- a/arch/powerpc/include/asm/ptrace.h
+++ b/arch/powerpc/include/asm/ptrace.h
@@ -19,6 +19,7 @@
 #ifndef _ASM_POWERPC_PTRACE_H
 #define _ASM_POWERPC_PTRACE_H
 
+#include 
 #include 
 #include 
 
@@ -152,25 +153,6 @@ extern unsigned long profile_pc(struct pt_regs *regs);
 long do_syscall_trace_enter(struct pt_regs *regs);
 void do_syscall_trace_leave(struct pt_regs *regs);
 
-#define kernel_stack_pointer(regs) ((regs)->gpr[1])
-static inline int is_syscall_success(struct pt_regs *regs)
-{
-   return !(regs->ccr & 0x1000);
-}
-
-static inline long regs_return_value(struct pt_regs *regs)
-{
-   if (is_syscall_success(regs))
-   return regs->gpr[3];
-   else
-   return -regs->gpr[3];
-}
-
-static inline void regs_set_return_value(struct pt_regs *regs, unsigned long 
rc)
-{
-   regs->gpr[3] = rc;
-}
-
 #ifdef __powerpc64__
 #define user_mode(regs) regs)->msr) >> MSR_PR_LG) & 0x1)
 #else
@@ -235,6 +217,31 @@ static __always_inline void set_trap_norestart(struct 
pt_regs *regs)
regs->trap |= 0x1;
 }
 
+#define kernel_stack_pointer(regs) ((regs)->gpr[1])
+static inline int is_syscall_success(struct pt_regs *regs)
+{
+   if (trap_is_scv(regs))
+   return !IS_ERR_VALUE((unsigned long)regs->gpr[3]);
+   else
+   return !(regs->ccr & 0x1000);
+}
+
+static inline long regs_return_value(struct pt_regs *regs)
+{
+   if (trap_is_scv(regs))
+   return regs->gpr[3];
+
+   if (is_syscall_success(regs))
+   return regs->gpr[3];
+   else
+   return -regs->gpr[3];
+}
+
+static inline void regs_set_return_value(struct pt_regs *regs, unsigned long 
rc)
+{
+   regs->gpr[3] = rc;
+}
+
 #define arch_has_single_step() (1)
 #define arch_has_block_step()  (true)
 #define ARCH_HAS_USER_SINGLE_STEP_REPORT
diff --git a/arch/powerpc/include/asm/syscall.h 
b/arch/powerpc/include/asm/syscall.h
index fd1b518eed17..e8b40149bf7e 100644
--- a/arch/powerpc/include/asm/syscall.h
+++ b/arch/powerpc/include/asm/syscall.h
@@ -41,11 +41,20 @@ static inline void syscall_rollback(struct task_struct 
*task,
 static inline long syscall_get_error(struct task_struct *task,
 struct pt_regs *regs)
 {
-   /*
-* If the system call failed,
-* regs->gpr[3] contains a positive ERRORCODE.
-*/
-   return (regs->ccr & 0x1000UL) ? -regs->gpr[3] : 0;
+   if (trap_is_scv(regs)) {
+   unsigned long error = regs->gpr[3];
+
+   if (task_is_32bit(task))
+   error = (long)(int)error;
+
+   return IS_ERR_VALUE(error) ? error : 0;
+   } else {
+   /*
+* If the system

Re: [PATCH v5 3/9] mm/mremap: Use pmd/pud_poplulate to update page table entries

2021-05-18 Thread Aneesh Kumar K.V

Nathan Chancellor  writes:

> Hi Aneesh,
>
> On Thu, Apr 22, 2021 at 11:13:17AM +0530, Aneesh Kumar K.V wrote:
>> pmd/pud_populate is the right interface to be used to set the respective
>> page table entries. Some architectures like ppc64 do assume that 
>> set_pmd/pud_at
>> can only be used to set a hugepage PTE. Since we are not setting up a 
>> hugepage
>> PTE here, use the pmd/pud_populate interface.
>> 
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>  mm/mremap.c | 7 +++
>>  1 file changed, 3 insertions(+), 4 deletions(-)
>> 
>> diff --git a/mm/mremap.c b/mm/mremap.c
>> index ec8f840399ed..574287f9bb39 100644
>> --- a/mm/mremap.c
>> +++ b/mm/mremap.c
>> @@ -26,6 +26,7 @@
>>  
>>  #include 
>>  #include 
>> +#include 
>>  
>>  #include "internal.h"
>>  
>> @@ -257,9 +258,8 @@ static bool move_normal_pmd(struct vm_area_struct *vma, 
>> unsigned long old_addr,
>>  pmd_clear(old_pmd);
>>  
>>  VM_BUG_ON(!pmd_none(*new_pmd));
>> +pmd_populate(mm, new_pmd, (pgtable_t)pmd_page_vaddr(pmd));
>>  
>> -/* Set the new pmd */
>> -set_pmd_at(mm, new_addr, new_pmd, pmd);
>>  flush_tlb_range(vma, old_addr, old_addr + PMD_SIZE);
>>  if (new_ptl != old_ptl)
>>  spin_unlock(new_ptl);
>> @@ -306,8 +306,7 @@ static bool move_normal_pud(struct vm_area_struct *vma, 
>> unsigned long old_addr,
>>  
>>  VM_BUG_ON(!pud_none(*new_pud));
>>  
>> -/* Set the new pud */
>> -set_pud_at(mm, new_addr, new_pud, pud);
>> +pud_populate(mm, new_pud, (pmd_t *)pud_page_vaddr(pud));
>>  flush_tlb_range(vma, old_addr, old_addr + PUD_SIZE);
>>  if (new_ptl != old_ptl)
>>  spin_unlock(new_ptl);
>> -- 
>> 2.30.2
>> 
>> 
>
> This commit causes my WSL2 VM to close when compiling something memory
> intensive, such as an x86_64_defconfig + CONFIG_LTO_CLANG_FULL=y kernel
> or LLVM/Clang. Unfortunately, I do not have much further information to
> provide since I do not see any sort of splat in dmesg right before it
> closes and I have found zero information about getting the previous
> kernel message in WSL2 (custom init so no systemd or anything).
>
> The config file is the stock one from Microsoft:
>
> https://github.com/microsoft/WSL2-Linux-Kernel/blob/a571dc8cedc8e0e56487c0dc93243e0b5db8960a/Microsoft/config-wsl
>
> I have attached my .config anyways, which includes CONFIG_DEBUG_VM,
> which does not appear to show anything out of the ordinary. I have also
> attached a dmesg just in case anything sticks out. I am happy to provide
> any additional information or perform additional debugging steps as
> needed.
>

Can you try this change?

modified   mm/mremap.c
@@ -279,7 +279,7 @@ static bool move_normal_pmd(struct vm_area_struct *vma, 
unsigned long old_addr,
pmd_clear(old_pmd);
 
VM_BUG_ON(!pmd_none(*new_pmd));
-   pmd_populate(mm, new_pmd, (pgtable_t)pmd_page_vaddr(pmd));
+   pmd_populate(mm, new_pmd, pmd_pgtable(pmd));
 
if (new_ptl != old_ptl)
spin_unlock(new_ptl);

Re: Linux powerpc new system call instruction and ABI

2021-05-18 Thread Nicholas Piggin

Excerpts from Nicholas Piggin's message of May 19, 2021 12:50 pm:
> Excerpts from Dmitry V. Levin's message of May 19, 2021 9:13 am:
>> Hi,
>> 
>> On Thu, Jun 11, 2020 at 06:12:01PM +1000, Nicholas Piggin wrote:
>> [...]
>>> - Error handling: The consensus among kernel, glibc, and musl is to move to
>>>   using negative return values in r3 rather than CR0[SO]=1 to indicate 
>>> error,
>>>   which matches most other architectures, and is closer to a function call.
>> 
>> Apparently, the patchset merged by commit v5.9-rc1~100^2~164 was
>> incomplete: all functions defined in arch/powerpc/include/asm/ptrace.h and
>> arch/powerpc/include/asm/syscall.h that use ccr are broken when scv is used.
>> This includes syscall_get_error() and all its users including
>> PTRACE_GET_SYSCALL_INFO API, which in turn makes strace unusable
>> when scv is used.
>> 
>> See also https://bugzilla.redhat.com/1929836
> 
> I see, thanks. Using latest strace from github.com, the attached kernel
> patch makes strace -k check results a lot greener.
> 
> Some of the remaining failing tests look like this (I didn't look at all
> of them yet):
> 
> signal(SIGUSR1, 0xfacefeeddeadbeef) = 0 (SIG_DFL)
> write(1, "signal(SIGUSR1, 0xfacefeeddeadbe"..., 50signal(SIGUSR1, 
> 0xfacefeeddeadbeef) = 0 (SIG_DFL)
> ) = 50
> signal(SIGUSR1, SIG_IGN)= 0xfacefeeddeadbeef
> write(2, "errno2name.c:461: unknown errno "..., 41errno2name.c:461: unknown 
> errno 559038737) = 41
> write(2, ": Unknown error 559038737\n", 26: Unknown error 559038737
> ) = 26
> exit_group(1)   = ?
> 
> I think the problem is glibc testing for -ve, but it should be comparing
> against -4095 (+cc Matheus)
> 
>   #define RET_SCV \
>   cmpdi r3,0; \
>   bgelr+; \
>   neg r3,r3;

This glibc patch at least gets that signal test working. Haven't run the 
full suite yet because of trouble making it work with a local glibc
install...

Thanks,
Nick

---

diff --git a/sysdeps/powerpc/powerpc64/sysdep.h 
b/sysdeps/powerpc/powerpc64/sysdep.h
index c57bb1c05d..1ea4c3b917 100644
--- a/sysdeps/powerpc/powerpc64/sysdep.h
+++ b/sysdeps/powerpc/powerpc64/sysdep.h
@@ -398,8 +398,9 @@ LT_LABELSUFFIX(name,_name_end): ; \
 #endif
 
 #define RET_SCV \
-cmpdi r3,0; \
-bgelr+; \
+li r9,-4095; \
+cmpld r3,r9; \
+bltlr+; \
 neg r3,r3;
 
 #define RET_SC \

Re: [PATCH net-next] ibmveth: fix kobj_to_dev.cocci warnings

2021-05-18 Thread Lijun Pan




> On May 18, 2021, at 9:28 PM, YueHaibing  wrote:
> 
> Use kobj_to_dev() instead of container_of()
> 
> Generated by: scripts/coccinelle/api/kobj_to_dev.cocci
> 
> Signed-off-by: YueHaibing 
> ---

Acked-by: Lijun Pan 


> drivers/net/ethernet/ibm/ibmveth.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/ibm/ibmveth.c 
> b/drivers/net/ethernet/ibm/ibmveth.c
> index 7fea9ae60f13..bc67a7ee872b 100644
> --- a/drivers/net/ethernet/ibm/ibmveth.c
> +++ b/drivers/net/ethernet/ibm/ibmveth.c
> @@ -1799,8 +1799,7 @@ static ssize_t veth_pool_store(struct kobject *kobj, 
> struct attribute *attr,
>   struct ibmveth_buff_pool *pool = container_of(kobj,
> struct ibmveth_buff_pool,
> kobj);
> - struct net_device *netdev = dev_get_drvdata(
> - container_of(kobj->parent, struct device, kobj));
> + struct net_device *netdev = dev_get_drvdata(kobj_to_dev(kobj->parent));
>   struct ibmveth_adapter *adapter = netdev_priv(netdev);
>   long value = simple_strtol(buf, NULL, 10);
>   long rc;
> -- 
> 2.17.1
>

[PATCH v7 01/15] swiotlb: Refactor swiotlb init functions

2021-05-18 Thread Claire Chang

Add a new function, swiotlb_init_io_tlb_mem, for the io_tlb_mem struct
initialization to make the code reusable.

Note that we now also call set_memory_decrypted in swiotlb_init_with_tbl.

Signed-off-by: Claire Chang 
---
 kernel/dma/swiotlb.c | 51 ++--
 1 file changed, 25 insertions(+), 26 deletions(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 8ca7d505d61c..d3232fc19385 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -168,9 +168,30 @@ void __init swiotlb_update_mem_attributes(void)
memset(vaddr, 0, bytes);
 }
 
-int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
+static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
+   unsigned long nslabs, bool late_alloc)
 {
+   void *vaddr = phys_to_virt(start);
unsigned long bytes = nslabs << IO_TLB_SHIFT, i;
+
+   mem->nslabs = nslabs;
+   mem->start = start;
+   mem->end = mem->start + bytes;
+   mem->index = 0;
+   mem->late_alloc = late_alloc;
+   spin_lock_init(>lock);
+   for (i = 0; i < mem->nslabs; i++) {
+   mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
+   mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
+   mem->slots[i].alloc_size = 0;
+   }
+
+   set_memory_decrypted((unsigned long)vaddr, bytes >> PAGE_SHIFT);
+   memset(vaddr, 0, bytes);
+}
+
+int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
+{
struct io_tlb_mem *mem;
size_t alloc_size;
 
@@ -186,16 +207,8 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long 
nslabs, int verbose)
if (!mem)
panic("%s: Failed to allocate %zu bytes align=0x%lx\n",
  __func__, alloc_size, PAGE_SIZE);
-   mem->nslabs = nslabs;
-   mem->start = __pa(tlb);
-   mem->end = mem->start + bytes;
-   mem->index = 0;
-   spin_lock_init(>lock);
-   for (i = 0; i < mem->nslabs; i++) {
-   mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
-   mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
-   mem->slots[i].alloc_size = 0;
-   }
+
+   swiotlb_init_io_tlb_mem(mem, __pa(tlb), nslabs, false);
 
io_tlb_default_mem = mem;
if (verbose)
@@ -282,7 +295,6 @@ swiotlb_late_init_with_default_size(size_t default_size)
 int
 swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
 {
-   unsigned long bytes = nslabs << IO_TLB_SHIFT, i;
struct io_tlb_mem *mem;
 
if (swiotlb_force == SWIOTLB_NO_FORCE)
@@ -297,20 +309,7 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
if (!mem)
return -ENOMEM;
 
-   mem->nslabs = nslabs;
-   mem->start = virt_to_phys(tlb);
-   mem->end = mem->start + bytes;
-   mem->index = 0;
-   mem->late_alloc = 1;
-   spin_lock_init(>lock);
-   for (i = 0; i < mem->nslabs; i++) {
-   mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
-   mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
-   mem->slots[i].alloc_size = 0;
-   }
-
-   set_memory_decrypted((unsigned long)tlb, bytes >> PAGE_SHIFT);
-   memset(tlb, 0, bytes);
+   swiotlb_init_io_tlb_mem(mem, virt_to_phys(tlb), nslabs, true);
 
io_tlb_default_mem = mem;
swiotlb_print_info();
-- 
2.31.1.751.gd2f1c929bd-goog

[PATCH v7 00/15] Restricted DMA

2021-05-18 Thread Claire Chang

This series implements mitigations for lack of DMA access control on
systems without an IOMMU, which could result in the DMA accessing the
system memory at unexpected times and/or unexpected addresses, possibly
leading to data leakage or corruption.

For example, we plan to use the PCI-e bus for Wi-Fi and that PCI-e bus is
not behind an IOMMU. As PCI-e, by design, gives the device full access to
system memory, a vulnerability in the Wi-Fi firmware could easily escalate
to a full system exploit (remote wifi exploits: [1a], [1b] that shows a
full chain of exploits; [2], [3]).

To mitigate the security concerns, we introduce restricted DMA. Restricted
DMA utilizes the existing swiotlb to bounce streaming DMA in and out of a
specially allocated region and does memory allocation from the same region.
The feature on its own provides a basic level of protection against the DMA
overwriting buffer contents at unexpected times. However, to protect
against general data leakage and system memory corruption, the system needs
to provide a way to restrict the DMA to a predefined memory region (this is
usually done at firmware level, e.g. MPU in ATF on some ARM platforms [4]).

[1a] 
https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_4.html
[1b] 
https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_11.html
[2] https://blade.tencent.com/en/advisories/qualpwn/
[3] 
https://www.bleepingcomputer.com/news/security/vulnerabilities-found-in-highly-popular-firmware-for-wifi-chips/
[4] 
https://github.com/ARM-software/arm-trusted-firmware/blob/master/plat/mediatek/mt8183/drivers/emi_mpu/emi_mpu.c#L132

v7:
Fix debugfs, PageHighMem and comment style in rmem_swiotlb_device_init

v6:
Address the comments in v5
https://lore.kernel.org/patchwork/cover/1423201/

v5:
Rebase on latest linux-next
https://lore.kernel.org/patchwork/cover/1416899/

v4:
- Fix spinlock bad magic
- Use rmem->name for debugfs entry
- Address the comments in v3
https://lore.kernel.org/patchwork/cover/1378113/

v3:
Using only one reserved memory region for both streaming DMA and memory
allocation.
https://lore.kernel.org/patchwork/cover/1360992/

v2:
Building on top of swiotlb.
https://lore.kernel.org/patchwork/cover/1280705/

v1:
Using dma_map_ops.
https://lore.kernel.org/patchwork/cover/1271660/

Claire Chang (15):
  swiotlb: Refactor swiotlb init functions
  swiotlb: Refactor swiotlb_create_debugfs
  swiotlb: Add DMA_RESTRICTED_POOL
  swiotlb: Add restricted DMA pool initialization
  swiotlb: Add a new get_io_tlb_mem getter
  swiotlb: Update is_swiotlb_buffer to add a struct device argument
  swiotlb: Update is_swiotlb_active to add a struct device argument
  swiotlb: Bounce data from/to restricted DMA pool if available
  swiotlb: Move alloc_size to find_slots
  swiotlb: Refactor swiotlb_tbl_unmap_single
  dma-direct: Add a new wrapper __dma_direct_free_pages()
  swiotlb: Add restricted DMA alloc/free support.
  dma-direct: Allocate memory from restricted DMA pool if available
  dt-bindings: of: Add restricted DMA pool
  of: Add plumbing for restricted DMA pool

 .../reserved-memory/reserved-memory.txt   |  27 ++
 drivers/gpu/drm/i915/gem/i915_gem_internal.c  |   2 +-
 drivers/gpu/drm/nouveau/nouveau_ttm.c |   2 +-
 drivers/iommu/dma-iommu.c |  12 +-
 drivers/of/address.c  |  25 ++
 drivers/of/device.c   |   3 +
 drivers/of/of_private.h   |   5 +
 drivers/pci/xen-pcifront.c|   2 +-
 drivers/xen/swiotlb-xen.c |   2 +-
 include/linux/device.h|   4 +
 include/linux/swiotlb.h   |  41 ++-
 kernel/dma/Kconfig|  14 +
 kernel/dma/direct.c   |  63 +++--
 kernel/dma/direct.h   |   9 +-
 kernel/dma/swiotlb.c  | 242 +-
 15 files changed, 356 insertions(+), 97 deletions(-)

-- 
2.31.1.751.gd2f1c929bd-goog

[PATCH v7 09/15] swiotlb: Move alloc_size to find_slots

2021-05-18 Thread Claire Chang

Move the maintenance of alloc_size to find_slots for better code
reusability later.

Signed-off-by: Claire Chang 
---
 kernel/dma/swiotlb.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 95f482c4408c..2ec6711071de 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -482,8 +482,11 @@ static int find_slots(struct device *dev, phys_addr_t 
orig_addr,
return -1;
 
 found:
-   for (i = index; i < index + nslots; i++)
+   for (i = index; i < index + nslots; i++) {
mem->slots[i].list = 0;
+   mem->slots[i].alloc_size =
+   alloc_size - ((i - index) << IO_TLB_SHIFT);
+   }
for (i = index - 1;
 io_tlb_offset(i) != IO_TLB_SEGSIZE - 1 &&
 mem->slots[i].list; i--)
@@ -538,11 +541,8 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, 
phys_addr_t orig_addr,
 * This is needed when we sync the memory.  Then we sync the buffer if
 * needed.
 */
-   for (i = 0; i < nr_slots(alloc_size + offset); i++) {
+   for (i = 0; i < nr_slots(alloc_size + offset); i++)
mem->slots[index + i].orig_addr = slot_addr(orig_addr, i);
-   mem->slots[index + i].alloc_size =
-   alloc_size - (i << IO_TLB_SHIFT);
-   }
tlb_addr = slot_addr(mem->start, index) + offset;
if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
(dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL))
-- 
2.31.1.751.gd2f1c929bd-goog

[PATCH v7 08/15] swiotlb: Bounce data from/to restricted DMA pool if available

2021-05-18 Thread Claire Chang

Regardless of swiotlb setting, the restricted DMA pool is preferred if
available.

The restricted DMA pools provide a basic level of protection against the
DMA overwriting buffer contents at unexpected times. However, to protect
against general data leakage and system memory corruption, the system
needs to provide a way to lock down the memory access, e.g., MPU.

Note that is_dev_swiotlb_force doesn't check if
swiotlb_force == SWIOTLB_FORCE. Otherwise the memory allocation behavior
with default swiotlb will be changed by the following patche
("dma-direct: Allocate memory from restricted DMA pool if available").

Signed-off-by: Claire Chang 
---
 include/linux/swiotlb.h | 13 +
 kernel/dma/direct.c |  3 ++-
 kernel/dma/direct.h |  3 ++-
 kernel/dma/swiotlb.c|  8 
 4 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index c530c976d18b..0c5a18d9cf89 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -120,6 +120,15 @@ static inline bool is_swiotlb_buffer(struct device *dev, 
phys_addr_t paddr)
return mem && paddr >= mem->start && paddr < mem->end;
 }
 
+static inline bool is_dev_swiotlb_force(struct device *dev)
+{
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+   if (dev->dma_io_tlb_mem)
+   return true;
+#endif /* CONFIG_DMA_RESTRICTED_POOL */
+   return false;
+}
+
 void __init swiotlb_exit(void);
 unsigned int swiotlb_max_segment(void);
 size_t swiotlb_max_mapping_size(struct device *dev);
@@ -131,6 +140,10 @@ static inline bool is_swiotlb_buffer(struct device *dev, 
phys_addr_t paddr)
 {
return false;
 }
+static inline bool is_dev_swiotlb_force(struct device *dev)
+{
+   return false;
+}
 static inline void swiotlb_exit(void)
 {
 }
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 7a88c34d0867..078f7087e466 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -496,7 +496,8 @@ size_t dma_direct_max_mapping_size(struct device *dev)
 {
/* If SWIOTLB is active, use its maximum mapping size */
if (is_swiotlb_active(dev) &&
-   (dma_addressing_limited(dev) || swiotlb_force == SWIOTLB_FORCE))
+   (dma_addressing_limited(dev) || swiotlb_force == SWIOTLB_FORCE ||
+is_dev_swiotlb_force(dev)))
return swiotlb_max_mapping_size(dev);
return SIZE_MAX;
 }
diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
index 13e9e7158d94..f94813674e23 100644
--- a/kernel/dma/direct.h
+++ b/kernel/dma/direct.h
@@ -87,7 +87,8 @@ static inline dma_addr_t dma_direct_map_page(struct device 
*dev,
phys_addr_t phys = page_to_phys(page) + offset;
dma_addr_t dma_addr = phys_to_dma(dev, phys);
 
-   if (unlikely(swiotlb_force == SWIOTLB_FORCE))
+   if (unlikely(swiotlb_force == SWIOTLB_FORCE) ||
+   is_dev_swiotlb_force(dev))
return swiotlb_map(dev, phys, size, dir, attrs);
 
if (unlikely(!dma_capable(dev, dma_addr, size, true))) {
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 68e7633f11fe..95f482c4408c 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -347,7 +347,7 @@ void __init swiotlb_exit(void)
 static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t 
size,
   enum dma_data_direction dir)
 {
-   struct io_tlb_mem *mem = io_tlb_default_mem;
+   struct io_tlb_mem *mem = get_io_tlb_mem(dev);
int index = (tlb_addr - mem->start) >> IO_TLB_SHIFT;
phys_addr_t orig_addr = mem->slots[index].orig_addr;
size_t alloc_size = mem->slots[index].alloc_size;
@@ -429,7 +429,7 @@ static unsigned int wrap_index(struct io_tlb_mem *mem, 
unsigned int index)
 static int find_slots(struct device *dev, phys_addr_t orig_addr,
size_t alloc_size)
 {
-   struct io_tlb_mem *mem = io_tlb_default_mem;
+   struct io_tlb_mem *mem = get_io_tlb_mem(dev);
unsigned long boundary_mask = dma_get_seg_boundary(dev);
dma_addr_t tbl_dma_addr =
phys_to_dma_unencrypted(dev, mem->start) & boundary_mask;
@@ -506,7 +506,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, 
phys_addr_t orig_addr,
size_t mapping_size, size_t alloc_size,
enum dma_data_direction dir, unsigned long attrs)
 {
-   struct io_tlb_mem *mem = io_tlb_default_mem;
+   struct io_tlb_mem *mem = get_io_tlb_mem(dev);
unsigned int offset = swiotlb_align_offset(dev, orig_addr);
unsigned int i;
int index;
@@ -557,7 +557,7 @@ void swiotlb_tbl_unmap_single(struct device *hwdev, 
phys_addr_t tlb_addr,
  size_t mapping_size, enum dma_data_direction dir,
  unsigned long attrs)
 {
-   struct io_tlb_mem *mem = io_tlb_default_mem;
+   struct io_tlb_mem *mem = get_io_tlb_mem(hwdev);
unsigned long flags;
unsigned int offset =

[PATCH v7 07/15] swiotlb: Update is_swiotlb_active to add a struct device argument

2021-05-18 Thread Claire Chang

Update is_swiotlb_active to add a struct device argument. This will be
useful later to allow for restricted DMA pool.

Signed-off-by: Claire Chang 
---
 drivers/gpu/drm/i915/gem/i915_gem_internal.c | 2 +-
 drivers/gpu/drm/nouveau/nouveau_ttm.c| 2 +-
 drivers/pci/xen-pcifront.c   | 2 +-
 include/linux/swiotlb.h  | 4 ++--
 kernel/dma/direct.c  | 2 +-
 kernel/dma/swiotlb.c | 4 ++--
 6 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_internal.c 
b/drivers/gpu/drm/i915/gem/i915_gem_internal.c
index ce6b664b10aa..7d48c433446b 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_internal.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_internal.c
@@ -42,7 +42,7 @@ static int i915_gem_object_get_pages_internal(struct 
drm_i915_gem_object *obj)
 
max_order = MAX_ORDER;
 #ifdef CONFIG_SWIOTLB
-   if (is_swiotlb_active()) {
+   if (is_swiotlb_active(NULL)) {
unsigned int max_segment;
 
max_segment = swiotlb_max_segment();
diff --git a/drivers/gpu/drm/nouveau/nouveau_ttm.c 
b/drivers/gpu/drm/nouveau/nouveau_ttm.c
index e8b506a6685b..2a2ae6d6cf6d 100644
--- a/drivers/gpu/drm/nouveau/nouveau_ttm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_ttm.c
@@ -321,7 +321,7 @@ nouveau_ttm_init(struct nouveau_drm *drm)
}
 
 #if IS_ENABLED(CONFIG_SWIOTLB) && IS_ENABLED(CONFIG_X86)
-   need_swiotlb = is_swiotlb_active();
+   need_swiotlb = is_swiotlb_active(NULL);
 #endif
 
ret = ttm_device_init(>ttm.bdev, _bo_driver, drm->dev->dev,
diff --git a/drivers/pci/xen-pcifront.c b/drivers/pci/xen-pcifront.c
index b7a8f3a1921f..6d548ce53ce7 100644
--- a/drivers/pci/xen-pcifront.c
+++ b/drivers/pci/xen-pcifront.c
@@ -693,7 +693,7 @@ static int pcifront_connect_and_init_dma(struct 
pcifront_device *pdev)
 
spin_unlock(_dev_lock);
 
-   if (!err && !is_swiotlb_active()) {
+   if (!err && !is_swiotlb_active(NULL)) {
err = pci_xen_swiotlb_init_late();
if (err)
dev_err(>xdev->dev, "Could not setup SWIOTLB!\n");
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 2a6cca07540b..c530c976d18b 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -123,7 +123,7 @@ static inline bool is_swiotlb_buffer(struct device *dev, 
phys_addr_t paddr)
 void __init swiotlb_exit(void);
 unsigned int swiotlb_max_segment(void);
 size_t swiotlb_max_mapping_size(struct device *dev);
-bool is_swiotlb_active(void);
+bool is_swiotlb_active(struct device *dev);
 void __init swiotlb_adjust_size(unsigned long size);
 #else
 #define swiotlb_force SWIOTLB_NO_FORCE
@@ -143,7 +143,7 @@ static inline size_t swiotlb_max_mapping_size(struct device 
*dev)
return SIZE_MAX;
 }
 
-static inline bool is_swiotlb_active(void)
+static inline bool is_swiotlb_active(struct device *dev)
 {
return false;
 }
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 84c9feb5474a..7a88c34d0867 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -495,7 +495,7 @@ int dma_direct_supported(struct device *dev, u64 mask)
 size_t dma_direct_max_mapping_size(struct device *dev)
 {
/* If SWIOTLB is active, use its maximum mapping size */
-   if (is_swiotlb_active() &&
+   if (is_swiotlb_active(dev) &&
(dma_addressing_limited(dev) || swiotlb_force == SWIOTLB_FORCE))
return swiotlb_max_mapping_size(dev);
return SIZE_MAX;
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 1d8eb4de0d01..68e7633f11fe 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -662,9 +662,9 @@ size_t swiotlb_max_mapping_size(struct device *dev)
return ((size_t)IO_TLB_SIZE) * IO_TLB_SEGSIZE;
 }
 
-bool is_swiotlb_active(void)
+bool is_swiotlb_active(struct device *dev)
 {
-   return io_tlb_default_mem != NULL;
+   return get_io_tlb_mem(dev) != NULL;
 }
 EXPORT_SYMBOL_GPL(is_swiotlb_active);
 
-- 
2.31.1.751.gd2f1c929bd-goog

[PATCH v7 10/15] swiotlb: Refactor swiotlb_tbl_unmap_single

2021-05-18 Thread Claire Chang

Add a new function, release_slots, to make the code reusable for supporting
different bounce buffer pools, e.g. restricted DMA pool.

Signed-off-by: Claire Chang 
---
 kernel/dma/swiotlb.c | 35 ---
 1 file changed, 20 insertions(+), 15 deletions(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 2ec6711071de..cef856d23194 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -550,27 +550,15 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, 
phys_addr_t orig_addr,
return tlb_addr;
 }
 
-/*
- * tlb_addr is the physical address of the bounce buffer to unmap.
- */
-void swiotlb_tbl_unmap_single(struct device *hwdev, phys_addr_t tlb_addr,
- size_t mapping_size, enum dma_data_direction dir,
- unsigned long attrs)
+static void release_slots(struct device *dev, phys_addr_t tlb_addr)
 {
-   struct io_tlb_mem *mem = get_io_tlb_mem(hwdev);
+   struct io_tlb_mem *mem = get_io_tlb_mem(dev);
unsigned long flags;
-   unsigned int offset = swiotlb_align_offset(hwdev, tlb_addr);
+   unsigned int offset = swiotlb_align_offset(dev, tlb_addr);
int index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
int nslots = nr_slots(mem->slots[index].alloc_size + offset);
int count, i;
 
-   /*
-* First, sync the memory before unmapping the entry
-*/
-   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
-   (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
-   swiotlb_bounce(hwdev, tlb_addr, mapping_size, DMA_FROM_DEVICE);
-
/*
 * Return the buffer to the free list by setting the corresponding
 * entries to indicate the number of contiguous entries available.
@@ -605,6 +593,23 @@ void swiotlb_tbl_unmap_single(struct device *hwdev, 
phys_addr_t tlb_addr,
spin_unlock_irqrestore(>lock, flags);
 }
 
+/*
+ * tlb_addr is the physical address of the bounce buffer to unmap.
+ */
+void swiotlb_tbl_unmap_single(struct device *dev, phys_addr_t tlb_addr,
+ size_t mapping_size, enum dma_data_direction dir,
+ unsigned long attrs)
+{
+   /*
+* First, sync the memory before unmapping the entry
+*/
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
+   (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
+   swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_FROM_DEVICE);
+
+   release_slots(dev, tlb_addr);
+}
+
 void swiotlb_sync_single_for_device(struct device *dev, phys_addr_t tlb_addr,
size_t size, enum dma_data_direction dir)
 {
-- 
2.31.1.751.gd2f1c929bd-goog

[PATCH v7 11/15] dma-direct: Add a new wrapper __dma_direct_free_pages()

2021-05-18 Thread Claire Chang

Add a new wrapper __dma_direct_free_pages() that will be useful later
for swiotlb_free().

Signed-off-by: Claire Chang 
---
 kernel/dma/direct.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 078f7087e466..eb4098323bbc 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -75,6 +75,12 @@ static bool dma_coherent_ok(struct device *dev, phys_addr_t 
phys, size_t size)
min_not_zero(dev->coherent_dma_mask, dev->bus_dma_limit);
 }
 
+static void __dma_direct_free_pages(struct device *dev, struct page *page,
+   size_t size)
+{
+   dma_free_contiguous(dev, page, size);
+}
+
 static struct page *__dma_direct_alloc_pages(struct device *dev, size_t size,
gfp_t gfp)
 {
@@ -237,7 +243,7 @@ void *dma_direct_alloc(struct device *dev, size_t size,
return NULL;
}
 out_free_pages:
-   dma_free_contiguous(dev, page, size);
+   __dma_direct_free_pages(dev, page, size);
return NULL;
 }
 
@@ -273,7 +279,7 @@ void dma_direct_free(struct device *dev, size_t size,
else if (IS_ENABLED(CONFIG_ARCH_HAS_DMA_CLEAR_UNCACHED))
arch_dma_clear_uncached(cpu_addr, size);
 
-   dma_free_contiguous(dev, dma_direct_to_page(dev, dma_addr), size);
+   __dma_direct_free_pages(dev, dma_direct_to_page(dev, dma_addr), size);
 }
 
 struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
@@ -310,7 +316,7 @@ struct page *dma_direct_alloc_pages(struct device *dev, 
size_t size,
*dma_handle = phys_to_dma_direct(dev, page_to_phys(page));
return page;
 out_free_pages:
-   dma_free_contiguous(dev, page, size);
+   __dma_direct_free_pages(dev, page, size);
return NULL;
 }
 
@@ -329,7 +335,7 @@ void dma_direct_free_pages(struct device *dev, size_t size,
if (force_dma_unencrypted(dev))
set_memory_encrypted((unsigned long)vaddr, 1 << page_order);
 
-   dma_free_contiguous(dev, page, size);
+   __dma_direct_free_pages(dev, page, size);
 }
 
 #if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE) || \
-- 
2.31.1.751.gd2f1c929bd-goog

[PATCH v7 12/15] swiotlb: Add restricted DMA alloc/free support.

2021-05-18 Thread Claire Chang

Add the functions, swiotlb_{alloc,free} to support the memory allocation
from restricted DMA pool.

Signed-off-by: Claire Chang 
---
 include/linux/swiotlb.h |  4 
 kernel/dma/swiotlb.c| 35 +--
 2 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 0c5a18d9cf89..e8cf49bd90c5 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -134,6 +134,10 @@ unsigned int swiotlb_max_segment(void);
 size_t swiotlb_max_mapping_size(struct device *dev);
 bool is_swiotlb_active(struct device *dev);
 void __init swiotlb_adjust_size(unsigned long size);
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+struct page *swiotlb_alloc(struct device *dev, size_t size);
+bool swiotlb_free(struct device *dev, struct page *page, size_t size);
+#endif /* CONFIG_DMA_RESTRICTED_POOL */
 #else
 #define swiotlb_force SWIOTLB_NO_FORCE
 static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index cef856d23194..d3fa4669229b 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -457,8 +457,9 @@ static int find_slots(struct device *dev, phys_addr_t 
orig_addr,
 
index = wrap = wrap_index(mem, ALIGN(mem->index, stride));
do {
-   if ((slot_addr(tbl_dma_addr, index) & iotlb_align_mask) !=
-   (orig_addr & iotlb_align_mask)) {
+   if (orig_addr &&
+   (slot_addr(tbl_dma_addr, index) & iotlb_align_mask) !=
+   (orig_addr & iotlb_align_mask)) {
index = wrap_index(mem, index + 1);
continue;
}
@@ -704,6 +705,36 @@ late_initcall(swiotlb_create_default_debugfs);
 #endif
 
 #ifdef CONFIG_DMA_RESTRICTED_POOL
+struct page *swiotlb_alloc(struct device *dev, size_t size)
+{
+   struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+   phys_addr_t tlb_addr;
+   int index;
+
+   if (!mem)
+   return NULL;
+
+   index = find_slots(dev, 0, size);
+   if (index == -1)
+   return NULL;
+
+   tlb_addr = slot_addr(mem->start, index);
+
+   return pfn_to_page(PFN_DOWN(tlb_addr));
+}
+
+bool swiotlb_free(struct device *dev, struct page *page, size_t size)
+{
+   phys_addr_t tlb_addr = page_to_phys(page);
+
+   if (!is_swiotlb_buffer(dev, tlb_addr))
+   return false;
+
+   release_slots(dev, tlb_addr);
+
+   return true;
+}
+
 static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
struct device *dev)
 {
-- 
2.31.1.751.gd2f1c929bd-goog

Re: [PATCH] powerpc/powernv/pci: remove dead code from !CONFIG_EEH

2021-05-18 Thread Michael Ellerman

Nick Desaulniers  writes:
> On Thu, Apr 22, 2021 at 6:13 PM Oliver O'Halloran  wrote:
>>
>> On Fri, Apr 23, 2021 at 9:09 AM Daniel Axtens  wrote:
>> >
>> > Hi Nick,
>> >
>> > > While looking at -Wundef warnings, the #if CONFIG_EEH stood out as a
>> > > possible candidate to convert to #ifdef CONFIG_EEH, but it seems that
>> > > based on Kconfig dependencies it's not possible to build this file
>> > > without CONFIG_EEH enabled.
>> >
>> > This seemed odd to me, but I think you're right:
>> >
>> > arch/powerpc/platforms/Kconfig contains:
>> >
>> > config EEH
>> > bool
>> > depends on (PPC_POWERNV || PPC_PSERIES) && PCI
>> > default y
>> >
>> > It's not configurable from e.g. make menuconfig because there's no prompt.
>> > You can attempt to explicitly disable it with e.g. `scripts/config -d EEH`
>> > but then something like `make oldconfig` will silently re-enable it for
>> > you.
>> >
>> > It's been forced on since commit e49f7a9997c6 ("powerpc/pseries: Rivet
>> > CONFIG_EEH for pSeries platform") in 2012 which fixed it for
>> > pseries. That moved out from pseries to pseries + powernv later on.
>> >
>> > There are other cleanups in the same vein that could be made, from the
>> > Makefile (which has files only built with CONFIG_EEH) through to other
>> > source files. It looks like there's one `#ifdef CONFIG_EEH` in
>> > arch/powerpc/platforms/powernv/pci-ioda.c that could be pulled out, for
>> > example.
>> >
>> > I think it's probably worth trying to rip out all of those in one patch?
>>
>> The change in commit e49f7a9997c6 ("powerpc/pseries: Rivet CONFIG_EEH
>> for pSeries platform") never should have been made.
>
> I'll change my patch to keep the conditionals, but use #ifdef instead
> of #if then?

Yeah, please.

I'm not sure I agree with oohal that untangling pseries/powernv from
EEH is something we should do, but let's kick that can down the road by
just fixing up the ifdef.

cheers

Re: [FSL P50x0] KVM HV doesn't work anymore

2021-05-18 Thread Christian Zigotzky




> On 17. May 2021, at 11:43, Christian Zigotzky  wrote:
> 
> On 17 May 2021 at 09:42am, Nicholas Piggin wrote:
>> Excerpts from Christian Zigotzky's message of May 15, 2021 11:46 pm:
>>> On 15 May 2021 at 12:08pm Christophe Leroy wrote:
 
> Le 15/05/2021 à 11:48, Christian Zigotzky a écrit :
>> Hi All,
>> 
>> I bisected today [1] and the bisecting itself was OK but the
>> reverting of the bad commit doesn't solve the issue. Do you have an
>> idea which commit could be resposible for this issue? Maybe the
>> bisecting wasn't successful. I will look in the kernel git log. Maybe
>> there is a commit that affected KVM HV on FSL P50x0 machines.
> If the uImage doesn't load, it may be because of the size of uImage.
> 
> See https://github.com/linuxppc/issues/issues/208
> 
> Is there a significant size difference with and without KVM HV ?
> 
> Maybe you can try to remove another option to reduce the size of the
> uImage.
>>> I tried it but it doesn't solve the issue. The uImage works without KVM
>>> HV in a virtual e5500 QEMU machine.
>> Any more progress with this? I would say that bisect might have just
>> been a bit unstable and maybe by chance some things did not crash so
>> it's pointing to the wrong patch.
>> 
>> Upstream merge of powerpc-5.13-1 was good and powerpc-5.13-2 was bad?
>> 
>> Between that looks like some KVM MMU rework. You could try the patch
>> before this one b1c5356e873c ("KVM: PPC: Convert to the gfn-based MMU
>> notifier callbacks"). That won't revert cleanly so just try run the
>> tree at that point. If it works, test the patch and see if it fails.
>> 
>> Thanks,
>> Nick
> Hi Nick,
> 
> Thanks a lot for your answer. Yes, there is a little bit of progress. The RC2 
> of kernel 5.13 successfully boots with -smp 3 in a virtual e5500 QEMU machine.
> -smp 4 doesn't work anymore since the PowerPC updates 5.13-2. I used -smp 4 
> before 5.13 because my FSL P5040 machine has 4 cores.
> 
> Could you please post a patch for reverting the commit before b1c5356e873c 
> ("KVM: PPC: Convert to the gfn-based MMU notifier callbacks")?
> 
> Thanks in advance,
> 
> Christian
> 
> 
For me it is ok to work with -smp 1, 2, and 3 but I am curious why -smp 4 
doesn’t work.

-Christian

[PATCH v7 04/15] swiotlb: Add restricted DMA pool initialization

2021-05-18 Thread Claire Chang

Add the initialization function to create restricted DMA pools from
matching reserved-memory nodes.

Signed-off-by: Claire Chang 
---
 include/linux/device.h  |  4 +++
 include/linux/swiotlb.h |  3 +-
 kernel/dma/swiotlb.c| 76 +
 3 files changed, 82 insertions(+), 1 deletion(-)

diff --git a/include/linux/device.h b/include/linux/device.h
index 38a2071cf776..4987608ea4ff 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -416,6 +416,7 @@ struct dev_links_info {
  * @dma_pools: Dma pools (if dma'ble device).
  * @dma_mem:   Internal for coherent mem override.
  * @cma_area:  Contiguous memory area for dma allocations
+ * @dma_io_tlb_mem: Internal for swiotlb io_tlb_mem override.
  * @archdata:  For arch-specific additions.
  * @of_node:   Associated device tree node.
  * @fwnode:Associated device node supplied by platform firmware.
@@ -521,6 +522,9 @@ struct device {
 #ifdef CONFIG_DMA_CMA
struct cma *cma_area;   /* contiguous memory area for dma
   allocations */
+#endif
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+   struct io_tlb_mem *dma_io_tlb_mem;
 #endif
/* arch specific additions */
struct dev_archdata archdata;
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 216854a5e513..03ad6e3b4056 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -72,7 +72,8 @@ extern enum swiotlb_force swiotlb_force;
  * range check to see if the memory was in fact allocated by this
  * API.
  * @nslabs:The number of IO TLB blocks (in groups of 64) between @start and
- * @end. This is command line adjustable via setup_io_tlb_npages.
+ * @end. For default swiotlb, this is command line adjustable via
+ * setup_io_tlb_npages.
  * @used:  The number of used IO TLB block.
  * @list:  The free list describing the number of free entries available
  * from each index.
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index b849b01a446f..1d8eb4de0d01 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -39,6 +39,13 @@
 #ifdef CONFIG_DEBUG_FS
 #include 
 #endif
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+#include 
+#include 
+#include 
+#include 
+#include 
+#endif
 
 #include 
 #include 
@@ -690,3 +697,72 @@ static int __init swiotlb_create_default_debugfs(void)
 late_initcall(swiotlb_create_default_debugfs);
 
 #endif
+
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
+   struct device *dev)
+{
+   struct io_tlb_mem *mem = rmem->priv;
+   unsigned long nslabs = rmem->size >> IO_TLB_SHIFT;
+
+   if (dev->dma_io_tlb_mem)
+   return 0;
+
+   /*
+* Since multiple devices can share the same pool, the private data,
+* io_tlb_mem struct, will be initialized by the first device attached
+* to it.
+*/
+   if (!mem) {
+   mem = kzalloc(struct_size(mem, slots, nslabs), GFP_KERNEL);
+   if (!mem)
+   return -ENOMEM;
+
+   if (PageHighMem(pfn_to_page(PHYS_PFN(rmem->base {
+   kfree(mem);
+   return -EINVAL;
+   }
+
+   swiotlb_init_io_tlb_mem(mem, rmem->base, nslabs, false);
+
+   rmem->priv = mem;
+
+   if (IS_ENABLED(CONFIG_DEBUG_FS))
+   swiotlb_create_debugfs(mem, rmem->name);
+   }
+
+   dev->dma_io_tlb_mem = mem;
+
+   return 0;
+}
+
+static void rmem_swiotlb_device_release(struct reserved_mem *rmem,
+   struct device *dev)
+{
+   if (dev)
+   dev->dma_io_tlb_mem = NULL;
+}
+
+static const struct reserved_mem_ops rmem_swiotlb_ops = {
+   .device_init = rmem_swiotlb_device_init,
+   .device_release = rmem_swiotlb_device_release,
+};
+
+static int __init rmem_swiotlb_setup(struct reserved_mem *rmem)
+{
+   unsigned long node = rmem->fdt_node;
+
+   if (of_get_flat_dt_prop(node, "reusable", NULL) ||
+   of_get_flat_dt_prop(node, "linux,cma-default", NULL) ||
+   of_get_flat_dt_prop(node, "linux,dma-default", NULL) ||
+   of_get_flat_dt_prop(node, "no-map", NULL))
+   return -EINVAL;
+
+   rmem->ops = _swiotlb_ops;
+   pr_info("Reserved memory: created device swiotlb memory pool at %pa, 
size %ld MiB\n",
+   >base, (unsigned long)rmem->size / SZ_1M);
+   return 0;
+}
+
+RESERVEDMEM_OF_DECLARE(dma, "restricted-dma-pool", rmem_swiotlb_setup);
+#endif /* CONFIG_DMA_RESTRICTED_POOL */
-- 
2.31.1.751.gd2f1c929bd-goog

[PATCH v7 02/15] swiotlb: Refactor swiotlb_create_debugfs

2021-05-18 Thread Claire Chang

Split the debugfs creation to make the code reusable for supporting
different bounce buffer pools, e.g. restricted DMA pool.

Signed-off-by: Claire Chang 
---
 kernel/dma/swiotlb.c | 25 +++--
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index d3232fc19385..b849b01a446f 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -64,6 +64,7 @@
 enum swiotlb_force swiotlb_force;
 
 struct io_tlb_mem *io_tlb_default_mem;
+static struct dentry *debugfs_dir;
 
 /*
  * Max segment that we can provide which (if pages are contingous) will
@@ -662,18 +663,30 @@ EXPORT_SYMBOL_GPL(is_swiotlb_active);
 
 #ifdef CONFIG_DEBUG_FS
 
-static int __init swiotlb_create_debugfs(void)
+static void swiotlb_create_debugfs(struct io_tlb_mem *mem, const char *name)
 {
-   struct io_tlb_mem *mem = io_tlb_default_mem;
-
if (!mem)
-   return 0;
-   mem->debugfs = debugfs_create_dir("swiotlb", NULL);
+   return;
+
+   mem->debugfs = debugfs_create_dir(name, debugfs_dir);
debugfs_create_ulong("io_tlb_nslabs", 0400, mem->debugfs, >nslabs);
debugfs_create_ulong("io_tlb_used", 0400, mem->debugfs, >used);
+}
+
+static int __init swiotlb_create_default_debugfs(void)
+{
+   struct io_tlb_mem *mem = io_tlb_default_mem;
+
+   if (mem) {
+   swiotlb_create_debugfs(mem, "swiotlb");
+   debugfs_dir = mem->debugfs;
+   } else {
+   debugfs_dir = debugfs_create_dir("swiotlb", NULL);
+   }
+
return 0;
 }
 
-late_initcall(swiotlb_create_debugfs);
+late_initcall(swiotlb_create_default_debugfs);
 
 #endif
-- 
2.31.1.751.gd2f1c929bd-goog

[PATCH v7 03/15] swiotlb: Add DMA_RESTRICTED_POOL

2021-05-18 Thread Claire Chang

Add a new kconfig symbol, DMA_RESTRICTED_POOL, for restricted DMA pool.

Signed-off-by: Claire Chang 
---
 kernel/dma/Kconfig | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
index 77b405508743..3e961dc39634 100644
--- a/kernel/dma/Kconfig
+++ b/kernel/dma/Kconfig
@@ -80,6 +80,20 @@ config SWIOTLB
bool
select NEED_DMA_MAP_STATE
 
+config DMA_RESTRICTED_POOL
+   bool "DMA Restricted Pool"
+   depends on OF && OF_RESERVED_MEM
+   select SWIOTLB
+   help
+ This enables support for restricted DMA pools which provide a level of
+ DMA memory protection on systems with limited hardware protection
+ capabilities, such as those lacking an IOMMU.
+
+ For more information see
+ 

+ and .
+ If unsure, say "n".
+
 #
 # Should be selected if we can mmap non-coherent mappings to userspace.
 # The only thing that is really required is a way to set an uncached bit
-- 
2.31.1.751.gd2f1c929bd-goog

[PATCH v7 06/15] swiotlb: Update is_swiotlb_buffer to add a struct device argument

2021-05-18 Thread Claire Chang

Update is_swiotlb_buffer to add a struct device argument. This will be
useful later to allow for restricted DMA pool.

Signed-off-by: Claire Chang 
---
 drivers/iommu/dma-iommu.c | 12 ++--
 drivers/xen/swiotlb-xen.c |  2 +-
 include/linux/swiotlb.h   |  6 +++---
 kernel/dma/direct.c   |  6 +++---
 kernel/dma/direct.h   |  6 +++---
 5 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 7bcdd1205535..a5df35bfd150 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -504,7 +504,7 @@ static void __iommu_dma_unmap_swiotlb(struct device *dev, 
dma_addr_t dma_addr,
 
__iommu_dma_unmap(dev, dma_addr, size);
 
-   if (unlikely(is_swiotlb_buffer(phys)))
+   if (unlikely(is_swiotlb_buffer(dev, phys)))
swiotlb_tbl_unmap_single(dev, phys, size, dir, attrs);
 }
 
@@ -575,7 +575,7 @@ static dma_addr_t __iommu_dma_map_swiotlb(struct device 
*dev, phys_addr_t phys,
}
 
iova = __iommu_dma_map(dev, phys, aligned_size, prot, dma_mask);
-   if (iova == DMA_MAPPING_ERROR && is_swiotlb_buffer(phys))
+   if (iova == DMA_MAPPING_ERROR && is_swiotlb_buffer(dev, phys))
swiotlb_tbl_unmap_single(dev, phys, org_size, dir, attrs);
return iova;
 }
@@ -781,7 +781,7 @@ static void iommu_dma_sync_single_for_cpu(struct device 
*dev,
if (!dev_is_dma_coherent(dev))
arch_sync_dma_for_cpu(phys, size, dir);
 
-   if (is_swiotlb_buffer(phys))
+   if (is_swiotlb_buffer(dev, phys))
swiotlb_sync_single_for_cpu(dev, phys, size, dir);
 }
 
@@ -794,7 +794,7 @@ static void iommu_dma_sync_single_for_device(struct device 
*dev,
return;
 
phys = iommu_iova_to_phys(iommu_get_dma_domain(dev), dma_handle);
-   if (is_swiotlb_buffer(phys))
+   if (is_swiotlb_buffer(dev, phys))
swiotlb_sync_single_for_device(dev, phys, size, dir);
 
if (!dev_is_dma_coherent(dev))
@@ -815,7 +815,7 @@ static void iommu_dma_sync_sg_for_cpu(struct device *dev,
if (!dev_is_dma_coherent(dev))
arch_sync_dma_for_cpu(sg_phys(sg), sg->length, dir);
 
-   if (is_swiotlb_buffer(sg_phys(sg)))
+   if (is_swiotlb_buffer(dev, sg_phys(sg)))
swiotlb_sync_single_for_cpu(dev, sg_phys(sg),
sg->length, dir);
}
@@ -832,7 +832,7 @@ static void iommu_dma_sync_sg_for_device(struct device *dev,
return;
 
for_each_sg(sgl, sg, nelems, i) {
-   if (is_swiotlb_buffer(sg_phys(sg)))
+   if (is_swiotlb_buffer(dev, sg_phys(sg)))
swiotlb_sync_single_for_device(dev, sg_phys(sg),
   sg->length, dir);
 
diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 4c89afc0df62..0c6ed09f8513 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -100,7 +100,7 @@ static int is_xen_swiotlb_buffer(struct device *dev, 
dma_addr_t dma_addr)
 * in our domain. Therefore _only_ check address within our domain.
 */
if (pfn_valid(PFN_DOWN(paddr)))
-   return is_swiotlb_buffer(paddr);
+   return is_swiotlb_buffer(dev, paddr);
return 0;
 }
 
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index b469f04cca26..2a6cca07540b 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -113,9 +113,9 @@ static inline struct io_tlb_mem *get_io_tlb_mem(struct 
device *dev)
return io_tlb_default_mem;
 }
 
-static inline bool is_swiotlb_buffer(phys_addr_t paddr)
+static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
 {
-   struct io_tlb_mem *mem = io_tlb_default_mem;
+   struct io_tlb_mem *mem = get_io_tlb_mem(dev);
 
return mem && paddr >= mem->start && paddr < mem->end;
 }
@@ -127,7 +127,7 @@ bool is_swiotlb_active(void);
 void __init swiotlb_adjust_size(unsigned long size);
 #else
 #define swiotlb_force SWIOTLB_NO_FORCE
-static inline bool is_swiotlb_buffer(phys_addr_t paddr)
+static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
 {
return false;
 }
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index f737e3347059..84c9feb5474a 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -343,7 +343,7 @@ void dma_direct_sync_sg_for_device(struct device *dev,
for_each_sg(sgl, sg, nents, i) {
phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));
 
-   if (unlikely(is_swiotlb_buffer(paddr)))
+   if (unlikely(is_swiotlb_buffer(dev, paddr)))
swiotlb_sync_single_for_device(dev, paddr, sg->length,
   dir);
 
@@ -369,7 +369,7 @@ void dma_direct_sync_sg_for_cpu(struct device

[PATCH v7 05/15] swiotlb: Add a new get_io_tlb_mem getter

2021-05-18 Thread Claire Chang

Add a new getter, get_io_tlb_mem, to help select the io_tlb_mem struct.
The restricted DMA pool is preferred if available.

Signed-off-by: Claire Chang 
---
 include/linux/swiotlb.h | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 03ad6e3b4056..b469f04cca26 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -2,6 +2,7 @@
 #ifndef __LINUX_SWIOTLB_H
 #define __LINUX_SWIOTLB_H
 
+#include 
 #include 
 #include 
 #include 
@@ -102,6 +103,16 @@ struct io_tlb_mem {
 };
 extern struct io_tlb_mem *io_tlb_default_mem;
 
+static inline struct io_tlb_mem *get_io_tlb_mem(struct device *dev)
+{
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+   if (dev && dev->dma_io_tlb_mem)
+   return dev->dma_io_tlb_mem;
+#endif /* CONFIG_DMA_RESTRICTED_POOL */
+
+   return io_tlb_default_mem;
+}
+
 static inline bool is_swiotlb_buffer(phys_addr_t paddr)
 {
struct io_tlb_mem *mem = io_tlb_default_mem;
-- 
2.31.1.751.gd2f1c929bd-goog

[PATCH v7 13/15] dma-direct: Allocate memory from restricted DMA pool if available

2021-05-18 Thread Claire Chang

The restricted DMA pool is preferred if available.

The restricted DMA pools provide a basic level of protection against the
DMA overwriting buffer contents at unexpected times. However, to protect
against general data leakage and system memory corruption, the system
needs to provide a way to lock down the memory access, e.g., MPU.

Note that since coherent allocation needs remapping, one must set up
another device coherent pool by shared-dma-pool and use
dma_alloc_from_dev_coherent instead for atomic coherent allocation.

Signed-off-by: Claire Chang 
---
 kernel/dma/direct.c | 38 +-
 1 file changed, 29 insertions(+), 9 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index eb4098323bbc..0d521f78c7b9 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -78,6 +78,10 @@ static bool dma_coherent_ok(struct device *dev, phys_addr_t 
phys, size_t size)
 static void __dma_direct_free_pages(struct device *dev, struct page *page,
size_t size)
 {
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+   if (swiotlb_free(dev, page, size))
+   return;
+#endif
dma_free_contiguous(dev, page, size);
 }
 
@@ -92,7 +96,17 @@ static struct page *__dma_direct_alloc_pages(struct device 
*dev, size_t size,
 
gfp |= dma_direct_optimal_gfp_mask(dev, dev->coherent_dma_mask,
   _limit);
-   page = dma_alloc_contiguous(dev, size, gfp);
+
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+   page = swiotlb_alloc(dev, size);
+   if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) {
+   __dma_direct_free_pages(dev, page, size);
+   page = NULL;
+   }
+#endif
+
+   if (!page)
+   page = dma_alloc_contiguous(dev, size, gfp);
if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) {
dma_free_contiguous(dev, page, size);
page = NULL;
@@ -148,7 +162,7 @@ void *dma_direct_alloc(struct device *dev, size_t size,
gfp |= __GFP_NOWARN;
 
if ((attrs & DMA_ATTR_NO_KERNEL_MAPPING) &&
-   !force_dma_unencrypted(dev)) {
+   !force_dma_unencrypted(dev) && !is_dev_swiotlb_force(dev)) {
page = __dma_direct_alloc_pages(dev, size, gfp & ~__GFP_ZERO);
if (!page)
return NULL;
@@ -161,18 +175,23 @@ void *dma_direct_alloc(struct device *dev, size_t size,
}
 
if (!IS_ENABLED(CONFIG_ARCH_HAS_DMA_SET_UNCACHED) &&
-   !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
-   !dev_is_dma_coherent(dev))
+   !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && !dev_is_dma_coherent(dev) &&
+   !is_dev_swiotlb_force(dev))
return arch_dma_alloc(dev, size, dma_handle, gfp, attrs);
 
/*
 * Remapping or decrypting memory may block. If either is required and
 * we can't block, allocate the memory from the atomic pools.
+* If restricted DMA (i.e., is_dev_swiotlb_force) is required, one must
+* set up another device coherent pool by shared-dma-pool and use
+* dma_alloc_from_dev_coherent instead.
 */
if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) &&
!gfpflags_allow_blocking(gfp) &&
(force_dma_unencrypted(dev) ||
-(IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && 
!dev_is_dma_coherent(dev
+(IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
+ !dev_is_dma_coherent(dev))) &&
+   !is_dev_swiotlb_force(dev))
return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp);
 
/* we always manually zero the memory once we are done */
@@ -253,15 +272,15 @@ void dma_direct_free(struct device *dev, size_t size,
unsigned int page_order = get_order(size);
 
if ((attrs & DMA_ATTR_NO_KERNEL_MAPPING) &&
-   !force_dma_unencrypted(dev)) {
+   !force_dma_unencrypted(dev) && !is_dev_swiotlb_force(dev)) {
/* cpu_addr is a struct page cookie, not a kernel address */
dma_free_contiguous(dev, cpu_addr, size);
return;
}
 
if (!IS_ENABLED(CONFIG_ARCH_HAS_DMA_SET_UNCACHED) &&
-   !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
-   !dev_is_dma_coherent(dev)) {
+   !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && !dev_is_dma_coherent(dev) &&
+   !is_dev_swiotlb_force(dev)) {
arch_dma_free(dev, size, cpu_addr, dma_addr, attrs);
return;
}
@@ -289,7 +308,8 @@ struct page *dma_direct_alloc_pages(struct device *dev, 
size_t size,
void *ret;
 
if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) &&
-   force_dma_unencrypted(dev) && !gfpflags_allow_blocking(gfp))
+   force_dma_unencrypted(dev) && !gfpflags_allow_blocking(gfp) &&
+   !is_dev_swiotlb_force(dev))
return dma_direct_alloc_from_pool(dev, size, dma_handle,

[PATCH v7 14/15] dt-bindings: of: Add restricted DMA pool

2021-05-18 Thread Claire Chang

Introduce the new compatible string, restricted-dma-pool, for restricted
DMA. One can specify the address and length of the restricted DMA memory
region by restricted-dma-pool in the reserved-memory node.

Signed-off-by: Claire Chang 
---
 .../reserved-memory/reserved-memory.txt   | 27 +++
 1 file changed, 27 insertions(+)

diff --git 
a/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt 
b/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
index e8d3096d922c..284aea659015 100644
--- a/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
+++ b/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
@@ -51,6 +51,23 @@ compatible (optional) - standard definition
   used as a shared pool of DMA buffers for a set of devices. It can
   be used by an operating system to instantiate the necessary pool
   management subsystem if necessary.
+- restricted-dma-pool: This indicates a region of memory meant to be
+  used as a pool of restricted DMA buffers for a set of devices. The
+  memory region would be the only region accessible to those devices.
+  When using this, the no-map and reusable properties must not be set,
+  so the operating system can create a virtual mapping that will be 
used
+  for synchronization. The main purpose for restricted DMA is to
+  mitigate the lack of DMA access control on systems without an IOMMU,
+  which could result in the DMA accessing the system memory at
+  unexpected times and/or unexpected addresses, possibly leading to 
data
+  leakage or corruption. The feature on its own provides a basic level
+  of protection against the DMA overwriting buffer contents at
+  unexpected times. However, to protect against general data leakage 
and
+  system memory corruption, the system needs to provide way to lock 
down
+  the memory access, e.g., MPU. Note that since coherent allocation
+  needs remapping, one must set up another device coherent pool by
+  shared-dma-pool and use dma_alloc_from_dev_coherent instead for 
atomic
+  coherent allocation.
 - vendor specific string in the form ,[-]
 no-map (optional) - empty property
 - Indicates the operating system must not create a virtual mapping
@@ -120,6 +137,11 @@ one for multimedia processing (named 
multimedia-memory@7700, 64MiB).
compatible = "acme,multimedia-memory";
reg = <0x7700 0x400>;
};
+
+   restricted_dma_mem_reserved: restricted_dma_mem_reserved {
+   compatible = "restricted-dma-pool";
+   reg = <0x5000 0x40>;
+   };
};
 
/* ... */
@@ -138,4 +160,9 @@ one for multimedia processing (named 
multimedia-memory@7700, 64MiB).
memory-region = <_reserved>;
/* ... */
};
+
+   pcie_device: pcie_device@0,0 {
+   memory-region = <_dma_mem_reserved>;
+   /* ... */
+   };
 };
-- 
2.31.1.751.gd2f1c929bd-goog

[PATCH v7 15/15] of: Add plumbing for restricted DMA pool

2021-05-18 Thread Claire Chang

If a device is not behind an IOMMU, we look up the device node and set
up the restricted DMA when the restricted-dma-pool is presented.

Signed-off-by: Claire Chang 
---
 drivers/of/address.c| 25 +
 drivers/of/device.c |  3 +++
 drivers/of/of_private.h |  5 +
 3 files changed, 33 insertions(+)

diff --git a/drivers/of/address.c b/drivers/of/address.c
index aca94c348bd4..c562a9ff5f0b 100644
--- a/drivers/of/address.c
+++ b/drivers/of/address.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1112,6 +1113,30 @@ bool of_dma_is_coherent(struct device_node *np)
 }
 EXPORT_SYMBOL_GPL(of_dma_is_coherent);
 
+int of_dma_set_restricted_buffer(struct device *dev)
+{
+   struct device_node *node;
+   int count, i;
+
+   if (!dev->of_node)
+   return 0;
+
+   count = of_property_count_elems_of_size(dev->of_node, "memory-region",
+   sizeof(phandle));
+   for (i = 0; i < count; i++) {
+   node = of_parse_phandle(dev->of_node, "memory-region", i);
+   /* There might be multiple memory regions, but only one
+* restriced-dma-pool region is allowed.
+*/
+   if (of_device_is_compatible(node, "restricted-dma-pool") &&
+   of_device_is_available(node))
+   return of_reserved_mem_device_init_by_idx(
+   dev, dev->of_node, i);
+   }
+
+   return 0;
+}
+
 /**
  * of_mmio_is_nonposted - Check if device uses non-posted MMIO
  * @np:device node
diff --git a/drivers/of/device.c b/drivers/of/device.c
index c5a9473a5fb1..d8d865223e51 100644
--- a/drivers/of/device.c
+++ b/drivers/of/device.c
@@ -165,6 +165,9 @@ int of_dma_configure_id(struct device *dev, struct 
device_node *np,
 
arch_setup_dma_ops(dev, dma_start, size, iommu, coherent);
 
+   if (!iommu)
+   return of_dma_set_restricted_buffer(dev);
+
return 0;
 }
 EXPORT_SYMBOL_GPL(of_dma_configure_id);
diff --git a/drivers/of/of_private.h b/drivers/of/of_private.h
index d717efbd637d..9fc874548528 100644
--- a/drivers/of/of_private.h
+++ b/drivers/of/of_private.h
@@ -163,12 +163,17 @@ struct bus_dma_region;
 #if defined(CONFIG_OF_ADDRESS) && defined(CONFIG_HAS_DMA)
 int of_dma_get_range(struct device_node *np,
const struct bus_dma_region **map);
+int of_dma_set_restricted_buffer(struct device *dev);
 #else
 static inline int of_dma_get_range(struct device_node *np,
const struct bus_dma_region **map)
 {
return -ENODEV;
 }
+static inline int of_dma_set_restricted_buffer(struct device *dev)
+{
+   return -ENODEV;
+}
 #endif
 
 #endif /* _LINUX_OF_PRIVATE_H */
-- 
2.31.1.751.gd2f1c929bd-goog

Re: [PATCH v7 04/15] swiotlb: Add restricted DMA pool initialization

2021-05-18 Thread Claire Chang

I didn't move this to a separate file because I feel it might be
confusing for swiotlb_alloc/free (and need more functions to be
non-static).
Maybe instead of moving to a separate file, we can try to come up with
a better naming?

Re: [PATCH v7 05/15] swiotlb: Add a new get_io_tlb_mem getter

2021-05-18 Thread Claire Chang

Still keep this function because directly using dev->dma_io_tlb_mem
will cause issues for memory allocation for existing devices. The pool
can't support atomic coherent allocation so we need to distinguish the
per device pool and the default pool in swiotlb_alloc.

Re: [PATCH v6 00/15] Restricted DMA

2021-05-18 Thread Claire Chang

v7: https://lore.kernel.org/patchwork/cover/1431031/

On Mon, May 10, 2021 at 5:50 PM Claire Chang  wrote:
>
> From: Claire Chang 
>
> This series implements mitigations for lack of DMA access control on
> systems without an IOMMU, which could result in the DMA accessing the
> system memory at unexpected times and/or unexpected addresses, possibly
> leading to data leakage or corruption.
>
> For example, we plan to use the PCI-e bus for Wi-Fi and that PCI-e bus is
> not behind an IOMMU. As PCI-e, by design, gives the device full access to
> system memory, a vulnerability in the Wi-Fi firmware could easily escalate
> to a full system exploit (remote wifi exploits: [1a], [1b] that shows a
> full chain of exploits; [2], [3]).
>
> To mitigate the security concerns, we introduce restricted DMA. Restricted
> DMA utilizes the existing swiotlb to bounce streaming DMA in and out of a
> specially allocated region and does memory allocation from the same region.
> The feature on its own provides a basic level of protection against the DMA
> overwriting buffer contents at unexpected times. However, to protect
> against general data leakage and system memory corruption, the system needs
> to provide a way to restrict the DMA to a predefined memory region (this is
> usually done at firmware level, e.g. MPU in ATF on some ARM platforms [4]).
>
> [1a] 
> https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_4.html
> [1b] 
> https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_11.html
> [2] https://blade.tencent.com/en/advisories/qualpwn/
> [3] 
> https://www.bleepingcomputer.com/news/security/vulnerabilities-found-in-highly-popular-firmware-for-wifi-chips/
> [4] 
> https://github.com/ARM-software/arm-trusted-firmware/blob/master/plat/mediatek/mt8183/drivers/emi_mpu/emi_mpu.c#L132
>
> v6:
> Address the comments in v5
>
> v5:
> Rebase on latest linux-next
> https://lore.kernel.org/patchwork/cover/1416899/
>
> v4:
> - Fix spinlock bad magic
> - Use rmem->name for debugfs entry
> - Address the comments in v3
> https://lore.kernel.org/patchwork/cover/1378113/
>
> v3:
> Using only one reserved memory region for both streaming DMA and memory
> allocation.
> https://lore.kernel.org/patchwork/cover/1360992/
>
> v2:
> Building on top of swiotlb.
> https://lore.kernel.org/patchwork/cover/1280705/
>
> v1:
> Using dma_map_ops.
> https://lore.kernel.org/patchwork/cover/1271660/
> *** BLURB HERE ***
>
> Claire Chang (15):
>   swiotlb: Refactor swiotlb init functions
>   swiotlb: Refactor swiotlb_create_debugfs
>   swiotlb: Add DMA_RESTRICTED_POOL
>   swiotlb: Add restricted DMA pool initialization
>   swiotlb: Add a new get_io_tlb_mem getter
>   swiotlb: Update is_swiotlb_buffer to add a struct device argument
>   swiotlb: Update is_swiotlb_active to add a struct device argument
>   swiotlb: Bounce data from/to restricted DMA pool if available
>   swiotlb: Move alloc_size to find_slots
>   swiotlb: Refactor swiotlb_tbl_unmap_single
>   dma-direct: Add a new wrapper __dma_direct_free_pages()
>   swiotlb: Add restricted DMA alloc/free support.
>   dma-direct: Allocate memory from restricted DMA pool if available
>   dt-bindings: of: Add restricted DMA pool
>   of: Add plumbing for restricted DMA pool
>
>  .../reserved-memory/reserved-memory.txt   |  27 ++
>  drivers/gpu/drm/i915/gem/i915_gem_internal.c  |   2 +-
>  drivers/gpu/drm/nouveau/nouveau_ttm.c |   2 +-
>  drivers/iommu/dma-iommu.c |  12 +-
>  drivers/of/address.c  |  25 ++
>  drivers/of/device.c   |   3 +
>  drivers/of/of_private.h   |   5 +
>  drivers/pci/xen-pcifront.c|   2 +-
>  drivers/xen/swiotlb-xen.c |   2 +-
>  include/linux/device.h|   4 +
>  include/linux/swiotlb.h   |  41 ++-
>  kernel/dma/Kconfig|  14 +
>  kernel/dma/direct.c   |  63 +++--
>  kernel/dma/direct.h   |   9 +-
>  kernel/dma/swiotlb.c  | 242 +-
>  15 files changed, 356 insertions(+), 97 deletions(-)
>
> --
> 2.31.1.607.g51e8a6a459-goog
>

[PATCH 1/1] powerpc/ps3: Fix error return code in ps3_register_devices()

2021-05-18 Thread Zhen Lei

When call ps3_start_probe_thread() failed, further initialization should
be stopped and the returned error code should be propagated.

Reported-by: Hulk Robot 
Signed-off-by: Zhen Lei 
---
 arch/powerpc/platforms/ps3/device-init.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/platforms/ps3/device-init.c 
b/arch/powerpc/platforms/ps3/device-init.c
index e87360a0fb40..9b6d8ca8fc01 100644
--- a/arch/powerpc/platforms/ps3/device-init.c
+++ b/arch/powerpc/platforms/ps3/device-init.c
@@ -955,6 +955,8 @@ static int __init ps3_register_devices(void)
/* ps3_repository_dump_bus_info(); */
 
result = ps3_start_probe_thread(PS3_BUS_TYPE_STORAGE);
+   if (result < 0)
+   return result;
 
ps3_register_vuart_devices();
 
-- 
2.25.1

52 matches

Mail list logo