Re: [PATCH v4 14/15] kprobes: remove dependency on CONFIG_MODULES

2024-04-19 Thread Christophe Leroy


Le 19/04/2024 à 17:49, Mike Rapoport a écrit :
> Hi Masami,
> 
> On Thu, Apr 18, 2024 at 06:16:15AM +0900, Masami Hiramatsu wrote:
>> Hi Mike,
>>
>> On Thu, 11 Apr 2024 19:00:50 +0300
>> Mike Rapoport  wrote:
>>
>>> From: "Mike Rapoport (IBM)" 
>>>
>>> kprobes depended on CONFIG_MODULES because it has to allocate memory for
>>> code.
>>>
>>> Since code allocations are now implemented with execmem, kprobes can be
>>> enabled in non-modular kernels.
>>>
>>> Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside
>>> modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the
>>> dependency of CONFIG_KPROBES on CONFIG_MODULES.
>>
>> Thanks for this work, but this conflicts with the latest fix in v6.9-rc4.
>> Also, can you use IS_ENABLED(CONFIG_MODULES) instead of #ifdefs in
>> function body? We have enough dummy functions for that, so it should
>> not make a problem.
> 
> The code in check_kprobe_address_safe() that gets the module and checks for
> __init functions does not compile with IS_ENABLED(CONFIG_MODULES).
> I can pull it out to a helper or leave #ifdef in the function body,
> whichever you prefer.

As far as I can see, the only problem is MODULE_STATE_COMING.
Can we move 'enum module_state' out of #ifdef CONFIG_MODULES in module.h  ?


>   
>> -- 
>> Masami Hiramatsu
> 


Re: [RFC PATCH 2/7] mm: vmalloc: don't account for number of nodes for HUGE_VMAP allocations

2024-04-12 Thread Christophe Leroy


Le 11/04/2024 à 18:05, Mike Rapoport a écrit :
> From: "Mike Rapoport (IBM)" 
> 
> vmalloc allocations with VM_ALLOW_HUGE_VMAP that do not explictly
> specify node ID will use huge pages only if size_per_node is larger than
> PMD_SIZE.
> Still the actual allocated memory is not distributed between nodes and
> there is no advantage in such approach.
> On the contrary, BPF allocates PMD_SIZE * num_possible_nodes() for each
> new bpf_prog_pack, while it could do with PMD_SIZE'ed packs.
> 
> Don't account for number of nodes for VM_ALLOW_HUGE_VMAP with
> NUMA_NO_NODE and use huge pages whenever the requested allocation size
> is larger than PMD_SIZE.

Patch looks ok but message is confusing. We also use huge pages at PTE 
size, for instance 512k pages or 16k pages on powerpc 8xx, while 
PMD_SIZE is 4M.

Christophe

> 
> Signed-off-by: Mike Rapoport (IBM) 
> ---
>   mm/vmalloc.c | 9 ++---
>   1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 22aa63f4ef63..5fc8b514e457 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3737,8 +3737,6 @@ void *__vmalloc_node_range(unsigned long size, unsigned 
> long align,
>   }
>   
>   if (vmap_allow_huge && (vm_flags & VM_ALLOW_HUGE_VMAP)) {
> - unsigned long size_per_node;
> -
>   /*
>* Try huge pages. Only try for PAGE_KERNEL allocations,
>* others like modules don't yet expect huge pages in
> @@ -3746,13 +3744,10 @@ void *__vmalloc_node_range(unsigned long size, 
> unsigned long align,
>* supporting them.
>*/
>   
> - size_per_node = size;
> - if (node == NUMA_NO_NODE)
> - size_per_node /= num_online_nodes();
> - if (arch_vmap_pmd_supported(prot) && size_per_node >= PMD_SIZE)
> + if (arch_vmap_pmd_supported(prot) && size >= PMD_SIZE)
>   shift = PMD_SHIFT;
>   else
> - shift = arch_vmap_pte_supported_shift(size_per_node);
> + shift = arch_vmap_pte_supported_shift(size);
>   
>   align = max(real_align, 1UL << shift);
>   size = ALIGN(real_size, 1UL << shift);


Re: [PATCH v7 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-28 Thread Christophe Leroy


Le 26/03/2024 à 14:46, Jarkko Sakkinen a écrit :
> Tacing with kprobes while running a monolithic kernel is currently
> impossible due the kernel module allocator dependency.
> 
> Address the issue by implementing textmem API for RISC-V.
> 
> Link: https://www.sochub.fi # for power on testing new SoC's with a minimal 
> stack
> Link: 
> https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ # 
> continuation
> Signed-off-by: Jarkko Sakkinen 
> ---
> v5-v7:
> - No changes.
> v4:
> - Include linux/execmem.h.
> v3:
> - Architecture independent parts have been split to separate patches.
> - Do not change arch/riscv/kernel/module.c as it is out of scope for
>this patch set now.
> v2:
> - Better late than never right? :-)
> - Focus only to RISC-V for now to make the patch more digestable. This
>is the arch where I use the patch on a daily basis to help with QA.
> - Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration.
> ---
>   arch/riscv/Kconfig  |  1 +
>   arch/riscv/kernel/Makefile  |  3 +++
>   arch/riscv/kernel/execmem.c | 22 ++
>   3 files changed, 26 insertions(+)
>   create mode 100644 arch/riscv/kernel/execmem.c
> 
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index e3142ce531a0..499512fb17ff 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -132,6 +132,7 @@ config RISCV
>   select HAVE_KPROBES if !XIP_KERNEL
>   select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL
>   select HAVE_KRETPROBES if !XIP_KERNEL
> + select HAVE_ALLOC_EXECMEM if !XIP_KERNEL
>   # https://github.com/ClangBuiltLinux/linux/issues/1881
>   select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD
>   select HAVE_MOVE_PMD
> diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> index 604d6bf7e476..337797f10d3e 100644
> --- a/arch/riscv/kernel/Makefile
> +++ b/arch/riscv/kernel/Makefile
> @@ -73,6 +73,9 @@ obj-$(CONFIG_SMP)   += cpu_ops.o
>   
>   obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o
>   obj-$(CONFIG_MODULES)   += module.o
> +ifeq ($(CONFIG_ALLOC_EXECMEM),y)
> +obj-y+= execmem.o

Why not just :

obj-$(CONFIG_ALLOC_EXECMEM) += execmem.o

> +endif
>   obj-$(CONFIG_MODULE_SECTIONS)   += module-sections.o
>   
>   obj-$(CONFIG_CPU_PM)+= suspend_entry.o suspend.o
> diff --git a/arch/riscv/kernel/execmem.c b/arch/riscv/kernel/execmem.c
> new file mode 100644
> index ..3e52522ead32
> --- /dev/null
> +++ b/arch/riscv/kernel/execmem.c
> @@ -0,0 +1,22 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +void *alloc_execmem(unsigned long size, gfp_t /* gfp */)
> +{
> + return __vmalloc_node_range(size, 1, MODULES_VADDR,
> + MODULES_END, GFP_KERNEL,

Why not use gfp argument ?

> + PAGE_KERNEL, 0, NUMA_NO_NODE,
> + __builtin_return_address(0));
> +}
> +
> +void free_execmem(void *region)
> +{
> + if (in_interrupt())
> + pr_warn("In interrupt context: vmalloc may not work.\n");

Do you expect that to happen ? module_memfree() has a WARN_ON() meaning 
this should never happen and if it really does it is not just a poor 
dmesg warning.

> +
> + vfree(region);
> +}


Re: [PATCH v7 1/2] kprobes: Implement trampoline memory allocator for tracing

2024-03-28 Thread Christophe Leroy


Le 26/03/2024 à 14:46, Jarkko Sakkinen a écrit :
> Tracing with kprobes while running a monolithic kernel is currently
> impossible because CONFIG_KPROBES depends on CONFIG_MODULES.
> 
> Introduce alloc_execmem() and free_execmem() for allocating executable
> memory. If an arch implements these functions, it can mark this up with
> the HAVE_ALLOC_EXECMEM kconfig flag.
> 
> The second new kconfig flag is ALLOC_EXECMEM, which can be selected if
> either MODULES is selected or HAVE_ALLOC_EXECMEM is support by the arch. If
> HAVE_ALLOC_EXECMEM is not supported by an arch, module_alloc() and
> module_memfree() are used as a fallback, thus retaining backwards
> compatibility to earlier kernel versions.
> 
> This will allow architecture to enable kprobes traces without requiring
> to enable module.
> 
> The support can be implemented with four easy steps:
> 
> 1. Implement alloc_execmem().
> 2. Implement free_execmem().
> 3. Edit arch//Makefile.
> 4. Set HAVE_ALLOC_EXECMEM in arch//Kconfig.
> 
> Link: 
> https://lore.kernel.org/all/20240325115632.04e37297491cadfbbf382...@kernel.org/
> Suggested-by: Masami Hiramatsu 
> Signed-off-by: Jarkko Sakkinen 
> ---
> v7:
> - Use "depends on" for ALLOC_EXECMEM instead of "select"
> - Reduced and narrowed CONFIG_MODULES checks further in kprobes.c.
> v6:
> - Use null pointer for notifiers and register the module notifier only if
>IS_ENABLED(CONFIG_MODULES) is set.
> - Fixed typo in the commit message and wrote more verbose description
>of the feature.
> v5:
> - alloc_execmem() was missing GFP_KERNEL parameter. The patch set did
>compile because 2/2 had the fixup (leaked there when rebasing the
>patch set).
> v4:
> - Squashed a couple of unrequired CONFIG_MODULES checks.
> - See https://lore.kernel.org/all/d034m18d63ec.2y11d954ys...@kernel.org/
> v3:
> - A new patch added.
> - For IS_DEFINED() I need advice as I could not really find that many
>locations where it would be applicable.
> ---
>   arch/Kconfig| 17 +++-
>   include/linux/execmem.h | 13 +
>   kernel/kprobes.c| 53 ++---
>   kernel/trace/trace_kprobe.c | 15 +--
>   4 files changed, 73 insertions(+), 25 deletions(-)
>   create mode 100644 include/linux/execmem.h
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index a5af0edd3eb8..5e9735f60f3c 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -52,8 +52,8 @@ config GENERIC_ENTRY
>   
>   config KPROBES
>   bool "Kprobes"
> - depends on MODULES
>   depends on HAVE_KPROBES
> + depends on ALLOC_EXECMEM
>   select KALLSYMS
>   select TASKS_RCU if PREEMPTION
>   help
> @@ -215,6 +215,21 @@ config HAVE_OPTPROBES
>   config HAVE_KPROBES_ON_FTRACE
>   bool
>   
> +config HAVE_ALLOC_EXECMEM
> + bool
> + help
> +   Architectures that select this option are capable of allocating 
> trampoline
> +   executable memory for tracing subsystems, indepedently of the kernel 
> module
> +   subsystem.
> +
> +config ALLOC_EXECMEM
> + bool "Executable (trampoline) memory allocation"

Why make it user selectable ? Previously I was able to select KPROBES as 
soon as MODULES was selected. Now I will have to first select 
ALLOC_EXECMEM in addition ? What is the added value of allowing the user 
to disable it ?

> + default y
> + depends on MODULES || HAVE_ALLOC_EXECMEM
> + help
> +   Select this for executable (trampoline) memory. Can be enabled when 
> either
> +   module allocator or arch-specific allocator is available.
> +
>   config ARCH_CORRECT_STACKTRACE_ON_KRETPROBE
>   bool
>   help
> diff --git a/include/linux/execmem.h b/include/linux/execmem.h
> new file mode 100644
> index ..ae2ff151523a
> --- /dev/null
> +++ b/include/linux/execmem.h
> @@ -0,0 +1,13 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_EXECMEM_H
> +#define _LINUX_EXECMEM_H
> +

It should include moduleloader.h otherwise the user of alloc_execmem() 
must include both this header and moduleloader.h to use alloc_execmem()

> +#ifdef CONFIG_HAVE_ALLOC_EXECMEM
> +void *alloc_execmem(unsigned long size, gfp_t gfp);
> +void free_execmem(void *region);
> +#else
> +#define alloc_execmem(size, gfp) module_alloc(size)

Then gfp is silently ignored in the case. Is that expected ?

> +#define free_execmem(region) module_memfree(region)
> +#endif
> +
> +#endif /* _LINUX_EXECMEM_H */
> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> index 9d9095e81792..13bef5de315c 100644
> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -44,6 +44,7 @@
>   #include 
>   #include 
>   #include 
> +#include 
>   
>   #define KPROBE_HASH_BITS 6
>   #define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS)
> @@ -113,17 +114,17 @@ enum kprobe_slot_state {
>   void __weak *alloc_insn_page(void)
>   {
>   /*
> -  * Use module_alloc() so this page is within +/- 2GB of where the
> +  * Use alloc_execmem() so this 

Re: [RFC][PATCH 3/4] kprobes: Allow kprobes with CONFIG_MODULES=n

2024-03-07 Thread Christophe Leroy


Le 06/03/2024 à 21:05, Calvin Owens a écrit :
> [Vous ne recevez pas souvent de courriers de jcalvinow...@gmail.com. 
> Découvrez pourquoi ceci est important à 
> https://aka.ms/LearnAboutSenderIdentification ]
> 
> If something like this is merged down the road, it can go in at leisure
> once the module_alloc change is in: it's a one-way dependency.

Too many #ifdef, please reorganise stuff to avoid that and avoid 
changing prototypes based of CONFIG_MODULES.

Other few comments below.

> 
> Signed-off-by: Calvin Owens 
> ---
>   arch/Kconfig|  2 +-
>   kernel/kprobes.c| 22 ++
>   kernel/trace/trace_kprobe.c | 11 +++
>   3 files changed, 34 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index cfc24ced16dd..e60ce984d095 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -52,8 +52,8 @@ config GENERIC_ENTRY
> 
>   config KPROBES
>  bool "Kprobes"
> -   depends on MODULES
>  depends on HAVE_KPROBES
> +   select MODULE_ALLOC
>  select KALLSYMS
>  select TASKS_RCU if PREEMPTION
>  help
> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> index 9d9095e81792..194270e17d57 100644
> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -1556,8 +1556,12 @@ static bool is_cfi_preamble_symbol(unsigned long addr)
>  str_has_prefix("__pfx_", symbuf);
>   }
> 
> +#if IS_ENABLED(CONFIG_MODULES)
>   static int check_kprobe_address_safe(struct kprobe *p,
>   struct module **probed_mod)
> +#else
> +static int check_kprobe_address_safe(struct kprobe *p)
> +#endif

A bit ugly to have to change the prototype, why not just keep probed_mod 
at all time ?

When CONFIG_MODULES is not selected, __module_text_address() returns 
NULL so it should work without that many #ifdefs.

>   {
>  int ret;
> 
> @@ -1580,6 +1584,7 @@ static int check_kprobe_address_safe(struct kprobe *p,
>  goto out;
>  }
> 
> +#if IS_ENABLED(CONFIG_MODULES)
>  /* Check if 'p' is probing a module. */
>  *probed_mod = __module_text_address((unsigned long) p->addr);
>  if (*probed_mod) {
> @@ -1603,6 +1608,8 @@ static int check_kprobe_address_safe(struct kprobe *p,
>  ret = -ENOENT;
>  }
>  }
> +#endif
> +
>   out:
>  preempt_enable();
>  jump_label_unlock();
> @@ -1614,7 +1621,9 @@ int register_kprobe(struct kprobe *p)
>   {
>  int ret;
>  struct kprobe *old_p;
> +#if IS_ENABLED(CONFIG_MODULES)
>  struct module *probed_mod;
> +#endif
>  kprobe_opcode_t *addr;
>  bool on_func_entry;
> 
> @@ -1633,7 +1642,11 @@ int register_kprobe(struct kprobe *p)
>  p->nmissed = 0;
>  INIT_LIST_HEAD(>list);
> 
> +#if IS_ENABLED(CONFIG_MODULES)
>  ret = check_kprobe_address_safe(p, _mod);
> +#else
> +   ret = check_kprobe_address_safe(p);
> +#endif
>  if (ret)
>  return ret;
> 
> @@ -1676,8 +1689,10 @@ int register_kprobe(struct kprobe *p)
>   out:
>  mutex_unlock(_mutex);
> 
> +#if IS_ENABLED(CONFIG_MODULES)
>  if (probed_mod)
>  module_put(probed_mod);
> +#endif
> 
>  return ret;
>   }
> @@ -2482,6 +2497,7 @@ int kprobe_add_area_blacklist(unsigned long start, 
> unsigned long end)
>  return 0;
>   }
> 
> +#if IS_ENABLED(CONFIG_MODULES)
>   /* Remove all symbols in given area from kprobe blacklist */
>   static void kprobe_remove_area_blacklist(unsigned long start, unsigned long 
> end)
>   {
> @@ -2499,6 +2515,7 @@ static void kprobe_remove_ksym_blacklist(unsigned long 
> entry)
>   {
>  kprobe_remove_area_blacklist(entry, entry + 1);
>   }
> +#endif
> 
>   int __weak arch_kprobe_get_kallsym(unsigned int *symnum, unsigned long 
> *value,
> char *type, char *sym)
> @@ -2564,6 +2581,7 @@ static int __init populate_kprobe_blacklist(unsigned 
> long *start,
>  return ret ? : arch_populate_kprobe_blacklist();
>   }
> 
> +#if IS_ENABLED(CONFIG_MODULES)
>   static void add_module_kprobe_blacklist(struct module *mod)
>   {
>  unsigned long start, end;
> @@ -2665,6 +2683,7 @@ static struct notifier_block kprobe_module_nb = {
>  .notifier_call = kprobes_module_callback,
>  .priority = 0
>   };
> +#endif /* IS_ENABLED(CONFIG_MODULES) */
> 
>   void kprobe_free_init_mem(void)
>   {
> @@ -2724,8 +2743,11 @@ static int __init init_kprobes(void)
>  err = arch_init_kprobes();
>  if (!err)
>  err = register_die_notifier(_exceptions_nb);
> +
> +#if IS_ENABLED(CONFIG_MODULES)
>  if (!err)
>  err = register_module_notifier(_module_nb);
> +#endif
> 
>  kprobes_initialized = (err == 0);
>  kprobe_sysctls_init();
> diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> index 

Re: [RFC][PATCH 2/4] bpf: Allow BPF_JIT with CONFIG_MODULES=n

2024-03-07 Thread Christophe Leroy


Le 06/03/2024 à 21:05, Calvin Owens a écrit :
> [Vous ne recevez pas souvent de courriers de jcalvinow...@gmail.com. 
> Découvrez pourquoi ceci est important à 
> https://aka.ms/LearnAboutSenderIdentification ]
> 
> No BPF code has to change, except in struct_ops (for module refs).
> 
> This conflicts with bpf-next because of this (relevant) series:
> 
>  https://lore.kernel.org/all/20240119225005.668602-1-thinker...@gmail.com/
> 
> If something like this is merged down the road, it can go through
> bpf-next at leisure once the module_alloc change is in: it's a one-way
> dependency.
> 
> Signed-off-by: Calvin Owens 
> ---
>   kernel/bpf/Kconfig  |  2 +-
>   kernel/bpf/bpf_struct_ops.c | 28 
>   2 files changed, 25 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig
> index 6a906ff93006..77df483a8925 100644
> --- a/kernel/bpf/Kconfig
> +++ b/kernel/bpf/Kconfig
> @@ -42,7 +42,7 @@ config BPF_JIT
>  bool "Enable BPF Just In Time compiler"
>  depends on BPF
>  depends on HAVE_CBPF_JIT || HAVE_EBPF_JIT
> -   depends on MODULES
> +   select MODULE_ALLOC
>  help
>BPF programs are normally handled by a BPF interpreter. This option
>allows the kernel to generate native code when a program is loaded
> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> index 02068bd0e4d9..fbf08a1bb00c 100644
> --- a/kernel/bpf/bpf_struct_ops.c
> +++ b/kernel/bpf/bpf_struct_ops.c
> @@ -108,11 +108,30 @@ const struct bpf_prog_ops bpf_struct_ops_prog_ops = {
>   #endif
>   };
> 
> +#if IS_ENABLED(CONFIG_MODULES)

Can you avoid ifdefs as much as possible ?

>   static const struct btf_type *module_type;
> 
> +static int bpf_struct_module_type_init(struct btf *btf)
> +{
> +   s32 module_id;

Could be:

if (!IS_ENABLED(CONFIG_MODULES))
return 0;

> +
> +   module_id = btf_find_by_name_kind(btf, "module", BTF_KIND_STRUCT);
> +   if (module_id < 0)
> +   return 1;
> +
> +   module_type = btf_type_by_id(btf, module_id);
> +   return 0;
> +}
> +#else
> +static int bpf_struct_module_type_init(struct btf *btf)
> +{
> +   return 0;
> +}
> +#endif
> +
>   void bpf_struct_ops_init(struct btf *btf, struct bpf_verifier_log *log)
>   {
> -   s32 type_id, value_id, module_id;
> +   s32 type_id, value_id;
>  const struct btf_member *member;
>  struct bpf_struct_ops *st_ops;
>  const struct btf_type *t;
> @@ -125,12 +144,10 @@ void bpf_struct_ops_init(struct btf *btf, struct 
> bpf_verifier_log *log)
>   #include "bpf_struct_ops_types.h"
>   #undef BPF_STRUCT_OPS_TYPE
> 
> -   module_id = btf_find_by_name_kind(btf, "module", BTF_KIND_STRUCT);
> -   if (module_id < 0) {
> +   if (bpf_struct_module_type_init(btf)) {
>  pr_warn("Cannot find struct module in btf_vmlinux\n");
>  return;
>  }
> -   module_type = btf_type_by_id(btf, module_id);
> 
>  for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) {
>  st_ops = bpf_struct_ops[i];
> @@ -433,12 +450,15 @@ static long bpf_struct_ops_map_update_elem(struct 
> bpf_map *map, void *key,
> 
>  moff = __btf_member_bit_offset(t, member) / 8;
>  ptype = btf_type_resolve_ptr(btf_vmlinux, member->type, 
> NULL);
> +
> +#if IS_ENABLED(CONFIG_MODULES)

Can't see anything depending on CONFIG_MODULES here, can you instead do:

if (IS_ENABLED(CONFIG_MODULES) && ptype == module_type) {

>  if (ptype == module_type) {
>  if (*(void **)(udata + moff))
>  goto reset_unlock;
>  *(void **)(kdata + moff) = BPF_MODULE_OWNER;
>  continue;
>  }
> +#endif
> 
>  err = st_ops->init_member(t, member, kdata, udata);
>  if (err < 0)
> --
> 2.43.0
> 
> 


Re: [RFC][PATCH 1/4] module: mm: Make module_alloc() generally available

2024-03-07 Thread Christophe Leroy
Hi Calvin,

Le 06/03/2024 à 21:05, Calvin Owens a écrit :
> [Vous ne recevez pas souvent de courriers de jcalvinow...@gmail.com. 
> Découvrez pourquoi ceci est important à 
> https://aka.ms/LearnAboutSenderIdentification ]
> 
> Both BPF_JIT and KPROBES depend on CONFIG_MODULES, but only require
> module_alloc() itself, which can be easily separated into a standalone
> allocator for executable kernel memory.

Easily maybe, but not as easily as you think, see below.

> 
> Thomas Gleixner sent a patch to do that for x86 as part of a larger
> series a couple years ago:
> 
>  https://lore.kernel.org/all/20220716230953.442937...@linutronix.de/
> 
> I've simply extended that approach to the whole kernel.
> 
> Signed-off-by: Calvin Owens 
> ---
>   arch/Kconfig |   2 +-
>   arch/arm/kernel/module.c |  35 -
>   arch/arm/mm/Makefile |   2 +
>   arch/arm/mm/module_alloc.c   |  40 ++
>   arch/arm64/kernel/module.c   | 127 --
>   arch/arm64/mm/Makefile   |   1 +
>   arch/arm64/mm/module_alloc.c | 130 +++
>   arch/loongarch/kernel/module.c   |   6 --
>   arch/loongarch/mm/Makefile   |   2 +
>   arch/loongarch/mm/module_alloc.c |  10 +++
>   arch/mips/kernel/module.c|  10 ---
>   arch/mips/mm/Makefile|   2 +
>   arch/mips/mm/module_alloc.c  |  13 
>   arch/nios2/kernel/module.c   |  20 -
>   arch/nios2/mm/Makefile   |   2 +
>   arch/nios2/mm/module_alloc.c |  22 ++
>   arch/parisc/kernel/module.c  |  12 ---
>   arch/parisc/mm/Makefile  |   1 +
>   arch/parisc/mm/module_alloc.c|  15 
>   arch/powerpc/kernel/module.c |  36 -
>   arch/powerpc/mm/Makefile |   1 +
>   arch/powerpc/mm/module_alloc.c   |  41 ++

Missing several powerpc changes to make it work. You must audit every 
use of CONFIG_MODULES inside powerpc. Here are a few exemples:

Function get_patch_pfn() to enable text code patching.

arch/powerpc/Kconfig :  select KASAN_VMALLOCif KASAN && 
MODULES

arch/powerpc/include/asm/kasan.h:

#if defined(CONFIG_MODULES) && defined(CONFIG_PPC32)
#define KASAN_KERN_STARTALIGN_DOWN(PAGE_OFFSET - SZ_256M, SZ_256M)
#else
#define KASAN_KERN_STARTPAGE_OFFSET
#endif

arch/powerpc/kernel/head_8xx.S and arch/powerpc/kernel/head_book3s_32.S: 
InstructionTLBMiss interrupt handler must know that there is executable 
kernel text outside kernel core.

Function is_module_segment() to identified segments used for module text 
and set NX (NoExec) MMU flag on non-module segments.



>   arch/riscv/kernel/module.c   |  11 ---
>   arch/riscv/mm/Makefile   |   1 +
>   arch/riscv/mm/module_alloc.c |  17 
>   arch/s390/kernel/module.c|  37 -
>   arch/s390/mm/Makefile|   1 +
>   arch/s390/mm/module_alloc.c  |  42 ++
>   arch/sparc/kernel/module.c   |  31 
>   arch/sparc/mm/Makefile   |   2 +
>   arch/sparc/mm/module_alloc.c |  31 
>   arch/x86/kernel/ftrace.c |   2 +-
>   arch/x86/kernel/module.c |  56 -
>   arch/x86/mm/Makefile |   2 +
>   arch/x86/mm/module_alloc.c   |  59 ++
>   fs/proc/kcore.c  |   2 +-
>   kernel/module/Kconfig|   1 +
>   kernel/module/main.c |  17 
>   mm/Kconfig   |   3 +
>   mm/Makefile  |   1 +
>   mm/module_alloc.c|  21 +
>   mm/vmalloc.c |   2 +-
>   42 files changed, 467 insertions(+), 402 deletions(-)

...

> diff --git a/mm/Kconfig b/mm/Kconfig
> index ffc3a2ba3a8c..92bfb5ae2e95 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1261,6 +1261,9 @@ config LOCK_MM_AND_FIND_VMA
>   config IOMMU_MM_DATA
>  bool
> 
> +config MODULE_ALLOC
> +   def_bool n
> +

I'd call it something else than CONFIG_MODULE_ALLOC as you want to use 
it when CONFIG_MODULE is not selected.

Something like CONFIG_EXECMEM_ALLOC or CONFIG_DYNAMIC_EXECMEM ?



Christophe


Re: [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time

2024-01-31 Thread Christophe Leroy


Le 31/01/2024 à 16:17, Marek Szyprowski a écrit :
> [Vous ne recevez pas souvent de courriers de m.szyprow...@samsung.com. 
> Découvrez pourquoi ceci est important à 
> https://aka.ms/LearnAboutSenderIdentification ]
> 
> Hi Christophe,
> 
> On 31.01.2024 12:58, Christophe Leroy wrote:
>> Le 30/01/2024 à 18:48, Marek Szyprowski a écrit :
>>> [Vous ne recevez pas souvent de courriers de m.szyprow...@samsung.com. 
>>> Découvrez pourquoi ceci est important à 
>>> https://aka.ms/LearnAboutSenderIdentification ]
>>>
>>> On 30.01.2024 12:03, Christophe Leroy wrote:
>>>> Le 30/01/2024 à 10:16, Chen-Yu Tsai a écrit :
>>>>> [Vous ne recevez pas souvent de courriers de we...@chromium.org. 
>>>>> D?couvrez pourquoi ceci est important ? 
>>>>> https://aka.ms/LearnAboutSenderIdentification ]
>>>>>
>>>>> On Mon, Jan 29, 2024 at 12:09:50PM -0800, Luis Chamberlain wrote:
>>>>>> On Thu, Dec 21, 2023 at 10:02:46AM +0100, Christophe Leroy wrote:
>>>>>>> Declaring rodata_enabled and mark_rodata_ro() at all time
>>>>>>> helps removing related #ifdefery in C files.
>>>>>>>
>>>>>>> Signed-off-by: Christophe Leroy 
>>>>>> Very nice cleanup, thanks!, applied and pushed
>>>>>>
>>>>>>Luis
>>>>> On next-20240130, which has your modules-next branch, and thus this
>>>>> series and the other "module: Use set_memory_rox()" series applied,
>>>>> my kernel crashes in some very weird way. Reverting your branch
>>>>> makes the crash go away.
>>>>>
>>>>> I thought I'd report it right away. Maybe you folks would know what's
>>>>> happening here? This is on arm64.
>>>> That's strange, it seems to bug in module_bug_finalize() which is
>>>> _before_ calls to module_enable_ro() and such.
>>>>
>>>> Can you try to revert the 6 patches one by one to see which one
>>>> introduces the problem ?
>>>>
>>>> In reality, only patch 677bfb9db8a3 really change things. Other ones are
>>>> more on less only cleanup.
>>> I've also run into this issue with today's (20240130) linux-next on my
>>> test farm. The issue is not fully reproducible, so it was a bit hard to
>>> bisect it automatically. I've spent some time on manual testing and it
>>> looks that reverting the following 2 commits on top of linux-next fixes
>>> the problem:
>>>
>>> 65929884f868 ("modules: Remove #ifdef CONFIG_STRICT_MODULE_RWX around
>>> rodata_enabled")
>>> 677bfb9db8a3 ("module: Don't ignore errors from set_memory_XX()")
>>>
>>> This in fact means that commit 677bfb9db8a3 is responsible for this
>>> regression, as 65929884f868 has to be reverted only because the latter
>>> depends on it. Let me know what I can do to help debugging this issue.
>>>
>> Thanks for the bisect. I suspect you hit one of the errors and something
>> goes wrong in the error path.
>>
>> To confirm this assumption, could you try with the following change on
>> top of everything ?
> 
> 
> Yes, this is the problem. I've added printing a mod->name to the log.
> Here is a log from kernel build from next-20240130 (sometimes it even
> boots to shell):
> 
> # dmesg | grep module_set_memory
> [    8.061525] module_set_memory(6, , 0) name ipv6
> returned -22
> [    8.067543] WARNING: CPU: 3 PID: 1 at kernel/module/strict_rwx.c:22
> module_set_memory+0x9c/0xb8

Would be good if you could show the backtrace too so that we know who is 
the caller. I guess what you show here is what you get on the screen ? 
The backtrace should be available throught 'dmesg'.

I guess we will now seek help from ARM64 people to understand why 
module_set_memory_something() fails with -EINVAL when loading modules.


> [    8.097821] pc : module_set_memory+0x9c/0xb8
> [    8.102068] lr : module_set_memory+0x9c/0xb8
> [    8.183101]  module_set_memory+0x9c/0xb8
> [    8.472862] module_set_memory(6, , 0) name x_tables
> returned -22
> [    8.479215] WARNING: CPU: 2 PID: 1 at kernel/module/strict_rwx.c:22
> module_set_memory+0x9c/0xb8
> [    8.510978] pc : module_set_memory+0x9c/0xb8
> [    8.515225] lr : module_set_memory+0x9c/0xb8
> [    8.596259]  module_set_memory+0x9c/0xb8
> [   10.529879] module_set_memory(6, , 0) name dm_mod
> returned -22
> [   10.536087] WARNING: CPU: 3 PID: 127 at kernel/module/strict_rwx.c:22
> mod

Re: [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time

2024-01-31 Thread Christophe Leroy
Hi,

Le 30/01/2024 à 18:48, Marek Szyprowski a écrit :
> [Vous ne recevez pas souvent de courriers de m.szyprow...@samsung.com. 
> Découvrez pourquoi ceci est important à 
> https://aka.ms/LearnAboutSenderIdentification ]
> 
> Dear All,
> 
> On 30.01.2024 12:03, Christophe Leroy wrote:
>> Le 30/01/2024 à 10:16, Chen-Yu Tsai a écrit :
>>> [Vous ne recevez pas souvent de courriers de we...@chromium.org. D?couvrez 
>>> pourquoi ceci est important ? https://aka.ms/LearnAboutSenderIdentification 
>>> ]
>>>
>>> On Mon, Jan 29, 2024 at 12:09:50PM -0800, Luis Chamberlain wrote:
>>>> On Thu, Dec 21, 2023 at 10:02:46AM +0100, Christophe Leroy wrote:
>>>>> Declaring rodata_enabled and mark_rodata_ro() at all time
>>>>> helps removing related #ifdefery in C files.
>>>>>
>>>>> Signed-off-by: Christophe Leroy 
>>>> Very nice cleanup, thanks!, applied and pushed
>>>>
>>>>  Luis
>>> On next-20240130, which has your modules-next branch, and thus this
>>> series and the other "module: Use set_memory_rox()" series applied,
>>> my kernel crashes in some very weird way. Reverting your branch
>>> makes the crash go away.
>>>
>>> I thought I'd report it right away. Maybe you folks would know what's
>>> happening here? This is on arm64.
>> That's strange, it seems to bug in module_bug_finalize() which is
>> _before_ calls to module_enable_ro() and such.
>>
>> Can you try to revert the 6 patches one by one to see which one
>> introduces the problem ?
>>
>> In reality, only patch 677bfb9db8a3 really change things. Other ones are
>> more on less only cleanup.
> 
> I've also run into this issue with today's (20240130) linux-next on my
> test farm. The issue is not fully reproducible, so it was a bit hard to
> bisect it automatically. I've spent some time on manual testing and it
> looks that reverting the following 2 commits on top of linux-next fixes
> the problem:
> 
> 65929884f868 ("modules: Remove #ifdef CONFIG_STRICT_MODULE_RWX around
> rodata_enabled")
> 677bfb9db8a3 ("module: Don't ignore errors from set_memory_XX()")
> 
> This in fact means that commit 677bfb9db8a3 is responsible for this
> regression, as 65929884f868 has to be reverted only because the latter
> depends on it. Let me know what I can do to help debugging this issue.
> 

Thanks for the bisect. I suspect you hit one of the errors and something 
goes wrong in the error path.

To confirm this assumption, could you try with the following change on 
top of everything ?

diff --git a/kernel/module/strict_rwx.c b/kernel/module/strict_rwx.c
index a14df9655dbe..fdf8484154dd 100644
--- a/kernel/module/strict_rwx.c
+++ b/kernel/module/strict_rwx.c
@@ -15,9 +15,12 @@ static int module_set_memory(const struct module 
*mod, enum mod_mem_type type,
  int (*set_memory)(unsigned long start, int 
num_pages))
  {
const struct module_memory *mod_mem = >mem[type];
+   int err;

set_vm_flush_reset_perms(mod_mem->base);
-   return set_memory((unsigned long)mod_mem->base, mod_mem->size >> 
PAGE_SHIFT);
+   err = set_memory((unsigned long)mod_mem->base, mod_mem->size >> 
PAGE_SHIFT);
+   WARN(err, "module_set_memory(%d, %px, %x) returned %d\n", type, 
mod_mem->base, mod_mem->size, err);
+   return err;
  }

  /*


Thanks for your help
Christophe


Re: [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time

2024-01-30 Thread Christophe Leroy


Le 30/01/2024 à 21:27, Luis Chamberlain a écrit :
> On Tue, Jan 30, 2024 at 06:48:11PM +0100, Marek Szyprowski wrote:
>> Dear All,
>>
>> On 30.01.2024 12:03, Christophe Leroy wrote:
>>> Le 30/01/2024 à 10:16, Chen-Yu Tsai a écrit :
>>>> [Vous ne recevez pas souvent de courriers de we...@chromium.org. D?couvrez 
>>>> pourquoi ceci est important ? 
>>>> https://aka.ms/LearnAboutSenderIdentification ]
>>>>
>>>> On Mon, Jan 29, 2024 at 12:09:50PM -0800, Luis Chamberlain wrote:
>>>>> On Thu, Dec 21, 2023 at 10:02:46AM +0100, Christophe Leroy wrote:
>>>>>> Declaring rodata_enabled and mark_rodata_ro() at all time
>>>>>> helps removing related #ifdefery in C files.
>>>>>>
>>>>>> Signed-off-by: Christophe Leroy 
>>>>> Very nice cleanup, thanks!, applied and pushed
>>>>>
>>>>>  Luis
>>>> On next-20240130, which has your modules-next branch, and thus this
>>>> series and the other "module: Use set_memory_rox()" series applied,
>>>> my kernel crashes in some very weird way. Reverting your branch
>>>> makes the crash go away.
>>>>
>>>> I thought I'd report it right away. Maybe you folks would know what's
>>>> happening here? This is on arm64.
>>> That's strange, it seems to bug in module_bug_finalize() which is
>>> _before_ calls to module_enable_ro() and such.
>>>
>>> Can you try to revert the 6 patches one by one to see which one
>>> introduces the problem ?
>>>
>>> In reality, only patch 677bfb9db8a3 really change things. Other ones are
>>> more on less only cleanup.
>>
>> I've also run into this issue with today's (20240130) linux-next on my
>> test farm. The issue is not fully reproducible, so it was a bit hard to
>> bisect it automatically. I've spent some time on manual testing and it
>> looks that reverting the following 2 commits on top of linux-next fixes
>> the problem:
>>
>> 65929884f868 ("modules: Remove #ifdef CONFIG_STRICT_MODULE_RWX around
>> rodata_enabled")
>> 677bfb9db8a3 ("module: Don't ignore errors from set_memory_XX()")
>>
>> This in fact means that commit 677bfb9db8a3 is responsible for this
>> regression, as 65929884f868 has to be reverted only because the latter
>> depends on it. Let me know what I can do to help debugging this issue.
> 
> Thanks for the bisect, I've reset my tree to commit
> 3559ad395bf02 ("module: Change module_enable_{nx/x/ro}() to more
> explicit names") for now then, so to remove those commits.
> 

The problem being identified in commit 677bfb9db8a3 ("module: Don't 
ignore errors from set_memory_XX()"), you can keep/re-apply the series 
[PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time.

Christophe


Re: [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time

2024-01-30 Thread Christophe Leroy


Le 30/01/2024 à 10:16, Chen-Yu Tsai a écrit :
> [Vous ne recevez pas souvent de courriers de we...@chromium.org. D?couvrez 
> pourquoi ceci est important ? https://aka.ms/LearnAboutSenderIdentification ]
> 
> Hi,
> 
> On Mon, Jan 29, 2024 at 12:09:50PM -0800, Luis Chamberlain wrote:
>> On Thu, Dec 21, 2023 at 10:02:46AM +0100, Christophe Leroy wrote:
>>> Declaring rodata_enabled and mark_rodata_ro() at all time
>>> helps removing related #ifdefery in C files.
>>>
>>> Signed-off-by: Christophe Leroy 
>>
>> Very nice cleanup, thanks!, applied and pushed
>>
>>Luis
> 
> On next-20240130, which has your modules-next branch, and thus this
> series and the other "module: Use set_memory_rox()" series applied,
> my kernel crashes in some very weird way. Reverting your branch
> makes the crash go away.
> 
> I thought I'd report it right away. Maybe you folks would know what's
> happening here? This is on arm64.

That's strange, it seems to bug in module_bug_finalize() which is 
_before_ calls to module_enable_ro() and such.

Can you try to revert the 6 patches one by one to see which one 
introduces the problem ?

In reality, only patch 677bfb9db8a3 really change things. Other ones are 
more on less only cleanup.

Thanks
Christophe


> 
> [   10.481015] Unable to handle kernel paging request at virtual address 
> ffde85245d30
> [   10.490369] KASAN: maybe wild-memory-access in range 
> [0x00f42922e980-0x00f42922e987]
> [   10.503744] Mem abort info:
> [   10.509383]   ESR = 0x9647
> [   10.514400]   EC = 0x25: DABT (current EL), IL = 32 bits
> [   10.522366]   SET = 0, FnV = 0
> [   10.526343]   EA = 0, S1PTW = 0
> [   10.530695]   FSC = 0x07: level 3 translation fault
> [   10.537081] Data abort info:
> [   10.540839]   ISV = 0, ISS = 0x0047, ISS2 = 0x
> [   10.546456]   CM = 0, WnR = 1, TnD = 0, TagAccess = 0
> [   10.551726]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> [   10.557612] swapper pgtable: 4k pages, 39-bit VAs, pgdp=41f98000
> [   10.565214] [ffde85245d30] pgd=10023003, p4d=10023003, 
> pud=10023003, pmd=1001121eb003, pte=
> [   10.578887] Internal error: Oops: 9647 [#1] PREEMPT SMP
> [   10.585815] Modules linked in:
> [   10.590235] CPU: 6 PID: 195 Comm: (udev-worker) Tainted: GB
>   6.8.0-rc2-next-20240130-02908-ge8ad01d60927-dirty #163 
> 3f2318148ecc5fa70d1092c2b874f9b59bdb7d60
> [   10.607021] Hardware name: Google Tentacruel board (DT)
> [   10.613607] pstate: a049 (NzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [   10.621954] pc : module_bug_finalize+0x118/0x148
> [   10.626823] lr : module_bug_finalize+0x118/0x148
> [   10.631463] sp : ffc0820478d0
> [   10.631466] x29: ffc0820478d0 x28: ffc082047ca0 x27: 
> ffde8d7d31a0
> [   10.631477] x26: ffde85223780 x25:  x24: 
> ffde8c413cc0
> [   10.631486] x23: ffde8dfcec80 x22: ffde8dfce000 x21: 
> ffde85223ba8
> [   10.631495] x20: ffde85223780 x19: ffde85245d28 x18: 
> 
> [   10.631504] x17: ffde8aa15938 x16: ffde8aabdd90 x15: 
> ffde8aab8124
> [   10.631513] x14: ffde8acdd380 x13: 41b58ab3 x12: 
> ffbbd1bf9d91
> [   10.631522] x11: 1ffbd1bf9d90 x10: ffbbd1bf9d90 x9 : 
> dfc0
> [   10.631531] x8 : 00442e406270 x7 : ffde8dfcec87 x6 : 
> 0001
> [   10.631539] x5 : ffde8dfcec80 x4 :  x3 : 
> ffde8bbadf08
> [   10.631548] x2 : 0001 x1 : ffde8eaff080 x0 : 
> 
> [   10.631556] Call trace:
> [   10.631559]  module_bug_finalize+0x118/0x148
> [   10.631565]  load_module+0x25ec/0x2a78
> [   10.631572]  __do_sys_init_module+0x234/0x418
> [   10.631578]  __arm64_sys_init_module+0x4c/0x68
> [   10.631584]  invoke_syscall+0x68/0x198
> [   10.631589]  el0_svc_common.constprop.0+0x11c/0x150
> [   10.631594]  do_el0_svc+0x38/0x50
> [   10.631598]  el0_svc+0x50/0xa0
> [   10.631604]  el0t_64_sync_handler+0x120/0x130
> [   10.631609]  el0t_64_sync+0x1a8/0x1b0
> [   10.631619] Code: 97c5418e c89ffef5 91002260 97c53ca7 (f9000675)
> [   10.631624] ---[ end trace  ]---
> [   10.642965] Kernel panic - not syncing: Oops: Fatal exception
> [   10.642975] SMP: stopping secondary CPUs
> [   10.648339] Kernel Offset: 0x1e0a80 from 0xffc08000
> [   10.648343] PHYS_OFFSET: 0x4000
> [   10.648345] CPU features: 0x0,c061,7002814a,2100720b
> [   10.648350] Memory Limit: none
> 


Re: [PATCH] powerpc/papr_scm: Move duplicate definitions to common header files

2024-01-25 Thread Christophe Leroy


Le 18/04/2022 à 06:38, Shivaprasad G Bhat a écrit :
> papr_scm and ndtest share common PDSM payload structs like
> nd_papr_pdsm_health. Presently these structs are duplicated across
> papr_pdsm.h and ndtest.h header files. Since 'ndtest' is essentially
> arch independent and can run on platforms other than PPC64, a way
> needs to be deviced to avoid redundancy and duplication of PDSM
> structs in future.
> 
> So the patch proposes moving the PDSM header from arch/powerpc/include-
> -/uapi/ to the generic include/uapi/linux directory. Also, there are
> some #defines common between papr_scm and ndtest which are not exported
> to the user space. So, move them to a header file which can be shared
> across ndtest and papr_scm via newly introduced include/linux/papr_scm.h.
> 
> Signed-off-by: Shivaprasad G Bhat 
> Signed-off-by: Vaibhav Jain 
> Suggested-by: "Aneesh Kumar K.V" 

This patch doesn't apply, if still relevant can you please rebase and 
re-submit ?

Thanks
Christophe

> ---
> Changelog:
> Since v2:
> Link: 
> https://patchwork.kernel.org/project/linux-nvdimm/patch/163454440296.431294.2368481747380790011.st...@lep8c.aus.stglabs.ibm.com/
> * Made it like v1, and rebased.
> * Fixed repeating words in comments of the header file papr_scm.h
> 
> Since v1:
> Link: 
> https://patchwork.kernel.org/project/linux-nvdimm/patch/162505488483.72147.12741153746322191381.stgit@56e104a48989/
> * Removed dependency on this patch for the other patches
> 
>   MAINTAINERS   |2
>   arch/powerpc/include/uapi/asm/papr_pdsm.h |  165 
> -
>   arch/powerpc/platforms/pseries/papr_scm.c |   43 
>   include/linux/papr_scm.h  |   49 +
>   include/uapi/linux/papr_pdsm.h|  165 
> +
>   tools/testing/nvdimm/test/ndtest.c|2
>   tools/testing/nvdimm/test/ndtest.h|   31 -
>   7 files changed, 220 insertions(+), 237 deletions(-)
>   delete mode 100644 arch/powerpc/include/uapi/asm/papr_pdsm.h
>   create mode 100644 include/linux/papr_scm.h
>   create mode 100644 include/uapi/linux/papr_pdsm.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 1699bb7cc867..03685b074dda 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -11254,6 +11254,8 @@ F:drivers/rtc/rtc-opal.c
>   F:  drivers/scsi/ibmvscsi/
>   F:  drivers/tty/hvc/hvc_opal.c
>   F:  drivers/watchdog/wdrtas.c
> +F:   include/linux/papr_scm.h
> +F:   include/uapi/linux/papr_pdsm.h
>   F:  tools/testing/selftests/powerpc
>   N:  /pmac
>   N:  powermac
> diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h 
> b/arch/powerpc/include/uapi/asm/papr_pdsm.h
> deleted file mode 100644
> index 17439925045c..
> --- a/arch/powerpc/include/uapi/asm/papr_pdsm.h
> +++ /dev/null
> @@ -1,165 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> -/*
> - * PAPR nvDimm Specific Methods (PDSM) and structs for libndctl
> - *
> - * (C) Copyright IBM 2020
> - *
> - * Author: Vaibhav Jain 
> - */
> -
> -#ifndef _UAPI_ASM_POWERPC_PAPR_PDSM_H_
> -#define _UAPI_ASM_POWERPC_PAPR_PDSM_H_
> -
> -#include 
> -#include 
> -
> -/*
> - * PDSM Envelope:
> - *
> - * The ioctl ND_CMD_CALL exchange data between user-space and kernel via
> - * envelope which consists of 2 headers sections and payload sections as
> - * illustrated below:
> - *  +-+---+---+
> - *  |   64-Bytes  |   8-Bytes |   Max 184-Bytes   |
> - *  +-+---+---+
> - *  | ND-HEADER   |  PDSM-HEADER  |  PDSM-PAYLOAD |
> - *  +-+---+---+
> - *  | nd_family   |   |   |
> - *  | nd_size_out | cmd_status|   |
> - *  | nd_size_in  | reserved  | nd_pdsm_payload   |
> - *  | nd_command  | payload   --> |   |
> - *  | nd_fw_size  |   |   |
> - *  | nd_payload ---> |   |   |
> - *  +---+-+---+
> - *
> - * ND Header:
> - * This is the generic libnvdimm header described as 'struct nd_cmd_pkg'
> - * which is interpreted by libnvdimm before passed on to papr_scm. Important
> - * member fields used are:
> - * 'nd_family'   : (In) NVDIMM_FAMILY_PAPR_SCM
> - * 'nd_size_in'  : (In) PDSM-HEADER + PDSM-IN-PAYLOAD (usually 0)
> - * 'nd_size_out': (In) PDSM-HEADER + PDSM-RETURN-PAYLOAD
> - * 'nd_command' : (In) One of PAPR_PDSM_XXX
> - * 'nd_fw_size' : (Out) PDSM-HEADER + size of actual payload returned
> - *
> - * PDSM Header:
> - * This is papr-scm specific header that precedes the payload. This is 
> defined
> - * as nd_cmd_pdsm_pkg.  Following fields aare available in this header:
> - *
> - * 'cmd_status' 

Re: [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time

2023-12-22 Thread Christophe Leroy


Le 22/12/2023 à 06:35, Kees Cook a écrit :
> [Vous ne recevez pas souvent de courriers de k...@kernel.org. Découvrez 
> pourquoi ceci est important à https://aka.ms/LearnAboutSenderIdentification ]
> 
> On December 21, 2023 4:16:56 AM PST, Michael Ellerman  
> wrote:
>> Cc +Kees
>>
>> Christophe Leroy  writes:
>>> Declaring rodata_enabled and mark_rodata_ro() at all time
>>> helps removing related #ifdefery in C files.
>>>
>>> Signed-off-by: Christophe Leroy 
>>> ---
>>>   include/linux/init.h |  4 
>>>   init/main.c  | 21 +++--
>>>   2 files changed, 7 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/include/linux/init.h b/include/linux/init.h
>>> index 01b52c9c7526..d2b47be38a07 100644
>>> --- a/include/linux/init.h
>>> +++ b/include/linux/init.h
>>> @@ -168,12 +168,8 @@ extern initcall_entry_t __initcall_end[];
>>>
>>>   extern struct file_system_type rootfs_fs_type;
>>>
>>> -#if defined(CONFIG_STRICT_KERNEL_RWX) || defined(CONFIG_STRICT_MODULE_RWX)
>>>   extern bool rodata_enabled;
>>> -#endif
>>> -#ifdef CONFIG_STRICT_KERNEL_RWX
>>>   void mark_rodata_ro(void);
>>> -#endif
>>>
>>>   extern void (*late_time_init)(void);
>>>
>>> diff --git a/init/main.c b/init/main.c
>>> index e24b0780fdff..807df08c501f 100644
>>> --- a/init/main.c
>>> +++ b/init/main.c
>>> @@ -1396,10 +1396,9 @@ static int __init set_debug_rodata(char *str)
>>>   early_param("rodata", set_debug_rodata);
>>>   #endif
>>>
>>> -#ifdef CONFIG_STRICT_KERNEL_RWX
>>>   static void mark_readonly(void)
>>>   {
>>> -if (rodata_enabled) {
>>> +if (IS_ENABLED(CONFIG_STRICT_KERNEL_RWX) && rodata_enabled) {
> 
> I think this will break without rodata_enabled actual existing on other 
> architectures. (Only declaration was made visible, not the definition, which 
> is above here and still behind ifdefs?)

The compiler constant-folds IS_ENABLED(CONFIG_STRICT_KERNEL_RWX).
When it is false, the second part is dropped.

Exemple:

bool test(void)
{
if (IS_ENABLED(CONFIG_STRICT_KERNEL_RWX) && rodata_enabled)
return true;
else
return false;
}

With CONFIG_STRICT_KERNEL_RWX set, it directly returns the content of 
rodata_enabled:

0160 :
  160:  3d 20 00 00 lis r9,0
162: R_PPC_ADDR16_HArodata_enabled
  164:  88 69 00 00 lbz r3,0(r9)
166: R_PPC_ADDR16_LOrodata_enabled
  168:  4e 80 00 20 blr

With CONFIG_STRICT_KERNEL_RWX unset, it returns 0 and doesn't reference 
rodata_enabled at all:

00bc :
   bc:  38 60 00 00 li  r3,0
   c0:  4e 80 00 20 blr

Many places in the kernel use this approach to minimise amount of #ifdefs.

Christophe


[PATCH 3/3] powerpc: Simplify strict_kernel_rwx_enabled()

2023-12-21 Thread Christophe Leroy
Now that rodata_enabled is always declared, remove #ifdef
and define a single version of strict_kernel_rwx_enabled().

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/mmu.h | 9 +
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu.h b/arch/powerpc/include/asm/mmu.h
index d8b7e246a32f..24241995f740 100644
--- a/arch/powerpc/include/asm/mmu.h
+++ b/arch/powerpc/include/asm/mmu.h
@@ -330,17 +330,10 @@ static __always_inline bool early_radix_enabled(void)
return early_mmu_has_feature(MMU_FTR_TYPE_RADIX);
 }
 
-#ifdef CONFIG_STRICT_KERNEL_RWX
 static inline bool strict_kernel_rwx_enabled(void)
 {
-   return rodata_enabled;
+   return IS_ENABLED(CONFIG_STRICT_KERNEL_RWX) && rodata_enabled;
 }
-#else
-static inline bool strict_kernel_rwx_enabled(void)
-{
-   return false;
-}
-#endif
 
 static inline bool strict_module_rwx_enabled(void)
 {
-- 
2.41.0




[PATCH 2/3] modules: Remove #ifdef CONFIG_STRICT_MODULE_RWX around rodata_enabled

2023-12-21 Thread Christophe Leroy
Now that rodata_enabled is declared at all time, the #ifdef
CONFIG_STRICT_MODULE_RWX can be removed.

Signed-off-by: Christophe Leroy 
---
 kernel/module/strict_rwx.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/kernel/module/strict_rwx.c b/kernel/module/strict_rwx.c
index a2b656b4e3d2..eadff63b6e80 100644
--- a/kernel/module/strict_rwx.c
+++ b/kernel/module/strict_rwx.c
@@ -34,12 +34,8 @@ void module_enable_x(const struct module *mod)
 
 void module_enable_ro(const struct module *mod, bool after_init)
 {
-   if (!IS_ENABLED(CONFIG_STRICT_MODULE_RWX))
-   return;
-#ifdef CONFIG_STRICT_MODULE_RWX
-   if (!rodata_enabled)
+   if (!IS_ENABLED(CONFIG_STRICT_MODULE_RWX) || !rodata_enabled)
return;
-#endif
 
module_set_memory(mod, MOD_TEXT, set_memory_ro);
module_set_memory(mod, MOD_INIT_TEXT, set_memory_ro);
-- 
2.41.0




[PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time

2023-12-21 Thread Christophe Leroy
Declaring rodata_enabled and mark_rodata_ro() at all time
helps removing related #ifdefery in C files.

Signed-off-by: Christophe Leroy 
---
 include/linux/init.h |  4 
 init/main.c  | 21 +++--
 2 files changed, 7 insertions(+), 18 deletions(-)

diff --git a/include/linux/init.h b/include/linux/init.h
index 01b52c9c7526..d2b47be38a07 100644
--- a/include/linux/init.h
+++ b/include/linux/init.h
@@ -168,12 +168,8 @@ extern initcall_entry_t __initcall_end[];
 
 extern struct file_system_type rootfs_fs_type;
 
-#if defined(CONFIG_STRICT_KERNEL_RWX) || defined(CONFIG_STRICT_MODULE_RWX)
 extern bool rodata_enabled;
-#endif
-#ifdef CONFIG_STRICT_KERNEL_RWX
 void mark_rodata_ro(void);
-#endif
 
 extern void (*late_time_init)(void);
 
diff --git a/init/main.c b/init/main.c
index e24b0780fdff..807df08c501f 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1396,10 +1396,9 @@ static int __init set_debug_rodata(char *str)
 early_param("rodata", set_debug_rodata);
 #endif
 
-#ifdef CONFIG_STRICT_KERNEL_RWX
 static void mark_readonly(void)
 {
-   if (rodata_enabled) {
+   if (IS_ENABLED(CONFIG_STRICT_KERNEL_RWX) && rodata_enabled) {
/*
 * load_module() results in W+X mappings, which are cleaned
 * up with call_rcu().  Let's make sure that queued work is
@@ -1409,20 +1408,14 @@ static void mark_readonly(void)
rcu_barrier();
mark_rodata_ro();
rodata_test();
-   } else
+   } else if (IS_ENABLED(CONFIG_STRICT_KERNEL_RWX)) {
pr_info("Kernel memory protection disabled.\n");
+   } else if (IS_ENABLED(CONFIG_ARCH_HAS_STRICT_KERNEL_RWX)) {
+   pr_warn("Kernel memory protection not selected by kernel 
config.\n");
+   } else {
+   pr_warn("This architecture does not have kernel memory 
protection.\n");
+   }
 }
-#elif defined(CONFIG_ARCH_HAS_STRICT_KERNEL_RWX)
-static inline void mark_readonly(void)
-{
-   pr_warn("Kernel memory protection not selected by kernel config.\n");
-}
-#else
-static inline void mark_readonly(void)
-{
-   pr_warn("This architecture does not have kernel memory protection.\n");
-}
-#endif
 
 void __weak free_initmem(void)
 {
-- 
2.41.0




[PATCH 3/3] module: Don't ignore errors from set_memory_XX()

2023-12-20 Thread Christophe Leroy
set_memory_ro(), set_memory_nx(), set_memory_x() and other helps
can fail an return an error. In that case the memory might not be
protected as expected and the module loading has to be aborted to
avoid security issues.

Check return value of all calls to set_memory_XX() and handle
error if any.

Signed-off-by: Christophe Leroy 
---
 kernel/module/internal.h   |  6 ++---
 kernel/module/main.c   | 18 ++
 kernel/module/strict_rwx.c | 48 ++
 3 files changed, 50 insertions(+), 22 deletions(-)

diff --git a/kernel/module/internal.h b/kernel/module/internal.h
index 4f1b98f011da..2ebece8a789f 100644
--- a/kernel/module/internal.h
+++ b/kernel/module/internal.h
@@ -322,9 +322,9 @@ static inline struct module *mod_find(unsigned long addr, 
struct mod_tree_root *
 }
 #endif /* CONFIG_MODULES_TREE_LOOKUP */
 
-void module_enable_rodata_ro(const struct module *mod, bool after_init);
-void module_enable_data_nx(const struct module *mod);
-void module_enable_text_rox(const struct module *mod);
+int module_enable_rodata_ro(const struct module *mod, bool after_init);
+int module_enable_data_nx(const struct module *mod);
+int module_enable_text_rox(const struct module *mod);
 int module_enforce_rwx_sections(Elf_Ehdr *hdr, Elf_Shdr *sechdrs,
char *secstrings, struct module *mod);
 
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 64662e55e275..cfe197455d64 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -2568,7 +2568,9 @@ static noinline int do_init_module(struct module *mod)
/* Switch to core kallsyms now init is done: kallsyms may be walking! */
rcu_assign_pointer(mod->kallsyms, >core_kallsyms);
 #endif
-   module_enable_rodata_ro(mod, true);
+   ret = module_enable_rodata_ro(mod, true);
+   if (ret)
+   goto fail_mutex_unlock;
mod_tree_remove_init(mod);
module_arch_freeing_init(mod);
for_class_mod_mem_type(type, init) {
@@ -2606,6 +2608,8 @@ static noinline int do_init_module(struct module *mod)
 
return 0;
 
+fail_mutex_unlock:
+   mutex_unlock(_mutex);
 fail_free_freeinit:
kfree(freeinit);
 fail:
@@ -2733,9 +2737,15 @@ static int complete_formation(struct module *mod, struct 
load_info *info)
module_bug_finalize(info->hdr, info->sechdrs, mod);
module_cfi_finalize(info->hdr, info->sechdrs, mod);
 
-   module_enable_rodata_ro(mod, false);
-   module_enable_data_nx(mod);
-   module_enable_text_rox(mod);
+   err = module_enable_rodata_ro(mod, false);
+   if (err)
+   goto out;
+   err = module_enable_data_nx(mod);
+   if (err)
+   goto out;
+   err = module_enable_text_rox(mod);
+   if (err)
+   goto out;
 
/*
 * Mark state as coming so strong_try_module_get() ignores us,
diff --git a/kernel/module/strict_rwx.c b/kernel/module/strict_rwx.c
index 9b2d58a8d59d..a14df9655dbe 100644
--- a/kernel/module/strict_rwx.c
+++ b/kernel/module/strict_rwx.c
@@ -11,13 +11,13 @@
 #include 
 #include "internal.h"
 
-static void module_set_memory(const struct module *mod, enum mod_mem_type type,
+static int module_set_memory(const struct module *mod, enum mod_mem_type type,
  int (*set_memory)(unsigned long start, int 
num_pages))
 {
const struct module_memory *mod_mem = >mem[type];
 
set_vm_flush_reset_perms(mod_mem->base);
-   set_memory((unsigned long)mod_mem->base, mod_mem->size >> PAGE_SHIFT);
+   return set_memory((unsigned long)mod_mem->base, mod_mem->size >> 
PAGE_SHIFT);
 }
 
 /*
@@ -26,39 +26,57 @@ static void module_set_memory(const struct module *mod, 
enum mod_mem_type type,
  * CONFIG_STRICT_MODULE_RWX because they are needed regardless of whether we
  * are strict.
  */
-void module_enable_text_rox(const struct module *mod)
+int module_enable_text_rox(const struct module *mod)
 {
for_class_mod_mem_type(type, text) {
+   int ret;
+
if (IS_ENABLED(CONFIG_STRICT_MODULE_RWX))
-   module_set_memory(mod, type, set_memory_rox);
+   ret = module_set_memory(mod, type, set_memory_rox);
else
-   module_set_memory(mod, type, set_memory_x);
+   ret = module_set_memory(mod, type, set_memory_x);
+   if (ret)
+   return ret;
}
+   return 0;
 }
 
-void module_enable_rodata_ro(const struct module *mod, bool after_init)
+int module_enable_rodata_ro(const struct module *mod, bool after_init)
 {
+   int ret;
+
if (!IS_ENABLED(CONFIG_STRICT_MODULE_RWX))
-   return;
+   return 0;
 #ifdef CONFIG_STRICT_MODULE_RWX
if (!rodata_enabled)
-   return;
+   return 0;
 #endif
 
-   module_set_memory(mod,

[PATCH 2/3] module: Change module_enable_{nx/x/ro}() to more explicit names

2023-12-20 Thread Christophe Leroy
It's a bit puzzling to see a call to module_enable_nx() followed by a
call to module_enable_x(). This is because one applies on text while
the other applies on data.

Change name to make that more clear.

Signed-off-by: Christophe Leroy 
---
 kernel/module/internal.h   | 6 +++---
 kernel/module/main.c   | 8 
 kernel/module/strict_rwx.c | 6 +++---
 3 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/kernel/module/internal.h b/kernel/module/internal.h
index a647ab17193d..4f1b98f011da 100644
--- a/kernel/module/internal.h
+++ b/kernel/module/internal.h
@@ -322,9 +322,9 @@ static inline struct module *mod_find(unsigned long addr, 
struct mod_tree_root *
 }
 #endif /* CONFIG_MODULES_TREE_LOOKUP */
 
-void module_enable_ro(const struct module *mod, bool after_init);
-void module_enable_nx(const struct module *mod);
-void module_enable_rox(const struct module *mod);
+void module_enable_rodata_ro(const struct module *mod, bool after_init);
+void module_enable_data_nx(const struct module *mod);
+void module_enable_text_rox(const struct module *mod);
 int module_enforce_rwx_sections(Elf_Ehdr *hdr, Elf_Shdr *sechdrs,
char *secstrings, struct module *mod);
 
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 1c8f328ca015..64662e55e275 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -2568,7 +2568,7 @@ static noinline int do_init_module(struct module *mod)
/* Switch to core kallsyms now init is done: kallsyms may be walking! */
rcu_assign_pointer(mod->kallsyms, >core_kallsyms);
 #endif
-   module_enable_ro(mod, true);
+   module_enable_rodata_ro(mod, true);
mod_tree_remove_init(mod);
module_arch_freeing_init(mod);
for_class_mod_mem_type(type, init) {
@@ -2733,9 +2733,9 @@ static int complete_formation(struct module *mod, struct 
load_info *info)
module_bug_finalize(info->hdr, info->sechdrs, mod);
module_cfi_finalize(info->hdr, info->sechdrs, mod);
 
-   module_enable_ro(mod, false);
-   module_enable_nx(mod);
-   module_enable_rox(mod);
+   module_enable_rodata_ro(mod, false);
+   module_enable_data_nx(mod);
+   module_enable_text_rox(mod);
 
/*
 * Mark state as coming so strong_try_module_get() ignores us,
diff --git a/kernel/module/strict_rwx.c b/kernel/module/strict_rwx.c
index 9345b09f28a5..9b2d58a8d59d 100644
--- a/kernel/module/strict_rwx.c
+++ b/kernel/module/strict_rwx.c
@@ -26,7 +26,7 @@ static void module_set_memory(const struct module *mod, enum 
mod_mem_type type,
  * CONFIG_STRICT_MODULE_RWX because they are needed regardless of whether we
  * are strict.
  */
-void module_enable_rox(const struct module *mod)
+void module_enable_text_rox(const struct module *mod)
 {
for_class_mod_mem_type(type, text) {
if (IS_ENABLED(CONFIG_STRICT_MODULE_RWX))
@@ -36,7 +36,7 @@ void module_enable_rox(const struct module *mod)
}
 }
 
-void module_enable_ro(const struct module *mod, bool after_init)
+void module_enable_rodata_ro(const struct module *mod, bool after_init)
 {
if (!IS_ENABLED(CONFIG_STRICT_MODULE_RWX))
return;
@@ -52,7 +52,7 @@ void module_enable_ro(const struct module *mod, bool 
after_init)
module_set_memory(mod, MOD_RO_AFTER_INIT, set_memory_ro);
 }
 
-void module_enable_nx(const struct module *mod)
+void module_enable_data_nx(const struct module *mod)
 {
if (!IS_ENABLED(CONFIG_STRICT_MODULE_RWX))
return;
-- 
2.41.0




[PATCH 1/3] module: Use set_memory_rox()

2023-12-20 Thread Christophe Leroy
A couple of architectures seem concerned about calling set_memory_ro()
and set_memory_x() too frequently and have implemented a version of
set_memory_rox(), see commit 60463628c9e0 ("x86/mm: Implement native
set_memory_rox()") and commit 22e99fa56443 ("s390/mm: implement
set_memory_rox()")

Use set_memory_rox() in modules when STRICT_MODULES_RWX is set.

Signed-off-by: Christophe Leroy 
---
 kernel/module/internal.h   |  2 +-
 kernel/module/main.c   |  2 +-
 kernel/module/strict_rwx.c | 12 +++-
 3 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/kernel/module/internal.h b/kernel/module/internal.h
index c8b7b4dcf782..a647ab17193d 100644
--- a/kernel/module/internal.h
+++ b/kernel/module/internal.h
@@ -324,7 +324,7 @@ static inline struct module *mod_find(unsigned long addr, 
struct mod_tree_root *
 
 void module_enable_ro(const struct module *mod, bool after_init);
 void module_enable_nx(const struct module *mod);
-void module_enable_x(const struct module *mod);
+void module_enable_rox(const struct module *mod);
 int module_enforce_rwx_sections(Elf_Ehdr *hdr, Elf_Shdr *sechdrs,
char *secstrings, struct module *mod);
 
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 98fedfdb8db5..1c8f328ca015 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -2735,7 +2735,7 @@ static int complete_formation(struct module *mod, struct 
load_info *info)
 
module_enable_ro(mod, false);
module_enable_nx(mod);
-   module_enable_x(mod);
+   module_enable_rox(mod);
 
/*
 * Mark state as coming so strong_try_module_get() ignores us,
diff --git a/kernel/module/strict_rwx.c b/kernel/module/strict_rwx.c
index a2b656b4e3d2..9345b09f28a5 100644
--- a/kernel/module/strict_rwx.c
+++ b/kernel/module/strict_rwx.c
@@ -26,10 +26,14 @@ static void module_set_memory(const struct module *mod, 
enum mod_mem_type type,
  * CONFIG_STRICT_MODULE_RWX because they are needed regardless of whether we
  * are strict.
  */
-void module_enable_x(const struct module *mod)
+void module_enable_rox(const struct module *mod)
 {
-   for_class_mod_mem_type(type, text)
-   module_set_memory(mod, type, set_memory_x);
+   for_class_mod_mem_type(type, text) {
+   if (IS_ENABLED(CONFIG_STRICT_MODULE_RWX))
+   module_set_memory(mod, type, set_memory_rox);
+   else
+   module_set_memory(mod, type, set_memory_x);
+   }
 }
 
 void module_enable_ro(const struct module *mod, bool after_init)
@@ -41,8 +45,6 @@ void module_enable_ro(const struct module *mod, bool 
after_init)
return;
 #endif
 
-   module_set_memory(mod, MOD_TEXT, set_memory_ro);
-   module_set_memory(mod, MOD_INIT_TEXT, set_memory_ro);
module_set_memory(mod, MOD_RODATA, set_memory_ro);
module_set_memory(mod, MOD_INIT_RODATA, set_memory_ro);
 
-- 
2.41.0




Re: [PATCH 12/27] tty: hvc: convert to u8 and size_t

2023-12-06 Thread Christophe Leroy


Le 06/12/2023 à 08:36, Jiri Slaby (SUSE) a écrit :
> Switch character types to u8 and sizes to size_t. To conform to
> characters/sizes in the rest of the tty layer.
> 
> Signed-off-by: Jiri Slaby (SUSE) 
> Cc: Michael Ellerman 
> Cc: Nicholas Piggin 
> Cc: Christophe Leroy 
> Cc: Amit Shah 
> Cc: Arnd Bergmann 
> Cc: Paul Walmsley 
> Cc: Palmer Dabbelt 
> Cc: Albert Ou 
> Cc: linuxppc-...@lists.ozlabs.org
> Cc: virtualizat...@lists.linux.dev
> Cc: linux-ri...@lists.infradead.org
> ---
>   arch/powerpc/include/asm/hvconsole.h   |  4 ++--
>   arch/powerpc/include/asm/hvsi.h| 18 
>   arch/powerpc/include/asm/opal.h|  8 +---
>   arch/powerpc/platforms/powernv/opal.c  | 14 +++--
>   arch/powerpc/platforms/pseries/hvconsole.c |  4 ++--
>   drivers/char/virtio_console.c  | 10 -
>   drivers/tty/hvc/hvc_console.h  |  4 ++--
>   drivers/tty/hvc/hvc_dcc.c  | 24 +++---
>   drivers/tty/hvc/hvc_iucv.c | 18 
>   drivers/tty/hvc/hvc_opal.c |  5 +++--
>   drivers/tty/hvc/hvc_riscv_sbi.c|  9 
>   drivers/tty/hvc/hvc_rtas.c | 11 +-
>   drivers/tty/hvc/hvc_udbg.c |  9 
>   drivers/tty/hvc/hvc_vio.c  | 18 
>   drivers/tty/hvc/hvc_xen.c  | 23 +++--
>   drivers/tty/hvc/hvsi_lib.c | 20 ++
>   16 files changed, 107 insertions(+), 92 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/hvconsole.h 
> b/arch/powerpc/include/asm/hvconsole.h
> index ccb2034506f0..d841a97010a0 100644
> --- a/arch/powerpc/include/asm/hvconsole.h
> +++ b/arch/powerpc/include/asm/hvconsole.h
> @@ -21,8 +21,8 @@
>* Vio firmware always attempts to fetch MAX_VIO_GET_CHARS chars.  The 
> 'count'
>* parm is included to conform to put_chars() function pointer template
>*/
> -extern int hvc_get_chars(uint32_t vtermno, char *buf, int count);
> -extern int hvc_put_chars(uint32_t vtermno, const char *buf, int count);
> +extern ssize_t hvc_get_chars(uint32_t vtermno, u8 *buf, size_t count);
> +extern ssize_t hvc_put_chars(uint32_t vtermno, const u8 *buf, size_t count);

Would be a good opportunity to drop this pointless deprecated 'extern' 
keyword on all function prototypes you are changing.

Christophe


Re: [PATCH v3 09/13] powerpc: extend execmem_params for kprobes allocations

2023-09-22 Thread Christophe Leroy
Hi Mike,

Le 18/09/2023 à 09:29, Mike Rapoport a écrit :
> From: "Mike Rapoport (IBM)" 
> 
> powerpc overrides kprobes::alloc_insn_page() to remove writable
> permissions when STRICT_MODULE_RWX is on.
> 
> Add definition of EXECMEM_KRPOBES to execmem_params to allow using the
> generic kprobes::alloc_insn_page() with the desired permissions.
> 
> As powerpc uses breakpoint instructions to inject kprobes, it does not
> need to constrain kprobe allocations to the modules area and can use the
> entire vmalloc address space.

I don't understand what you mean here. Does it mean kprobe allocation 
doesn't need to be executable ? I don't think so based on the pgprot you 
set.

On powerpc book3s/32, vmalloc space is not executable. Only modules 
space is executable. X/NX cannot be set on a per page basis, it can only 
be set on a 256 Mbytes segment basis.

See commit c49643319715 ("powerpc/32s: Only leave NX unset on segments 
used for modules") and 6ca055322da8 ("powerpc/32s: Use dedicated segment 
for modules with STRICT_KERNEL_RWX") and 7bee31ad8e2f ("powerpc/32s: Fix 
is_module_segment() when MODULES_VADDR is defined").

So if your intention is still to have an executable kprobes, then you 
can't use vmalloc address space.

Christophe

> 
> Signed-off-by: Mike Rapoport (IBM) 
> ---
>   arch/powerpc/kernel/kprobes.c | 14 --
>   arch/powerpc/kernel/module.c  | 11 +++
>   2 files changed, 11 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
> index 62228c7072a2..14c5ddec3056 100644
> --- a/arch/powerpc/kernel/kprobes.c
> +++ b/arch/powerpc/kernel/kprobes.c
> @@ -126,20 +126,6 @@ kprobe_opcode_t *arch_adjust_kprobe_addr(unsigned long 
> addr, unsigned long offse
>   return (kprobe_opcode_t *)(addr + offset);
>   }
>   
> -void *alloc_insn_page(void)
> -{
> - void *page;
> -
> - page = execmem_text_alloc(EXECMEM_KPROBES, PAGE_SIZE);
> - if (!page)
> - return NULL;
> -
> - if (strict_module_rwx_enabled())
> - set_memory_rox((unsigned long)page, 1);
> -
> - return page;
> -}
> -
>   int arch_prepare_kprobe(struct kprobe *p)
>   {
>   int ret = 0;
> diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c
> index 824d9541a310..bf2c62aef628 100644
> --- a/arch/powerpc/kernel/module.c
> +++ b/arch/powerpc/kernel/module.c
> @@ -95,6 +95,9 @@ static struct execmem_params execmem_params __ro_after_init 
> = {
>   [EXECMEM_DEFAULT] = {
>   .alignment = 1,
>   },
> + [EXECMEM_KPROBES] = {
> + .alignment = 1,
> + },
>   [EXECMEM_MODULE_DATA] = {
>   .alignment = 1,
>   },
> @@ -135,5 +138,13 @@ struct execmem_params __init *execmem_arch_params(void)
>   
>   range->pgprot = prot;
>   
> + execmem_params.ranges[EXECMEM_KPROBES].start = VMALLOC_START;
> + execmem_params.ranges[EXECMEM_KPROBES].start = VMALLOC_END;
> +
> + if (strict_module_rwx_enabled())
> + execmem_params.ranges[EXECMEM_KPROBES].pgprot = PAGE_KERNEL_ROX;
> + else
> + execmem_params.ranges[EXECMEM_KPROBES].pgprot = 
> PAGE_KERNEL_EXEC;
> +
>   return _params;
>   }


Re: [PATCH v3 06/13] mm/execmem: introduce execmem_data_alloc()

2023-09-22 Thread Christophe Leroy


Le 22/09/2023 à 10:55, Song Liu a écrit :
> On Fri, Sep 22, 2023 at 12:17 AM Christophe Leroy
>  wrote:
>>
>>
>>
>> Le 22/09/2023 à 00:52, Song Liu a écrit :
>>> On Mon, Sep 18, 2023 at 12:31 AM Mike Rapoport  wrote:
>>>>
>>> [...]
>>>> diff --git a/include/linux/execmem.h b/include/linux/execmem.h
>>>> index 519bdfdca595..09d45ac786e9 100644
>>>> --- a/include/linux/execmem.h
>>>> +++ b/include/linux/execmem.h
>>>> @@ -29,6 +29,7 @@
>>>> * @EXECMEM_KPROBES: parameters for kprobes
>>>> * @EXECMEM_FTRACE: parameters for ftrace
>>>> * @EXECMEM_BPF: parameters for BPF
>>>> + * @EXECMEM_MODULE_DATA: parameters for module data sections
>>>> * @EXECMEM_TYPE_MAX:
>>>> */
>>>>enum execmem_type {
>>>> @@ -37,6 +38,7 @@ enum execmem_type {
>>>>   EXECMEM_KPROBES,
>>>>   EXECMEM_FTRACE,
>>>
>>> In longer term, I think we can improve the JITed code and merge
>>> kprobe/ftrace/bpf. to use the same ranges. Also, do we need special
>>> setting for FTRACE? If not, let's just remove it.
>>
>> How can we do that ? Some platforms like powerpc require executable
>> memory for BPF and non-exec mem for KPROBE so it can't be in the same
>> area/ranges.
> 
> Hmm... non-exec mem for kprobes?
> 
> if (strict_module_rwx_enabled())
> execmem_params.ranges[EXECMEM_KPROBES].pgprot = 
> PAGE_KERNEL_ROX;
> else
> execmem_params.ranges[EXECMEM_KPROBES].pgprot = 
> PAGE_KERNEL_EXEC;
> 
> Do you mean the latter case?
> 

In fact I may have misunderstood patch 9. I'll provide a response there.

Christophe


Re: [PATCH v3 06/13] mm/execmem: introduce execmem_data_alloc()

2023-09-22 Thread Christophe Leroy


Le 22/09/2023 à 00:52, Song Liu a écrit :
> On Mon, Sep 18, 2023 at 12:31 AM Mike Rapoport  wrote:
>>
> [...]
>> diff --git a/include/linux/execmem.h b/include/linux/execmem.h
>> index 519bdfdca595..09d45ac786e9 100644
>> --- a/include/linux/execmem.h
>> +++ b/include/linux/execmem.h
>> @@ -29,6 +29,7 @@
>>* @EXECMEM_KPROBES: parameters for kprobes
>>* @EXECMEM_FTRACE: parameters for ftrace
>>* @EXECMEM_BPF: parameters for BPF
>> + * @EXECMEM_MODULE_DATA: parameters for module data sections
>>* @EXECMEM_TYPE_MAX:
>>*/
>>   enum execmem_type {
>> @@ -37,6 +38,7 @@ enum execmem_type {
>>  EXECMEM_KPROBES,
>>  EXECMEM_FTRACE,
> 
> In longer term, I think we can improve the JITed code and merge
> kprobe/ftrace/bpf. to use the same ranges. Also, do we need special
> setting for FTRACE? If not, let's just remove it.

How can we do that ? Some platforms like powerpc require executable 
memory for BPF and non-exec mem for KPROBE so it can't be in the same 
area/ranges.

> 
>>  EXECMEM_BPF,
>> +   EXECMEM_MODULE_DATA,
>>  EXECMEM_TYPE_MAX,
>>   };
> 
> Overall, it is great that kprobe/ftrace/bpf no longer depend on modules.
> 
> OTOH, I think we should merge execmem_type and existing mod_mem_type.
> Otherwise, we still need to handle page permissions in multiple places.
> What is our plan for that?
> 

Christophe


Re: [PATCH v4 19/20] mips: Convert to GENERIC_CMDLINE

2021-04-20 Thread Christophe Leroy




Le 09/04/2021 à 03:23, Daniel Walker a écrit :

On Thu, Apr 08, 2021 at 02:04:08PM -0500, Rob Herring wrote:

On Tue, Apr 06, 2021 at 10:38:36AM -0700, Daniel Walker wrote:

On Fri, Apr 02, 2021 at 03:18:21PM +, Christophe Leroy wrote:

-config CMDLINE_BOOL
-   bool "Built-in kernel command line"
-   help
- For most systems, it is firmware or second stage bootloader that
- by default specifies the kernel command line options.  However,
- it might be necessary or advantageous to either override the
- default kernel command line or add a few extra options to it.
- For such cases, this option allows you to hardcode your own
- command line options directly into the kernel.  For that, you
- should choose 'Y' here, and fill in the extra boot arguments
- in CONFIG_CMDLINE.
-
- The built-in options will be concatenated to the default command
- line if CMDLINE_OVERRIDE is set to 'N'. Otherwise, the default
- command line will be ignored and replaced by the built-in string.
-
- Most MIPS systems will normally expect 'N' here and rely upon
- the command line from the firmware or the second-stage bootloader.
-



See how you complained that I have CMDLINE_BOOL in my changed, and you think it
shouldn't exist.

Yet here mips has it, and you just deleted it with no feature parity in your
changes for this.


AFAICT, CMDLINE_BOOL equates to a non-empty or empty CONFIG_CMDLINE. You
seem to need it just because you have CMDLINE_PREPEND and
CMDLINE_APPEND. If that's not it, what feature is missing? CMDLINE_BOOL
is not a feature, but an implementation detail.


Not true.

It makes it easier to turn it all off inside the Kconfig , so it's for usability
and multiple architecture have it even with just CMDLINE as I was commenting
here.



Among the 13 architectures having CONFIG_CMDLINE, todayb only 6 have a 
CONFIG_CMDLINE_BOOL in addition:

arch/arm/Kconfig:config CMDLINE
arch/arm64/Kconfig:config CMDLINE
arch/hexagon/Kconfig:config CMDLINE
arch/microblaze/Kconfig:config CMDLINE
arch/mips/Kconfig.debug:config CMDLINE
arch/nios2/Kconfig:config CMDLINE
arch/openrisc/Kconfig:config CMDLINE
arch/powerpc/Kconfig:config CMDLINE
arch/riscv/Kconfig:config CMDLINE
arch/sh/Kconfig:config CMDLINE
arch/sparc/Kconfig:config CMDLINE
arch/x86/Kconfig:config CMDLINE
arch/xtensa/Kconfig:config CMDLINE

arch/microblaze/Kconfig:config CMDLINE_BOOL
arch/mips/Kconfig.debug:config CMDLINE_BOOL
arch/nios2/Kconfig:config CMDLINE_BOOL
arch/sparc/Kconfig:config CMDLINE_BOOL
arch/x86/Kconfig:config CMDLINE_BOOL
arch/xtensa/Kconfig:config CMDLINE_BOOL


In the begining I hesitated about the CMDLINE_BOOL, at the end I decided to go the same way as what 
is done today in the kernel for initramfs with CONFIG_INITRAMFS_SOURCE.


The problem I see within adding CONFIG_CMDLINE_BOOL for every architecture which don't have it today 
is that when doing a "make oldconfig" on their custom configs, thousands of users will loose their 
CMDLINE without notice.


When we do the other way round, removing CONFIG_CMDLINE_BOOL on the 6 architectures that have it 
today will have no impact on existing config.


Also, in order to avoid tons of #ifdefs in the code as mandated by Kernel Codying Style §21, we have 
to have CONFIG_CMDLINE defined at all time, so at the end CONFIG_CMDLINE_BOOL is really redundant 
with an empty CONFIG_CMDLINE.


Unlike you, the approach I took for my series is to minimise the impact on existing implementation 
and existing configurations as much as possible.


I know you have a different approach where you break every existing config 
anyway.

https://www.kernel.org/doc/html/latest/process/coding-style.html#conditional-compilation

Christophe


[PATCH v2 1/2] powerpc/inst: ppc_inst_as_u64() becomes ppc_inst_as_ulong()

2021-04-20 Thread Christophe Leroy
In order to simplify use on PPC32, change ppc_inst_as_u64()
into ppc_inst_as_ulong() that returns the 32 bits instruction
on PPC32.

Will be used when porting OPTPROBES to PPC32.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/inst.h  | 13 +++--
 arch/powerpc/kernel/optprobes.c  |  2 +-
 arch/powerpc/lib/code-patching.c |  2 +-
 arch/powerpc/xmon/xmon.c |  2 +-
 4 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/inst.h b/arch/powerpc/include/asm/inst.h
index 19e18af2fac9..9646c63f7420 100644
--- a/arch/powerpc/include/asm/inst.h
+++ b/arch/powerpc/include/asm/inst.h
@@ -147,13 +147,14 @@ static inline struct ppc_inst *ppc_inst_next(void 
*location, struct ppc_inst *va
return location + ppc_inst_len(tmp);
 }
 
-static inline u64 ppc_inst_as_u64(struct ppc_inst x)
+static inline unsigned long ppc_inst_as_ulong(struct ppc_inst x)
 {
-#ifdef CONFIG_CPU_LITTLE_ENDIAN
-   return (u64)ppc_inst_suffix(x) << 32 | ppc_inst_val(x);
-#else
-   return (u64)ppc_inst_val(x) << 32 | ppc_inst_suffix(x);
-#endif
+   if (IS_ENABLED(CONFIG_PPC32))
+   return ppc_inst_val(x);
+   else if (IS_ENABLED(CONFIG_CPU_LITTLE_ENDIAN))
+   return (u64)ppc_inst_suffix(x) << 32 | ppc_inst_val(x);
+   else
+   return (u64)ppc_inst_val(x) << 32 | ppc_inst_suffix(x);
 }
 
 #define PPC_INST_STR_LEN sizeof(" ")
diff --git a/arch/powerpc/kernel/optprobes.c b/arch/powerpc/kernel/optprobes.c
index 7f7cdbeacd1a..58fdb9f66e0f 100644
--- a/arch/powerpc/kernel/optprobes.c
+++ b/arch/powerpc/kernel/optprobes.c
@@ -264,7 +264,7 @@ int arch_prepare_optimized_kprobe(struct optimized_kprobe 
*op, struct kprobe *p)
 * 3. load instruction to be emulated into relevant register, and
 */
temp = ppc_inst_read((struct ppc_inst *)p->ainsn.insn);
-   patch_imm64_load_insns(ppc_inst_as_u64(temp), 4, buff + TMPL_INSN_IDX);
+   patch_imm64_load_insns(ppc_inst_as_ulong(temp), 4, buff + 
TMPL_INSN_IDX);
 
/*
 * 4. branch back from trampoline
diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 65aec4d6d9ba..870b30d9be2f 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -26,7 +26,7 @@ static int __patch_instruction(struct ppc_inst *exec_addr, 
struct ppc_inst instr
 
__put_kernel_nofault(patch_addr, , u32, failed);
} else {
-   u64 val = ppc_inst_as_u64(instr);
+   u64 val = ppc_inst_as_ulong(instr);
 
__put_kernel_nofault(patch_addr, , u64, failed);
}
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index a619b9ed8458..ff2b92bfeedc 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -2953,7 +2953,7 @@ generic_inst_dump(unsigned long adr, long count, int 
praddr,
if (!ppc_inst_prefixed(inst))
dump_func(ppc_inst_val(inst), adr);
else
-   dump_func(ppc_inst_as_u64(inst), adr);
+   dump_func(ppc_inst_as_ulong(inst), adr);
printf("\n");
}
return adr - first_adr;
-- 
2.25.0



[PATCH v2 2/2] powerpc: Enable OPTPROBES on PPC32

2021-04-20 Thread Christophe Leroy
For that, create a 32 bits version of patch_imm64_load_insns()
and create a patch_imm_load_insns() which calls
patch_imm32_load_insns() on PPC32 and patch_imm64_load_insns()
on PPC64.

Adapt optprobes_head.S for PPC32. Use PPC_LL/PPC_STL macros instead
of raw ld/std, opt out things linked to paca and use stmw/lmw to
save/restore registers.

Signed-off-by: Christophe Leroy 
---
v2: Comments from Naveen.
---
 arch/powerpc/Kconfig |  2 +-
 arch/powerpc/kernel/optprobes.c  | 24 --
 arch/powerpc/kernel/optprobes_head.S | 65 +++-
 3 files changed, 56 insertions(+), 35 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 475d77a6ebbe..d2e31a578e26 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -229,7 +229,7 @@ config PPC
select HAVE_MOD_ARCH_SPECIFIC
select HAVE_NMI if PERF_EVENTS || (PPC64 && 
PPC_BOOK3S)
select HAVE_HARDLOCKUP_DETECTOR_ARCHif PPC64 && PPC_BOOK3S && SMP
-   select HAVE_OPTPROBES   if PPC64
+   select HAVE_OPTPROBES
select HAVE_PERF_EVENTS
select HAVE_PERF_EVENTS_NMI if PPC64
select HAVE_HARDLOCKUP_DETECTOR_PERFif PERF_EVENTS && 
HAVE_PERF_EVENTS_NMI && !HAVE_HARDLOCKUP_DETECTOR_ARCH
diff --git a/arch/powerpc/kernel/optprobes.c b/arch/powerpc/kernel/optprobes.c
index 58fdb9f66e0f..cdf87086fa33 100644
--- a/arch/powerpc/kernel/optprobes.c
+++ b/arch/powerpc/kernel/optprobes.c
@@ -141,11 +141,21 @@ void arch_remove_optimized_kprobe(struct optimized_kprobe 
*op)
}
 }
 
+static void patch_imm32_load_insns(unsigned long val, int reg, kprobe_opcode_t 
*addr)
+{
+   patch_instruction((struct ppc_inst *)addr,
+ ppc_inst(PPC_RAW_LIS(reg, IMM_H(val;
+   addr++;
+
+   patch_instruction((struct ppc_inst *)addr,
+ ppc_inst(PPC_RAW_ORI(reg, reg, IMM_L(val;
+}
+
 /*
  * Generate instructions to load provided immediate 64-bit value
  * to register 'reg' and patch these instructions at 'addr'.
  */
-static void patch_imm64_load_insns(unsigned long val, int reg, kprobe_opcode_t 
*addr)
+static void patch_imm64_load_insns(unsigned long long val, int reg, 
kprobe_opcode_t *addr)
 {
/* lis reg,(op)@highest */
patch_instruction((struct ppc_inst *)addr,
@@ -177,6 +187,14 @@ static void patch_imm64_load_insns(unsigned long val, int 
reg, kprobe_opcode_t *
   ___PPC_RS(reg) | (val & 0x)));
 }
 
+static void patch_imm_load_insns(unsigned long val, int reg, kprobe_opcode_t 
*addr)
+{
+   if (IS_ENABLED(CONFIG_PPC64))
+   patch_imm64_load_insns(val, reg, addr);
+   else
+   patch_imm32_load_insns(val, reg, addr);
+}
+
 int arch_prepare_optimized_kprobe(struct optimized_kprobe *op, struct kprobe 
*p)
 {
struct ppc_inst branch_op_callback, branch_emulate_step, temp;
@@ -230,7 +248,7 @@ int arch_prepare_optimized_kprobe(struct optimized_kprobe 
*op, struct kprobe *p)
 * Fixup the template with instructions to:
 * 1. load the address of the actual probepoint
 */
-   patch_imm64_load_insns((unsigned long)op, 3, buff + TMPL_OP_IDX);
+   patch_imm_load_insns((unsigned long)op, 3, buff + TMPL_OP_IDX);
 
/*
 * 2. branch to optimized_callback() and emulate_step()
@@ -264,7 +282,7 @@ int arch_prepare_optimized_kprobe(struct optimized_kprobe 
*op, struct kprobe *p)
 * 3. load instruction to be emulated into relevant register, and
 */
temp = ppc_inst_read((struct ppc_inst *)p->ainsn.insn);
-   patch_imm64_load_insns(ppc_inst_as_ulong(temp), 4, buff + 
TMPL_INSN_IDX);
+   patch_imm_load_insns(ppc_inst_as_ulong(temp), 4, buff + TMPL_INSN_IDX);
 
/*
 * 4. branch back from trampoline
diff --git a/arch/powerpc/kernel/optprobes_head.S 
b/arch/powerpc/kernel/optprobes_head.S
index ff8ba4d3824d..19ea3312403c 100644
--- a/arch/powerpc/kernel/optprobes_head.S
+++ b/arch/powerpc/kernel/optprobes_head.S
@@ -9,6 +9,16 @@
 #include 
 #include 
 
+#ifdef CONFIG_PPC64
+#define SAVE_30GPRS(base) SAVE_10GPRS(2,base); SAVE_10GPRS(12,base); 
SAVE_10GPRS(22,base)
+#define REST_30GPRS(base) REST_10GPRS(2,base); REST_10GPRS(12,base); 
REST_10GPRS(22,base)
+#define TEMPLATE_FOR_IMM_LOAD_INSNSnop; nop; nop; nop; nop
+#else
+#define SAVE_30GPRS(base) stmw r2, GPR2(base)
+#define REST_30GPRS(base) lmw  r2, GPR2(base)
+#define TEMPLATE_FOR_IMM_LOAD_INSNSnop; nop; nop
+#endif
+
 #defineOPT_SLOT_SIZE   65536
 
.balign 4
@@ -30,39 +40,41 @@ optinsn_slot:
.global optprobe_template_entry
 optprobe_template_entry:
/* Create an in-memory pt_regs */
-   stdur1,-INT_FRAME_SIZE(r1)
+   PPC_STLUr1,-INT_FRAME_SIZE(r1)
SAVE_GPR(0,r1)
/* Save the previous SP into stack */
addi

Re: [PATCH v2 2/2] powerpc/legacy_serial: Use early_ioremap()

2021-04-20 Thread Christophe Leroy




Le 20/04/2021 à 15:32, Christophe Leroy a écrit :

From: Christophe Leroy 


Oops, I forgot to reset the Author. Michael if you apply this patch please update the author and 
remove the old Signed-off-by


Thanks



[0.00] ioremap() called early from 
find_legacy_serial_ports+0x3cc/0x474. Use early_ioremap() instead

find_legacy_serial_ports() is called early from setup_arch(), before
paging_init(). vmalloc is not available yet, ioremap shouldn't be
used that early.

Use early_ioremap() and switch to a regular ioremap() later.

Signed-off-by: Christophe Leroy 
Signed-off-by: Christophe Leroy 
---
  arch/powerpc/kernel/legacy_serial.c | 33 +
  1 file changed, 29 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/legacy_serial.c 
b/arch/powerpc/kernel/legacy_serial.c
index f061e06e9f51..8b2c1a8553a0 100644
--- a/arch/powerpc/kernel/legacy_serial.c
+++ b/arch/powerpc/kernel/legacy_serial.c
@@ -15,6 +15,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #undef DEBUG
  
@@ -34,6 +35,7 @@ static struct legacy_serial_info {

unsigned intclock;
int irq_check_parent;
phys_addr_t taddr;
+   void __iomem*early_addr;
  } legacy_serial_infos[MAX_LEGACY_SERIAL_PORTS];
  
  static const struct of_device_id legacy_serial_parents[] __initconst = {

@@ -325,17 +327,16 @@ static void __init setup_legacy_serial_console(int 
console)
  {
struct legacy_serial_info *info = _serial_infos[console];
struct plat_serial8250_port *port = _serial_ports[console];
-   void __iomem *addr;
unsigned int stride;
  
  	stride = 1 << port->regshift;
  
  	/* Check if a translated MMIO address has been found */

if (info->taddr) {
-   addr = ioremap(info->taddr, 0x1000);
-   if (addr == NULL)
+   info->early_addr = early_ioremap(info->taddr, 0x1000);
+   if (info->early_addr == NULL)
return;
-   udbg_uart_init_mmio(addr, stride);
+   udbg_uart_init_mmio(info->early_addr, stride);
} else {
/* Check if it's PIO and we support untranslated PIO */
if (port->iotype == UPIO_PORT && isa_io_special)
@@ -353,6 +354,30 @@ static void __init setup_legacy_serial_console(int console)
udbg_uart_setup(info->speed, info->clock);
  }
  
+static int __init ioremap_legacy_serial_console(void)

+{
+   struct legacy_serial_info *info = 
_serial_infos[legacy_serial_console];
+   struct plat_serial8250_port *port = 
_serial_ports[legacy_serial_console];
+   void __iomem *vaddr;
+
+   if (legacy_serial_console < 0)
+   return 0;
+
+   if (!info->early_addr)
+   return 0;
+
+   vaddr = ioremap(info->taddr, 0x1000);
+   if (WARN_ON(!vaddr))
+   return -ENOMEM;
+
+   udbg_uart_init_mmio(vaddr, 1 << port->regshift);
+   early_iounmap(info->early_addr, 0x1000);
+   info->early_addr = NULL;
+
+   return 0;
+}
+early_initcall(ioremap_legacy_serial_console);
+
  /*
   * This is called very early, as part of setup_system() or eventually
   * setup_arch(), basically before anything else in this file. This function



Re: [PATCH v1 2/2] powerpc: Enable OPTPROBES on PPC32

2021-04-20 Thread Christophe Leroy




Le 20/04/2021 à 08:51, Naveen N. Rao a écrit :

Christophe Leroy wrote:

For that, create a 32 bits version of patch_imm64_load_insns()
and create a patch_imm_load_insns() which calls
patch_imm32_load_insns() on PPC32 and patch_imm64_load_insns()
on PPC64.

Adapt optprobes_head.S for PPC32. Use PPC_LL/PPC_STL macros instead
of raw ld/std, opt out things linked to paca and use stmw/lmw to
save/restore registers.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/Kconfig |  2 +-
 arch/powerpc/kernel/optprobes.c  | 24 +--
 arch/powerpc/kernel/optprobes_head.S | 46 +++-
 3 files changed, 53 insertions(+), 19 deletions(-)


Thanks for adding support for ppc32. It is good to see that it works without 
too many changes.



diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c1344c05226c..49b538e54efb 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -227,7 +227,7 @@ config PPC
 select HAVE_MOD_ARCH_SPECIFIC
 select HAVE_NMI    if PERF_EVENTS || (PPC64 && PPC_BOOK3S)
 select HAVE_HARDLOCKUP_DETECTOR_ARCH    if PPC64 && PPC_BOOK3S && SMP
-    select HAVE_OPTPROBES    if PPC64
+    select HAVE_OPTPROBES
 select HAVE_PERF_EVENTS
 select HAVE_PERF_EVENTS_NMI    if PPC64
 select HAVE_HARDLOCKUP_DETECTOR_PERF    if PERF_EVENTS && HAVE_PERF_EVENTS_NMI && 
!HAVE_HARDLOCKUP_DETECTOR_ARCH

diff --git a/arch/powerpc/kernel/optprobes.c b/arch/powerpc/kernel/optprobes.c
index 58fdb9f66e0f..cdf87086fa33 100644
--- a/arch/powerpc/kernel/optprobes.c
+++ b/arch/powerpc/kernel/optprobes.c
@@ -141,11 +141,21 @@ void arch_remove_optimized_kprobe(struct optimized_kprobe 
*op)
 }
 }

+static void patch_imm32_load_insns(unsigned long val, int reg, kprobe_opcode_t 
*addr)
+{
+    patch_instruction((struct ppc_inst *)addr,
+  ppc_inst(PPC_RAW_LIS(reg, IMM_H(val;
+    addr++;
+
+    patch_instruction((struct ppc_inst *)addr,
+  ppc_inst(PPC_RAW_ORI(reg, reg, IMM_L(val;
+}
+
 /*
  * Generate instructions to load provided immediate 64-bit value
  * to register 'reg' and patch these instructions at 'addr'.
  */
-static void patch_imm64_load_insns(unsigned long val, int reg, kprobe_opcode_t 
*addr)
+static void patch_imm64_load_insns(unsigned long long val, int reg, 
kprobe_opcode_t *addr)


Do you really need this?


Without it I get:

 from arch/powerpc/kernel/optprobes.c:8:
arch/powerpc/kernel/optprobes.c: In function 'patch_imm64_load_insns':
arch/powerpc/kernel/optprobes.c:163:14: error: right shift count >= width of type 
[-Werror=shift-count-overflow]

  163 |((val >> 48) & 0x)));
  |  ^~
./arch/powerpc/include/asm/inst.h:69:48: note: in definition of macro 'ppc_inst'
   69 | #define ppc_inst(x) ((struct ppc_inst){ .val = x })
  |^
arch/powerpc/kernel/optprobes.c:169:31: error: right shift count >= width of type 
[-Werror=shift-count-overflow]

  169 |___PPC_RS(reg) | ((val >> 32) & 0x)));
  |   ^~
./arch/powerpc/include/asm/inst.h:69:48: note: in definition of macro 'ppc_inst'
   69 | #define ppc_inst(x) ((struct ppc_inst){ .val = x })
  |^





 {
 /* lis reg,(op)@highest */
 patch_instruction((struct ppc_inst *)addr,
@@ -177,6 +187,14 @@ static void patch_imm64_load_insns(unsigned long val, int 
reg, kprobe_opcode_t *
    ___PPC_RS(reg) | (val & 0x)));
 }

+static void patch_imm_load_insns(unsigned long val, int reg, kprobe_opcode_t 
*addr)
+{
+    if (IS_ENABLED(CONFIG_PPC64))
+    patch_imm64_load_insns(val, reg, addr);
+    else
+    patch_imm32_load_insns(val, reg, addr);
+}
+
 int arch_prepare_optimized_kprobe(struct optimized_kprobe *op, struct kprobe 
*p)
 {
 struct ppc_inst branch_op_callback, branch_emulate_step, temp;
@@ -230,7 +248,7 @@ int arch_prepare_optimized_kprobe(struct optimized_kprobe 
*op, struct kprobe *p)
  * Fixup the template with instructions to:
  * 1. load the address of the actual probepoint
  */
-    patch_imm64_load_insns((unsigned long)op, 3, buff + TMPL_OP_IDX);
+    patch_imm_load_insns((unsigned long)op, 3, buff + TMPL_OP_IDX);

 /*
  * 2. branch to optimized_callback() and emulate_step()
@@ -264,7 +282,7 @@ int arch_prepare_optimized_kprobe(struct optimized_kprobe 
*op, struct kprobe *p)
  * 3. load instruction to be emulated into relevant register, and
  */
 temp = ppc_inst_read((struct ppc_inst *)p->ainsn.insn);
-    patch_imm64_load_insns(ppc_inst_as_ulong(temp), 4, buff + TMPL_INSN_IDX);
+    patch_imm_load_insns(ppc_inst_as_ulong(temp), 4, buff + TMPL_INSN_IDX);

 /*
  * 4. branch back from trampoline
diff --git a/arch/powerpc/kernel/optprobes_head.S 
b/arch/powerpc/kernel/optprobes

[PATCH v2 2/2] powerpc/legacy_serial: Use early_ioremap()

2021-04-20 Thread Christophe Leroy
From: Christophe Leroy 

[0.00] ioremap() called early from 
find_legacy_serial_ports+0x3cc/0x474. Use early_ioremap() instead

find_legacy_serial_ports() is called early from setup_arch(), before
paging_init(). vmalloc is not available yet, ioremap shouldn't be
used that early.

Use early_ioremap() and switch to a regular ioremap() later.

Signed-off-by: Christophe Leroy 
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/legacy_serial.c | 33 +
 1 file changed, 29 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/legacy_serial.c 
b/arch/powerpc/kernel/legacy_serial.c
index f061e06e9f51..8b2c1a8553a0 100644
--- a/arch/powerpc/kernel/legacy_serial.c
+++ b/arch/powerpc/kernel/legacy_serial.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #undef DEBUG
 
@@ -34,6 +35,7 @@ static struct legacy_serial_info {
unsigned intclock;
int irq_check_parent;
phys_addr_t taddr;
+   void __iomem*early_addr;
 } legacy_serial_infos[MAX_LEGACY_SERIAL_PORTS];
 
 static const struct of_device_id legacy_serial_parents[] __initconst = {
@@ -325,17 +327,16 @@ static void __init setup_legacy_serial_console(int 
console)
 {
struct legacy_serial_info *info = _serial_infos[console];
struct plat_serial8250_port *port = _serial_ports[console];
-   void __iomem *addr;
unsigned int stride;
 
stride = 1 << port->regshift;
 
/* Check if a translated MMIO address has been found */
if (info->taddr) {
-   addr = ioremap(info->taddr, 0x1000);
-   if (addr == NULL)
+   info->early_addr = early_ioremap(info->taddr, 0x1000);
+   if (info->early_addr == NULL)
return;
-   udbg_uart_init_mmio(addr, stride);
+   udbg_uart_init_mmio(info->early_addr, stride);
} else {
/* Check if it's PIO and we support untranslated PIO */
if (port->iotype == UPIO_PORT && isa_io_special)
@@ -353,6 +354,30 @@ static void __init setup_legacy_serial_console(int console)
udbg_uart_setup(info->speed, info->clock);
 }
 
+static int __init ioremap_legacy_serial_console(void)
+{
+   struct legacy_serial_info *info = 
_serial_infos[legacy_serial_console];
+   struct plat_serial8250_port *port = 
_serial_ports[legacy_serial_console];
+   void __iomem *vaddr;
+
+   if (legacy_serial_console < 0)
+   return 0;
+
+   if (!info->early_addr)
+   return 0;
+
+   vaddr = ioremap(info->taddr, 0x1000);
+   if (WARN_ON(!vaddr))
+   return -ENOMEM;
+
+   udbg_uart_init_mmio(vaddr, 1 << port->regshift);
+   early_iounmap(info->early_addr, 0x1000);
+   info->early_addr = NULL;
+
+   return 0;
+}
+early_initcall(ioremap_legacy_serial_console);
+
 /*
  * This is called very early, as part of setup_system() or eventually
  * setup_arch(), basically before anything else in this file. This function
-- 
2.25.0



[PATCH v2 1/2] powerpc/64: Fix the definition of the fixmap area

2021-04-20 Thread Christophe Leroy
At the time being, the fixmap area is defined at the top of
the address space or just below KASAN.

This definition is not valid for PPC64.

For PPC64, use the top of the I/O space.

Because of circular dependencies, it is not possible to include
asm/fixmap.h in asm/book3s/64/pgtable.h , so define a fixed size
AREA at the top of the I/O space for fixmap and ensure during
build that the size is big enough.

Fixes: 265c3491c4bc ("powerpc: Add support for GENERIC_EARLY_IOREMAP")
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 4 +++-
 arch/powerpc/include/asm/fixmap.h| 9 +
 arch/powerpc/include/asm/nohash/64/pgtable.h | 5 -
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 0c89977ec10b..a666d561b44d 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -7,6 +7,7 @@
 #ifndef __ASSEMBLY__
 #include 
 #include 
+#include 
 #endif
 
 /*
@@ -324,7 +325,8 @@ extern unsigned long pci_io_base;
 #define  PHB_IO_END(KERN_IO_START + FULL_IO_SIZE)
 #define IOREMAP_BASE   (PHB_IO_END)
 #define IOREMAP_START  (ioremap_bot)
-#define IOREMAP_END(KERN_IO_END)
+#define IOREMAP_END(KERN_IO_END - FIXADDR_SIZE)
+#define FIXADDR_SIZE   SZ_32M
 
 /* Advertise special mapping type for AGP */
 #define HAVE_PAGE_AGP
diff --git a/arch/powerpc/include/asm/fixmap.h 
b/arch/powerpc/include/asm/fixmap.h
index 8d03c16a3663..947b5b9c4424 100644
--- a/arch/powerpc/include/asm/fixmap.h
+++ b/arch/powerpc/include/asm/fixmap.h
@@ -23,12 +23,17 @@
 #include 
 #endif
 
+#ifdef CONFIG_PPC64
+#define FIXADDR_TOP(IOREMAP_END + FIXADDR_SIZE)
+#else
+#define FIXADDR_SIZE   0
 #ifdef CONFIG_KASAN
 #include 
 #define FIXADDR_TOP(KASAN_SHADOW_START - PAGE_SIZE)
 #else
 #define FIXADDR_TOP((unsigned long)(-PAGE_SIZE))
 #endif
+#endif
 
 /*
  * Here we define all the compile-time 'special' virtual
@@ -50,6 +55,7 @@
  */
 enum fixed_addresses {
FIX_HOLE,
+#ifdef CONFIG_PPC32
/* reserve the top 128K for early debugging purposes */
FIX_EARLY_DEBUG_TOP = FIX_HOLE,
FIX_EARLY_DEBUG_BASE = FIX_EARLY_DEBUG_TOP+(ALIGN(SZ_128K, 
PAGE_SIZE)/PAGE_SIZE)-1,
@@ -72,6 +78,7 @@ enum fixed_addresses {
   FIX_IMMR_SIZE,
 #endif
/* FIX_PCIE_MCFG, */
+#endif /* CONFIG_PPC32 */
__end_of_permanent_fixed_addresses,
 
 #define NR_FIX_BTMAPS  (SZ_256K / PAGE_SIZE)
@@ -98,6 +105,8 @@ enum fixed_addresses {
 static inline void __set_fixmap(enum fixed_addresses idx,
phys_addr_t phys, pgprot_t flags)
 {
+   BUILD_BUG_ON(IS_ENABLED(CONFIG_PPC64) && __FIXADDR_SIZE > FIXADDR_SIZE);
+
if (__builtin_constant_p(idx))
BUILD_BUG_ON(idx >= __end_of_fixed_addresses);
else if (WARN_ON(idx >= __end_of_fixed_addresses))
diff --git a/arch/powerpc/include/asm/nohash/64/pgtable.h 
b/arch/powerpc/include/asm/nohash/64/pgtable.h
index 6cb8aa357191..57cd3892bfe0 100644
--- a/arch/powerpc/include/asm/nohash/64/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/64/pgtable.h
@@ -6,6 +6,8 @@
  * the ppc64 non-hashed page table.
  */
 
+#include 
+
 #include 
 #include 
 #include 
@@ -54,7 +56,8 @@
 #define  PHB_IO_END(KERN_IO_START + FULL_IO_SIZE)
 #define IOREMAP_BASE   (PHB_IO_END)
 #define IOREMAP_START  (ioremap_bot)
-#define IOREMAP_END(KERN_VIRT_START + KERN_VIRT_SIZE)
+#define IOREMAP_END(KERN_VIRT_START + KERN_VIRT_SIZE - FIXADDR_SIZE)
+#define FIXADDR_SIZE   SZ_32M
 
 
 /*
-- 
2.25.0



Re: [PATCH] powerpc/legacy_serial: Use early_ioremap()

2021-04-20 Thread Christophe Leroy

Hi Chris,

Le 10/08/2020 à 04:01, Chris Packham a écrit :


On 24/03/20 10:54 am, Chris Packham wrote:

Hi Christophe,

On Wed, 2020-02-05 at 12:03 +, Christophe Leroy wrote:

[0.00] ioremap() called early from
find_legacy_serial_ports+0x3cc/0x474. Use early_ioremap() instead


I was just about to dig into this error message and found you patch. I
applied it to a v5.5 base.


find_legacy_serial_ports() is called early from setup_arch(), before
paging_init(). vmalloc is not available yet, ioremap shouldn't be
used that early.

Use early_ioremap() and switch to a regular ioremap() later.

Signed-off-by: Christophe Leroy 

On my system (Freescale T2080 SOC) this seems to cause a crash/hang in
early boot. Unfortunately because this is affecting the boot console I
don't get any earlyprintk output.


I've been doing a bit more digging into why Christophe's patch didn't
work for me. I noticed the powerpc specific early_ioremap_range()
returns addresses around ioremap_bot. Yet the generic early_ioremap()
uses addresses around FIXADDR_TOP. If I try the following hack I can
make Christophe's patch work

diff --git a/arch/powerpc/include/asm/fixmap.h
b/arch/powerpc/include/asm/fixmap.h
index 2ef155a3c821..7bc2f3f73c8b 100644
--- a/arch/powerpc/include/asm/fixmap.h
+++ b/arch/powerpc/include/asm/fixmap.h
@@ -27,7 +27,7 @@
   #include 
   #define FIXADDR_TOP    (KASAN_SHADOW_START - PAGE_SIZE)
   #else
-#define FIXADDR_TOP    ((unsigned long)(-PAGE_SIZE))
+#define FIXADDR_TOP    (IOREMAP_END - PAGE_SIZE)
   #endif

   /*

I'll admit to being out of my depth. It seems that the generic
early_ioremap() is not quite correctly plumbed in for powerpc.


Yes that's probably true for PPC64.

I see that on PPC32 I had to implement the following changes in order to enable earlier use of 
early_ioremap()


https://github.com/torvalds/linux/commit/925ac141d106b55acbe112a9272f970631a3c082


I have the problem with QEMU with the ppce500 machine. It will allow me to 
investigate it a bit further.


Re: PPC_FPU, ALTIVEC: enable_kernel_fp, put_vr, get_vr

2021-04-19 Thread Christophe Leroy




Le 19/04/2021 à 23:39, Randy Dunlap a écrit :

On 4/19/21 6:16 AM, Michael Ellerman wrote:

Randy Dunlap  writes:



Sure.  I'll post them later today.
They keep FPU and ALTIVEC as independent (build) features.


Those patches look OK.

But I don't think it makes sense to support that configuration, FPU=n
ALTVEC=y. No one is ever going to make a CPU like that. We have enough
testing surface due to configuration options, without adding artificial
combinations that no one is ever going to use.

IMHO :)

So I'd rather we just make ALTIVEC depend on FPU.


That's rather simple. See below.
I'm doing a bunch of randconfig builds with it now.

---
From: Randy Dunlap 
Subject: [PATCH] powerpc: make ALTIVEC depend PPC_FPU

On a kernel config with ALTIVEC=y and PPC_FPU not set/enabled,
there are build errors:

drivers/cpufreq/pmac32-cpufreq.c:262:2: error: implicit declaration of function 
'enable_kernel_fp' [-Werror,-Wimplicit-function-declaration]
enable_kernel_fp();
../arch/powerpc/lib/sstep.c: In function 'do_vec_load':
../arch/powerpc/lib/sstep.c:637:3: error: implicit declaration of function 
'put_vr' [-Werror=implicit-function-declaration]
   637 |   put_vr(rn, );
   |   ^~
../arch/powerpc/lib/sstep.c: In function 'do_vec_store':
../arch/powerpc/lib/sstep.c:660:3: error: implicit declaration of function 
'get_vr'; did you mean 'get_oc'? [-Werror=implicit-function-declaration]
   660 |   get_vr(rn, );
   |   ^~

In theory ALTIVEC is independent of PPC_FPU but in practice nobody
is going to build such a machine, so make ALTIVEC require PPC_FPU
by depending on PPC_FPU.

Signed-off-by: Randy Dunlap 
Reported-by: kernel test robot 
Cc: Michael Ellerman 
Cc: linuxppc-...@lists.ozlabs.org
Cc: Christophe Leroy 
Cc: Segher Boessenkool 
Cc: l...@intel.com
---
  arch/powerpc/platforms/86xx/Kconfig|1 +
  arch/powerpc/platforms/Kconfig.cputype |2 ++
  2 files changed, 3 insertions(+)

--- linux-next-20210416.orig/arch/powerpc/platforms/86xx/Kconfig
+++ linux-next-20210416/arch/powerpc/platforms/86xx/Kconfig
@@ -4,6 +4,7 @@ menuconfig PPC_86xx
bool "86xx-based boards"
depends on PPC_BOOK3S_32
select FSL_SOC
+   select PPC_FPU
select ALTIVEC
help
  The Freescale E600 SoCs have 74xx cores.
--- linux-next-20210416.orig/arch/powerpc/platforms/Kconfig.cputype
+++ linux-next-20210416/arch/powerpc/platforms/Kconfig.cputype
@@ -186,6 +186,7 @@ config E300C3_CPU
  config G4_CPU
bool "G4 (74xx)"
depends on PPC_BOOK3S_32
+   select PPC_FPU
select ALTIVEC
  
  endchoice

@@ -309,6 +310,7 @@ config PHYS_64BIT
  
  config ALTIVEC

bool "AltiVec Support"
+   depends on PPC_FPU


Shouldn't we do it the other way round ? In extenso make ALTIVEC select PPC_FPU and avoid the two 
selects that are above ?



depends on PPC_BOOK3S_32 || PPC_BOOK3S_64 || (PPC_E500MC && PPC64)
help
  This option enables kernel support for the Altivec extensions to the



Re: [PATCH v1 3/5] mm: ptdump: Provide page size to notepage()

2021-04-19 Thread Christophe Leroy




Le 19/04/2021 à 16:00, Steven Price a écrit :

On 19/04/2021 14:14, Christophe Leroy wrote:



Le 16/04/2021 à 12:51, Steven Price a écrit :

On 16/04/2021 11:38, Christophe Leroy wrote:



Le 16/04/2021 à 11:28, Steven Price a écrit :

On 15/04/2021 18:18, Christophe Leroy wrote:

To be honest I don't fully understand why powerpc requires the page_size - it appears to be 
using it purely to find "holes" in the calls to note_page(), but I haven't worked out why such 
holes would occur.


I was indeed introduced for KASAN. We have a first commit 
https://github.com/torvalds/linux/commit/cabe8138 which uses page size to detect whether it is a 
KASAN like stuff.


Then came https://github.com/torvalds/linux/commit/b00ff6d8c as a fix. I can't remember what the 
problem was exactly, something around the use of hugepages for kernel memory, came as part of 
the series 
https://patchwork.ozlabs.org/project/linuxppc-dev/cover/cover.1589866984.git.christophe.le...@csgroup.eu/ 





Ah, that's useful context. So it looks like powerpc took a different route to reducing the KASAN 
output to x86.


Given the generic ptdump code has handling for KASAN already it should be possible to drop that 
from the powerpc arch code, which I think means we don't actually need to provide page size to 
notepage(). Hopefully that means more code to delete ;)




Looking at how the generic ptdump code handles KASAN, I'm a bit sceptic.

IIUC, it is checking that kasan_early_shadow_pte is in the same page as the pgtable referred by 
the PMD entry. But what happens if that PMD entry is referring another pgtable which is inside the 
same page as kasan_early_shadow_pte ?


Shouldn't the test be

 if (pmd_page_vaddr(val) == lm_alias(kasan_early_shadow_pte))
 return note_kasan_page_table(walk, addr);


Now I come to look at this code again, I think you're right. On arm64 this doesn't cause a problem - 
page tables are page sized and page aligned, so there couldn't be any non-KASAN pgtables sharing the 
page. Obviously that's not necessarily true of other architectures.


Feel free to add a patch to your series ;)



Ok.

I'll leave that outside of the series, it is not a show stopper because early shadow page 
directories are all tagged __bss_page_aligned so we can't have two of them in the same page and it 
is really unlikely that we'll have any other statically defined page directory in the same pages either.


And for the special case of powerpc 8xx which is the only one for which we have both KASAN and 
HUGEPD at the time being, there are only two levels of page directories so no issue.


Christophe


[PATCH 3/3] powerpc/irq: Enhance readability of trap types

2021-04-19 Thread Christophe Leroy
This patch makes use of trap types in irq.c

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/interrupt.h |  1 +
 arch/powerpc/kernel/irq.c| 13 +
 2 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/interrupt.h 
b/arch/powerpc/include/asm/interrupt.h
index 8970990e3b08..44cde2e129b8 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -23,6 +23,7 @@
 #define INTERRUPT_INST_SEGMENT0x480
 #define INTERRUPT_TRACE   0xd00
 #define INTERRUPT_H_DATA_STORAGE  0xe00
+#define INTERRUPT_HMI  0xe60
 #define INTERRUPT_H_FAC_UNAVAIL   0xf80
 #ifdef CONFIG_PPC_BOOK3S
 #define INTERRUPT_DOORBELL0xa00
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 893d3f8d6f47..72cb45393ef2 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -142,7 +142,7 @@ void replay_soft_interrupts(void)
 */
if (IS_ENABLED(CONFIG_PPC_BOOK3S) && (local_paca->irq_happened & 
PACA_IRQ_HMI)) {
local_paca->irq_happened &= ~PACA_IRQ_HMI;
-   regs.trap = 0xe60;
+   regs.trap = INTERRUPT_HMI;
handle_hmi_exception();
if (!(local_paca->irq_happened & PACA_IRQ_HARD_DIS))
hard_irq_disable();
@@ -150,7 +150,7 @@ void replay_soft_interrupts(void)
 
if (local_paca->irq_happened & PACA_IRQ_DEC) {
local_paca->irq_happened &= ~PACA_IRQ_DEC;
-   regs.trap = 0x900;
+   regs.trap = INTERRUPT_DECREMENTER;
timer_interrupt();
if (!(local_paca->irq_happened & PACA_IRQ_HARD_DIS))
hard_irq_disable();
@@ -158,7 +158,7 @@ void replay_soft_interrupts(void)
 
if (local_paca->irq_happened & PACA_IRQ_EE) {
local_paca->irq_happened &= ~PACA_IRQ_EE;
-   regs.trap = 0x500;
+   regs.trap = INTERRUPT_EXTERNAL;
do_IRQ();
if (!(local_paca->irq_happened & PACA_IRQ_HARD_DIS))
hard_irq_disable();
@@ -166,10 +166,7 @@ void replay_soft_interrupts(void)
 
if (IS_ENABLED(CONFIG_PPC_DOORBELL) && (local_paca->irq_happened & 
PACA_IRQ_DBELL)) {
local_paca->irq_happened &= ~PACA_IRQ_DBELL;
-   if (IS_ENABLED(CONFIG_PPC_BOOK3E))
-   regs.trap = 0x280;
-   else
-   regs.trap = 0xa00;
+   regs.trap = INTERRUPT_DOORBELL;
doorbell_exception();
if (!(local_paca->irq_happened & PACA_IRQ_HARD_DIS))
hard_irq_disable();
@@ -178,7 +175,7 @@ void replay_soft_interrupts(void)
/* Book3E does not support soft-masking PMI interrupts */
if (IS_ENABLED(CONFIG_PPC_BOOK3S) && (local_paca->irq_happened & 
PACA_IRQ_PMI)) {
local_paca->irq_happened &= ~PACA_IRQ_PMI;
-   regs.trap = 0xf00;
+   regs.trap = INTERRUPT_PERFMON;
performance_monitor_exception();
if (!(local_paca->irq_happened & PACA_IRQ_HARD_DIS))
hard_irq_disable();
-- 
2.25.0



[PATCH 1/3] powerpc/8xx: Enhance readability of trap types

2021-04-19 Thread Christophe Leroy
This patch makes use of trap types in head_8xx.S

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/interrupt.h | 29 
 arch/powerpc/kernel/head_8xx.S   | 49 ++--
 2 files changed, 47 insertions(+), 31 deletions(-)

diff --git a/arch/powerpc/include/asm/interrupt.h 
b/arch/powerpc/include/asm/interrupt.h
index ed2c4042c3d1..cf2c5c3ae716 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -2,13 +2,6 @@
 #ifndef _ASM_POWERPC_INTERRUPT_H
 #define _ASM_POWERPC_INTERRUPT_H
 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
 /* BookE/4xx */
 #define INTERRUPT_CRITICAL_INPUT  0x100
 
@@ -39,9 +32,11 @@
 /* BookE/BookS/4xx/8xx */
 #define INTERRUPT_DATA_STORAGE0x300
 #define INTERRUPT_INST_STORAGE0x400
+#define INTERRUPT_EXTERNAL 0x500
 #define INTERRUPT_ALIGNMENT   0x600
 #define INTERRUPT_PROGRAM 0x700
 #define INTERRUPT_SYSCALL 0xc00
+#define INTERRUPT_TRACE0xd00
 
 /* BookE/BookS/44x */
 #define INTERRUPT_FP_UNAVAIL  0x800
@@ -53,6 +48,24 @@
 #define INTERRUPT_PERFMON 0x0
 #endif
 
+/* 8xx */
+#define INTERRUPT_SOFT_EMU_8xx 0x1000
+#define INTERRUPT_INST_TLB_MISS_8xx0x1100
+#define INTERRUPT_DATA_TLB_MISS_8xx0x1200
+#define INTERRUPT_INST_TLB_ERROR_8xx   0x1300
+#define INTERRUPT_DATA_TLB_ERROR_8xx   0x1400
+#define INTERRUPT_DATA_BREAKPOINT_8xx  0x1c00
+#define INTERRUPT_INST_BREAKPOINT_8xx  0x1d00
+
+#ifndef __ASSEMBLY__
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
 static inline void nap_adjust_return(struct pt_regs *regs)
 {
 #ifdef CONFIG_PPC_970_NAP
@@ -514,4 +527,6 @@ static inline void interrupt_cond_local_irq_enable(struct 
pt_regs *regs)
local_irq_enable();
 }
 
+#endif /* __ASSEMBLY__ */
+
 #endif /* _ASM_POWERPC_INTERRUPT_H */
diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index e3b066703eab..7d445e4342c0 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Value for the bits that have fixed value in RPN entries.
@@ -118,49 +119,49 @@ instruction_counter:
 #endif
 
 /* System reset */
-   EXCEPTION(0x100, Reset, system_reset_exception)
+   EXCEPTION(INTERRUPT_SYSTEM_RESET, Reset, system_reset_exception)
 
 /* Machine check */
-   START_EXCEPTION(0x200, MachineCheck)
-   EXCEPTION_PROLOG 0x200 MachineCheck handle_dar_dsisr=1
+   START_EXCEPTION(INTERRUPT_MACHINE_CHECK, MachineCheck)
+   EXCEPTION_PROLOG INTERRUPT_MACHINE_CHECK MachineCheck handle_dar_dsisr=1
prepare_transfer_to_handler
bl  machine_check_exception
b   interrupt_return
 
 /* External interrupt */
-   EXCEPTION(0x500, HardwareInterrupt, do_IRQ)
+   EXCEPTION(INTERRUPT_EXTERNAL, HardwareInterrupt, do_IRQ)
 
 /* Alignment exception */
-   START_EXCEPTION(0x600, Alignment)
-   EXCEPTION_PROLOG 0x600 Alignment handle_dar_dsisr=1
+   START_EXCEPTION(INTERRUPT_ALIGNMENT, Alignment)
+   EXCEPTION_PROLOG INTERRUPT_ALIGNMENT Alignment handle_dar_dsisr=1
prepare_transfer_to_handler
bl  alignment_exception
REST_NVGPRS(r1)
b   interrupt_return
 
 /* Program check exception */
-   START_EXCEPTION(0x700, ProgramCheck)
-   EXCEPTION_PROLOG 0x700 ProgramCheck
+   START_EXCEPTION(INTERRUPT_PROGRAM, ProgramCheck)
+   EXCEPTION_PROLOG INTERRUPT_PROGRAM ProgramCheck
prepare_transfer_to_handler
bl  program_check_exception
REST_NVGPRS(r1)
b   interrupt_return
 
 /* Decrementer */
-   EXCEPTION(0x900, Decrementer, timer_interrupt)
+   EXCEPTION(INTERRUPT_DECREMENTER, Decrementer, timer_interrupt)
 
 /* System call */
-   START_EXCEPTION(0xc00, SystemCall)
-   SYSCALL_ENTRY   0xc00
+   START_EXCEPTION(INTERRUPT_SYSCALL, SystemCall)
+   SYSCALL_ENTRY   INTERRUPT_SYSCALL
 
 /* Single step - not used on 601 */
-   EXCEPTION(0xd00, SingleStep, single_step_exception)
+   EXCEPTION(INTERRUPT_TRACE, SingleStep, single_step_exception)
 
 /* On the MPC8xx, this is a software emulation interrupt.  It occurs
  * for all unimplemented and illegal instructions.
  */
-   START_EXCEPTION(0x1000, SoftEmu)
-   EXCEPTION_PROLOG 0x1000 SoftEmu
+   START_EXCEPTION(INTERRUPT_SOFT_EMU_8xx, SoftEmu)
+   EXCEPTION_PROLOG INTERRUPT_SOFT_EMU_8xx SoftEmu
prepare_transfer_to_handler
bl  emulation_assist_interrupt
REST_NVGPRS(r1)
@@ -187,7 +188,7 @@ instruction_counter:
 #define INVALIDATE_ADJACENT_PAGES_CPU15(addr, tmp)
 #endif
 
-   START_EXCEPTION(0x1100, InstructionTLBMiss)
+   START_EXCEPTION(INTERRUPT_INST_TLB_MISS_8xx, InstructionTLBMiss)
mtspr   SPRN_SPRG_SCRATCH2, r10
mtspr   SPRN_M_TW, r11
 
@@ -243,7 +244,7 @@ instruction_counter

[PATCH 2/3] powerpc/32s: Enhance readability of trap types

2021-04-19 Thread Christophe Leroy
This patch makes use of trap types in head_book3s_32.S

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/interrupt.h |  6 
 arch/powerpc/kernel/head_book3s_32.S | 43 ++--
 2 files changed, 28 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/include/asm/interrupt.h 
b/arch/powerpc/include/asm/interrupt.h
index cf2c5c3ae716..8970990e3b08 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -27,6 +27,7 @@
 #ifdef CONFIG_PPC_BOOK3S
 #define INTERRUPT_DOORBELL0xa00
 #define INTERRUPT_PERFMON 0xf00
+#define INTERRUPT_ALTIVEC_UNAVAIL  0xf20
 #endif
 
 /* BookE/BookS/4xx/8xx */
@@ -57,6 +58,11 @@
 #define INTERRUPT_DATA_BREAKPOINT_8xx  0x1c00
 #define INTERRUPT_INST_BREAKPOINT_8xx  0x1d00
 
+/* 603 */
+#define INTERRUPT_INST_TLB_MISS_6030x1000
+#define INTERRUPT_DATA_LOAD_TLB_MISS_603   0x1100
+#define INTERRUPT_DATA_STORE_TLB_MISS_603  0x1200
+
 #ifndef __ASSEMBLY__
 
 #include 
diff --git a/arch/powerpc/kernel/head_book3s_32.S 
b/arch/powerpc/kernel/head_book3s_32.S
index 18f4ae163f34..065178f19a3d 100644
--- a/arch/powerpc/kernel/head_book3s_32.S
+++ b/arch/powerpc/kernel/head_book3s_32.S
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "head_32.h"
 
@@ -239,7 +240,7 @@ __secondary_hold_acknowledge:
 /* System reset */
 /* core99 pmac starts the seconary here by changing the vector, and
putting it back to what it was (unknown_async_exception) when done.  */
-   EXCEPTION(0x100, Reset, unknown_async_exception)
+   EXCEPTION(INTERRUPT_SYSTEM_RESET, Reset, unknown_async_exception)
 
 /* Machine check */
 /*
@@ -255,7 +256,7 @@ __secondary_hold_acknowledge:
  * pointer when we take an exception from supervisor mode.)
  * -- paulus.
  */
-   START_EXCEPTION(0x200, MachineCheck)
+   START_EXCEPTION(INTERRUPT_MACHINE_CHECK, MachineCheck)
EXCEPTION_PROLOG_0
 #ifdef CONFIG_PPC_CHRP
mtspr   SPRN_SPRG_SCRATCH2,r1
@@ -276,7 +277,7 @@ __secondary_hold_acknowledge:
b   interrupt_return
 
 /* Data access exception. */
-   START_EXCEPTION(0x300, DataAccess)
+   START_EXCEPTION(INTERRUPT_DATA_STORAGE, DataAccess)
 #ifdef CONFIG_PPC_BOOK3S_604
 BEGIN_MMU_FTR_SECTION
mtspr   SPRN_SPRG_SCRATCH2,r10
@@ -297,7 +298,7 @@ ALT_MMU_FTR_SECTION_END_IFSET(MMU_FTR_HPTE_TABLE)
 #endif
 1: EXCEPTION_PROLOG_0 handle_dar_dsisr=1
EXCEPTION_PROLOG_1
-   EXCEPTION_PROLOG_2 0x300 DataAccess handle_dar_dsisr=1
+   EXCEPTION_PROLOG_2 INTERRUPT_DATA_STORAGE DataAccess handle_dar_dsisr=1
prepare_transfer_to_handler
lwz r5, _DSISR(r11)
andis.  r0, r5, DSISR_DABRMATCH@h
@@ -310,7 +311,7 @@ ALT_MMU_FTR_SECTION_END_IFSET(MMU_FTR_HPTE_TABLE)
 
 
 /* Instruction access exception. */
-   START_EXCEPTION(0x400, InstructionAccess)
+   START_EXCEPTION(INTERRUPT_INST_STORAGE, InstructionAccess)
mtspr   SPRN_SPRG_SCRATCH0,r10
mtspr   SPRN_SPRG_SCRATCH1,r11
mfspr   r10, SPRN_SPRG_THREAD
@@ -330,7 +331,7 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_HPTE_TABLE)
andi.   r11, r11, MSR_PR
 
EXCEPTION_PROLOG_1
-   EXCEPTION_PROLOG_2 0x400 InstructionAccess
+   EXCEPTION_PROLOG_2 INTERRUPT_INST_STORAGE InstructionAccess
andis.  r5,r9,DSISR_SRR1_MATCH_32S@h /* Filter relevant SRR1 bits */
stw r5, _DSISR(r11)
stw r12, _DAR(r11)
@@ -339,19 +340,19 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_HPTE_TABLE)
b   interrupt_return
 
 /* External interrupt */
-   EXCEPTION(0x500, HardwareInterrupt, do_IRQ)
+   EXCEPTION(INTERRUPT_EXTERNAL, HardwareInterrupt, do_IRQ)
 
 /* Alignment exception */
-   START_EXCEPTION(0x600, Alignment)
-   EXCEPTION_PROLOG 0x600 Alignment handle_dar_dsisr=1
+   START_EXCEPTION(INTERRUPT_ALIGNMENT, Alignment)
+   EXCEPTION_PROLOG INTERRUPT_ALIGNMENT Alignment handle_dar_dsisr=1
prepare_transfer_to_handler
bl  alignment_exception
REST_NVGPRS(r1)
b   interrupt_return
 
 /* Program check exception */
-   START_EXCEPTION(0x700, ProgramCheck)
-   EXCEPTION_PROLOG 0x700 ProgramCheck
+   START_EXCEPTION(INTERRUPT_PROGRAM, ProgramCheck)
+   EXCEPTION_PROLOG INTERRUPT_PROGRAM ProgramCheck
prepare_transfer_to_handler
bl  program_check_exception
REST_NVGPRS(r1)
@@ -367,7 +368,7 @@ BEGIN_FTR_SECTION
  */
b   ProgramCheck
 END_FTR_SECTION_IFSET(CPU_FTR_FPU_UNAVAILABLE)
-   EXCEPTION_PROLOG 0x800 FPUnavailable
+   EXCEPTION_PROLOG INTERRUPT_FP_UNAVAIL FPUnavailable
beq 1f
bl  load_up_fpu /* if from user, just load it up */
b   fast_exception_return
@@ -379,16 +380,16 @@ END_FTR_SECTION_IFSET(CPU_FTR_FPU_UNAVAILABLE)
 #endif
 
 /* Decrementer */
-   EXCEPTION(0x900, Decrementer, timer_interrupt)
+   EXCEPTION(INTERRUPT_D

Re: [PATCH 2/2] powerpc: add ALTIVEC support to lib/ when PPC_FPU not set

2021-04-19 Thread Christophe Leroy




Le 19/04/2021 à 15:32, Segher Boessenkool a écrit :

Hi!

On Sun, Apr 18, 2021 at 01:17:26PM -0700, Randy Dunlap wrote:

Add ldstfp.o to the Makefile for CONFIG_ALTIVEC and add
externs for get_vr() and put_vr() in lib/sstep.c to fix the
build errors.



  obj-$(CONFIG_PPC_FPU) += ldstfp.o
+obj-$(CONFIG_ALTIVEC)  += ldstfp.o


It is probably a good idea to split ldstfp.S into two, one for each of
the two configuration options?



Or we can build it all the time and #ifdef the FPU part.

Because it contains FPU, ALTIVEC and VSX stuff.

Christophe


Re: [PATCH v1 3/5] mm: ptdump: Provide page size to notepage()

2021-04-19 Thread Christophe Leroy




Le 16/04/2021 à 12:51, Steven Price a écrit :

On 16/04/2021 11:38, Christophe Leroy wrote:



Le 16/04/2021 à 11:28, Steven Price a écrit :

On 15/04/2021 18:18, Christophe Leroy wrote:

To be honest I don't fully understand why powerpc requires the page_size - it appears to be using 
it purely to find "holes" in the calls to note_page(), but I haven't worked out why such holes 
would occur.


I was indeed introduced for KASAN. We have a first commit 
https://github.com/torvalds/linux/commit/cabe8138 which uses page size to detect whether it is a 
KASAN like stuff.


Then came https://github.com/torvalds/linux/commit/b00ff6d8c as a fix. I can't remember what the 
problem was exactly, something around the use of hugepages for kernel memory, came as part of the 
series 
https://patchwork.ozlabs.org/project/linuxppc-dev/cover/cover.1589866984.git.christophe.le...@csgroup.eu/ 



Ah, that's useful context. So it looks like powerpc took a different route to reducing the KASAN 
output to x86.


Given the generic ptdump code has handling for KASAN already it should be possible to drop that from 
the powerpc arch code, which I think means we don't actually need to provide page size to 
notepage(). Hopefully that means more code to delete ;)




Looking at how the generic ptdump code handles KASAN, I'm a bit sceptic.

IIUC, it is checking that kasan_early_shadow_pte is in the same page as the pgtable referred by the 
PMD entry. But what happens if that PMD entry is referring another pgtable which is inside the same 
page as kasan_early_shadow_pte ?


Shouldn't the test be

if (pmd_page_vaddr(val) == lm_alias(kasan_early_shadow_pte))
return note_kasan_page_table(walk, addr);


Christophe


[PATCH v2 4/4] powerpc/mm: Convert powerpc to GENERIC_PTDUMP

2021-04-19 Thread Christophe Leroy
This patch converts powerpc to the generic PTDUMP implementation.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/Kconfig  |   2 +
 arch/powerpc/Kconfig.debug|  30 --
 arch/powerpc/mm/Makefile  |   2 +-
 arch/powerpc/mm/mmu_decl.h|   2 +-
 arch/powerpc/mm/ptdump/8xx.c  |   6 +-
 arch/powerpc/mm/ptdump/Makefile   |   9 +-
 arch/powerpc/mm/ptdump/book3s64.c |   6 +-
 arch/powerpc/mm/ptdump/ptdump.c   | 165 --
 arch/powerpc/mm/ptdump/shared.c   |   6 +-
 9 files changed, 68 insertions(+), 160 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 475d77a6ebbe..40259437a28f 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -120,6 +120,7 @@ config PPC
select ARCH_32BIT_OFF_T if PPC32
select ARCH_HAS_DEBUG_VIRTUAL
select ARCH_HAS_DEBUG_VM_PGTABLE
+   select ARCH_HAS_DEBUG_WXif STRICT_KERNEL_RWX
select ARCH_HAS_DEVMEM_IS_ALLOWED
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_FORTIFY_SOURCE
@@ -177,6 +178,7 @@ config PPC
select GENERIC_IRQ_SHOW
select GENERIC_IRQ_SHOW_LEVEL
select GENERIC_PCI_IOMAPif PCI
+   select GENERIC_PTDUMP
select GENERIC_SMP_IDLE_THREAD
select GENERIC_STRNCPY_FROM_USER
select GENERIC_STRNLEN_USER
diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug
index 6342f9da4545..05b1180ea502 100644
--- a/arch/powerpc/Kconfig.debug
+++ b/arch/powerpc/Kconfig.debug
@@ -360,36 +360,6 @@ config FAIL_IOMMU
 
  If you are unsure, say N.
 
-config PPC_PTDUMP
-   bool "Export kernel pagetable layout to userspace via debugfs"
-   depends on DEBUG_KERNEL && DEBUG_FS
-   help
- This option exports the state of the kernel pagetables to a
- debugfs file. This is only useful for kernel developers who are
- working in architecture specific areas of the kernel - probably
- not a good idea to enable this feature in a production kernel.
-
- If you are unsure, say N.
-
-config PPC_DEBUG_WX
-   bool "Warn on W+X mappings at boot"
-   depends on PPC_PTDUMP && STRICT_KERNEL_RWX
-   help
- Generate a warning if any W+X mappings are found at boot.
-
- This is useful for discovering cases where the kernel is leaving
- W+X mappings after applying NX, as such mappings are a security risk.
-
- Note that even if the check fails, your kernel is possibly
- still fine, as W+X mappings are not a security hole in
- themselves, what they do is that they make the exploitation
- of other unfixed kernel bugs easier.
-
- There is no runtime or memory usage effect of this option
- once the kernel has booted up - it's a one time check.
-
- If in doubt, say "Y".
-
 config PPC_FAST_ENDIAN_SWITCH
bool "Deprecated fast endian-switch syscall"
depends on DEBUG_KERNEL && PPC_BOOK3S_64
diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index c3df3a8501d4..c90d58aaebe2 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -18,5 +18,5 @@ obj-$(CONFIG_PPC_MM_SLICES)   += slice.o
 obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
 obj-$(CONFIG_NOT_COHERENT_CACHE) += dma-noncoherent.o
 obj-$(CONFIG_PPC_COPRO_BASE)   += copro_fault.o
-obj-$(CONFIG_PPC_PTDUMP)   += ptdump/
+obj-$(CONFIG_PTDUMP_CORE)  += ptdump/
 obj-$(CONFIG_KASAN)+= kasan/
diff --git a/arch/powerpc/mm/mmu_decl.h b/arch/powerpc/mm/mmu_decl.h
index 7dac910c0b21..dd1cabc2ea0f 100644
--- a/arch/powerpc/mm/mmu_decl.h
+++ b/arch/powerpc/mm/mmu_decl.h
@@ -180,7 +180,7 @@ static inline void mmu_mark_rodata_ro(void) { }
 void __init mmu_mapin_immr(void);
 #endif
 
-#ifdef CONFIG_PPC_DEBUG_WX
+#ifdef CONFIG_DEBUG_WX
 void ptdump_check_wx(void);
 #else
 static inline void ptdump_check_wx(void) { }
diff --git a/arch/powerpc/mm/ptdump/8xx.c b/arch/powerpc/mm/ptdump/8xx.c
index 86da2a669680..fac932eb8f9a 100644
--- a/arch/powerpc/mm/ptdump/8xx.c
+++ b/arch/powerpc/mm/ptdump/8xx.c
@@ -75,8 +75,10 @@ static const struct flag_info flag_array[] = {
 };
 
 struct pgtable_level pg_level[5] = {
-   {
-   }, { /* pgd */
+   { /* pgd */
+   .flag   = flag_array,
+   .num= ARRAY_SIZE(flag_array),
+   }, { /* p4d */
.flag   = flag_array,
.num= ARRAY_SIZE(flag_array),
}, { /* pud */
diff --git a/arch/powerpc/mm/ptdump/Makefile b/arch/powerpc/mm/ptdump/Makefile
index 712762be3cb1..4050cbb55acf 100644
--- a/arch/powerpc/mm/ptdump/Makefile
+++ b/arch/powerpc/mm/ptdump/Makefile
@@ -5,5 +5,10 @@ obj-y  += ptdump.o
 obj-$(CONFIG_4xx)  += shared.o
 obj-$(CONFIG_PPC_8xx)  += 8xx.o
 obj-$(CONFIG_PPC_BOOK3E_MMU)   += shared.o
-obj-$(CONFIG_PPC_BOOK3S_32)+= shared.o bat

[PATCH v2 3/4] powerpc/mm: Properly coalesce pages in ptdump

2021-04-19 Thread Christophe Leroy
Commit aaa229529244 ("powerpc/mm: Add physical address to Linux page
table dump") changed range coalescing to only combine ranges that are
both virtually and physically contiguous, in order to avoid erroneous
combination of unrelated mappings in IOREMAP space.

But in the VMALLOC space, mappings almost never have contiguous
physical pages, so the commit mentionned above leads to dumping one
line per page for vmalloc mappings.

Taking into account the vmalloc always leave a gap between two areas,
we never have two mappings dumped as a single combination even if they
have the exact same flags. The only space that may have encountered
such an issue was the early IOREMAP which is not using vmalloc engine.
But previous commits added gaps between early IO mappings, so it is
not an issue anymore.

That commit created some difficulties with KASAN mappings, see
commit cabe8138b23c ("powerpc: dump as a single line areas mapping a
single physical page.") and with huge page, see
commit b00ff6d8c1c3 ("powerpc/ptdump: Properly handle non standard
page size").

So, almost revert commit aaa229529244 to properly coalesce pages
mapped with the same flags as before, only keep the display of the
first physical address of the range, as it can be usefull especially
for IO mappings.

It brings back powerpc at the same level as other architectures and
simplifies the conversion to GENERIC PTDUMP.

With the patch:

---[ kasan shadow mem start ]---
0xf800-0xf8ff  0x070016M   hugerw   present 
  dirty  accessed
0xf900-0xf91f  0x01434000 2M   rpresent 
 accessed
0xf920-0xf95a  0x02104000  3776K   rw   present 
  dirty  accessed
0xfef5c000-0xfeff  0x01434000   656K   rpresent 
 accessed
---[ kasan shadow mem end ]---

Before:

---[ kasan shadow mem start ]---
0xf800-0xf8ff  0x070016M   hugerw   present 
  dirty  accessed
0xf900-0xf91f  0x0143400016K   rpresent 
 accessed
0xf920-0xf9203fff  0x0210400016K   rw   present 
  dirty  accessed
0xf9204000-0xf9207fff  0x0213c00016K   rw   present 
  dirty  accessed
0xf9208000-0xf920bfff  0x0217400016K   rw   present 
  dirty  accessed
0xf920c000-0xf920  0x0218800016K   rw   present 
  dirty  accessed
0xf921-0xf9213fff  0x021dc00016K   rw   present 
  dirty  accessed
0xf9214000-0xf9217fff  0x022216K   rw   present 
  dirty  accessed
0xf9218000-0xf921bfff  0x023c16K   rw   present 
  dirty  accessed
0xf921c000-0xf921  0x023d400016K   rw   present 
  dirty  accessed
0xf922-0xf9227fff  0x023ec00032K   rw   present 
  dirty  accessed
...
0xf93b8000-0xf93e3fff  0x02614000   176K   rw   present 
  dirty  accessed
0xf93e4000-0xf94c3fff  0x027c   896K   rw   present 
  dirty  accessed
0xf94c4000-0xf94c7fff  0x0236c00016K   rw   present 
  dirty  accessed
0xf94c8000-0xf94cbfff  0x041f16K   rw   present 
  dirty  accessed
0xf94cc000-0xf94c  0x029c16K   rw   present 
  dirty  accessed
0xf94d-0xf94d3fff  0x041ec00016K   rw   present 
  dirty  accessed
0xf94d4000-0xf94d7fff  0x0407c00016K   rw   present 
  dirty  accessed
0xf94d8000-0xf94f7fff  0x041c   128K   rw   present 
  dirty  accessed
...
0xf95ac000-0xf95a  0x042b16K   rw   present 
  dirty  accessed
0xfef5c000-0xfeff  0x0143400016K   rpresent 
 accessed
---[ kasan shadow mem end ]---

Signed-off-by: Christophe Leroy 
Cc: Oliver O'Halloran 
---
 arch/powerpc/mm/ptdump/ptdump.c | 22 +++---
 1 file changed, 3 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/mm/ptdump/ptdump.c b/arch/powerpc/mm/ptdump/ptdump.c
index aca354fb670b..5062c58b1e5b 100644
--- a/arch/powerpc/mm/ptdump/ptdump.c
+++ b/arch/powerpc/mm/ptdump/ptdump.c
@@ -58,8 +58,6 @@ struct pg_state {
const struct addr_marker *marker;
unsigned long start_address;
unsigned long start_pa;
-   unsigned long last_pa;
-   unsigned long page_size;
unsigned int level;
u64 current_flags;
bool check_wx;
@@ -163,8 +161,6 @@ static void dump_flag_info(struct pg_state *st, const 
struct flag_info
 
 static void dump_addr(struct pg_state *st, unsigned long addr)
 {
-   

[PATCH v2 1/4] mm: pagewalk: Fix walk for hugepage tables

2021-04-19 Thread Christophe Leroy
Pagewalk ignores hugepd entries and walk down the tables
as if it was traditionnal entries, leading to crazy result.

Add walk_hugepd_range() and use it to walk hugepage tables.

Signed-off-by: Christophe Leroy 
---
v2:
- Add a guard for NULL ops->pte_entry
- Take mm->page_table_lock when walking hugepage table, as suggested by 
follow_huge_pd()
---
 mm/pagewalk.c | 58 ++-
 1 file changed, 53 insertions(+), 5 deletions(-)

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index e81640d9f177..9b3db11a4d1d 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -58,6 +58,45 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, 
unsigned long end,
return err;
 }
 
+#ifdef CONFIG_ARCH_HAS_HUGEPD
+static int walk_hugepd_range(hugepd_t *phpd, unsigned long addr,
+unsigned long end, struct mm_walk *walk, int 
pdshift)
+{
+   int err = 0;
+   const struct mm_walk_ops *ops = walk->ops;
+   int shift = hugepd_shift(*phpd);
+   int page_size = 1 << shift;
+
+   if (!ops->pte_entry)
+   return 0;
+
+   if (addr & (page_size - 1))
+   return 0;
+
+   for (;;) {
+   pte_t *pte;
+
+   spin_lock(>mm->page_table_lock);
+   pte = hugepte_offset(*phpd, addr, pdshift);
+   err = ops->pte_entry(pte, addr, addr + page_size, walk);
+   spin_unlock(>mm->page_table_lock);
+
+   if (err)
+   break;
+   if (addr >= end - page_size)
+   break;
+   addr += page_size;
+   }
+   return err;
+}
+#else
+static int walk_hugepd_range(hugepd_t *phpd, unsigned long addr,
+unsigned long end, struct mm_walk *walk, int 
pdshift)
+{
+   return 0;
+}
+#endif
+
 static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
  struct mm_walk *walk)
 {
@@ -108,7 +147,10 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, 
unsigned long end,
goto again;
}
 
-   err = walk_pte_range(pmd, addr, next, walk);
+   if (is_hugepd(__hugepd(pmd_val(*pmd
+   err = walk_hugepd_range((hugepd_t *)pmd, addr, next, 
walk, PMD_SHIFT);
+   else
+   err = walk_pte_range(pmd, addr, next, walk);
if (err)
break;
} while (pmd++, addr = next, addr != end);
@@ -157,7 +199,10 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, 
unsigned long end,
if (pud_none(*pud))
goto again;
 
-   err = walk_pmd_range(pud, addr, next, walk);
+   if (is_hugepd(__hugepd(pud_val(*pud
+   err = walk_hugepd_range((hugepd_t *)pud, addr, next, 
walk, PUD_SHIFT);
+   else
+   err = walk_pmd_range(pud, addr, next, walk);
if (err)
break;
} while (pud++, addr = next, addr != end);
@@ -189,7 +234,9 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, 
unsigned long end,
if (err)
break;
}
-   if (ops->pud_entry || ops->pmd_entry || ops->pte_entry)
+   if (is_hugepd(__hugepd(p4d_val(*p4d
+   err = walk_hugepd_range((hugepd_t *)p4d, addr, next, 
walk, P4D_SHIFT);
+   else if (ops->pud_entry || ops->pmd_entry || ops->pte_entry)
err = walk_pud_range(p4d, addr, next, walk);
if (err)
break;
@@ -224,8 +271,9 @@ static int walk_pgd_range(unsigned long addr, unsigned long 
end,
if (err)
break;
}
-   if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry ||
-   ops->pte_entry)
+   if (is_hugepd(__hugepd(pgd_val(*pgd
+   err = walk_hugepd_range((hugepd_t *)pgd, addr, next, 
walk, PGDIR_SHIFT);
+   else if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry || 
ops->pte_entry)
err = walk_p4d_range(pgd, addr, next, walk);
if (err)
break;
-- 
2.25.0



[PATCH v2 0/4] Convert powerpc to GENERIC_PTDUMP

2021-04-19 Thread Christophe Leroy
This series converts powerpc to generic PTDUMP.

For that, we first need to add missing hugepd support
to pagewalk and ptdump.

v2:
- Reworked the pagewalk modification to add locking and check ops->pte_entry
- Modified powerpc early IO mapping to have gaps between mappings
- Removed the logic that checked for contiguous physical memory
- Removed the articial level calculation in ptdump_pte_entry(), level 4 is ok 
for all.
- Removed page_size argument to note_page()

Christophe Leroy (4):
  mm: pagewalk: Fix walk for hugepage tables
  powerpc/mm: Leave a gap between early allocated IO areas
  powerpc/mm: Properly coalesce pages in ptdump
  powerpc/mm: Convert powerpc to GENERIC_PTDUMP

 arch/powerpc/Kconfig  |   2 +
 arch/powerpc/Kconfig.debug|  30 -
 arch/powerpc/mm/Makefile  |   2 +-
 arch/powerpc/mm/ioremap_32.c  |   4 +-
 arch/powerpc/mm/ioremap_64.c  |   2 +-
 arch/powerpc/mm/mmu_decl.h|   2 +-
 arch/powerpc/mm/ptdump/8xx.c  |   6 +-
 arch/powerpc/mm/ptdump/Makefile   |   9 +-
 arch/powerpc/mm/ptdump/book3s64.c |   6 +-
 arch/powerpc/mm/ptdump/ptdump.c   | 187 --
 arch/powerpc/mm/ptdump/shared.c   |   6 +-
 mm/pagewalk.c |  58 -
 12 files changed, 127 insertions(+), 187 deletions(-)

-- 
2.25.0



[PATCH v2 2/4] powerpc/mm: Leave a gap between early allocated IO areas

2021-04-19 Thread Christophe Leroy
Vmalloc system leaves a gap between allocated areas. It helps catching
overflows.

Do the same for IO areas which are allocated with early_ioremap_range()
until slab_is_available().

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/mm/ioremap_32.c | 4 ++--
 arch/powerpc/mm/ioremap_64.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/ioremap_32.c b/arch/powerpc/mm/ioremap_32.c
index 743e11384dea..9d13143b8be4 100644
--- a/arch/powerpc/mm/ioremap_32.c
+++ b/arch/powerpc/mm/ioremap_32.c
@@ -70,10 +70,10 @@ __ioremap_caller(phys_addr_t addr, unsigned long size, 
pgprot_t prot, void *call
 */
pr_warn("ioremap() called early from %pS. Use early_ioremap() 
instead\n", caller);
 
-   err = early_ioremap_range(ioremap_bot - size, p, size, prot);
+   err = early_ioremap_range(ioremap_bot - size - PAGE_SIZE, p, size, 
prot);
if (err)
return NULL;
-   ioremap_bot -= size;
+   ioremap_bot -= size + PAGE_SIZE;
 
return (void __iomem *)ioremap_bot + offset;
 }
diff --git a/arch/powerpc/mm/ioremap_64.c b/arch/powerpc/mm/ioremap_64.c
index ba5cbb0d66bd..3acece00b33e 100644
--- a/arch/powerpc/mm/ioremap_64.c
+++ b/arch/powerpc/mm/ioremap_64.c
@@ -38,7 +38,7 @@ void __iomem *__ioremap_caller(phys_addr_t addr, unsigned 
long size,
return NULL;
 
ret = (void __iomem *)ioremap_bot + offset;
-   ioremap_bot += size;
+   ioremap_bot += size + PAGE_SIZE;
 
return ret;
 }
-- 
2.25.0



Re: mmu.c:undefined reference to `patch__hash_page_A0'

2021-04-18 Thread Christophe Leroy




Le 18/04/2021 à 19:15, Randy Dunlap a écrit :

On 4/18/21 3:43 AM, Christophe Leroy wrote:



Le 18/04/2021 à 02:02, Randy Dunlap a écrit :

HI--

I no longer see this build error.


Fixed by 
https://github.com/torvalds/linux/commit/acdad8fb4a1574323db88f98a38b630691574e16


However:

On 2/27/21 2:24 AM, kernel test robot wrote:

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
master
head:   3fb6d0e00efc958d01c2f109c8453033a2d96796
commit: 259149cf7c3c6195e6199e045ca988c31d081cab powerpc/32s: Only build hash 
code when CONFIG_PPC_BOOK3S_604 is selected
date:   4 weeks ago
config: powerpc64-randconfig-r013-20210227 (attached as .config)


ktr/lkp, this is a PPC32 .config file that is attached, not PPC64.

Also:


compiler: powerpc-linux-gcc (GCC) 9.3.0


...



I do see this build error:

powerpc-linux-ld: arch/powerpc/boot/wrapper.a(decompress.o): in function 
`partial_decompress':
decompress.c:(.text+0x1f0): undefined reference to `__decompress'

when either
CONFIG_KERNEL_LZO=y
or
CONFIG_KERNEL_LZMA=y

but the build succeeds when either
CONFIG_KERNEL_GZIP=y
or
CONFIG_KERNEL_XZ=y

I guess that is due to arch/powerpc/boot/decompress.c doing this:

#ifdef CONFIG_KERNEL_GZIP
#    include "decompress_inflate.c"
#endif

#ifdef CONFIG_KERNEL_XZ
#    include "xz_config.h"
#    include "../../../lib/decompress_unxz.c"
#endif


It would be nice to require one of KERNEL_GZIP or KERNEL_XZ
to be set/enabled (maybe unless a uImage is being built?).



Can you test by 
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/a74fce4dfc9fa32da6ce3470bbedcecf795de1ec.1591189069.git.christophe.le...@csgroup.eu/
 ?


Hi Christophe,

I get build errors for both LZO and LZMA:


Ok, the patch is almost 1 year old, I guess there has been changes that break it. Will see if I can 
find some time to look at it.


Christophe


Re: PPC_FPU, ALTIVEC: enable_kernel_fp, put_vr, get_vr

2021-04-18 Thread Christophe Leroy




Le 17/04/2021 à 22:17, Randy Dunlap a écrit :

Hi,

kernel test robot reports:


drivers/cpufreq/pmac32-cpufreq.c:262:2: error: implicit declaration of function 
'enable_kernel_fp' [-Werror,-Wimplicit-function-declaration]

enable_kernel_fp();
^

when
# CONFIG_PPC_FPU is not set
CONFIG_ALTIVEC=y

I see at least one other place that does not handle that
combination well, here:

../arch/powerpc/lib/sstep.c: In function 'do_vec_load':
../arch/powerpc/lib/sstep.c:637:3: error: implicit declaration of function 
'put_vr' [-Werror=implicit-function-declaration]
   637 |   put_vr(rn, );
   |   ^~
../arch/powerpc/lib/sstep.c: In function 'do_vec_store':
../arch/powerpc/lib/sstep.c:660:3: error: implicit declaration of function 
'get_vr'; did you mean 'get_oc'? [-Werror=implicit-function-declaration]
   660 |   get_vr(rn, );
   |   ^~


Should the code + Kconfigs/Makefiles handle that kind of
kernel config or should ALTIVEC always mean PPC_FPU as well?


As far as I understand, Altivec is completely independant of FPU in Theory. So it should be possible 
to use Altivec without using FPU.


However, until recently, it was not possible to de-activate FPU support on book3s/32. I made it 
possible in order to reduce unneccessary processing on processors like the 832x that has no FPU.
As far as I can see in cputable.h/.c, 832x is the only book3s/32 without FPU, and it doesn't have 
ALTIVEC either.


So we can in the future ensure that Altivec can be used without FPU support, but for the time being 
I think it is OK to force selection of FPU when selecting ALTIVEC in order to avoid build failures.




I have patches to fix the build errors with the config as
reported but I don't know if that's the right thing to do...



Lets see them.

Christophe


Re: mmu.c:undefined reference to `patch__hash_page_A0'

2021-04-18 Thread Christophe Leroy




Le 18/04/2021 à 02:02, Randy Dunlap a écrit :

HI--

I no longer see this build error.


Fixed by 
https://github.com/torvalds/linux/commit/acdad8fb4a1574323db88f98a38b630691574e16


However:

On 2/27/21 2:24 AM, kernel test robot wrote:

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
master
head:   3fb6d0e00efc958d01c2f109c8453033a2d96796
commit: 259149cf7c3c6195e6199e045ca988c31d081cab powerpc/32s: Only build hash 
code when CONFIG_PPC_BOOK3S_604 is selected
date:   4 weeks ago
config: powerpc64-randconfig-r013-20210227 (attached as .config)


ktr/lkp, this is a PPC32 .config file that is attached, not PPC64.

Also:


compiler: powerpc-linux-gcc (GCC) 9.3.0


...



I do see this build error:

powerpc-linux-ld: arch/powerpc/boot/wrapper.a(decompress.o): in function 
`partial_decompress':
decompress.c:(.text+0x1f0): undefined reference to `__decompress'

when either
CONFIG_KERNEL_LZO=y
or
CONFIG_KERNEL_LZMA=y

but the build succeeds when either
CONFIG_KERNEL_GZIP=y
or
CONFIG_KERNEL_XZ=y

I guess that is due to arch/powerpc/boot/decompress.c doing this:

#ifdef CONFIG_KERNEL_GZIP
#   include "decompress_inflate.c"
#endif

#ifdef CONFIG_KERNEL_XZ
#   include "xz_config.h"
#   include "../../../lib/decompress_unxz.c"
#endif


It would be nice to require one of KERNEL_GZIP or KERNEL_XZ
to be set/enabled (maybe unless a uImage is being built?).



Can you test by 
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/a74fce4dfc9fa32da6ce3470bbedcecf795de1ec.1591189069.git.christophe.le...@csgroup.eu/ 
?


Thanks
Christophe


Re: [PATCH bpf-next 1/2] bpf: Remove bpf_jit_enable=2 debugging mode

2021-04-17 Thread Christophe Leroy




Le 16/04/2021 à 01:49, Alexei Starovoitov a écrit :

On Thu, Apr 15, 2021 at 8:41 AM Quentin Monnet  wrote:


2021-04-15 16:37 UTC+0200 ~ Daniel Borkmann 

On 4/15/21 11:32 AM, Jianlin Lv wrote:

For debugging JITs, dumping the JITed image to kernel log is discouraged,
"bpftool prog dump jited" is much better way to examine JITed dumps.
This patch get rid of the code related to bpf_jit_enable=2 mode and
update the proc handler of bpf_jit_enable, also added auxiliary
information to explain how to use bpf_jit_disasm tool after this change.

Signed-off-by: Jianlin Lv 


Hello,

For what it's worth, I have already seen people dump the JIT image in
kernel logs in Qemu VMs running with just a busybox, not for kernel
development, but in a context where buiding/using bpftool was not
possible.


If building/using bpftool is not possible then majority of selftests won't
be exercised. I don't think such environment is suitable for any kind
of bpf development. Much so for JIT debugging.
While bpf_jit_enable=2 is nothing but the debugging tool for JIT developers.
I'd rather nuke that code instead of carrying it from kernel to kernel.



When I implemented JIT for PPC32, it was extremely helpfull.

As far as I understand, for the time being bpftool is not usable in my environment because it 
doesn't support cross compilation when the target's endianess differs from the building host 
endianess, see discussion at 
https://lore.kernel.org/bpf/21e66a09-514f-f426-b9e2-13baab0b9...@csgroup.eu/


That's right that selftests can't be exercised because they don't build.

The question might be candid as I didn't investigate much about the replacement of "bpf_jit_enable=2 
debugging mode" by bpftool, how do we use bpftool exactly for that ? Especially when using the BPF 
test module ?




Re: [PATCH v1 3/5] mm: ptdump: Provide page size to notepage()

2021-04-16 Thread Christophe Leroy




Le 16/04/2021 à 17:04, Christophe Leroy a écrit :



Le 16/04/2021 à 16:40, Christophe Leroy a écrit :



Le 16/04/2021 à 15:00, Steven Price a écrit :

On 16/04/2021 12:08, Christophe Leroy wrote:



Le 16/04/2021 à 12:51, Steven Price a écrit :

On 16/04/2021 11:38, Christophe Leroy wrote:



Le 16/04/2021 à 11:28, Steven Price a écrit :
To be honest I don't fully understand why powerpc requires the page_size - it appears to be 
using it purely to find "holes" in the calls to note_page(), but I haven't worked out why 
such holes would occur.


I was indeed introduced for KASAN. We have a first commit 
https://github.com/torvalds/linux/commit/cabe8138 which uses page size to detect whether it is 
a KASAN like stuff.


Then came https://github.com/torvalds/linux/commit/b00ff6d8c as a fix. I can't remember what 
the problem was exactly, something around the use of hugepages for kernel memory, came as part 
of the series 
https://patchwork.ozlabs.org/project/linuxppc-dev/cover/cover.1589866984.git.christophe.le...@csgroup.eu/ 







Ah, that's useful context. So it looks like powerpc took a different route to reducing the 
KASAN output to x86.


Given the generic ptdump code has handling for KASAN already it should be possible to drop that 
from the powerpc arch code, which I think means we don't actually need to provide page size to 
notepage(). Hopefully that means more code to delete ;)




Yes ... and no.

It looks like the generic ptdump handles the case when several pgdir entries points to the same 
kasan_early_shadow_pte. But it doesn't take into account the powerpc case where we have regular 
page tables where several (if not all) PTEs are pointing to the kasan_early_shadow_page .


I'm not sure I follow quite how powerpc is different here. But could you have a similar check for 
PTEs against kasan_early_shadow_pte as the other levels already have?


I'm just worried that page_size isn't well defined in this interface and it's going to cause 
problems in the future.




I'm trying. I reverted the two commits b00ff6d8c and cabe8138.

At the moment, I don't get exactly what I expect: For linear memory I get one line for each 8M 
page whereas before reverting the patches I got one 16M line and one 112M line.


And for KASAN shadow area I get two lines for the 2x 8M pages shadowing linear mem then I get one 
4M line for each PGDIR entry pointing to kasan_early_shadow_pte.


0xf800-0xf87f 0x0700 8M   huge    rw   present
0xf880-0xf8ff 0x0780 8M   huge    rw   present
0xf900-0xf93f 0x0143 4M   r    present

...

0xfec0-0xfeff 0x0143 4M   r    present

Any idea ?




I think the different with other architectures is here:

 } else if (flag != st->current_flags || level != st->level ||
    addr >= st->marker[1].start_address ||
    pa != st->last_pa + PAGE_SIZE) {


In addition to the checks everyone do, powerpc also checks "pa != st->last_pa + 
PAGE_SIZE".
And it is definitely for that test that page_size argument add been added.


By replacing that test by (pa - st->start_pa != addr - st->start_address) it works again. So we 
definitely don't need the real page size.





I see that other architectures except RISCV don't dump the physical address. But even RISCV doesn't 
include that check.


That physical address dump was added by commit aaa229529244 ("powerpc/mm: Add physical address to 
Linux page table dump") [https://github.com/torvalds/linux/commit/aaa2295]


How do other architectures deal with the problem described by the commit log of 
that patch ?

Christophe


Re: [PATCH v1 3/5] mm: ptdump: Provide page size to notepage()

2021-04-16 Thread Christophe Leroy




Le 16/04/2021 à 16:40, Christophe Leroy a écrit :



Le 16/04/2021 à 15:00, Steven Price a écrit :

On 16/04/2021 12:08, Christophe Leroy wrote:



Le 16/04/2021 à 12:51, Steven Price a écrit :

On 16/04/2021 11:38, Christophe Leroy wrote:



Le 16/04/2021 à 11:28, Steven Price a écrit :
To be honest I don't fully understand why powerpc requires the page_size - it appears to be 
using it purely to find "holes" in the calls to note_page(), but I haven't worked out why such 
holes would occur.


I was indeed introduced for KASAN. We have a first commit 
https://github.com/torvalds/linux/commit/cabe8138 which uses page size to detect whether it is 
a KASAN like stuff.


Then came https://github.com/torvalds/linux/commit/b00ff6d8c as a fix. I can't remember what 
the problem was exactly, something around the use of hugepages for kernel memory, came as part 
of the series 
https://patchwork.ozlabs.org/project/linuxppc-dev/cover/cover.1589866984.git.christophe.le...@csgroup.eu/ 






Ah, that's useful context. So it looks like powerpc took a different route to reducing the KASAN 
output to x86.


Given the generic ptdump code has handling for KASAN already it should be possible to drop that 
from the powerpc arch code, which I think means we don't actually need to provide page size to 
notepage(). Hopefully that means more code to delete ;)




Yes ... and no.

It looks like the generic ptdump handles the case when several pgdir entries points to the same 
kasan_early_shadow_pte. But it doesn't take into account the powerpc case where we have regular 
page tables where several (if not all) PTEs are pointing to the kasan_early_shadow_page .


I'm not sure I follow quite how powerpc is different here. But could you have a similar check for 
PTEs against kasan_early_shadow_pte as the other levels already have?


I'm just worried that page_size isn't well defined in this interface and it's going to cause 
problems in the future.




I'm trying. I reverted the two commits b00ff6d8c and cabe8138.

At the moment, I don't get exactly what I expect: For linear memory I get one line for each 8M page 
whereas before reverting the patches I got one 16M line and one 112M line.


And for KASAN shadow area I get two lines for the 2x 8M pages shadowing linear mem then I get one 4M 
line for each PGDIR entry pointing to kasan_early_shadow_pte.


0xf800-0xf87f 0x0700 8M   huge    rw   present
0xf880-0xf8ff 0x0780 8M   huge    rw   present
0xf900-0xf93f 0x0143 4M   r    present

...

0xfec0-0xfeff 0x0143 4M   r    present

Any idea ?




I think the different with other architectures is here:

} else if (flag != st->current_flags || level != st->level ||
   addr >= st->marker[1].start_address ||
   pa != st->last_pa + PAGE_SIZE) {


In addition to the checks everyone do, powerpc also checks "pa != st->last_pa + 
PAGE_SIZE".
And it is definitely for that test that page_size argument add been added.

I see that other architectures except RISCV don't dump the physical address. But even RISCV doesn't 
include that check.


That physical address dump was added by commit aaa229529244 ("powerpc/mm: Add physical address to 
Linux page table dump") [https://github.com/torvalds/linux/commit/aaa2295]


How do other architectures deal with the problem described by the commit log of 
that patch ?

Christophe


Re: [PATCH v1 3/5] mm: ptdump: Provide page size to notepage()

2021-04-16 Thread Christophe Leroy




Le 16/04/2021 à 15:00, Steven Price a écrit :

On 16/04/2021 12:08, Christophe Leroy wrote:



Le 16/04/2021 à 12:51, Steven Price a écrit :

On 16/04/2021 11:38, Christophe Leroy wrote:



Le 16/04/2021 à 11:28, Steven Price a écrit :

On 15/04/2021 18:18, Christophe Leroy wrote:

In order to support large pages on powerpc, notepage()
needs to know the page size of the page.

Add a page_size argument to notepage().

Signed-off-by: Christophe Leroy 
---
  arch/arm64/mm/ptdump.c |  2 +-
  arch/riscv/mm/ptdump.c |  2 +-
  arch/s390/mm/dump_pagetables.c |  3 ++-
  arch/x86/mm/dump_pagetables.c  |  2 +-
  include/linux/ptdump.h |  2 +-
  mm/ptdump.c    | 16 
  6 files changed, 14 insertions(+), 13 deletions(-)


[...]

diff --git a/mm/ptdump.c b/mm/ptdump.c
index da751448d0e4..61cd16afb1c8 100644
--- a/mm/ptdump.c
+++ b/mm/ptdump.c
@@ -17,7 +17,7 @@ static inline int note_kasan_page_table(struct mm_walk *walk,
  {
  struct ptdump_state *st = walk->private;
-    st->note_page(st, addr, 4, pte_val(kasan_early_shadow_pte[0]));
+    st->note_page(st, addr, 4, pte_val(kasan_early_shadow_pte[0]), PAGE_SIZE);


I'm not completely sure what the page_size is going to be used for, but note that KASAN 
presents an interesting case here. We short-cut by detecting it's a KASAN region at a high 
level (PGD/P4D/PUD/PMD) and instead of walking the tree down just call note_page() *once* but 
with level==4 because we know KASAN sets up the page table like that.


However the one call actually covers a much larger region - so while PAGE_SIZE matches the 
level it doesn't match the region covered. AFAICT this will lead to odd results if you enable 
KASAN on powerpc.


Hum  I successfully tested it with KASAN, I now realise that I tested it with 
CONFIG_KASAN_VMALLOC selected. In this situation, since 
https://github.com/torvalds/linux/commit/af3d0a686 we don't have any common shadow page table 
anymore.


I'll test again without CONFIG_KASAN_VMALLOC.



To be honest I don't fully understand why powerpc requires the page_size - it appears to be 
using it purely to find "holes" in the calls to note_page(), but I haven't worked out why such 
holes would occur.


I was indeed introduced for KASAN. We have a first commit 
https://github.com/torvalds/linux/commit/cabe8138 which uses page size to detect whether it is a 
KASAN like stuff.


Then came https://github.com/torvalds/linux/commit/b00ff6d8c as a fix. I can't remember what the 
problem was exactly, something around the use of hugepages for kernel memory, came as part of 
the series 
https://patchwork.ozlabs.org/project/linuxppc-dev/cover/cover.1589866984.git.christophe.le...@csgroup.eu/ 





Ah, that's useful context. So it looks like powerpc took a different route to reducing the KASAN 
output to x86.


Given the generic ptdump code has handling for KASAN already it should be possible to drop that 
from the powerpc arch code, which I think means we don't actually need to provide page size to 
notepage(). Hopefully that means more code to delete ;)




Yes ... and no.

It looks like the generic ptdump handles the case when several pgdir entries points to the same 
kasan_early_shadow_pte. But it doesn't take into account the powerpc case where we have regular 
page tables where several (if not all) PTEs are pointing to the kasan_early_shadow_page .


I'm not sure I follow quite how powerpc is different here. But could you have a similar check for 
PTEs against kasan_early_shadow_pte as the other levels already have?


I'm just worried that page_size isn't well defined in this interface and it's going to cause 
problems in the future.




I'm trying. I reverted the two commits b00ff6d8c and cabe8138.

At the moment, I don't get exactly what I expect: For linear memory I get one line for each 8M page 
whereas before reverting the patches I got one 16M line and one 112M line.


And for KASAN shadow area I get two lines for the 2x 8M pages shadowing linear mem then I get one 4M 
line for each PGDIR entry pointing to kasan_early_shadow_pte.


0xf800-0xf87f 0x0700 8M   hugerw   present
0xf880-0xf8ff 0x0780 8M   hugerw   present
0xf900-0xf93f 0x0143 4M   rpresent
0xf940-0xf97f 0x0143 4M   rpresent
0xf980-0xf9bf 0x0143 4M   rpresent
0xf9c0-0xf9ff 0x0143 4M   rpresent
0xfa00-0xfa3f 0x0143 4M   rpresent
0xfa40-0xfa7f 0x0143 4M   rpresent
0xfa80-0xfabf 0x0143 4M   rpresent
0xfac0-0xfaff 0x0143 4M   rpresent
0xfb00-0xfb3f 0x0143 4M   rpresent
0xfb40-0xfb7f 0x0143

Re: [PATCH v1 3/5] mm: ptdump: Provide page size to notepage()

2021-04-16 Thread Christophe Leroy




Le 16/04/2021 à 12:51, Steven Price a écrit :

On 16/04/2021 11:38, Christophe Leroy wrote:



Le 16/04/2021 à 11:28, Steven Price a écrit :

On 15/04/2021 18:18, Christophe Leroy wrote:

In order to support large pages on powerpc, notepage()
needs to know the page size of the page.

Add a page_size argument to notepage().

Signed-off-by: Christophe Leroy 
---
  arch/arm64/mm/ptdump.c |  2 +-
  arch/riscv/mm/ptdump.c |  2 +-
  arch/s390/mm/dump_pagetables.c |  3 ++-
  arch/x86/mm/dump_pagetables.c  |  2 +-
  include/linux/ptdump.h |  2 +-
  mm/ptdump.c    | 16 
  6 files changed, 14 insertions(+), 13 deletions(-)


[...]

diff --git a/mm/ptdump.c b/mm/ptdump.c
index da751448d0e4..61cd16afb1c8 100644
--- a/mm/ptdump.c
+++ b/mm/ptdump.c
@@ -17,7 +17,7 @@ static inline int note_kasan_page_table(struct mm_walk *walk,
  {
  struct ptdump_state *st = walk->private;
-    st->note_page(st, addr, 4, pte_val(kasan_early_shadow_pte[0]));
+    st->note_page(st, addr, 4, pte_val(kasan_early_shadow_pte[0]), PAGE_SIZE);


I'm not completely sure what the page_size is going to be used for, but note that KASAN presents 
an interesting case here. We short-cut by detecting it's a KASAN region at a high level 
(PGD/P4D/PUD/PMD) and instead of walking the tree down just call note_page() *once* but with 
level==4 because we know KASAN sets up the page table like that.


However the one call actually covers a much larger region - so while PAGE_SIZE matches the level 
it doesn't match the region covered. AFAICT this will lead to odd results if you enable KASAN on 
powerpc.


Hum  I successfully tested it with KASAN, I now realise that I tested it with 
CONFIG_KASAN_VMALLOC selected. In this situation, since 
https://github.com/torvalds/linux/commit/af3d0a686 we don't have any common shadow page table 
anymore.


I'll test again without CONFIG_KASAN_VMALLOC.



To be honest I don't fully understand why powerpc requires the page_size - it appears to be using 
it purely to find "holes" in the calls to note_page(), but I haven't worked out why such holes 
would occur.


I was indeed introduced for KASAN. We have a first commit 
https://github.com/torvalds/linux/commit/cabe8138 which uses page size to detect whether it is a 
KASAN like stuff.


Then came https://github.com/torvalds/linux/commit/b00ff6d8c as a fix. I can't remember what the 
problem was exactly, something around the use of hugepages for kernel memory, came as part of the 
series 
https://patchwork.ozlabs.org/project/linuxppc-dev/cover/cover.1589866984.git.christophe.le...@csgroup.eu/ 



Ah, that's useful context. So it looks like powerpc took a different route to reducing the KASAN 
output to x86.


Given the generic ptdump code has handling for KASAN already it should be possible to drop that from 
the powerpc arch code, which I think means we don't actually need to provide page size to 
notepage(). Hopefully that means more code to delete ;)




Yes ... and no.

It looks like the generic ptdump handles the case when several pgdir entries points to the same 
kasan_early_shadow_pte. But it doesn't take into account the powerpc case where we have regular page 
tables where several (if not all) PTEs are pointing to the kasan_early_shadow_page .


Christophe


Re: [PATCH v1 3/5] mm: ptdump: Provide page size to notepage()

2021-04-16 Thread Christophe Leroy




Le 16/04/2021 à 11:28, Steven Price a écrit :

On 15/04/2021 18:18, Christophe Leroy wrote:

In order to support large pages on powerpc, notepage()
needs to know the page size of the page.

Add a page_size argument to notepage().

Signed-off-by: Christophe Leroy 
---
  arch/arm64/mm/ptdump.c |  2 +-
  arch/riscv/mm/ptdump.c |  2 +-
  arch/s390/mm/dump_pagetables.c |  3 ++-
  arch/x86/mm/dump_pagetables.c  |  2 +-
  include/linux/ptdump.h |  2 +-
  mm/ptdump.c    | 16 
  6 files changed, 14 insertions(+), 13 deletions(-)


[...]

diff --git a/mm/ptdump.c b/mm/ptdump.c
index da751448d0e4..61cd16afb1c8 100644
--- a/mm/ptdump.c
+++ b/mm/ptdump.c
@@ -17,7 +17,7 @@ static inline int note_kasan_page_table(struct mm_walk *walk,
  {
  struct ptdump_state *st = walk->private;
-    st->note_page(st, addr, 4, pte_val(kasan_early_shadow_pte[0]));
+    st->note_page(st, addr, 4, pte_val(kasan_early_shadow_pte[0]), PAGE_SIZE);


I'm not completely sure what the page_size is going to be used for, but note that KASAN presents an 
interesting case here. We short-cut by detecting it's a KASAN region at a high level 
(PGD/P4D/PUD/PMD) and instead of walking the tree down just call note_page() *once* but with 
level==4 because we know KASAN sets up the page table like that.


However the one call actually covers a much larger region - so while PAGE_SIZE matches the level it 
doesn't match the region covered. AFAICT this will lead to odd results if you enable KASAN on powerpc.


Hum  I successfully tested it with KASAN, I now realise that I tested it with 
CONFIG_KASAN_VMALLOC selected. In this situation, since 
https://github.com/torvalds/linux/commit/af3d0a686 we don't have any common shadow page table anymore.


I'll test again without CONFIG_KASAN_VMALLOC.



To be honest I don't fully understand why powerpc requires the page_size - it appears to be using it 
purely to find "holes" in the calls to note_page(), but I haven't worked out why such holes would 
occur.


I was indeed introduced for KASAN. We have a first commit 
https://github.com/torvalds/linux/commit/cabe8138 which uses page size to detect whether it is a 
KASAN like stuff.


Then came https://github.com/torvalds/linux/commit/b00ff6d8c as a fix. I can't remember what the 
problem was exactly, something around the use of hugepages for kernel memory, came as part of the 
series 
https://patchwork.ozlabs.org/project/linuxppc-dev/cover/cover.1589866984.git.christophe.le...@csgroup.eu/



Christophe


Re: [PATCH] soc: fsl: qe: remove unused function

2021-04-16 Thread Christophe Leroy




Le 16/04/2021 à 08:57, Daniel Axtens a écrit :

Hi Jiapeng,


Fix the following clang warning:


You are not fixing a warning, you are removing a function in order to fix a 
warning ...



drivers/soc/fsl/qe/qe_ic.c:234:29: warning: unused function
'qe_ic_from_irq' [-Wunused-function].


Would be wise to tell that the last users of the function where removed by commit d7c2878cfcfa 
("soc: fsl: qe: remove unused qe_ic_set_* functions")


https://github.com/torvalds/linux/commit/d7c2878cfcfa



Reported-by: Abaci Robot 
Signed-off-by: Jiapeng Chong 
---
  drivers/soc/fsl/qe/qe_ic.c | 5 -
  1 file changed, 5 deletions(-)

diff --git a/drivers/soc/fsl/qe/qe_ic.c b/drivers/soc/fsl/qe/qe_ic.c
index 0390af9..b573712 100644
--- a/drivers/soc/fsl/qe/qe_ic.c
+++ b/drivers/soc/fsl/qe/qe_ic.c
@@ -231,11 +231,6 @@ static inline void qe_ic_write(__be32  __iomem *base, 
unsigned int reg,
qe_iowrite32be(value, base + (reg >> 2));
  }
  
-static inline struct qe_ic *qe_ic_from_irq(unsigned int virq)

-{
-   return irq_get_chip_data(virq);
-}


This seems good to me.

  * We know that this function can't be called directly from outside the
   file, because it is static.

  * The function address isn't used as a function pointer anywhere, so
that means it can't be called from outside the file that way (also
it's inline, which would make using a function pointer unwise!)

  * There's no obvious macros in that file that might construct the name
of the function in a way that is hidden from grep.

All in all, I am fairly confident that the function is indeed not used.

Reviewed-by: Daniel Axtens 

Kind regards,
Daniel


-
  static inline struct qe_ic *qe_ic_from_irq_data(struct irq_data *d)
  {
return irq_data_get_irq_chip_data(d);
--
1.8.3.1


Re: [PATCH] symbol : Make the size of the compile-related array fixed

2021-04-16 Thread Christophe Leroy
Also, the following statement which appears at the end of your mail is puzzling. What can we do with 
your patch if there are such limitations ?


This e-mail and its attachments contain confidential information from OPPO, which is intended only 
for the person or entity whose address is listed above. Any use of the information contained herein 
in any way (including, but not limited to, total or partial disclosure, reproduction, or 
dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this 
e-mail in error, please notify the sender by phone or email immediately and delete it!




Le 16/04/2021 à 08:08, Christophe Leroy a écrit :

Hi,

This mail is unreadable.

Please send your patch as raw text mail, not as attached file.

Thanks
Christophe

Le 16/04/2021 à 05:12, 韩大鹏(Han Dapeng) a écrit :


*OPPO*
*
*
本电子邮件及其附件含有OPPO公司的保密信息,仅限于邮件指明的收件人使用(包含个人及群组)。禁止任何人 
在 未经授权的情况下以任何形式使用。如果您错收了本邮件,请立即以电子邮件通知发件人并删除本邮件及其 
附件。


This e-mail and its attachments contain confidential information from OPPO, which is intended only 
for the person or entity whose address is listed above. Any use of the information contained 
herein in any way (including, but not limited to, total or partial disclosure, reproduction, or 
dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this 
e-mail in error, please notify the sender by phone or email immediately and delete it!




Re: [PATCH] symbol : Make the size of the compile-related array fixed

2021-04-16 Thread Christophe Leroy

Hi,

This mail is unreadable.

Please send your patch as raw text mail, not as attached file.

Thanks
Christophe

Le 16/04/2021 à 05:12, 韩大鹏(Han Dapeng) a écrit :


*OPPO*
*
*
本电子邮件及其附件含有OPPO公司的保密信息,仅限于邮件指明的收件人使用(包含个人及群组)。禁止任何人在 
未经授权的情况下以任何形式使用。如果您错收了本邮件,请立即以电子邮件通知发件人并删除本邮件及其附件。


This e-mail and its attachments contain confidential information from OPPO, which is intended only 
for the person or entity whose address is listed above. Any use of the information contained herein 
in any way (including, but not limited to, total or partial disclosure, reproduction, or 
dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this 
e-mail in error, please notify the sender by phone or email immediately and delete it!




Re: [PATCH v1 1/5] mm: pagewalk: Fix walk for hugepage tables

2021-04-15 Thread Christophe Leroy




Le 16/04/2021 à 00:43, Daniel Axtens a écrit :

Hi Christophe,


Pagewalk ignores hugepd entries and walk down the tables
as if it was traditionnal entries, leading to crazy result.

Add walk_hugepd_range() and use it to walk hugepage tables.

Signed-off-by: Christophe Leroy 
---
  mm/pagewalk.c | 54 +--
  1 file changed, 48 insertions(+), 6 deletions(-)

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index e81640d9f177..410a9d8f7572 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -58,6 +58,32 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, 
unsigned long end,
return err;
  }
  
+static int walk_hugepd_range(hugepd_t *phpd, unsigned long addr,

+unsigned long end, struct mm_walk *walk, int 
pdshift)
+{
+   int err = 0;
+#ifdef CONFIG_ARCH_HAS_HUGEPD
+   const struct mm_walk_ops *ops = walk->ops;
+   int shift = hugepd_shift(*phpd);
+   int page_size = 1 << shift;
+
+   if (addr & (page_size - 1))
+   return 0;
+
+   for (;;) {
+   pte_t *pte = hugepte_offset(*phpd, addr, pdshift);
+
+   err = ops->pte_entry(pte, addr, addr + page_size, walk);
+   if (err)
+   break;
+   if (addr >= end - page_size)
+   break;
+   addr += page_size;
+   }


Initially I thought this was a somewhat unintuitive way to structure
this loop, but I see it parallels the structure of walk_pte_range_inner,
so I think the consistency is worth it.

I notice the pte walking code potentially takes some locks: does this
code need to do that?

arch/powerpc/mm/hugetlbpage.c says that hugepds are protected by the
mm->page_table_lock, but I don't think we're taking it in this code.


I'll add it, thanks.




+#endif
+   return err;
+}
+
  static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
  struct mm_walk *walk)
  {
@@ -108,7 +134,10 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, 
unsigned long end,
goto again;
}
  
-		err = walk_pte_range(pmd, addr, next, walk);

+   if (is_hugepd(__hugepd(pmd_val(*pmd
+   err = walk_hugepd_range((hugepd_t *)pmd, addr, next, 
walk, PMD_SHIFT);
+   else
+   err = walk_pte_range(pmd, addr, next, walk);
if (err)
break;
} while (pmd++, addr = next, addr != end);
@@ -157,7 +186,10 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, 
unsigned long end,
if (pud_none(*pud))
goto again;
  
-		err = walk_pmd_range(pud, addr, next, walk);

+   if (is_hugepd(__hugepd(pud_val(*pud
+   err = walk_hugepd_range((hugepd_t *)pud, addr, next, 
walk, PUD_SHIFT);
+   else
+   err = walk_pmd_range(pud, addr, next, walk);


I'm a bit worried you might end up calling into walk_hugepd_range with
ops->pte_entry == NULL, and then jumping to 0.


You are right, I missed it.
I'll bail out of walk_hugepd_range() when ops->pte_entry is NULL.




static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
  struct mm_walk *walk)
{
...
 pud = pud_offset(p4d, addr);
do {
 ...
 if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) ||
walk->action == ACTION_CONTINUE ||
!(ops->pmd_entry || ops->pte_entry)) <<< THIS CHECK
continue;
 ...
if (is_hugepd(__hugepd(pud_val(*pud
err = walk_hugepd_range((hugepd_t *)pud, addr, next, 
walk, PUD_SHIFT);
else
err = walk_pmd_range(pud, addr, next, walk);
if (err)
break;
} while (pud++, addr = next, addr != end);

walk_pud_range will proceed if there is _either_ an ops->pmd_entry _or_
an ops->pte_entry, but walk_hugepd_range will call ops->pte_entry
unconditionally.

The same issue applies to walk_{p4d,pgd}_range...

Kind regards,
Daniel



Thanks
Christophe


Re: [PATCH v1 4/5] mm: ptdump: Support hugepd table entries

2021-04-15 Thread Christophe Leroy

Hi Daniel,

Le 16/04/2021 à 01:29, Daniel Axtens a écrit :

Hi Christophe,


Which hugepd, page table entries can be at any level
and can be of any size.

Add support for them.

Signed-off-by: Christophe Leroy 
---
  mm/ptdump.c | 17 +++--
  1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/mm/ptdump.c b/mm/ptdump.c
index 61cd16afb1c8..6efdb8c15a7d 100644
--- a/mm/ptdump.c
+++ b/mm/ptdump.c
@@ -112,11 +112,24 @@ static int ptdump_pte_entry(pte_t *pte, unsigned long 
addr,
  {
struct ptdump_state *st = walk->private;
pte_t val = ptep_get(pte);
+   unsigned long page_size = next - addr;
+   int level;
+
+   if (page_size >= PGDIR_SIZE)
+   level = 0;
+   else if (page_size >= P4D_SIZE)
+   level = 1;
+   else if (page_size >= PUD_SIZE)
+   level = 2;
+   else if (page_size >= PMD_SIZE)
+   level = 3;
+   else
+   level = 4;
  
  	if (st->effective_prot)

-   st->effective_prot(st, 4, pte_val(val));
+   st->effective_prot(st, level, pte_val(val));
  
-	st->note_page(st, addr, 4, pte_val(val), PAGE_SIZE);

+   st->note_page(st, addr, level, pte_val(val), page_size);


It seems to me that passing both level and page_size is a bit redundant,
but I guess it does reduce the impact on each arch's code?


Exactly, as shown above, the level can be re-calculated based on the page size, but it would be a 
unnecessary impact on all architectures and would duplicate the re-calculation of the level whereas 
in most cases we get it for free from the caller.




Kind regards,
Daniel

  
  	return 0;

  }
--
2.25.0


Re: [PATCH v1 3/5] mm: ptdump: Provide page size to notepage()

2021-04-15 Thread Christophe Leroy




Le 16/04/2021 à 01:12, Daniel Axtens a écrit :

Hi Christophe,


  static void note_page(struct ptdump_state *pt_st, unsigned long addr, int 
level,
- u64 val)
+ u64 val, unsigned long page_size)


Compilers can warn about unused parameters at -Wextra level.  However,
reading scripts/Makefile.extrawarn it looks like the warning is
explicitly _disabled_ in the kernel at W=1 and not reenabled at W=2 or
W=3. So I guess this is fine...


There are a lot lot lot functions having unused parameters in the kernel , especially the ones that 
are re-implemented by each architecture.





@@ -126,7 +126,7 @@ static int ptdump_hole(unsigned long addr, unsigned long 
next,
  {
struct ptdump_state *st = walk->private;
  
-	st->note_page(st, addr, depth, 0);

+   st->note_page(st, addr, depth, 0, 0);


I know it doesn't matter at this point, but I'm not really thrilled by
the idea of passing 0 as the size here. Doesn't the hole have a known
page size?


The hole has a size for sure, I don't think we can call it a page size:

On powerpc 8xx, we have 4 page sizes: 8M, 512k, 16k and 4k.
A page table will cover 4M areas and will contain pages of size 512k, 16k and 
4k.
A PGD table contains either entries which points to a page table (covering 4M), or two identical 
consecutive entries pointing to the same hugepd which contains a single PTE for an 8M page.


So, if a PGD entry is empty, the hole is 4M, it corresponds to none of the page sizes the 
architecture supports.



But looking at what is done with that size, it can make sense to pass it to notepage() anyway. Let's 
do that.




  
  	return 0;

  }
@@ -153,5 +153,5 @@ void ptdump_walk_pgd(struct ptdump_state *st, struct 
mm_struct *mm, pgd_t *pgd)
mmap_read_unlock(mm);
  
  	/* Flush out the last page */

-   st->note_page(st, 0, -1, 0);
+   st->note_page(st, 0, -1, 0, 0);


I'm more OK with the idea of passing 0 as the size when the depth is -1
(don't know): if we don't know the depth we conceptually can't know the
page size.

Regards,
Daniel



[PATCH v1 3/5] mm: ptdump: Provide page size to notepage()

2021-04-15 Thread Christophe Leroy
In order to support large pages on powerpc, notepage()
needs to know the page size of the page.

Add a page_size argument to notepage().

Signed-off-by: Christophe Leroy 
---
 arch/arm64/mm/ptdump.c |  2 +-
 arch/riscv/mm/ptdump.c |  2 +-
 arch/s390/mm/dump_pagetables.c |  3 ++-
 arch/x86/mm/dump_pagetables.c  |  2 +-
 include/linux/ptdump.h |  2 +-
 mm/ptdump.c| 16 
 6 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
index 0e050d76b83a..ea1a1c3a3ea0 100644
--- a/arch/arm64/mm/ptdump.c
+++ b/arch/arm64/mm/ptdump.c
@@ -257,7 +257,7 @@ static void note_prot_wx(struct pg_state *st, unsigned long 
addr)
 }
 
 static void note_page(struct ptdump_state *pt_st, unsigned long addr, int 
level,
- u64 val)
+ u64 val, unsigned long page_size)
 {
struct pg_state *st = container_of(pt_st, struct pg_state, ptdump);
static const char units[] = "KMGTPE";
diff --git a/arch/riscv/mm/ptdump.c b/arch/riscv/mm/ptdump.c
index ace74dec7492..0a7f276ba799 100644
--- a/arch/riscv/mm/ptdump.c
+++ b/arch/riscv/mm/ptdump.c
@@ -235,7 +235,7 @@ static void note_prot_wx(struct pg_state *st, unsigned long 
addr)
 }
 
 static void note_page(struct ptdump_state *pt_st, unsigned long addr,
- int level, u64 val)
+ int level, u64 val, unsigned long page_size)
 {
struct pg_state *st = container_of(pt_st, struct pg_state, ptdump);
u64 pa = PFN_PHYS(pte_pfn(__pte(val)));
diff --git a/arch/s390/mm/dump_pagetables.c b/arch/s390/mm/dump_pagetables.c
index e40a30647d99..29673c38e773 100644
--- a/arch/s390/mm/dump_pagetables.c
+++ b/arch/s390/mm/dump_pagetables.c
@@ -116,7 +116,8 @@ static void note_prot_wx(struct pg_state *st, unsigned long 
addr)
 #endif /* CONFIG_DEBUG_WX */
 }
 
-static void note_page(struct ptdump_state *pt_st, unsigned long addr, int 
level, u64 val)
+static void note_page(struct ptdump_state *pt_st, unsigned long addr, int 
level,
+ u64 val, unsigned long page_size)
 {
int width = sizeof(unsigned long) * 2;
static const char units[] = "KMGTPE";
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index e1b599ecbbc2..2ec76737c1f1 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -272,7 +272,7 @@ static void effective_prot(struct ptdump_state *pt_st, int 
level, u64 val)
  * print what we collected so far.
  */
 static void note_page(struct ptdump_state *pt_st, unsigned long addr, int 
level,
- u64 val)
+ u64 val, unsigned long page_size)
 {
struct pg_state *st = container_of(pt_st, struct pg_state, ptdump);
pgprotval_t new_prot, new_eff;
diff --git a/include/linux/ptdump.h b/include/linux/ptdump.h
index 2a3a95586425..3a971fadc95e 100644
--- a/include/linux/ptdump.h
+++ b/include/linux/ptdump.h
@@ -13,7 +13,7 @@ struct ptdump_range {
 struct ptdump_state {
/* level is 0:PGD to 4:PTE, or -1 if unknown */
void (*note_page)(struct ptdump_state *st, unsigned long addr,
- int level, u64 val);
+ int level, u64 val, unsigned long page_size);
void (*effective_prot)(struct ptdump_state *st, int level, u64 val);
const struct ptdump_range *range;
 };
diff --git a/mm/ptdump.c b/mm/ptdump.c
index da751448d0e4..61cd16afb1c8 100644
--- a/mm/ptdump.c
+++ b/mm/ptdump.c
@@ -17,7 +17,7 @@ static inline int note_kasan_page_table(struct mm_walk *walk,
 {
struct ptdump_state *st = walk->private;
 
-   st->note_page(st, addr, 4, pte_val(kasan_early_shadow_pte[0]));
+   st->note_page(st, addr, 4, pte_val(kasan_early_shadow_pte[0]), 
PAGE_SIZE);
 
walk->action = ACTION_CONTINUE;
 
@@ -41,7 +41,7 @@ static int ptdump_pgd_entry(pgd_t *pgd, unsigned long addr,
st->effective_prot(st, 0, pgd_val(val));
 
if (pgd_leaf(val))
-   st->note_page(st, addr, 0, pgd_val(val));
+   st->note_page(st, addr, 0, pgd_val(val), PGDIR_SIZE);
 
return 0;
 }
@@ -62,7 +62,7 @@ static int ptdump_p4d_entry(p4d_t *p4d, unsigned long addr,
st->effective_prot(st, 1, p4d_val(val));
 
if (p4d_leaf(val))
-   st->note_page(st, addr, 1, p4d_val(val));
+   st->note_page(st, addr, 1, p4d_val(val), P4D_SIZE);
 
return 0;
 }
@@ -83,7 +83,7 @@ static int ptdump_pud_entry(pud_t *pud, unsigned long addr,
st->effective_prot(st, 2, pud_val(val));
 
if (pud_leaf(val))
-   st->note_page(st, addr, 2, pud_val(val));
+   st->note_page(st, addr, 2, pud_val(val), PUD_SIZE);
 
return 0;
 }
@@ -102,7 +102,7 @@ static int ptdump_pmd_entry(pmd_t *pmd, unsigned long addr,
if (st->effecti

[PATCH v1 2/5] mm: ptdump: Fix build failure

2021-04-15 Thread Christophe Leroy
  CC  mm/ptdump.o
In file included from :
mm/ptdump.c: In function 'ptdump_pte_entry':
././include/linux/compiler_types.h:320:38: error: call to 
'__compiletime_assert_207' declared with attribute error: Unsupported access 
size for {READ,WRITE}_ONCE().
  320 |  _compiletime_assert(condition, msg, __compiletime_assert_, 
__COUNTER__)
  |  ^
././include/linux/compiler_types.h:301:4: note: in definition of macro 
'__compiletime_assert'
  301 |prefix ## suffix();\
  |^~
././include/linux/compiler_types.h:320:2: note: in expansion of macro 
'_compiletime_assert'
  320 |  _compiletime_assert(condition, msg, __compiletime_assert_, 
__COUNTER__)
  |  ^~~
./include/asm-generic/rwonce.h:36:2: note: in expansion of macro 
'compiletime_assert'
   36 |  compiletime_assert(__native_word(t) || sizeof(t) == 
sizeof(long long), \
  |  ^~
./include/asm-generic/rwonce.h:49:2: note: in expansion of macro 
'compiletime_assert_rwonce_type'
   49 |  compiletime_assert_rwonce_type(x);\
  |  ^~
mm/ptdump.c:114:14: note: in expansion of macro 'READ_ONCE'
  114 |  pte_t val = READ_ONCE(*pte);
  |  ^
make[2]: *** [mm/ptdump.o] Error 1

READ_ONCE() cannot be used for reading PTEs. Use ptep_get()
instead. See commit 481e980a7c19 ("mm: Allow arches to provide ptep_get()")
and commit c0e1c8c22beb ("powerpc/8xx: Provide ptep_get() with 16k pages")
for details.

Fixes: 30d621f6723b ("mm: add generic ptdump")
Cc: Steven Price 
Signed-off-by: Christophe Leroy 
---
 mm/ptdump.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/ptdump.c b/mm/ptdump.c
index 4354c1422d57..da751448d0e4 100644
--- a/mm/ptdump.c
+++ b/mm/ptdump.c
@@ -111,7 +111,7 @@ static int ptdump_pte_entry(pte_t *pte, unsigned long addr,
unsigned long next, struct mm_walk *walk)
 {
struct ptdump_state *st = walk->private;
-   pte_t val = READ_ONCE(*pte);
+   pte_t val = ptep_get(pte);
 
if (st->effective_prot)
st->effective_prot(st, 4, pte_val(val));
-- 
2.25.0



[PATCH v1 4/5] mm: ptdump: Support hugepd table entries

2021-04-15 Thread Christophe Leroy
Which hugepd, page table entries can be at any level
and can be of any size.

Add support for them.

Signed-off-by: Christophe Leroy 
---
 mm/ptdump.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/mm/ptdump.c b/mm/ptdump.c
index 61cd16afb1c8..6efdb8c15a7d 100644
--- a/mm/ptdump.c
+++ b/mm/ptdump.c
@@ -112,11 +112,24 @@ static int ptdump_pte_entry(pte_t *pte, unsigned long 
addr,
 {
struct ptdump_state *st = walk->private;
pte_t val = ptep_get(pte);
+   unsigned long page_size = next - addr;
+   int level;
+
+   if (page_size >= PGDIR_SIZE)
+   level = 0;
+   else if (page_size >= P4D_SIZE)
+   level = 1;
+   else if (page_size >= PUD_SIZE)
+   level = 2;
+   else if (page_size >= PMD_SIZE)
+   level = 3;
+   else
+   level = 4;
 
if (st->effective_prot)
-   st->effective_prot(st, 4, pte_val(val));
+   st->effective_prot(st, level, pte_val(val));
 
-   st->note_page(st, addr, 4, pte_val(val), PAGE_SIZE);
+   st->note_page(st, addr, level, pte_val(val), page_size);
 
return 0;
 }
-- 
2.25.0



[PATCH v1 5/5] powerpc/mm: Convert powerpc to GENERIC_PTDUMP

2021-04-15 Thread Christophe Leroy
This patch converts powerpc to the generic PTDUMP implementation.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/Kconfig  |   2 +
 arch/powerpc/Kconfig.debug|  30 --
 arch/powerpc/mm/Makefile  |   2 +-
 arch/powerpc/mm/mmu_decl.h|   2 +-
 arch/powerpc/mm/ptdump/8xx.c  |   6 +-
 arch/powerpc/mm/ptdump/Makefile   |   9 +-
 arch/powerpc/mm/ptdump/book3s64.c |   6 +-
 arch/powerpc/mm/ptdump/ptdump.c   | 161 +-
 arch/powerpc/mm/ptdump/shared.c   |   6 +-
 9 files changed, 68 insertions(+), 156 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 475d77a6ebbe..40259437a28f 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -120,6 +120,7 @@ config PPC
select ARCH_32BIT_OFF_T if PPC32
select ARCH_HAS_DEBUG_VIRTUAL
select ARCH_HAS_DEBUG_VM_PGTABLE
+   select ARCH_HAS_DEBUG_WXif STRICT_KERNEL_RWX
select ARCH_HAS_DEVMEM_IS_ALLOWED
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_FORTIFY_SOURCE
@@ -177,6 +178,7 @@ config PPC
select GENERIC_IRQ_SHOW
select GENERIC_IRQ_SHOW_LEVEL
select GENERIC_PCI_IOMAPif PCI
+   select GENERIC_PTDUMP
select GENERIC_SMP_IDLE_THREAD
select GENERIC_STRNCPY_FROM_USER
select GENERIC_STRNLEN_USER
diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug
index 6342f9da4545..05b1180ea502 100644
--- a/arch/powerpc/Kconfig.debug
+++ b/arch/powerpc/Kconfig.debug
@@ -360,36 +360,6 @@ config FAIL_IOMMU
 
  If you are unsure, say N.
 
-config PPC_PTDUMP
-   bool "Export kernel pagetable layout to userspace via debugfs"
-   depends on DEBUG_KERNEL && DEBUG_FS
-   help
- This option exports the state of the kernel pagetables to a
- debugfs file. This is only useful for kernel developers who are
- working in architecture specific areas of the kernel - probably
- not a good idea to enable this feature in a production kernel.
-
- If you are unsure, say N.
-
-config PPC_DEBUG_WX
-   bool "Warn on W+X mappings at boot"
-   depends on PPC_PTDUMP && STRICT_KERNEL_RWX
-   help
- Generate a warning if any W+X mappings are found at boot.
-
- This is useful for discovering cases where the kernel is leaving
- W+X mappings after applying NX, as such mappings are a security risk.
-
- Note that even if the check fails, your kernel is possibly
- still fine, as W+X mappings are not a security hole in
- themselves, what they do is that they make the exploitation
- of other unfixed kernel bugs easier.
-
- There is no runtime or memory usage effect of this option
- once the kernel has booted up - it's a one time check.
-
- If in doubt, say "Y".
-
 config PPC_FAST_ENDIAN_SWITCH
bool "Deprecated fast endian-switch syscall"
depends on DEBUG_KERNEL && PPC_BOOK3S_64
diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index c3df3a8501d4..c90d58aaebe2 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -18,5 +18,5 @@ obj-$(CONFIG_PPC_MM_SLICES)   += slice.o
 obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
 obj-$(CONFIG_NOT_COHERENT_CACHE) += dma-noncoherent.o
 obj-$(CONFIG_PPC_COPRO_BASE)   += copro_fault.o
-obj-$(CONFIG_PPC_PTDUMP)   += ptdump/
+obj-$(CONFIG_PTDUMP_CORE)  += ptdump/
 obj-$(CONFIG_KASAN)+= kasan/
diff --git a/arch/powerpc/mm/mmu_decl.h b/arch/powerpc/mm/mmu_decl.h
index 7dac910c0b21..dd1cabc2ea0f 100644
--- a/arch/powerpc/mm/mmu_decl.h
+++ b/arch/powerpc/mm/mmu_decl.h
@@ -180,7 +180,7 @@ static inline void mmu_mark_rodata_ro(void) { }
 void __init mmu_mapin_immr(void);
 #endif
 
-#ifdef CONFIG_PPC_DEBUG_WX
+#ifdef CONFIG_DEBUG_WX
 void ptdump_check_wx(void);
 #else
 static inline void ptdump_check_wx(void) { }
diff --git a/arch/powerpc/mm/ptdump/8xx.c b/arch/powerpc/mm/ptdump/8xx.c
index 86da2a669680..fac932eb8f9a 100644
--- a/arch/powerpc/mm/ptdump/8xx.c
+++ b/arch/powerpc/mm/ptdump/8xx.c
@@ -75,8 +75,10 @@ static const struct flag_info flag_array[] = {
 };
 
 struct pgtable_level pg_level[5] = {
-   {
-   }, { /* pgd */
+   { /* pgd */
+   .flag   = flag_array,
+   .num= ARRAY_SIZE(flag_array),
+   }, { /* p4d */
.flag   = flag_array,
.num= ARRAY_SIZE(flag_array),
}, { /* pud */
diff --git a/arch/powerpc/mm/ptdump/Makefile b/arch/powerpc/mm/ptdump/Makefile
index 712762be3cb1..4050cbb55acf 100644
--- a/arch/powerpc/mm/ptdump/Makefile
+++ b/arch/powerpc/mm/ptdump/Makefile
@@ -5,5 +5,10 @@ obj-y  += ptdump.o
 obj-$(CONFIG_4xx)  += shared.o
 obj-$(CONFIG_PPC_8xx)  += 8xx.o
 obj-$(CONFIG_PPC_BOOK3E_MMU)   += shared.o
-obj-$(CONFIG_PPC_BOOK3S_32)+= shared.o bat

[PATCH v1 1/5] mm: pagewalk: Fix walk for hugepage tables

2021-04-15 Thread Christophe Leroy
Pagewalk ignores hugepd entries and walk down the tables
as if it was traditionnal entries, leading to crazy result.

Add walk_hugepd_range() and use it to walk hugepage tables.

Signed-off-by: Christophe Leroy 
---
 mm/pagewalk.c | 54 +--
 1 file changed, 48 insertions(+), 6 deletions(-)

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index e81640d9f177..410a9d8f7572 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -58,6 +58,32 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, 
unsigned long end,
return err;
 }
 
+static int walk_hugepd_range(hugepd_t *phpd, unsigned long addr,
+unsigned long end, struct mm_walk *walk, int 
pdshift)
+{
+   int err = 0;
+#ifdef CONFIG_ARCH_HAS_HUGEPD
+   const struct mm_walk_ops *ops = walk->ops;
+   int shift = hugepd_shift(*phpd);
+   int page_size = 1 << shift;
+
+   if (addr & (page_size - 1))
+   return 0;
+
+   for (;;) {
+   pte_t *pte = hugepte_offset(*phpd, addr, pdshift);
+
+   err = ops->pte_entry(pte, addr, addr + page_size, walk);
+   if (err)
+   break;
+   if (addr >= end - page_size)
+   break;
+   addr += page_size;
+   }
+#endif
+   return err;
+}
+
 static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
  struct mm_walk *walk)
 {
@@ -108,7 +134,10 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, 
unsigned long end,
goto again;
}
 
-   err = walk_pte_range(pmd, addr, next, walk);
+   if (is_hugepd(__hugepd(pmd_val(*pmd
+   err = walk_hugepd_range((hugepd_t *)pmd, addr, next, 
walk, PMD_SHIFT);
+   else
+   err = walk_pte_range(pmd, addr, next, walk);
if (err)
break;
} while (pmd++, addr = next, addr != end);
@@ -157,7 +186,10 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, 
unsigned long end,
if (pud_none(*pud))
goto again;
 
-   err = walk_pmd_range(pud, addr, next, walk);
+   if (is_hugepd(__hugepd(pud_val(*pud
+   err = walk_hugepd_range((hugepd_t *)pud, addr, next, 
walk, PUD_SHIFT);
+   else
+   err = walk_pmd_range(pud, addr, next, walk);
if (err)
break;
} while (pud++, addr = next, addr != end);
@@ -189,8 +221,13 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, 
unsigned long end,
if (err)
break;
}
-   if (ops->pud_entry || ops->pmd_entry || ops->pte_entry)
-   err = walk_pud_range(p4d, addr, next, walk);
+   if (ops->pud_entry || ops->pmd_entry || ops->pte_entry) {
+   if (is_hugepd(__hugepd(p4d_val(*p4d
+   err = walk_hugepd_range((hugepd_t *)p4d, addr, 
next, walk,
+   P4D_SHIFT);
+   else
+   err = walk_pud_range(p4d, addr, next, walk);
+   }
if (err)
break;
} while (p4d++, addr = next, addr != end);
@@ -225,8 +262,13 @@ static int walk_pgd_range(unsigned long addr, unsigned 
long end,
break;
}
if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry ||
-   ops->pte_entry)
-   err = walk_p4d_range(pgd, addr, next, walk);
+   ops->pte_entry) {
+   if (is_hugepd(__hugepd(pgd_val(*pgd
+   err = walk_hugepd_range((hugepd_t *)pgd, addr, 
next, walk,
+   PGDIR_SHIFT);
+   else
+   err = walk_p4d_range(pgd, addr, next, walk);
+   }
if (err)
break;
} while (pgd++, addr = next, addr != end);
-- 
2.25.0



[PATCH v1 0/5] Convert powerpc to GENERIC_PTDUMP

2021-04-15 Thread Christophe Leroy
This series converts powerpc to generic PTDUMP.

For that, we first need to add missing hugepd support
to pagewalk and ptdump.

Christophe Leroy (5):
  mm: pagewalk: Fix walk for hugepage tables
  mm: ptdump: Fix build failure
  mm: ptdump: Provide page size to notepage()
  mm: ptdump: Support hugepd table entries
  powerpc/mm: Convert powerpc to GENERIC_PTDUMP

 arch/arm64/mm/ptdump.c|   2 +-
 arch/powerpc/Kconfig  |   2 +
 arch/powerpc/Kconfig.debug|  30 --
 arch/powerpc/mm/Makefile  |   2 +-
 arch/powerpc/mm/mmu_decl.h|   2 +-
 arch/powerpc/mm/ptdump/8xx.c  |   6 +-
 arch/powerpc/mm/ptdump/Makefile   |   9 +-
 arch/powerpc/mm/ptdump/book3s64.c |   6 +-
 arch/powerpc/mm/ptdump/ptdump.c   | 161 +-
 arch/powerpc/mm/ptdump/shared.c   |   6 +-
 arch/riscv/mm/ptdump.c|   2 +-
 arch/s390/mm/dump_pagetables.c|   3 +-
 arch/x86/mm/dump_pagetables.c |   2 +-
 include/linux/ptdump.h|   2 +-
 mm/pagewalk.c |  54 --
 mm/ptdump.c   |  33 --
 16 files changed, 145 insertions(+), 177 deletions(-)

-- 
2.25.0



Re: [PATCH v13 14/14] powerpc/64s/radix: Enable huge vmalloc mappings

2021-04-15 Thread Christophe Leroy

Hi Nick,

Le 17/03/2021 à 07:24, Nicholas Piggin a écrit :

This reduces TLB misses by nearly 30x on a `git diff` workload on a
2-node POWER9 (59,800 -> 2,100) and reduces CPU cycles by 0.54%, due
to vfs hashes being allocated with 2MB pages.

Cc: linuxppc-...@lists.ozlabs.org
Acked-by: Michael Ellerman 
Signed-off-by: Nicholas Piggin 
---
  .../admin-guide/kernel-parameters.txt |  2 ++
  arch/powerpc/Kconfig  |  1 +
  arch/powerpc/kernel/module.c  | 22 +++
  3 files changed, 21 insertions(+), 4 deletions(-)

--- a/arch/powerpc/kernel/module.c
+++ b/arch/powerpc/kernel/module.c
@@ -8,6 +8,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -87,13 +88,26 @@ int module_finalize(const Elf_Ehdr *hdr,
return 0;
  }
  
-#ifdef MODULES_VADDR

  void *module_alloc(unsigned long size)
  {
+   unsigned long start = VMALLOC_START;
+   unsigned long end = VMALLOC_END;
+
+#ifdef MODULES_VADDR
BUILD_BUG_ON(TASK_SIZE > MODULES_VADDR);
+   start = MODULES_VADDR;
+   end = MODULES_END;
+#endif
+
+   /*
+* Don't do huge page allocations for modules yet until more testing
+* is done. STRICT_MODULE_RWX may require extra work to support this
+* too.
+*/
  
-	return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, GFP_KERNEL,

-   PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS, 
NUMA_NO_NODE,



I think you should add the following in 

#ifndef MODULES_VADDR
#define MODULES_VADDR VMALLOC_START
#define MODULES_END VMALLOC_END
#endif

And leave module_alloc() as is (just removing the enclosing #ifdef MODULES_VADDR and adding the 
VM_NO_HUGE_VMAP  flag)


This would minimise the conflits with the changes I did in powerpc/next 
reported by Stephen R.


+   return __vmalloc_node_range(size, 1, start, end, GFP_KERNEL,
+   PAGE_KERNEL_EXEC,
+   VM_NO_HUGE_VMAP | VM_FLUSH_RESET_PERMS,
+   NUMA_NO_NODE,
__builtin_return_address(0));
  }
-#endif



Re: linux-next: manual merge of the akpm-current tree with the powerpc tree

2021-04-15 Thread Christophe Leroy




Le 15/04/2021 à 12:08, Christophe Leroy a écrit :



Le 15/04/2021 à 12:07, Christophe Leroy a écrit :



Le 15/04/2021 à 11:58, Stephen Rothwell a écrit :

Hi all,

On Thu, 15 Apr 2021 19:44:17 +1000 Stephen Rothwell  
wrote:


Today's linux-next merge of the akpm-current tree got a conflict in:

   arch/powerpc/kernel/module.c

between commit:

   2ec13df16704 ("powerpc/modules: Load modules closer to kernel text")

from the powerpc tree and commit:

   4930ba789f8d ("powerpc/64s/radix: enable huge vmalloc mappings")

from the akpm-current tree.

I fixed it up (I think - see below) and can carry the fix as
necessary. This is now fixed as far as linux-next is concerned, but any
non trivial conflicts should be mentioned to your upstream maintainer
when your tree is submitted for merging.  You may also want to consider
cooperating with the maintainer of the conflicting tree to minimise any
particularly complex conflicts.

--
Cheers,
Stephen Rothwell

diff --cc arch/powerpc/kernel/module.c
index fab84024650c,cdb2d88c54e7..
--- a/arch/powerpc/kernel/module.c
+++ b/arch/powerpc/kernel/module.c
@@@ -88,29 -88,26 +89,42 @@@ int module_finalize(const Elf_Ehdr *hdr
   return 0;
   }
- #ifdef MODULES_VADDR
  -void *module_alloc(unsigned long size)
  +static __always_inline void *
  +__module_alloc(unsigned long size, unsigned long start, unsigned long end)
   {
  -    unsigned long start = VMALLOC_START;
  -    unsigned long end = VMALLOC_END;
  -
  -#ifdef MODULES_VADDR
  -    BUILD_BUG_ON(TASK_SIZE > MODULES_VADDR);
  -    start = MODULES_VADDR;
  -    end = MODULES_END;
  -#endif
  -
+ /*
+  * Don't do huge page allocations for modules yet until more testing
+  * is done. STRICT_MODULE_RWX may require extra work to support this
+  * too.
+  */
+
   return __vmalloc_node_range(size, 1, start, end, GFP_KERNEL,
- PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
+ PAGE_KERNEL_EXEC,
+ VM_NO_HUGE_VMAP | VM_FLUSH_RESET_PERMS,
+ NUMA_NO_NODE,
   __builtin_return_address(0));
   }
  +
++
  +void *module_alloc(unsigned long size)
  +{
++    unsigned long start = VMALLOC_START;
++    unsigned long end = VMALLOC_END;
  +    unsigned long limit = (unsigned long)_etext - SZ_32M;
  +    void *ptr = NULL;
  +
++#ifdef MODULES_VADDR
  +    BUILD_BUG_ON(TASK_SIZE > MODULES_VADDR);
++    start = MODULES_VADDR;
++    end = MODULES_END;


The #endif should be here.



  +
  +    /* First try within 32M limit from _etext to avoid branch trampolines */
  +    if (MODULES_VADDR < PAGE_OFFSET && MODULES_END > limit)


Should also use start and end here instead of MODULES_VADDR  and MODULES_END



The cleanest however should be to define MODULES_VADDR and MODULES_END all the time with a fallback 
to VMALLOC_START/VMALLOC_END, to avoid the #ifdef.


The #ifdef was OK when we wanted to define modules_alloc() only when module area was different from 
vmalloc area, but now that we want modules_alloc() at all time, MODULES_VADDR and MODULES_END should 
be defined all the time.






- ptr = __module_alloc(size, limit, MODULES_END);
++    ptr = __module_alloc(size, limit, end);
  +
  +    if (!ptr)
- ptr = __module_alloc(size, MODULES_VADDR, MODULES_END);
++#endif
++    ptr = __module_alloc(size, start, end);
  +
  +    return ptr;
  +}
- #endif


Unfortunately, it also needs this:


Before the #endif is too far.



From: Stephen Rothwell 
Date: Thu, 15 Apr 2021 19:53:58 +1000
Subject: [PATCH] merge fix up for powerpc merge fix

Signed-off-by: Stephen Rothwell 
---
  arch/powerpc/kernel/module.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c
index d8ab1ad2eb05..c060f99afd4d 100644
--- a/arch/powerpc/kernel/module.c
+++ b/arch/powerpc/kernel/module.c
@@ -110,7 +110,9 @@ void *module_alloc(unsigned long size)
  {
  unsigned long start = VMALLOC_START;
  unsigned long end = VMALLOC_END;
+#ifdef MODULES_VADDR
  unsigned long limit = (unsigned long)_etext - SZ_32M;
+#endif
  void *ptr = NULL;
  #ifdef MODULES_VADDR



Re: linux-next: manual merge of the akpm-current tree with the powerpc tree

2021-04-15 Thread Christophe Leroy




Le 15/04/2021 à 12:07, Christophe Leroy a écrit :



Le 15/04/2021 à 11:58, Stephen Rothwell a écrit :

Hi all,

On Thu, 15 Apr 2021 19:44:17 +1000 Stephen Rothwell  
wrote:


Today's linux-next merge of the akpm-current tree got a conflict in:

   arch/powerpc/kernel/module.c

between commit:

   2ec13df16704 ("powerpc/modules: Load modules closer to kernel text")

from the powerpc tree and commit:

   4930ba789f8d ("powerpc/64s/radix: enable huge vmalloc mappings")

from the akpm-current tree.

I fixed it up (I think - see below) and can carry the fix as
necessary. This is now fixed as far as linux-next is concerned, but any
non trivial conflicts should be mentioned to your upstream maintainer
when your tree is submitted for merging.  You may also want to consider
cooperating with the maintainer of the conflicting tree to minimise any
particularly complex conflicts.

--
Cheers,
Stephen Rothwell

diff --cc arch/powerpc/kernel/module.c
index fab84024650c,cdb2d88c54e7..
--- a/arch/powerpc/kernel/module.c
+++ b/arch/powerpc/kernel/module.c
@@@ -88,29 -88,26 +89,42 @@@ int module_finalize(const Elf_Ehdr *hdr
   return 0;
   }
- #ifdef MODULES_VADDR
  -void *module_alloc(unsigned long size)
  +static __always_inline void *
  +__module_alloc(unsigned long size, unsigned long start, unsigned long end)
   {
  -    unsigned long start = VMALLOC_START;
  -    unsigned long end = VMALLOC_END;
  -
  -#ifdef MODULES_VADDR
  -    BUILD_BUG_ON(TASK_SIZE > MODULES_VADDR);
  -    start = MODULES_VADDR;
  -    end = MODULES_END;
  -#endif
  -
+ /*
+  * Don't do huge page allocations for modules yet until more testing
+  * is done. STRICT_MODULE_RWX may require extra work to support this
+  * too.
+  */
+
   return __vmalloc_node_range(size, 1, start, end, GFP_KERNEL,
- PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
+ PAGE_KERNEL_EXEC,
+ VM_NO_HUGE_VMAP | VM_FLUSH_RESET_PERMS,
+ NUMA_NO_NODE,
   __builtin_return_address(0));
   }
  +
++
  +void *module_alloc(unsigned long size)
  +{
++    unsigned long start = VMALLOC_START;
++    unsigned long end = VMALLOC_END;
  +    unsigned long limit = (unsigned long)_etext - SZ_32M;
  +    void *ptr = NULL;
  +
++#ifdef MODULES_VADDR
  +    BUILD_BUG_ON(TASK_SIZE > MODULES_VADDR);
++    start = MODULES_VADDR;
++    end = MODULES_END;


The #endif should be here.



  +
  +    /* First try within 32M limit from _etext to avoid branch trampolines */
  +    if (MODULES_VADDR < PAGE_OFFSET && MODULES_END > limit)


Should also use start and end here instead of MODULES_VADDR  and MODULES_END


- ptr = __module_alloc(size, limit, MODULES_END);
++    ptr = __module_alloc(size, limit, end);
  +
  +    if (!ptr)
- ptr = __module_alloc(size, MODULES_VADDR, MODULES_END);
++#endif
++    ptr = __module_alloc(size, start, end);
  +
  +    return ptr;
  +}
- #endif


Unfortunately, it also needs this:


Before the #endif is too far.



From: Stephen Rothwell 
Date: Thu, 15 Apr 2021 19:53:58 +1000
Subject: [PATCH] merge fix up for powerpc merge fix

Signed-off-by: Stephen Rothwell 
---
  arch/powerpc/kernel/module.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c
index d8ab1ad2eb05..c060f99afd4d 100644
--- a/arch/powerpc/kernel/module.c
+++ b/arch/powerpc/kernel/module.c
@@ -110,7 +110,9 @@ void *module_alloc(unsigned long size)
  {
  unsigned long start = VMALLOC_START;
  unsigned long end = VMALLOC_END;
+#ifdef MODULES_VADDR
  unsigned long limit = (unsigned long)_etext - SZ_32M;
+#endif
  void *ptr = NULL;
  #ifdef MODULES_VADDR



Re: linux-next: manual merge of the akpm-current tree with the powerpc tree

2021-04-15 Thread Christophe Leroy




Le 15/04/2021 à 11:58, Stephen Rothwell a écrit :

Hi all,

On Thu, 15 Apr 2021 19:44:17 +1000 Stephen Rothwell  
wrote:


Today's linux-next merge of the akpm-current tree got a conflict in:

   arch/powerpc/kernel/module.c

between commit:

   2ec13df16704 ("powerpc/modules: Load modules closer to kernel text")

from the powerpc tree and commit:

   4930ba789f8d ("powerpc/64s/radix: enable huge vmalloc mappings")

from the akpm-current tree.

I fixed it up (I think - see below) and can carry the fix as
necessary. This is now fixed as far as linux-next is concerned, but any
non trivial conflicts should be mentioned to your upstream maintainer
when your tree is submitted for merging.  You may also want to consider
cooperating with the maintainer of the conflicting tree to minimise any
particularly complex conflicts.

--
Cheers,
Stephen Rothwell

diff --cc arch/powerpc/kernel/module.c
index fab84024650c,cdb2d88c54e7..
--- a/arch/powerpc/kernel/module.c
+++ b/arch/powerpc/kernel/module.c
@@@ -88,29 -88,26 +89,42 @@@ int module_finalize(const Elf_Ehdr *hdr
return 0;
   }
   
- #ifdef MODULES_VADDR

  -void *module_alloc(unsigned long size)
  +static __always_inline void *
  +__module_alloc(unsigned long size, unsigned long start, unsigned long end)
   {
  - unsigned long start = VMALLOC_START;
  - unsigned long end = VMALLOC_END;
  -
  -#ifdef MODULES_VADDR
  - BUILD_BUG_ON(TASK_SIZE > MODULES_VADDR);
  - start = MODULES_VADDR;
  - end = MODULES_END;
  -#endif
  -
+   /*
+* Don't do huge page allocations for modules yet until more testing
+* is done. STRICT_MODULE_RWX may require extra work to support this
+* too.
+*/
+
return __vmalloc_node_range(size, 1, start, end, GFP_KERNEL,
-   PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS, 
NUMA_NO_NODE,
+   PAGE_KERNEL_EXEC,
+   VM_NO_HUGE_VMAP | VM_FLUSH_RESET_PERMS,
+   NUMA_NO_NODE,
__builtin_return_address(0));
   }
  +
++
  +void *module_alloc(unsigned long size)
  +{
++  unsigned long start = VMALLOC_START;
++  unsigned long end = VMALLOC_END;
  + unsigned long limit = (unsigned long)_etext - SZ_32M;
  + void *ptr = NULL;
  +
++#ifdef MODULES_VADDR
  + BUILD_BUG_ON(TASK_SIZE > MODULES_VADDR);
++  start = MODULES_VADDR;
++  end = MODULES_END;


The #endif should be here.



  +
  + /* First try within 32M limit from _etext to avoid branch trampolines */
  + if (MODULES_VADDR < PAGE_OFFSET && MODULES_END > limit)
-   ptr = __module_alloc(size, limit, MODULES_END);
++  ptr = __module_alloc(size, limit, end);
  +
  + if (!ptr)
-   ptr = __module_alloc(size, MODULES_VADDR, MODULES_END);
++#endif
++  ptr = __module_alloc(size, start, end);
  +
  + return ptr;
  +}
- #endif


Unfortunately, it also needs this:


Before the #endif is too far.



From: Stephen Rothwell 
Date: Thu, 15 Apr 2021 19:53:58 +1000
Subject: [PATCH] merge fix up for powerpc merge fix

Signed-off-by: Stephen Rothwell 
---
  arch/powerpc/kernel/module.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c
index d8ab1ad2eb05..c060f99afd4d 100644
--- a/arch/powerpc/kernel/module.c
+++ b/arch/powerpc/kernel/module.c
@@ -110,7 +110,9 @@ void *module_alloc(unsigned long size)
  {
unsigned long start = VMALLOC_START;
unsigned long end = VMALLOC_END;
+#ifdef MODULES_VADDR
unsigned long limit = (unsigned long)_etext - SZ_32M;
+#endif
void *ptr = NULL;
  
  #ifdef MODULES_VADDR




[PATCH] mm: ptdump: Fix build failure

2021-04-15 Thread Christophe Leroy
  CC  mm/ptdump.o
In file included from :
mm/ptdump.c: In function 'ptdump_pte_entry':
././include/linux/compiler_types.h:320:38: error: call to 
'__compiletime_assert_207' declared with attribute error: Unsupported access 
size for {READ,WRITE}_ONCE().
  320 |  _compiletime_assert(condition, msg, __compiletime_assert_, 
__COUNTER__)
  |  ^
././include/linux/compiler_types.h:301:4: note: in definition of macro 
'__compiletime_assert'
  301 |prefix ## suffix();\
  |^~
././include/linux/compiler_types.h:320:2: note: in expansion of macro 
'_compiletime_assert'
  320 |  _compiletime_assert(condition, msg, __compiletime_assert_, 
__COUNTER__)
  |  ^~~
./include/asm-generic/rwonce.h:36:2: note: in expansion of macro 
'compiletime_assert'
   36 |  compiletime_assert(__native_word(t) || sizeof(t) == 
sizeof(long long), \
  |  ^~
./include/asm-generic/rwonce.h:49:2: note: in expansion of macro 
'compiletime_assert_rwonce_type'
   49 |  compiletime_assert_rwonce_type(x);\
  |  ^~
mm/ptdump.c:114:14: note: in expansion of macro 'READ_ONCE'
  114 |  pte_t val = READ_ONCE(*pte);
  |  ^
make[2]: *** [mm/ptdump.o] Error 1

READ_ONCE() cannot be used for reading PTEs. Use ptep_get()
instead. See commit 481e980a7c19 ("mm: Allow arches to provide ptep_get()")
and commit c0e1c8c22beb ("powerpc/8xx: Provide ptep_get() with 16k pages")
for details.

Fixes: 30d621f6723b ("mm: add generic ptdump")
Cc: Steven Price 
Signed-off-by: Christophe Leroy 
---
 mm/ptdump.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/ptdump.c b/mm/ptdump.c
index 4354c1422d57..da751448d0e4 100644
--- a/mm/ptdump.c
+++ b/mm/ptdump.c
@@ -111,7 +111,7 @@ static int ptdump_pte_entry(pte_t *pte, unsigned long addr,
unsigned long next, struct mm_walk *walk)
 {
struct ptdump_state *st = walk->private;
-   pte_t val = READ_ONCE(*pte);
+   pte_t val = ptep_get(pte);
 
if (st->effective_prot)
st->effective_prot(st, 4, pte_val(val));
-- 
2.25.0



[PATCH v3 3/3] powerpc/atomics: Remove atomic_inc()/atomic_dec() and friends

2021-04-14 Thread Christophe Leroy
Now that atomic_add() and atomic_sub() handle immediate operands,
atomic_inc() and atomic_dec() have no added value compared to the
generic fallback which calls atomic_add(1) and atomic_sub(1).

Also remove atomic_inc_not_zero() which fallsback to
atomic_add_unless() which itself fallsback to
atomic_fetch_add_unless() which now handles immediate operands.

Signed-off-by: Christophe Leroy 
---
v2: New
---
 arch/powerpc/include/asm/atomic.h | 95 ---
 1 file changed, 95 deletions(-)

diff --git a/arch/powerpc/include/asm/atomic.h 
b/arch/powerpc/include/asm/atomic.h
index eb1bdf14f67c..00ba5d9e837b 100644
--- a/arch/powerpc/include/asm/atomic.h
+++ b/arch/powerpc/include/asm/atomic.h
@@ -118,71 +118,6 @@ ATOMIC_OPS(xor, xor, "", K)
 #undef ATOMIC_OP_RETURN_RELAXED
 #undef ATOMIC_OP
 
-static __inline__ void atomic_inc(atomic_t *v)
-{
-   int t;
-
-   __asm__ __volatile__(
-"1:lwarx   %0,0,%2 # atomic_inc\n\
-   addic   %0,%0,1\n"
-"  stwcx.  %0,0,%2 \n\
-   bne-1b"
-   : "=" (t), "+m" (v->counter)
-   : "r" (>counter)
-   : "cc", "xer");
-}
-#define atomic_inc atomic_inc
-
-static __inline__ int atomic_inc_return_relaxed(atomic_t *v)
-{
-   int t;
-
-   __asm__ __volatile__(
-"1:lwarx   %0,0,%2 # atomic_inc_return_relaxed\n"
-"  addic   %0,%0,1\n"
-"  stwcx.  %0,0,%2\n"
-"  bne-1b"
-   : "=" (t), "+m" (v->counter)
-   : "r" (>counter)
-   : "cc", "xer");
-
-   return t;
-}
-
-static __inline__ void atomic_dec(atomic_t *v)
-{
-   int t;
-
-   __asm__ __volatile__(
-"1:lwarx   %0,0,%2 # atomic_dec\n\
-   addic   %0,%0,-1\n"
-"  stwcx.  %0,0,%2\n\
-   bne-1b"
-   : "=" (t), "+m" (v->counter)
-   : "r" (>counter)
-   : "cc", "xer");
-}
-#define atomic_dec atomic_dec
-
-static __inline__ int atomic_dec_return_relaxed(atomic_t *v)
-{
-   int t;
-
-   __asm__ __volatile__(
-"1:lwarx   %0,0,%2 # atomic_dec_return_relaxed\n"
-"  addic   %0,%0,-1\n"
-"  stwcx.  %0,0,%2\n"
-"  bne-1b"
-   : "=" (t), "+m" (v->counter)
-   : "r" (>counter)
-   : "cc", "xer");
-
-   return t;
-}
-
-#define atomic_inc_return_relaxed atomic_inc_return_relaxed
-#define atomic_dec_return_relaxed atomic_dec_return_relaxed
-
 #define atomic_cmpxchg(v, o, n) (cmpxchg(&((v)->counter), (o), (n)))
 #define atomic_cmpxchg_relaxed(v, o, n) \
cmpxchg_relaxed(&((v)->counter), (o), (n))
@@ -252,36 +187,6 @@ static __inline__ int atomic_fetch_add_unless(atomic_t *v, 
int a, int u)
 }
 #define atomic_fetch_add_unless atomic_fetch_add_unless
 
-/**
- * atomic_inc_not_zero - increment unless the number is zero
- * @v: pointer of type atomic_t
- *
- * Atomically increments @v by 1, so long as @v is non-zero.
- * Returns non-zero if @v was non-zero, and zero otherwise.
- */
-static __inline__ int atomic_inc_not_zero(atomic_t *v)
-{
-   int t1, t2;
-
-   __asm__ __volatile__ (
-   PPC_ATOMIC_ENTRY_BARRIER
-"1:lwarx   %0,0,%2 # atomic_inc_not_zero\n\
-   cmpwi   0,%0,0\n\
-   beq-2f\n\
-   addic   %1,%0,1\n"
-"  stwcx.  %1,0,%2\n\
-   bne-1b\n"
-   PPC_ATOMIC_EXIT_BARRIER
-   "\n\
-2:"
-   : "=" (t1), "=" (t2)
-   : "r" (>counter)
-   : "cc", "xer", "memory");
-
-   return t1;
-}
-#define atomic_inc_not_zero(v) atomic_inc_not_zero((v))
-
 /*
  * Atomically test *v and decrement if it is greater than 0.
  * The function returns the old value of *v minus 1, even if
-- 
2.25.0



[PATCH v3 2/3] powerpc/atomics: Use immediate operand when possible

2021-04-14 Thread Christophe Leroy
Today we get the following code generation for atomic operations:

c001bb2c:   39 20 00 01 li  r9,1
c001bb30:   7d 40 18 28 lwarx   r10,0,r3
c001bb34:   7d 09 50 50 subfr8,r9,r10
c001bb38:   7d 00 19 2d stwcx.  r8,0,r3

c001c7a8:   39 40 00 01 li  r10,1
c001c7ac:   7d 00 18 28 lwarx   r8,0,r3
c001c7b0:   7c ea 42 14 add r7,r10,r8
c001c7b4:   7c e0 19 2d stwcx.  r7,0,r3

By allowing GCC to choose between immediate or regular operation,
we get:

c001bb2c:   7d 20 18 28 lwarx   r9,0,r3
c001bb30:   39 49 ff ff addir10,r9,-1
c001bb34:   7d 40 19 2d stwcx.  r10,0,r3
--
c001c7a4:   7d 40 18 28 lwarx   r10,0,r3
c001c7a8:   39 0a 00 01 addir8,r10,1
c001c7ac:   7d 00 19 2d stwcx.  r8,0,r3

For "and", the dot form has to be used because "andi" doesn't exist.

For logical operations we use unsigned 16 bits immediate.
For arithmetic operations we use signed 16 bits immediate.

On pmac32_defconfig, it reduces the text by approx another 8 kbytes.

Signed-off-by: Christophe Leroy 
Acked-by: Segher Boessenkool 
---
v2: Use "addc/addic"
---
 arch/powerpc/include/asm/atomic.h | 56 +++
 1 file changed, 28 insertions(+), 28 deletions(-)

diff --git a/arch/powerpc/include/asm/atomic.h 
b/arch/powerpc/include/asm/atomic.h
index 61c6e8b200e8..eb1bdf14f67c 100644
--- a/arch/powerpc/include/asm/atomic.h
+++ b/arch/powerpc/include/asm/atomic.h
@@ -37,62 +37,62 @@ static __inline__ void atomic_set(atomic_t *v, int i)
__asm__ __volatile__("stw%U0%X0 %1,%0" : "=m"UPD_CONSTR(v->counter) : 
"r"(i));
 }
 
-#define ATOMIC_OP(op, asm_op)  \
+#define ATOMIC_OP(op, asm_op, suffix, sign, ...)   \
 static __inline__ void atomic_##op(int a, atomic_t *v) \
 {  \
int t;  \
\
__asm__ __volatile__(   \
 "1:lwarx   %0,0,%3 # atomic_" #op "\n" \
-   #asm_op " %0,%2,%0\n"   \
+   #asm_op "%I2" suffix " %0,%0,%2\n"  \
 "  stwcx.  %0,0,%3 \n" \
 "  bne-1b\n"   \
: "=" (t), "+m" (v->counter)  \
-   : "r" (a), "r" (>counter)\
-   : "cc");\
+   : "r"#sign (a), "r" (>counter)   \
+   : "cc", ##__VA_ARGS__); \
 }  \
 
-#define ATOMIC_OP_RETURN_RELAXED(op, asm_op)   \
+#define ATOMIC_OP_RETURN_RELAXED(op, asm_op, suffix, sign, ...)
\
 static inline int atomic_##op##_return_relaxed(int a, atomic_t *v) \
 {  \
int t;  \
\
__asm__ __volatile__(   \
 "1:lwarx   %0,0,%3 # atomic_" #op "_return_relaxed\n"  \
-   #asm_op " %0,%2,%0\n"   \
+   #asm_op "%I2" suffix " %0,%0,%2\n"  \
 "  stwcx.  %0,0,%3\n"  \
 "  bne-1b\n"   \
: "=" (t), "+m" (v->counter)  \
-   : "r" (a), "r" (>counter)\
-   : "cc");\
+   : "r"#sign (a), "r" (>counter)   \
+   : "cc", ##__VA_ARGS__); \
\
return t;   \
 }
 
-#define ATOMIC_FETCH_OP_RELAXED(op, asm_op)\
+#de

[PATCH v3 1/3] powerpc/bitops: Use immediate operand when possible

2021-04-14 Thread Christophe Leroy
Today we get the following code generation for bitops like
set or clear bit:

c0009fe0:   39 40 08 00 li  r10,2048
c0009fe4:   7c e0 40 28 lwarx   r7,0,r8
c0009fe8:   7c e7 53 78 or  r7,r7,r10
c0009fec:   7c e0 41 2d stwcx.  r7,0,r8

c000d568:   39 00 18 00 li  r8,6144
c000d56c:   7c c0 38 28 lwarx   r6,0,r7
c000d570:   7c c6 40 78 andcr6,r6,r8
c000d574:   7c c0 39 2d stwcx.  r6,0,r7

Most set bits are constant on lower 16 bits, so it can easily
be replaced by the "immediate" version of the operation. Allow
GCC to choose between the normal or immediate form.

For clear bits, on 32 bits 'rlwinm' can be used instead of 'andc' for
when all bits to be cleared are consecutive.

On 64 bits we don't have any equivalent single operation for clearing,
single bits or a few bits, we'd need two 'rldicl' so it is not
worth it, the li/andc sequence is doing the same.

With this patch we get:

c0009fe0:   7d 00 50 28 lwarx   r8,0,r10
c0009fe4:   61 08 08 00 ori r8,r8,2048
c0009fe8:   7d 00 51 2d stwcx.  r8,0,r10

c000d558:   7c e0 40 28 lwarx   r7,0,r8
c000d55c:   54 e7 05 64 rlwinm  r7,r7,0,21,18
c000d560:   7c e0 41 2d stwcx.  r7,0,r8

On pmac32_defconfig, it reduces the text by approx 10 kbytes.

Signed-off-by: Christophe Leroy 
---
v3:
- Using the mask validation proposed by Segher

v2:
- Use "n" instead of "i" as constraint for the rlwinm mask
- Improve mask verification to handle more than single bit masks

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/bitops.h | 89 ---
 1 file changed, 81 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/bitops.h 
b/arch/powerpc/include/asm/bitops.h
index 299ab33505a6..09500c789972 100644
--- a/arch/powerpc/include/asm/bitops.h
+++ b/arch/powerpc/include/asm/bitops.h
@@ -71,19 +71,61 @@ static inline void fn(unsigned long mask,   \
__asm__ __volatile__ (  \
prefix  \
 "1:"   PPC_LLARX(%0,0,%3,0) "\n"   \
-   stringify_in_c(op) "%0,%0,%2\n" \
+   #op "%I2 %0,%0,%2\n"\
PPC_STLCX "%0,0,%3\n"   \
"bne- 1b\n" \
: "=" (old), "+m" (*p)\
-   : "r" (mask), "r" (p)   \
+   : "rK" (mask), "r" (p)  \
: "cc", "memory");  \
 }
 
 DEFINE_BITOP(set_bits, or, "")
-DEFINE_BITOP(clear_bits, andc, "")
-DEFINE_BITOP(clear_bits_unlock, andc, PPC_RELEASE_BARRIER)
 DEFINE_BITOP(change_bits, xor, "")
 
+static __always_inline bool is_rlwinm_mask_valid(unsigned long x)
+{
+   if (!x)
+   return false;
+   if (x & 1)
+   x = ~x; // make the mask non-wrapping
+   x += x & -x;// adding the low set bit results in at most one bit set
+
+   return !(x & (x - 1));
+}
+
+#define DEFINE_CLROP(fn, prefix)   \
+static inline void fn(unsigned long mask, volatile unsigned long *_p)  \
+{  \
+   unsigned long old;  \
+   unsigned long *p = (unsigned long *)_p; \
+   \
+   if (IS_ENABLED(CONFIG_PPC32) && \
+   __builtin_constant_p(mask) && is_rlwinm_mask_valid(~mask)) {\
+   asm volatile (  \
+   prefix  \
+   "1:""lwarx  %0,0,%3\n"  \
+   "rlwinm %0,%0,0,%2\n"   \
+   "stwcx. %0,0,%3\n"  \
+   "bne- 1b\n" \
+   : "=" (old), "+m" (*p)\
+   : "n" (~mask), "r" (p)  \
+   : "cc", "memory");  \
+   } else {\
+   asm volatile (  \
+   prefix  \
+   "1:"PPC_LLARX(%0,0,%3,0) &qu

[PATCH v3 3/4] powerpc: Rename probe_kernel_read_inst()

2021-04-14 Thread Christophe Leroy
When probe_kernel_read_inst() was created, it was to mimic
probe_kernel_read() function.

Since then, probe_kernel_read() has been renamed
copy_from_kernel_nofault().

Rename probe_kernel_read_inst() into copy_inst_from_kernel_nofault().

Signed-off-by: Christophe Leroy 
---
v3: copy_inst_from_kernel_nofault() instead of copy_from_kernel_nofault_inst()
---
 arch/powerpc/include/asm/inst.h|  3 +--
 arch/powerpc/kernel/align.c|  2 +-
 arch/powerpc/kernel/trace/ftrace.c | 22 +++---
 arch/powerpc/lib/inst.c|  3 +--
 4 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/inst.h b/arch/powerpc/include/asm/inst.h
index a40c3913a4a3..eaf5a6299034 100644
--- a/arch/powerpc/include/asm/inst.h
+++ b/arch/powerpc/include/asm/inst.h
@@ -177,7 +177,6 @@ static inline char *__ppc_inst_as_str(char 
str[PPC_INST_STR_LEN], struct ppc_ins
__str;  \
 })
 
-int probe_kernel_read_inst(struct ppc_inst *inst,
-  struct ppc_inst *src);
+int copy_inst_from_kernel_nofault(struct ppc_inst *inst, struct ppc_inst *src);
 
 #endif /* _ASM_POWERPC_INST_H */
diff --git a/arch/powerpc/kernel/align.c b/arch/powerpc/kernel/align.c
index a97d5f1a3905..8f350d0478e6 100644
--- a/arch/powerpc/kernel/align.c
+++ b/arch/powerpc/kernel/align.c
@@ -311,7 +311,7 @@ int fix_alignment(struct pt_regs *regs)
CHECK_FULL_REGS(regs);
 
if (is_kernel_addr(regs->nip))
-   r = probe_kernel_read_inst(, (void *)regs->nip);
+   r = copy_inst_from_kernel_nofault(, (void *)regs->nip);
else
r = __get_user_instr(instr, (void __user *)regs->nip);
 
diff --git a/arch/powerpc/kernel/trace/ftrace.c 
b/arch/powerpc/kernel/trace/ftrace.c
index 42761ebec9f7..ffe9537195aa 100644
--- a/arch/powerpc/kernel/trace/ftrace.c
+++ b/arch/powerpc/kernel/trace/ftrace.c
@@ -68,7 +68,7 @@ ftrace_modify_code(unsigned long ip, struct ppc_inst old, 
struct ppc_inst new)
 */
 
/* read the text we want to modify */
-   if (probe_kernel_read_inst(, (void *)ip))
+   if (copy_inst_from_kernel_nofault(, (void *)ip))
return -EFAULT;
 
/* Make sure it is what we expect it to be */
@@ -130,7 +130,7 @@ __ftrace_make_nop(struct module *mod,
struct ppc_inst op, pop;
 
/* read where this goes */
-   if (probe_kernel_read_inst(, (void *)ip)) {
+   if (copy_inst_from_kernel_nofault(, (void *)ip)) {
pr_err("Fetching opcode failed.\n");
return -EFAULT;
}
@@ -164,7 +164,7 @@ __ftrace_make_nop(struct module *mod,
/* When using -mkernel_profile there is no load to jump over */
pop = ppc_inst(PPC_INST_NOP);
 
-   if (probe_kernel_read_inst(, (void *)(ip - 4))) {
+   if (copy_inst_from_kernel_nofault(, (void *)(ip - 4))) {
pr_err("Fetching instruction at %lx failed.\n", ip - 4);
return -EFAULT;
}
@@ -197,7 +197,7 @@ __ftrace_make_nop(struct module *mod,
 * Check what is in the next instruction. We can see ld r2,40(r1), but
 * on first pass after boot we will see mflr r0.
 */
-   if (probe_kernel_read_inst(, (void *)(ip + 4))) {
+   if (copy_inst_from_kernel_nofault(, (void *)(ip + 4))) {
pr_err("Fetching op failed.\n");
return -EFAULT;
}
@@ -349,7 +349,7 @@ static int setup_mcount_compiler_tramp(unsigned long tramp)
return -1;
 
/* New trampoline -- read where this goes */
-   if (probe_kernel_read_inst(, (void *)tramp)) {
+   if (copy_inst_from_kernel_nofault(, (void *)tramp)) {
pr_debug("Fetching opcode failed.\n");
return -1;
}
@@ -399,7 +399,7 @@ static int __ftrace_make_nop_kernel(struct dyn_ftrace *rec, 
unsigned long addr)
struct ppc_inst op;
 
/* Read where this goes */
-   if (probe_kernel_read_inst(, (void *)ip)) {
+   if (copy_inst_from_kernel_nofault(, (void *)ip)) {
pr_err("Fetching opcode failed.\n");
return -EFAULT;
}
@@ -526,10 +526,10 @@ __ftrace_make_call(struct dyn_ftrace *rec, unsigned long 
addr)
struct module *mod = rec->arch.mod;
 
/* read where this goes */
-   if (probe_kernel_read_inst(op, ip))
+   if (copy_inst_from_kernel_nofault(op, ip))
return -EFAULT;
 
-   if (probe_kernel_read_inst(op + 1, ip + 4))
+   if (copy_inst_from_kernel_nofault(op + 1, ip + 4))
return -EFAULT;
 
if (!expected_nop_sequence(ip, op[0], op[1])) {
@@ -592,7 +592,7 @@ __ftrace_make_call(struct dyn_ftrace *rec, unsigned long 
addr)
unsigned long ip = rec->ip;
 
/* read where this goes */
-   if (probe_kernel_read_inst(, (void *)ip))
+   if (copy_inst_from_kernel_nofault(,

[PATCH v3 4/4] powerpc: Move copy_from_kernel_nofault_inst()

2021-04-14 Thread Christophe Leroy
When probe_kernel_read_inst() was created, there was no good place to
put it, so a file called lib/inst.c was dedicated for it.

Since then, probe_kernel_read_inst() has been renamed
copy_from_kernel_nofault_inst(). And mm/maccess.h didn't exist at that
time. Today, mm/maccess.h is related to copy_from_kernel_nofault().

Move copy_from_kernel_nofault_inst() into mm/maccess.c

Signed-off-by: Christophe Leroy 
---
v2: Remove inst.o from Makefile
---
 arch/powerpc/lib/Makefile |  2 +-
 arch/powerpc/lib/inst.c   | 26 --
 arch/powerpc/mm/maccess.c | 21 +
 3 files changed, 22 insertions(+), 27 deletions(-)
 delete mode 100644 arch/powerpc/lib/inst.c

diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
index d4efc182662a..f2c690ee75d1 100644
--- a/arch/powerpc/lib/Makefile
+++ b/arch/powerpc/lib/Makefile
@@ -16,7 +16,7 @@ CFLAGS_code-patching.o += -DDISABLE_BRANCH_PROFILING
 CFLAGS_feature-fixups.o += -DDISABLE_BRANCH_PROFILING
 endif
 
-obj-y += alloc.o code-patching.o feature-fixups.o pmem.o inst.o 
test_code-patching.o
+obj-y += alloc.o code-patching.o feature-fixups.o pmem.o test_code-patching.o
 
 ifndef CONFIG_KASAN
 obj-y  +=  string.o memcmp_$(BITS).o
diff --git a/arch/powerpc/lib/inst.c b/arch/powerpc/lib/inst.c
deleted file mode 100644
index e554d1357f2f..
--- a/arch/powerpc/lib/inst.c
+++ /dev/null
@@ -1,26 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-or-later
-/*
- *  Copyright 2020, IBM Corporation.
- */
-
-#include 
-#include 
-#include 
-#include 
-
-int copy_inst_from_kernel_nofault(struct ppc_inst *inst, struct ppc_inst *src)
-{
-   unsigned int val, suffix;
-   int err;
-
-   err = copy_from_kernel_nofault(, src, sizeof(val));
-   if (err)
-   return err;
-   if (IS_ENABLED(CONFIG_PPC64) && get_op(val) == OP_PREFIX) {
-   err = copy_from_kernel_nofault(, (void *)src + 4, 4);
-   *inst = ppc_inst_prefix(val, suffix);
-   } else {
-   *inst = ppc_inst(val);
-   }
-   return err;
-}
diff --git a/arch/powerpc/mm/maccess.c b/arch/powerpc/mm/maccess.c
index fa9a7a718fc6..a3c30a884076 100644
--- a/arch/powerpc/mm/maccess.c
+++ b/arch/powerpc/mm/maccess.c
@@ -3,7 +3,28 @@
 #include 
 #include 
 
+#include 
+#include 
+#include 
+
 bool copy_from_kernel_nofault_allowed(const void *unsafe_src, size_t size)
 {
return is_kernel_addr((unsigned long)unsafe_src);
 }
+
+int copy_inst_from_kernel_nofault(struct ppc_inst *inst, struct ppc_inst *src)
+{
+   unsigned int val, suffix;
+   int err;
+
+   err = copy_from_kernel_nofault(, src, sizeof(val));
+   if (err)
+   return err;
+   if (IS_ENABLED(CONFIG_PPC64) && get_op(val) == OP_PREFIX) {
+   err = copy_from_kernel_nofault(, (void *)src + 4, 4);
+   *inst = ppc_inst_prefix(val, suffix);
+   } else {
+   *inst = ppc_inst(val);
+   }
+   return err;
+}
-- 
2.25.0



[PATCH v3 2/4] powerpc: Make probe_kernel_read_inst() common to PPC32 and PPC64

2021-04-14 Thread Christophe Leroy
We have two independant versions of probe_kernel_read_inst(), one for
PPC32 and one for PPC64.

The PPC32 is identical to the first part of the PPC64 version.
The remaining part of PPC64 version is not relevant for PPC32, but
not contradictory, so we can easily have a common function with
the PPC64 part opted out via a IS_ENABLED(CONFIG_PPC64).

The only need is to add a version of ppc_inst_prefix() for PPC32.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/inst.h |  2 ++
 arch/powerpc/lib/inst.c | 17 +
 2 files changed, 3 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/inst.h b/arch/powerpc/include/asm/inst.h
index 2902d4e6a363..a40c3913a4a3 100644
--- a/arch/powerpc/include/asm/inst.h
+++ b/arch/powerpc/include/asm/inst.h
@@ -102,6 +102,8 @@ static inline bool ppc_inst_equal(struct ppc_inst x, struct 
ppc_inst y)
 
 #define ppc_inst(x) ((struct ppc_inst){ .val = x })
 
+#define ppc_inst_prefix(x, y) ppc_inst(x)
+
 static inline bool ppc_inst_prefixed(struct ppc_inst x)
 {
return false;
diff --git a/arch/powerpc/lib/inst.c b/arch/powerpc/lib/inst.c
index c57b3548de37..0dff3ac2d45f 100644
--- a/arch/powerpc/lib/inst.c
+++ b/arch/powerpc/lib/inst.c
@@ -8,7 +8,6 @@
 #include 
 #include 
 
-#ifdef CONFIG_PPC64
 int probe_kernel_read_inst(struct ppc_inst *inst,
   struct ppc_inst *src)
 {
@@ -18,7 +17,7 @@ int probe_kernel_read_inst(struct ppc_inst *inst,
err = copy_from_kernel_nofault(, src, sizeof(val));
if (err)
return err;
-   if (get_op(val) == OP_PREFIX) {
+   if (IS_ENABLED(CONFIG_PPC64) && get_op(val) == OP_PREFIX) {
err = copy_from_kernel_nofault(, (void *)src + 4, 4);
*inst = ppc_inst_prefix(val, suffix);
} else {
@@ -26,17 +25,3 @@ int probe_kernel_read_inst(struct ppc_inst *inst,
}
return err;
 }
-#else /* !CONFIG_PPC64 */
-int probe_kernel_read_inst(struct ppc_inst *inst,
-  struct ppc_inst *src)
-{
-   unsigned int val;
-   int err;
-
-   err = copy_from_kernel_nofault(, src, sizeof(val));
-   if (!err)
-   *inst = ppc_inst(val);
-
-   return err;
-}
-#endif /* CONFIG_PPC64 */
-- 
2.25.0



[PATCH v3 1/4] powerpc: Remove probe_user_read_inst()

2021-04-14 Thread Christophe Leroy
Its name comes from former probe_user_read() function.
That function is now called copy_from_user_nofault().

probe_user_read_inst() uses copy_from_user_nofault() to read only
a few bytes. It is suboptimal.

It does the same as get_user_inst() but in addition disables
page faults.

But on the other hand, it is not used for the time being. So remove it
for now. If one day it is really needed, we can give it a new name
more in line with today's naming, and implement it using get_user_inst()

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/inst.h |  3 ---
 arch/powerpc/lib/inst.c | 31 ---
 2 files changed, 34 deletions(-)

diff --git a/arch/powerpc/include/asm/inst.h b/arch/powerpc/include/asm/inst.h
index 19e18af2fac9..2902d4e6a363 100644
--- a/arch/powerpc/include/asm/inst.h
+++ b/arch/powerpc/include/asm/inst.h
@@ -175,9 +175,6 @@ static inline char *__ppc_inst_as_str(char 
str[PPC_INST_STR_LEN], struct ppc_ins
__str;  \
 })
 
-int probe_user_read_inst(struct ppc_inst *inst,
-struct ppc_inst __user *nip);
-
 int probe_kernel_read_inst(struct ppc_inst *inst,
   struct ppc_inst *src);
 
diff --git a/arch/powerpc/lib/inst.c b/arch/powerpc/lib/inst.c
index 9cc17eb62462..c57b3548de37 100644
--- a/arch/powerpc/lib/inst.c
+++ b/arch/powerpc/lib/inst.c
@@ -9,24 +9,6 @@
 #include 
 
 #ifdef CONFIG_PPC64
-int probe_user_read_inst(struct ppc_inst *inst,
-struct ppc_inst __user *nip)
-{
-   unsigned int val, suffix;
-   int err;
-
-   err = copy_from_user_nofault(, nip, sizeof(val));
-   if (err)
-   return err;
-   if (get_op(val) == OP_PREFIX) {
-   err = copy_from_user_nofault(, (void __user *)nip + 4, 
4);
-   *inst = ppc_inst_prefix(val, suffix);
-   } else {
-   *inst = ppc_inst(val);
-   }
-   return err;
-}
-
 int probe_kernel_read_inst(struct ppc_inst *inst,
   struct ppc_inst *src)
 {
@@ -45,19 +27,6 @@ int probe_kernel_read_inst(struct ppc_inst *inst,
return err;
 }
 #else /* !CONFIG_PPC64 */
-int probe_user_read_inst(struct ppc_inst *inst,
-struct ppc_inst __user *nip)
-{
-   unsigned int val;
-   int err;
-
-   err = copy_from_user_nofault(, nip, sizeof(val));
-   if (!err)
-   *inst = ppc_inst(val);
-
-   return err;
-}
-
 int probe_kernel_read_inst(struct ppc_inst *inst,
   struct ppc_inst *src)
 {
-- 
2.25.0



Re: [PATCH v2 3/4] powerpc: Rename probe_kernel_read_inst()

2021-04-14 Thread Christophe Leroy




Le 14/04/2021 à 07:23, Aneesh Kumar K.V a écrit :

Christophe Leroy  writes:


When probe_kernel_read_inst() was created, it was to mimic
probe_kernel_read() function.

Since then, probe_kernel_read() has been renamed
copy_from_kernel_nofault().

Rename probe_kernel_read_inst() into copy_from_kernel_nofault_inst().


At first glance I read it as copy from kernel nofault instruction.
How about copy_inst_from_kernel_nofault()?


Yes good idea.

Christophe


Re: [PATCH v1 1/2] powerpc/bitops: Use immediate operand when possible

2021-04-14 Thread Christophe Leroy




Le 14/04/2021 à 14:24, Segher Boessenkool a écrit :

On Wed, Apr 14, 2021 at 12:01:21PM +1000, Nicholas Piggin wrote:

Would be nice if we could let the compiler deal with it all...

static inline unsigned long lr(unsigned long *mem)
{
 unsigned long val;

 /*
  * This doesn't clobber memory but want to avoid memory operations
  * moving ahead of it
  */
 asm volatile("ldarx %0, %y1" : "=r"(val) : "Z"(*mem) : "memory");

 return val;
}


(etc.)

That can not work reliably: the compiler can put random instructions
between the larx and stcx. this way, and you then do not have guaranteed
forward progress anymore.  It can put the two in different routines
(after inlining and other interprocedural optimisations), duplicate
them, make a different number of copies of them, etc.

Nothing of that is okay if you want to guarantee forward progress on all
implementations, and also not if you want to have good performance
everywhere (or anywhere even).  Unfortunately you have to write all
larx/stcx. loops as one block of assembler, so that you know exactly
what instructions will end up in your binary.

If you don't, it will fail mysteriously after random recompilations, or
have performance degradations, etc.  You don't want to go there :-)



Could the kernel use GCC builtin atomic functions instead ?

https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html




Re: [PATCH] init: consolidate trap_init()

2021-04-14 Thread Christophe Leroy




Le 14/04/2021 à 10:58, Jisheng Zhang a écrit :

Many architectures implement the trap_init() as NOP, since there is
no such default for trap_init(), this empty stub is duplicated among
these architectures. Provide a generic but weak NOP implementation
to drop the empty stubs of trap_init() in these architectures.


You define the weak function in the __init section.

Most but not all architectures had it in __init section.

And the remaining ones may not be defined in __init section. For instance look at the one in alpha 
architecture.


Have you checked that it is not a problem ? It would be good to say something about it in the commit 
description.





Signed-off-by: Jisheng Zhang 
---
  arch/arc/kernel/traps.c  |  5 -
  arch/arm/kernel/traps.c  |  5 -
  arch/h8300/kernel/traps.c| 13 -
  arch/hexagon/kernel/traps.c  |  4 
  arch/nds32/kernel/traps.c|  5 -
  arch/nios2/kernel/traps.c|  5 -
  arch/openrisc/kernel/traps.c |  5 -
  arch/parisc/kernel/traps.c   |  4 
  arch/powerpc/kernel/traps.c  |  5 -
  arch/riscv/kernel/traps.c|  5 -
  arch/um/kernel/trap.c|  4 
  init/main.c  |  2 ++
  12 files changed, 2 insertions(+), 60 deletions(-)

diff --git a/init/main.c b/init/main.c
index 53b278845b88..4bdbe2928530 100644
--- a/init/main.c
+++ b/init/main.c
@@ -790,6 +790,8 @@ static inline void initcall_debug_enable(void)
  }
  #endif
  
+void __init __weak trap_init(void) { }

+


I think in a C file we don't try to save space as much as in a header file.

I would prefer something like:


void __init __weak trap_init(void)
{
}



  /* Report memory auto-initialization states for this boot. */
  static void __init report_meminit(void)
  {



Re: [PATCH] mm: Define ARCH_HAS_FIRST_USER_ADDRESS

2021-04-14 Thread Christophe Leroy




Le 14/04/2021 à 07:59, Anshuman Khandual a écrit :



On 4/14/21 10:52 AM, Christophe Leroy wrote:



Le 14/04/2021 à 04:54, Anshuman Khandual a écrit :

Currently most platforms define FIRST_USER_ADDRESS as 0UL duplicating the
same code all over. Instead define a new option ARCH_HAS_FIRST_USER_ADDRESS
for those platforms which would override generic default FIRST_USER_ADDRESS
value 0UL. This makes it much cleaner with reduced code.

Cc: linux-al...@vger.kernel.org
Cc: linux-snps-...@lists.infradead.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-c...@vger.kernel.org
Cc: linux-hexa...@vger.kernel.org
Cc: linux-i...@vger.kernel.org
Cc: linux-m...@lists.linux-m68k.org
Cc: linux-m...@vger.kernel.org
Cc: openr...@lists.librecores.org
Cc: linux-par...@vger.kernel.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux-ri...@lists.infradead.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linux...@lists.infradead.org
Cc: linux-xte...@linux-xtensa.org
Cc: x...@kernel.org
Cc: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
   arch/alpha/include/asm/pgtable.h | 1 -
   arch/arc/include/asm/pgtable.h   | 6 --
   arch/arm/Kconfig | 1 +
   arch/arm64/include/asm/pgtable.h | 2 --
   arch/csky/include/asm/pgtable.h  | 1 -
   arch/hexagon/include/asm/pgtable.h   | 3 ---
   arch/ia64/include/asm/pgtable.h  | 1 -
   arch/m68k/include/asm/pgtable_mm.h   | 1 -
   arch/microblaze/include/asm/pgtable.h    | 2 --
   arch/mips/include/asm/pgtable-32.h   | 1 -
   arch/mips/include/asm/pgtable-64.h   | 1 -
   arch/nds32/Kconfig   | 1 +
   arch/nios2/include/asm/pgtable.h | 2 --
   arch/openrisc/include/asm/pgtable.h  | 1 -
   arch/parisc/include/asm/pgtable.h    | 2 --
   arch/powerpc/include/asm/book3s/pgtable.h    | 1 -
   arch/powerpc/include/asm/nohash/32/pgtable.h | 1 -
   arch/powerpc/include/asm/nohash/64/pgtable.h | 2 --
   arch/riscv/include/asm/pgtable.h | 2 --
   arch/s390/include/asm/pgtable.h  | 2 --
   arch/sh/include/asm/pgtable.h    | 2 --
   arch/sparc/include/asm/pgtable_32.h  | 1 -
   arch/sparc/include/asm/pgtable_64.h  | 3 ---
   arch/um/include/asm/pgtable-2level.h | 1 -
   arch/um/include/asm/pgtable-3level.h | 1 -
   arch/x86/include/asm/pgtable_types.h | 2 --
   arch/xtensa/include/asm/pgtable.h    | 1 -
   include/linux/mm.h   | 4 
   mm/Kconfig   | 4 
   29 files changed, 10 insertions(+), 43 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8ba434287387..47098ccd715e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -46,6 +46,10 @@ extern int sysctl_page_lock_unfairness;
     void init_mm_internals(void);
   +#ifndef ARCH_HAS_FIRST_USER_ADDRESS


I guess you didn't test it . :)


In fact I did :) Though just booted it on arm64 and cross compiled on
multiple others platforms.



should be #ifndef CONFIG_ARCH_HAS_FIRST_USER_ADDRESS


Right, meant that instead.




+#define FIRST_USER_ADDRESS    0UL
+#endif


But why do we need a config option at all for that ?

Why not just:

#ifndef FIRST_USER_ADDRESS
#define FIRST_USER_ADDRESS    0UL
#endif


This sounds simpler. But just wondering, would not there be any possibility
of build problems due to compilation sequence between arch and generic code ?



For sure it has to be addresses carefully, but there are already a lot of stuff like that around 
pgtables.h


For instance, pte_offset_kernel() has a generic definition in linux/pgtables.h based on whether it 
is already defined or not.


Taking into account that FIRST_USER_ADDRESS is today in the architectures's asm/pgtables.h, I think 
putting the fallback definition in linux/pgtable.h would do the trick.


Re: [PATCH] mm: Define ARCH_HAS_FIRST_USER_ADDRESS

2021-04-13 Thread Christophe Leroy




Le 14/04/2021 à 04:54, Anshuman Khandual a écrit :

Currently most platforms define FIRST_USER_ADDRESS as 0UL duplicating the
same code all over. Instead define a new option ARCH_HAS_FIRST_USER_ADDRESS
for those platforms which would override generic default FIRST_USER_ADDRESS
value 0UL. This makes it much cleaner with reduced code.

Cc: linux-al...@vger.kernel.org
Cc: linux-snps-...@lists.infradead.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-c...@vger.kernel.org
Cc: linux-hexa...@vger.kernel.org
Cc: linux-i...@vger.kernel.org
Cc: linux-m...@lists.linux-m68k.org
Cc: linux-m...@vger.kernel.org
Cc: openr...@lists.librecores.org
Cc: linux-par...@vger.kernel.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux-ri...@lists.infradead.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linux...@lists.infradead.org
Cc: linux-xte...@linux-xtensa.org
Cc: x...@kernel.org
Cc: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
  arch/alpha/include/asm/pgtable.h | 1 -
  arch/arc/include/asm/pgtable.h   | 6 --
  arch/arm/Kconfig | 1 +
  arch/arm64/include/asm/pgtable.h | 2 --
  arch/csky/include/asm/pgtable.h  | 1 -
  arch/hexagon/include/asm/pgtable.h   | 3 ---
  arch/ia64/include/asm/pgtable.h  | 1 -
  arch/m68k/include/asm/pgtable_mm.h   | 1 -
  arch/microblaze/include/asm/pgtable.h| 2 --
  arch/mips/include/asm/pgtable-32.h   | 1 -
  arch/mips/include/asm/pgtable-64.h   | 1 -
  arch/nds32/Kconfig   | 1 +
  arch/nios2/include/asm/pgtable.h | 2 --
  arch/openrisc/include/asm/pgtable.h  | 1 -
  arch/parisc/include/asm/pgtable.h| 2 --
  arch/powerpc/include/asm/book3s/pgtable.h| 1 -
  arch/powerpc/include/asm/nohash/32/pgtable.h | 1 -
  arch/powerpc/include/asm/nohash/64/pgtable.h | 2 --
  arch/riscv/include/asm/pgtable.h | 2 --
  arch/s390/include/asm/pgtable.h  | 2 --
  arch/sh/include/asm/pgtable.h| 2 --
  arch/sparc/include/asm/pgtable_32.h  | 1 -
  arch/sparc/include/asm/pgtable_64.h  | 3 ---
  arch/um/include/asm/pgtable-2level.h | 1 -
  arch/um/include/asm/pgtable-3level.h | 1 -
  arch/x86/include/asm/pgtable_types.h | 2 --
  arch/xtensa/include/asm/pgtable.h| 1 -
  include/linux/mm.h   | 4 
  mm/Kconfig   | 4 
  29 files changed, 10 insertions(+), 43 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8ba434287387..47098ccd715e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -46,6 +46,10 @@ extern int sysctl_page_lock_unfairness;
  
  void init_mm_internals(void);
  
+#ifndef ARCH_HAS_FIRST_USER_ADDRESS


I guess you didn't test it . :)

should be #ifndef CONFIG_ARCH_HAS_FIRST_USER_ADDRESS


+#define FIRST_USER_ADDRESS 0UL
+#endif


But why do we need a config option at all for that ?

Why not just:

#ifndef FIRST_USER_ADDRESS
#define FIRST_USER_ADDRESS  0UL
#endif


+
  #ifndef CONFIG_NEED_MULTIPLE_NODES/* Don't use mapnrs, do it properly */
  extern unsigned long max_mapnr;
  
diff --git a/mm/Kconfig b/mm/Kconfig

index 24c045b24b95..373fbe377075 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -806,6 +806,10 @@ config VMAP_PFN
  
  config ARCH_USES_HIGH_VMA_FLAGS

bool
+
+config ARCH_HAS_FIRST_USER_ADDRESS
+   bool
+
  config ARCH_HAS_PKEYS
bool
  



[PATCH v2 3/4] powerpc: Rename probe_kernel_read_inst()

2021-04-13 Thread Christophe Leroy
When probe_kernel_read_inst() was created, it was to mimic
probe_kernel_read() function.

Since then, probe_kernel_read() has been renamed
copy_from_kernel_nofault().

Rename probe_kernel_read_inst() into copy_from_kernel_nofault_inst().

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/inst.h|  3 +--
 arch/powerpc/kernel/align.c|  2 +-
 arch/powerpc/kernel/trace/ftrace.c | 22 +++---
 arch/powerpc/lib/inst.c|  3 +--
 4 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/inst.h b/arch/powerpc/include/asm/inst.h
index a40c3913a4a3..a8ab0715f50e 100644
--- a/arch/powerpc/include/asm/inst.h
+++ b/arch/powerpc/include/asm/inst.h
@@ -177,7 +177,6 @@ static inline char *__ppc_inst_as_str(char 
str[PPC_INST_STR_LEN], struct ppc_ins
__str;  \
 })
 
-int probe_kernel_read_inst(struct ppc_inst *inst,
-  struct ppc_inst *src);
+int copy_from_kernel_nofault_inst(struct ppc_inst *inst, struct ppc_inst *src);
 
 #endif /* _ASM_POWERPC_INST_H */
diff --git a/arch/powerpc/kernel/align.c b/arch/powerpc/kernel/align.c
index a97d5f1a3905..df3b55fec27d 100644
--- a/arch/powerpc/kernel/align.c
+++ b/arch/powerpc/kernel/align.c
@@ -311,7 +311,7 @@ int fix_alignment(struct pt_regs *regs)
CHECK_FULL_REGS(regs);
 
if (is_kernel_addr(regs->nip))
-   r = probe_kernel_read_inst(, (void *)regs->nip);
+   r = copy_from_kernel_nofault_inst(, (void *)regs->nip);
else
r = __get_user_instr(instr, (void __user *)regs->nip);
 
diff --git a/arch/powerpc/kernel/trace/ftrace.c 
b/arch/powerpc/kernel/trace/ftrace.c
index 42761ebec9f7..9daa4eb812ce 100644
--- a/arch/powerpc/kernel/trace/ftrace.c
+++ b/arch/powerpc/kernel/trace/ftrace.c
@@ -68,7 +68,7 @@ ftrace_modify_code(unsigned long ip, struct ppc_inst old, 
struct ppc_inst new)
 */
 
/* read the text we want to modify */
-   if (probe_kernel_read_inst(, (void *)ip))
+   if (copy_from_kernel_nofault_inst(, (void *)ip))
return -EFAULT;
 
/* Make sure it is what we expect it to be */
@@ -130,7 +130,7 @@ __ftrace_make_nop(struct module *mod,
struct ppc_inst op, pop;
 
/* read where this goes */
-   if (probe_kernel_read_inst(, (void *)ip)) {
+   if (copy_from_kernel_nofault_inst(, (void *)ip)) {
pr_err("Fetching opcode failed.\n");
return -EFAULT;
}
@@ -164,7 +164,7 @@ __ftrace_make_nop(struct module *mod,
/* When using -mkernel_profile there is no load to jump over */
pop = ppc_inst(PPC_INST_NOP);
 
-   if (probe_kernel_read_inst(, (void *)(ip - 4))) {
+   if (copy_from_kernel_nofault_inst(, (void *)(ip - 4))) {
pr_err("Fetching instruction at %lx failed.\n", ip - 4);
return -EFAULT;
}
@@ -197,7 +197,7 @@ __ftrace_make_nop(struct module *mod,
 * Check what is in the next instruction. We can see ld r2,40(r1), but
 * on first pass after boot we will see mflr r0.
 */
-   if (probe_kernel_read_inst(, (void *)(ip + 4))) {
+   if (copy_from_kernel_nofault_inst(, (void *)(ip + 4))) {
pr_err("Fetching op failed.\n");
return -EFAULT;
}
@@ -349,7 +349,7 @@ static int setup_mcount_compiler_tramp(unsigned long tramp)
return -1;
 
/* New trampoline -- read where this goes */
-   if (probe_kernel_read_inst(, (void *)tramp)) {
+   if (copy_from_kernel_nofault_inst(, (void *)tramp)) {
pr_debug("Fetching opcode failed.\n");
return -1;
}
@@ -399,7 +399,7 @@ static int __ftrace_make_nop_kernel(struct dyn_ftrace *rec, 
unsigned long addr)
struct ppc_inst op;
 
/* Read where this goes */
-   if (probe_kernel_read_inst(, (void *)ip)) {
+   if (copy_from_kernel_nofault_inst(, (void *)ip)) {
pr_err("Fetching opcode failed.\n");
return -EFAULT;
}
@@ -526,10 +526,10 @@ __ftrace_make_call(struct dyn_ftrace *rec, unsigned long 
addr)
struct module *mod = rec->arch.mod;
 
/* read where this goes */
-   if (probe_kernel_read_inst(op, ip))
+   if (copy_from_kernel_nofault_inst(op, ip))
return -EFAULT;
 
-   if (probe_kernel_read_inst(op + 1, ip + 4))
+   if (copy_from_kernel_nofault_inst(op + 1, ip + 4))
return -EFAULT;
 
if (!expected_nop_sequence(ip, op[0], op[1])) {
@@ -592,7 +592,7 @@ __ftrace_make_call(struct dyn_ftrace *rec, unsigned long 
addr)
unsigned long ip = rec->ip;
 
/* read where this goes */
-   if (probe_kernel_read_inst(, (void *)ip))
+   if (copy_from_kernel_nofault_inst(, (void *)ip))
return -EFAULT;
 
/* It should be pointing 

[PATCH v2 4/4] powerpc: Move copy_from_kernel_nofault_inst()

2021-04-13 Thread Christophe Leroy
When probe_kernel_read_inst() was created, there was no good place to
put it, so a file called lib/inst.c was dedicated for it.

Since then, probe_kernel_read_inst() has been renamed
copy_from_kernel_nofault_inst(). And mm/maccess.h didn't exist at that
time. Today, mm/maccess.h is related to copy_from_kernel_nofault().

Move copy_from_kernel_nofault_inst() into mm/maccess.c

Signed-off-by: Christophe Leroy 
---
v2: Remove inst.o from Makefile
---
 arch/powerpc/lib/Makefile |  2 +-
 arch/powerpc/lib/inst.c   | 26 --
 arch/powerpc/mm/maccess.c | 21 +
 3 files changed, 22 insertions(+), 27 deletions(-)
 delete mode 100644 arch/powerpc/lib/inst.c

diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
index d4efc182662a..f2c690ee75d1 100644
--- a/arch/powerpc/lib/Makefile
+++ b/arch/powerpc/lib/Makefile
@@ -16,7 +16,7 @@ CFLAGS_code-patching.o += -DDISABLE_BRANCH_PROFILING
 CFLAGS_feature-fixups.o += -DDISABLE_BRANCH_PROFILING
 endif
 
-obj-y += alloc.o code-patching.o feature-fixups.o pmem.o inst.o 
test_code-patching.o
+obj-y += alloc.o code-patching.o feature-fixups.o pmem.o test_code-patching.o
 
 ifndef CONFIG_KASAN
 obj-y  +=  string.o memcmp_$(BITS).o
diff --git a/arch/powerpc/lib/inst.c b/arch/powerpc/lib/inst.c
deleted file mode 100644
index ec7f6bae8b3c..
--- a/arch/powerpc/lib/inst.c
+++ /dev/null
@@ -1,26 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-or-later
-/*
- *  Copyright 2020, IBM Corporation.
- */
-
-#include 
-#include 
-#include 
-#include 
-
-int copy_from_kernel_nofault_inst(struct ppc_inst *inst, struct ppc_inst *src)
-{
-   unsigned int val, suffix;
-   int err;
-
-   err = copy_from_kernel_nofault(, src, sizeof(val));
-   if (err)
-   return err;
-   if (IS_ENABLED(CONFIG_PPC64) && get_op(val) == OP_PREFIX) {
-   err = copy_from_kernel_nofault(, (void *)src + 4, 4);
-   *inst = ppc_inst_prefix(val, suffix);
-   } else {
-   *inst = ppc_inst(val);
-   }
-   return err;
-}
diff --git a/arch/powerpc/mm/maccess.c b/arch/powerpc/mm/maccess.c
index fa9a7a718fc6..e75e74c52a8a 100644
--- a/arch/powerpc/mm/maccess.c
+++ b/arch/powerpc/mm/maccess.c
@@ -3,7 +3,28 @@
 #include 
 #include 
 
+#include 
+#include 
+#include 
+
 bool copy_from_kernel_nofault_allowed(const void *unsafe_src, size_t size)
 {
return is_kernel_addr((unsigned long)unsafe_src);
 }
+
+int copy_from_kernel_nofault_inst(struct ppc_inst *inst, struct ppc_inst *src)
+{
+   unsigned int val, suffix;
+   int err;
+
+   err = copy_from_kernel_nofault(, src, sizeof(val));
+   if (err)
+   return err;
+   if (IS_ENABLED(CONFIG_PPC64) && get_op(val) == OP_PREFIX) {
+   err = copy_from_kernel_nofault(, (void *)src + 4, 4);
+   *inst = ppc_inst_prefix(val, suffix);
+   } else {
+   *inst = ppc_inst(val);
+   }
+   return err;
+}
-- 
2.25.0



[PATCH v2 2/4] powerpc: Make probe_kernel_read_inst() common to PPC32 and PPC64

2021-04-13 Thread Christophe Leroy
We have two independant versions of probe_kernel_read_inst(), one for
PPC32 and one for PPC64.

The PPC32 is identical to the first part of the PPC64 version.
The remaining part of PPC64 version is not relevant for PPC32, but
not contradictory, so we can easily have a common function with
the PPC64 part opted out via a IS_ENABLED(CONFIG_PPC64).

The only need is to add a version of ppc_inst_prefix() for PPC32.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/inst.h |  2 ++
 arch/powerpc/lib/inst.c | 17 +
 2 files changed, 3 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/inst.h b/arch/powerpc/include/asm/inst.h
index 2902d4e6a363..a40c3913a4a3 100644
--- a/arch/powerpc/include/asm/inst.h
+++ b/arch/powerpc/include/asm/inst.h
@@ -102,6 +102,8 @@ static inline bool ppc_inst_equal(struct ppc_inst x, struct 
ppc_inst y)
 
 #define ppc_inst(x) ((struct ppc_inst){ .val = x })
 
+#define ppc_inst_prefix(x, y) ppc_inst(x)
+
 static inline bool ppc_inst_prefixed(struct ppc_inst x)
 {
return false;
diff --git a/arch/powerpc/lib/inst.c b/arch/powerpc/lib/inst.c
index c57b3548de37..0dff3ac2d45f 100644
--- a/arch/powerpc/lib/inst.c
+++ b/arch/powerpc/lib/inst.c
@@ -8,7 +8,6 @@
 #include 
 #include 
 
-#ifdef CONFIG_PPC64
 int probe_kernel_read_inst(struct ppc_inst *inst,
   struct ppc_inst *src)
 {
@@ -18,7 +17,7 @@ int probe_kernel_read_inst(struct ppc_inst *inst,
err = copy_from_kernel_nofault(, src, sizeof(val));
if (err)
return err;
-   if (get_op(val) == OP_PREFIX) {
+   if (IS_ENABLED(CONFIG_PPC64) && get_op(val) == OP_PREFIX) {
err = copy_from_kernel_nofault(, (void *)src + 4, 4);
*inst = ppc_inst_prefix(val, suffix);
} else {
@@ -26,17 +25,3 @@ int probe_kernel_read_inst(struct ppc_inst *inst,
}
return err;
 }
-#else /* !CONFIG_PPC64 */
-int probe_kernel_read_inst(struct ppc_inst *inst,
-  struct ppc_inst *src)
-{
-   unsigned int val;
-   int err;
-
-   err = copy_from_kernel_nofault(, src, sizeof(val));
-   if (!err)
-   *inst = ppc_inst(val);
-
-   return err;
-}
-#endif /* CONFIG_PPC64 */
-- 
2.25.0



[PATCH v2 1/4] powerpc: Remove probe_user_read_inst()

2021-04-13 Thread Christophe Leroy
Its name comes from former probe_user_read() function.
That function is now called copy_from_user_nofault().

probe_user_read_inst() uses copy_from_user_nofault() to read only
a few bytes. It is suboptimal.

It does the same as get_user_inst() but in addition disables
page faults.

But on the other hand, it is not used for the time being. So remove it
for now. If one day it is really needed, we can give it a new name
more in line with today's naming, and implement it using get_user_inst()

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/inst.h |  3 ---
 arch/powerpc/lib/inst.c | 31 ---
 2 files changed, 34 deletions(-)

diff --git a/arch/powerpc/include/asm/inst.h b/arch/powerpc/include/asm/inst.h
index 19e18af2fac9..2902d4e6a363 100644
--- a/arch/powerpc/include/asm/inst.h
+++ b/arch/powerpc/include/asm/inst.h
@@ -175,9 +175,6 @@ static inline char *__ppc_inst_as_str(char 
str[PPC_INST_STR_LEN], struct ppc_ins
__str;  \
 })
 
-int probe_user_read_inst(struct ppc_inst *inst,
-struct ppc_inst __user *nip);
-
 int probe_kernel_read_inst(struct ppc_inst *inst,
   struct ppc_inst *src);
 
diff --git a/arch/powerpc/lib/inst.c b/arch/powerpc/lib/inst.c
index 9cc17eb62462..c57b3548de37 100644
--- a/arch/powerpc/lib/inst.c
+++ b/arch/powerpc/lib/inst.c
@@ -9,24 +9,6 @@
 #include 
 
 #ifdef CONFIG_PPC64
-int probe_user_read_inst(struct ppc_inst *inst,
-struct ppc_inst __user *nip)
-{
-   unsigned int val, suffix;
-   int err;
-
-   err = copy_from_user_nofault(, nip, sizeof(val));
-   if (err)
-   return err;
-   if (get_op(val) == OP_PREFIX) {
-   err = copy_from_user_nofault(, (void __user *)nip + 4, 
4);
-   *inst = ppc_inst_prefix(val, suffix);
-   } else {
-   *inst = ppc_inst(val);
-   }
-   return err;
-}
-
 int probe_kernel_read_inst(struct ppc_inst *inst,
   struct ppc_inst *src)
 {
@@ -45,19 +27,6 @@ int probe_kernel_read_inst(struct ppc_inst *inst,
return err;
 }
 #else /* !CONFIG_PPC64 */
-int probe_user_read_inst(struct ppc_inst *inst,
-struct ppc_inst __user *nip)
-{
-   unsigned int val;
-   int err;
-
-   err = copy_from_user_nofault(, nip, sizeof(val));
-   if (!err)
-   *inst = ppc_inst(val);
-
-   return err;
-}
-
 int probe_kernel_read_inst(struct ppc_inst *inst,
   struct ppc_inst *src)
 {
-- 
2.25.0



[PATCH v2 2/2] powerpc/bug: Provide better flexibility to WARN_ON/__WARN_FLAGS() with asm goto

2021-04-13 Thread Christophe Leroy
Using asm goto in __WARN_FLAGS() and WARN_ON() allows more
flexibility to GCC.

For that add an entry to the exception table so that
program_check_exception() knowns where to resume execution
after a WARNING.

Here are two exemples. The first one is done on PPC32 (which
benefits from the previous patch), the second is on PPC64.

unsigned long test(struct pt_regs *regs)
{
int ret;

WARN_ON(regs->msr & MSR_PR);

return regs->gpr[3];
}

unsigned long test9w(unsigned long a, unsigned long b)
{
if (WARN_ON(!b))
return 0;
return a / b;
}

Before the patch:

03a8 :
 3a8:   81 23 00 84 lwz r9,132(r3)
 3ac:   71 29 40 00 andi.   r9,r9,16384
 3b0:   40 82 00 0c bne 3bc 
 3b4:   80 63 00 0c lwz r3,12(r3)
 3b8:   4e 80 00 20 blr

 3bc:   0f e0 00 00 twuir0,0
 3c0:   80 63 00 0c lwz r3,12(r3)
 3c4:   4e 80 00 20 blr

0bf0 <.test9w>:
 bf0:   7c 89 00 74 cntlzd  r9,r4
 bf4:   79 29 d1 82 rldicl  r9,r9,58,6
 bf8:   0b 09 00 00 tdnei   r9,0
 bfc:   2c 24 00 00 cmpdi   r4,0
 c00:   41 82 00 0c beq c0c <.test9w+0x1c>
 c04:   7c 63 23 92 divdu   r3,r3,r4
 c08:   4e 80 00 20 blr

 c0c:   38 60 00 00 li  r3,0
 c10:   4e 80 00 20 blr

After the patch:

03a8 :
 3a8:   81 23 00 84 lwz r9,132(r3)
 3ac:   71 29 40 00 andi.   r9,r9,16384
 3b0:   40 82 00 0c bne 3bc 
 3b4:   80 63 00 0c lwz r3,12(r3)
 3b8:   4e 80 00 20 blr

 3bc:   0f e0 00 00 twuir0,0

0c50 <.test9w>:
 c50:   7c 89 00 74 cntlzd  r9,r4
 c54:   79 29 d1 82 rldicl  r9,r9,58,6
 c58:   0b 09 00 00 tdnei   r9,0
 c5c:   7c 63 23 92 divdu   r3,r3,r4
 c60:   4e 80 00 20 blr

 c70:   38 60 00 00 li  r3,0
 c74:   4e 80 00 20 blr

In the first exemple, we see GCC doesn't need to duplicate what
happens after the trap.

In the second exemple, we see that GCC doesn't need to emit a test
and a branch in the likely path in addition to the trap.

We've got some WARN_ON() in .softirqentry.text section so it needs
to be added in the OTHER_TEXT_SECTIONS in modpost.c

Signed-off-by: Christophe Leroy 
---
v2: Fix build failure when CONFIG_BUG is not selected.
---
 arch/powerpc/include/asm/book3s/64/kup.h |  2 +-
 arch/powerpc/include/asm/bug.h   | 54 
 arch/powerpc/include/asm/extable.h   | 14 ++
 arch/powerpc/include/asm/ppc_asm.h   | 11 +
 arch/powerpc/kernel/entry_64.S   |  2 +-
 arch/powerpc/kernel/exceptions-64e.S |  2 +-
 arch/powerpc/kernel/misc_32.S|  2 +-
 arch/powerpc/kernel/traps.c  |  9 +++-
 scripts/mod/modpost.c|  2 +-
 9 files changed, 72 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/kup.h 
b/arch/powerpc/include/asm/book3s/64/kup.h
index 9700da3a4093..a22839cba32e 100644
--- a/arch/powerpc/include/asm/book3s/64/kup.h
+++ b/arch/powerpc/include/asm/book3s/64/kup.h
@@ -90,7 +90,7 @@
/* Prevent access to userspace using any key values */
LOAD_REG_IMMEDIATE(\gpr2, AMR_KUAP_BLOCKED)
 999:   tdne\gpr1, \gpr2
-   EMIT_BUG_ENTRY 999b, __FILE__, __LINE__, (BUGFLAG_WARNING | 
BUGFLAG_ONCE)
+   EMIT_WARN_ENTRY 999b, __FILE__, __LINE__, (BUGFLAG_WARNING | 
BUGFLAG_ONCE)
END_MMU_FTR_SECTION_NESTED_IFSET(MMU_FTR_BOOK3S_KUAP, 67)
 #endif
 .endm
diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
index 101dea4eec8d..e22dc503fb2f 100644
--- a/arch/powerpc/include/asm/bug.h
+++ b/arch/powerpc/include/asm/bug.h
@@ -4,6 +4,7 @@
 #ifdef __KERNEL__
 
 #include 
+#include 
 
 #ifdef CONFIG_BUG
 
@@ -30,6 +31,11 @@
 .endm
 #endif /* verbose */
 
+.macro EMIT_WARN_ENTRY addr,file,line,flags
+   EX_TABLE(\addr,\addr+4)
+   EMIT_BUG_ENTRY \addr,\file,\line,\flags
+.endm
+
 #else /* !__ASSEMBLY__ */
 /* _EMIT_BUG_ENTRY expects args %0,%1,%2,%3 to be FILE, LINE, flags and
sizeof(struct bug_entry), respectively */
@@ -58,6 +64,16 @@
  "i" (sizeof(struct bug_entry)),   \
  ##__VA_ARGS__)
 
+#define WARN_ENTRY(insn, flags, label, ...)\
+   asm_volatile_goto(  \
+   "1: " insn "\n" \
+   EX_TABLE(1b, %l[label]) \
+   _EMIT_BUG_ENTRY \
+   : : "i" (__FILE__), "i" (__LINE__), \
+ "i&

[PATCH v2 1/2] powerpc/bug: Remove specific powerpc BUG_ON() and WARN_ON() on PPC32

2021-04-13 Thread Christophe Leroy
powerpc BUG_ON() and WARN_ON() are based on using twnei instruction.

For catching simple conditions like a variable having value 0, this
is efficient because it does the test and the trap at the same time.
But most conditions used with BUG_ON or WARN_ON are more complex and
forces GCC to format the condition into a 0 or 1 value in a register.
This will usually require 2 to 3 instructions.

The most efficient solution would be to use __builtin_trap() because
GCC is able to optimise the use of the different trap instructions
based on the requested condition, but this is complex if not
impossible for the following reasons:
- __builtin_trap() is a non-recoverable instruction, so it can't be
used for WARN_ON
- Knowing which line of code generated the trap would require the
analysis of DWARF information. This is not a feature we have today.

As mentioned in commit 8d4fbcfbe0a4 ("Fix WARN_ON() on bitfield ops")
the way WARN_ON() is implemented is suboptimal. That commit also
mentions an issue with 'long long' condition. It fixed it for
WARN_ON() but the same problem still exists today with BUG_ON() on
PPC32. It will be fixed by using the generic implementation.

By using the generic implementation, gcc will naturally generate a
branch to the unconditional trap generated by BUG().

As modern powerpc implement zero-cycle branch,
that's even more efficient.

And for the functions using WARN_ON() and its return, the test
on return from WARN_ON() is now also used for the WARN_ON() itself.

On PPC64 we don't want it because we want to be able to use CFAR
register to track how we entered the code that trapped. The CFAR
register would be clobbered by the branch.

A simple test function:

unsigned long test9w(unsigned long a, unsigned long b)
{
if (WARN_ON(!b))
return 0;
return a / b;
}

Before the patch:

046c :
 46c:   7c 89 00 34 cntlzw  r9,r4
 470:   55 29 d9 7e rlwinm  r9,r9,27,5,31
 474:   0f 09 00 00 twnei   r9,0
 478:   2c 04 00 00 cmpwi   r4,0
 47c:   41 82 00 0c beq 488 
 480:   7c 63 23 96 divwu   r3,r3,r4
 484:   4e 80 00 20 blr

 488:   38 60 00 00 li  r3,0
 48c:   4e 80 00 20 blr

After the patch:

0468 :
 468:   2c 04 00 00 cmpwi   r4,0
 46c:   41 82 00 0c beq 478 
 470:   7c 63 23 96 divwu   r3,r3,r4
 474:   4e 80 00 20 blr

 478:   0f e0 00 00 twuir0,0
 47c:   38 60 00 00 li  r3,0
 480:   4e 80 00 20 blr

So we see before the patch we need 3 instructions on the likely path
to handle the WARN_ON(). With the patch the trap goes on the unlikely
path.

See below the difference at the entry of system_call_exception where
we have several BUG_ON(), allthough less impressing.

With the patch:

 :
   0:   81 6a 00 84 lwz r11,132(r10)
   4:   90 6a 00 88 stw r3,136(r10)
   8:   71 60 00 02 andi.   r0,r11,2
   c:   41 82 00 70 beq 7c 
  10:   71 60 40 00 andi.   r0,r11,16384
  14:   41 82 00 6c beq 80 
  18:   71 6b 80 00 andi.   r11,r11,32768
  1c:   41 82 00 68 beq 84 
  20:   94 21 ff e0 stwur1,-32(r1)
  24:   93 e1 00 1c stw r31,28(r1)
  28:   7d 8c 42 e6 mftbr12
...
  7c:   0f e0 00 00 twuir0,0
  80:   0f e0 00 00 twuir0,0
  84:   0f e0 00 00 twuir0,0

Without the patch:

 :
   0:   94 21 ff e0 stwur1,-32(r1)
   4:   93 e1 00 1c stw r31,28(r1)
   8:   90 6a 00 88 stw r3,136(r10)
   c:   81 6a 00 84 lwz r11,132(r10)
  10:   69 60 00 02 xorir0,r11,2
  14:   54 00 ff fe rlwinm  r0,r0,31,31,31
  18:   0f 00 00 00 twnei   r0,0
  1c:   69 60 40 00 xorir0,r11,16384
  20:   54 00 97 fe rlwinm  r0,r0,18,31,31
  24:   0f 00 00 00 twnei   r0,0
  28:   69 6b 80 00 xorir11,r11,32768
  2c:   55 6b 8f fe rlwinm  r11,r11,17,31,31
  30:   0f 0b 00 00 twnei   r11,0
  34:   7d 8c 42 e6 mftbr12

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/bug.h | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
index d1635ffbb179..101dea4eec8d 100644
--- a/arch/powerpc/include/asm/bug.h
+++ b/arch/powerpc/include/asm/bug.h
@@ -68,7 +68,11 @@
BUG_ENTRY("twi 31, 0, 0", 0);   \
unreachable();  \
 } while (0)
+#define HAVE_ARCH_BUG
+
+#define __WARN_FLAGS(flags) BUG_ENTRY("twi 3

[PATCH v2 3/3] powerpc/atomics: Remove atomic_inc()/atomic_dec() and friends

2021-04-13 Thread Christophe Leroy
Now that atomic_add() and atomic_sub() handle immediate operands,
atomic_inc() and atomic_dec() have no added value compared to the
generic fallback which calls atomic_add(1) and atomic_sub(1).

Also remove atomic_inc_not_zero() which fallsback to
atomic_add_unless() which itself fallsback to
atomic_fetch_add_unless() which now handles immediate operands.

Signed-off-by: Christophe Leroy 
---
v2: New
---
 arch/powerpc/include/asm/atomic.h | 95 ---
 1 file changed, 95 deletions(-)

diff --git a/arch/powerpc/include/asm/atomic.h 
b/arch/powerpc/include/asm/atomic.h
index eb1bdf14f67c..00ba5d9e837b 100644
--- a/arch/powerpc/include/asm/atomic.h
+++ b/arch/powerpc/include/asm/atomic.h
@@ -118,71 +118,6 @@ ATOMIC_OPS(xor, xor, "", K)
 #undef ATOMIC_OP_RETURN_RELAXED
 #undef ATOMIC_OP
 
-static __inline__ void atomic_inc(atomic_t *v)
-{
-   int t;
-
-   __asm__ __volatile__(
-"1:lwarx   %0,0,%2 # atomic_inc\n\
-   addic   %0,%0,1\n"
-"  stwcx.  %0,0,%2 \n\
-   bne-1b"
-   : "=" (t), "+m" (v->counter)
-   : "r" (>counter)
-   : "cc", "xer");
-}
-#define atomic_inc atomic_inc
-
-static __inline__ int atomic_inc_return_relaxed(atomic_t *v)
-{
-   int t;
-
-   __asm__ __volatile__(
-"1:lwarx   %0,0,%2 # atomic_inc_return_relaxed\n"
-"  addic   %0,%0,1\n"
-"  stwcx.  %0,0,%2\n"
-"  bne-1b"
-   : "=" (t), "+m" (v->counter)
-   : "r" (>counter)
-   : "cc", "xer");
-
-   return t;
-}
-
-static __inline__ void atomic_dec(atomic_t *v)
-{
-   int t;
-
-   __asm__ __volatile__(
-"1:lwarx   %0,0,%2 # atomic_dec\n\
-   addic   %0,%0,-1\n"
-"  stwcx.  %0,0,%2\n\
-   bne-1b"
-   : "=" (t), "+m" (v->counter)
-   : "r" (>counter)
-   : "cc", "xer");
-}
-#define atomic_dec atomic_dec
-
-static __inline__ int atomic_dec_return_relaxed(atomic_t *v)
-{
-   int t;
-
-   __asm__ __volatile__(
-"1:lwarx   %0,0,%2 # atomic_dec_return_relaxed\n"
-"  addic   %0,%0,-1\n"
-"  stwcx.  %0,0,%2\n"
-"  bne-1b"
-   : "=" (t), "+m" (v->counter)
-   : "r" (>counter)
-   : "cc", "xer");
-
-   return t;
-}
-
-#define atomic_inc_return_relaxed atomic_inc_return_relaxed
-#define atomic_dec_return_relaxed atomic_dec_return_relaxed
-
 #define atomic_cmpxchg(v, o, n) (cmpxchg(&((v)->counter), (o), (n)))
 #define atomic_cmpxchg_relaxed(v, o, n) \
cmpxchg_relaxed(&((v)->counter), (o), (n))
@@ -252,36 +187,6 @@ static __inline__ int atomic_fetch_add_unless(atomic_t *v, 
int a, int u)
 }
 #define atomic_fetch_add_unless atomic_fetch_add_unless
 
-/**
- * atomic_inc_not_zero - increment unless the number is zero
- * @v: pointer of type atomic_t
- *
- * Atomically increments @v by 1, so long as @v is non-zero.
- * Returns non-zero if @v was non-zero, and zero otherwise.
- */
-static __inline__ int atomic_inc_not_zero(atomic_t *v)
-{
-   int t1, t2;
-
-   __asm__ __volatile__ (
-   PPC_ATOMIC_ENTRY_BARRIER
-"1:lwarx   %0,0,%2 # atomic_inc_not_zero\n\
-   cmpwi   0,%0,0\n\
-   beq-2f\n\
-   addic   %1,%0,1\n"
-"  stwcx.  %1,0,%2\n\
-   bne-1b\n"
-   PPC_ATOMIC_EXIT_BARRIER
-   "\n\
-2:"
-   : "=" (t1), "=" (t2)
-   : "r" (>counter)
-   : "cc", "xer", "memory");
-
-   return t1;
-}
-#define atomic_inc_not_zero(v) atomic_inc_not_zero((v))
-
 /*
  * Atomically test *v and decrement if it is greater than 0.
  * The function returns the old value of *v minus 1, even if
-- 
2.25.0



[PATCH v2 1/3] powerpc/bitops: Use immediate operand when possible

2021-04-13 Thread Christophe Leroy
Today we get the following code generation for bitops like
set or clear bit:

c0009fe0:   39 40 08 00 li  r10,2048
c0009fe4:   7c e0 40 28 lwarx   r7,0,r8
c0009fe8:   7c e7 53 78 or  r7,r7,r10
c0009fec:   7c e0 41 2d stwcx.  r7,0,r8

c000d568:   39 00 18 00 li  r8,6144
c000d56c:   7c c0 38 28 lwarx   r6,0,r7
c000d570:   7c c6 40 78 andcr6,r6,r8
c000d574:   7c c0 39 2d stwcx.  r6,0,r7

Most set bits are constant on lower 16 bits, so it can easily
be replaced by the "immediate" version of the operation. Allow
GCC to choose between the normal or immediate form.

For clear bits, on 32 bits 'rlwinm' can be used instead of 'andc' for
when all bits to be cleared are consecutive. For this we detect
the number of transitions from 0 to 1 in the mask. This is done
by handing the mask with its complement rotated left 1 bit. If this
operation provides a number which is a power of 2, it means there
is only one transition from 0 to 1 in the number, so all 1 bits are
consecutives.
Can't use rol32() which is not defined yet, so do a
raw ((x << 1) | (x >> 31)). For the power of 2, can't use
is_power_of_2() for the same reason, but it can also be easily encoded
as  (mask & (mask - 1)) and even the 0 case which is not a power of
two is acceptable for us.

On 64 bits we don't have any equivalent single operation, we'd need
two 'rldicl' so it is not worth it.

With this patch we get:

c0009fe0:   7d 00 50 28 lwarx   r8,0,r10
c0009fe4:   61 08 08 00 ori r8,r8,2048
c0009fe8:   7d 00 51 2d stwcx.  r8,0,r10

c000d558:   7c e0 40 28 lwarx   r7,0,r8
c000d55c:   54 e7 05 64 rlwinm  r7,r7,0,21,18
c000d560:   7c e0 41 2d stwcx.  r7,0,r8

On pmac32_defconfig, it reduces the text by approx 10 kbytes.

Signed-off-by: Christophe Leroy 
---
v2:
- Use "n" instead of "i" as constraint for the rlwinm mask
- Improve mask verification to handle more than single bit masks
---
 arch/powerpc/include/asm/bitops.h | 85 ---
 1 file changed, 77 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/bitops.h 
b/arch/powerpc/include/asm/bitops.h
index 299ab33505a6..baa7666c1094 100644
--- a/arch/powerpc/include/asm/bitops.h
+++ b/arch/powerpc/include/asm/bitops.h
@@ -71,19 +71,57 @@ static inline void fn(unsigned long mask,   \
__asm__ __volatile__ (  \
prefix  \
 "1:"   PPC_LLARX(%0,0,%3,0) "\n"   \
-   stringify_in_c(op) "%0,%0,%2\n" \
+   #op "%I2 %0,%0,%2\n"\
PPC_STLCX "%0,0,%3\n"   \
"bne- 1b\n" \
: "=" (old), "+m" (*p)\
-   : "r" (mask), "r" (p)   \
+   : "rK" (mask), "r" (p)  \
: "cc", "memory");  \
 }
 
 DEFINE_BITOP(set_bits, or, "")
-DEFINE_BITOP(clear_bits, andc, "")
-DEFINE_BITOP(clear_bits_unlock, andc, PPC_RELEASE_BARRIER)
 DEFINE_BITOP(change_bits, xor, "")
 
+static __always_inline bool is_rlwinm_mask_valid(unsigned long x)
+{
+   x = x & ~((x << 1) | (x >> 31));/* Flag transitions from 0 to 1 */
+
+   return !(x & (x - 1));  /* Is there only one transition */
+}
+
+#define DEFINE_CLROP(fn, prefix)   \
+static inline void fn(unsigned long mask, volatile unsigned long *_p)  \
+{  \
+   unsigned long old;  \
+   unsigned long *p = (unsigned long *)_p; \
+   \
+   if (IS_ENABLED(CONFIG_PPC32) && \
+   __builtin_constant_p(mask) && is_rlwinm_mask_valid(mask)) { \
+   asm volatile (  \
+   prefix  \
+   "1:""lwarx  %0,0,%3\n"  \
+   "rlwinm %0,%0,0,%2\n"   \
+   "stwcx. %0,0,%3\n"  \
+   "bne- 1b\n" \
+   : "=" (old), "+m" (*p)\
+   : "n" (~mask), "r" (p)  \
+ 

[PATCH v2 2/3] powerpc/atomics: Use immediate operand when possible

2021-04-13 Thread Christophe Leroy
Today we get the following code generation for atomic operations:

c001bb2c:   39 20 00 01 li  r9,1
c001bb30:   7d 40 18 28 lwarx   r10,0,r3
c001bb34:   7d 09 50 50 subfr8,r9,r10
c001bb38:   7d 00 19 2d stwcx.  r8,0,r3

c001c7a8:   39 40 00 01 li  r10,1
c001c7ac:   7d 00 18 28 lwarx   r8,0,r3
c001c7b0:   7c ea 42 14 add r7,r10,r8
c001c7b4:   7c e0 19 2d stwcx.  r7,0,r3

By allowing GCC to choose between immediate or regular operation,
we get:

c001bb2c:   7d 20 18 28 lwarx   r9,0,r3
c001bb30:   39 49 ff ff addir10,r9,-1
c001bb34:   7d 40 19 2d stwcx.  r10,0,r3
--
c001c7a4:   7d 40 18 28 lwarx   r10,0,r3
c001c7a8:   39 0a 00 01 addir8,r10,1
c001c7ac:   7d 00 19 2d stwcx.  r8,0,r3

For "and", the dot form has to be used because "andi" doesn't exist.

For logical operations we use unsigned 16 bits immediate.
For arithmetic operations we use signed 16 bits immediate.

On pmac32_defconfig, it reduces the text by approx another 8 kbytes.

Signed-off-by: Christophe Leroy 
Acked-by: Segher Boessenkool 
---
v2: Use "addc/addic"
---
 arch/powerpc/include/asm/atomic.h | 56 +++
 1 file changed, 28 insertions(+), 28 deletions(-)

diff --git a/arch/powerpc/include/asm/atomic.h 
b/arch/powerpc/include/asm/atomic.h
index 61c6e8b200e8..eb1bdf14f67c 100644
--- a/arch/powerpc/include/asm/atomic.h
+++ b/arch/powerpc/include/asm/atomic.h
@@ -37,62 +37,62 @@ static __inline__ void atomic_set(atomic_t *v, int i)
__asm__ __volatile__("stw%U0%X0 %1,%0" : "=m"UPD_CONSTR(v->counter) : 
"r"(i));
 }
 
-#define ATOMIC_OP(op, asm_op)  \
+#define ATOMIC_OP(op, asm_op, suffix, sign, ...)   \
 static __inline__ void atomic_##op(int a, atomic_t *v) \
 {  \
int t;  \
\
__asm__ __volatile__(   \
 "1:lwarx   %0,0,%3 # atomic_" #op "\n" \
-   #asm_op " %0,%2,%0\n"   \
+   #asm_op "%I2" suffix " %0,%0,%2\n"  \
 "  stwcx.  %0,0,%3 \n" \
 "  bne-1b\n"   \
: "=" (t), "+m" (v->counter)  \
-   : "r" (a), "r" (>counter)\
-   : "cc");\
+   : "r"#sign (a), "r" (>counter)   \
+   : "cc", ##__VA_ARGS__); \
 }  \
 
-#define ATOMIC_OP_RETURN_RELAXED(op, asm_op)   \
+#define ATOMIC_OP_RETURN_RELAXED(op, asm_op, suffix, sign, ...)
\
 static inline int atomic_##op##_return_relaxed(int a, atomic_t *v) \
 {  \
int t;  \
\
__asm__ __volatile__(   \
 "1:lwarx   %0,0,%3 # atomic_" #op "_return_relaxed\n"  \
-   #asm_op " %0,%2,%0\n"   \
+   #asm_op "%I2" suffix " %0,%0,%2\n"  \
 "  stwcx.  %0,0,%3\n"  \
 "  bne-1b\n"   \
: "=" (t), "+m" (v->counter)  \
-   : "r" (a), "r" (>counter)\
-   : "cc");\
+   : "r"#sign (a), "r" (>counter)   \
+   : "cc", ##__VA_ARGS__); \
\
return t;   \
 }
 
-#define ATOMIC_FETCH_OP_RELAXED(op, asm_op)\
+#de

Re: [PATCH v1 2/2] powerpc/atomics: Use immediate operand when possible

2021-04-13 Thread Christophe Leroy




Le 13/04/2021 à 00:08, Segher Boessenkool a écrit :

Hi!

On Thu, Apr 08, 2021 at 03:33:45PM +, Christophe Leroy wrote:

+#define ATOMIC_OP(op, asm_op, dot, sign)   \
  static __inline__ void atomic_##op(int a, atomic_t *v)
\
  { \
int t;  \
\
__asm__ __volatile__(   \
  "1:  lwarx   %0,0,%3 # atomic_" #op "\n"  \
-   #asm_op " %0,%2,%0\n" \
+   #asm_op "%I2" dot " %0,%0,%2\n" \
  "stwcx.  %0,0,%3 \n"\
  "bne-1b\n"  \
-   : "=" (t), "+m" (v->counter)   \
-   : "r" (a), "r" (>counter) \
+   : "=" (t), "+m" (v->counter)   \
+   : "r"#sign (a), "r" (>counter)\
: "cc");  \
  } \


You need "b" (instead of "r") only for "addi".  You can use "addic"
instead, which clobbers XER[CA], but *all* inline asm does, so that is
not a downside here (it is also not slower on any CPU that matters).


@@ -238,14 +238,14 @@ static __inline__ int atomic_fetch_add_unless(atomic_t 
*v, int a, int u)
  "1:  lwarx   %0,0,%1 # atomic_fetch_add_unless\n\
cmpw0,%0,%3 \n\
beq 2f \n\
-   add %0,%2,%0 \n"
+   add%I2  %0,%0,%2 \n"
  "stwcx.  %0,0,%1 \n\
bne-1b \n"
PPC_ATOMIC_EXIT_BARRIER
-" subf%0,%2,%0 \n\
+" sub%I2  %0,%0,%2 \n\
  2:"
-   : "=" (t)
-   : "r" (>counter), "r" (a), "r" (u)
+   : "=" (t)
+   : "r" (>counter), "rI" (a), "r" (u)
: "cc", "memory");


Same here.


Yes, I thought about addic, I didn't find an early solution because I forgot 
the matching 'addc'.

Now with the couple addc/addic it works well.

Thanks



Nice patches!

Acked-by: Segher Boessenkool 



Christophe


Re: [PATCH v1 1/2] powerpc/bitops: Use immediate operand when possible

2021-04-13 Thread Christophe Leroy




Le 12/04/2021 à 23:54, Segher Boessenkool a écrit :

Hi!

On Thu, Apr 08, 2021 at 03:33:44PM +, Christophe Leroy wrote:

For clear bits, on 32 bits 'rlwinm' can be used instead or 'andc' for
when all bits to be cleared are consecutive.


Also on 64-bits, as long as both the top and bottom bits are in the low
32-bit half (for 32 bit mode, it can wrap as well).


Yes. But here we are talking about clearing a few bits, all other ones must remain unchanged. An 
rlwinm on PPC64 will always clear the upper part, which is unlikely what we want.





For the time being only
handle the single bit case, which we detect by checking whether the
mask is a power of two.


You could look at rs6000_is_valid_mask in GCC:
   
<https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/config/rs6000/rs6000.c;h=48b8efd732b251c059628096314848305deb0c0b;hb=HEAD#l11148>
used by rs6000_is_valid_and_mask immediately after it.  You probably
want to allow only rlwinm in your case, and please note this checks if
something is a valid mask, not the inverse of a valid mask (as you
want here).


This check looks more complex than what I need. It is used for both rlw... and rld..., and it 
calculates the operants. The only thing I need is to validate the mask.
I found a way: By anding the mask with the complement of itself rotated by left bits to 1, we 
identify the transitions from 0 to 1. If the result is a power of 2, it means there's only one 
transition so the mask is as expected.


So I did that in v2.




So yes this is pretty involved :-)

Your patch looks good btw.  But please use "n", not "i", as constraint?


Done.

Christophe


[PATCH 2/2] powerpc/bug: Provide better flexibility to WARN_ON/__WARN_FLAGS() with asm goto

2021-04-12 Thread Christophe Leroy
Using asm goto in __WARN_FLAGS() and WARN_ON() allows more
flexibility to GCC.

For that add an entry to the exception table so that
program_check_exception() knowns where to resume execution
after a WARNING.

Here are two exemples. The first one is done on PPC32 (which
benefits from the previous patch), the second is on PPC64.

unsigned long test(struct pt_regs *regs)
{
int ret;

WARN_ON(regs->msr & MSR_PR);

return regs->gpr[3];
}

unsigned long test9w(unsigned long a, unsigned long b)
{
if (WARN_ON(!b))
return 0;
return a / b;
}

Before the patch:

03a8 :
 3a8:   81 23 00 84 lwz r9,132(r3)
 3ac:   71 29 40 00 andi.   r9,r9,16384
 3b0:   40 82 00 0c bne 3bc 
 3b4:   80 63 00 0c lwz r3,12(r3)
 3b8:   4e 80 00 20 blr

 3bc:   0f e0 00 00 twuir0,0
 3c0:   80 63 00 0c lwz r3,12(r3)
 3c4:   4e 80 00 20 blr

0bf0 <.test9w>:
 bf0:   7c 89 00 74 cntlzd  r9,r4
 bf4:   79 29 d1 82 rldicl  r9,r9,58,6
 bf8:   0b 09 00 00 tdnei   r9,0
 bfc:   2c 24 00 00 cmpdi   r4,0
 c00:   41 82 00 0c beq c0c <.test9w+0x1c>
 c04:   7c 63 23 92 divdu   r3,r3,r4
 c08:   4e 80 00 20 blr

 c0c:   38 60 00 00 li  r3,0
 c10:   4e 80 00 20 blr

After the patch:

03a8 :
 3a8:   81 23 00 84 lwz r9,132(r3)
 3ac:   71 29 40 00 andi.   r9,r9,16384
 3b0:   40 82 00 0c bne 3bc 
 3b4:   80 63 00 0c lwz r3,12(r3)
 3b8:   4e 80 00 20 blr

 3bc:   0f e0 00 00 twuir0,0

0c50 <.test9w>:
 c50:   7c 89 00 74 cntlzd  r9,r4
 c54:   79 29 d1 82 rldicl  r9,r9,58,6
 c58:   0b 09 00 00 tdnei   r9,0
 c5c:   7c 63 23 92 divdu   r3,r3,r4
 c60:   4e 80 00 20 blr

 c70:   38 60 00 00 li  r3,0
 c74:   4e 80 00 20 blr

In the first exemple, we see GCC doesn't need to duplicate what
happens after the trap.

In the second exemple, we see that GCC doesn't need to emit a test
and a branch in the likely path in addition to the trap.

We've got some WARN_ON() in .softirqentry.text section so it needs
to be added in the OTHER_TEXT_SECTIONS in modpost.c

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/book3s/64/kup.h |  2 +-
 arch/powerpc/include/asm/bug.h   | 51 +++-
 arch/powerpc/include/asm/extable.h   | 14 +++
 arch/powerpc/include/asm/ppc_asm.h   | 11 +
 arch/powerpc/kernel/entry_64.S   |  2 +-
 arch/powerpc/kernel/exceptions-64e.S |  2 +-
 arch/powerpc/kernel/misc_32.S|  2 +-
 arch/powerpc/kernel/traps.c  |  9 -
 scripts/mod/modpost.c|  2 +-
 9 files changed, 69 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/kup.h 
b/arch/powerpc/include/asm/book3s/64/kup.h
index 9700da3a4093..a22839cba32e 100644
--- a/arch/powerpc/include/asm/book3s/64/kup.h
+++ b/arch/powerpc/include/asm/book3s/64/kup.h
@@ -90,7 +90,7 @@
/* Prevent access to userspace using any key values */
LOAD_REG_IMMEDIATE(\gpr2, AMR_KUAP_BLOCKED)
 999:   tdne\gpr1, \gpr2
-   EMIT_BUG_ENTRY 999b, __FILE__, __LINE__, (BUGFLAG_WARNING | 
BUGFLAG_ONCE)
+   EMIT_WARN_ENTRY 999b, __FILE__, __LINE__, (BUGFLAG_WARNING | 
BUGFLAG_ONCE)
END_MMU_FTR_SECTION_NESTED_IFSET(MMU_FTR_BOOK3S_KUAP, 67)
 #endif
 .endm
diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
index 101dea4eec8d..d92afdbd4449 100644
--- a/arch/powerpc/include/asm/bug.h
+++ b/arch/powerpc/include/asm/bug.h
@@ -4,6 +4,7 @@
 #ifdef __KERNEL__
 
 #include 
+#include 
 
 #ifdef CONFIG_BUG
 
@@ -30,6 +31,11 @@
 .endm
 #endif /* verbose */
 
+.macro EMIT_WARN_ENTRY addr,file,line,flags
+   EX_TABLE(\addr,\addr+4)
+   EMIT_BUG_ENTRY \addr,\file,\line,\flags
+.endm
+
 #else /* !__ASSEMBLY__ */
 /* _EMIT_BUG_ENTRY expects args %0,%1,%2,%3 to be FILE, LINE, flags and
sizeof(struct bug_entry), respectively */
@@ -58,6 +64,16 @@
  "i" (sizeof(struct bug_entry)),   \
  ##__VA_ARGS__)
 
+#define WARN_ENTRY(insn, flags, label, ...)\
+   asm_volatile_goto(  \
+   "1: " insn "\n" \
+   EX_TABLE(1b, %l[label]) \
+   _EMIT_BUG_ENTRY \
+   : : "i" (__FILE__), "i" (__LINE__), \
+ "i" (flags),  \
+ "i&quo

[PATCH 1/2] powerpc/bug: Remove specific powerpc BUG_ON() and WARN_ON() on PPC32

2021-04-12 Thread Christophe Leroy
powerpc BUG_ON() and WARN_ON() are based on using twnei instruction.

For catching simple conditions like a variable having value 0, this
is efficient because it does the test and the trap at the same time.
But most conditions used with BUG_ON or WARN_ON are more complex and
forces GCC to format the condition into a 0 or 1 value in a register.
This will usually require 2 to 3 instructions.

The most efficient solution would be to use __builtin_trap() because
GCC is able to optimise the use of the different trap instructions
based on the requested condition, but this is complex if not
impossible for the following reasons:
- __builtin_trap() is a non-recoverable instruction, so it can't be
used for WARN_ON
- Knowing which line of code generated the trap would require the
analysis of DWARF information. This is not a feature we have today.

As mentioned in commit 8d4fbcfbe0a4 ("Fix WARN_ON() on bitfield ops")
the way WARN_ON() is implemented is suboptimal. That commit also
mentions an issue with 'long long' condition. It fixed it for
WARN_ON() but the same problem still exists today with BUG_ON() on
PPC32. It will be fixed by using the generic implementation.

By using the generic implementation, gcc will naturally generate a
branch to the unconditional trap generated by BUG().

As modern powerpc implement zero-cycle branch,
that's even more efficient.

And for the functions using WARN_ON() and its return, the test
on return from WARN_ON() is now also used for the WARN_ON() itself.

On PPC64 we don't want it because we want to be able to use CFAR
register to track how we entered the code that trapped. The CFAR
register would be clobbered by the branch.

A simple test function:

unsigned long test9w(unsigned long a, unsigned long b)
{
if (WARN_ON(!b))
return 0;
return a / b;
}

Before the patch:

046c :
 46c:   7c 89 00 34 cntlzw  r9,r4
 470:   55 29 d9 7e rlwinm  r9,r9,27,5,31
 474:   0f 09 00 00 twnei   r9,0
 478:   2c 04 00 00 cmpwi   r4,0
 47c:   41 82 00 0c beq 488 
 480:   7c 63 23 96 divwu   r3,r3,r4
 484:   4e 80 00 20 blr

 488:   38 60 00 00 li  r3,0
 48c:   4e 80 00 20 blr

After the patch:

0468 :
 468:   2c 04 00 00 cmpwi   r4,0
 46c:   41 82 00 0c beq 478 
 470:   7c 63 23 96 divwu   r3,r3,r4
 474:   4e 80 00 20 blr

 478:   0f e0 00 00 twuir0,0
 47c:   38 60 00 00 li  r3,0
 480:   4e 80 00 20 blr

So we see before the patch we need 3 instructions on the likely path
to handle the WARN_ON(). With the patch the trap goes on the unlikely
path.

See below the difference at the entry of system_call_exception where
we have several BUG_ON(), allthough less impressing.

With the patch:

 :
   0:   81 6a 00 84 lwz r11,132(r10)
   4:   90 6a 00 88 stw r3,136(r10)
   8:   71 60 00 02 andi.   r0,r11,2
   c:   41 82 00 70 beq 7c 
  10:   71 60 40 00 andi.   r0,r11,16384
  14:   41 82 00 6c beq 80 
  18:   71 6b 80 00 andi.   r11,r11,32768
  1c:   41 82 00 68 beq 84 
  20:   94 21 ff e0 stwur1,-32(r1)
  24:   93 e1 00 1c stw r31,28(r1)
  28:   7d 8c 42 e6 mftbr12
...
  7c:   0f e0 00 00 twuir0,0
  80:   0f e0 00 00 twuir0,0
  84:   0f e0 00 00 twuir0,0

Without the patch:

 :
   0:   94 21 ff e0 stwur1,-32(r1)
   4:   93 e1 00 1c stw r31,28(r1)
   8:   90 6a 00 88 stw r3,136(r10)
   c:   81 6a 00 84 lwz r11,132(r10)
  10:   69 60 00 02 xorir0,r11,2
  14:   54 00 ff fe rlwinm  r0,r0,31,31,31
  18:   0f 00 00 00 twnei   r0,0
  1c:   69 60 40 00 xorir0,r11,16384
  20:   54 00 97 fe rlwinm  r0,r0,18,31,31
  24:   0f 00 00 00 twnei   r0,0
  28:   69 6b 80 00 xorir11,r11,32768
  2c:   55 6b 8f fe rlwinm  r11,r11,17,31,31
  30:   0f 0b 00 00 twnei   r11,0
  34:   7d 8c 42 e6 mftbr12

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/bug.h | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
index d1635ffbb179..101dea4eec8d 100644
--- a/arch/powerpc/include/asm/bug.h
+++ b/arch/powerpc/include/asm/bug.h
@@ -68,7 +68,11 @@
BUG_ENTRY("twi 31, 0, 0", 0);   \
unreachable();  \
 } while (0)
+#define HAVE_ARCH_BUG
+
+#define __WARN_FLAGS(flags) BUG_ENTRY("twi 3

[PATCH 3/3] powerpc/ebpf32: Use standard function call for functions within 32M distance

2021-04-12 Thread Christophe Leroy
If the target of a function call is within 32 Mbytes distance, use a
standard function call with 'bl' of the 'lis/ori/mtlr/blrl' sequence.

In the first pass, no memory has been allocated yet and the code
position is not known yet (image pointer is NULL). This pass is there
to calculate the amount of memory to allocate for the EBPF code, so
assume the 4 instructions sequence is required, so that enough memory
is allocated.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/ppc-opcode.h |  1 +
 arch/powerpc/net/bpf_jit.h|  3 +++
 arch/powerpc/net/bpf_jit_comp32.c | 16 +++-
 3 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
b/arch/powerpc/include/asm/ppc-opcode.h
index 5b60020dc1f4..ac41776661e9 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -265,6 +265,7 @@
 #define PPC_INST_ORI   0x6000
 #define PPC_INST_ORIS  0x6400
 #define PPC_INST_BRANCH0x4800
+#define PPC_INST_BL0x4801
 #define PPC_INST_BRANCH_COND   0x4080
 
 /* Prefixes */
diff --git a/arch/powerpc/net/bpf_jit.h b/arch/powerpc/net/bpf_jit.h
index 776abef4d2a0..99fad093f43e 100644
--- a/arch/powerpc/net/bpf_jit.h
+++ b/arch/powerpc/net/bpf_jit.h
@@ -26,6 +26,9 @@
 /* Long jump; (unconditional 'branch') */
 #define PPC_JMP(dest)  EMIT(PPC_INST_BRANCH |\
 (((dest) - (ctx->idx * 4)) & 0x03fc))
+/* blr; (unconditional 'branch' with link) to absolute address */
+#define PPC_BL_ABS(dest)   EMIT(PPC_INST_BL |\
+(((dest) - (unsigned long)(image + 
ctx->idx)) & 0x03fc))
 /* "cond" here covers BO:BI fields. */
 #define PPC_BCC_SHORT(cond, dest)  EMIT(PPC_INST_BRANCH_COND |   \
 (((cond) & 0x3ff) << 16) |   \
diff --git a/arch/powerpc/net/bpf_jit_comp32.c 
b/arch/powerpc/net/bpf_jit_comp32.c
index ef21b09df76e..bbb16099e8c7 100644
--- a/arch/powerpc/net/bpf_jit_comp32.c
+++ b/arch/powerpc/net/bpf_jit_comp32.c
@@ -187,11 +187,17 @@ void bpf_jit_build_epilogue(u32 *image, struct 
codegen_context *ctx)
 
 void bpf_jit_emit_func_call_rel(u32 *image, struct codegen_context *ctx, u64 
func)
 {
-   /* Load function address into r0 */
-   EMIT(PPC_RAW_LIS(__REG_R0, IMM_H(func)));
-   EMIT(PPC_RAW_ORI(__REG_R0, __REG_R0, IMM_L(func)));
-   EMIT(PPC_RAW_MTLR(__REG_R0));
-   EMIT(PPC_RAW_BLRL());
+   s32 rel = (s32)func - (s32)(image + ctx->idx);
+
+   if (image && rel < 0x200 && rel >= -0x200) {
+   PPC_BL_ABS(func);
+   } else {
+   /* Load function address into r0 */
+   EMIT(PPC_RAW_LIS(__REG_R0, IMM_H(func)));
+   EMIT(PPC_RAW_ORI(__REG_R0, __REG_R0, IMM_L(func)));
+   EMIT(PPC_RAW_MTLR(__REG_R0));
+   EMIT(PPC_RAW_BLRL());
+   }
 }
 
 static void bpf_jit_emit_tail_call(u32 *image, struct codegen_context *ctx, 
u32 out)
-- 
2.25.0



[PATCH 2/3] powerpc/ebpf32: Rework 64 bits shifts to avoid tests and branches

2021-04-12 Thread Christophe Leroy
Re-implement BPF_ALU64 | BPF_{LSH/RSH/ARSH} | BPF_X with branchless
implementation copied from misc_32.S.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/net/bpf_jit_comp32.c | 39 +++
 1 file changed, 19 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/net/bpf_jit_comp32.c 
b/arch/powerpc/net/bpf_jit_comp32.c
index ca6fe1583460..ef21b09df76e 100644
--- a/arch/powerpc/net/bpf_jit_comp32.c
+++ b/arch/powerpc/net/bpf_jit_comp32.c
@@ -548,16 +548,15 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, 
struct codegen_context *
EMIT(PPC_RAW_SLW(dst_reg, dst_reg, src_reg));
break;
case BPF_ALU64 | BPF_LSH | BPF_X: /* dst <<= src; */
-   EMIT(PPC_RAW_ADDIC_DOT(__REG_R0, src_reg, -32));
-   PPC_BCC_SHORT(COND_LT, (ctx->idx + 4) * 4);
-   EMIT(PPC_RAW_SLW(dst_reg_h, dst_reg, __REG_R0));
-   EMIT(PPC_RAW_LI(dst_reg, 0));
-   PPC_JMP((ctx->idx + 6) * 4);
+   bpf_set_seen_register(ctx, tmp_reg);
EMIT(PPC_RAW_SUBFIC(__REG_R0, src_reg, 32));
EMIT(PPC_RAW_SLW(dst_reg_h, dst_reg_h, src_reg));
+   EMIT(PPC_RAW_ADDI(tmp_reg, src_reg, 32));
EMIT(PPC_RAW_SRW(__REG_R0, dst_reg, __REG_R0));
-   EMIT(PPC_RAW_SLW(dst_reg, dst_reg, src_reg));
+   EMIT(PPC_RAW_SLW(tmp_reg, dst_reg, tmp_reg));
EMIT(PPC_RAW_OR(dst_reg_h, dst_reg_h, __REG_R0));
+   EMIT(PPC_RAW_SLW(dst_reg, dst_reg, src_reg));
+   EMIT(PPC_RAW_OR(dst_reg_h, dst_reg_h, tmp_reg));
break;
case BPF_ALU | BPF_LSH | BPF_K: /* (u32) dst <<= (u32) imm */
if (!imm)
@@ -585,16 +584,15 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, 
struct codegen_context *
EMIT(PPC_RAW_SRW(dst_reg, dst_reg, src_reg));
break;
case BPF_ALU64 | BPF_RSH | BPF_X: /* dst >>= src */
-   EMIT(PPC_RAW_ADDIC_DOT(__REG_R0, src_reg, -32));
-   PPC_BCC_SHORT(COND_LT, (ctx->idx + 4) * 4);
-   EMIT(PPC_RAW_SRW(dst_reg, dst_reg_h, __REG_R0));
-   EMIT(PPC_RAW_LI(dst_reg_h, 0));
-   PPC_JMP((ctx->idx + 6) * 4);
-   EMIT(PPC_RAW_SUBFIC(0, src_reg, 32));
+   bpf_set_seen_register(ctx, tmp_reg);
+   EMIT(PPC_RAW_SUBFIC(__REG_R0, src_reg, 32));
EMIT(PPC_RAW_SRW(dst_reg, dst_reg, src_reg));
+   EMIT(PPC_RAW_ADDI(tmp_reg, src_reg, 32));
EMIT(PPC_RAW_SLW(__REG_R0, dst_reg_h, __REG_R0));
-   EMIT(PPC_RAW_SRW(dst_reg_h, dst_reg_h, src_reg));
+   EMIT(PPC_RAW_SRW(tmp_reg, dst_reg_h, tmp_reg));
EMIT(PPC_RAW_OR(dst_reg, dst_reg, __REG_R0));
+   EMIT(PPC_RAW_SRW(dst_reg_h, dst_reg_h, src_reg));
+   EMIT(PPC_RAW_OR(dst_reg, dst_reg, tmp_reg));
break;
case BPF_ALU | BPF_RSH | BPF_K: /* (u32) dst >>= (u32) imm */
if (!imm)
@@ -622,16 +620,17 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, 
struct codegen_context *
EMIT(PPC_RAW_SRAW(dst_reg_h, dst_reg, src_reg));
break;
case BPF_ALU64 | BPF_ARSH | BPF_X: /* (s64) dst >>= src */
-   EMIT(PPC_RAW_ADDIC_DOT(__REG_R0, src_reg, -32));
-   PPC_BCC_SHORT(COND_LT, (ctx->idx + 4) * 4);
-   EMIT(PPC_RAW_SRAW(dst_reg, dst_reg_h, __REG_R0));
-   EMIT(PPC_RAW_SRAWI(dst_reg_h, dst_reg_h, 31));
-   PPC_JMP((ctx->idx + 6) * 4);
-   EMIT(PPC_RAW_SUBFIC(0, src_reg, 32));
+   bpf_set_seen_register(ctx, tmp_reg);
+   EMIT(PPC_RAW_SUBFIC(__REG_R0, src_reg, 32));
EMIT(PPC_RAW_SRW(dst_reg, dst_reg, src_reg));
EMIT(PPC_RAW_SLW(__REG_R0, dst_reg_h, __REG_R0));
-   EMIT(PPC_RAW_SRAW(dst_reg_h, dst_reg_h, src_reg));
+   EMIT(PPC_RAW_ADDI(tmp_reg, src_reg, 32));
EMIT(PPC_RAW_OR(dst_reg, dst_reg, __REG_R0));
+   EMIT(PPC_RAW_RLWINM(__REG_R0, tmp_reg, 0, 26, 26));
+   EMIT(PPC_RAW_SRAW(tmp_reg, dst_reg_h, tmp_reg));
+   EMIT(PPC_RAW_SRAW(dst_reg_h, dst_reg_h, src_reg));
+   EMIT(PPC_RAW_SLW(tmp_reg, tmp_reg, __REG_R0));
+   EMIT(PPC_RAW_OR(dst_reg, dst_reg, tmp_reg));
   

[PATCH 1/3] powerpc/ebpf32: Fix comment on BPF_ALU{64} | BPF_LSH | BPF_K

2021-04-12 Thread Christophe Leroy
Replace <<== by <<=

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/net/bpf_jit_comp32.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/net/bpf_jit_comp32.c 
b/arch/powerpc/net/bpf_jit_comp32.c
index 003843273b43..ca6fe1583460 100644
--- a/arch/powerpc/net/bpf_jit_comp32.c
+++ b/arch/powerpc/net/bpf_jit_comp32.c
@@ -559,12 +559,12 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, 
struct codegen_context *
EMIT(PPC_RAW_SLW(dst_reg, dst_reg, src_reg));
EMIT(PPC_RAW_OR(dst_reg_h, dst_reg_h, __REG_R0));
break;
-   case BPF_ALU | BPF_LSH | BPF_K: /* (u32) dst <<== (u32) imm */
+   case BPF_ALU | BPF_LSH | BPF_K: /* (u32) dst <<= (u32) imm */
if (!imm)
break;
EMIT(PPC_RAW_SLWI(dst_reg, dst_reg, imm));
break;
-   case BPF_ALU64 | BPF_LSH | BPF_K: /* dst <<== imm */
+   case BPF_ALU64 | BPF_LSH | BPF_K: /* dst <<= imm */
if (imm < 0)
return -EINVAL;
if (!imm)
-- 
2.25.0



[PATCH v1 4/4] powerpc: Move copy_from_kernel_nofault_inst()

2021-04-12 Thread Christophe Leroy
When probe_kernel_read_inst() was created, there was no good place to
put it, so a file called lib/inst.c was dedicated for it.

Since then, probe_kernel_read_inst() has been renamed
copy_from_kernel_nofault_inst(). And mm/maccess.h didn't exist at that
time. Today, mm/maccess.h is related to copy_from_kernel_nofault().

Move copy_from_kernel_nofault_inst() into mm/maccess.c

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/lib/inst.c   | 26 --
 arch/powerpc/mm/maccess.c | 21 +
 2 files changed, 21 insertions(+), 26 deletions(-)
 delete mode 100644 arch/powerpc/lib/inst.c

diff --git a/arch/powerpc/lib/inst.c b/arch/powerpc/lib/inst.c
deleted file mode 100644
index ec7f6bae8b3c..
--- a/arch/powerpc/lib/inst.c
+++ /dev/null
@@ -1,26 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-or-later
-/*
- *  Copyright 2020, IBM Corporation.
- */
-
-#include 
-#include 
-#include 
-#include 
-
-int copy_from_kernel_nofault_inst(struct ppc_inst *inst, struct ppc_inst *src)
-{
-   unsigned int val, suffix;
-   int err;
-
-   err = copy_from_kernel_nofault(, src, sizeof(val));
-   if (err)
-   return err;
-   if (IS_ENABLED(CONFIG_PPC64) && get_op(val) == OP_PREFIX) {
-   err = copy_from_kernel_nofault(, (void *)src + 4, 4);
-   *inst = ppc_inst_prefix(val, suffix);
-   } else {
-   *inst = ppc_inst(val);
-   }
-   return err;
-}
diff --git a/arch/powerpc/mm/maccess.c b/arch/powerpc/mm/maccess.c
index fa9a7a718fc6..e75e74c52a8a 100644
--- a/arch/powerpc/mm/maccess.c
+++ b/arch/powerpc/mm/maccess.c
@@ -3,7 +3,28 @@
 #include 
 #include 
 
+#include 
+#include 
+#include 
+
 bool copy_from_kernel_nofault_allowed(const void *unsafe_src, size_t size)
 {
return is_kernel_addr((unsigned long)unsafe_src);
 }
+
+int copy_from_kernel_nofault_inst(struct ppc_inst *inst, struct ppc_inst *src)
+{
+   unsigned int val, suffix;
+   int err;
+
+   err = copy_from_kernel_nofault(, src, sizeof(val));
+   if (err)
+   return err;
+   if (IS_ENABLED(CONFIG_PPC64) && get_op(val) == OP_PREFIX) {
+   err = copy_from_kernel_nofault(, (void *)src + 4, 4);
+   *inst = ppc_inst_prefix(val, suffix);
+   } else {
+   *inst = ppc_inst(val);
+   }
+   return err;
+}
-- 
2.25.0



[PATCH v1 3/4] powerpc: Rename probe_kernel_read_inst()

2021-04-12 Thread Christophe Leroy
When probe_kernel_read_inst() was created, it was to mimic
probe_kernel_read() function.

Since then, probe_kernel_read() has been renamed
copy_from_kernel_nofault().

Rename probe_kernel_read_inst() into copy_from_kernel_nofault_inst().

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/inst.h|  3 +--
 arch/powerpc/kernel/align.c|  2 +-
 arch/powerpc/kernel/trace/ftrace.c | 22 +++---
 arch/powerpc/lib/inst.c|  3 +--
 4 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/inst.h b/arch/powerpc/include/asm/inst.h
index a40c3913a4a3..a8ab0715f50e 100644
--- a/arch/powerpc/include/asm/inst.h
+++ b/arch/powerpc/include/asm/inst.h
@@ -177,7 +177,6 @@ static inline char *__ppc_inst_as_str(char 
str[PPC_INST_STR_LEN], struct ppc_ins
__str;  \
 })
 
-int probe_kernel_read_inst(struct ppc_inst *inst,
-  struct ppc_inst *src);
+int copy_from_kernel_nofault_inst(struct ppc_inst *inst, struct ppc_inst *src);
 
 #endif /* _ASM_POWERPC_INST_H */
diff --git a/arch/powerpc/kernel/align.c b/arch/powerpc/kernel/align.c
index a97d5f1a3905..df3b55fec27d 100644
--- a/arch/powerpc/kernel/align.c
+++ b/arch/powerpc/kernel/align.c
@@ -311,7 +311,7 @@ int fix_alignment(struct pt_regs *regs)
CHECK_FULL_REGS(regs);
 
if (is_kernel_addr(regs->nip))
-   r = probe_kernel_read_inst(, (void *)regs->nip);
+   r = copy_from_kernel_nofault_inst(, (void *)regs->nip);
else
r = __get_user_instr(instr, (void __user *)regs->nip);
 
diff --git a/arch/powerpc/kernel/trace/ftrace.c 
b/arch/powerpc/kernel/trace/ftrace.c
index 42761ebec9f7..9daa4eb812ce 100644
--- a/arch/powerpc/kernel/trace/ftrace.c
+++ b/arch/powerpc/kernel/trace/ftrace.c
@@ -68,7 +68,7 @@ ftrace_modify_code(unsigned long ip, struct ppc_inst old, 
struct ppc_inst new)
 */
 
/* read the text we want to modify */
-   if (probe_kernel_read_inst(, (void *)ip))
+   if (copy_from_kernel_nofault_inst(, (void *)ip))
return -EFAULT;
 
/* Make sure it is what we expect it to be */
@@ -130,7 +130,7 @@ __ftrace_make_nop(struct module *mod,
struct ppc_inst op, pop;
 
/* read where this goes */
-   if (probe_kernel_read_inst(, (void *)ip)) {
+   if (copy_from_kernel_nofault_inst(, (void *)ip)) {
pr_err("Fetching opcode failed.\n");
return -EFAULT;
}
@@ -164,7 +164,7 @@ __ftrace_make_nop(struct module *mod,
/* When using -mkernel_profile there is no load to jump over */
pop = ppc_inst(PPC_INST_NOP);
 
-   if (probe_kernel_read_inst(, (void *)(ip - 4))) {
+   if (copy_from_kernel_nofault_inst(, (void *)(ip - 4))) {
pr_err("Fetching instruction at %lx failed.\n", ip - 4);
return -EFAULT;
}
@@ -197,7 +197,7 @@ __ftrace_make_nop(struct module *mod,
 * Check what is in the next instruction. We can see ld r2,40(r1), but
 * on first pass after boot we will see mflr r0.
 */
-   if (probe_kernel_read_inst(, (void *)(ip + 4))) {
+   if (copy_from_kernel_nofault_inst(, (void *)(ip + 4))) {
pr_err("Fetching op failed.\n");
return -EFAULT;
}
@@ -349,7 +349,7 @@ static int setup_mcount_compiler_tramp(unsigned long tramp)
return -1;
 
/* New trampoline -- read where this goes */
-   if (probe_kernel_read_inst(, (void *)tramp)) {
+   if (copy_from_kernel_nofault_inst(, (void *)tramp)) {
pr_debug("Fetching opcode failed.\n");
return -1;
}
@@ -399,7 +399,7 @@ static int __ftrace_make_nop_kernel(struct dyn_ftrace *rec, 
unsigned long addr)
struct ppc_inst op;
 
/* Read where this goes */
-   if (probe_kernel_read_inst(, (void *)ip)) {
+   if (copy_from_kernel_nofault_inst(, (void *)ip)) {
pr_err("Fetching opcode failed.\n");
return -EFAULT;
}
@@ -526,10 +526,10 @@ __ftrace_make_call(struct dyn_ftrace *rec, unsigned long 
addr)
struct module *mod = rec->arch.mod;
 
/* read where this goes */
-   if (probe_kernel_read_inst(op, ip))
+   if (copy_from_kernel_nofault_inst(op, ip))
return -EFAULT;
 
-   if (probe_kernel_read_inst(op + 1, ip + 4))
+   if (copy_from_kernel_nofault_inst(op + 1, ip + 4))
return -EFAULT;
 
if (!expected_nop_sequence(ip, op[0], op[1])) {
@@ -592,7 +592,7 @@ __ftrace_make_call(struct dyn_ftrace *rec, unsigned long 
addr)
unsigned long ip = rec->ip;
 
/* read where this goes */
-   if (probe_kernel_read_inst(, (void *)ip))
+   if (copy_from_kernel_nofault_inst(, (void *)ip))
return -EFAULT;
 
/* It should be pointing 

[PATCH v1 2/4] powerpc: Make probe_kernel_read_inst() common to PPC32 and PPC64

2021-04-12 Thread Christophe Leroy
We have two independant versions of probe_kernel_read_inst(), one for
PPC32 and one for PPC64.

The PPC32 is identical to the first part of the PPC64 version.
The remaining part of PPC64 version is not relevant for PPC32, but
not contradictory, so we can easily have a common function with
the PPC64 part opted out via a IS_ENABLED(CONFIG_PPC64).

The only need is to add a version of ppc_inst_prefix() for PPC32.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/inst.h |  2 ++
 arch/powerpc/lib/inst.c | 17 +
 2 files changed, 3 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/inst.h b/arch/powerpc/include/asm/inst.h
index 2902d4e6a363..a40c3913a4a3 100644
--- a/arch/powerpc/include/asm/inst.h
+++ b/arch/powerpc/include/asm/inst.h
@@ -102,6 +102,8 @@ static inline bool ppc_inst_equal(struct ppc_inst x, struct 
ppc_inst y)
 
 #define ppc_inst(x) ((struct ppc_inst){ .val = x })
 
+#define ppc_inst_prefix(x, y) ppc_inst(x)
+
 static inline bool ppc_inst_prefixed(struct ppc_inst x)
 {
return false;
diff --git a/arch/powerpc/lib/inst.c b/arch/powerpc/lib/inst.c
index c57b3548de37..0dff3ac2d45f 100644
--- a/arch/powerpc/lib/inst.c
+++ b/arch/powerpc/lib/inst.c
@@ -8,7 +8,6 @@
 #include 
 #include 
 
-#ifdef CONFIG_PPC64
 int probe_kernel_read_inst(struct ppc_inst *inst,
   struct ppc_inst *src)
 {
@@ -18,7 +17,7 @@ int probe_kernel_read_inst(struct ppc_inst *inst,
err = copy_from_kernel_nofault(, src, sizeof(val));
if (err)
return err;
-   if (get_op(val) == OP_PREFIX) {
+   if (IS_ENABLED(CONFIG_PPC64) && get_op(val) == OP_PREFIX) {
err = copy_from_kernel_nofault(, (void *)src + 4, 4);
*inst = ppc_inst_prefix(val, suffix);
} else {
@@ -26,17 +25,3 @@ int probe_kernel_read_inst(struct ppc_inst *inst,
}
return err;
 }
-#else /* !CONFIG_PPC64 */
-int probe_kernel_read_inst(struct ppc_inst *inst,
-  struct ppc_inst *src)
-{
-   unsigned int val;
-   int err;
-
-   err = copy_from_kernel_nofault(, src, sizeof(val));
-   if (!err)
-   *inst = ppc_inst(val);
-
-   return err;
-}
-#endif /* CONFIG_PPC64 */
-- 
2.25.0



  1   2   3   4   5   6   7   8   9   10   >