Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-15 Thread Mark Rutland
On Mon, Apr 15, 2024 at 09:52:41AM +0200, Peter Zijlstra wrote:
> On Thu, Apr 11, 2024 at 07:00:41PM +0300, Mike Rapoport wrote:
> > +/**
> > + * enum execmem_type - types of executable memory ranges
> > + *
> > + * There are several subsystems that allocate executable memory.
> > + * Architectures define different restrictions on placement,
> > + * permissions, alignment and other parameters for memory that can be used
> > + * by these subsystems.
> > + * Types in this enum identify subsystems that allocate executable memory
> > + * and let architectures define parameters for ranges suitable for
> > + * allocations by each subsystem.
> > + *
> > + * @EXECMEM_DEFAULT: default parameters that would be used for types that
> > + * are not explcitly defined.
> > + * @EXECMEM_MODULE_TEXT: parameters for module text sections
> > + * @EXECMEM_KPROBES: parameters for kprobes
> > + * @EXECMEM_FTRACE: parameters for ftrace
> > + * @EXECMEM_BPF: parameters for BPF
> > + * @EXECMEM_TYPE_MAX:
> > + */
> > +enum execmem_type {
> > +   EXECMEM_DEFAULT,
> > +   EXECMEM_MODULE_TEXT = EXECMEM_DEFAULT,
> > +   EXECMEM_KPROBES,
> > +   EXECMEM_FTRACE,
> > +   EXECMEM_BPF,
> > +   EXECMEM_TYPE_MAX,
> > +};
> 
> Can we please get a break-down of how all these types are actually
> different from one another?
> 
> I'm thinking some platforms have a tiny immediate space (arm64 comes to
> mind) and has less strict placement constraints for some of them?

Yeah, and really I'd *much* rather deal with that in arch code, as I have said
several times.

For arm64 we have two bsaic restrictions: 

1) Direct branches can go +/-128M
   We can expand this range by having direct branches go to PLTs, at a
   performance cost.

2) PREL32 relocations can go +/-2G
   We cannot expand this further.

* We don't need to allocate memory for ftrace. We do not use trampolines.

* Kprobes XOL areas don't care about either of those; we don't place any
  PC-relative instructions in those. Maybe we want to in future.

* Modules care about both; we'd *prefer* to place them within +/-128M of all
  other kernel/module code, but if there's no space we can use PLTs and expand
  that to +/-2G. Since modules can refreence other modules, that ends up
  actually being halved, and modules have to fit within some 2G window that
  also covers the kernel.

* I'm not sure about BPF's requirements; it seems happy doing the same as
  modules.

So if we *must* use a common execmem allocator, what we'd reall want is our own
types, e.g.

EXECMEM_ANYWHERE
EXECMEM_NOPLT
EXECMEM_PREL32

... and then we use those in arch code to implement module_alloc() and friends.

Mark.


Re: [PATCH v6 00/18] Transparent Contiguous PTEs for User Mappings

2024-02-15 Thread Mark Rutland
On Thu, Feb 15, 2024 at 10:31:47AM +, Ryan Roberts wrote:
> Hi All,
> 
> This is a series to opportunistically and transparently use contpte mappings
> (set the contiguous bit in ptes) for user memory when those mappings meet the
> requirements. The change benefits arm64, but there is some (very) minor
> refactoring for x86 to enable its integration with core-mm.

I've looked over each of the arm64-specific patches, and those all seem good to
me. I've thrown my local Syzkaller instance at the series, and I'll shout if
that hits anything that's not clearly a latent issue prior to this series.

The other bits also look good to me, so FWIW, for the series as a whole:

Acked-by: Mark Rutland 

Mark.


Re: [PATCH v6 18/18] arm64/mm: Automatically fold contpte mappings

2024-02-15 Thread Mark Rutland
On Thu, Feb 15, 2024 at 10:32:05AM +, Ryan Roberts wrote:
> There are situations where a change to a single PTE could cause the
> contpte block in which it resides to become foldable (i.e. could be
> repainted with the contiguous bit). Such situations arise, for example,
> when user space temporarily changes protections, via mprotect, for
> individual pages, such can be the case for certain garbage collectors.
> 
> We would like to detect when such a PTE change occurs. However this can
> be expensive due to the amount of checking required. Therefore only
> perform the checks when an indiviual PTE is modified via mprotect
> (ptep_modify_prot_commit() -> set_pte_at() -> set_ptes(nr=1)) and only
> when we are setting the final PTE in a contpte-aligned block.
> 
> Signed-off-by: Ryan Roberts 

Acked-by: Mark Rutland 

Mark.

> ---
>  arch/arm64/include/asm/pgtable.h | 26 +
>  arch/arm64/mm/contpte.c  | 64 
>  2 files changed, 90 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index 8310875133ff..401087e8a43d 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1185,6 +1185,8 @@ extern void ptep_modify_prot_commit(struct 
> vm_area_struct *vma,
>   * where it is possible and makes sense to do so. The PTE_CONT bit is 
> considered
>   * a private implementation detail of the public ptep API (see below).
>   */
> +extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte);
>  extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>   pte_t *ptep, pte_t pte);
>  extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
> @@ -1206,6 +1208,29 @@ extern int contpte_ptep_set_access_flags(struct 
> vm_area_struct *vma,
>   unsigned long addr, pte_t *ptep,
>   pte_t entry, int dirty);
>  
> +static __always_inline void contpte_try_fold(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep, pte_t pte)
> +{
> + /*
> +  * Only bother trying if both the virtual and physical addresses are
> +  * aligned and correspond to the last entry in a contig range. The core
> +  * code mostly modifies ranges from low to high, so this is the likely
> +  * the last modification in the contig range, so a good time to fold.
> +  * We can't fold special mappings, because there is no associated folio.
> +  */
> +
> + const unsigned long contmask = CONT_PTES - 1;
> + bool valign = ((addr >> PAGE_SHIFT) & contmask) == contmask;
> +
> + if (unlikely(valign)) {
> + bool palign = (pte_pfn(pte) & contmask) == contmask;
> +
> + if (unlikely(palign &&
> + pte_valid(pte) && !pte_cont(pte) && !pte_special(pte)))
> + __contpte_try_fold(mm, addr, ptep, pte);
> + }
> +}
> +
>  static __always_inline void contpte_try_unfold(struct mm_struct *mm,
>   unsigned long addr, pte_t *ptep, pte_t pte)
>  {
> @@ -1286,6 +1311,7 @@ static __always_inline void set_ptes(struct mm_struct 
> *mm, unsigned long addr,
>   if (likely(nr == 1)) {
>   contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>   __set_ptes(mm, addr, ptep, pte, 1);
> + contpte_try_fold(mm, addr, ptep, pte);
>   } else {
>   contpte_set_ptes(mm, addr, ptep, pte, nr);
>   }
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index 50e0173dc5ee..16788f07716d 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -73,6 +73,70 @@ static void contpte_convert(struct mm_struct *mm, unsigned 
> long addr,
>   __set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
>  }
>  
> +void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte)
> +{
> + /*
> +  * We have already checked that the virtual and pysical addresses are
> +  * correctly aligned for a contpte mapping in contpte_try_fold() so the
> +  * remaining checks are to ensure that the contpte range is fully
> +  * covered by a single folio, and ensure that all the ptes are valid
> +  * with contiguous PFNs and matching prots. We ignore the state of the
> +  * access and dirty bits for the purpose of deciding if its a contiguous
> +  * range; the folding process will generate a single contpte entry which
> +  * has a single access and dirty bit. Tho

Re: [PATCH v6 14/18] arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs

2024-02-15 Thread Mark Rutland
On Thu, Feb 15, 2024 at 10:32:01AM +, Ryan Roberts wrote:
> Optimize the contpte implementation to fix some of the
> exit/munmap/dontneed performance regression introduced by the initial
> contpte commit. Subsequent patches will solve it entirely.
> 
> During exit(), munmap() or madvise(MADV_DONTNEED), mappings must be
> cleared. Previously this was done 1 PTE at a time. But the core-mm
> supports batched clear via the new [get_and_]clear_full_ptes() APIs. So
> let's implement those APIs and for fully covered contpte mappings, we no
> longer need to unfold the contpte. This significantly reduces unfolding
> operations, reducing the number of tlbis that must be issued.
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Mark Rutland 

Mark.

> ---
>  arch/arm64/include/asm/pgtable.h | 67 
>  arch/arm64/mm/contpte.c  | 17 
>  2 files changed, 84 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index 8643227c318b..a8f1a35e3086 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -965,6 +965,37 @@ static inline pte_t __ptep_get_and_clear(struct 
> mm_struct *mm,
>   return pte;
>  }
>  
> +static inline void __clear_full_ptes(struct mm_struct *mm, unsigned long 
> addr,
> + pte_t *ptep, unsigned int nr, int full)
> +{
> + for (;;) {
> + __ptep_get_and_clear(mm, addr, ptep);
> + if (--nr == 0)
> + break;
> + ptep++;
> + addr += PAGE_SIZE;
> + }
> +}
> +
> +static inline pte_t __get_and_clear_full_ptes(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep,
> + unsigned int nr, int full)
> +{
> + pte_t pte, tmp_pte;
> +
> + pte = __ptep_get_and_clear(mm, addr, ptep);
> + while (--nr) {
> + ptep++;
> + addr += PAGE_SIZE;
> + tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
> + if (pte_dirty(tmp_pte))
> + pte = pte_mkdirty(pte);
> + if (pte_young(tmp_pte))
> + pte = pte_mkyoung(pte);
> + }
> + return pte;
> +}
> +
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
>  static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
> @@ -1160,6 +1191,11 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t 
> orig_pte);
>  extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>  extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>   pte_t *ptep, pte_t pte, unsigned int nr);
> +extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, unsigned int nr, int full);
> +extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep,
> + unsigned int nr, int full);
>  extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>   unsigned long addr, pte_t *ptep);
>  extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> @@ -1253,6 +1289,35 @@ static inline void pte_clear(struct mm_struct *mm,
>   __pte_clear(mm, addr, ptep);
>  }
>  
> +#define clear_full_ptes clear_full_ptes
> +static inline void clear_full_ptes(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, unsigned int nr, int full)
> +{
> + if (likely(nr == 1)) {
> + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> + __clear_full_ptes(mm, addr, ptep, nr, full);
> + } else {
> + contpte_clear_full_ptes(mm, addr, ptep, nr, full);
> + }
> +}
> +
> +#define get_and_clear_full_ptes get_and_clear_full_ptes
> +static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep,
> + unsigned int nr, int full)
> +{
> + pte_t pte;
> +
> + if (likely(nr == 1)) {
> + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> + pte = __get_and_clear_full_ptes(mm, addr, ptep, nr, full);
> + } else {
> + pte = contpte_get_and_clear_full_ptes(mm, addr, ptep, nr, full);
> + }
> +
> + return pte;
> +}
> +
>  #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>   uns

Re: [PATCH v6 13/18] arm64/mm: Implement new wrprotect_ptes() batch API

2024-02-15 Thread Mark Rutland
On Thu, Feb 15, 2024 at 10:32:00AM +, Ryan Roberts wrote:
> Optimize the contpte implementation to fix some of the fork performance
> regression introduced by the initial contpte commit. Subsequent patches
> will solve it entirely.
> 
> During fork(), any private memory in the parent must be write-protected.
> Previously this was done 1 PTE at a time. But the core-mm supports
> batched wrprotect via the new wrprotect_ptes() API. So let's implement
> that API and for fully covered contpte mappings, we no longer need to
> unfold the contpte. This has 2 benefits:
> 
>   - reduced unfolding, reduces the number of tlbis that must be issued.
>   - The memory remains contpte-mapped ("folded") in the parent, so it
> continues to benefit from the more efficient use of the TLB after
> the fork.
> 
> The optimization to wrprotect a whole contpte block without unfolding is
> possible thanks to the tightening of the Arm ARM in respect to the
> definition and behaviour when 'Misprogramming the Contiguous bit'. See
> section D21194 at https://developer.arm.com/documentation/102105/ja-07/
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Mark Rutland 

Mark.

> ---
>  arch/arm64/include/asm/pgtable.h | 61 ++--
>  arch/arm64/mm/contpte.c  | 38 
>  2 files changed, 89 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index 831099cfc96b..8643227c318b 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -978,16 +978,12 @@ static inline pmd_t pmdp_huge_get_and_clear(struct 
> mm_struct *mm,
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> -/*
> - * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
> - * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
> - */
> -static inline void __ptep_set_wrprotect(struct mm_struct *mm,
> - unsigned long address, pte_t *ptep)
> +static inline void ___ptep_set_wrprotect(struct mm_struct *mm,
> + unsigned long address, pte_t *ptep,
> + pte_t pte)
>  {
> - pte_t old_pte, pte;
> + pte_t old_pte;
>  
> - pte = __ptep_get(ptep);
>   do {
>   old_pte = pte;
>   pte = pte_wrprotect(pte);
> @@ -996,6 +992,25 @@ static inline void __ptep_set_wrprotect(struct mm_struct 
> *mm,
>   } while (pte_val(pte) != pte_val(old_pte));
>  }
>  
> +/*
> + * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
> + * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
> + */
> +static inline void __ptep_set_wrprotect(struct mm_struct *mm,
> + unsigned long address, pte_t *ptep)
> +{
> + ___ptep_set_wrprotect(mm, address, ptep, __ptep_get(ptep));
> +}
> +
> +static inline void __wrprotect_ptes(struct mm_struct *mm, unsigned long 
> address,
> + pte_t *ptep, unsigned int nr)
> +{
> + unsigned int i;
> +
> + for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
> + __ptep_set_wrprotect(mm, address, ptep);
> +}
> +
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  #define __HAVE_ARCH_PMDP_SET_WRPROTECT
>  static inline void pmdp_set_wrprotect(struct mm_struct *mm,
> @@ -1149,6 +1164,8 @@ extern int contpte_ptep_test_and_clear_young(struct 
> vm_area_struct *vma,
>   unsigned long addr, pte_t *ptep);
>  extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>   unsigned long addr, pte_t *ptep);
> +extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, unsigned int nr);
>  extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>   unsigned long addr, pte_t *ptep,
>   pte_t entry, int dirty);
> @@ -1268,12 +1285,35 @@ static inline int ptep_clear_flush_young(struct 
> vm_area_struct *vma,
>   return contpte_ptep_clear_flush_young(vma, addr, ptep);
>  }
>  
> +#define wrprotect_ptes wrprotect_ptes
> +static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, unsigned int nr)
> +{
> + if (likely(nr == 1)) {
> + /*
> +  * Optimization: wrprotect_ptes() can only be called for present
> +  * ptes so we only need to check contig bit as condition for
> +

Re: [PATCH v6 12/18] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-15 Thread Mark Rutland
On Thu, Feb 15, 2024 at 10:31:59AM +, Ryan Roberts wrote:
> With the ptep API sufficiently refactored, we can now introduce a new
> "contpte" API layer, which transparently manages the PTE_CONT bit for
> user mappings.
> 
> In this initial implementation, only suitable batches of PTEs, set via
> set_ptes(), are mapped with the PTE_CONT bit. Any subsequent
> modification of individual PTEs will cause an "unfold" operation to
> repaint the contpte block as individual PTEs before performing the
> requested operation. While, a modification of a single PTE could cause
> the block of PTEs to which it belongs to become eligible for "folding"
> into a contpte entry, "folding" is not performed in this initial
> implementation due to the costs of checking the requirements are met.
> Due to this, contpte mappings will degrade back to normal pte mappings
> over time if/when protections are changed. This will be solved in a
> future patch.
> 
> Since a contpte block only has a single access and dirty bit, the
> semantic here changes slightly; when getting a pte (e.g. ptep_get())
> that is part of a contpte mapping, the access and dirty information are
> pulled from the block (so all ptes in the block return the same
> access/dirty info). When changing the access/dirty info on a pte (e.g.
> ptep_set_access_flags()) that is part of a contpte mapping, this change
> will affect the whole contpte block. This is works fine in practice
> since we guarantee that only a single folio is mapped by a contpte
> block, and the core-mm tracks access/dirty information per folio.
> 
> In order for the public functions, which used to be pure inline, to
> continue to be callable by modules, export all the contpte_* symbols
> that are now called by those public inline functions.
> 
> The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter
> at build time. It defaults to enabled as long as its dependency,
> TRANSPARENT_HUGEPAGE is also enabled. The core-mm depends upon
> TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not
> enabled, then there is no chance of meeting the physical contiguity
> requirement for contpte mappings.
> 
> Acked-by: Ard Biesheuvel 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Mark Rutland 

Mark.

> ---
>  arch/arm64/Kconfig   |   9 +
>  arch/arm64/include/asm/pgtable.h | 167 ++
>  arch/arm64/mm/Makefile   |   1 +
>  arch/arm64/mm/contpte.c  | 285 +++
>  include/linux/efi.h  |   5 +
>  5 files changed, 467 insertions(+)
>  create mode 100644 arch/arm64/mm/contpte.c
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index e8275a40afbd..5a7ac1f37bdc 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -2229,6 +2229,15 @@ config UNWIND_PATCH_PAC_INTO_SCS
>   select UNWIND_TABLES
>   select DYNAMIC_SCS
>  
> +config ARM64_CONTPTE
> + bool "Contiguous PTE mappings for user memory" if EXPERT
> + depends on TRANSPARENT_HUGEPAGE
> + default y
> + help
> +   When enabled, user mappings are configured using the PTE contiguous
> +   bit, for any mappings that meet the size and alignment requirements.
> +   This reduces TLB pressure and improves performance.
> +
>  endmenu # "Kernel Features"
>  
>  menu "Boot options"
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index 7336d40a893a..831099cfc96b 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -133,6 +133,10 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t 
> phys)
>   */
>  #define pte_valid_not_user(pte) \
>   ((pte_val(pte) & (PTE_VALID | PTE_USER | PTE_UXN)) == (PTE_VALID | 
> PTE_UXN))
> +/*
> + * Returns true if the pte is valid and has the contiguous bit set.
> + */
> +#define pte_valid_cont(pte)  (pte_valid(pte) && pte_cont(pte))
>  /*
>   * Could the pte be present in the TLB? We must check mm_tlb_flush_pending
>   * so that we don't erroneously return false for pages that have been
> @@ -1128,6 +1132,167 @@ extern void ptep_modify_prot_commit(struct 
> vm_area_struct *vma,
>   unsigned long addr, pte_t *ptep,
>   pte_t old_pte, pte_t new_pte);
>  
> +#ifdef CONFIG_ARM64_CONTPTE
> +
> +/*
> + * The contpte APIs are used to transparently manage the contiguous bit in 
> ptes
> + * where it is possible and makes sense to do so. The PTE_CONT bit is 
> considered
> + * a private implementation detail of the public ptep API (see bel

Re: [PATCH v6 11/18] arm64/mm: Split __flush_tlb_range() to elide trailing DSB

2024-02-15 Thread Mark Rutland
On Thu, Feb 15, 2024 at 10:31:58AM +, Ryan Roberts wrote:
> Split __flush_tlb_range() into __flush_tlb_range_nosync() +
> __flush_tlb_range(), in the same way as the existing flush_tlb_page()
> arrangement. This allows calling __flush_tlb_range_nosync() to elide the
> trailing DSB. Forthcoming "contpte" code will take advantage of this
> when clearing the young bit from a contiguous range of ptes.
> 
> Ordering between dsb and mmu_notifier_arch_invalidate_secondary_tlbs()
> has changed, but now aligns with the ordering of __flush_tlb_page(). It
> has been discussed that __flush_tlb_page() may be wrong though.
> Regardless, both will be resolved separately if needed.
> 
> Reviewed-by: David Hildenbrand 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Mark Rutland 

Mark.

> ---
>  arch/arm64/include/asm/tlbflush.h | 13 +++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/tlbflush.h 
> b/arch/arm64/include/asm/tlbflush.h
> index 1deb5d789c2e..3b0e8248e1a4 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -422,7 +422,7 @@ do {  
> \
>  #define __flush_s2_tlb_range_op(op, start, pages, stride, tlb_level) \
>   __flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false, 
> kvm_lpa2_is_enabled());
>  
> -static inline void __flush_tlb_range(struct vm_area_struct *vma,
> +static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
>unsigned long start, unsigned long end,
>unsigned long stride, bool last_level,
>int tlb_level)
> @@ -456,10 +456,19 @@ static inline void __flush_tlb_range(struct 
> vm_area_struct *vma,
>   __flush_tlb_range_op(vae1is, start, pages, stride, asid,
>tlb_level, true, lpa2_is_enabled());
>  
> - dsb(ish);
>   mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
>  }
>  
> +static inline void __flush_tlb_range(struct vm_area_struct *vma,
> +  unsigned long start, unsigned long end,
> +  unsigned long stride, bool last_level,
> +  int tlb_level)
> +{
> + __flush_tlb_range_nosync(vma, start, end, stride,
> +  last_level, tlb_level);
> + dsb(ish);
> +}
> +
>  static inline void flush_tlb_range(struct vm_area_struct *vma,
>  unsigned long start, unsigned long end)
>  {
> -- 
> 2.25.1
> 


Re: [PATCH v6 10/18] arm64/mm: New ptep layer to manage contig bit

2024-02-15 Thread Mark Rutland
On Thu, Feb 15, 2024 at 10:31:57AM +, Ryan Roberts wrote:
> Create a new layer for the in-table PTE manipulation APIs. For now, The
> existing API is prefixed with double underscore to become the
> arch-private API and the public API is just a simple wrapper that calls
> the private API.
> 
> The public API implementation will subsequently be used to transparently
> manipulate the contiguous bit where appropriate. But since there are
> already some contig-aware users (e.g. hugetlb, kernel mapper), we must
> first ensure those users use the private API directly so that the future
> contig-bit manipulations in the public API do not interfere with those
> existing uses.
> 
> The following APIs are treated this way:
> 
>  - ptep_get
>  - set_pte
>  - set_ptes
>  - pte_clear
>  - ptep_get_and_clear
>  - ptep_test_and_clear_young
>  - ptep_clear_flush_young
>  - ptep_set_wrprotect
>  - ptep_set_access_flags
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Mark Rutland 

Mark.

> ---
>  arch/arm64/include/asm/pgtable.h | 83 +---
>  arch/arm64/kernel/efi.c  |  4 +-
>  arch/arm64/kernel/mte.c  |  2 +-
>  arch/arm64/kvm/guest.c   |  2 +-
>  arch/arm64/mm/fault.c| 12 ++---
>  arch/arm64/mm/fixmap.c   |  4 +-
>  arch/arm64/mm/hugetlbpage.c  | 40 +++
>  arch/arm64/mm/kasan_init.c   |  6 +--
>  arch/arm64/mm/mmu.c  | 14 +++---
>  arch/arm64/mm/pageattr.c |  6 +--
>  arch/arm64/mm/trans_pgd.c|  6 +--
>  11 files changed, 93 insertions(+), 86 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index 9a2df85eb493..7336d40a893a 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -93,7 +93,8 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
>   __pte(__phys_to_pte_val((phys_addr_t)(pfn) << PAGE_SHIFT) | 
> pgprot_val(prot))
>  
>  #define pte_none(pte)(!pte_val(pte))
> -#define pte_clear(mm,addr,ptep)  set_pte(ptep, __pte(0))
> +#define __pte_clear(mm, addr, ptep) \
> + __set_pte(ptep, __pte(0))
>  #define pte_page(pte)(pfn_to_page(pte_pfn(pte)))
>  
>  /*
> @@ -137,7 +138,7 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
>   * so that we don't erroneously return false for pages that have been
>   * remapped as PROT_NONE but are yet to be flushed from the TLB.
>   * Note that we can't make any assumptions based on the state of the access
> - * flag, since ptep_clear_flush_young() elides a DSB when invalidating the
> + * flag, since __ptep_clear_flush_young() elides a DSB when invalidating the
>   * TLB.
>   */
>  #define pte_accessible(mm, pte)  \
> @@ -261,7 +262,7 @@ static inline pte_t pte_mkdevmap(pte_t pte)
>   return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
>  }
>  
> -static inline void set_pte(pte_t *ptep, pte_t pte)
> +static inline void __set_pte(pte_t *ptep, pte_t pte)
>  {
>   WRITE_ONCE(*ptep, pte);
>  
> @@ -275,8 +276,7 @@ static inline void set_pte(pte_t *ptep, pte_t pte)
>   }
>  }
>  
> -#define ptep_get ptep_get
> -static inline pte_t ptep_get(pte_t *ptep)
> +static inline pte_t __ptep_get(pte_t *ptep)
>  {
>   return READ_ONCE(*ptep);
>  }
> @@ -308,7 +308,7 @@ static inline void __check_safe_pte_update(struct 
> mm_struct *mm, pte_t *ptep,
>   if (!IS_ENABLED(CONFIG_DEBUG_VM))
>   return;
>  
> - old_pte = ptep_get(ptep);
> + old_pte = __ptep_get(ptep);
>  
>   if (!pte_valid(old_pte) || !pte_valid(pte))
>   return;
> @@ -317,7 +317,7 @@ static inline void __check_safe_pte_update(struct 
> mm_struct *mm, pte_t *ptep,
>  
>   /*
>* Check for potential race with hardware updates of the pte
> -  * (ptep_set_access_flags safely changes valid ptes without going
> +  * (__ptep_set_access_flags safely changes valid ptes without going
>* through an invalid entry).
>*/
>   VM_WARN_ONCE(!pte_young(pte),
> @@ -363,23 +363,22 @@ static inline pte_t pte_advance_pfn(pte_t pte, unsigned 
> long nr)
>   return pfn_pte(pte_pfn(pte) + nr, pte_pgprot(pte));
>  }
>  
> -static inline void set_ptes(struct mm_struct *mm,
> - unsigned long __always_unused addr,
> - pte_t *ptep, pte_t pte, unsigned int nr)
> +static inline void __set_ptes(struct mm_struct *mm,
> +   unsigned long __always_unused addr,
> +   pte_t *ptep, pte_t pte, 

Re: [PATCH v6 09/18] arm64/mm: Convert ptep_clear() to ptep_get_and_clear()

2024-02-15 Thread Mark Rutland
On Thu, Feb 15, 2024 at 10:31:56AM +, Ryan Roberts wrote:
> ptep_clear() is a generic wrapper around the arch-implemented
> ptep_get_and_clear(). We are about to convert ptep_get_and_clear() into
> a public version and private version (__ptep_get_and_clear()) to support
> the transparent contpte work. We won't have a private version of
> ptep_clear() so let's convert it to directly call ptep_get_and_clear().
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Mark Rutland 

Mark.

> ---
>  arch/arm64/mm/hugetlbpage.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index 27f6160890d1..48e8b429879d 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -229,7 +229,7 @@ static void clear_flush(struct mm_struct *mm,
>   unsigned long i, saddr = addr;
>  
>   for (i = 0; i < ncontig; i++, addr += pgsize, ptep++)
> - ptep_clear(mm, addr, ptep);
> + ptep_get_and_clear(mm, addr, ptep);
>  
>   flush_tlb_range(, saddr, addr);
>  }
> -- 
> 2.25.1
> 


Re: [PATCH v6 08/18] arm64/mm: Convert set_pte_at() to set_ptes(..., 1)

2024-02-15 Thread Mark Rutland
On Thu, Feb 15, 2024 at 10:31:55AM +, Ryan Roberts wrote:
> Since set_ptes() was introduced, set_pte_at() has been implemented as a
> generic macro around set_ptes(..., 1). So this change should continue to
> generate the same code. However, making this change prepares us for the
> transparent contpte support. It means we can reroute set_ptes() to
> __set_ptes(). Since set_pte_at() is a generic macro, there will be no
> equivalent __set_pte_at() to reroute to.
> 
> Note that a couple of calls to set_pte_at() remain in the arch code.
> This is intentional, since those call sites are acting on behalf of
> core-mm and should continue to call into the public set_ptes() rather
> than the arch-private __set_ptes().
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Mark Rutland 

Mark.

> ---
>  arch/arm64/include/asm/pgtable.h |  2 +-
>  arch/arm64/kernel/mte.c  |  2 +-
>  arch/arm64/kvm/guest.c   |  2 +-
>  arch/arm64/mm/fault.c|  2 +-
>  arch/arm64/mm/hugetlbpage.c  | 10 +-
>  5 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index de034ca40bad..9a2df85eb493 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1084,7 +1084,7 @@ static inline void arch_swap_restore(swp_entry_t entry, 
> struct folio *folio)
>  #endif /* CONFIG_ARM64_MTE */
>  
>  /*
> - * On AArch64, the cache coherency is handled via the set_pte_at() function.
> + * On AArch64, the cache coherency is handled via the set_ptes() function.
>   */
>  static inline void update_mmu_cache_range(struct vm_fault *vmf,
>   struct vm_area_struct *vma, unsigned long addr, pte_t *ptep,
> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> index a41ef3213e1e..59bfe2e96f8f 100644
> --- a/arch/arm64/kernel/mte.c
> +++ b/arch/arm64/kernel/mte.c
> @@ -67,7 +67,7 @@ int memcmp_pages(struct page *page1, struct page *page2)
>   /*
>* If the page content is identical but at least one of the pages is
>* tagged, return non-zero to avoid KSM merging. If only one of the
> -  * pages is tagged, set_pte_at() may zero or change the tags of the
> +  * pages is tagged, set_ptes() may zero or change the tags of the
>* other page via mte_sync_tags().
>*/
>   if (page_mte_tagged(page1) || page_mte_tagged(page2))
> diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
> index aaf1d4939739..6e0df623c8e9 100644
> --- a/arch/arm64/kvm/guest.c
> +++ b/arch/arm64/kvm/guest.c
> @@ -1072,7 +1072,7 @@ int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
>   } else {
>   /*
>* Only locking to serialise with a concurrent
> -  * set_pte_at() in the VMM but still overriding the
> +  * set_ptes() in the VMM but still overriding the
>* tags, hence ignoring the return value.
>*/
>   try_page_mte_tagging(page);
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index a254761fa1bd..3235e23309ec 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -205,7 +205,7 @@ static void show_pte(unsigned long addr)
>   *
>   * It needs to cope with hardware update of the accessed/dirty state by other
>   * agents in the system and can safely skip the __sync_icache_dcache() call 
> as,
> - * like set_pte_at(), the PTE is never changed from no-exec to exec here.
> + * like set_ptes(), the PTE is never changed from no-exec to exec here.
>   *
>   * Returns whether or not the PTE actually changed.
>   */
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index 2892f925ed66..27f6160890d1 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -247,12 +247,12 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned 
> long addr,
>  
>   if (!pte_present(pte)) {
>   for (i = 0; i < ncontig; i++, ptep++, addr += pgsize)
> - set_pte_at(mm, addr, ptep, pte);
> + set_ptes(mm, addr, ptep, pte, 1);
>   return;
>   }
>  
>   if (!pte_cont(pte)) {
> - set_pte_at(mm, addr, ptep, pte);
> + set_ptes(mm, addr, ptep, pte, 1);
>   return;
>   }
>  
> @@ -263,7 +263,7 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long 
> addr,
>   clear_flush(mm, addr, ptep, pgsize, ncontig);
>  
>   for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
> - set_pte_

Re: [PATCH v6 07/18] arm64/mm: Convert READ_ONCE(*ptep) to ptep_get(ptep)

2024-02-15 Thread Mark Rutland
On Thu, Feb 15, 2024 at 10:31:54AM +, Ryan Roberts wrote:
> There are a number of places in the arch code that read a pte by using
> the READ_ONCE() macro. Refactor these call sites to instead use the
> ptep_get() helper, which itself is a READ_ONCE(). Generated code should
> be the same.
> 
> This will benefit us when we shortly introduce the transparent contpte
> support. In this case, ptep_get() will become more complex so we now
> have all the code abstracted through it.
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Mark Rutland 

Mark.

> ---
>  arch/arm64/include/asm/pgtable.h | 12 +---
>  arch/arm64/kernel/efi.c  |  2 +-
>  arch/arm64/mm/fault.c|  4 ++--
>  arch/arm64/mm/hugetlbpage.c  |  6 +++---
>  arch/arm64/mm/kasan_init.c   |  2 +-
>  arch/arm64/mm/mmu.c  | 12 ++--
>  arch/arm64/mm/pageattr.c |  4 ++--
>  arch/arm64/mm/trans_pgd.c|  2 +-
>  8 files changed, 25 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index b6d3e9e0a946..de034ca40bad 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -275,6 +275,12 @@ static inline void set_pte(pte_t *ptep, pte_t pte)
>   }
>  }
>  
> +#define ptep_get ptep_get
> +static inline pte_t ptep_get(pte_t *ptep)
> +{
> + return READ_ONCE(*ptep);
> +}
> +
>  extern void __sync_icache_dcache(pte_t pteval);
>  bool pgattr_change_is_safe(u64 old, u64 new);
>  
> @@ -302,7 +308,7 @@ static inline void __check_safe_pte_update(struct 
> mm_struct *mm, pte_t *ptep,
>   if (!IS_ENABLED(CONFIG_DEBUG_VM))
>   return;
>  
> - old_pte = READ_ONCE(*ptep);
> + old_pte = ptep_get(ptep);
>  
>   if (!pte_valid(old_pte) || !pte_valid(pte))
>   return;
> @@ -904,7 +910,7 @@ static inline int __ptep_test_and_clear_young(pte_t *ptep)
>  {
>   pte_t old_pte, pte;
>  
> - pte = READ_ONCE(*ptep);
> + pte = ptep_get(ptep);
>   do {
>   old_pte = pte;
>   pte = pte_mkold(pte);
> @@ -986,7 +992,7 @@ static inline void ptep_set_wrprotect(struct mm_struct 
> *mm, unsigned long addres
>  {
>   pte_t old_pte, pte;
>  
> - pte = READ_ONCE(*ptep);
> + pte = ptep_get(ptep);
>   do {
>   old_pte = pte;
>   pte = pte_wrprotect(pte);
> diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
> index 0228001347be..d0e08e93b246 100644
> --- a/arch/arm64/kernel/efi.c
> +++ b/arch/arm64/kernel/efi.c
> @@ -103,7 +103,7 @@ static int __init set_permissions(pte_t *ptep, unsigned 
> long addr, void *data)
>  {
>   struct set_perm_data *spd = data;
>   const efi_memory_desc_t *md = spd->md;
> - pte_t pte = READ_ONCE(*ptep);
> + pte_t pte = ptep_get(ptep);
>  
>   if (md->attribute & EFI_MEMORY_RO)
>   pte = set_pte_bit(pte, __pgprot(PTE_RDONLY));
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 55f6455a8284..a254761fa1bd 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -191,7 +191,7 @@ static void show_pte(unsigned long addr)
>   if (!ptep)
>   break;
>  
> - pte = READ_ONCE(*ptep);
> + pte = ptep_get(ptep);
>   pr_cont(", pte=%016llx", pte_val(pte));
>   pte_unmap(ptep);
>   } while(0);
> @@ -214,7 +214,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
> pte_t entry, int dirty)
>  {
>   pteval_t old_pteval, pteval;
> - pte_t pte = READ_ONCE(*ptep);
> + pte_t pte = ptep_get(ptep);
>  
>   if (pte_same(pte, entry))
>   return 0;
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index 6720ec8d50e7..2892f925ed66 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -485,7 +485,7 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
>   size_t pgsize;
>   pte_t pte;
>  
> - if (!pte_cont(READ_ONCE(*ptep))) {
> + if (!pte_cont(ptep_get(ptep))) {
>   ptep_set_wrprotect(mm, addr, ptep);
>   return;
>   }
> @@ -510,7 +510,7 @@ pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
>   size_t pgsize;
>   int ncontig;
>  
> - if (!pte_cont(READ_ONCE(*ptep)))
> + if (!pte_cont(ptep_get(ptep)))
>   return ptep_clear_flush(vma, addr, ptep);
>  
>   ncontig = find_num_contig(mm, addr, ptep, );
> @@ -543,7 +543,7 @@ pte_t huge_pt

Re: [PATCH v6 04/18] arm64/mm: Convert pte_next_pfn() to pte_advance_pfn()

2024-02-15 Thread Mark Rutland
On Thu, Feb 15, 2024 at 10:31:51AM +, Ryan Roberts wrote:
> Core-mm needs to be able to advance the pfn by an arbitrary amount, so
> override the new pte_advance_pfn() API to do so.
> 
> Signed-off-by: Ryan Roberts 

Acked-by: Mark Rutland 

Mark.

> ---
>  arch/arm64/include/asm/pgtable.h | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index 52d0b0a763f1..b6d3e9e0a946 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -351,10 +351,10 @@ static inline pgprot_t pte_pgprot(pte_t pte)
>   return __pgprot(pte_val(pfn_pte(pfn, __pgprot(0))) ^ pte_val(pte));
>  }
>  
> -#define pte_next_pfn pte_next_pfn
> -static inline pte_t pte_next_pfn(pte_t pte)
> +#define pte_advance_pfn pte_advance_pfn
> +static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
>  {
> - return pfn_pte(pte_pfn(pte) + 1, pte_pgprot(pte));
> + return pfn_pte(pte_pfn(pte) + nr, pte_pgprot(pte));
>  }
>  
>  static inline void set_ptes(struct mm_struct *mm,
> @@ -370,7 +370,7 @@ static inline void set_ptes(struct mm_struct *mm,
>   if (--nr == 0)
>   break;
>   ptep++;
> - pte = pte_next_pfn(pte);
> + pte = pte_advance_pfn(pte, 1);
>   }
>  }
>  #define set_ptes set_ptes
> -- 
> 2.25.1
> 


Re: [PATCH v5 25/25] arm64/mm: Automatically fold contpte mappings

2024-02-13 Thread Mark Rutland
On Fri, Feb 02, 2024 at 08:07:56AM +, Ryan Roberts wrote:
> There are situations where a change to a single PTE could cause the
> contpte block in which it resides to become foldable (i.e. could be
> repainted with the contiguous bit). Such situations arise, for example,
> when user space temporarily changes protections, via mprotect, for
> individual pages, such can be the case for certain garbage collectors.
> 
> We would like to detect when such a PTE change occurs. However this can
> be expensive due to the amount of checking required. Therefore only
> perform the checks when an indiviual PTE is modified via mprotect
> (ptep_modify_prot_commit() -> set_pte_at() -> set_ptes(nr=1)) and only
> when we are setting the final PTE in a contpte-aligned block.
> 
> Signed-off-by: Ryan Roberts 
> ---
>  arch/arm64/include/asm/pgtable.h | 26 +
>  arch/arm64/mm/contpte.c  | 64 
>  2 files changed, 90 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index cdc310880a3b..d3357fe4eb89 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1192,6 +1192,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t 
> *ptep, pte_t pte);
>   * where it is possible and makes sense to do so. The PTE_CONT bit is 
> considered
>   * a private implementation detail of the public ptep API (see below).
>   */
> +extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte);
>  extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>   pte_t *ptep, pte_t pte);
>  extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
> @@ -1213,6 +1215,29 @@ extern int contpte_ptep_set_access_flags(struct 
> vm_area_struct *vma,
>   unsigned long addr, pte_t *ptep,
>   pte_t entry, int dirty);
>  
> +static __always_inline void contpte_try_fold(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep, pte_t pte)
> +{
> + /*
> +  * Only bother trying if both the virtual and physical addresses are
> +  * aligned and correspond to the last entry in a contig range. The core
> +  * code mostly modifies ranges from low to high, so this is the likely
> +  * the last modification in the contig range, so a good time to fold.
> +  * We can't fold special mappings, because there is no associated folio.
> +  */
> +
> + const unsigned long contmask = CONT_PTES - 1;
> + bool valign = ((addr >> PAGE_SHIFT) & contmask) == contmask;
> +
> + if (unlikely(valign)) {
> + bool palign = (pte_pfn(pte) & contmask) == contmask;
> +
> + if (unlikely(palign &&
> + pte_valid(pte) && !pte_cont(pte) && !pte_special(pte)))
> + __contpte_try_fold(mm, addr, ptep, pte);
> + }
> +}
> +
>  static __always_inline void contpte_try_unfold(struct mm_struct *mm,
>   unsigned long addr, pte_t *ptep, pte_t pte)
>  {
> @@ -1287,6 +1312,7 @@ static __always_inline void set_ptes(struct mm_struct 
> *mm, unsigned long addr,
>   if (likely(nr == 1)) {
>   contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>   __set_ptes(mm, addr, ptep, pte, 1);
> + contpte_try_fold(mm, addr, ptep, pte);
>   } else {
>   contpte_set_ptes(mm, addr, ptep, pte, nr);
>   }
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index 80346108450b..2c7dafd0552a 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -67,6 +67,70 @@ static void contpte_convert(struct mm_struct *mm, unsigned 
> long addr,
>   __set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
>  }
>  
> +void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte)
> +{
> + /*
> +  * We have already checked that the virtual and pysical addresses are
> +  * correctly aligned for a contpte mapping in contpte_try_fold() so the
> +  * remaining checks are to ensure that the contpte range is fully
> +  * covered by a single folio, and ensure that all the ptes are valid
> +  * with contiguous PFNs and matching prots. We ignore the state of the
> +  * access and dirty bits for the purpose of deciding if its a contiguous
> +  * range; the folding process will generate a single contpte entry which
> +  * has a single access and dirty bit. Those 2 bits are the logical OR of
> +  * their respective bits in the constituent pte entries. In order to
> +  * ensure the contpte range is covered by a single folio, we must
> +  * recover the folio from the pfn, but special mappings don't have a
> +  * folio backing them. Fortunately contpte_try_fold() already checked
> +  * that the pte is 

Re: [PATCH v5 24/25] arm64/mm: __always_inline to improve fork() perf

2024-02-13 Thread Mark Rutland
On Fri, Feb 02, 2024 at 08:07:55AM +, Ryan Roberts wrote:
> As set_ptes() and wrprotect_ptes() become a bit more complex, the
> compiler may choose not to inline them. But this is critical for fork()
> performance. So mark the functions, along with contpte_try_unfold()
> which is called by them, as __always_inline. This is worth ~1% on the
> fork() microbenchmark with order-0 folios (the common case).
> 
> Signed-off-by: Ryan Roberts 

I have no strong feelings either way on this, so:

Acked-by: Mark Rutland 

Mark.

> ---
>  arch/arm64/include/asm/pgtable.h | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index 353ea67b5d75..cdc310880a3b 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1213,8 +1213,8 @@ extern int contpte_ptep_set_access_flags(struct 
> vm_area_struct *vma,
>   unsigned long addr, pte_t *ptep,
>   pte_t entry, int dirty);
>  
> -static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long 
> addr,
> - pte_t *ptep, pte_t pte)
> +static __always_inline void contpte_try_unfold(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep, pte_t pte)
>  {
>   if (unlikely(pte_valid_cont(pte)))
>   __contpte_try_unfold(mm, addr, ptep, pte);
> @@ -1279,7 +1279,7 @@ static inline void set_pte(pte_t *ptep, pte_t pte)
>  }
>  
>  #define set_ptes set_ptes
> -static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
> +static __always_inline void set_ptes(struct mm_struct *mm, unsigned long 
> addr,
>   pte_t *ptep, pte_t pte, unsigned int nr)
>  {
>   pte = pte_mknoncont(pte);
> @@ -1361,8 +1361,8 @@ static inline int ptep_clear_flush_young(struct 
> vm_area_struct *vma,
>  }
>  
>  #define wrprotect_ptes wrprotect_ptes
> -static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> - pte_t *ptep, unsigned int nr)
> +static __always_inline void wrprotect_ptes(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep, unsigned int 
> nr)
>  {
>   if (likely(nr == 1)) {
>   /*
> -- 
> 2.25.1
> 


Re: [PATCH v5 23/25] arm64/mm: Implement pte_batch_hint()

2024-02-13 Thread Mark Rutland
On Fri, Feb 02, 2024 at 08:07:54AM +, Ryan Roberts wrote:
> When core code iterates over a range of ptes and calls ptep_get() for
> each of them, if the range happens to cover contpte mappings, the number
> of pte reads becomes amplified by a factor of the number of PTEs in a
> contpte block. This is because for each call to ptep_get(), the
> implementation must read all of the ptes in the contpte block to which
> it belongs to gather the access and dirty bits.
> 
> This causes a hotspot for fork(), as well as operations that unmap
> memory such as munmap(), exit and madvise(MADV_DONTNEED). Fortunately we
> can fix this by implementing pte_batch_hint() which allows their
> iterators to skip getting the contpte tail ptes when gathering the batch
> of ptes to operate on. This results in the number of PTE reads returning
> to 1 per pte.
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Mark Rutland 

Mark.

> ---
>  arch/arm64/include/asm/pgtable.h | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index ad04adb7b87f..353ea67b5d75 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1220,6 +1220,15 @@ static inline void contpte_try_unfold(struct mm_struct 
> *mm, unsigned long addr,
>   __contpte_try_unfold(mm, addr, ptep, pte);
>  }
>  
> +#define pte_batch_hint pte_batch_hint
> +static inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
> +{
> + if (!pte_valid_cont(pte))
> + return 1;
> +
> + return CONT_PTES - (((unsigned long)ptep >> 3) & (CONT_PTES - 1));
> +}
> +
>  /*
>   * The below functions constitute the public API that arm64 presents to the
>   * core-mm to manipulate PTE entries within their page tables (or at least 
> this
> -- 
> 2.25.1
> 


Re: [PATCH v5 21/25] arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs

2024-02-13 Thread Mark Rutland
On Tue, Feb 13, 2024 at 04:48:50PM +, Ryan Roberts wrote:
> On 13/02/2024 16:43, Mark Rutland wrote:
> > On Fri, Feb 02, 2024 at 08:07:52AM +, Ryan Roberts wrote:

> >> +static inline void __clear_full_ptes(struct mm_struct *mm, unsigned long 
> >> addr,
> >> +  pte_t *ptep, unsigned int nr, int full)
> >> +{
> >> +  for (;;) {
> >> +  __ptep_get_and_clear(mm, addr, ptep);
> >> +  if (--nr == 0)
> >> +  break;
> >> +  ptep++;
> >> +  addr += PAGE_SIZE;
> >> +  }
> >> +}
> > 
> > The loop construct is a bit odd; can't this be:
> 
> I found it a little odd at first, but its avoiding the ptep and addr 
> increments
> the last time through the loop. Its the preferred pattern for these functions 
> in
> core-mm. See default set_ptes(), wrprotect_ptes(), clear_full_ptes() in
> include/linux/pgtable.h.
> 
> So I'd prefer to leave it as is so that we match them. What do you think?

That's fair enough; it I'm happy with it as-is.

Mark.


Re: [PATCH v5 21/25] arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs

2024-02-13 Thread Mark Rutland
On Fri, Feb 02, 2024 at 08:07:52AM +, Ryan Roberts wrote:
> Optimize the contpte implementation to fix some of the
> exit/munmap/dontneed performance regression introduced by the initial
> contpte commit. Subsequent patches will solve it entirely.
> 
> During exit(), munmap() or madvise(MADV_DONTNEED), mappings must be
> cleared. Previously this was done 1 PTE at a time. But the core-mm
> supports batched clear via the new [get_and_]clear_full_ptes() APIs. So
> let's implement those APIs and for fully covered contpte mappings, we no
> longer need to unfold the contpte. This significantly reduces unfolding
> operations, reducing the number of tlbis that must be issued.
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 
> ---
>  arch/arm64/include/asm/pgtable.h | 67 
>  arch/arm64/mm/contpte.c  | 17 
>  2 files changed, 84 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index c07f0d563733..ad04adb7b87f 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -965,6 +965,37 @@ static inline pte_t __ptep_get_and_clear(struct 
> mm_struct *mm,
>   return pte;
>  }
>  
> +static inline void __clear_full_ptes(struct mm_struct *mm, unsigned long 
> addr,
> + pte_t *ptep, unsigned int nr, int full)
> +{
> + for (;;) {
> + __ptep_get_and_clear(mm, addr, ptep);
> + if (--nr == 0)
> + break;
> + ptep++;
> + addr += PAGE_SIZE;
> + }
> +}

The loop construct is a bit odd; can't this be:

while (nr--) {
__ptep_get_and_clear(mm, addr, ptep);
ptep++;
addr += PAGE_SIZE;
}

... or:

do {
__ptep_get_and_clear(mm, addr, ptep);
ptep++;
addr += PAGE_SIZE;
} while (--nr);

... ?

Otherwise, this looks good to me.

Mark.

> +
> +static inline pte_t __get_and_clear_full_ptes(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep,
> + unsigned int nr, int full)
> +{
> + pte_t pte, tmp_pte;
> +
> + pte = __ptep_get_and_clear(mm, addr, ptep);
> + while (--nr) {
> + ptep++;
> + addr += PAGE_SIZE;
> + tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
> + if (pte_dirty(tmp_pte))
> + pte = pte_mkdirty(pte);
> + if (pte_young(tmp_pte))
> + pte = pte_mkyoung(pte);
> + }
> + return pte;
> +}
> +
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
>  static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
> @@ -1167,6 +1198,11 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t 
> orig_pte);
>  extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>  extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>   pte_t *ptep, pte_t pte, unsigned int nr);
> +extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, unsigned int nr, int full);
> +extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep,
> + unsigned int nr, int full);
>  extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>   unsigned long addr, pte_t *ptep);
>  extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> @@ -1254,6 +1290,35 @@ static inline void pte_clear(struct mm_struct *mm,
>   __pte_clear(mm, addr, ptep);
>  }
>  
> +#define clear_full_ptes clear_full_ptes
> +static inline void clear_full_ptes(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, unsigned int nr, int full)
> +{
> + if (likely(nr == 1)) {
> + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> + __clear_full_ptes(mm, addr, ptep, nr, full);
> + } else {
> + contpte_clear_full_ptes(mm, addr, ptep, nr, full);
> + }
> +}
> +
> +#define get_and_clear_full_ptes get_and_clear_full_ptes
> +static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep,
> + unsigned int nr, int full)
> +{
> + pte_t pte;
> +
> + if (likely(nr == 1)) {
> + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> + pte = __get_and_clear_full_ptes(mm, addr, ptep, nr, full);
> + } else {
> + pte = contpte_get_and_clear_full_ptes(mm, addr, ptep, nr, full);
> + }
> +
> + return pte;
> +}
> +
>  #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>   

Re: [PATCH v5 20/25] arm64/mm: Implement new wrprotect_ptes() batch API

2024-02-13 Thread Mark Rutland
On Fri, Feb 02, 2024 at 08:07:51AM +, Ryan Roberts wrote:
> Optimize the contpte implementation to fix some of the fork performance
> regression introduced by the initial contpte commit. Subsequent patches
> will solve it entirely.
> 
> During fork(), any private memory in the parent must be write-protected.
> Previously this was done 1 PTE at a time. But the core-mm supports
> batched wrprotect via the new wrprotect_ptes() API. So let's implement
> that API and for fully covered contpte mappings, we no longer need to
> unfold the contpte. This has 2 benefits:
> 
>   - reduced unfolding, reduces the number of tlbis that must be issued.
>   - The memory remains contpte-mapped ("folded") in the parent, so it
> continues to benefit from the more efficient use of the TLB after
> the fork.
> 
> The optimization to wrprotect a whole contpte block without unfolding is
> possible thanks to the tightening of the Arm ARM in respect to the
> definition and behaviour when 'Misprogramming the Contiguous bit'. See
> section D21194 at https://developer.arm.com/documentation/102105/latest/

Minor nit, but it'd be better to refer to a specific revision of the document,
e.g.

  https://developer.arm.com/documentation/102105/ja-07/

That way people can see the specific version of the text you were referring to
even if that changes later, and it means the link is still useful when D21194
gets merged into the ARM ARM and dropped from the known issues doc.

> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 
> ---
>  arch/arm64/include/asm/pgtable.h | 61 ++--
>  arch/arm64/mm/contpte.c  | 35 ++
>  2 files changed, 86 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index 34892a95403d..c07f0d563733 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -978,16 +978,12 @@ static inline pmd_t pmdp_huge_get_and_clear(struct 
> mm_struct *mm,
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> -/*
> - * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
> - * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
> - */
> -static inline void __ptep_set_wrprotect(struct mm_struct *mm,
> - unsigned long address, pte_t *ptep)
> +static inline void ___ptep_set_wrprotect(struct mm_struct *mm,
> + unsigned long address, pte_t *ptep,
> + pte_t pte)
>  {
> - pte_t old_pte, pte;
> + pte_t old_pte;
>  
> - pte = __ptep_get(ptep);
>   do {
>   old_pte = pte;
>   pte = pte_wrprotect(pte);
> @@ -996,6 +992,25 @@ static inline void __ptep_set_wrprotect(struct mm_struct 
> *mm,
>   } while (pte_val(pte) != pte_val(old_pte));
>  }
>  
> +/*
> + * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
> + * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
> + */
> +static inline void __ptep_set_wrprotect(struct mm_struct *mm,
> + unsigned long address, pte_t *ptep)
> +{
> + ___ptep_set_wrprotect(mm, address, ptep, __ptep_get(ptep));
> +}
> +
> +static inline void __wrprotect_ptes(struct mm_struct *mm, unsigned long 
> address,
> + pte_t *ptep, unsigned int nr)
> +{
> + unsigned int i;
> +
> + for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
> + __ptep_set_wrprotect(mm, address, ptep);
> +}
> +
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  #define __HAVE_ARCH_PMDP_SET_WRPROTECT
>  static inline void pmdp_set_wrprotect(struct mm_struct *mm,
> @@ -1156,6 +1171,8 @@ extern int contpte_ptep_test_and_clear_young(struct 
> vm_area_struct *vma,
>   unsigned long addr, pte_t *ptep);
>  extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>   unsigned long addr, pte_t *ptep);
> +extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, unsigned int nr);
>  extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>   unsigned long addr, pte_t *ptep,
>   pte_t entry, int dirty);
> @@ -1269,12 +1286,35 @@ static inline int ptep_clear_flush_young(struct 
> vm_area_struct *vma,
>   return contpte_ptep_clear_flush_young(vma, addr, ptep);
>  }
>  
> +#define wrprotect_ptes wrprotect_ptes
> +static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, unsigned int nr)
> +{
> + if (likely(nr == 1)) {
> + /*
> +  * Optimization: wrprotect_ptes() can only be called for present
> +  * ptes so we only need to check contig bit as condition for
> +  * unfold, 

Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-13 Thread Mark Rutland
On Mon, Feb 12, 2024 at 12:59:57PM +, Ryan Roberts wrote:
> On 12/02/2024 12:00, Mark Rutland wrote:
> > Hi Ryan,

[...]

> >> +static inline void set_pte(pte_t *ptep, pte_t pte)
> >> +{
> >> +  /*
> >> +   * We don't have the mm or vaddr so cannot unfold contig entries (since
> >> +   * it requires tlb maintenance). set_pte() is not used in core code, so
> >> +   * this should never even be called. Regardless do our best to service
> >> +   * any call and emit a warning if there is any attempt to set a pte on
> >> +   * top of an existing contig range.
> >> +   */
> >> +  pte_t orig_pte = __ptep_get(ptep);
> >> +
> >> +  WARN_ON_ONCE(pte_valid_cont(orig_pte));
> >> +  __set_pte(ptep, pte_mknoncont(pte));
> >> +}
> >> +
> >> +#define set_ptes set_ptes
> >> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
> >> +  pte_t *ptep, pte_t pte, unsigned int nr)
> >> +{
> >> +  pte = pte_mknoncont(pte);
> > 
> > Why do we have to clear the contiguous bit here? Is that for the same 
> > reason as
> > set_pte(), or do we expect callers to legitimately call this with the
> > contiguous bit set in 'pte'?
> > 
> > I think you explained this to me in-person, and IIRC we don't expect 
> > callers to
> > go set the bit themselves, but since it 'leaks' out to them via 
> > __ptep_get() we
> > have to clear it here to defer the decision of whether to set/clear it when
> > modifying entries. It would be nice if we could have a description of 
> > why/when
> > we need to clear this, e.g. in the 'public API' comment block above.
> 
> Yes, I think you've got it, but just to ram home the point: The PTE_CONT bit 
> is
> private to the architecture code and is never set directly by core code. If 
> the
> public API ever receives a pte that happens to have the PTE_CONT bit set, it
> would be bad news if we then accidentally set that in the pgtable.
> 
> Ideally, we would just uncondidtionally clear the bit before a getter returns
> the pte (e.g. ptep_get(), ptep_get_lockless(), ptep_get_and_clear(), ...). 
> That
> way, the code code is guarranteed never to see a pte with the PTE_CONT bit set
> and can therefore never accidentally pass such a pte into a setter function.
> However, there is existing functionality that relies on being able to get a 
> pte,
> then pass it to pte_leaf_size(), and arch function that checks the PTE_CONT 
> bit
> to determine how big the leaf is. This is used in perf_get_pgtable_size().
> 
> So to allow perf_get_pgtable_size() to continue to see the "real" page size, I
> decided to allow PTE_CONT to leak through the getters and instead
> unconditionally clear the bit when a pte is passed to any of the setters.
> 
> I'll add a (slightly less verbose) comment as you suggest.

Great, thanks!

[...]

> >> +static inline bool mm_is_user(struct mm_struct *mm)
> >> +{
> >> +  /*
> >> +   * Don't attempt to apply the contig bit to kernel mappings, because
> >> +   * dynamically adding/removing the contig bit can cause page faults.
> >> +   * These racing faults are ok for user space, since they get serialized
> >> +   * on the PTL. But kernel mappings can't tolerate faults.
> >> +   */
> >> +  return mm != _mm;
> >> +}
> > 
> > We also have the efi_mm as a non-user mm, though I don't think we manipulate
> > that while it is live, and I'm not sure if that needs any special handling.
> 
> Well we never need this function in the hot (order-0 folio) path, so I think I
> could add a check for efi_mm here with performance implication. It's probably
> safest to explicitly exclude it? What do you think?

That sounds ok to me.

Otherwise, if we (somehow) know that we avoid calling this at all with an EFI
mm (e.g. because of the way we construct that), I'd be happy with a comment.

Probably best to Cc Ard for whatever we do here.

> >> +static inline pte_t *contpte_align_down(pte_t *ptep)
> >> +{
> >> +  return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
> > 
> > I think this can be:
> > 
> > static inline pte_t *contpte_align_down(pte_t *ptep)
> > {
> > return PTR_ALIGN_DOWN(ptep, sizeof(*ptep) * CONT_PTES);
> > }
> 
> Yep - that's much less ugly - thanks!
> 
> > 
> >> +
> >> +static void contpte_convert(struct mm_struct *mm, unsigned long addr,
> >> +  pte_t *ptep, pte_t pte)
> >> +{
> >> +  struct vm_area_struct vma = TLB_

Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-12 Thread Mark Rutland
Hi Ryan,

Overall this looks pretty good; I have a bunch of minor comments below, and a
bigger question on the way ptep_get_lockless() works.

On Fri, Feb 02, 2024 at 08:07:50AM +, Ryan Roberts wrote:
> With the ptep API sufficiently refactored, we can now introduce a new
> "contpte" API layer, which transparently manages the PTE_CONT bit for
> user mappings.
> 
> In this initial implementation, only suitable batches of PTEs, set via
> set_ptes(), are mapped with the PTE_CONT bit. Any subsequent
> modification of individual PTEs will cause an "unfold" operation to
> repaint the contpte block as individual PTEs before performing the
> requested operation. While, a modification of a single PTE could cause
> the block of PTEs to which it belongs to become eligible for "folding"
> into a contpte entry, "folding" is not performed in this initial
> implementation due to the costs of checking the requirements are met.
> Due to this, contpte mappings will degrade back to normal pte mappings
> over time if/when protections are changed. This will be solved in a
> future patch.
> 
> Since a contpte block only has a single access and dirty bit, the
> semantic here changes slightly; when getting a pte (e.g. ptep_get())
> that is part of a contpte mapping, the access and dirty information are
> pulled from the block (so all ptes in the block return the same
> access/dirty info). When changing the access/dirty info on a pte (e.g.
> ptep_set_access_flags()) that is part of a contpte mapping, this change
> will affect the whole contpte block. This is works fine in practice
> since we guarantee that only a single folio is mapped by a contpte
> block, and the core-mm tracks access/dirty information per folio.
> 
> In order for the public functions, which used to be pure inline, to
> continue to be callable by modules, export all the contpte_* symbols
> that are now called by those public inline functions.
> 
> The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter
> at build time. It defaults to enabled as long as its dependency,
> TRANSPARENT_HUGEPAGE is also enabled. The core-mm depends upon
> TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not
> enabled, then there is no chance of meeting the physical contiguity
> requirement for contpte mappings.
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 
> ---
>  arch/arm64/Kconfig   |   9 +
>  arch/arm64/include/asm/pgtable.h | 161 ++
>  arch/arm64/mm/Makefile   |   1 +
>  arch/arm64/mm/contpte.c  | 283 +++
>  4 files changed, 454 insertions(+)
>  create mode 100644 arch/arm64/mm/contpte.c
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index d86d7f4758b5..1442e8ed95b6 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -2230,6 +2230,15 @@ config UNWIND_PATCH_PAC_INTO_SCS
>   select UNWIND_TABLES
>   select DYNAMIC_SCS
>  
> +config ARM64_CONTPTE
> + bool "Contiguous PTE mappings for user memory" if EXPERT
> + depends on TRANSPARENT_HUGEPAGE
> + default y
> + help
> +   When enabled, user mappings are configured using the PTE contiguous
> +   bit, for any mappings that meet the size and alignment requirements.
> +   This reduces TLB pressure and improves performance.
> +
>  endmenu # "Kernel Features"
>  
>  menu "Boot options"
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index 7dc6b68ee516..34892a95403d 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -133,6 +133,10 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t 
> phys)
>   */
>  #define pte_valid_not_user(pte) \
>   ((pte_val(pte) & (PTE_VALID | PTE_USER | PTE_UXN)) == (PTE_VALID | 
> PTE_UXN))
> +/*
> + * Returns true if the pte is valid and has the contiguous bit set.
> + */
> +#define pte_valid_cont(pte)  (pte_valid(pte) && pte_cont(pte))
>  /*
>   * Could the pte be present in the TLB? We must check mm_tlb_flush_pending
>   * so that we don't erroneously return false for pages that have been
> @@ -1135,6 +1139,161 @@ void vmemmap_update_pte(unsigned long addr, pte_t 
> *ptep, pte_t pte);
>  #define vmemmap_update_pte vmemmap_update_pte
>  #endif
>  
> +#ifdef CONFIG_ARM64_CONTPTE
> +
> +/*
> + * The contpte APIs are used to transparently manage the contiguous bit in 
> ptes
> + * where it is possible and makes sense to do so. The PTE_CONT bit is 
> considered
> + * a private implementation detail of the public ptep API (see below).
> + */
> +extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte);
> +extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
> +extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
> +extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte, unsigned int nr);

Re: [PATCH v5 00/25] Transparent Contiguous PTEs for User Mappings

2024-02-08 Thread Mark Rutland
On Fri, Feb 02, 2024 at 08:07:31AM +, Ryan Roberts wrote:
> Hi All,

Hi Ryan,

I assume this is the same as your 'features/granule_perf/contpte-lkml_v' branch
on https://gitlab.arm.com/linux-arm/linux-rr/

I've taken a quick look, and I have a few initial/superficial comments before
digging into the detail on the important changes.

> Patch Layout
> 
> 
> In this version, I've split the patches to better show each optimization:
> 
>   - 1-2:mm prep: misc code and docs cleanups

I'm not confident enough to comment on patch 2, but these look reasonable to
me.

>   - 3-8:mm,arm,arm64,powerpc,x86 prep: Replace pte_next_pfn() with more
> general pte_advance_pfn()

These look fine to me.

>   - 9-18:   arm64 prep: Refactor ptep helpers into new layer

The result of patches 9-17 looks good to me, but the intermediate stages where
some functions are converted is a bit odd, and it's a bit painful for review
since you need to skip ahead a few patches to see the end result to tell that
the conversions are consistent and complete.

IMO it'd be easier for review if that were three patches:

1) Convert READ_ONCE() -> ptep_get()
2) Convert set_pte_at() -> set_ptes()
3) All the "New layer" renames and addition of the trivial wrappers

Patch 18 looks fine to me.

>   - 19: functional contpte implementation
>   - 20-25:  various optimizations on top of the contpte implementation

I'll try to dig into these over the next few days.

Mark.


Re: [PATCH v10 2/6] arm64: add support for machine check error safe

2024-01-30 Thread Mark Rutland
On Tue, Jan 30, 2024 at 06:57:24PM +0800, Tong Tiangen wrote:
> 在 2024/1/30 1:51, Mark Rutland 写道:
> > On Mon, Jan 29, 2024 at 09:46:48PM +0800, Tong Tiangen wrote:

> > > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> > > index 55f6455a8284..312932dc100b 100644
> > > --- a/arch/arm64/mm/fault.c
> > > +++ b/arch/arm64/mm/fault.c
> > > @@ -730,6 +730,31 @@ static int do_bad(unsigned long far, unsigned long 
> > > esr, struct pt_regs *regs)
> > >   return 1; /* "fault" */
> > >   }
> > > +static bool arm64_do_kernel_sea(unsigned long addr, unsigned int esr,
> > > +  struct pt_regs *regs, int sig, int code)
> > > +{
> > > + if (!IS_ENABLED(CONFIG_ARCH_HAS_COPY_MC))
> > > + return false;
> > > +
> > > + if (user_mode(regs))
> > > + return false;
> > 
> > This function is called "arm64_do_kernel_sea"; surely the caller should 
> > *never*
> > call this for a SEA taken from user mode?
> 
> In do_sea(), the processing logic is as follows:
>   do_sea()
>   {
> [...]
> if (user_mode(regs) && apei_claim_sea(regs) == 0) {
>return 0;
> }
> [...]
> //[1]
> if (!arm64_do_kernel_sea()) {
>arm64_notify_die();
> }
>   }
> 
> [1] user_mode() is still possible to go here,If user_mode() goes here,
>  it indicates that the impact caused by the memory error cannot be
>  processed correctly by apei_claim_sea().
> 
> 
> In this case, only arm64_notify_die() can be used, This also maintains
> the original logic of user_mode()'s processing.

My point is that either:

(a) The name means that this should *only* be called for SEAs from a kernel
context, and the caller should be responsible for ensuring that.

(b) The name is misleading, and the 'kernel' part should be removed from the
name.

I prefer (a), and if you head down that route it's clear that you can get rid
of a bunch of redundant logic and remove the need for do_kernel_sea(), anyway,
e.g.

| static int do_sea(unsigned long far, unsigned long esr, struct pt_regs *regs)
| {
| const struct fault_info *inf = esr_to_fault_info(esr);
| bool claimed = apei_claim_sea(regs) == 0;
| unsigned long siaddr;
| 
| if (claimed) {
| if (user_mode(regs)) {
| /*  
|  * APEI claimed this as a firmware-first notification.
|  * Some processing deferred to task_work before 
ret_to_user().
|  */
| return 0;
| } else {
| /*
|  * TODO: explain why this is correct.
|  */
| if ((current->flags & PF_KTHREAD) &&
| fixup_exception_mc(regs))
| return 0;
| }
| }
| 
| if (esr & ESR_ELx_FnV) {
| siaddr = 0;
| } else {
| /*  
|  * The architecture specifies that the tag bits of FAR_EL1 are
|  * UNKNOWN for synchronous external aborts. Mask them out now
|  * so that userspace doesn't see them.
|  */
| siaddr  = untagged_addr(far);
| }   
| arm64_notify_die(inf->name, regs, inf->sig, inf->code, siaddr, esr);
| 
| return 0;
| }

> > > +
> > > + if (apei_claim_sea(regs) < 0)
> > > + return false;
> > > +
> > > + if (!fixup_exception_mc(regs))
> > > + return false;
> > > +
> > > + if (current->flags & PF_KTHREAD)
> > > + return true;
> > 
> > I think this needs a comment; why do we allow kthreads to go on, yet kill 
> > user
> > threads? What about helper threads (e.g. for io_uring)?
> 
> If a memroy error occurs in the kernel thread, the problem is more
> serious than that of the user thread. As a result, related kernel
> functions, such as khugepaged, cannot run properly. kernel panic should
> be a better choice at this time.
> 
> Therefore, the processing scope of this framework is limited to the user
> thread.

That's reasonable, but needs to be explained in a comment.

Also, as above, I think you haven't conisderd helper threads (e.g. io_uring),
which don't have PF_KTHREAD set but do have PF_USER_WORKER set. I suspect those
need the same treatment as kthreads.

> > > + set_thread_esr(0, esr);
> > 
> > Why do we set the ESR to 0?
> 
> The purpose is to reuse the logic of arm64_notify_die() and set the
> following parameters before sending signals to users:
>   current->thread.fault_address = 0;
>   current->thread.fault_code = err;

Ok, but there's no need to open-code that.

As per my above example, please continue to use the existing call to
arm64_notify_die() rather than open-coding bits of it.

Mark.


Re: [PATCH v10 3/6] arm64: add uaccess to machine check safe

2024-01-30 Thread Mark Rutland
On Tue, Jan 30, 2024 at 07:14:35PM +0800, Tong Tiangen wrote:
> 在 2024/1/30 1:43, Mark Rutland 写道:
> > On Mon, Jan 29, 2024 at 09:46:49PM +0800, Tong Tiangen wrote:
> > Further, this change will also silently fixup unexpected kernel faults if we
> > pass bad kernel pointers to copy_{to,from}_user, which will hide real bugs.
> 
> I think this is better than the panic kernel, because the real bugs
> belongs to the user process. Even if the wrong pointer is
> transferred, the page corresponding to the wrong pointer has a memroy
> error.

I think you have misunderstood my point; I'm talking about the case of a bad
kernel pointer *without* a memory error.

For example, consider some buggy code such as:

void __user *uptr = some_valid_user_pointer;
void *kptr = NULL; // or any other bad pointer

ret = copy_to_user(uptr, kptr, size);
if (ret)
return -EFAULT;

Before this patch, when copy_to_user() attempted to load from NULL it would
fault, there would be no fixup handler for the LDR, and the kernel would die(),
reporting the bad kernel access.

After this patch (which adds fixup handlers to all the LDR*s in
copy_to_user()), the fault (which is *not* a memory error) would be handled by
the fixup handler, and copy_to_user() would return an error without *any*
indication of the horrible kernel bug.

This will hide kernel bugs, which will make those harder to identify and fix,
and will also potentially make it easier to exploit the kernel: if the user
somehow gains control of the kernel pointer, they can rely on the fixup handler
returning an error, and can scan through memory rather than dying as soon as
they pas a bad pointer.

> In addition, the panic information contains necessary information
> for users to check.

There is no panic() in the case I am describing.

> > So NAK to this change as-is; likewise for the addition of USER() to other 
> > ldr*
> > macros in copy_from_user.S and the addition of USER() str* macros in
> > copy_to_user.S.
> > 
> > If we want to handle memory errors on some kaccesses, we need a new 
> > EX_TYPE_*
> > separate from the usual EX_TYPE_KACESS_ERR_ZERO that means "handle memory
> > errors, but treat other faults as fatal". That should come with a rationale 
> > and
> > explanation of why it's actually useful.
> 
> This makes sense. Add kaccess types that can be processed properly.
> 
> > 
> > [...]
> > 
> > > diff --git a/arch/arm64/mm/extable.c b/arch/arm64/mm/extable.c
> > > index 478e639f8680..28ec35e3d210 100644
> > > --- a/arch/arm64/mm/extable.c
> > > +++ b/arch/arm64/mm/extable.c
> > > @@ -85,10 +85,10 @@ bool fixup_exception_mc(struct pt_regs *regs)
> > >   if (!ex)
> > >   return false;
> > > - /*
> > > -  * This is not complete, More Machine check safe extable type can
> > > -  * be processed here.
> > > -  */
> > > + switch (ex->type) {
> > > + case EX_TYPE_UACCESS_ERR_ZERO:
> > > + return ex_handler_uaccess_err_zero(ex, regs);
> > > + }
> > 
> > Please fold this part into the prior patch, and start ogf with *only* 
> > handling
> > errors on accesses already marked with EX_TYPE_UACCESS_ERR_ZERO. I think 
> > that
> > change would be relatively uncontroversial, and it would be much easier to
> > build atop that.
> 
> OK, the two patches will be merged in the next release.

Thanks.

Mark.


Re: [PATCH v10 5/6] arm64: support copy_mc_[user]_highpage()

2024-01-30 Thread Mark Rutland
On Mon, Jan 29, 2024 at 09:46:51PM +0800, Tong Tiangen wrote:
> Currently, many scenarios that can tolerate memory errors when copying page
> have been supported in the kernel[1][2][3], all of which are implemented by
> copy_mc_[user]_highpage(). arm64 should also support this mechanism.
> 
> Due to mte, arm64 needs to have its own copy_mc_[user]_highpage()
> architecture implementation, macros __HAVE_ARCH_COPY_MC_HIGHPAGE and
> __HAVE_ARCH_COPY_MC_USER_HIGHPAGE have been added to control it.
> 
> Add new helper copy_mc_page() which provide a page copy implementation with
> machine check safe. The copy_mc_page() in copy_mc_page.S is largely borrows
> from copy_page() in copy_page.S and the main difference is copy_mc_page()
> add extable entry to every load/store insn to support machine check safe.
> 
> Add new extable type EX_TYPE_COPY_MC_PAGE_ERR_ZERO which used in
> copy_mc_page().
> 
> [1]a873dfe1032a ("mm, hwpoison: try to recover from copy-on write faults")
> [2]5f2500b93cc9 ("mm/khugepaged: recover from poisoned anonymous memory")
> [3]6b970599e807 ("mm: hwpoison: support recovery from 
> ksm_might_need_to_copy()")
> 
> Signed-off-by: Tong Tiangen 
> ---
>  arch/arm64/include/asm/asm-extable.h | 15 ++
>  arch/arm64/include/asm/assembler.h   |  4 ++
>  arch/arm64/include/asm/mte.h |  5 ++
>  arch/arm64/include/asm/page.h| 10 
>  arch/arm64/lib/Makefile  |  2 +
>  arch/arm64/lib/copy_mc_page.S| 78 
>  arch/arm64/lib/mte.S | 27 ++
>  arch/arm64/mm/copypage.c | 66 ---
>  arch/arm64/mm/extable.c  |  7 +--
>  include/linux/highmem.h  |  8 +++
>  10 files changed, 213 insertions(+), 9 deletions(-)
>  create mode 100644 arch/arm64/lib/copy_mc_page.S
> 
> diff --git a/arch/arm64/include/asm/asm-extable.h 
> b/arch/arm64/include/asm/asm-extable.h
> index 980d1dd8e1a3..819044fefbe7 100644
> --- a/arch/arm64/include/asm/asm-extable.h
> +++ b/arch/arm64/include/asm/asm-extable.h
> @@ -10,6 +10,7 @@
>  #define EX_TYPE_UACCESS_ERR_ZERO 2
>  #define EX_TYPE_KACCESS_ERR_ZERO 3
>  #define EX_TYPE_LOAD_UNALIGNED_ZEROPAD   4
> +#define EX_TYPE_COPY_MC_PAGE_ERR_ZERO5
>  
>  /* Data fields for EX_TYPE_UACCESS_ERR_ZERO */
>  #define EX_DATA_REG_ERR_SHIFT0
> @@ -51,6 +52,16 @@
>  #define _ASM_EXTABLE_UACCESS(insn, fixup)\
>   _ASM_EXTABLE_UACCESS_ERR_ZERO(insn, fixup, wzr, wzr)
>  
> +#define _ASM_EXTABLE_COPY_MC_PAGE_ERR_ZERO(insn, fixup, err, zero)   \
> + __ASM_EXTABLE_RAW(insn, fixup,  \
> +   EX_TYPE_COPY_MC_PAGE_ERR_ZERO,\
> +   ( \
> + EX_DATA_REG(ERR, err) | \
> + EX_DATA_REG(ZERO, zero) \
> +   ))
> +
> +#define _ASM_EXTABLE_COPY_MC_PAGE(insn, fixup)   
> \
> + _ASM_EXTABLE_COPY_MC_PAGE_ERR_ZERO(insn, fixup, wzr, wzr)
>  /*
>   * Create an exception table entry for uaccess `insn`, which will branch to 
> `fixup`
>   * when an unhandled fault is taken.
> @@ -59,6 +70,10 @@
>   _ASM_EXTABLE_UACCESS(\insn, \fixup)
>   .endm
>  
> + .macro  _asm_extable_copy_mc_page, insn, fixup
> + _ASM_EXTABLE_COPY_MC_PAGE(\insn, \fixup)
> + .endm
> +

This should share a common EX_TYPE_ with the other "kaccess where memory error
is handled but other faults are fatal" cases.

>  /*
>   * Create an exception table entry for `insn` if `fixup` is provided. 
> Otherwise
>   * do nothing.
> diff --git a/arch/arm64/include/asm/assembler.h 
> b/arch/arm64/include/asm/assembler.h
> index 513787e43329..e1d8ce155878 100644
> --- a/arch/arm64/include/asm/assembler.h
> +++ b/arch/arm64/include/asm/assembler.h
> @@ -154,6 +154,10 @@ lr   .reqx30 // link register
>  #define CPU_LE(code...) code
>  #endif
>  
> +#define CPY_MC(l, x...)  \
> +:   x;   \
> + _asm_extable_copy_mc_pageb, l
> +
>  /*
>   * Define a macro that constructs a 64-bit value by concatenating two
>   * 32-bit registers. Note that on big endian systems the order of the
> diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
> index 91fbd5c8a391..9cdded082dd4 100644
> --- a/arch/arm64/include/asm/mte.h
> +++ b/arch/arm64/include/asm/mte.h
> @@ -92,6 +92,7 @@ static inline bool try_page_mte_tagging(struct page *page)
>  void mte_zero_clear_page_tags(void *addr);
>  void mte_sync_tags(pte_t pte, unsigned int nr_pages);
>  void mte_copy_page_tags(void *kto, const void *kfrom);
> +int mte_copy_mc_page_tags(void *kto, const void *kfrom);
>  void mte_thread_init_user(void);
>  void mte_thread_switch(struct task_struct *next);
>  void mte_cpu_setup(void);
> @@ -128,6 +129,10 @@ static inline void 

Re: [PATCH v10 6/6] arm64: introduce copy_mc_to_kernel() implementation

2024-01-30 Thread Mark Rutland
On Mon, Jan 29, 2024 at 09:46:52PM +0800, Tong Tiangen wrote:
> The copy_mc_to_kernel() helper is memory copy implementation that handles
> source exceptions. It can be used in memory copy scenarios that tolerate
> hardware memory errors(e.g: pmem_read/dax_copy_to_iter).
> 
> Currnently, only x86 and ppc suuport this helper, after arm64 support
> machine check safe framework, we introduce copy_mc_to_kernel()
> implementation.
> 
> Signed-off-by: Tong Tiangen 
> ---
>  arch/arm64/include/asm/string.h  |   5 +
>  arch/arm64/include/asm/uaccess.h |  21 +++
>  arch/arm64/lib/Makefile  |   2 +-
>  arch/arm64/lib/memcpy_mc.S   | 257 +++
>  mm/kasan/shadow.c|  12 ++
>  5 files changed, 296 insertions(+), 1 deletion(-)
>  create mode 100644 arch/arm64/lib/memcpy_mc.S

Looking at the diffstat and code, this duplicates arch/arm64/lib/memcpy.S with
a few annotations. Duplicating that code is not maintainable, and so we cannot
take this as-is.

If you want a version that can handle faults that *must* be written such that
the code is shared with the regular memcpy. That could be done by using macros
to instantiate two copies (one with fault handling, the other without).

It would also be very helpful to see *any* indication that this has been
tested, which is sorely lacking in the series as-is.

Mark.

> diff --git a/arch/arm64/include/asm/string.h b/arch/arm64/include/asm/string.h
> index 3a3264ff47b9..995b63c26e99 100644
> --- a/arch/arm64/include/asm/string.h
> +++ b/arch/arm64/include/asm/string.h
> @@ -35,6 +35,10 @@ extern void *memchr(const void *, int, __kernel_size_t);
>  extern void *memcpy(void *, const void *, __kernel_size_t);
>  extern void *__memcpy(void *, const void *, __kernel_size_t);
>  
> +#define __HAVE_ARCH_MEMCPY_MC
> +extern int memcpy_mcs(void *, const void *, __kernel_size_t);
> +extern int __memcpy_mcs(void *, const void *, __kernel_size_t);
> +
>  #define __HAVE_ARCH_MEMMOVE
>  extern void *memmove(void *, const void *, __kernel_size_t);
>  extern void *__memmove(void *, const void *, __kernel_size_t);
> @@ -57,6 +61,7 @@ void memcpy_flushcache(void *dst, const void *src, size_t 
> cnt);
>   */
>  
>  #define memcpy(dst, src, len) __memcpy(dst, src, len)
> +#define memcpy_mcs(dst, src, len) __memcpy_mcs(dst, src, len)
>  #define memmove(dst, src, len) __memmove(dst, src, len)
>  #define memset(s, c, n) __memset(s, c, n)
>  
> diff --git a/arch/arm64/include/asm/uaccess.h 
> b/arch/arm64/include/asm/uaccess.h
> index 14be5000c5a0..61e28ef2112a 100644
> --- a/arch/arm64/include/asm/uaccess.h
> +++ b/arch/arm64/include/asm/uaccess.h
> @@ -425,4 +425,25 @@ static inline size_t probe_subpage_writeable(const char 
> __user *uaddr,
>  
>  #endif /* CONFIG_ARCH_HAS_SUBPAGE_FAULTS */
>  
> +#ifdef CONFIG_ARCH_HAS_COPY_MC
> +/**
> + * copy_mc_to_kernel - memory copy that handles source exceptions
> + *
> + * @dst: destination address
> + * @src: source address
> + * @len: number of bytes to copy
> + *
> + * Return 0 for success, or #size if there was an exception.
> + */
> +static inline unsigned long __must_check
> +copy_mc_to_kernel(void *to, const void *from, unsigned long size)
> +{
> + int ret;
> +
> + ret = memcpy_mcs(to, from, size);
> + return (ret == -EFAULT) ? size : 0;
> +}
> +#define copy_mc_to_kernel copy_mc_to_kernel
> +#endif
> +
>  #endif /* __ASM_UACCESS_H */
> diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile
> index a2fd865b816d..899d6ae9698c 100644
> --- a/arch/arm64/lib/Makefile
> +++ b/arch/arm64/lib/Makefile
> @@ -3,7 +3,7 @@ lib-y := clear_user.o delay.o copy_from_user.o
> \
>  copy_to_user.o copy_page.o   \
>  clear_page.o csum.o insn.o memchr.o memcpy.o \
>  memset.o memcmp.o strcmp.o strncmp.o strlen.o\
> -strnlen.o strchr.o strrchr.o tishift.o
> +strnlen.o strchr.o strrchr.o tishift.o memcpy_mc.o
>  
>  ifeq ($(CONFIG_KERNEL_MODE_NEON), y)
>  obj-$(CONFIG_XOR_BLOCKS) += xor-neon.o
> diff --git a/arch/arm64/lib/memcpy_mc.S b/arch/arm64/lib/memcpy_mc.S
> new file mode 100644
> index ..7076b500d154
> --- /dev/null
> +++ b/arch/arm64/lib/memcpy_mc.S
> @@ -0,0 +1,257 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (c) 2012-2021, Arm Limited.
> + *
> + * Adapted from the original at:
> + * 
> https://github.com/ARM-software/optimized-routines/blob/afd6244a1f8d9229/string/aarch64/memcpy.S
> + */
> +
> +#include 
> +#include 
> +
> +/* Assumptions:
> + *
> + * ARMv8-a, AArch64, unaligned accesses.
> + *
> + */
> +
> +#define L(label) .L ## label
> +
> +#define dstinx0
> +#define src  x1
> +#define countx2
> +#define dst  x3
> +#define srcend   x4
> +#define dstend   x5
> +#define A_l  x6
> +#define A_lw w6
> +#define A_h  x7
> +#define B_l  x8
> +#define B_lw w8
> +#define B_h  x9
> 

Re: [PATCH v10 3/6] arm64: add uaccess to machine check safe

2024-01-29 Thread Mark Rutland
On Mon, Jan 29, 2024 at 09:46:49PM +0800, Tong Tiangen wrote:
> If user process access memory fails due to hardware memory error, only the
> relevant processes are affected, so it is more reasonable to kill the user
> process and isolate the corrupt page than to panic the kernel.
> 
> Signed-off-by: Tong Tiangen 
> ---
>  arch/arm64/lib/copy_from_user.S | 10 +-
>  arch/arm64/lib/copy_to_user.S   | 10 +-
>  arch/arm64/mm/extable.c |  8 
>  3 files changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/arm64/lib/copy_from_user.S b/arch/arm64/lib/copy_from_user.S
> index 34e317907524..1bf676e9201d 100644
> --- a/arch/arm64/lib/copy_from_user.S
> +++ b/arch/arm64/lib/copy_from_user.S
> @@ -25,7 +25,7 @@
>   .endm
>  
>   .macro strb1 reg, ptr, val
> - strb \reg, [\ptr], \val
> + USER(9998f, strb \reg, [\ptr], \val)
>   .endm

This is a store to *kernel* memory, not user memory. It should not be marked
with USER().

I understand that you *might* want to handle memory errors on these stores, but
the commit message doesn't describe that and the associated trade-off. For
example, consider that when a copy_form_user fails we'll try to zero the
remaining buffer via memset(); so if a STR* instruction in copy_to_user
faulted, upon handling the fault we'll immediately try to fix that up with some
more stores which will also fault, but won't get fixed up, leading to a panic()
anyway...

Further, this change will also silently fixup unexpected kernel faults if we
pass bad kernel pointers to copy_{to,from}_user, which will hide real bugs.

So NAK to this change as-is; likewise for the addition of USER() to other ldr*
macros in copy_from_user.S and the addition of USER() str* macros in
copy_to_user.S.

If we want to handle memory errors on some kaccesses, we need a new EX_TYPE_*
separate from the usual EX_TYPE_KACESS_ERR_ZERO that means "handle memory
errors, but treat other faults as fatal". That should come with a rationale and
explanation of why it's actually useful.

[...]

> diff --git a/arch/arm64/mm/extable.c b/arch/arm64/mm/extable.c
> index 478e639f8680..28ec35e3d210 100644
> --- a/arch/arm64/mm/extable.c
> +++ b/arch/arm64/mm/extable.c
> @@ -85,10 +85,10 @@ bool fixup_exception_mc(struct pt_regs *regs)
>   if (!ex)
>   return false;
>  
> - /*
> -  * This is not complete, More Machine check safe extable type can
> -  * be processed here.
> -  */
> + switch (ex->type) {
> + case EX_TYPE_UACCESS_ERR_ZERO:
> + return ex_handler_uaccess_err_zero(ex, regs);
> + }

Please fold this part into the prior patch, and start ogf with *only* handling
errors on accesses already marked with EX_TYPE_UACCESS_ERR_ZERO. I think that
change would be relatively uncontroversial, and it would be much easier to
build atop that.

Thanks,
Mark.


Re: [PATCH v10 2/6] arm64: add support for machine check error safe

2024-01-29 Thread Mark Rutland
On Mon, Jan 29, 2024 at 09:46:48PM +0800, Tong Tiangen wrote:
> For the arm64 kernel, when it processes hardware memory errors for
> synchronize notifications(do_sea()), if the errors is consumed within the
> kernel, the current processing is panic. However, it is not optimal.
> 
> Take uaccess for example, if the uaccess operation fails due to memory
> error, only the user process will be affected. Killing the user process and
> isolating the corrupt page is a better choice.
> 
> This patch only enable machine error check framework and adds an exception
> fixup before the kernel panic in do_sea().
> 
> Signed-off-by: Tong Tiangen 
> ---
>  arch/arm64/Kconfig   |  1 +
>  arch/arm64/include/asm/extable.h |  1 +
>  arch/arm64/mm/extable.c  | 16 
>  arch/arm64/mm/fault.c| 29 -
>  4 files changed, 46 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index aa7c1d435139..2cc34b5e7abb 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -20,6 +20,7 @@ config ARM64
>   select ARCH_ENABLE_SPLIT_PMD_PTLOCK if PGTABLE_LEVELS > 2
>   select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
>   select ARCH_HAS_CACHE_LINE_SIZE
> + select ARCH_HAS_COPY_MC if ACPI_APEI_GHES
>   select ARCH_HAS_CURRENT_STACK_POINTER
>   select ARCH_HAS_DEBUG_VIRTUAL
>   select ARCH_HAS_DEBUG_VM_PGTABLE
> diff --git a/arch/arm64/include/asm/extable.h 
> b/arch/arm64/include/asm/extable.h
> index 72b0e71cc3de..f80ebd0addfd 100644
> --- a/arch/arm64/include/asm/extable.h
> +++ b/arch/arm64/include/asm/extable.h
> @@ -46,4 +46,5 @@ bool ex_handler_bpf(const struct exception_table_entry *ex,
>  #endif /* !CONFIG_BPF_JIT */
>  
>  bool fixup_exception(struct pt_regs *regs);
> +bool fixup_exception_mc(struct pt_regs *regs);
>  #endif
> diff --git a/arch/arm64/mm/extable.c b/arch/arm64/mm/extable.c
> index 228d681a8715..478e639f8680 100644
> --- a/arch/arm64/mm/extable.c
> +++ b/arch/arm64/mm/extable.c
> @@ -76,3 +76,19 @@ bool fixup_exception(struct pt_regs *regs)
>  
>   BUG();
>  }
> +
> +bool fixup_exception_mc(struct pt_regs *regs)

Can we please replace 'mc' with something like 'memory_error' ?

There's no "machine check" on arm64, and 'mc' is opaque regardless.

> +{
> + const struct exception_table_entry *ex;
> +
> + ex = search_exception_tables(instruction_pointer(regs));
> + if (!ex)
> + return false;
> +
> + /*
> +  * This is not complete, More Machine check safe extable type can
> +  * be processed here.
> +  */
> +
> + return false;
> +}

As with my comment on the subsequenty patch, I'd much prefer that we handle
EX_TYPE_UACCESS_ERR_ZERO from the outset.



> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 55f6455a8284..312932dc100b 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -730,6 +730,31 @@ static int do_bad(unsigned long far, unsigned long esr, 
> struct pt_regs *regs)
>   return 1; /* "fault" */
>  }
>  
> +static bool arm64_do_kernel_sea(unsigned long addr, unsigned int esr,
> +  struct pt_regs *regs, int sig, int code)
> +{
> + if (!IS_ENABLED(CONFIG_ARCH_HAS_COPY_MC))
> + return false;
> +
> + if (user_mode(regs))
> + return false;

This function is called "arm64_do_kernel_sea"; surely the caller should *never*
call this for a SEA taken from user mode?

> +
> + if (apei_claim_sea(regs) < 0)
> + return false;
> +
> + if (!fixup_exception_mc(regs))
> + return false;
> +
> + if (current->flags & PF_KTHREAD)
> + return true;

I think this needs a comment; why do we allow kthreads to go on, yet kill user
threads? What about helper threads (e.g. for io_uring)?

> +
> + set_thread_esr(0, esr);

Why do we set the ESR to 0?

Mark.

> + arm64_force_sig_fault(sig, code, addr,
> + "Uncorrected memory error on access to user memory\n");
> +
> + return true;
> +}
> +
>  static int do_sea(unsigned long far, unsigned long esr, struct pt_regs *regs)
>  {
>   const struct fault_info *inf;
> @@ -755,7 +780,9 @@ static int do_sea(unsigned long far, unsigned long esr, 
> struct pt_regs *regs)
>*/
>   siaddr  = untagged_addr(far);
>   }
> - arm64_notify_die(inf->name, regs, inf->sig, inf->code, siaddr, esr);
> +
> + if (!arm64_do_kernel_sea(siaddr, esr, regs, inf->sig, inf->code))
> + arm64_notify_die(inf->name, regs, inf->sig, inf->code, siaddr, 
> esr);
>  
>   return 0;
>  }
> -- 
> 2.25.1
> 


Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-26 Thread Mark Rutland
On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> >> > From: "Mike Rapoport (IBM)" 
> >> >
> >> > module_alloc() is used everywhere as a mean to allocate memory for code.
> >> >
> >> > Beside being semantically wrong, this unnecessarily ties all subsystems
> >> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> >> > and puts the burden of code allocation to the modules code.
> >> >
> >> > Several architectures override module_alloc() because of various
> >> > constraints where the executable memory can be located and this causes
> >> > additional obstacles for improvements of code allocation.
> >> >
> >> > Start splitting code allocation from modules by introducing
> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> >> >
> >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> >> > module_alloc() and execmem_free() and jit_free() are replacements of
> >> > module_memfree() to allow updating all call sites to use the new APIs.
> >> >
> >> > The intention semantics for new allocation APIs:
> >> >
> >> > * execmem_text_alloc() should be used to allocate memory that must reside
> >> >   close to the kernel image, like loadable kernel modules and generated
> >> >   code that is restricted by relative addressing.
> >> >
> >> > * jit_text_alloc() should be used to allocate memory for generated code
> >> >   when there are no restrictions for the code placement. For
> >> >   architectures that require that any code is within certain distance
> >> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> >> >   execmem_text_alloc().
> >> >
> >> 
> >> Is there anything in this series to help users do the appropriate
> >> synchronization when the actually populate the allocated memory with
> >> code?  See here, for example:
> >
> > This series only factors out the executable allocations from modules and
> > puts them in a central place.
> > Anything else would go on top after this lands.
> 
> Hmm.
> 
> On the one hand, there's nothing wrong with factoring out common code. On the
> other hand, this is probably the right time to at least start thinking about
> synchronization, at least to the extent that it might make us want to change
> this API.  (I'm not at all saying that this series should require changes --
> I'm just saying that this is a good time to think about how this should
> work.)
> 
> The current APIs, *and* the proposed jit_text_alloc() API, don't actually
> look like the one think in the Linux ecosystem that actually intelligently
> and efficiently maps new text into an address space: mmap().
> 
> On x86, you can mmap() an existing file full of executable code PROT_EXEC and
> jump to it with minimal synchronization (just the standard implicit ordering
> in the kernel that populates the pages before setting up the PTEs and
> whatever user synchronization is needed to avoid jumping into the mapping
> before mmap() finishes).  It works across CPUs, and the only possible way
> userspace can screw it up (for a read-only mapping of read-only text, anyway)
> is to jump to the mapping too early, in which case userspace gets a page
> fault.  Incoherence is impossible, and no one needs to "serialize" (in the
> SDM sense).
> 
> I think the same sequence (from userspace's perspective) works on other
> architectures, too, although I think more cache management is needed on the
> kernel's end.  As far as I know, no Linux SMP architecture needs an IPI to
> map executable text into usermode, but I could easily be wrong.  (IIRC RISC-V
> has very developer-unfriendly icache management, but I don't remember the
> details.)

That's my understanding too, with a couple of details:

1) After the copy we perform and complete all the data + instruction cache
   maintenance *before* marking the mapping as executable.

2) Even *after* the mapping is marked executable, a thread could take a
   spurious fault on an instruction fetch for the new instructions. One way to
   think about this is that the CPU attempted to speculate the instructions
   earlier, saw that the mapping was faulting, and placed a "generate a fault
   here" operation into its pipeline to generate that later.

   The CPU pipeline/OoO-engine/whatever is effectively a transient cache for
   operations in-flight which is only ever "invalidated" by a
   context-synchronization-event (akin to an x86 serializing effect).

   We're only guarnateed to have a new instruction fetch (from the I-cache into
   the CPU pipeline) after the next context synchronization event (akin to an 
x86
   serializing effect), and luckily out exception entry/exit is architecturally
   guarnateed to provide that (unless we explicitly opt out via a control bit).

I know we're a bit lax with that 

Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-26 Thread Mark Rutland
On Sun, Jun 25, 2023 at 07:14:17PM +0300, Mike Rapoport wrote:
> On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
> > 
> > On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> > > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> > >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> > >> > From: "Mike Rapoport (IBM)" 
> > >> >
> > >> > module_alloc() is used everywhere as a mean to allocate memory for 
> > >> > code.
> > >> >
> > >> > Beside being semantically wrong, this unnecessarily ties all subsystems
> > >> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> > >> > and puts the burden of code allocation to the modules code.
> > >> >
> > >> > Several architectures override module_alloc() because of various
> > >> > constraints where the executable memory can be located and this causes
> > >> > additional obstacles for improvements of code allocation.
> > >> >
> > >> > Start splitting code allocation from modules by introducing
> > >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() 
> > >> > APIs.
> > >> >
> > >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> > >> > module_alloc() and execmem_free() and jit_free() are replacements of
> > >> > module_memfree() to allow updating all call sites to use the new APIs.
> > >> >
> > >> > The intention semantics for new allocation APIs:
> > >> >
> > >> > * execmem_text_alloc() should be used to allocate memory that must 
> > >> > reside
> > >> >   close to the kernel image, like loadable kernel modules and generated
> > >> >   code that is restricted by relative addressing.
> > >> >
> > >> > * jit_text_alloc() should be used to allocate memory for generated code
> > >> >   when there are no restrictions for the code placement. For
> > >> >   architectures that require that any code is within certain distance
> > >> >   from the kernel image, jit_text_alloc() will be essentially aliased 
> > >> > to
> > >> >   execmem_text_alloc().
> > >> >
> > >> 
> > >> Is there anything in this series to help users do the appropriate
> > >> synchronization when the actually populate the allocated memory with
> > >> code?  See here, for example:
> > >
> > > This series only factors out the executable allocations from modules and
> > > puts them in a central place.
> > > Anything else would go on top after this lands.
> > 
> > Hmm.
> > 
> > On the one hand, there's nothing wrong with factoring out common code. On
> > the other hand, this is probably the right time to at least start
> > thinking about synchronization, at least to the extent that it might make
> > us want to change this API.  (I'm not at all saying that this series
> > should require changes -- I'm just saying that this is a good time to
> > think about how this should work.)
> > 
> > The current APIs, *and* the proposed jit_text_alloc() API, don't actually
> > look like the one think in the Linux ecosystem that actually
> > intelligently and efficiently maps new text into an address space:
> > mmap().
> > 
> > On x86, you can mmap() an existing file full of executable code PROT_EXEC
> > and jump to it with minimal synchronization (just the standard implicit
> > ordering in the kernel that populates the pages before setting up the
> > PTEs and whatever user synchronization is needed to avoid jumping into
> > the mapping before mmap() finishes).  It works across CPUs, and the only
> > possible way userspace can screw it up (for a read-only mapping of
> > read-only text, anyway) is to jump to the mapping too early, in which
> > case userspace gets a page fault.  Incoherence is impossible, and no one
> > needs to "serialize" (in the SDM sense).
> > 
> > I think the same sequence (from userspace's perspective) works on other
> > architectures, too, although I think more cache management is needed on
> > the kernel's end.  As far as I know, no Linux SMP architecture needs an
> > IPI to map executable text into usermode, but I could easily be wrong.
> > (IIRC RISC-V has very developer-unfriendly icache management, but I don't
> > remember the details.)
> > 
> > Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is
> > rather fraught, and I bet many things do it wrong when userspace is
> > multithreaded.  But not in production because it's mostly not used in
> > production.)
> > 
> > But jit_text_alloc() can't do this, because the order of operations
> > doesn't match.  With jit_text_alloc(), the executable mapping shows up
> > before the text is populated, so there is no atomic change from not-there
> > to populated-and-executable.  Which means that there is an opportunity
> > for CPUs, speculatively or otherwise, to start filling various caches
> > with intermediate states of the text, which means that various
> > architectures (even x86!) may need serialization.
> > 
> > For eBPF- and module- like use cases, where JITting/code gen is quite
> > coarse-grained, perhaps something vaguely 

Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-26 Thread Mark Rutland
On Mon, Jun 26, 2023 at 11:54:02AM +0200, Puranjay Mohan wrote:
> On Mon, Jun 26, 2023 at 8:13 AM Song Liu  wrote:
> >
> > On Sun, Jun 25, 2023 at 11:07 AM Kent Overstreet
> >  wrote:
> > >
> > > On Sun, Jun 25, 2023 at 08:42:57PM +0300, Mike Rapoport wrote:
> > > > On Sun, Jun 25, 2023 at 09:59:34AM -0700, Andy Lutomirski wrote:
> > > > >
> > > > >
> > > > > On Sun, Jun 25, 2023, at 9:14 AM, Mike Rapoport wrote:
> > > > > > On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
> > > > > >>
> > > > > >> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> > > > > >> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> > > > > >> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> > > > > >> >> > From: "Mike Rapoport (IBM)" 
> > > > > >> >> >
> > > > > >> >> > module_alloc() is used everywhere as a mean to allocate 
> > > > > >> >> > memory for code.
> > > > > >> >> >
> > > > > >> >> > Beside being semantically wrong, this unnecessarily ties all 
> > > > > >> >> > subsystems
> > > > > >> >> > that need to allocate code, such as ftrace, kprobes and BPF 
> > > > > >> >> > to modules
> > > > > >> >> > and puts the burden of code allocation to the modules code.
> > > > > >> >> >
> > > > > >> >> > Several architectures override module_alloc() because of 
> > > > > >> >> > various
> > > > > >> >> > constraints where the executable memory can be located and 
> > > > > >> >> > this causes
> > > > > >> >> > additional obstacles for improvements of code allocation.
> > > > > >> >> >
> > > > > >> >> > Start splitting code allocation from modules by introducing
> > > > > >> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), 
> > > > > >> >> > jit_free() APIs.
> > > > > >> >> >
> > > > > >> >> > Initially, execmem_text_alloc() and jit_text_alloc() are 
> > > > > >> >> > wrappers for
> > > > > >> >> > module_alloc() and execmem_free() and jit_free() are 
> > > > > >> >> > replacements of
> > > > > >> >> > module_memfree() to allow updating all call sites to use the 
> > > > > >> >> > new APIs.
> > > > > >> >> >
> > > > > >> >> > The intention semantics for new allocation APIs:
> > > > > >> >> >
> > > > > >> >> > * execmem_text_alloc() should be used to allocate memory that 
> > > > > >> >> > must reside
> > > > > >> >> >   close to the kernel image, like loadable kernel modules and 
> > > > > >> >> > generated
> > > > > >> >> >   code that is restricted by relative addressing.
> > > > > >> >> >
> > > > > >> >> > * jit_text_alloc() should be used to allocate memory for 
> > > > > >> >> > generated code
> > > > > >> >> >   when there are no restrictions for the code placement. For
> > > > > >> >> >   architectures that require that any code is within certain 
> > > > > >> >> > distance
> > > > > >> >> >   from the kernel image, jit_text_alloc() will be essentially 
> > > > > >> >> > aliased to
> > > > > >> >> >   execmem_text_alloc().
> > > > > >> >> >
> > > > > >> >>
> > > > > >> >> Is there anything in this series to help users do the 
> > > > > >> >> appropriate
> > > > > >> >> synchronization when the actually populate the allocated memory 
> > > > > >> >> with
> > > > > >> >> code?  See here, for example:
> > > > > >> >
> > > > > >> > This series only factors out the executable allocations from 
> > > > > >> > modules and
> > > > > >> > puts them in a central place.
> > > > > >> > Anything else would go on top after this lands.
> > > > > >>
> > > > > >> Hmm.
> > > > > >>
> > > > > >> On the one hand, there's nothing wrong with factoring out common 
> > > > > >> code. On
> > > > > >> the other hand, this is probably the right time to at least start
> > > > > >> thinking about synchronization, at least to the extent that it 
> > > > > >> might make
> > > > > >> us want to change this API.  (I'm not at all saying that this 
> > > > > >> series
> > > > > >> should require changes -- I'm just saying that this is a good time 
> > > > > >> to
> > > > > >> think about how this should work.)
> > > > > >>
> > > > > >> The current APIs, *and* the proposed jit_text_alloc() API, don't 
> > > > > >> actually
> > > > > >> look like the one think in the Linux ecosystem that actually
> > > > > >> intelligently and efficiently maps new text into an address space:
> > > > > >> mmap().
> > > > > >>
> > > > > >> On x86, you can mmap() an existing file full of executable code 
> > > > > >> PROT_EXEC
> > > > > >> and jump to it with minimal synchronization (just the standard 
> > > > > >> implicit
> > > > > >> ordering in the kernel that populates the pages before setting up 
> > > > > >> the
> > > > > >> PTEs and whatever user synchronization is needed to avoid jumping 
> > > > > >> into
> > > > > >> the mapping before mmap() finishes).  It works across CPUs, and 
> > > > > >> the only
> > > > > >> possible way userspace can screw it up (for a read-only mapping of
> > > > > >> read-only text, anyway) is to jump to the mapping too early, in 
> > > > > >> which
> > > > > >> case userspace 

Re: [PATCH v5 15/18] watchdog/perf: Add a weak function for an arch to detect if perf can use NMIs

2023-06-12 Thread Mark Rutland
On Mon, Jun 12, 2023 at 06:55:37AM -0700, Doug Anderson wrote:
> Mark,
> 
> On Mon, Jun 12, 2023 at 3:33 AM Mark Rutland  wrote:
> >
> > On Fri, May 19, 2023 at 10:18:39AM -0700, Douglas Anderson wrote:
> > > On arm64, NMI support needs to be detected at runtime. Add a weak
> > > function to the perf hardlockup detector so that an architecture can
> > > implement it to detect whether NMIs are available.
> > >
> > > Signed-off-by: Douglas Anderson 
> > > ---
> > > While I won't object to this patch landing, I consider it part of the
> > > arm64 perf hardlockup effort. I would be OK with the earlier patches
> > > in the series landing and then not landing ${SUBJECT} patch nor
> > > anything else later.
> >
> > FWIW, everything prior to this looks fine to me, so I reckon it'd be worth
> > splitting the series here and getting the buddy lockup detector in first, to
> > avoid a log-jam on all the subsequent NMI bits.
> 
> I think the whole series has already landed in Andrew's tree,
> including the arm64 "perf" lockup detector bits. I saw all the
> notifications from Andrew go through over the weekend that they were
> moved from an "unstable" branch to a "stable" one and I see them at:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/log/?h=mm-nonmm-stable
> 
> When I first saw Anderw land the arm64 perf lockup detector bits in
> his unstable branch several weeks ago, I sent a private message to the
> arm64 maintainers (yourself included) to make sure you were aware of
> it and that it hadn't been caught in mail filters. I got the
> impression that everything was OK. Is that not the case?

Sorry; I'm slowly catching up with a backlog of email, and I'm just behind.

Feel free to ignore this; sorry for the noise!

If we spot anything going wrong in testing we can look at fixing those up.

Thanks,
Mark.


Re: [PATCH v5 15/18] watchdog/perf: Add a weak function for an arch to detect if perf can use NMIs

2023-06-12 Thread Mark Rutland
On Fri, May 19, 2023 at 10:18:39AM -0700, Douglas Anderson wrote:
> On arm64, NMI support needs to be detected at runtime. Add a weak
> function to the perf hardlockup detector so that an architecture can
> implement it to detect whether NMIs are available.
> 
> Signed-off-by: Douglas Anderson 
> ---
> While I won't object to this patch landing, I consider it part of the
> arm64 perf hardlockup effort. I would be OK with the earlier patches
> in the series landing and then not landing ${SUBJECT} patch nor
> anything else later.

FWIW, everything prior to this looks fine to me, so I reckon it'd be worth
splitting the series here and getting the buddy lockup detector in first, to
avoid a log-jam on all the subsequent NMI bits.

Thanks,
Mark.

> I'll also note that, as an alternative to this, it would be nice if we
> could figure out how to make perf_event_create_kernel_counter() fail
> on arm64 if NMIs aren't available. Maybe we could add a "must_use_nmi"
> element to "struct perf_event_attr"?
> 
> (no changes since v4)
> 
> Changes in v4:
> - ("Add a weak function for an arch to detect ...") new for v4.
> 
>  include/linux/nmi.h|  1 +
>  kernel/watchdog_perf.c | 12 +++-
>  2 files changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/nmi.h b/include/linux/nmi.h
> index 47db14e7da52..eb616fc07c85 100644
> --- a/include/linux/nmi.h
> +++ b/include/linux/nmi.h
> @@ -210,6 +210,7 @@ static inline bool trigger_single_cpu_backtrace(int cpu)
>  
>  #ifdef CONFIG_HARDLOCKUP_DETECTOR_PERF
>  u64 hw_nmi_get_sample_period(int watchdog_thresh);
> +bool arch_perf_nmi_is_available(void);
>  #endif
>  
>  #if defined(CONFIG_HARDLOCKUP_CHECK_TIMESTAMP) && \
> diff --git a/kernel/watchdog_perf.c b/kernel/watchdog_perf.c
> index 349fcd4d2abc..8ea00c4a24b2 100644
> --- a/kernel/watchdog_perf.c
> +++ b/kernel/watchdog_perf.c
> @@ -234,12 +234,22 @@ void __init hardlockup_detector_perf_restart(void)
>   }
>  }
>  
> +bool __weak __init arch_perf_nmi_is_available(void)
> +{
> + return true;
> +}
> +
>  /**
>   * watchdog_hardlockup_probe - Probe whether NMI event is available at all
>   */
>  int __init watchdog_hardlockup_probe(void)
>  {
> - int ret = hardlockup_detector_event_create();
> + int ret;
> +
> + if (!arch_perf_nmi_is_available())
> + return -ENODEV;
> +
> + ret = hardlockup_detector_event_create();
>  
>   if (ret) {
>   pr_info("Perf NMI watchdog permanently disabled\n");
> -- 
> 2.40.1.698.g37aff9b760-goog
> 


Re: [PATCH 00/13] mm: jit/text allocator

2023-06-05 Thread Mark Rutland
On Mon, Jun 05, 2023 at 12:20:40PM +0300, Mike Rapoport wrote:
> On Fri, Jun 02, 2023 at 10:35:09AM +0100, Mark Rutland wrote:
> > On Thu, Jun 01, 2023 at 02:14:56PM -0400, Kent Overstreet wrote:
> > > On Thu, Jun 01, 2023 at 05:12:03PM +0100, Mark Rutland wrote:
> > > > For a while I have wanted to give kprobes its own allocator so that it 
> > > > can work
> > > > even with CONFIG_MODULES=n, and so that it doesn't have to waste VA 
> > > > space in
> > > > the modules area.
> > > > 
> > > > Given that, I think these should have their own allocator functions 
> > > > that can be
> > > > provided independently, even if those happen to use common 
> > > > infrastructure.
> > > 
> > > How much memory can kprobes conceivably use? I think we also want to try
> > > to push back on combinatorial new allocators, if we can.
> > 
> > That depends on who's using it, and how (e.g. via BPF).
> > 
> > To be clear, I'm not necessarily asking for entirely different allocators, 
> > but
> > I do thinkg that we want wrappers that can at least pass distinct start+end
> > parameters to a common allocator, and for arm64's modules code I'd expect 
> > that
> > we'd keep the range falblack logic out of the common allcoator, and just 
> > call
> > it twice.
> > 
> > > > > Several architectures override module_alloc() because of various
> > > > > constraints where the executable memory can be located and this causes
> > > > > additional obstacles for improvements of code allocation.
> > > > > 
> > > > > This set splits code allocation from modules by introducing
> > > > > jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call
> > > > > sites of module_alloc() and module_memfree() with the new APIs and
> > > > > implements core text and related allocation in a central place.
> > > > > 
> > > > > Instead of architecture specific overrides for module_alloc(), the
> > > > > architectures that require non-default behaviour for text allocation 
> > > > > must
> > > > > fill jit_alloc_params structure and implement jit_alloc_arch_params() 
> > > > > that
> > > > > returns a pointer to that structure. If an architecture does not 
> > > > > implement
> > > > > jit_alloc_arch_params(), the defaults compatible with the current
> > > > > modules::module_alloc() are used.
> > > > 
> > > > As above, I suspect that each of the callsites should probably be using 
> > > > common
> > > > infrastructure, but I don't think that a single jit_alloc_arch_params() 
> > > > makes
> > > > sense, since the parameters for each case may need to be distinct.
> > > 
> > > I don't see how that follows. The whole point of function parameters is
> > > that they may be different :)
> > 
> > What I mean is that jit_alloc_arch_params() tries to aggregate common
> > parameters, but they aren't actually common (e.g. the actual start+end range
> > for allocation).
> 
> jit_alloc_arch_params() tries to aggregate architecture constraints and
> requirements for allocations of executable memory and this exactly what
> the first 6 patches of this set do.
> 
> A while ago Thomas suggested to use a structure that parametrizes
> architecture constraints by the memory type used in modules [1] and Song
> implemented the infrastructure for it and x86 part [2].
> 
> I liked the idea of defining parameters in a single structure, but I
> thought that approaching the problem from the arch side rather than from
> modules perspective will be better starting point, hence these patches.
> 
> I don't see a fundamental reason why a single structure cannot describe
> what is needed for different code allocation cases, be it modules, kprobes
> or bpf. There is of course an assumption that the core allocations will be
> the same for all the users, and it seems to me that something like 
> 
> * allocate physical memory if allocator caches are empty
> * map it in vmalloc or modules address space
> * return memory from the allocator cache to the caller
> 
> will work for all usecases.
> 
> We might need separate caches for different cases on different
> architectures, and a way to specify what cache should be used in the
> allocator API, but that does not contradict a single structure for arch
> specific parameters, but only makes it more elaborate, e.g. something like
&

Re: [PATCH 00/13] mm: jit/text allocator

2023-06-02 Thread Mark Rutland
On Thu, Jun 01, 2023 at 02:14:56PM -0400, Kent Overstreet wrote:
> On Thu, Jun 01, 2023 at 05:12:03PM +0100, Mark Rutland wrote:
> > For a while I have wanted to give kprobes its own allocator so that it can 
> > work
> > even with CONFIG_MODULES=n, and so that it doesn't have to waste VA space in
> > the modules area.
> > 
> > Given that, I think these should have their own allocator functions that 
> > can be
> > provided independently, even if those happen to use common infrastructure.
> 
> How much memory can kprobes conceivably use? I think we also want to try
> to push back on combinatorial new allocators, if we can.

That depends on who's using it, and how (e.g. via BPF).

To be clear, I'm not necessarily asking for entirely different allocators, but
I do thinkg that we want wrappers that can at least pass distinct start+end
parameters to a common allocator, and for arm64's modules code I'd expect that
we'd keep the range falblack logic out of the common allcoator, and just call
it twice.

> > > Several architectures override module_alloc() because of various
> > > constraints where the executable memory can be located and this causes
> > > additional obstacles for improvements of code allocation.
> > > 
> > > This set splits code allocation from modules by introducing
> > > jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call
> > > sites of module_alloc() and module_memfree() with the new APIs and
> > > implements core text and related allocation in a central place.
> > > 
> > > Instead of architecture specific overrides for module_alloc(), the
> > > architectures that require non-default behaviour for text allocation must
> > > fill jit_alloc_params structure and implement jit_alloc_arch_params() that
> > > returns a pointer to that structure. If an architecture does not implement
> > > jit_alloc_arch_params(), the defaults compatible with the current
> > > modules::module_alloc() are used.
> > 
> > As above, I suspect that each of the callsites should probably be using 
> > common
> > infrastructure, but I don't think that a single jit_alloc_arch_params() 
> > makes
> > sense, since the parameters for each case may need to be distinct.
> 
> I don't see how that follows. The whole point of function parameters is
> that they may be different :)

What I mean is that jit_alloc_arch_params() tries to aggregate common
parameters, but they aren't actually common (e.g. the actual start+end range
for allocation).

> Can you give more detail on what parameters you need? If the only extra
> parameter is just "does this allocation need to live close to kernel
> text", that's not that big of a deal.

My thinking was that we at least need the start + end for each caller. That
might be it, tbh.

Thanks,
Mark.


Re: [PATCH 00/13] mm: jit/text allocator

2023-06-01 Thread Mark Rutland
Hi Mike,

On Thu, Jun 01, 2023 at 01:12:44PM +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" 
> 
> Hi,
> 
> module_alloc() is used everywhere as a mean to allocate memory for code.
> 
> Beside being semantically wrong, this unnecessarily ties all subsystmes
> that need to allocate code, such as ftrace, kprobes and BPF to modules
> and puts the burden of code allocation to the modules code.

I agree this is a problem, and one key issue here is that these can have
different requirements. For example, on arm64 we need modules to be placed
within a 128M or 2G window containing the kernel, whereas it would be safe for
the kprobes XOL area to be placed arbitrarily far from the kernel image (since
we don't allow PC-relative insns to be stepped out-of-line). Likewise arm64
doesn't have ftrace trampolines, and DIRECT_CALL trampolines can safely be
placed arbitarily far from the kernel image.

For a while I have wanted to give kprobes its own allocator so that it can work
even with CONFIG_MODULES=n, and so that it doesn't have to waste VA space in
the modules area.

Given that, I think these should have their own allocator functions that can be
provided independently, even if those happen to use common infrastructure.

> Several architectures override module_alloc() because of various
> constraints where the executable memory can be located and this causes
> additional obstacles for improvements of code allocation.
> 
> This set splits code allocation from modules by introducing
> jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call
> sites of module_alloc() and module_memfree() with the new APIs and
> implements core text and related allocation in a central place.
> 
> Instead of architecture specific overrides for module_alloc(), the
> architectures that require non-default behaviour for text allocation must
> fill jit_alloc_params structure and implement jit_alloc_arch_params() that
> returns a pointer to that structure. If an architecture does not implement
> jit_alloc_arch_params(), the defaults compatible with the current
> modules::module_alloc() are used.

As above, I suspect that each of the callsites should probably be using common
infrastructure, but I don't think that a single jit_alloc_arch_params() makes
sense, since the parameters for each case may need to be distinct.

> The new jitalloc infrastructure allows decoupling of kprobes and ftrace
> from modules, and most importantly it enables ROX allocations for
> executable memory.
> 
> A centralized infrastructure for code allocation allows future
> optimizations for allocations of executable memory, caching large pages for
> better iTLB performance and providing sub-page allocations for users that
> only need small jit code snippets.

This sounds interesting, but I think this can be achieved without requiring a
single jit_alloc_arch_params() shared by all users?

Thanks,
Mark.

> 
> patches 1-5: split out the code allocation from modules and arch
> patch 6: add dedicated API for data allocations with constraints similar to
> code allocations
> patches 7-9: decouple dynamic ftrace and kprobes form CONFIG_MODULES
> patches 10-13: enable ROX allocations for executable memory on x86
> 
> Mike Rapoport (IBM) (11):
>   nios2: define virtual address space for modules
>   mm: introduce jit_text_alloc() and use it instead of module_alloc()
>   mm/jitalloc, arch: convert simple overrides of module_alloc to jitalloc
>   mm/jitalloc, arch: convert remaining overrides of module_alloc to jitalloc
>   module, jitalloc: drop module_alloc
>   mm/jitalloc: introduce jit_data_alloc()
>   x86/ftrace: enable dynamic ftrace without CONFIG_MODULES
>   arch: make jitalloc setup available regardless of CONFIG_MODULES
>   kprobes: remove dependcy on CONFIG_MODULES
>   modules, jitalloc: prepare to allocate executable memory as ROX
>   x86/jitalloc: make memory allocated for code ROX
> 
> Song Liu (2):
>   ftrace: Add swap_func to ftrace_process_locs()
>   x86/jitalloc: prepare to allocate exectuatble memory as ROX
> 
>  arch/Kconfig |   5 +-
>  arch/arm/kernel/module.c |  32 --
>  arch/arm/mm/init.c   |  35 ++
>  arch/arm64/kernel/module.c   |  47 
>  arch/arm64/mm/init.c |  42 +++
>  arch/loongarch/kernel/module.c   |   6 -
>  arch/loongarch/mm/init.c |  16 +++
>  arch/mips/kernel/module.c|   9 --
>  arch/mips/mm/init.c  |  19 
>  arch/nios2/include/asm/pgtable.h |   5 +-
>  arch/nios2/kernel/module.c   |  24 ++--
>  arch/parisc/kernel/module.c  |  11 --
>  arch/parisc/mm/init.c|  21 +++-
>  arch/powerpc/kernel/kprobes.c|   4 +-
>  arch/powerpc/kernel/module.c |  37 ---
>  arch/powerpc/mm/mem.c|  41 +++
>  arch/riscv/kernel/module.c   |  10 --
>  arch/riscv/mm/init.c |  18 +++
>  arch/s390/kernel/ftrace.c|   4 +-
>  arch/s390/kernel/kprobes.c   |   4 +-
>  

Re: [PATCH v8 00/10] arm64: Add framework to turn an IPI as NMI

2023-05-10 Thread Mark Rutland
On Wed, May 10, 2023 at 08:28:17AM -0700, Doug Anderson wrote:
> Hi,

Hi Doug,

> On Wed, Apr 19, 2023 at 3:57 PM Douglas Anderson  
> wrote:
> > This is an attempt to resurrect Sumit's old patch series [1] that
> > allowed us to use the arm64 pseudo-NMI to get backtraces of CPUs and
> > also to round up CPUs in kdb/kgdb. The last post from Sumit that I
> > could find was v7, so I called this series v8. I haven't copied all of
> > his old changelongs here, but you can find them from the link.
> >
> > Since v7, I have:
> > * Addressed the small amount of feedback that was there for v7.
> > * Rebased.
> > * Added a new patch that prevents us from spamming the logs with idle
> >   tasks.
> > * Added an extra patch to gracefully fall back to regular IPIs if
> >   pseudo-NMIs aren't there.
> >
> > Since there appear to be a few different patches series related to
> > being able to use NMIs to get stack traces of crashed systems, let me
> > try to organize them to the best of my understanding:
> >
> > a) This series. On its own, a) will (among other things) enable stack
> >traces of all running processes with the soft lockup detector if
> >you've enabled the sysctl "kernel.softlockup_all_cpu_backtrace". On
> >its own, a) doesn't give a hard lockup detector.
> >
> > b) A different recently-posted series [2] that adds a hard lockup
> >detector based on perf. On its own, b) gives a stack crawl of the
> >locked up CPU but no stack crawls of other CPUs (even if they're
> >locked too). Together with a) + b) we get everything (full lockup
> >detect, full ability to get stack crawls).
> >
> > c) The old Android "buddy" hard lockup detector [3] that I'm
> >considering trying to upstream. If b) lands then I believe c) would
> >be redundant (at least for arm64). c) on its own is really only
> >useful on arm64 for platforms that can print CPU_DBGPCSR somehow
> >(see [4]). a) + c) is roughly as good as a) + b).

> It's been 3 weeks and I haven't heard a peep on this series. That
> means nobody has any objections and it's all good to land, right?
> Right? :-P

FWIW, there are still longstanding soundness issues in the arm64 pseudo-NMI
support (and fixing that requires an overhaul of our DAIF / IRQ flag
management, which I've been chipping away at for a number of releases), so I
hadn't looked at this in detail yet because the foundations are still somewhat
dodgy.

I appreciate that this has been around for a while, and it's on my queue to
look at.

Thanks,
Mark.


Re: [PATCH v2 0/5] locking: Introduce local{,64}_try_cmpxchg

2023-04-11 Thread Mark Rutland
On Wed, Apr 05, 2023 at 09:37:04AM -0700, Dave Hansen wrote:
> On 4/5/23 07:17, Uros Bizjak wrote:
> > Add generic and target specific support for local{,64}_try_cmpxchg
> > and wire up support for all targets that use local_t infrastructure.
> 
> I feel like I'm missing some context.
> 
> What are the actual end user visible effects of this series?  Is there a
> measurable decrease in perf overhead?  Why go to all this trouble for
> perf?  Who else will use local_try_cmpxchg()?

Overall, the theory is that it can generate slightly better code (e.g. by
reusing the flags on x86). In practice, that might be in the noise, but as
demonstrated in prior postings the code generation is no worse than before.

>From my perspective, the more important part is that this aligns local_t with
the other atomic*_t APIs, which all have ${atomictype}_try_cmpxchg(), and for
consistency/legibility/maintainability it's nice to be able to use the same
code patterns, e.g.

${inttype} new, old = ${atomictype}_read(ptr);
do {
...
new = do_something_with(old);
} while (${atomictype}_try_cmpxvhg(ptr, , newval);

> I'm all for improving things, and perf is an important user.  But, if
> the goal here is improving performance, it would be nice to see at least
> a stab at quantifying the performance delta.

IIUC, Steve's original request for local_try_cmpxchg() was a combination of a
theoretical performance benefit and a more general preference to use
try_cmpxchg() for consistency / better structure of the source code:

  https://lore.kernel.org/lkml/20230301131831.6c8d4...@gandalf.local.home/

I agree it'd be nice to have performance figures, but I think those would only
need to demonstrate a lack of a regression rather than a performance
improvement, and I think it's fairly clear from eyeballing the generated
instructions that a regression isn't likely.

Thanks,
Mark.


Re: [PATCH v2 2/5] locking/generic: Wire up local{,64}_try_cmpxchg

2023-04-11 Thread Mark Rutland
On Wed, Apr 05, 2023 at 04:17:07PM +0200, Uros Bizjak wrote:
> Implement generic support for local{,64}_try_cmpxchg.
> 
> Redirect to the atomic_ family of functions when the target
> does not provide its own local.h definitions.
> 
> For 64-bit targets, implement local64_try_cmpxchg and
> local64_cmpxchg using typed C wrappers that call local_
> family of functions and provide additional checking
> of their input arguments.
> 
> Cc: Arnd Bergmann 
> Signed-off-by: Uros Bizjak 

Acked-by: Mark Rutland 

Mark.

> ---
>  include/asm-generic/local.h   |  1 +
>  include/asm-generic/local64.h | 12 +++-
>  2 files changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/include/asm-generic/local.h b/include/asm-generic/local.h
> index fca7f1d84818..7f97018df66f 100644
> --- a/include/asm-generic/local.h
> +++ b/include/asm-generic/local.h
> @@ -42,6 +42,7 @@ typedef struct
>  #define local_inc_return(l) atomic_long_inc_return(&(l)->a)
>  
>  #define local_cmpxchg(l, o, n) atomic_long_cmpxchg((&(l)->a), (o), (n))
> +#define local_try_cmpxchg(l, po, n) atomic_long_try_cmpxchg((&(l)->a), (po), 
> (n))
>  #define local_xchg(l, n) atomic_long_xchg((&(l)->a), (n))
>  #define local_add_unless(l, _a, u) atomic_long_add_unless((&(l)->a), (_a), 
> (u))
>  #define local_inc_not_zero(l) atomic_long_inc_not_zero(&(l)->a)
> diff --git a/include/asm-generic/local64.h b/include/asm-generic/local64.h
> index 765be0b7d883..14963a7a6253 100644
> --- a/include/asm-generic/local64.h
> +++ b/include/asm-generic/local64.h
> @@ -42,7 +42,16 @@ typedef struct {
>  #define local64_sub_return(i, l) local_sub_return((i), (&(l)->a))
>  #define local64_inc_return(l)local_inc_return(&(l)->a)
>  
> -#define local64_cmpxchg(l, o, n) local_cmpxchg((&(l)->a), (o), (n))
> +static inline s64 local64_cmpxchg(local64_t *l, s64 old, s64 new)
> +{
> + return local_cmpxchg(>a, old, new);
> +}
> +
> +static inline bool local64_try_cmpxchg(local64_t *l, s64 *old, s64 new)
> +{
> + return local_try_cmpxchg(>a, (long *)old, new);
> +}
> +
>  #define local64_xchg(l, n)   local_xchg((&(l)->a), (n))
>  #define local64_add_unless(l, _a, u) local_add_unless((&(l)->a), (_a), (u))
>  #define local64_inc_not_zero(l)  local_inc_not_zero(&(l)->a)
> @@ -81,6 +90,7 @@ typedef struct {
>  #define local64_inc_return(l)atomic64_inc_return(&(l)->a)
>  
>  #define local64_cmpxchg(l, o, n) atomic64_cmpxchg((&(l)->a), (o), (n))
> +#define local64_try_cmpxchg(l, po, n) atomic64_try_cmpxchg((&(l)->a), (po), 
> (n))
>  #define local64_xchg(l, n)   atomic64_xchg((&(l)->a), (n))
>  #define local64_add_unless(l, _a, u) atomic64_add_unless((&(l)->a), (_a), 
> (u))
>  #define local64_inc_not_zero(l)  atomic64_inc_not_zero(&(l)->a)
> -- 
> 2.39.2
> 


Re: [PATCH v2 1/5] locking/atomic: Add generic try_cmpxchg{,64}_local support

2023-04-11 Thread Mark Rutland
On Wed, Apr 05, 2023 at 04:17:06PM +0200, Uros Bizjak wrote:
> Add generic support for try_cmpxchg{,64}_local and their falbacks.
> 
> These provides the generic try_cmpxchg_local family of functions
> from the arch_ prefixed version, also adding explicit instrumentation.
> 
> Cc: Will Deacon 
> Cc: Peter Zijlstra 
> Cc: Boqun Feng 
> Cc: Mark Rutland 
> Signed-off-by: Uros Bizjak 

Acked-by: Mark Rutland 

Mark.

> ---
>  include/linux/atomic/atomic-arch-fallback.h | 24 -
>  include/linux/atomic/atomic-instrumented.h  | 20 -
>  scripts/atomic/gen-atomic-fallback.sh   |  4 
>  scripts/atomic/gen-atomic-instrumented.sh   |  2 +-
>  4 files changed, 47 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/atomic/atomic-arch-fallback.h 
> b/include/linux/atomic/atomic-arch-fallback.h
> index 77bc5522e61c..36c92851cdee 100644
> --- a/include/linux/atomic/atomic-arch-fallback.h
> +++ b/include/linux/atomic/atomic-arch-fallback.h
> @@ -217,6 +217,28 @@
>  
>  #endif /* arch_try_cmpxchg64_relaxed */
>  
> +#ifndef arch_try_cmpxchg_local
> +#define arch_try_cmpxchg_local(_ptr, _oldp, _new) \
> +({ \
> + typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
> + ___r = arch_cmpxchg_local((_ptr), ___o, (_new)); \
> + if (unlikely(___r != ___o)) \
> + *___op = ___r; \
> + likely(___r == ___o); \
> +})
> +#endif /* arch_try_cmpxchg_local */
> +
> +#ifndef arch_try_cmpxchg64_local
> +#define arch_try_cmpxchg64_local(_ptr, _oldp, _new) \
> +({ \
> + typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
> + ___r = arch_cmpxchg64_local((_ptr), ___o, (_new)); \
> + if (unlikely(___r != ___o)) \
> + *___op = ___r; \
> + likely(___r == ___o); \
> +})
> +#endif /* arch_try_cmpxchg64_local */
> +
>  #ifndef arch_atomic_read_acquire
>  static __always_inline int
>  arch_atomic_read_acquire(const atomic_t *v)
> @@ -2456,4 +2478,4 @@ arch_atomic64_dec_if_positive(atomic64_t *v)
>  #endif
>  
>  #endif /* _LINUX_ATOMIC_FALLBACK_H */
> -// b5e87bdd5ede61470c29f7a7e4de781af3770f09
> +// 1f49bd4895a4b7a5383906649027205c52ec80ab
> diff --git a/include/linux/atomic/atomic-instrumented.h 
> b/include/linux/atomic/atomic-instrumented.h
> index 7a139ec030b0..14a9212cc987 100644
> --- a/include/linux/atomic/atomic-instrumented.h
> +++ b/include/linux/atomic/atomic-instrumented.h
> @@ -2066,6 +2066,24 @@ atomic_long_dec_if_positive(atomic_long_t *v)
>   arch_sync_cmpxchg(__ai_ptr, __VA_ARGS__); \
>  })
>  
> +#define try_cmpxchg_local(ptr, oldp, ...) \
> +({ \
> + typeof(ptr) __ai_ptr = (ptr); \
> + typeof(oldp) __ai_oldp = (oldp); \
> + instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
> + instrument_atomic_write(__ai_oldp, sizeof(*__ai_oldp)); \
> + arch_try_cmpxchg_local(__ai_ptr, __ai_oldp, __VA_ARGS__); \
> +})
> +
> +#define try_cmpxchg64_local(ptr, oldp, ...) \
> +({ \
> + typeof(ptr) __ai_ptr = (ptr); \
> + typeof(oldp) __ai_oldp = (oldp); \
> + instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
> + instrument_atomic_write(__ai_oldp, sizeof(*__ai_oldp)); \
> + arch_try_cmpxchg64_local(__ai_ptr, __ai_oldp, __VA_ARGS__); \
> +})
> +
>  #define cmpxchg_double(ptr, ...) \
>  ({ \
>   typeof(ptr) __ai_ptr = (ptr); \
> @@ -2083,4 +2101,4 @@ atomic_long_dec_if_positive(atomic_long_t *v)
>  })
>  
>  #endif /* _LINUX_ATOMIC_INSTRUMENTED_H */
> -// 764f741eb77a7ad565dc8d99ce2837d5542e8aee
> +// 456e206c7e4e681126c482e4edcc6f46921ac731
> diff --git a/scripts/atomic/gen-atomic-fallback.sh 
> b/scripts/atomic/gen-atomic-fallback.sh
> index 3a07695e3c89..6e853f0dad8d 100755
> --- a/scripts/atomic/gen-atomic-fallback.sh
> +++ b/scripts/atomic/gen-atomic-fallback.sh
> @@ -225,6 +225,10 @@ for cmpxchg in "cmpxchg" "cmpxchg64"; do
>   gen_try_cmpxchg_fallbacks "${cmpxchg}"
>  done
>  
> +for cmpxchg in "cmpxchg_local" "cmpxchg64_local"; do
> + gen_try_cmpxchg_fallback "${cmpxchg}" ""
> +done
> +
>  grep '^[a-z]' "$1" | while read name meta args; do
>   gen_proto "${meta}" "${name}" "atomic" "int" ${args}
>  done
> diff --git a/scripts/atomic/gen-atomic-instrumented.sh 
> b/scripts/atomic/gen-atomic-instrumented.sh
> index 77c06526a574..c8165e9431bf 100755
> --- a/scripts/atomic/gen-atomic-instrumented.sh
> +++ b/scripts/atomic/gen-atomic-instrumented.sh
> @@ -173,7 +173,7 @@ for xchg in "xchg" "cmpxchg" "cmpxchg64" "try_cmpxchg" 
> "try_cmpxchg64"; do
>   done
>  done
>  
> -for xchg in "cmpxchg_local" "cmpxchg64_local" "sync_cmpxchg"; do
> +for xchg in "cmpxchg_local" "cmpxchg64_local" "sync_cmpxchg" 
> "try_cmpxchg_local" "try_cmpxchg64_local" ; do
>   gen_xchg "${xchg}" "" ""
>   printf "\n"
>  done
> -- 
> 2.39.2
> 


Re: [PATCH 01/10] locking/atomic: Add missing cast to try_cmpxchg() fallbacks

2023-04-04 Thread Mark Rutland
On Tue, Apr 04, 2023 at 02:24:38PM +0200, Uros Bizjak wrote:
> On Mon, Apr 3, 2023 at 12:19 PM Mark Rutland  wrote:
> >
> > On Sun, Mar 26, 2023 at 09:28:38PM +0200, Uros Bizjak wrote:
> > > On Fri, Mar 24, 2023 at 5:33 PM Mark Rutland  wrote:
> > > >
> > > > On Fri, Mar 24, 2023 at 04:14:22PM +, Mark Rutland wrote:
> > > > > On Fri, Mar 24, 2023 at 04:43:32PM +0100, Uros Bizjak wrote:
> > > > > > On Fri, Mar 24, 2023 at 3:13 PM Mark Rutland  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Sun, Mar 05, 2023 at 09:56:19PM +0100, Uros Bizjak wrote:
> > > > > > > > Cast _oldp to the type of _ptr to avoid 
> > > > > > > > incompatible-pointer-types warning.
> > > > > > >
> > > > > > > Can you give an example of where we are passing an incompatible 
> > > > > > > pointer?
> > > > > >
> > > > > > An example is patch 10/10 from the series, which will fail without
> > > > > > this fix when fallback code is used. We have:
> > > > > >
> > > > > > -   } while (local_cmpxchg(>head, offset, head) != offset);
> > > > > > +   } while (!local_try_cmpxchg(>head, , head));
> > > > > >
> > > > > > where rb->head is defined as:
> > > > > >
> > > > > > typedef struct {
> > > > > >atomic_long_t a;
> > > > > > } local_t;
> > > > > >
> > > > > > while offset is defined as 'unsigned long'.
> > > > >
> > > > > Ok, but that's because we're doing the wrong thing to start with.
> > > > >
> > > > > Since local_t is defined in terms of atomic_long_t, we should define 
> > > > > the
> > > > > generic local_try_cmpxchg() in terms of atomic_long_try_cmpxchg(). 
> > > > > We'll still
> > > > > have a mismatch between 'long *' and 'unsigned long *', but then we 
> > > > > can fix
> > > > > that in the callsite:
> > > > >
> > > > >   while (!local_try_cmpxchg(>head, &(long *)offset, head))
> > > >
> > > > Sorry, that should be:
> > > >
> > > > while (!local_try_cmpxchg(>head, (long *), head))
> > >
> > > The fallbacks are a bit more complicated than above, and are different
> > > from atomic_try_cmpxchg.
> > >
> > > Please note in patch 2/10, the falbacks when arch_try_cmpxchg_local
> > > are not defined call arch_cmpxchg_local. Also in patch 2/10,
> > > try_cmpxchg_local is introduced, where it calls
> > > arch_try_cmpxchg_local. Targets (and generic code) simply define (e.g.
> > > :
> > >
> > > #define local_cmpxchg(l, o, n) \
> > >(cmpxchg_local(&((l)->a.counter), (o), (n)))
> > > +#define local_try_cmpxchg(l, po, n) \
> > > +   (try_cmpxchg_local(&((l)->a.counter), (po), (n)))
> > >
> > > which is part of the local_t API. Targets should either define all
> > > these #defines, or none. There are no partial fallbacks as is the case
> > > with atomic_t.
> >
> > Whether or not there are fallbacks is immaterial.
> >
> > In those cases, architectures can just as easily write C wrappers, e.g.
> >
> > long local_cmpxchg(local_t *l, long old, long new)
> > {
> > return cmpxchg_local(>a.counter, old, new);
> > }
> >
> > long local_try_cmpxchg(local_t *l, long *old, long new)
> > {
> > return try_cmpxchg_local(>a.counter, old, new);
> > }
> 
> Please find attached the complete prototype patch that implements the
> above suggestion.
> 
> The patch includes:
> - implementation of instrumented try_cmpxchg{,64}_local definitions
> - corresponding arch_try_cmpxchg{,64}_local fallback definitions
> - generic local{,64}_try_cmpxchg (and local{,64}_cmpxchg) C wrappers
> 
> - x86 specific local_try_cmpxchg (and local_cmpxchg) C wrappers
> - x86 specific arch_try_cmpxchg_local definition
> 
> - kernel/events/ring_buffer.c change to test local_try_cmpxchg
> implementation and illustrate the transition
> - arch/x86/events/core.c change to test local64_try_cmpxchg
> implementation and illustrate the transition
> 
> The definition of atomic_long_t is different for 64-bit and 32-bit
> targets (s64 vs int), so target specific C wrappers have to use
&

Re: [PATCH 01/10] locking/atomic: Add missing cast to try_cmpxchg() fallbacks

2023-04-03 Thread Mark Rutland
On Sun, Mar 26, 2023 at 09:28:38PM +0200, Uros Bizjak wrote:
> On Fri, Mar 24, 2023 at 5:33 PM Mark Rutland  wrote:
> >
> > On Fri, Mar 24, 2023 at 04:14:22PM +0000, Mark Rutland wrote:
> > > On Fri, Mar 24, 2023 at 04:43:32PM +0100, Uros Bizjak wrote:
> > > > On Fri, Mar 24, 2023 at 3:13 PM Mark Rutland  
> > > > wrote:
> > > > >
> > > > > On Sun, Mar 05, 2023 at 09:56:19PM +0100, Uros Bizjak wrote:
> > > > > > Cast _oldp to the type of _ptr to avoid incompatible-pointer-types 
> > > > > > warning.
> > > > >
> > > > > Can you give an example of where we are passing an incompatible 
> > > > > pointer?
> > > >
> > > > An example is patch 10/10 from the series, which will fail without
> > > > this fix when fallback code is used. We have:
> > > >
> > > > -   } while (local_cmpxchg(>head, offset, head) != offset);
> > > > +   } while (!local_try_cmpxchg(>head, , head));
> > > >
> > > > where rb->head is defined as:
> > > >
> > > > typedef struct {
> > > >atomic_long_t a;
> > > > } local_t;
> > > >
> > > > while offset is defined as 'unsigned long'.
> > >
> > > Ok, but that's because we're doing the wrong thing to start with.
> > >
> > > Since local_t is defined in terms of atomic_long_t, we should define the
> > > generic local_try_cmpxchg() in terms of atomic_long_try_cmpxchg(). We'll 
> > > still
> > > have a mismatch between 'long *' and 'unsigned long *', but then we can 
> > > fix
> > > that in the callsite:
> > >
> > >   while (!local_try_cmpxchg(>head, &(long *)offset, head))
> >
> > Sorry, that should be:
> >
> > while (!local_try_cmpxchg(>head, (long *), head))
> 
> The fallbacks are a bit more complicated than above, and are different
> from atomic_try_cmpxchg.
> 
> Please note in patch 2/10, the falbacks when arch_try_cmpxchg_local
> are not defined call arch_cmpxchg_local. Also in patch 2/10,
> try_cmpxchg_local is introduced, where it calls
> arch_try_cmpxchg_local. Targets (and generic code) simply define (e.g.
> :
> 
> #define local_cmpxchg(l, o, n) \
>(cmpxchg_local(&((l)->a.counter), (o), (n)))
> +#define local_try_cmpxchg(l, po, n) \
> +   (try_cmpxchg_local(&((l)->a.counter), (po), (n)))
> 
> which is part of the local_t API. Targets should either define all
> these #defines, or none. There are no partial fallbacks as is the case
> with atomic_t.

Whether or not there are fallbacks is immaterial.

In those cases, architectures can just as easily write C wrappers, e.g.

long local_cmpxchg(local_t *l, long old, long new)
{
return cmpxchg_local(>a.counter, old, new);
}

long local_try_cmpxchg(local_t *l, long *old, long new)
{
return try_cmpxchg_local(>a.counter, old, new);
}

> The core of the local_h API is in the local.h header. If the target
> doesn't define its own local.h header, then asm-generic/local.h is
> used that does exactly what you propose above regarding the usage of
> atomic functions.
> 
> OTOH, when the target defines its own local.h, then the above
> target-dependent #define path applies. The target should define its
> own arch_try_cmpxchg_local, otherwise a "generic" target-dependent
> fallback that calls target arch_cmpxchg_local applies. In the case of
> x86, patch 9/10 enables new instruction by defining
> arch_try_cmpxchg_local.
> 
> FYI, the patch sequence is carefully chosen so that x86 also exercises
> fallback code between different patches in the series.
> 
> Targets are free to define local_t to whatever they like, but for some
> reason they all define it to:
> 
> typedef struct {
> atomic_long_t a;
> } local_t;

Yes, which is why I used atomic_long() above.

> so they have to dig the variable out of the struct like:
> 
> #define local_cmpxchg(l, o, n) \
>  (cmpxchg_local(&((l)->a.counter), (o), (n)))
> 
> Regarding the mismatch of 'long *' vs 'unsigned long *': x86
> target-specific code does for try_cmpxchg:
> 
> #define __raw_try_cmpxchg(_ptr, _pold, _new, size, lock) \
> ({ \
> bool success; \
> __typeof__(_ptr) _old = (__typeof__(_ptr))(_pold); \
> __typeof__(*(_ptr)) __old = *_old; \
> __typeof__(*(_ptr)) __new = (_new); \
> 
> so, it *does* cast the "old" pointer to the type of "ptr". The generic
> code does *not*. This difference is dangerous, since the compilation
> of some code involving try_cmpxchg will compil

Re: [PATCH 01/10] locking/atomic: Add missing cast to try_cmpxchg() fallbacks

2023-03-24 Thread Mark Rutland
On Fri, Mar 24, 2023 at 04:14:22PM +, Mark Rutland wrote:
> On Fri, Mar 24, 2023 at 04:43:32PM +0100, Uros Bizjak wrote:
> > On Fri, Mar 24, 2023 at 3:13 PM Mark Rutland  wrote:
> > >
> > > On Sun, Mar 05, 2023 at 09:56:19PM +0100, Uros Bizjak wrote:
> > > > Cast _oldp to the type of _ptr to avoid incompatible-pointer-types 
> > > > warning.
> > >
> > > Can you give an example of where we are passing an incompatible pointer?
> > 
> > An example is patch 10/10 from the series, which will fail without
> > this fix when fallback code is used. We have:
> > 
> > -   } while (local_cmpxchg(>head, offset, head) != offset);
> > +   } while (!local_try_cmpxchg(>head, , head));
> > 
> > where rb->head is defined as:
> > 
> > typedef struct {
> >atomic_long_t a;
> > } local_t;
> > 
> > while offset is defined as 'unsigned long'.
> 
> Ok, but that's because we're doing the wrong thing to start with.
> 
> Since local_t is defined in terms of atomic_long_t, we should define the
> generic local_try_cmpxchg() in terms of atomic_long_try_cmpxchg(). We'll still
> have a mismatch between 'long *' and 'unsigned long *', but then we can fix
> that in the callsite:
> 
>   while (!local_try_cmpxchg(>head, &(long *)offset, head))

Sorry, that should be:

while (!local_try_cmpxchg(>head, (long *), head))

The fundamenalthing I'm trying to say is that the
atomic/atomic64/atomic_long/local/local64 APIs should be type-safe, and for
their try_cmpxchg() implementations, the type signature should be:

${atomictype}_try_cmpxchg(${atomictype} *ptr, ${inttype} *old, 
${inttype} new)

Thanks,
Mark.


Re: [PATCH 01/10] locking/atomic: Add missing cast to try_cmpxchg() fallbacks

2023-03-24 Thread Mark Rutland
On Fri, Mar 24, 2023 at 04:43:32PM +0100, Uros Bizjak wrote:
> On Fri, Mar 24, 2023 at 3:13 PM Mark Rutland  wrote:
> >
> > On Sun, Mar 05, 2023 at 09:56:19PM +0100, Uros Bizjak wrote:
> > > Cast _oldp to the type of _ptr to avoid incompatible-pointer-types 
> > > warning.
> >
> > Can you give an example of where we are passing an incompatible pointer?
> 
> An example is patch 10/10 from the series, which will fail without
> this fix when fallback code is used. We have:
> 
> -   } while (local_cmpxchg(>head, offset, head) != offset);
> +   } while (!local_try_cmpxchg(>head, , head));
> 
> where rb->head is defined as:
> 
> typedef struct {
>atomic_long_t a;
> } local_t;
> 
> while offset is defined as 'unsigned long'.

Ok, but that's because we're doing the wrong thing to start with.

Since local_t is defined in terms of atomic_long_t, we should define the
generic local_try_cmpxchg() in terms of atomic_long_try_cmpxchg(). We'll still
have a mismatch between 'long *' and 'unsigned long *', but then we can fix
that in the callsite:

while (!local_try_cmpxchg(>head, &(long *)offset, head))

... which then won't silently mask issues elsewhere, and will be consistent
with all the other atomic APIs.

Thanks,
Mark.

> 
> The assignment in existing try_cmpxchg template:
> 
> typeof(*(_ptr)) *___op = (_oldp)
> 
> will trigger an initialization from an incompatible pointer type error.
> 
> Please note that x86 avoids this issue by a cast in its
> target-dependent definition:
> 
> #define __raw_try_cmpxchg(_ptr, _pold, _new, size, lock)\
> ({  \
>bool success;   \
>__typeof__(_ptr) _old = (__typeof__(_ptr))(_pold);  \
>__typeof__(*(_ptr)) __old = *_old;  \
>__typeof__(*(_ptr)) __new = (_new); \
> 
> so, the warning/error will trigger only in the fallback code.
> 
> > That sounds indicative of a bug in the caller, but maybe I'm missing some
> > reason this is necessary due to some indirection.
> >
> > > Fixes: 29f006fdefe6 ("asm-generic/atomic: Add try_cmpxchg() fallbacks")
> >
> > I'm not sure that this needs a fixes tag. Does anything go wrong today, or 
> > only
> > later in this series?
> 
> The patch at [1] triggered a build error in posix_acl.c/__get.acl due
> to the same problem. The compilation for x86 target was OK, because
> x86 defines target-specific arch_try_cmpxchg, but the compilation
> broke for targets that revert to generic support. Please note that
> this specific problem was recently fixed in a different way [2], but
> the issue with the fallback remains.
> 
> [1] https://lore.kernel.org/lkml/20220714173819.13312-1-ubiz...@gmail.com/
> [2] https://lore.kernel.org/lkml/20221201160103.76012-1-ubiz...@gmail.com/
> 
> Uros.


Re: [PATCH 01/10] locking/atomic: Add missing cast to try_cmpxchg() fallbacks

2023-03-24 Thread Mark Rutland
On Sun, Mar 05, 2023 at 09:56:19PM +0100, Uros Bizjak wrote:
> Cast _oldp to the type of _ptr to avoid incompatible-pointer-types warning.

Can you give an example of where we are passing an incompatible pointer?

That sounds indicative of a bug in the caller, but maybe I'm missing some
reason this is necessary due to some indirection.

> Fixes: 29f006fdefe6 ("asm-generic/atomic: Add try_cmpxchg() fallbacks")

I'm not sure that this needs a fixes tag. Does anything go wrong today, or only
later in this series?

Thanks,
Mark.

> Cc: Will Deacon 
> Cc: Peter Zijlstra 
> Cc: Boqun Feng 
> Cc: Mark Rutland 
> Signed-off-by: Uros Bizjak 
> ---
>  include/linux/atomic/atomic-arch-fallback.h | 18 +-
>  scripts/atomic/gen-atomic-fallback.sh   |  2 +-
>  2 files changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/atomic/atomic-arch-fallback.h 
> b/include/linux/atomic/atomic-arch-fallback.h
> index 77bc5522e61c..19debd501ee7 100644
> --- a/include/linux/atomic/atomic-arch-fallback.h
> +++ b/include/linux/atomic/atomic-arch-fallback.h
> @@ -87,7 +87,7 @@
>  #ifndef arch_try_cmpxchg
>  #define arch_try_cmpxchg(_ptr, _oldp, _new) \
>  ({ \
> - typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
> + typeof(*(_ptr)) *___op = (typeof(_ptr))(_oldp), ___o = *___op, ___r; \
>   ___r = arch_cmpxchg((_ptr), ___o, (_new)); \
>   if (unlikely(___r != ___o)) \
>   *___op = ___r; \
> @@ -98,7 +98,7 @@
>  #ifndef arch_try_cmpxchg_acquire
>  #define arch_try_cmpxchg_acquire(_ptr, _oldp, _new) \
>  ({ \
> - typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
> + typeof(*(_ptr)) *___op = (typeof(_ptr))(_oldp), ___o = *___op, ___r; \
>   ___r = arch_cmpxchg_acquire((_ptr), ___o, (_new)); \
>   if (unlikely(___r != ___o)) \
>   *___op = ___r; \
> @@ -109,7 +109,7 @@
>  #ifndef arch_try_cmpxchg_release
>  #define arch_try_cmpxchg_release(_ptr, _oldp, _new) \
>  ({ \
> - typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
> + typeof(*(_ptr)) *___op = (typeof(_ptr))(_oldp), ___o = *___op, ___r; \
>   ___r = arch_cmpxchg_release((_ptr), ___o, (_new)); \
>   if (unlikely(___r != ___o)) \
>   *___op = ___r; \
> @@ -120,7 +120,7 @@
>  #ifndef arch_try_cmpxchg_relaxed
>  #define arch_try_cmpxchg_relaxed(_ptr, _oldp, _new) \
>  ({ \
> - typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
> + typeof(*(_ptr)) *___op = (typeof(_ptr))(_oldp), ___o = *___op, ___r; \
>   ___r = arch_cmpxchg_relaxed((_ptr), ___o, (_new)); \
>   if (unlikely(___r != ___o)) \
>   *___op = ___r; \
> @@ -157,7 +157,7 @@
>  #ifndef arch_try_cmpxchg64
>  #define arch_try_cmpxchg64(_ptr, _oldp, _new) \
>  ({ \
> - typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
> + typeof(*(_ptr)) *___op = (typeof(_ptr))(_oldp), ___o = *___op, ___r; \
>   ___r = arch_cmpxchg64((_ptr), ___o, (_new)); \
>   if (unlikely(___r != ___o)) \
>   *___op = ___r; \
> @@ -168,7 +168,7 @@
>  #ifndef arch_try_cmpxchg64_acquire
>  #define arch_try_cmpxchg64_acquire(_ptr, _oldp, _new) \
>  ({ \
> - typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
> + typeof(*(_ptr)) *___op = (typeof(_ptr))(_oldp), ___o = *___op, ___r; \
>   ___r = arch_cmpxchg64_acquire((_ptr), ___o, (_new)); \
>   if (unlikely(___r != ___o)) \
>   *___op = ___r; \
> @@ -179,7 +179,7 @@
>  #ifndef arch_try_cmpxchg64_release
>  #define arch_try_cmpxchg64_release(_ptr, _oldp, _new) \
>  ({ \
> - typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
> + typeof(*(_ptr)) *___op = (typeof(_ptr))(_oldp), ___o = *___op, ___r; \
>   ___r = arch_cmpxchg64_release((_ptr), ___o, (_new)); \
>   if (unlikely(___r != ___o)) \
>   *___op = ___r; \
> @@ -190,7 +190,7 @@
>  #ifndef arch_try_cmpxchg64_relaxed
>  #define arch_try_cmpxchg64_relaxed(_ptr, _oldp, _new) \
>  ({ \
> - typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
> + typeof(*(_ptr)) *___op = (typeof(_ptr))(_oldp), ___o = *___op, ___r; \
>   ___r = arch_cmpxchg64_relaxed((_ptr), ___o, (_new)); \
>   if (unlikely(___r != ___o)) \
>   *___op = ___r; \
> @@ -2456,4 +2456,4 @@ arch_atomic64_dec_if_positive(atomic64_t *v)
>  #endif
>  
>  #endif /* _LINUX_ATOMIC_FALLBACK_H */
> -// b5e87bdd5ede61470c29f7a7e4de781af3770f09
> +// 1b4d4c82ae653389cd1538d5b07170267d9b3837
> diff --git a/scripts/atomic/gen-atomic-fallback.sh 
> b/scripts/atomic/gen-atomic-fallback.sh
> index 3a07695e3c89..39f447161108 100755
> --- a/scripts/atomic/gen-atomic-fallback.sh
> +++ b/scripts/atomic/gen-atomic-fal

Re: [PATCH v2 04/24] arm64/cpu: Mark cpu_die() __noreturn

2023-02-15 Thread Mark Rutland
On Tue, Feb 14, 2023 at 09:13:08AM +0100, Philippe Mathieu-Daudé wrote:
> On 14/2/23 08:05, Josh Poimboeuf wrote:
> > cpu_die() doesn't return.  Annotate it as such.  By extension this also
> > makes arch_cpu_idle_dead() noreturn.
> > 
> > Signed-off-by: Josh Poimboeuf 
> > ---
> >   arch/arm64/include/asm/smp.h | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h
> > index fc55f5a57a06..5733a31bab08 100644
> > --- a/arch/arm64/include/asm/smp.h
> > +++ b/arch/arm64/include/asm/smp.h
> > @@ -100,7 +100,7 @@ static inline void arch_send_wakeup_ipi_mask(const 
> > struct cpumask *mask)
> >   extern int __cpu_disable(void);
> >   extern void __cpu_die(unsigned int cpu);
> > -extern void cpu_die(void);
> > +extern void __noreturn cpu_die(void);
> >   extern void cpu_die_early(void);
> 
> Shouldn't cpu_operations::cpu_die() be declared noreturn first?

The cpu_die() function ends with a BUG(), and so does not return, even if a
cpu_operations::cpu_die() function that it calls erroneously returned.

We *could* mark cpu_operations::cpu_die() as noreturn, but I'd prefer that we
did not so that the compiler doesn't optimize away the BUG() which is there to
catch such erroneous returns.

That said, could we please add __noreturn to the implementation of cpu_die() in
arch/arm64/kernel/smp.c? i.e. the fixup below.

With that fixup:

Acked-by: Mark Rutland 

Mark.

>8
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index ffc5d76cf695..a98a76f7c1c6 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -361,7 +361,7 @@ void __cpu_die(unsigned int cpu)
  * Called from the idle thread for the CPU which has been shutdown.
  *
  */
-void cpu_die(void)
+void __noreturn cpu_die(void)
 {
unsigned int cpu = smp_processor_id();
const struct cpu_operations *ops = get_cpu_ops(cpu);


Re: [PATCH v3 00/51] cpuidle,rcu: Clean up the mess

2023-01-17 Thread Mark Rutland
On Tue, Jan 17, 2023 at 02:21:40PM +, Sudeep Holla wrote:
> On Tue, Jan 17, 2023 at 01:16:21PM +0000, Mark Rutland wrote:
> > On Tue, Jan 17, 2023 at 11:26:29AM +0100, Peter Zijlstra wrote:
> > > On Mon, Jan 16, 2023 at 04:59:04PM +, Mark Rutland wrote:
> > > 
> > > > I'm sorry to have to bear some bad news on that front. :(
> > > 
> > > Moo, something had to give..
> > > 
> > > 
> > > > IIUC what's happenign here is the PSCI cpuidle driver has entered idle 
> > > > and RCU
> > > > is no longer watching when arm64's cpu_suspend() manipulates DAIF. Our
> > > > local_daif_*() helpers poke lockdep and tracing, hence the call to
> > > > trace_hardirqs_off() and the RCU usage.
> > > 
> > > Right, strictly speaking not needed at this point, IRQs should have been
> > > traced off a long time ago.
> > 
> > True, but there are some other calls around here that *might* end up 
> > invoking
> > RCU stuff (e.g. the MTE code).
> > 
> > That all needs a noinstr cleanup too, which I'll sort out as a follow-up.
> > 
> > > > I think we need RCU to be watching all the way down to cpu_suspend(), 
> > > > and it's
> > > > cpu_suspend() that should actually enter/exit idle context. That and we 
> > > > need to
> > > > make cpu_suspend() and the low-level PSCI invocation noinstr.
> > > > 
> > > > I'm not sure whether 32-bit will have a similar issue or not.
> > > 
> > > I'm not seeing 32bit or Risc-V have similar issues here, but who knows,
> > > maybe I missed somsething.
> > 
> > I reckon if they do, the core changes here give us the infrastructure to fix
> > them if/when we get reports.
> > 
> > > In any case, the below ought to cure the ARM64 case and remove that last
> > > known RCU_NONIDLE() user as a bonus.
> > 
> > The below works for me testing on a Juno R1 board with PSCI, using 
> > defconfig +
> > CONFIG_PROVE_LOCKING=y + CONFIG_DEBUG_LOCKDEP=y + 
> > CONFIG_DEBUG_ATOMIC_SLEEP=y.
> > I'm not sure how to test the LPI / FFH part, but it looks good to me.
> > 
> > FWIW:
> > 
> > Reviewed-by: Mark Rutland 
> > Tested-by: Mark Rutland 
> > 
> > Sudeep, would you be able to give the LPI/FFH side a spin with the kconfig
> > options above?
> > 
> 
> Not sure if I have messed up something in my mail setup, but I did reply
> earlier.

Sorry, that was my bad; I had been drafting my reply for a while and forgot to
re-check prior to sending.

> I did test both DT/cpuidle-psci driver and  ACPI/LPI+FFH driver
> with the fix Peter sent. I was seeing same splat as you in both DT and
> ACPI boot which the patch fixed it. I used the same config as described by
> you above.

Perfect; thanks!

Mark.


Re: [PATCH v3 00/51] cpuidle,rcu: Clean up the mess

2023-01-17 Thread Mark Rutland
On Tue, Jan 17, 2023 at 11:26:29AM +0100, Peter Zijlstra wrote:
> On Mon, Jan 16, 2023 at 04:59:04PM +0000, Mark Rutland wrote:
> 
> > I'm sorry to have to bear some bad news on that front. :(
> 
> Moo, something had to give..
> 
> 
> > IIUC what's happenign here is the PSCI cpuidle driver has entered idle and 
> > RCU
> > is no longer watching when arm64's cpu_suspend() manipulates DAIF. Our
> > local_daif_*() helpers poke lockdep and tracing, hence the call to
> > trace_hardirqs_off() and the RCU usage.
> 
> Right, strictly speaking not needed at this point, IRQs should have been
> traced off a long time ago.

True, but there are some other calls around here that *might* end up invoking
RCU stuff (e.g. the MTE code).

That all needs a noinstr cleanup too, which I'll sort out as a follow-up.

> > I think we need RCU to be watching all the way down to cpu_suspend(), and 
> > it's
> > cpu_suspend() that should actually enter/exit idle context. That and we 
> > need to
> > make cpu_suspend() and the low-level PSCI invocation noinstr.
> > 
> > I'm not sure whether 32-bit will have a similar issue or not.
> 
> I'm not seeing 32bit or Risc-V have similar issues here, but who knows,
> maybe I missed somsething.

I reckon if they do, the core changes here give us the infrastructure to fix
them if/when we get reports.

> In any case, the below ought to cure the ARM64 case and remove that last
> known RCU_NONIDLE() user as a bonus.

The below works for me testing on a Juno R1 board with PSCI, using defconfig +
CONFIG_PROVE_LOCKING=y + CONFIG_DEBUG_LOCKDEP=y + CONFIG_DEBUG_ATOMIC_SLEEP=y.
I'm not sure how to test the LPI / FFH part, but it looks good to me.

FWIW:

Reviewed-by: Mark Rutland 
Tested-by: Mark Rutland 

Sudeep, would you be able to give the LPI/FFH side a spin with the kconfig
options above?

Thanks,
Mark.

> 
> ---
> diff --git a/arch/arm64/kernel/cpuidle.c b/arch/arm64/kernel/cpuidle.c
> index 41974a1a229a..42e19fff40ee 100644
> --- a/arch/arm64/kernel/cpuidle.c
> +++ b/arch/arm64/kernel/cpuidle.c
> @@ -67,10 +67,10 @@ __cpuidle int acpi_processor_ffh_lpi_enter(struct 
> acpi_lpi_state *lpi)
>   u32 state = lpi->address;
>  
>   if (ARM64_LPI_IS_RETENTION_STATE(lpi->arch_flags))
> - return 
> CPU_PM_CPU_IDLE_ENTER_RETENTION_PARAM(psci_cpu_suspend_enter,
> + return 
> CPU_PM_CPU_IDLE_ENTER_RETENTION_PARAM_RCU(psci_cpu_suspend_enter,
>   lpi->index, state);
>   else
> - return CPU_PM_CPU_IDLE_ENTER_PARAM(psci_cpu_suspend_enter,
> + return CPU_PM_CPU_IDLE_ENTER_PARAM_RCU(psci_cpu_suspend_enter,
>lpi->index, state);
>  }
>  #endif
> diff --git a/arch/arm64/kernel/suspend.c b/arch/arm64/kernel/suspend.c
> index e7163f31f716..0fbdf5fe64d8 100644
> --- a/arch/arm64/kernel/suspend.c
> +++ b/arch/arm64/kernel/suspend.c
> @@ -4,6 +4,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -104,6 +105,10 @@ int cpu_suspend(unsigned long arg, int (*fn)(unsigned 
> long))
>* From this point debug exceptions are disabled to prevent
>* updates to mdscr register (saved and restored along with
>* general purpose registers) from kernel debuggers.
> +  *
> +  * Strictly speaking the trace_hardirqs_off() here is superfluous,
> +  * hardirqs should be firmly off by now. This really ought to use
> +  * something like raw_local_daif_save().
>*/
>   flags = local_daif_save();
>  
> @@ -120,6 +125,8 @@ int cpu_suspend(unsigned long arg, int (*fn)(unsigned 
> long))
>*/
>   arm_cpuidle_save_irq_context();
>  
> + ct_cpuidle_enter();
> +
>   if (__cpu_suspend_enter()) {
>   /* Call the suspend finisher */
>   ret = fn(arg);
> @@ -133,8 +140,11 @@ int cpu_suspend(unsigned long arg, int (*fn)(unsigned 
> long))
>*/
>   if (!ret)
>   ret = -EOPNOTSUPP;
> +
> + ct_cpuidle_exit();
>   } else {
> - RCU_NONIDLE(__cpu_suspend_exit());
> + ct_cpuidle_exit();
> + __cpu_suspend_exit();
>   }
>  
>   arm_cpuidle_restore_irq_context();
> diff --git a/drivers/cpuidle/cpuidle-psci.c b/drivers/cpuidle/cpuidle-psci.c
> index 4fc4e0381944..312a34ef28dc 100644
> --- a/drivers/cpuidle/cpuidle-psci.c
> +++ b/drivers/cpuidle/cpuidle-psci.c
> @@ -69,16 +69,12 @@ static __cpuidle int 
> __psci_enter_domain_idle_state(struct cpuidle_device *dev,
>   else
>   pm_runtime_put_sync_

Re: [PATCH v3 00/51] cpuidle,rcu: Clean up the mess

2023-01-16 Thread Mark Rutland
On Thu, Jan 12, 2023 at 08:43:14PM +0100, Peter Zijlstra wrote:
> Hi All!

Hi Peter,

> The (hopefully) final respin of cpuidle vs rcu cleanup patches. Barring any
> objections I'll be queueing these patches in tip/sched/core in the next few
> days.

I'm sorry to have to bear some bad news on that front. :(

I just had a go at testing this on a Juno dev board, using your queue.git
sched/idle branch and defconfig + CONFIG_PROVE_LOCKING=y +
CONFIG_DEBUG_LOCKDEP=y + CONFIG_DEBUG_ATOMIC_SLEEP=y.

With that I consistently see RCU at boot time (log below).

| =
| WARNING: suspicious RCU usage
| 6.2.0-rc3-00051-gced9b6eecb31 #1 Not tainted
| -
| include/trace/events/ipi.h:19 suspicious rcu_dereference_check() usage!
| 
| other info that might help us debug this:
| 
| 
| rcu_scheduler_active = 2, debug_locks = 1
| RCU used illegally from extended quiescent state!
| no locks held by swapper/0/0.
| 
| stack backtrace:
| CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.2.0-rc3-00051-gced9b6eecb31 #1
| Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development 
Platform, BIOS EDK II May 16 2021
| Call trace:
|  dump_backtrace.part.0+0xe4/0xf0
|  show_stack+0x18/0x30
|  dump_stack_lvl+0x98/0xd4
|  dump_stack+0x18/0x34
|  lockdep_rcu_suspicious+0xf8/0x10c
|  trace_ipi_raise+0x1a8/0x1b0
|  arch_irq_work_raise+0x4c/0x70
|  __irq_work_queue_local+0x48/0x80
|  irq_work_queue+0x50/0x80
|  __wake_up_klogd.part.0+0x98/0xe0
|  defer_console_output+0x20/0x30
|  vprintk+0x98/0xf0
|  _printk+0x5c/0x84
|  lockdep_rcu_suspicious+0x34/0x10c
|  trace_lock_acquire+0x174/0x180
|  lock_acquire+0x3c/0x8c
|  _raw_spin_lock_irqsave+0x70/0x150
|  down_trylock+0x18/0x50
|  __down_trylock_console_sem+0x3c/0xd0
|  console_trylock+0x28/0x90
|  vprintk_emit+0x11c/0x354
|  vprintk_default+0x38/0x4c
|  vprintk+0xd4/0xf0
|  _printk+0x5c/0x84
|  lockdep_rcu_suspicious+0x34/0x10c
|  printk_sprint+0x238/0x240
|  vprintk_store+0x32c/0x4b0
|  vprintk_emit+0x104/0x354
|  vprintk_default+0x38/0x4c
|  vprintk+0xd4/0xf0
|  _printk+0x5c/0x84
|  lockdep_rcu_suspicious+0x34/0x10c
|  trace_irq_disable+0x1ac/0x1b0
|  trace_hardirqs_off+0xe8/0x110
|  cpu_suspend+0x4c/0xfc
|  psci_cpu_suspend_enter+0x58/0x6c
|  psci_enter_idle_state+0x70/0x170
|  cpuidle_enter_state+0xc4/0x464
|  cpuidle_enter+0x38/0x50
|  do_idle+0x230/0x2c0
|  cpu_startup_entry+0x24/0x30
|  rest_init+0x110/0x190
|  arch_post_acpi_subsys_init+0x0/0x18
|  start_kernel+0x6f8/0x738
|  __primary_switched+0xbc/0xc4

IIUC what's happenign here is the PSCI cpuidle driver has entered idle and RCU
is no longer watching when arm64's cpu_suspend() manipulates DAIF. Our
local_daif_*() helpers poke lockdep and tracing, hence the call to
trace_hardirqs_off() and the RCU usage.

I think we need RCU to be watching all the way down to cpu_suspend(), and it's
cpu_suspend() that should actually enter/exit idle context. That and we need to
make cpu_suspend() and the low-level PSCI invocation noinstr.

I'm not sure whether 32-bit will have a similar issue or not.

I'm surprised no-one else who has tested has seen this; I suspect people
haven't enabled lockdep and friends. :/

Thanks,
Mark. 


Re: [PATCH v2 33/44] ftrace: WARN on rcuidle

2022-10-04 Thread Mark Rutland
On Mon, Sep 19, 2022 at 12:00:12PM +0200, Peter Zijlstra wrote:
> CONFIG_GENERIC_ENTRY disallows any and all tracing when RCU isn't
> enabled.
> 
> XXX if s390 (the only other GENERIC_ENTRY user as of this writing)
> isn't comfortable with this, we could switch to
> HAVE_NOINSTR_VALIDATION which is x86_64 only atm.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  include/linux/tracepoint.h |   13 -
>  kernel/trace/trace.c   |3 +++
>  2 files changed, 15 insertions(+), 1 deletion(-)
> 
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -178,6 +178,16 @@ static inline struct tracepoint *tracepo
>  #endif /* CONFIG_HAVE_STATIC_CALL */
>  
>  /*
> + * CONFIG_GENERIC_ENTRY archs are expected to have sanitized entry and idle
> + * code that disallow any/all tracing/instrumentation when RCU isn't 
> watching.
> + */
> +#ifdef CONFIG_GENERIC_ENTRY
> +#define RCUIDLE_COND(rcuidle)(rcuidle)
> +#else
> +#define RCUIDLE_COND(rcuidle)(rcuidle && in_nmi())
> +#endif

Could we make this depend on ARCH_WANTS_NO_INSTR instead?

That'll allow arm64 to check this even though we're not using the generic entry
code (and there's lots of work necessary to make that possible...).

Thanks,
Mark.

> +
> +/*
>   * it_func[0] is never NULL because there is at least one element in the 
> array
>   * when the array itself is non NULL.
>   */
> @@ -189,7 +199,8 @@ static inline struct tracepoint *tracepo
>   return; \
>   \
>   /* srcu can't be used from NMI */   \
> - WARN_ON_ONCE(rcuidle && in_nmi());  \
> + if (WARN_ON_ONCE(RCUIDLE_COND(rcuidle)))\
> + return; \
>   \
>   /* keep srcu and sched-rcu usage consistent */  \
>   preempt_disable_notrace();  \
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -3104,6 +3104,9 @@ void __trace_stack(struct trace_array *t
>   return;
>   }
>  
> + if (WARN_ON_ONCE(IS_ENABLED(CONFIG_GENERIC_ENTRY)))
> + return;
> +
>   /*
>* When an NMI triggers, RCU is enabled via ct_nmi_enter(),
>* but if the above rcu_is_watching() failed, then the NMI
> 
> 


Re: [PATCH -next v5 2/8] arm64: extable: make uaaccess helper use extable type EX_TYPE_UACCESS_ERR_ZERO

2022-06-20 Thread Mark Rutland
On Mon, Jun 20, 2022 at 10:13:41PM +0800, Tong Tiangen wrote:
> 
> 
> 在 2022/6/20 17:10, Mark Rutland 写道:
> > On Mon, Jun 20, 2022 at 10:59:12AM +0800, Tong Tiangen wrote:
> > > 在 2022/6/18 20:40, Mark Rutland 写道:
> > > The following errors are reported during compilation:
> > > [...]
> > > arch/arm64/lib/clear_user.S:45: Error: invalid operands (*ABS* and *UND*
> > > sections) for `<<'
> > > [...]
> > 
> > As above, I'm not seeing this.
> > 
> > This suggests that the EX_DATA_REG() macro is going wrong somehow. Assuming 
> > the
> > operand types correspond to the LHS and RHS of the expression, this would 
> > mean
> > the GPR number is defined, but the REG value is not, and I can't currently 
> > see
> > how that can happen.
 
> Now I can compile success, both versions 9.4.0 and 11.2.0.
> 
> I should have made a mistake. There is no problem using your implementation.
> I will send a new version these days.

No problem; thanks for confirming!

Mark.


Re: [PATCH -next v5 2/8] arm64: extable: make uaaccess helper use extable type EX_TYPE_UACCESS_ERR_ZERO

2022-06-20 Thread Mark Rutland
On Mon, Jun 20, 2022 at 10:59:12AM +0800, Tong Tiangen wrote:
> 在 2022/6/18 20:40, Mark Rutland 写道:
> > On Sat, Jun 18, 2022 at 04:42:06PM +0800, Tong Tiangen wrote:
> > > > > > diff --git a/arch/arm64/include/asm/asm-extable.h
> > > > > > b/arch/arm64/include/asm/asm-extable.h
> > > > > > index 56ebe183e78b..9c94ac1f082c 100644
> > > > > > --- a/arch/arm64/include/asm/asm-extable.h
> > > > > > +++ b/arch/arm64/include/asm/asm-extable.h
> > > > > > @@ -28,6 +28,14 @@
> > > > > >    __ASM_EXTABLE_RAW(\insn, \fixup, EX_TYPE_FIXUP, 0)
> > > > > >    .endm
> > > > > > +/*
> > > > > > + * Create an exception table entry for uaccess `insn`, which
> > > > > > will branch to `fixup`
> > > > > > + * when an unhandled fault is taken.
> > > > > > + * ex->data = ~0 means both reg_err and reg_zero is set to 
> > > > > > wzr(x31).
> > > > > > + */
> > > > > > +    .macro  _asm_extable_uaccess, insn, fixup
> > > > > > +    __ASM_EXTABLE_RAW(\insn, \fixup, EX_TYPE_UACCESS_ERR_ZERO, ~0)
> > > > > > +    .endm
> > > > > 
> > > > > I'm not too keen on using `~0` here, since that also sets other bits
> > > > > in the
> > > > > data field, and its somewhat opaque.
> > > > > 
> > > > > How painful is it to generate the data fields as with the C version
> > > > > of this
> > > > > macro, so that we can pass in wzr explciitly for the two sub-fields?
> > > > > 
> > > > > Other than that, this looks good to me.
> > > > > 
> > > > > Thanks,
> > > > > Mark.
> > > > 
> > > > ok, will fix next version.
> > > > 
> > > > Thanks,
> > > > Tong.
> > > 
> > > I tried to using data filelds as with C version, but here assembly code we
> > > can not using operator such as << and |, if we use lsl and orr 
> > > instructions,
> > > the gpr will be occupied.
> > > 
> > > So how about using 0x3ff directly here? it means err register and zero
> > > register both set to x31.
> > 
> > I had a go at implementing this, and it seems simple enough. Please see:
> > 
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/extable/asm-uaccess
> > 
> 
> I made the following modifications, and the other parts are based on your
> implementation:
> 
> arch/arm64/include/asm/asm-extable.h
> [...]
> .macro  _asm_extable_uaccess, insn, fixup
> _ASM_EXTABLE_UACCESS(\insn, \fixup)
> .endm
> [...]

I also made this same change locally when testing, and building with GCC 11.1.0
or LLVM 14.0.0 I am not seeing any problem when building, and the result is as
expected:

| [mark@lakrids:~/src/linux]% usekorg 11.1.0 make ARCH=arm64 
CROSS_COMPILE=aarch64-linux- defconfig
| *** Default configuration is based on 'defconfig'
| #
| # No change to .config
| #
| [mark@lakrids:~/src/linux]% usekorg 11.1.0 make ARCH=arm64 
CROSS_COMPILE=aarch64-linux- -j50 arch/arm64/lib/
|   CALLscripts/atomic/check-atomics.sh
|   CC  arch/arm64/kernel/asm-offsets.s
|   CALLscripts/checksyscalls.sh
|   AS  arch/arm64/kernel/vdso/note.o
|   AS  arch/arm64/kernel/vdso/sigreturn.o
|   LD  arch/arm64/kernel/vdso/vdso.so.dbg
|   VDSOSYM include/generated/vdso-offsets.h
|   OBJCOPY arch/arm64/kernel/vdso/vdso.so
| make[2]: Nothing to be done for 'arch/arm64/lib/'.
|   AS  arch/arm64/lib/clear_page.o
|   AS  arch/arm64/lib/clear_user.o
|   AS  arch/arm64/lib/copy_from_user.o
|   AS  arch/arm64/lib/copy_page.o
|   AS  arch/arm64/lib/copy_to_user.o
|   CC  arch/arm64/lib/csum.o
|   CC  arch/arm64/lib/delay.o
|   AS  arch/arm64/lib/memchr.o
|   AS  arch/arm64/lib/memcmp.o
|   AS  arch/arm64/lib/memcpy.o
|   AS  arch/arm64/lib/memset.o
|   AS  arch/arm64/lib/strchr.o
|   AS  arch/arm64/lib/strcmp.o
|   AS  arch/arm64/lib/strlen.o
|   AS  arch/arm64/lib/strncmp.o
|   AS  arch/arm64/lib/strnlen.o
|   AS  arch/arm64/lib/strrchr.o
|   AS  arch/arm64/lib/tishift.o
|   AS  arch/arm64/lib/crc32.o
|   AS  arch/arm64/lib/mte.o
|   CC [M]  arch/arm64/lib/xor-neon.o
|   AR  arch/arm64/lib/built-in.a
|   AR  arch/arm64/lib/lib.a
| [mark@lakrids:~/src/linux]% usekorg 12.1.0 aarch64-linux-objdump -j 
__ex_table -D arch/arm64/lib/clear_user.o
| 
| arch/arm64/lib/clear_user.o: file format elf64-littleaarch64
| 
| 
| Disassembly of 

Re: [PATCH -next v5 6/8] arm64: add support for machine check error safe

2022-06-18 Thread Mark Rutland
On Sat, Jun 18, 2022 at 05:18:55PM +0800, Tong Tiangen wrote:
> 在 2022/6/17 16:55, Mark Rutland 写道:
> > On Sat, May 28, 2022 at 06:50:54AM +, Tong Tiangen wrote:
> > > +static bool arm64_do_kernel_sea(unsigned long addr, unsigned int esr,
> > > +  struct pt_regs *regs, int sig, int code)
> > > +{
> > > + if (!IS_ENABLED(CONFIG_ARCH_HAS_COPY_MC))
> > > + return false;
> > > +
> > > + if (user_mode(regs) || !current->mm)
> > > + return false;
> > 
> > What's the `!current->mm` check for?
> 
> At first, I considered that only user processes have the opportunity to
> recover when they trigger memory error.
> 
> But it seems that this restriction is unreasonable. When the kernel thread
> triggers memory error, it can also be recovered. for instance:
> 
> https://lore.kernel.org/linux-mm/20220527190731.322722-1-jiaqi...@google.com/
> 
> And i think if(!current->mm) shoud be added below:
> 
> if(!current->mm) {
>   set_thread_esr(0, esr);
>   arm64_force_sig_fault(...);
> }
> return true;

Why does 'current->mm' have anything to do with this, though?

There can be kernel threads with `current->mm` set in unusual circumstances
(and there's a lot of kernel code out there which handles that wrong), so if
you want to treat user tasks differently, we should be doing something like
checking PF_KTHREAD, or adding something like an is_user_task() helper.

[...]

> > > +
> > > + if (apei_claim_sea(regs) < 0)
> > > + return false;
> > > +
> > > + if (!fixup_exception_mc(regs))
> > > + return false;
> > 
> > I thought we still wanted to signal the task in this case? Or do you expect 
> > to
> > add that into `fixup_exception_mc()` ?
> 
> Yeah, here return false and will signal to task in do_sea() ->
> arm64_notify_die().

I mean when we do the fixup.

I thought the idea was to apply the fixup (to stop the kernel from crashing),
but still to deliver a fatal signal to the user task since we can't do what the
user task asked us to.

> > > +
> > > + set_thread_esr(0, esr);
> > 
> > Why are we not setting the address? Is that deliberate, or an oversight?
> 
> Here set fault_address to 0, i refer to the logic of arm64_notify_die().
> 
> void arm64_notify_die(...)
> {
>  if (user_mode(regs)) {
>  WARN_ON(regs != current_pt_regs());
>  current->thread.fault_address = 0;
>  current->thread.fault_code = err;
> 
>  arm64_force_sig_fault(signo, sicode, far, str);
>  } else {
>  die(str, regs, err);
>  }
> }
> 
> I don't know exactly why and do you know why arm64_notify_die() did this? :)

To be honest, I don't know, and that looks equally suspicious to me.

Looking at the git history, that was added in commit:

  9141300a5884b57c ("arm64: Provide read/write fault information in compat 
signal handlers")

... so maybe Catalin recalls why.

Perhaps the assumption is just that this will be fatal and so unimportant? ...
but in that case the same logic would apply to the ESR value, so it's not clear
to me.

Mark.


Re: [PATCH -next v5 2/8] arm64: extable: make uaaccess helper use extable type EX_TYPE_UACCESS_ERR_ZERO

2022-06-18 Thread Mark Rutland
On Sat, Jun 18, 2022 at 04:42:06PM +0800, Tong Tiangen wrote:
> > > > diff --git a/arch/arm64/include/asm/asm-extable.h
> > > > b/arch/arm64/include/asm/asm-extable.h
> > > > index 56ebe183e78b..9c94ac1f082c 100644
> > > > --- a/arch/arm64/include/asm/asm-extable.h
> > > > +++ b/arch/arm64/include/asm/asm-extable.h
> > > > @@ -28,6 +28,14 @@
> > > >   __ASM_EXTABLE_RAW(\insn, \fixup, EX_TYPE_FIXUP, 0)
> > > >   .endm
> > > > +/*
> > > > + * Create an exception table entry for uaccess `insn`, which
> > > > will branch to `fixup`
> > > > + * when an unhandled fault is taken.
> > > > + * ex->data = ~0 means both reg_err and reg_zero is set to wzr(x31).
> > > > + */
> > > > +    .macro  _asm_extable_uaccess, insn, fixup
> > > > +    __ASM_EXTABLE_RAW(\insn, \fixup, EX_TYPE_UACCESS_ERR_ZERO, ~0)
> > > > +    .endm
> > > 
> > > I'm not too keen on using `~0` here, since that also sets other bits
> > > in the
> > > data field, and its somewhat opaque.
> > > 
> > > How painful is it to generate the data fields as with the C version
> > > of this
> > > macro, so that we can pass in wzr explciitly for the two sub-fields?
> > > 
> > > Other than that, this looks good to me.
> > > 
> > > Thanks,
> > > Mark.
> > 
> > ok, will fix next version.
> > 
> > Thanks,
> > Tong.
> 
> I tried to using data filelds as with C version, but here assembly code we
> can not using operator such as << and |, if we use lsl and orr instructions,
> the gpr will be occupied.
> 
> So how about using 0x3ff directly here? it means err register and zero
> register both set to x31.

I had a go at implementing this, and it seems simple enough. Please see:

  
https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/extable/asm-uaccess

Mark.


Re: [PATCH -next v5 7/8] arm64: add uaccess to machine check safe

2022-06-18 Thread Mark Rutland
On Sat, Jun 18, 2022 at 05:27:45PM +0800, Tong Tiangen wrote:
> 
> 
> 在 2022/6/17 17:06, Mark Rutland 写道:
> > On Sat, May 28, 2022 at 06:50:55AM +, Tong Tiangen wrote:
> > > If user access fail due to hardware memory error, only the relevant
> > > processes are affected, so killing the user process and isolate the
> > > error page with hardware memory errors is a more reasonable choice
> > > than kernel panic.
> > > 
> > > Signed-off-by: Tong Tiangen 
> > 
> > > ---
> > >   arch/arm64/lib/copy_from_user.S | 8 
> > >   arch/arm64/lib/copy_to_user.S   | 8 
> > 
> > All of these changes are to the *kernel* accesses performed as part of copy
> > to/from user, and have nothing to do with userspace, so it does not make 
> > sense
> > to mark these as UACCESS.
> 
> You have a point. so there is no need to modify copy_from/to_user.S in this
> patch set.

Cool, thanks. If this patch just has the extable change, that's fine by me.

> > Do we *actually* need to recover from failues on these accesses? Looking at
> > _copy_from_user(), the kernel will immediately follow this up with a 
> > memset()
> > to the same address which will be fatal anyway, so this is only punting the
> > failure for a few instructions.
> 
> If recovery success, The task will be killed and there will be no subsequent
> memset().

I don't think that's true.

IIUC per the last patch, in the exception handler we'll apply the fixup then
force a signal. That doesn't kill the task immediately, and we'll return from
the exception handler back into the original context (with the fixup applied).

The structure of copy_from_user() is 

copy_from_user(to, from, n) {
_copy_from_user(to, from, n) {
res = n;
res = raw_copy_from_user(to, from, n);
if (res) 
memset(to + (n - res), 0, res);
}
}

So when the fixup is applied and res indicates that the copy terminated early,
there is an unconditinal memset() before the fatal signal is handled in the
return to userspace path.

> > If we really need to recover from certain accesses to kernel memory we 
> > should
> > add a new EX_TYPE_KACCESS_ERR_ZERO_MC or similar, but we need a strong
> > rationale as to why that's useful. As things stand I do not beleive it makes
> > sense for copy to/from user specifically.

[...]

> > > diff --git a/arch/arm64/mm/extable.c b/arch/arm64/mm/extable.c
> > > index c301dcf6335f..8ca8d9639f9f 100644
> > > --- a/arch/arm64/mm/extable.c
> > > +++ b/arch/arm64/mm/extable.c
> > > @@ -86,10 +86,10 @@ bool fixup_exception_mc(struct pt_regs *regs)
> > >   if (!ex)
> > >   return false;
> > > - /*
> > > -  * This is not complete, More Machine check safe extable type can
> > > -  * be processed here.
> > > -  */
> > > + switch (ex->type) {
> > > + case EX_TYPE_UACCESS_ERR_ZERO:
> > > + return ex_handler_uaccess_err_zero(ex, regs);
> > > + }
> > 
> > This addition specifically makes sense to me, so can you split this into a 
> > separate patch?
> 
> According to my understanding of the above, only the modification of
> extable.c is retained.
> 
> So what do you mean which part is made into a separate patch?

As above, if you just retain the extable.c changes, that's fine by me.

Thanks,
Mark.


Re: [PATCH -next v5 7/8] arm64: add uaccess to machine check safe

2022-06-17 Thread Mark Rutland
On Sat, May 28, 2022 at 06:50:55AM +, Tong Tiangen wrote:
> If user access fail due to hardware memory error, only the relevant
> processes are affected, so killing the user process and isolate the
> error page with hardware memory errors is a more reasonable choice
> than kernel panic.
> 
> Signed-off-by: Tong Tiangen 

> ---
>  arch/arm64/lib/copy_from_user.S | 8 
>  arch/arm64/lib/copy_to_user.S   | 8 

All of these changes are to the *kernel* accesses performed as part of copy
to/from user, and have nothing to do with userspace, so it does not make sense
to mark these as UACCESS.

Do we *actually* need to recover from failues on these accesses? Looking at
_copy_from_user(), the kernel will immediately follow this up with a memset()
to the same address which will be fatal anyway, so this is only punting the
failure for a few instructions.

If we really need to recover from certain accesses to kernel memory we should
add a new EX_TYPE_KACCESS_ERR_ZERO_MC or similar, but we need a strong
rationale as to why that's useful. As things stand I do not beleive it makes
sense for copy to/from user specifically.

>  arch/arm64/mm/extable.c | 8 
>  3 files changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/arm64/lib/copy_from_user.S b/arch/arm64/lib/copy_from_user.S
> index 34e317907524..402dd48a4f93 100644
> --- a/arch/arm64/lib/copy_from_user.S
> +++ b/arch/arm64/lib/copy_from_user.S
> @@ -25,7 +25,7 @@
>   .endm
>  
>   .macro strb1 reg, ptr, val
> - strb \reg, [\ptr], \val
> + USER(9998f, strb \reg, [\ptr], \val)
>   .endm
>  
>   .macro ldrh1 reg, ptr, val
> @@ -33,7 +33,7 @@
>   .endm
>  
>   .macro strh1 reg, ptr, val
> - strh \reg, [\ptr], \val
> + USER(9998f, strh \reg, [\ptr], \val)
>   .endm
>  
>   .macro ldr1 reg, ptr, val
> @@ -41,7 +41,7 @@
>   .endm
>  
>   .macro str1 reg, ptr, val
> - str \reg, [\ptr], \val
> + USER(9998f, str \reg, [\ptr], \val)
>   .endm
>  
>   .macro ldp1 reg1, reg2, ptr, val
> @@ -49,7 +49,7 @@
>   .endm
>  
>   .macro stp1 reg1, reg2, ptr, val
> - stp \reg1, \reg2, [\ptr], \val
> + USER(9998f, stp \reg1, \reg2, [\ptr], \val)
>   .endm
>  
>  end  .reqx5
> diff --git a/arch/arm64/lib/copy_to_user.S b/arch/arm64/lib/copy_to_user.S
> index 802231772608..4134bdb3a8b0 100644
> --- a/arch/arm64/lib/copy_to_user.S
> +++ b/arch/arm64/lib/copy_to_user.S
> @@ -20,7 +20,7 @@
>   *   x0 - bytes not copied
>   */
>   .macro ldrb1 reg, ptr, val
> - ldrb  \reg, [\ptr], \val
> + USER(9998f, ldrb  \reg, [\ptr], \val)
>   .endm
>  
>   .macro strb1 reg, ptr, val
> @@ -28,7 +28,7 @@
>   .endm
>  
>   .macro ldrh1 reg, ptr, val
> - ldrh  \reg, [\ptr], \val
> + USER(9998f, ldrh  \reg, [\ptr], \val)
>   .endm
>  
>   .macro strh1 reg, ptr, val
> @@ -36,7 +36,7 @@
>   .endm
>  
>   .macro ldr1 reg, ptr, val
> - ldr \reg, [\ptr], \val
> + USER(9998f, ldr \reg, [\ptr], \val)
>   .endm
>  
>   .macro str1 reg, ptr, val
> @@ -44,7 +44,7 @@
>   .endm
>  
>   .macro ldp1 reg1, reg2, ptr, val
> - ldp \reg1, \reg2, [\ptr], \val
> + USER(9998f, ldp \reg1, \reg2, [\ptr], \val)
>   .endm
>  
>   .macro stp1 reg1, reg2, ptr, val
> diff --git a/arch/arm64/mm/extable.c b/arch/arm64/mm/extable.c
> index c301dcf6335f..8ca8d9639f9f 100644
> --- a/arch/arm64/mm/extable.c
> +++ b/arch/arm64/mm/extable.c
> @@ -86,10 +86,10 @@ bool fixup_exception_mc(struct pt_regs *regs)
>   if (!ex)
>   return false;
>  
> - /*
> -  * This is not complete, More Machine check safe extable type can
> -  * be processed here.
> -  */
> + switch (ex->type) {
> + case EX_TYPE_UACCESS_ERR_ZERO:
> + return ex_handler_uaccess_err_zero(ex, regs);
> + }

This addition specifically makes sense to me, so can you split this into a 
separate patch?

Thanks,
Mark.


Re: [PATCH -next v5 6/8] arm64: add support for machine check error safe

2022-06-17 Thread Mark Rutland
On Sat, May 28, 2022 at 06:50:54AM +, Tong Tiangen wrote:
> During the processing of arm64 kernel hardware memory errors(do_sea()), if
> the errors is consumed in the kernel, the current processing is panic.
> However, it is not optimal.
> 
> Take uaccess for example, if the uaccess operation fails due to memory
> error, only the user process will be affected, kill the user process
> and isolate the user page with hardware memory errors is a better choice.
> 
> This patch only enable machine error check framework, it add exception
> fixup before kernel panic in do_sea() and only limit the consumption of
> hardware memory errors in kernel mode triggered by user mode processes.
> If fixup successful, panic can be avoided.
> 
> Signed-off-by: Tong Tiangen 
> ---
>  arch/arm64/Kconfig   |  1 +
>  arch/arm64/include/asm/extable.h |  1 +
>  arch/arm64/mm/extable.c  | 17 +
>  arch/arm64/mm/fault.c| 27 ++-
>  4 files changed, 45 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index aaeb70358979..a3b12ff0cd7f 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -19,6 +19,7 @@ config ARM64
>   select ARCH_ENABLE_SPLIT_PMD_PTLOCK if PGTABLE_LEVELS > 2
>   select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
>   select ARCH_HAS_CACHE_LINE_SIZE
> + select ARCH_HAS_COPY_MC if ACPI_APEI_GHES
>   select ARCH_HAS_CURRENT_STACK_POINTER
>   select ARCH_HAS_DEBUG_VIRTUAL
>   select ARCH_HAS_DEBUG_VM_PGTABLE
> diff --git a/arch/arm64/include/asm/extable.h 
> b/arch/arm64/include/asm/extable.h
> index 72b0e71cc3de..f80ebd0addfd 100644
> --- a/arch/arm64/include/asm/extable.h
> +++ b/arch/arm64/include/asm/extable.h
> @@ -46,4 +46,5 @@ bool ex_handler_bpf(const struct exception_table_entry *ex,
>  #endif /* !CONFIG_BPF_JIT */
>  
>  bool fixup_exception(struct pt_regs *regs);
> +bool fixup_exception_mc(struct pt_regs *regs);
>  #endif
> diff --git a/arch/arm64/mm/extable.c b/arch/arm64/mm/extable.c
> index 228d681a8715..c301dcf6335f 100644
> --- a/arch/arm64/mm/extable.c
> +++ b/arch/arm64/mm/extable.c
> @@ -9,6 +9,7 @@
>  
>  #include 
>  #include 
> +#include 
>  
>  static inline unsigned long
>  get_ex_fixup(const struct exception_table_entry *ex)
> @@ -76,3 +77,19 @@ bool fixup_exception(struct pt_regs *regs)
>  
>   BUG();
>  }
> +
> +bool fixup_exception_mc(struct pt_regs *regs)
> +{
> + const struct exception_table_entry *ex;
> +
> + ex = search_exception_tables(instruction_pointer(regs));
> + if (!ex)
> + return false;
> +
> + /*
> +  * This is not complete, More Machine check safe extable type can
> +  * be processed here.
> +  */
> +
> + return false;
> +}
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index c5e11768e5c1..b262bd282a89 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -696,6 +696,29 @@ static int do_bad(unsigned long far, unsigned long esr, 
> struct pt_regs *regs)
>   return 1; /* "fault" */
>  }
>  
> +static bool arm64_do_kernel_sea(unsigned long addr, unsigned int esr,
> +  struct pt_regs *regs, int sig, int code)
> +{
> + if (!IS_ENABLED(CONFIG_ARCH_HAS_COPY_MC))
> + return false;
> +
> + if (user_mode(regs) || !current->mm)
> + return false;

What's the `!current->mm` check for?

> +
> + if (apei_claim_sea(regs) < 0)
> + return false;
> +
> + if (!fixup_exception_mc(regs))
> + return false;

I thought we still wanted to signal the task in this case? Or do you expect to
add that into `fixup_exception_mc()` ?

> +
> + set_thread_esr(0, esr);

Why are we not setting the address? Is that deliberate, or an oversight?

> +
> + arm64_force_sig_fault(sig, code, addr,
> + "Uncorrected hardware memory error in kernel-access\n");

I think the wording here is misleading since we don't expect to recover from
accesses to kernel memory, and would be better as something like:

"Uncorrected memory error on access to user memory\n"

Thanks,
Mark.

> +
> + return true;
> +}
> +
>  static int do_sea(unsigned long far, unsigned long esr, struct pt_regs *regs)
>  {
>   const struct fault_info *inf;
> @@ -721,7 +744,9 @@ static int do_sea(unsigned long far, unsigned long esr, 
> struct pt_regs *regs)
>*/
>   siaddr  = untagged_addr(far);
>   }
> - arm64_notify_die(inf->name, regs, inf->sig, inf->code, siaddr, esr);
> +
> + if (!arm64_do_kernel_sea(siaddr, esr, regs, inf->sig, inf->code))
> + arm64_notify_die(inf->name, regs, inf->sig, inf->code, siaddr, 
> esr);
>  
>   return 0;
>  }
> -- 
> 2.25.1
> 


Re: [PATCH -next v5 4/8] arm64: extable: cleanup redundant extable type EX_TYPE_FIXUP

2022-06-17 Thread Mark Rutland
On Sat, May 28, 2022 at 06:50:52AM +, Tong Tiangen wrote:
> Currently, extable type EX_TYPE_FIXUP is no place to use, We can safely
> remove it.
> 
> Suggested-by: Mark Rutland 
> Signed-off-by: Tong Tiangen 

Acked-by: Mark Rutland 

Mark.

> ---
>  arch/arm64/include/asm/asm-extable.h | 20 
>  arch/arm64/mm/extable.c  |  9 -
>  2 files changed, 4 insertions(+), 25 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/asm-extable.h 
> b/arch/arm64/include/asm/asm-extable.h
> index d01bd94cc4c2..1f2974467273 100644
> --- a/arch/arm64/include/asm/asm-extable.h
> +++ b/arch/arm64/include/asm/asm-extable.h
> @@ -3,11 +3,10 @@
>  #define __ASM_ASM_EXTABLE_H
>  
>  #define EX_TYPE_NONE 0
> -#define EX_TYPE_FIXUP1
> -#define EX_TYPE_BPF  2
> -#define EX_TYPE_UACCESS_ERR_ZERO 3
> -#define EX_TYPE_LOAD_UNALIGNED_ZEROPAD   4
> -#define EX_TYPE_KACCESS_ERR_ZERO 5
> +#define EX_TYPE_BPF  1
> +#define EX_TYPE_UACCESS_ERR_ZERO 2
> +#define EX_TYPE_LOAD_UNALIGNED_ZEROPAD   3
> +#define EX_TYPE_KACCESS_ERR_ZERO 4
>  
>  #ifdef __ASSEMBLY__
>  
> @@ -20,14 +19,6 @@
>   .short  (data); \
>   .popsection;
>  
> -/*
> - * Create an exception table entry for `insn`, which will branch to `fixup`
> - * when an unhandled fault is taken.
> - */
> - .macro  _asm_extable, insn, fixup
> - __ASM_EXTABLE_RAW(\insn, \fixup, EX_TYPE_FIXUP, 0)
> - .endm
> -
>  /*
>   * Create an exception table entry for uaccess `insn`, which will branch to 
> `fixup`
>   * when an unhandled fault is taken.
> @@ -62,9 +53,6 @@
>   ".short (" data ")\n"   \
>   ".popsection\n"
>  
> -#define _ASM_EXTABLE(insn, fixup) \
> - __ASM_EXTABLE_RAW(#insn, #fixup, __stringify(EX_TYPE_FIXUP), "0")
> -
>  #define EX_DATA_REG_ERR_SHIFT0
>  #define EX_DATA_REG_ERR  GENMASK(4, 0)
>  #define EX_DATA_REG_ZERO_SHIFT   5
> diff --git a/arch/arm64/mm/extable.c b/arch/arm64/mm/extable.c
> index 056591e5ca80..228d681a8715 100644
> --- a/arch/arm64/mm/extable.c
> +++ b/arch/arm64/mm/extable.c
> @@ -16,13 +16,6 @@ get_ex_fixup(const struct exception_table_entry *ex)
>   return ((unsigned long)>fixup + ex->fixup);
>  }
>  
> -static bool ex_handler_fixup(const struct exception_table_entry *ex,
> -  struct pt_regs *regs)
> -{
> - regs->pc = get_ex_fixup(ex);
> - return true;
> -}
> -
>  static bool ex_handler_uaccess_err_zero(const struct exception_table_entry 
> *ex,
>   struct pt_regs *regs)
>  {
> @@ -72,8 +65,6 @@ bool fixup_exception(struct pt_regs *regs)
>   return false;
>  
>   switch (ex->type) {
> - case EX_TYPE_FIXUP:
> - return ex_handler_fixup(ex, regs);
>   case EX_TYPE_BPF:
>   return ex_handler_bpf(ex, regs);
>   case EX_TYPE_UACCESS_ERR_ZERO:
> -- 
> 2.25.1
> 


Re: [PATCH -next v5 3/8] arm64: extable: move _cond_extable to _cond_uaccess_extable

2022-06-17 Thread Mark Rutland
On Sat, May 28, 2022 at 06:50:51AM +, Tong Tiangen wrote:
> Currently, We use _cond_extable for cache maintenance uaccess helper
> caches_clean_inval_user_pou(), so this should be moved over to
> EX_TYPE_UACCESS_ERR_ZERO and rename _cond_extable to _cond_uaccess_extable
> for clarity.
> 
> Suggested-by: Mark Rutland 
> Signed-off-by: Tong Tiangen 

Acked-by: Mark Rutland 

Mark.

> ---
>  arch/arm64/include/asm/asm-extable.h | 6 +++---
>  arch/arm64/include/asm/assembler.h   | 4 ++--
>  2 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/asm-extable.h 
> b/arch/arm64/include/asm/asm-extable.h
> index 9c94ac1f082c..d01bd94cc4c2 100644
> --- a/arch/arm64/include/asm/asm-extable.h
> +++ b/arch/arm64/include/asm/asm-extable.h
> @@ -40,9 +40,9 @@
>   * Create an exception table entry for `insn` if `fixup` is provided. 
> Otherwise
>   * do nothing.
>   */
> - .macro  _cond_extable, insn, fixup
> - .ifnc   \fixup,
> - _asm_extable\insn, \fixup
> + .macro  _cond_uaccess_extable, insn, fixup
> + .ifnc   \fixup,
> + _asm_extable_uaccess\insn, \fixup
>   .endif
>   .endm
>  
> diff --git a/arch/arm64/include/asm/assembler.h 
> b/arch/arm64/include/asm/assembler.h
> index 8c5a61aeaf8e..dc422fa437c2 100644
> --- a/arch/arm64/include/asm/assembler.h
> +++ b/arch/arm64/include/asm/assembler.h
> @@ -423,7 +423,7 @@ alternative_endif
>   b.lo.Ldcache_op\@
>   dsb \domain
>  
> - _cond_extable .Ldcache_op\@, \fixup
> + _cond_uaccess_extable .Ldcache_op\@, \fixup
>   .endm
>  
>  /*
> @@ -462,7 +462,7 @@ alternative_endif
>   dsb ish
>   isb
>  
> - _cond_extable .Licache_op\@, \fixup
> + _cond_uaccess_extable .Licache_op\@, \fixup
>   .endm
>  
>  /*
> -- 
> 2.25.1
> 


Re: [PATCH -next v5 2/8] arm64: extable: make uaaccess helper use extable type EX_TYPE_UACCESS_ERR_ZERO

2022-06-17 Thread Mark Rutland
On Sat, May 28, 2022 at 06:50:50AM +, Tong Tiangen wrote:
> Currnetly, the extable type used by __arch_copy_from/to_user() is
> EX_TYPE_FIXUP. In fact, It is more clearly to use meaningful
> EX_TYPE_UACCESS_*.
> 
> Suggested-by: Mark Rutland 
> Signed-off-by: Tong Tiangen 
> ---
>  arch/arm64/include/asm/asm-extable.h |  8 
>  arch/arm64/include/asm/asm-uaccess.h | 12 ++--
>  2 files changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/asm-extable.h 
> b/arch/arm64/include/asm/asm-extable.h
> index 56ebe183e78b..9c94ac1f082c 100644
> --- a/arch/arm64/include/asm/asm-extable.h
> +++ b/arch/arm64/include/asm/asm-extable.h
> @@ -28,6 +28,14 @@
>   __ASM_EXTABLE_RAW(\insn, \fixup, EX_TYPE_FIXUP, 0)
>   .endm
>  
> +/*
> + * Create an exception table entry for uaccess `insn`, which will branch to 
> `fixup`
> + * when an unhandled fault is taken.
> + * ex->data = ~0 means both reg_err and reg_zero is set to wzr(x31).
> + */
> + .macro  _asm_extable_uaccess, insn, fixup
> + __ASM_EXTABLE_RAW(\insn, \fixup, EX_TYPE_UACCESS_ERR_ZERO, ~0)
> + .endm

I'm not too keen on using `~0` here, since that also sets other bits in the
data field, and its somewhat opaque.

How painful is it to generate the data fields as with the C version of this
macro, so that we can pass in wzr explciitly for the two sub-fields?

Other than that, this looks good to me.

Thanks,
Mark.

>  /*
>   * Create an exception table entry for `insn` if `fixup` is provided. 
> Otherwise
>   * do nothing.
> diff --git a/arch/arm64/include/asm/asm-uaccess.h 
> b/arch/arm64/include/asm/asm-uaccess.h
> index 0557af834e03..75b211c98dea 100644
> --- a/arch/arm64/include/asm/asm-uaccess.h
> +++ b/arch/arm64/include/asm/asm-uaccess.h
> @@ -61,7 +61,7 @@ alternative_else_nop_endif
>  
>  #define USER(l, x...)\
>  :x;  \
> - _asm_extableb, l
> + _asm_extable_uaccessb, l
>  
>  /*
>   * Generate the assembly for LDTR/STTR with exception table entries.
> @@ -73,8 +73,8 @@ alternative_else_nop_endif
>  8889:ldtr\reg2, [\addr, #8];
>   add \addr, \addr, \post_inc;
>  
> - _asm_extableb,\l;
> - _asm_extable8889b,\l;
> + _asm_extable_uaccessb, \l;
> + _asm_extable_uaccess8889b, \l;
>   .endm
>  
>   .macro user_stp l, reg1, reg2, addr, post_inc
> @@ -82,14 +82,14 @@ alternative_else_nop_endif
>  8889:sttr\reg2, [\addr, #8];
>   add \addr, \addr, \post_inc;
>  
> - _asm_extableb,\l;
> - _asm_extable8889b,\l;
> + _asm_extable_uaccessb,\l;
> + _asm_extable_uaccess8889b,\l;
>   .endm
>  
>   .macro user_ldst l, inst, reg, addr, post_inc
>  :\inst   \reg, [\addr];
>   add \addr, \addr, \post_inc;
>  
> - _asm_extableb,\l;
> + _asm_extable_uaccessb, \l;
>   .endm
>  #endif
> -- 
> 2.25.1
> 


Re: [PATCH -next v5 1/8] arm64: extable: add new extable type EX_TYPE_KACCESS_ERR_ZERO support

2022-06-17 Thread Mark Rutland
On Sat, May 28, 2022 at 06:50:49AM +, Tong Tiangen wrote:
> Currently, The extable type EX_TYPE_UACCESS_ERR_ZERO is used by
> __get/put_kernel_nofault(), but those helpers are not uaccess type, so we
> add a new extable type EX_TYPE_KACCESS_ERR_ZERO which can be used by
> __get/put_kernel_no_fault().
> 
> This is also to prepare for distinguishing the two types in machine check
> safe process.
> 
> Suggested-by: Mark Rutland 
> Signed-off-by: Tong Tiangen 

This looks good to me, so modulo one nit below:

Acked-by: Mark Rutland 

> ---
>  arch/arm64/include/asm/asm-extable.h | 13 
>  arch/arm64/include/asm/uaccess.h | 94 ++--
>  arch/arm64/mm/extable.c  |  1 +
>  3 files changed, 61 insertions(+), 47 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/asm-extable.h 
> b/arch/arm64/include/asm/asm-extable.h
> index c39f2437e08e..56ebe183e78b 100644
> --- a/arch/arm64/include/asm/asm-extable.h
> +++ b/arch/arm64/include/asm/asm-extable.h
> @@ -7,6 +7,7 @@
>  #define EX_TYPE_BPF  2
>  #define EX_TYPE_UACCESS_ERR_ZERO 3
>  #define EX_TYPE_LOAD_UNALIGNED_ZEROPAD   4
> +#define EX_TYPE_KACCESS_ERR_ZERO 5

Could we please renumber this so the UACCESS and KACCESS definitions are next
to one another, i.e.

#define EX_TYPE_BPF 2
#define EX_TYPE_UACCESS_ERR_ZERO3
#define EX_TYPE_KACCESS_ERR_ZERO4
#define EX_TYPE_LOAD_UNALIGNED_ZEROPAD  5

Thanks,
Mark.

>  
>  #ifdef __ASSEMBLY__
>  
> @@ -73,9 +74,21 @@
>   EX_DATA_REG(ZERO, zero) \
> ")")
>  
> +#define _ASM_EXTABLE_KACCESS_ERR_ZERO(insn, fixup, err, zero)
> \
> + __DEFINE_ASM_GPR_NUMS   \
> + __ASM_EXTABLE_RAW(#insn, #fixup,\
> +   __stringify(EX_TYPE_KACCESS_ERR_ZERO),\
> +   "("   \
> + EX_DATA_REG(ERR, err) " | " \
> + EX_DATA_REG(ZERO, zero) \
> +   ")")
> +
>  #define _ASM_EXTABLE_UACCESS_ERR(insn, fixup, err)   \
>   _ASM_EXTABLE_UACCESS_ERR_ZERO(insn, fixup, err, wzr)
>  
> +#define _ASM_EXTABLE_KACCESS_ERR(insn, fixup, err)   \
> + _ASM_EXTABLE_KACCESS_ERR_ZERO(insn, fixup, err, wzr)
> +
>  #define EX_DATA_REG_DATA_SHIFT   0
>  #define EX_DATA_REG_DATA GENMASK(4, 0)
>  #define EX_DATA_REG_ADDR_SHIFT   5
> diff --git a/arch/arm64/include/asm/uaccess.h 
> b/arch/arm64/include/asm/uaccess.h
> index 63f9c828f1a7..2fc9f0861769 100644
> --- a/arch/arm64/include/asm/uaccess.h
> +++ b/arch/arm64/include/asm/uaccess.h
> @@ -232,34 +232,34 @@ static inline void __user *__uaccess_mask_ptr(const 
> void __user *ptr)
>   * The "__xxx_error" versions set the third argument to -EFAULT if an error
>   * occurs, and leave it unchanged on success.
>   */
> -#define __get_mem_asm(load, reg, x, addr, err)   
> \
> +#define __get_mem_asm(load, reg, x, addr, err, type) \
>   asm volatile(   \
>   "1: " load "" reg "1, [%2]\n"   \
>   "2:\n"  \
> - _ASM_EXTABLE_UACCESS_ERR_ZERO(1b, 2b, %w0, %w1) \
> + _ASM_EXTABLE_##type##ACCESS_ERR_ZERO(1b, 2b, %w0, %w1)  \
>   : "+r" (err), "=" (x) \
>   : "r" (addr))
>  
> -#define __raw_get_mem(ldr, x, ptr, err)  
> \
> -do { \
> - unsigned long __gu_val; \
> - switch (sizeof(*(ptr))) {   \
> - case 1: \
> - __get_mem_asm(ldr "b", "%w", __gu_val, (ptr), (err));   \
> - break;  \
> - case 2: \
> - __get_mem_asm(ldr "h", "%w", __gu_val, (ptr), (err));   \
> - break;  \
> - case 4: \
> - __get_mem_asm(ldr, "%w", 

Re: [PATCH 00/36] cpuidle,rcu: Cleanup the mess

2022-06-14 Thread Mark Rutland
arndb.de>, ulli.kr...@googlemail.com, vgu...@kernel.org, 
linux-...@vger.kernel.org, j...@joshtriplett.org, rost...@goodmis.org, 
r...@vger.kernel.org, b...@alien8.de, bc...@quicinc.com, 
tsbog...@alpha.franken.de, linux-par...@vger.kernel.org, sudeep.ho...@arm.com, 
shawn...@kernel.org, da...@davemloft.net, dal...@libc.org, t...@atomide.com, 
amakha...@vmware.com, bjorn.anders...@linaro.org, h...@zytor.com, 
sparcli...@vger.kernel.org, linux-hexa...@vger.kernel.org, 
linux-ri...@lists.infradead.org, anton.iva...@cambridgegreys.com, 
jo...@southpole.se, yury.no...@gmail.com, rich...@nod.at, x...@kernel.org, 
li...@armlinux.org.uk, mi...@redhat.com, a...@eecs.berkeley.edu, 
paul...@kernel.org, h...@linux.ibm.com, stefan.kristians...@saunalahti.fi, 
openr...@lists.librecores.org, paul.walms...@sifive.com, 
linux-te...@vger.kernel.org, namhy...@kernel.org, 
andriy.shevche...@linux.intel.com, jpoim...@kernel.org, jgr...@suse.com, 
mon...@monstr.eu, linux-m...@vger.kernel.org, pal...@dabbelt.com, anup@brainfa
 ult.org, i...@jurassic.park.msu.ru, johan...@sipsolutions.net, 
linuxppc-dev@lists.ozlabs.org
Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org
Sender: "Linuxppc-dev" 


On Tue, Jun 14, 2022 at 06:58:30PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 14, 2022 at 12:19:29PM +0100, Mark Rutland wrote:
> > On Wed, Jun 08, 2022 at 04:27:23PM +0200, Peter Zijlstra wrote:
> > > Hi All! (omg so many)
> > 
> > Hi Peter,
> > 
> > Sorry for the delay; my plate has also been rather full recently. I'm 
> > beginning
> > to page this in now.
> 
> No worries; we all have too much to do ;-)
> 
> > > These here few patches mostly clear out the utter mess that is cpuidle vs 
> > > rcuidle.
> > > 
> > > At the end of the ride there's only 2 real RCU_NONIDLE() users left
> > > 
> > >   arch/arm64/kernel/suspend.c:
> > > RCU_NONIDLE(__cpu_suspend_exit());
> > >   drivers/perf/arm_pmu.c: RCU_NONIDLE(armpmu_start(event, 
> > > PERF_EF_RELOAD));
> > 
> > The latter of these is necessary because apparently PM notifiers are called
> > with RCU not watching. Is that still the case today (or at the end of this
> > series)? If so, that feels like fertile land for more issues (yaey...). If 
> > not,
> > we should be able to drop this.
> 
> That should be fixed; fingers crossed :-)

Cool; I'll try to give that a spin when I'm sat next to some relevant hardware. 
:)

> > >   kernel/cfi.c:   RCU_NONIDLE({
> > > 
> > > (the CFI one is likely dead in the kCFI rewrite) and there's only a hand 
> > > full
> > > of trace_.*_rcuidle() left:
> > > 
> > >   kernel/trace/trace_preemptirq.c:
> > > trace_irq_enable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
> > >   kernel/trace/trace_preemptirq.c:
> > > trace_irq_disable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
> > >   kernel/trace/trace_preemptirq.c:
> > > trace_irq_enable_rcuidle(CALLER_ADDR0, caller_addr);
> > >   kernel/trace/trace_preemptirq.c:
> > > trace_irq_disable_rcuidle(CALLER_ADDR0, caller_addr);
> > >   kernel/trace/trace_preemptirq.c:
> > > trace_preempt_enable_rcuidle(a0, a1);
> > >   kernel/trace/trace_preemptirq.c:
> > > trace_preempt_disable_rcuidle(a0, a1);
> > > 
> > > All of them are in 'deprecated' code that is unused for GENERIC_ENTRY.
> > I think those are also unused on arm64 too?
> > 
> > If not, I can go attack that.
> 
> My grep spots:
> 
> arch/arm64/kernel/entry-common.c:   trace_hardirqs_on();
> arch/arm64/include/asm/daifflags.h: trace_hardirqs_off();
> arch/arm64/include/asm/daifflags.h: trace_hardirqs_off();

Ah; I hadn't realised those used trace_.*_rcuidle() behind the scenes.

That affects local_irq_{enable,disable,restore}() too (which is what the
daifflags.h bits are emulating), and also the generic entry code's
irqentry_exit().

So it feels to me like we should be fixing those more generally? e.g. say that
with a new STRICT_ENTRY[_RCU], we can only call trace_hardirqs_{on,off}() with
RCU watching, and alter the definition of those?

> The _on thing should be replaced with something like:
> 
>   trace_hardirqs_on_prepare();
>   lockdep_hardirqs_on_prepare();
>   instrumentation_end();
>   rcu_irq_exit();
>   lockdep_hardirqs_on(CALLER_ADDR0);
> 
> (as I think you know, since you have some of that already). And
> something similar for the _off thing, but with _off_finish().

Sure; I knew that was necessary f

Re: [PATCH 14/36] cpuidle: Fix rcu_idle_*() usage

2022-06-14 Thread Mark Rutland
arndb.de>, ulli.kr...@googlemail.com, vgu...@kernel.org, 
linux-...@vger.kernel.org, j...@joshtriplett.org, rost...@goodmis.org, 
r...@vger.kernel.org, b...@alien8.de, bc...@quicinc.com, 
tsbog...@alpha.franken.de, linux-par...@vger.kernel.org, sudeep.ho...@arm.com, 
shawn...@kernel.org, da...@davemloft.net, dal...@libc.org, t...@atomide.com, 
amakha...@vmware.com, bjorn.anders...@linaro.org, h...@zytor.com, 
sparcli...@vger.kernel.org, linux-hexa...@vger.kernel.org, 
linux-ri...@lists.infradead.org, anton.iva...@cambridgegreys.com, 
jo...@southpole.se, yury.no...@gmail.com, rich...@nod.at, x...@kernel.org, 
li...@armlinux.org.uk, mi...@redhat.com, a...@eecs.berkeley.edu, 
paul...@kernel.org, h...@linux.ibm.com, stefan.kristians...@saunalahti.fi, 
openr...@lists.librecores.org, paul.walms...@sifive.com, 
linux-te...@vger.kernel.org, namhy...@kernel.org, 
andriy.shevche...@linux.intel.com, jpoim...@kernel.org, jgr...@suse.com, 
mon...@monstr.eu, linux-m...@vger.kernel.org, pal...@dabbelt.com, anup@brainfa
 ult.org, i...@jurassic.park.msu.ru, johan...@sipsolutions.net, 
linuxppc-dev@lists.ozlabs.org
Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org
Sender: "Linuxppc-dev" 


On Tue, Jun 14, 2022 at 06:40:53PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 14, 2022 at 01:41:13PM +0100, Mark Rutland wrote:
> > On Wed, Jun 08, 2022 at 04:27:37PM +0200, Peter Zijlstra wrote:
> > > --- a/kernel/time/tick-broadcast.c
> > > +++ b/kernel/time/tick-broadcast.c
> > > @@ -622,9 +622,13 @@ struct cpumask *tick_get_broadcast_onesh
> > >   * to avoid a deep idle transition as we are about to get the
> > >   * broadcast IPI right away.
> > >   */
> > > -int tick_check_broadcast_expired(void)
> > > +noinstr int tick_check_broadcast_expired(void)
> > >  {
> > > +#ifdef _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H
> > > + return arch_test_bit(smp_processor_id(), 
> > > cpumask_bits(tick_broadcast_force_mask));
> > > +#else
> > >   return cpumask_test_cpu(smp_processor_id(), tick_broadcast_force_mask);
> > > +#endif
> > >  }
> > 
> > This is somewhat not-ideal. :/
> 
> I'll say.
> 
> > Could we unconditionally do the arch_test_bit() variant, with a comment, or
> > does that not exist in some cases?
> 
> Loads of build errors ensued, which is how I ended up with this mess ...

Yaey :(

I see the same is true for the thread flag manipulation too.

I'll take a look and see if we can layer things so that we can use the arch_*()
helpers and wrap those consistently so that we don't have to check the CPP
guard.

Ideally we'd have a a better language that allows us to make some
context-senstive decisions, then we could hide all this gunk in the lower
levels with somethin like:

if (!THIS_IS_A_NOINSTR_FUNCTION()) {
explicit_instrumentation(...);
}

... ho hum.

Mark.


Re: [PATCH 15/36] cpuidle,cpu_pm: Remove RCU fiddling from cpu_pm_{enter,exit}()

2022-06-14 Thread Mark Rutland
arndb.de>, ulli.kr...@googlemail.com, vgu...@kernel.org, 
linux-...@vger.kernel.org, j...@joshtriplett.org, rost...@goodmis.org, 
r...@vger.kernel.org, b...@alien8.de, bc...@quicinc.com, 
tsbog...@alpha.franken.de, linux-par...@vger.kernel.org, sudeep.ho...@arm.com, 
shawn...@kernel.org, da...@davemloft.net, dal...@libc.org, t...@atomide.com, 
amakha...@vmware.com, bjorn.anders...@linaro.org, h...@zytor.com, 
sparcli...@vger.kernel.org, linux-hexa...@vger.kernel.org, 
linux-ri...@lists.infradead.org, anton.iva...@cambridgegreys.com, 
jo...@southpole.se, yury.no...@gmail.com, rich...@nod.at, x...@kernel.org, 
li...@armlinux.org.uk, mi...@redhat.com, a...@eecs.berkeley.edu, 
paul...@kernel.org, h...@linux.ibm.com, stefan.kristians...@saunalahti.fi, 
openr...@lists.librecores.org, paul.walms...@sifive.com, 
linux-te...@vger.kernel.org, namhy...@kernel.org, 
andriy.shevche...@linux.intel.com, jpoim...@kernel.org, jgr...@suse.com, 
mon...@monstr.eu, linux-m...@vger.kernel.org, pal...@dabbelt.com, anup@brainfa
 ult.org, i...@jurassic.park.msu.ru, johan...@sipsolutions.net, 
linuxppc-dev@lists.ozlabs.org
Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org
Sender: "Linuxppc-dev" 


On Tue, Jun 14, 2022 at 06:42:14PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 14, 2022 at 05:13:16PM +0100, Mark Rutland wrote:
> > On Wed, Jun 08, 2022 at 04:27:38PM +0200, Peter Zijlstra wrote:
> > > All callers should still have RCU enabled.
> > 
> > IIUC with that true we should be able to drop the RCU_NONIDLE() from
> > drivers/perf/arm_pmu.c, as we only needed that for an invocation via a pm
> > notifier.
> > 
> > I should be able to give that a spin on some hardware.
> > 
> > > 
> > > Signed-off-by: Peter Zijlstra (Intel) 
> > > ---
> > >  kernel/cpu_pm.c |9 -
> > >  1 file changed, 9 deletions(-)
> > > 
> > > --- a/kernel/cpu_pm.c
> > > +++ b/kernel/cpu_pm.c
> > > @@ -30,16 +30,9 @@ static int cpu_pm_notify(enum cpu_pm_eve
> > >  {
> > >   int ret;
> > >  
> > > - /*
> > > -  * This introduces a RCU read critical section, which could be
> > > -  * disfunctional in cpu idle. Copy RCU_NONIDLE code to let RCU know
> > > -  * this.
> > > -  */
> > > - rcu_irq_enter_irqson();
> > >   rcu_read_lock();
> > >   ret = raw_notifier_call_chain(_pm_notifier.chain, event, NULL);
> > >   rcu_read_unlock();
> > > - rcu_irq_exit_irqson();
> > 
> > To make this easier to debug, is it worth adding an assertion that RCU is
> > watching here? e.g.
> > 
> > RCU_LOCKDEP_WARN(!rcu_is_watching(),
> >  "cpu_pm_notify() used illegally from EQS");
> > 
> 
> My understanding is that rcu_read_lock() implies something along those
> lines when PROVE_RCU.

Ah, duh. Given that:

Acked-by: Mark Rutland 

Mark.


Re: [PATCH 25/36] time/tick-broadcast: Remove RCU_NONIDLE usage

2022-06-14 Thread Mark Rutland
arndb.de>, ulli.kr...@googlemail.com, vgu...@kernel.org, 
linux-...@vger.kernel.org, j...@joshtriplett.org, rost...@goodmis.org, 
r...@vger.kernel.org, b...@alien8.de, bc...@quicinc.com, 
tsbog...@alpha.franken.de, linux-par...@vger.kernel.org, sudeep.ho...@arm.com, 
shawn...@kernel.org, da...@davemloft.net, dal...@libc.org, t...@atomide.com, 
amakha...@vmware.com, bjorn.anders...@linaro.org, h...@zytor.com, 
sparcli...@vger.kernel.org, linux-hexa...@vger.kernel.org, 
linux-ri...@lists.infradead.org, anton.iva...@cambridgegreys.com, 
jo...@southpole.se, yury.no...@gmail.com, rich...@nod.at, x...@kernel.org, 
li...@armlinux.org.uk, mi...@redhat.com, a...@eecs.berkeley.edu, 
paul...@kernel.org, h...@linux.ibm.com, stefan.kristians...@saunalahti.fi, 
openr...@lists.librecores.org, paul.walms...@sifive.com, 
linux-te...@vger.kernel.org, namhy...@kernel.org, 
andriy.shevche...@linux.intel.com, jpoim...@kernel.org, jgr...@suse.com, 
mon...@monstr.eu, linux-m...@vger.kernel.org, pal...@dabbelt.com, anup@brainfa
 ult.org, i...@jurassic.park.msu.ru, johan...@sipsolutions.net, 
linuxppc-dev@lists.ozlabs.org
Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org
Sender: "Linuxppc-dev" 


On Wed, Jun 08, 2022 at 04:27:48PM +0200, Peter Zijlstra wrote:
> No callers left that have already disabled RCU.
> 
> Signed-off-by: Peter Zijlstra (Intel) 

Acked-by: Mark Rutland 

Mark.

> ---
>  kernel/time/tick-broadcast-hrtimer.c |   29 -
>  1 file changed, 12 insertions(+), 17 deletions(-)
> 
> --- a/kernel/time/tick-broadcast-hrtimer.c
> +++ b/kernel/time/tick-broadcast-hrtimer.c
> @@ -56,25 +56,20 @@ static int bc_set_next(ktime_t expires,
>* hrtimer callback function is currently running, then
>* hrtimer_start() cannot move it and the timer stays on the CPU on
>* which it is assigned at the moment.
> +  */
> + hrtimer_start(, expires, HRTIMER_MODE_ABS_PINNED_HARD);
> + /*
> +  * The core tick broadcast mode expects bc->bound_on to be set
> +  * correctly to prevent a CPU which has the broadcast hrtimer
> +  * armed from going deep idle.
>*
> -  * As this can be called from idle code, the hrtimer_start()
> -  * invocation has to be wrapped with RCU_NONIDLE() as
> -  * hrtimer_start() can call into tracing.
> +  * As tick_broadcast_lock is held, nothing can change the cpu
> +  * base which was just established in hrtimer_start() above. So
> +  * the below access is safe even without holding the hrtimer
> +  * base lock.
>*/
> - RCU_NONIDLE( {
> - hrtimer_start(, expires, HRTIMER_MODE_ABS_PINNED_HARD);
> - /*
> -  * The core tick broadcast mode expects bc->bound_on to be set
> -  * correctly to prevent a CPU which has the broadcast hrtimer
> -  * armed from going deep idle.
> -  *
> -  * As tick_broadcast_lock is held, nothing can change the cpu
> -  * base which was just established in hrtimer_start() above. So
> -  * the below access is safe even without holding the hrtimer
> -  * base lock.
> -  */
> - bc->bound_on = bctimer.base->cpu_base->cpu;
> - } );
> + bc->bound_on = bctimer.base->cpu_base->cpu;
> +
>   return 0;
>  }
>  
> 
> 


Re: [PATCH 23/36] arm64,smp: Remove trace_.*_rcuidle() usage

2022-06-14 Thread Mark Rutland
arndb.de>, ulli.kr...@googlemail.com, vgu...@kernel.org, 
linux-...@vger.kernel.org, j...@joshtriplett.org, rost...@goodmis.org, 
r...@vger.kernel.org, b...@alien8.de, bc...@quicinc.com, 
tsbog...@alpha.franken.de, linux-par...@vger.kernel.org, sudeep.ho...@arm.com, 
shawn...@kernel.org, da...@davemloft.net, dal...@libc.org, t...@atomide.com, 
amakha...@vmware.com, bjorn.anders...@linaro.org, h...@zytor.com, 
sparcli...@vger.kernel.org, linux-hexa...@vger.kernel.org, 
linux-ri...@lists.infradead.org, anton.iva...@cambridgegreys.com, 
jo...@southpole.se, m...@kernel.org, yury.no...@gmail.com, rich...@nod.at, 
x...@kernel.org, li...@armlinux.org.uk, mi...@redhat.com, 
a...@eecs.berkeley.edu, paul...@kernel.org, h...@linux.ibm.com, 
stefan.kristians...@saunalahti.fi, openr...@lists.librecores.org, 
paul.walms...@sifive.com, linux-te...@vger.kernel.org, namhy...@kernel.org, 
andriy.shevche...@linux.intel.com, jpoim...@kernel.org, jgr...@suse.com, 
mon...@monstr.eu, linux-m...@vger.kernel.org, palmer@dabbelt.c
 om, a...@brainfault.org, i...@jurassic.park.msu.ru, johan...@sipsolutions.net, 
linuxppc-dev@lists.ozlabs.org
Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org
Sender: "Linuxppc-dev" 


On Wed, Jun 08, 2022 at 04:27:46PM +0200, Peter Zijlstra wrote:
> Ever since commit d3afc7f12987 ("arm64: Allow IPIs to be handled as
> normal interrupts") this function is called in regular IRQ context.
> 
> Signed-off-by: Peter Zijlstra (Intel) 

[adding Marc since he authored that commit]

Makes sense to me:

  Acked-by: Mark Rutland 

Mark.

> ---
>  arch/arm64/kernel/smp.c |4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -865,7 +865,7 @@ static void do_handle_IPI(int ipinr)
>   unsigned int cpu = smp_processor_id();
>  
>   if ((unsigned)ipinr < NR_IPI)
> - trace_ipi_entry_rcuidle(ipi_types[ipinr]);
> + trace_ipi_entry(ipi_types[ipinr]);
>  
>   switch (ipinr) {
>   case IPI_RESCHEDULE:
> @@ -914,7 +914,7 @@ static void do_handle_IPI(int ipinr)
>   }
>  
>   if ((unsigned)ipinr < NR_IPI)
> - trace_ipi_exit_rcuidle(ipi_types[ipinr]);
> + trace_ipi_exit(ipi_types[ipinr]);
>  }
>  
>  static irqreturn_t ipi_handler(int irq, void *data)
> 
> 


Re: [PATCH 20/36] arch/idle: Change arch_cpu_idle() IRQ behaviour

2022-06-14 Thread Mark Rutland
arndb.de>, ulli.kr...@googlemail.com, vgu...@kernel.org, 
linux-...@vger.kernel.org, j...@joshtriplett.org, rost...@goodmis.org, 
r...@vger.kernel.org, b...@alien8.de, bc...@quicinc.com, 
tsbog...@alpha.franken.de, linux-par...@vger.kernel.org, sudeep.ho...@arm.com, 
shawn...@kernel.org, da...@davemloft.net, dal...@libc.org, t...@atomide.com, 
amakha...@vmware.com, bjorn.anders...@linaro.org, h...@zytor.com, 
sparcli...@vger.kernel.org, linux-hexa...@vger.kernel.org, 
linux-ri...@lists.infradead.org, anton.iva...@cambridgegreys.com, 
jo...@southpole.se, yury.no...@gmail.com, rich...@nod.at, x...@kernel.org, 
li...@armlinux.org.uk, mi...@redhat.com, a...@eecs.berkeley.edu, 
paul...@kernel.org, h...@linux.ibm.com, stefan.kristians...@saunalahti.fi, 
openr...@lists.librecores.org, paul.walms...@sifive.com, 
linux-te...@vger.kernel.org, namhy...@kernel.org, 
andriy.shevche...@linux.intel.com, jpoim...@kernel.org, jgr...@suse.com, 
mon...@monstr.eu, linux-m...@vger.kernel.org, pal...@dabbelt.com, anup@brainfa
 ult.org, i...@jurassic.park.msu.ru, johan...@sipsolutions.net, 
linuxppc-dev@lists.ozlabs.org
Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org
Sender: "Linuxppc-dev" 


On Wed, Jun 08, 2022 at 04:27:43PM +0200, Peter Zijlstra wrote:
> Current arch_cpu_idle() is called with IRQs disabled, but will return
> with IRQs enabled.
> 
> However, the very first thing the generic code does after calling
> arch_cpu_idle() is raw_local_irq_disable(). This means that
> architectures that can idle with IRQs disabled end up doing a
> pointless 'enable-disable' dance.
> 
> Therefore, push this IRQ disabling into the idle function, meaning
> that those architectures can avoid the pointless IRQ state flipping.
> 
> Signed-off-by: Peter Zijlstra (Intel) 

Nice!

  Acked-by: Mark Rutland  [arm64]

Mark.

> ---
>  arch/alpha/kernel/process.c  |1 -
>  arch/arc/kernel/process.c|3 +++
>  arch/arm/kernel/process.c|1 -
>  arch/arm/mach-gemini/board-dt.c  |3 ++-
>  arch/arm64/kernel/idle.c |1 -
>  arch/csky/kernel/process.c   |1 -
>  arch/csky/kernel/smp.c   |2 +-
>  arch/hexagon/kernel/process.c|1 -
>  arch/ia64/kernel/process.c   |1 +
>  arch/microblaze/kernel/process.c |1 -
>  arch/mips/kernel/idle.c  |8 +++-
>  arch/nios2/kernel/process.c  |1 -
>  arch/openrisc/kernel/process.c   |1 +
>  arch/parisc/kernel/process.c |2 --
>  arch/powerpc/kernel/idle.c   |5 ++---
>  arch/riscv/kernel/process.c  |1 -
>  arch/s390/kernel/idle.c  |1 -
>  arch/sh/kernel/idle.c|1 +
>  arch/sparc/kernel/leon_pmc.c |4 
>  arch/sparc/kernel/process_32.c   |1 -
>  arch/sparc/kernel/process_64.c   |3 ++-
>  arch/um/kernel/process.c |1 -
>  arch/x86/coco/tdx/tdx.c  |3 +++
>  arch/x86/kernel/process.c|   15 ---
>  arch/xtensa/kernel/process.c |1 +
>  kernel/sched/idle.c  |2 --
>  26 files changed, 28 insertions(+), 37 deletions(-)
> 
> --- a/arch/alpha/kernel/process.c
> +++ b/arch/alpha/kernel/process.c
> @@ -57,7 +57,6 @@ EXPORT_SYMBOL(pm_power_off);
>  void arch_cpu_idle(void)
>  {
>   wtint(0);
> - raw_local_irq_enable();
>  }
>  
>  void arch_cpu_idle_dead(void)
> --- a/arch/arc/kernel/process.c
> +++ b/arch/arc/kernel/process.c
> @@ -114,6 +114,8 @@ void arch_cpu_idle(void)
>   "sleep %0   \n"
>   :
>   :"I"(arg)); /* can't be "r" has to be embedded const */
> +
> + raw_local_irq_disable();
>  }
>  
>  #else/* ARC700 */
> @@ -122,6 +124,7 @@ void arch_cpu_idle(void)
>  {
>   /* sleep, but enable both set E1/E2 (levels of interrupts) before 
> committing */
>   __asm__ __volatile__("sleep 0x3 \n");
> + raw_local_irq_disable();
>  }
>  
>  #endif
> --- a/arch/arm/kernel/process.c
> +++ b/arch/arm/kernel/process.c
> @@ -78,7 +78,6 @@ void arch_cpu_idle(void)
>   arm_pm_idle();
>   else
>   cpu_do_idle();
> - raw_local_irq_enable();
>  }
>  
>  void arch_cpu_idle_prepare(void)
> --- a/arch/arm/mach-gemini/board-dt.c
> +++ b/arch/arm/mach-gemini/board-dt.c
> @@ -42,8 +42,9 @@ static void gemini_idle(void)
>*/
>  
>   /* FIXME: Enabling interrupts here is racy! */
> - local_irq_enable();
> + raw_local_irq_enable();
>   cpu_do_idle();
> + raw_local_irq_disable();
>  }
>  
>  static void __init gemini_init_machine(void)
> --- a/arch/arm64/kernel/idle.c
> +++ b/arch/arm64/kernel/idle.c
> @@ -42,5 +42,

Re: [PATCH 16/36] rcu: Fix rcu_idle_exit()

2022-06-14 Thread Mark Rutland
arndb.de>, ulli.kr...@googlemail.com, vgu...@kernel.org, 
linux-...@vger.kernel.org, j...@joshtriplett.org, rost...@goodmis.org, 
r...@vger.kernel.org, b...@alien8.de, bc...@quicinc.com, 
tsbog...@alpha.franken.de, linux-par...@vger.kernel.org, sudeep.ho...@arm.com, 
shawn...@kernel.org, da...@davemloft.net, dal...@libc.org, t...@atomide.com, 
amakha...@vmware.com, bjorn.anders...@linaro.org, h...@zytor.com, 
sparcli...@vger.kernel.org, linux-hexa...@vger.kernel.org, 
linux-ri...@lists.infradead.org, anton.iva...@cambridgegreys.com, 
jo...@southpole.se, yury.no...@gmail.com, rich...@nod.at, x...@kernel.org, 
li...@armlinux.org.uk, mi...@redhat.com, a...@eecs.berkeley.edu, 
paul...@kernel.org, h...@linux.ibm.com, stefan.kristians...@saunalahti.fi, 
openr...@lists.librecores.org, paul.walms...@sifive.com, 
linux-te...@vger.kernel.org, namhy...@kernel.org, 
andriy.shevche...@linux.intel.com, jpoim...@kernel.org, jgr...@suse.com, 
mon...@monstr.eu, linux-m...@vger.kernel.org, pal...@dabbelt.com, anup@brainfa
 ult.org, i...@jurassic.park.msu.ru, johan...@sipsolutions.net, 
linuxppc-dev@lists.ozlabs.org
Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org
Sender: "Linuxppc-dev" 


On Wed, Jun 08, 2022 at 04:27:39PM +0200, Peter Zijlstra wrote:
> Current rcu_idle_exit() is terminally broken because it uses
> local_irq_{save,restore}(), which are traced which uses RCU.
> 
> However, now that all the callers are sure to have IRQs disabled, we
> can remove these calls.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> Acked-by: Paul E. McKenney 

Acked-by: Mark Rutland 

Mark.

> ---
>  kernel/rcu/tree.c |9 +++--
>  1 file changed, 3 insertions(+), 6 deletions(-)
> 
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -659,7 +659,7 @@ static noinstr void rcu_eqs_enter(bool u
>   * If you add or remove a call to rcu_idle_enter(), be sure to test with
>   * CONFIG_RCU_EQS_DEBUG=y.
>   */
> -void rcu_idle_enter(void)
> +void noinstr rcu_idle_enter(void)
>  {
>   lockdep_assert_irqs_disabled();
>   rcu_eqs_enter(false);
> @@ -896,13 +896,10 @@ static void noinstr rcu_eqs_exit(bool us
>   * If you add or remove a call to rcu_idle_exit(), be sure to test with
>   * CONFIG_RCU_EQS_DEBUG=y.
>   */
> -void rcu_idle_exit(void)
> +void noinstr rcu_idle_exit(void)
>  {
> - unsigned long flags;
> -
> - local_irq_save(flags);
> + lockdep_assert_irqs_disabled();
>   rcu_eqs_exit(false);
> - local_irq_restore(flags);
>  }
>  EXPORT_SYMBOL_GPL(rcu_idle_exit);
>  
> 
> 


Re: [PATCH 15/36] cpuidle,cpu_pm: Remove RCU fiddling from cpu_pm_{enter,exit}()

2022-06-14 Thread Mark Rutland
arndb.de>, ulli.kr...@googlemail.com, vgu...@kernel.org, 
linux-...@vger.kernel.org, j...@joshtriplett.org, rost...@goodmis.org, 
r...@vger.kernel.org, b...@alien8.de, bc...@quicinc.com, 
tsbog...@alpha.franken.de, linux-par...@vger.kernel.org, sudeep.ho...@arm.com, 
shawn...@kernel.org, da...@davemloft.net, dal...@libc.org, t...@atomide.com, 
amakha...@vmware.com, bjorn.anders...@linaro.org, h...@zytor.com, 
sparcli...@vger.kernel.org, linux-hexa...@vger.kernel.org, 
linux-ri...@lists.infradead.org, anton.iva...@cambridgegreys.com, 
jo...@southpole.se, yury.no...@gmail.com, rich...@nod.at, x...@kernel.org, 
li...@armlinux.org.uk, mi...@redhat.com, a...@eecs.berkeley.edu, 
paul...@kernel.org, h...@linux.ibm.com, stefan.kristians...@saunalahti.fi, 
openr...@lists.librecores.org, paul.walms...@sifive.com, 
linux-te...@vger.kernel.org, namhy...@kernel.org, 
andriy.shevche...@linux.intel.com, jpoim...@kernel.org, jgr...@suse.com, 
mon...@monstr.eu, linux-m...@vger.kernel.org, pal...@dabbelt.com, anup@brainfa
 ult.org, i...@jurassic.park.msu.ru, johan...@sipsolutions.net, 
linuxppc-dev@lists.ozlabs.org
Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org
Sender: "Linuxppc-dev" 


On Wed, Jun 08, 2022 at 04:27:38PM +0200, Peter Zijlstra wrote:
> All callers should still have RCU enabled.

IIUC with that true we should be able to drop the RCU_NONIDLE() from
drivers/perf/arm_pmu.c, as we only needed that for an invocation via a pm
notifier.

I should be able to give that a spin on some hardware.

> 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  kernel/cpu_pm.c |9 -
>  1 file changed, 9 deletions(-)
> 
> --- a/kernel/cpu_pm.c
> +++ b/kernel/cpu_pm.c
> @@ -30,16 +30,9 @@ static int cpu_pm_notify(enum cpu_pm_eve
>  {
>   int ret;
>  
> - /*
> -  * This introduces a RCU read critical section, which could be
> -  * disfunctional in cpu idle. Copy RCU_NONIDLE code to let RCU know
> -  * this.
> -  */
> - rcu_irq_enter_irqson();
>   rcu_read_lock();
>   ret = raw_notifier_call_chain(_pm_notifier.chain, event, NULL);
>   rcu_read_unlock();
> - rcu_irq_exit_irqson();

To make this easier to debug, is it worth adding an assertion that RCU is
watching here? e.g.

RCU_LOCKDEP_WARN(!rcu_is_watching(),
 "cpu_pm_notify() used illegally from EQS");

>  
>   return notifier_to_errno(ret);
>  }
> @@ -49,11 +42,9 @@ static int cpu_pm_notify_robust(enum cpu
>   unsigned long flags;
>   int ret;
>  
> - rcu_irq_enter_irqson();
>   raw_spin_lock_irqsave(_pm_notifier.lock, flags);
>   ret = raw_notifier_call_chain_robust(_pm_notifier.chain, event_up, 
> event_down, NULL);
>   raw_spin_unlock_irqrestore(_pm_notifier.lock, flags);
> - rcu_irq_exit_irqson();


... and likewise here?

Thanks,
Mark.

>  
>   return notifier_to_errno(ret);
>  }
> 
> 


Re: [PATCH 14/36] cpuidle: Fix rcu_idle_*() usage

2022-06-14 Thread Mark Rutland
arndb.de>, ulli.kr...@googlemail.com, vgu...@kernel.org, 
linux-...@vger.kernel.org, j...@joshtriplett.org, rost...@goodmis.org, 
r...@vger.kernel.org, b...@alien8.de, bc...@quicinc.com, 
tsbog...@alpha.franken.de, linux-par...@vger.kernel.org, sudeep.ho...@arm.com, 
shawn...@kernel.org, da...@davemloft.net, dal...@libc.org, t...@atomide.com, 
amakha...@vmware.com, bjorn.anders...@linaro.org, h...@zytor.com, 
sparcli...@vger.kernel.org, linux-hexa...@vger.kernel.org, 
linux-ri...@lists.infradead.org, anton.iva...@cambridgegreys.com, 
jo...@southpole.se, yury.no...@gmail.com, rich...@nod.at, x...@kernel.org, 
li...@armlinux.org.uk, mi...@redhat.com, a...@eecs.berkeley.edu, 
paul...@kernel.org, h...@linux.ibm.com, stefan.kristians...@saunalahti.fi, 
openr...@lists.librecores.org, paul.walms...@sifive.com, 
linux-te...@vger.kernel.org, namhy...@kernel.org, 
andriy.shevche...@linux.intel.com, jpoim...@kernel.org, jgr...@suse.com, 
mon...@monstr.eu, linux-m...@vger.kernel.org, pal...@dabbelt.com, anup@brainfa
 ult.org, i...@jurassic.park.msu.ru, johan...@sipsolutions.net, 
linuxppc-dev@lists.ozlabs.org
Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org
Sender: "Linuxppc-dev" 


On Wed, Jun 08, 2022 at 04:27:37PM +0200, Peter Zijlstra wrote:
> --- a/kernel/time/tick-broadcast.c
> +++ b/kernel/time/tick-broadcast.c
> @@ -622,9 +622,13 @@ struct cpumask *tick_get_broadcast_onesh
>   * to avoid a deep idle transition as we are about to get the
>   * broadcast IPI right away.
>   */
> -int tick_check_broadcast_expired(void)
> +noinstr int tick_check_broadcast_expired(void)
>  {
> +#ifdef _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H
> + return arch_test_bit(smp_processor_id(), 
> cpumask_bits(tick_broadcast_force_mask));
> +#else
>   return cpumask_test_cpu(smp_processor_id(), tick_broadcast_force_mask);
> +#endif
>  }

This is somewhat not-ideal. :/

Could we unconditionally do the arch_test_bit() variant, with a comment, or
does that not exist in some cases?

Thanks,
Mark.


Re: [PATCH 00/36] cpuidle,rcu: Cleanup the mess

2022-06-14 Thread Mark Rutland
arndb.de>, ulli.kr...@googlemail.com, vgu...@kernel.org, 
linux-...@vger.kernel.org, j...@joshtriplett.org, rost...@goodmis.org, 
r...@vger.kernel.org, b...@alien8.de, bc...@quicinc.com, 
tsbog...@alpha.franken.de, linux-par...@vger.kernel.org, sudeep.ho...@arm.com, 
shawn...@kernel.org, da...@davemloft.net, dal...@libc.org, t...@atomide.com, 
amakha...@vmware.com, bjorn.anders...@linaro.org, h...@zytor.com, 
sparcli...@vger.kernel.org, linux-hexa...@vger.kernel.org, 
linux-ri...@lists.infradead.org, anton.iva...@cambridgegreys.com, 
jo...@southpole.se, yury.no...@gmail.com, rich...@nod.at, x...@kernel.org, 
li...@armlinux.org.uk, mi...@redhat.com, a...@eecs.berkeley.edu, 
paul...@kernel.org, h...@linux.ibm.com, stefan.kristians...@saunalahti.fi, 
openr...@lists.librecores.org, paul.walms...@sifive.com, 
linux-te...@vger.kernel.org, namhy...@kernel.org, 
andriy.shevche...@linux.intel.com, jpoim...@kernel.org, jgr...@suse.com, 
mon...@monstr.eu, linux-m...@vger.kernel.org, pal...@dabbelt.com, anup@brainfa
 ult.org, i...@jurassic.park.msu.ru, johan...@sipsolutions.net, 
linuxppc-dev@lists.ozlabs.org
Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org
Sender: "Linuxppc-dev" 


On Wed, Jun 08, 2022 at 04:27:23PM +0200, Peter Zijlstra wrote:
> Hi All! (omg so many)

Hi Peter,

Sorry for the delay; my plate has also been rather full recently. I'm beginning
to page this in now.

> These here few patches mostly clear out the utter mess that is cpuidle vs 
> rcuidle.
> 
> At the end of the ride there's only 2 real RCU_NONIDLE() users left
> 
>   arch/arm64/kernel/suspend.c:RCU_NONIDLE(__cpu_suspend_exit());
>   drivers/perf/arm_pmu.c: RCU_NONIDLE(armpmu_start(event, 
> PERF_EF_RELOAD));

The latter of these is necessary because apparently PM notifiers are called
with RCU not watching. Is that still the case today (or at the end of this
series)? If so, that feels like fertile land for more issues (yaey...). If not,
we should be able to drop this.

I can go dig into that some more.

>   kernel/cfi.c:   RCU_NONIDLE({
> 
> (the CFI one is likely dead in the kCFI rewrite) and there's only a hand full
> of trace_.*_rcuidle() left:
> 
>   kernel/trace/trace_preemptirq.c:
> trace_irq_enable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
>   kernel/trace/trace_preemptirq.c:
> trace_irq_disable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
>   kernel/trace/trace_preemptirq.c:
> trace_irq_enable_rcuidle(CALLER_ADDR0, caller_addr);
>   kernel/trace/trace_preemptirq.c:
> trace_irq_disable_rcuidle(CALLER_ADDR0, caller_addr);
>   kernel/trace/trace_preemptirq.c:
> trace_preempt_enable_rcuidle(a0, a1);
>   kernel/trace/trace_preemptirq.c:
> trace_preempt_disable_rcuidle(a0, a1);
> 
> All of them are in 'deprecated' code that is unused for GENERIC_ENTRY.

I think those are also unused on arm64 too?

If not, I can go attack that.

> I've touched a _lot_ of code that I can't test and likely broken some of it :/
> In particular, the whole ARM cpuidle stuff was quite involved with OMAP being
> the absolute 'winner'.
> 
> I'm hoping Mark can help me sort the remaining ARM64 bits as he moves that to
> GENERIC_ENTRY.

Moving to GENERIC_ENTRY as a whole is going to take a tonne of work
(refactoring both arm64 and the generic portion to be more amenable to each
other), but we can certainly move closer to that for the bits that matter here.

Maybe we want a STRICT_ENTRY option to get rid of all the deprecated stuff that
we can select regardless of GENERIC_ENTRY to make that easier.

> I've also got a note that says ARM64 can probably do a WFE based
> idle state and employ TIF_POLLING_NRFLAG to avoid some IPIs.

Possibly; I'm not sure how much of a win that'll be given that by default we'll
have a ~10KHz WFE wakeup from the timer, but we could take a peek.

Thanks,
Mark.


Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro

2022-05-26 Thread Mark Rutland
On Thu, May 26, 2022 at 10:14:59PM +1000, Michael Ellerman wrote:
> Linus Torvalds  writes:
> > On Wed, May 25, 2022 at 7:40 AM Uros Bizjak  wrote:
> >>
> >> Use try_cmpxchg64 instead of cmpxchg64 in CMPXCHG_LOOP macro.
> >> x86 CMPXCHG instruction returns success in ZF flag, so this
> >> change saves a compare after cmpxchg (and related move instruction
> >> in front of cmpxchg). The main loop of lockref_get improves from:
> >
> > Ack on this one regardless of the 32-bit x86 question.
> >
> > HOWEVER.
> >
> > I'd like other architectures to pipe up too, because I think right now
> > x86 is the only one that implements that "arch_try_cmpxchg()" family
> > of operations natively, and I think the generic fallback for when it
> > is missing might be kind of nasty.
> >
> > Maybe it ends up generating ok code, but it's also possible that it
> > just didn't matter when it was only used in one place in the
> > scheduler.
> 
> This patch seems to generate slightly *better* code on powerpc.
> 
> I see one register-to-register move that gets shifted slightly later, so
> that it's skipped on the path that returns directly via the SUCCESS
> case.

FWIW, I see the same on arm64; a register-to-register move gets moved out of
the success path. That changes the register allocation, and resulting in one
fewer move, but otherwise the code generation is the same.

Thanks,
Mark.


Re: [PATCH -next v4 3/7] arm64: add support for machine check error safe

2022-05-26 Thread Mark Rutland
On Thu, May 26, 2022 at 11:36:41AM +0800, Tong Tiangen wrote:
> 
> 
> 在 2022/5/25 16:30, Mark Rutland 写道:
> > On Thu, May 19, 2022 at 02:29:54PM +0800, Tong Tiangen wrote:
> > > 
> > > 
> > > 在 2022/5/13 23:26, Mark Rutland 写道:
> > > > On Wed, Apr 20, 2022 at 03:04:14AM +, Tong Tiangen wrote:
> > > > > During the processing of arm64 kernel hardware memory 
> > > > > errors(do_sea()), if
> > > > > the errors is consumed in the kernel, the current processing is panic.
> > > > > However, it is not optimal.
> > > > > 
> > > > > Take uaccess for example, if the uaccess operation fails due to memory
> > > > > error, only the user process will be affected, kill the user process
> > > > > and isolate the user page with hardware memory errors is a better 
> > > > > choice.
> > > > 
> > > > Conceptually, I'm fine with the idea of constraining what we do for a
> > > > true uaccess, but I don't like the implementation of this at all, and I
> > > > think we first need to clean up the arm64 extable usage to clearly
> > > > distinguish a uaccess from another access.
> > > 
> > > OK,using EX_TYPE_UACCESS and this extable type could be recover, this is
> > > more reasonable.
> > 
> > Great.
> > 
> > > For EX_TYPE_UACCESS_ERR_ZERO, today we use it for kernel accesses in a
> > > couple of cases, such as
> > > get_user/futex/__user_cache_maint()/__user_swpX_asm(),
> > 
> > Those are all user accesses.
> > 
> > However, __get_kernel_nofault() and __put_kernel_nofault() use
> > EX_TYPE_UACCESS_ERR_ZERO by way of __{get,put}_mem_asm(), so we'd need to
> > refactor that code to split the user/kernel cases higher up the callchain.
> > 
> > > your suggestion is:
> > > get_user continues to use EX_TYPE_UACCESS_ERR_ZERO and the other cases use
> > > new type EX_TYPE_FIXUP_ERR_ZERO?
> > 
> > Yes, that's the rough shape. We could make the latter 
> > EX_TYPE_KACCESS_ERR_ZERO
> > to be clearly analogous to EX_TYPE_UACCESS_ERR_ZERO, and with that I 
> > susepct we
> > could remove EX_TYPE_FIXUP.
> > 
> > Thanks,
> > Mark.
> According to your suggestion, i think the definition is like this:
> 
> #define EX_TYPE_NONE0
> #define EX_TYPE_FIXUP   1--> delete
> #define EX_TYPE_BPF 2
> #define EX_TYPE_UACCESS_ERR_ZERO3
> #define EX_TYPE_LOAD_UNALIGNED_ZEROPAD  4
> #define EX_TYPE_UACCESS   xx   --> add
> #define EX_TYPE_KACCESS_ERR_ZEROxx   --> add
> [The value defined by the macro here is temporary]

Almost; you don't need to add EX_TYPE_UACCESS here, as you can use
EX_TYPE_UACCESS_ERR_ZERO for that.

We already have:

| #define _ASM_EXTABLE_UACCESS_ERR(insn, fixup, err)\
| _ASM_EXTABLE_UACCESS_ERR_ZERO(insn, fixup, err, wzr)

... and we can add:

| #define _ASM_EXTABLE_UACCESS(insn, fixup) \
| _ASM_EXTABLE_UACCESS_ERR_ZERO(insn, fixup, wzr, wzr)


... and maybe we should use 'xzr' rather than 'wzr' for clarity.

> There are two points to modify:
> 
> 1、_get_kernel_nofault() and __put_kernel_nofault()  using
> EX_TYPE_KACCESS_ERR_ZERO, Other positions using EX_TYPE_UACCESS_ERR_ZERO
> keep unchanged.

That sounds right to me. This will require refactoring __raw_{get,put}_mem()
and __{get,put}_mem_asm().

> 2、delete EX_TYPE_FIXUP.
> 
> There is no doubt about others. As for EX_TYPE_FIXUP, I think it needs to be
> retained, _cond_extable(EX_TYPE_FIXUP) is still in use in assembler.h.

We use _cond_extable for cache maintenance uaccesses, so those should be moved
over to to EX_TYPE_UACCESS_ERR_ZERO. We can rename _cond_extable to
_cond_uaccess_extable for clarity.

That will require restructuring asm-extable.h a bit. If that turns out to be
painful I'm happy to take a look.

Thanks,
Mark.


Re: [PATCH -next v4 3/7] arm64: add support for machine check error safe

2022-05-25 Thread Mark Rutland
On Thu, May 19, 2022 at 02:29:54PM +0800, Tong Tiangen wrote:
> 
> 
> 在 2022/5/13 23:26, Mark Rutland 写道:
> > On Wed, Apr 20, 2022 at 03:04:14AM +, Tong Tiangen wrote:
> > > During the processing of arm64 kernel hardware memory errors(do_sea()), if
> > > the errors is consumed in the kernel, the current processing is panic.
> > > However, it is not optimal.
> > > 
> > > Take uaccess for example, if the uaccess operation fails due to memory
> > > error, only the user process will be affected, kill the user process
> > > and isolate the user page with hardware memory errors is a better choice.
> > 
> > Conceptually, I'm fine with the idea of constraining what we do for a
> > true uaccess, but I don't like the implementation of this at all, and I
> > think we first need to clean up the arm64 extable usage to clearly
> > distinguish a uaccess from another access.
> 
> OK,using EX_TYPE_UACCESS and this extable type could be recover, this is
> more reasonable.

Great.

> For EX_TYPE_UACCESS_ERR_ZERO, today we use it for kernel accesses in a
> couple of cases, such as
> get_user/futex/__user_cache_maint()/__user_swpX_asm(), 

Those are all user accesses.

However, __get_kernel_nofault() and __put_kernel_nofault() use
EX_TYPE_UACCESS_ERR_ZERO by way of __{get,put}_mem_asm(), so we'd need to
refactor that code to split the user/kernel cases higher up the callchain.

> your suggestion is:
> get_user continues to use EX_TYPE_UACCESS_ERR_ZERO and the other cases use
> new type EX_TYPE_FIXUP_ERR_ZERO?

Yes, that's the rough shape. We could make the latter EX_TYPE_KACCESS_ERR_ZERO
to be clearly analogous to EX_TYPE_UACCESS_ERR_ZERO, and with that I susepct we
could remove EX_TYPE_FIXUP.

Thanks,
Mark.


Re: [PATCH -next v4 7/7] arm64: add cow to machine check safe

2022-05-13 Thread Mark Rutland
On Wed, Apr 20, 2022 at 03:04:18AM +, Tong Tiangen wrote:
> In the cow(copy on write) processing, the data of the user process is
> copied, when hardware memory error is encountered during copy, only the
> relevant processes are affected, so killing the user process and isolate
> the user page with hardware memory errors is a more reasonable choice than
> kernel panic.

There are plenty of other places we'll access user pages via a kernel
alias (e.g. when performing IO), so why is this special?

To be clear, I am not entirely averse to this, but it seems like this is
being done because it's easy to do rather than necessarily being all
that useful, and I'm not keen on having to duplicate a bunch of logic
for this.

> Add new helper copy_page_mc() which provide a page copy implementation with
> machine check safe. At present, only used in cow. In future, we can expand
> more scenes. As long as the consequences of page copy failure are not
> fatal(eg: only affect user process), we can use this helper.
> 
> The copy_page_mc() in copy_page_mc.S is largely borrows from copy_page()
> in copy_page.S and the main difference is copy_page_mc() add extable entry
> to every load/store insn to support machine check safe. largely to keep the
> patch simple. If needed those optimizations can be folded in.
> 
> Add new extable type EX_TYPE_COPY_PAGE_MC which used in copy_page_mc().
> 
> This type only be processed in fixup_exception_mc(), The reason is that
> copy_page_mc() is consistent with copy_page() except machine check safe is
> considered, and copy_page() do not need to consider exception fixup.
> 
> Signed-off-by: Tong Tiangen 
> ---
>  arch/arm64/include/asm/asm-extable.h |  5 ++
>  arch/arm64/include/asm/page.h| 10 
>  arch/arm64/lib/Makefile  |  2 +
>  arch/arm64/lib/copy_page_mc.S| 86 
>  arch/arm64/mm/copypage.c | 36 ++--
>  arch/arm64/mm/extable.c  |  2 +
>  include/linux/highmem.h  |  8 +++
>  mm/memory.c  |  2 +-
>  8 files changed, 144 insertions(+), 7 deletions(-)
>  create mode 100644 arch/arm64/lib/copy_page_mc.S
> 
> diff --git a/arch/arm64/include/asm/asm-extable.h 
> b/arch/arm64/include/asm/asm-extable.h
> index 80410899a9ad..74c056ddae15 100644
> --- a/arch/arm64/include/asm/asm-extable.h
> +++ b/arch/arm64/include/asm/asm-extable.h
> @@ -14,6 +14,7 @@
>  /* _MC indicates that can fixup from machine check errors */
>  #define EX_TYPE_UACCESS_MC   5
>  #define EX_TYPE_UACCESS_MC_ERR_ZERO  6
> +#define EX_TYPE_COPY_PAGE_MC 7
>  
>  #ifdef __ASSEMBLY__
>  
> @@ -42,6 +43,10 @@
>   __ASM_EXTABLE_RAW(\insn, \fixup, EX_TYPE_UACCESS_MC, 0)
>   .endm
>  
> + .macro  _asm_extable_copy_page_mc, insn, fixup
> + __ASM_EXTABLE_RAW(\insn, \fixup, EX_TYPE_COPY_PAGE_MC, 0)
> + .endm
> +
>  /*
>   * Create an exception table entry for `insn` if `fixup` is provided. 
> Otherwise
>   * do nothing.
> diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
> index 993a27ea6f54..832571a7dddb 100644
> --- a/arch/arm64/include/asm/page.h
> +++ b/arch/arm64/include/asm/page.h
> @@ -29,6 +29,16 @@ void copy_user_highpage(struct page *to, struct page *from,
>  void copy_highpage(struct page *to, struct page *from);
>  #define __HAVE_ARCH_COPY_HIGHPAGE
>  
> +#ifdef CONFIG_ARCH_HAS_COPY_MC
> +extern void copy_page_mc(void *to, const void *from);
> +void copy_highpage_mc(struct page *to, struct page *from);
> +#define __HAVE_ARCH_COPY_HIGHPAGE_MC
> +
> +void copy_user_highpage_mc(struct page *to, struct page *from,
> + unsigned long vaddr, struct vm_area_struct *vma);
> +#define __HAVE_ARCH_COPY_USER_HIGHPAGE_MC
> +#endif
> +
>  struct page *alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
>   unsigned long vaddr);
>  #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE_MOVABLE
> diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile
> index 29490be2546b..0d9f292ef68a 100644
> --- a/arch/arm64/lib/Makefile
> +++ b/arch/arm64/lib/Makefile
> @@ -15,6 +15,8 @@ endif
>  
>  lib-$(CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE) += uaccess_flushcache.o
>  
> +lib-$(CONFIG_ARCH_HAS_COPY_MC) += copy_page_mc.o
> +
>  obj-$(CONFIG_CRC32) += crc32.o
>  
>  obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
> diff --git a/arch/arm64/lib/copy_page_mc.S b/arch/arm64/lib/copy_page_mc.S
> new file mode 100644
> index ..655161363dcf
> --- /dev/null
> +++ b/arch/arm64/lib/copy_page_mc.S
> @@ -0,0 +1,86 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (C) 2012 ARM Ltd.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define CPY_MC(l, x...)  \
> +:   x;   \
> + _asm_extable_copy_page_mcb, l
> +
> +/*
> + * Copy a page from src to dest (both are page 

Re: [PATCH -next v4 6/7] arm64: add {get, put}_user to machine check safe

2022-05-13 Thread Mark Rutland
On Wed, Apr 20, 2022 at 03:04:17AM +, Tong Tiangen wrote:
> Add {get, put}_user() to machine check safe.
> 
> If get/put fail due to hardware memory error, only the relevant processes
> are affected, so killing the user process and isolate the user page with
> hardware memory errors is a more reasonable choice than kernel panic.
> 
> Add new extable type EX_TYPE_UACCESS_MC_ERR_ZERO which can be used for
> uaccess that can be recovered from hardware memory errors. The difference
> from EX_TYPE_UACCESS_MC is that this type also sets additional two target
> register which save error code and value needs to be set zero.

Why does this need to be in any way distinct from the existing
EX_TYPE_UACCESS_ERR_ZERO ?

Other than the case where we currently (ab)use that for
copy_{to,from}_kernel_nofault(), where do we *not* want to use
EX_TYPE_UACCESS_ERR_ZERO and *not* recover from a memory error?

Thanks,
Mark.

> 
> Signed-off-by: Tong Tiangen 
> ---
>  arch/arm64/include/asm/asm-extable.h | 14 ++
>  arch/arm64/include/asm/uaccess.h |  4 ++--
>  arch/arm64/mm/extable.c  |  4 
>  3 files changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/asm-extable.h 
> b/arch/arm64/include/asm/asm-extable.h
> index 75b2c00e9523..80410899a9ad 100644
> --- a/arch/arm64/include/asm/asm-extable.h
> +++ b/arch/arm64/include/asm/asm-extable.h
> @@ -13,6 +13,7 @@
>  
>  /* _MC indicates that can fixup from machine check errors */
>  #define EX_TYPE_UACCESS_MC   5
> +#define EX_TYPE_UACCESS_MC_ERR_ZERO  6
>  
>  #ifdef __ASSEMBLY__
>  
> @@ -78,6 +79,15 @@
>  #define EX_DATA_REG(reg, gpr)
> \
>   "((.L__gpr_num_" #gpr ") << " __stringify(EX_DATA_REG_##reg##_SHIFT) ")"
>  
> +#define _ASM_EXTABLE_UACCESS_MC_ERR_ZERO(insn, fixup, err, zero) 
> \
> + __DEFINE_ASM_GPR_NUMS   
> \
> + __ASM_EXTABLE_RAW(#insn, #fixup,
> \
> +   __stringify(EX_TYPE_UACCESS_MC_ERR_ZERO), 
> \
> +   "("   
> \
> + EX_DATA_REG(ERR, err) " | " 
> \
> + EX_DATA_REG(ZERO, zero) 
> \
> +   ")")
> +
>  #define _ASM_EXTABLE_UACCESS_ERR_ZERO(insn, fixup, err, zero)
> \
>   __DEFINE_ASM_GPR_NUMS   \
>   __ASM_EXTABLE_RAW(#insn, #fixup,\
> @@ -90,6 +100,10 @@
>  #define _ASM_EXTABLE_UACCESS_ERR(insn, fixup, err)   \
>   _ASM_EXTABLE_UACCESS_ERR_ZERO(insn, fixup, err, wzr)
>  
> +
> +#define _ASM_EXTABLE_UACCESS_MC_ERR(insn, fixup, err)
> \
> + _ASM_EXTABLE_UACCESS_MC_ERR_ZERO(insn, fixup, err, wzr)
> +
>  #define EX_DATA_REG_DATA_SHIFT   0
>  #define EX_DATA_REG_DATA GENMASK(4, 0)
>  #define EX_DATA_REG_ADDR_SHIFT   5
> diff --git a/arch/arm64/include/asm/uaccess.h 
> b/arch/arm64/include/asm/uaccess.h
> index e8dce0cc5eaa..e41b47df48b0 100644
> --- a/arch/arm64/include/asm/uaccess.h
> +++ b/arch/arm64/include/asm/uaccess.h
> @@ -236,7 +236,7 @@ static inline void __user *__uaccess_mask_ptr(const void 
> __user *ptr)
>   asm volatile(   \
>   "1: " load "" reg "1, [%2]\n"   \
>   "2:\n"  \
> - _ASM_EXTABLE_UACCESS_ERR_ZERO(1b, 2b, %w0, %w1) \
> + _ASM_EXTABLE_UACCESS_MC_ERR_ZERO(1b, 2b, %w0, %w1)  \
>   : "+r" (err), "=" (x) \
>   : "r" (addr))
>  
> @@ -325,7 +325,7 @@ do {  
> \
>   asm volatile(   \
>   "1: " store "   " reg "1, [%2]\n"   \
>   "2:\n"  \
> - _ASM_EXTABLE_UACCESS_ERR(1b, 2b, %w0)   \
> + _ASM_EXTABLE_UACCESS_MC_ERR(1b, 2b, %w0)\
>   : "+r" (err)\
>   : "r" (x), "r" (addr))
>  
> diff --git a/arch/arm64/mm/extable.c b/arch/arm64/mm/extable.c
> index 525876c3ebf4..1023ccdb2f89 100644
> --- a/arch/arm64/mm/extable.c
> +++ b/arch/arm64/mm/extable.c
> @@ -88,6 +88,7 @@ bool fixup_exception(struct pt_regs *regs)
>   case EX_TYPE_BPF:
>   return ex_handler_bpf(ex, regs);
>   case EX_TYPE_UACCESS_ERR_ZERO:
> + case EX_TYPE_UACCESS_MC_ERR_ZERO:
>   return ex_handler_uaccess_err_zero(ex, regs);
>   case EX_TYPE_LOAD_UNALIGNED_ZEROPAD:
>   return 

Re: [PATCH -next v4 5/7] arm64: mte: Clean up user tag accessors

2022-05-13 Thread Mark Rutland
On Wed, Apr 20, 2022 at 03:04:16AM +, Tong Tiangen wrote:
> From: Robin Murphy 
> 
> Invoking user_ldst to explicitly add a post-increment of 0 is silly.
> Just use a normal USER() annotation and save the redundant instruction.
> 
> Signed-off-by: Robin Murphy 
> Reviewed-by: Tong Tiangen 

When posting someone else's patch, you need to add your own
Signed-off-by tag. Please see:

  
https://www.kernel.org/doc/html/latest/process/submitting-patches.html#sign-your-work-the-developer-s-certificate-of-origin

That said, the patch itself looks sane, and matches its original posting
at:

  
https://lore.kernel.org/linux-arm-kernel/38c6d4b5-a3db-5c3e-02e7-39875edb3...@arm.com/

So:

  Acked-by: Mark Rutland 

Catalin, are you happy to pick up this patch as a cleanup?

Thanks,
Mark.

> ---
>  arch/arm64/lib/mte.S | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/lib/mte.S b/arch/arm64/lib/mte.S
> index 8590af3c98c0..eeb9e45bcce8 100644
> --- a/arch/arm64/lib/mte.S
> +++ b/arch/arm64/lib/mte.S
> @@ -93,7 +93,7 @@ SYM_FUNC_START(mte_copy_tags_from_user)
>   mov x3, x1
>   cbz x2, 2f
>  1:
> - user_ldst 2f, ldtrb, w4, x1, 0
> +USER(2f, ldtrb   w4, [x1])
>   lsl x4, x4, #MTE_TAG_SHIFT
>   stg x4, [x0], #MTE_GRANULE_SIZE
>   add x1, x1, #1
> @@ -120,7 +120,7 @@ SYM_FUNC_START(mte_copy_tags_to_user)
>  1:
>   ldg x4, [x1]
>   ubfxx4, x4, #MTE_TAG_SHIFT, #MTE_TAG_SIZE
> - user_ldst 2f, sttrb, w4, x0, 0
> +USER(2f, sttrb   w4, [x0])
>   add x0, x0, #1
>   add x1, x1, #MTE_GRANULE_SIZE
>   subsx2, x2, #1
> -- 
> 2.25.1
> 


Re: [PATCH -next v4 4/7] arm64: add copy_{to, from}_user to machine check safe

2022-05-13 Thread Mark Rutland
On Wed, Apr 20, 2022 at 03:04:15AM +, Tong Tiangen wrote:
> Add copy_{to, from}_user() to machine check safe.
> 
> If copy fail due to hardware memory error, only the relevant processes are
> affected, so killing the user process and isolate the user page with
> hardware memory errors is a more reasonable choice than kernel panic.
> 
> Add new extable type EX_TYPE_UACCESS_MC which can be used for uaccess that
> can be recovered from hardware memory errors.

I don't understand why we need this.

If we apply EX_TYPE_UACCESS consistently to *all* user accesses, and
*only* to user accesses, that would *always* indicate that we can
recover, and that seems much simpler to deal with.

Today we use EX_TYPE_UACCESS_ERR_ZERO for kernel accesses in a couple of
cases, which we should clean up, and we user EX_TYPE_FIXUP for a couple
of user accesses, but those could easily be converted over.

> The x16 register is used to save the fixup type in copy_xxx_user which
> used extable type EX_TYPE_UACCESS_MC.

Why x16?

How is this intended to be consumed, and why is that behaviour different
from any *other* fault?

Mark.

> Signed-off-by: Tong Tiangen 
> ---
>  arch/arm64/include/asm/asm-extable.h | 14 ++
>  arch/arm64/include/asm/asm-uaccess.h | 15 ++-
>  arch/arm64/lib/copy_from_user.S  | 18 +++---
>  arch/arm64/lib/copy_to_user.S| 18 +++---
>  arch/arm64/mm/extable.c  | 18 ++
>  5 files changed, 60 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/asm-extable.h 
> b/arch/arm64/include/asm/asm-extable.h
> index c39f2437e08e..75b2c00e9523 100644
> --- a/arch/arm64/include/asm/asm-extable.h
> +++ b/arch/arm64/include/asm/asm-extable.h
> @@ -2,12 +2,18 @@
>  #ifndef __ASM_ASM_EXTABLE_H
>  #define __ASM_ASM_EXTABLE_H
>  
> +#define FIXUP_TYPE_NORMAL0
> +#define FIXUP_TYPE_MC1
> +
>  #define EX_TYPE_NONE 0
>  #define EX_TYPE_FIXUP1
>  #define EX_TYPE_BPF  2
>  #define EX_TYPE_UACCESS_ERR_ZERO 3
>  #define EX_TYPE_LOAD_UNALIGNED_ZEROPAD   4
>  
> +/* _MC indicates that can fixup from machine check errors */
> +#define EX_TYPE_UACCESS_MC   5
> +
>  #ifdef __ASSEMBLY__
>  
>  #define __ASM_EXTABLE_RAW(insn, fixup, type, data)   \
> @@ -27,6 +33,14 @@
>   __ASM_EXTABLE_RAW(\insn, \fixup, EX_TYPE_FIXUP, 0)
>   .endm
>  
> +/*
> + * Create an exception table entry for `insn`, which will branch to `fixup`
> + * when an unhandled fault(include sea fault) is taken.
> + */
> + .macro  _asm_extable_uaccess_mc, insn, fixup
> + __ASM_EXTABLE_RAW(\insn, \fixup, EX_TYPE_UACCESS_MC, 0)
> + .endm
> +
>  /*
>   * Create an exception table entry for `insn` if `fixup` is provided. 
> Otherwise
>   * do nothing.
> diff --git a/arch/arm64/include/asm/asm-uaccess.h 
> b/arch/arm64/include/asm/asm-uaccess.h
> index 0557af834e03..6c23c138e1fc 100644
> --- a/arch/arm64/include/asm/asm-uaccess.h
> +++ b/arch/arm64/include/asm/asm-uaccess.h
> @@ -63,6 +63,11 @@ alternative_else_nop_endif
>  :x;  \
>   _asm_extableb, l
>  
> +
> +#define USER_MC(l, x...) \
> +:x;  \
> + _asm_extable_uaccess_mc b, l
> +
>  /*
>   * Generate the assembly for LDTR/STTR with exception table entries.
>   * This is complicated as there is no post-increment or pair versions of the
> @@ -73,8 +78,8 @@ alternative_else_nop_endif
>  8889:ldtr\reg2, [\addr, #8];
>   add \addr, \addr, \post_inc;
>  
> - _asm_extableb,\l;
> - _asm_extable8889b,\l;
> + _asm_extable_uaccess_mc b, \l;
> + _asm_extable_uaccess_mc 8889b, \l;
>   .endm
>  
>   .macro user_stp l, reg1, reg2, addr, post_inc
> @@ -82,14 +87,14 @@ alternative_else_nop_endif
>  8889:sttr\reg2, [\addr, #8];
>   add \addr, \addr, \post_inc;
>  
> - _asm_extableb,\l;
> - _asm_extable8889b,\l;
> + _asm_extable_uaccess_mc b,\l;
> + _asm_extable_uaccess_mc 8889b,\l;
>   .endm
>  
>   .macro user_ldst l, inst, reg, addr, post_inc
>  :\inst   \reg, [\addr];
>   add \addr, \addr, \post_inc;
>  
> - _asm_extableb,\l;
> + _asm_extable_uaccess_mc b, \l;
>   .endm
>  #endif
> diff --git a/arch/arm64/lib/copy_from_user.S b/arch/arm64/lib/copy_from_user.S
> index 34e317907524..480cc5ac0a8d 100644
> --- a/arch/arm64/lib/copy_from_user.S
> +++ b/arch/arm64/lib/copy_from_user.S
> @@ -25,7 +25,7 @@
>   .endm
>  
>   .macro strb1 reg, ptr, val
> - strb \reg, [\ptr], \val
> + USER_MC(9998f, strb \reg, [\ptr], \val)
>   .endm
>  
>   .macro 

Re: [PATCH -next v4 3/7] arm64: add support for machine check error safe

2022-05-13 Thread Mark Rutland
On Wed, Apr 20, 2022 at 03:04:14AM +, Tong Tiangen wrote:
> During the processing of arm64 kernel hardware memory errors(do_sea()), if
> the errors is consumed in the kernel, the current processing is panic.
> However, it is not optimal.
> 
> Take uaccess for example, if the uaccess operation fails due to memory
> error, only the user process will be affected, kill the user process
> and isolate the user page with hardware memory errors is a better choice.

Conceptually, I'm fine with the idea of constraining what we do for a
true uaccess, but I don't like the implementation of this at all, and I
think we first need to clean up the arm64 extable usage to clearly
distinguish a uaccess from another access.

> This patch only enable machine error check framework, it add exception
> fixup before kernel panic in do_sea() and only limit the consumption of
> hardware memory errors in kernel mode triggered by user mode processes.
> If fixup successful, panic can be avoided.
> 
> Consistent with PPC/x86, it is implemented by CONFIG_ARCH_HAS_COPY_MC.
> 
> Also add copy_mc_to_user() in include/linux/uaccess.h, this helper is
> called when CONFIG_ARCH_HAS_COPOY_MC is open.
> 
> Signed-off-by: Tong Tiangen 
> ---
>  arch/arm64/Kconfig   |  1 +
>  arch/arm64/include/asm/extable.h |  1 +
>  arch/arm64/mm/extable.c  | 17 +
>  arch/arm64/mm/fault.c| 27 ++-
>  include/linux/uaccess.h  |  9 +
>  5 files changed, 54 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index d9325dd95eba..012e38309955 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -19,6 +19,7 @@ config ARM64
>   select ARCH_ENABLE_SPLIT_PMD_PTLOCK if PGTABLE_LEVELS > 2
>   select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
>   select ARCH_HAS_CACHE_LINE_SIZE
> + select ARCH_HAS_COPY_MC if ACPI_APEI_GHES
>   select ARCH_HAS_CURRENT_STACK_POINTER
>   select ARCH_HAS_DEBUG_VIRTUAL
>   select ARCH_HAS_DEBUG_VM_PGTABLE
> diff --git a/arch/arm64/include/asm/extable.h 
> b/arch/arm64/include/asm/extable.h
> index 72b0e71cc3de..f80ebd0addfd 100644
> --- a/arch/arm64/include/asm/extable.h
> +++ b/arch/arm64/include/asm/extable.h
> @@ -46,4 +46,5 @@ bool ex_handler_bpf(const struct exception_table_entry *ex,
>  #endif /* !CONFIG_BPF_JIT */
>  
>  bool fixup_exception(struct pt_regs *regs);
> +bool fixup_exception_mc(struct pt_regs *regs);
>  #endif
> diff --git a/arch/arm64/mm/extable.c b/arch/arm64/mm/extable.c
> index 489455309695..4f0083a550d4 100644
> --- a/arch/arm64/mm/extable.c
> +++ b/arch/arm64/mm/extable.c
> @@ -9,6 +9,7 @@
>  
>  #include 
>  #include 
> +#include 
>  
>  static inline unsigned long
>  get_ex_fixup(const struct exception_table_entry *ex)
> @@ -84,3 +85,19 @@ bool fixup_exception(struct pt_regs *regs)
>  
>   BUG();
>  }
> +
> +bool fixup_exception_mc(struct pt_regs *regs)
> +{
> + const struct exception_table_entry *ex;
> +
> + ex = search_exception_tables(instruction_pointer(regs));
> + if (!ex)
> + return false;
> +
> + /*
> +  * This is not complete, More Machine check safe extable type can
> +  * be processed here.
> +  */
> +
> + return false;
> +}

This is at best misnamed; It doesn't actually apply the fixup, it just
searches for one.

> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 77341b160aca..a9e6fb1999d1 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -695,6 +695,29 @@ static int do_bad(unsigned long far, unsigned int esr, 
> struct pt_regs *regs)
>   return 1; /* "fault" */
>  }
>  
> +static bool arm64_do_kernel_sea(unsigned long addr, unsigned int esr,
> +  struct pt_regs *regs, int sig, int code)
> +{
> + if (!IS_ENABLED(CONFIG_ARCH_HAS_COPY_MC))
> + return false;
> +
> + if (user_mode(regs) || !current->mm)
> + return false;
> +
> + if (apei_claim_sea(regs) < 0)
> + return false;
> +
> + if (!fixup_exception_mc(regs))
> + return false;
> +
> + set_thread_esr(0, esr);
> +
> + arm64_force_sig_fault(sig, code, addr,
> + "Uncorrected hardware memory error in kernel-access\n");
> +
> + return true;
> +}
> +
>  static int do_sea(unsigned long far, unsigned int esr, struct pt_regs *regs)
>  {
>   const struct fault_info *inf;
> @@ -720,7 +743,9 @@ static int do_sea(unsigned long far, unsigned int esr, 
> struct pt_regs *regs)
>*/
>   siaddr  = untagged_addr(far);
>   }
> - arm64_notify_die(inf->name, regs, inf->sig, inf->code, siaddr, esr);
> +
> + if (!arm64_do_kernel_sea(siaddr, esr, regs, inf->sig, inf->code))
> + arm64_notify_die(inf->name, regs, inf->sig, inf->code, siaddr, 
> esr);
>  
>   return 0;
>  }
> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> index 

Re: [PATCH] bug: Use normal relative pointers in 'struct bug_entry'

2022-05-06 Thread Mark Rutland
On Thu, May 05, 2022 at 06:09:45PM -0700, Josh Poimboeuf wrote:
> With CONFIG_GENERIC_BUG_RELATIVE_POINTERS, the addr/file relative
> pointers are calculated weirdly: based on the beginning of the bug_entry
> struct address, rather than their respective pointer addresses.
> 
> Make the relative pointers less surprising to both humans and tools by
> calculating them the normal way.
> 
> Signed-off-by: Josh Poimboeuf 

This looks good to me.

Just in case, I gave this a spin on arm64 defconfig atop v5.18-rc4. This builds
cleanly with both GCC 11.1.0 and LLVM 14.0.0, and works correctly in testing
on both with the LKDTM BUG/WARNING/WARNING_MESSAGE tests, i.e.

  echo WARNING > /sys/kernel/debug/provoke-crash/DIRECT
  echo WARNING_MESSAGE > /sys/kernel/debug/provoke-crash/DIRECT
  echo BUG > /sys/kernel/debug/provoke-crash/DIRECT

FWIW:

Reviewed-by: Mark Rutland 
Tested-by: Mark Rutland  [arm64]

As an aside (and for anyone else trying to duplicate my results), on arm64
there's a latent issue (prior to this patch) where BUG() will always result in
a WARN_ON_ONCE() in rcu_eqs_enter(). Since BUG() uses a BRK, and we treat the
BRK exception as an NMI, when we kill the task we do that in NMI context, but
schedule another task in regular task context, and RCU doesn't like that:

# echo BUG > /sys/kernel/debug/provoke-crash/DIRECT
[   28.284180] lkdtm: Performing direct entry BUG
[   28.285052] [ cut here ]
[   28.285940] kernel BUG at drivers/misc/lkdtm/bugs.c:78!
[   28.287008] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[   28.288143] Modules linked in:
[   28.288798] CPU: 0 PID: 151 Comm: bash Not tainted 5.18.0-rc4 #1
[   28.290040] Hardware name: linux,dummy-virt (DT)
[   28.290979] pstate: 6045 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   28.292380] pc : lkdtm_BUG+0x4/0xc
[   28.293084] lr : lkdtm_do_action+0x24/0x30
[   28.293923] sp : 883bbce0
[   28.294624] x29: 883bbce0 x28: 3c574344 x27: 
[   28.296057] x26:  x25: a4bc19c480b0 x24: 883bbdf0
[   28.297493] x23: 0004 x22: 3c57440f3000 x21: a4bc1a0bfba0
[   28.298933] x20: a4bc19c480c0 x19: 0001 x18: 
[   28.300369] x17:  x16:  x15: 0720072007200720
[   28.301823] x14: 0720072007200747 x13: a4bc1a8d2520 x12: 03b1
[   28.303257] x11: 013b x10: a4bc1a92a520 x9 : a4bc1a8d2520
[   28.304689] x8 : efff x7 : a4bc1a92a520 x6 : 
[   28.306120] x5 :  x4 : 3c57bfbcc9e8 x3 : 
[   28.307550] x2 :  x1 : 3c574344 x0 : a4bc19279284
[   28.308981] Call trace:
[   28.309496]  lkdtm_BUG+0x4/0xc
[   28.310134]  direct_entry+0x11c/0x1cc
[   28.310888]  full_proxy_write+0x60/0xbc
[   28.311690]  vfs_write+0xc4/0x2a4
[   28.312383]  ksys_write+0x68/0xf4
[   28.313056]  __arm64_sys_write+0x20/0x2c
[   28.313851]  invoke_syscall+0x48/0x114
[   28.314623]  el0_svc_common.constprop.0+0xd4/0xfc
[   28.315584]  do_el0_svc+0x28/0x90
[   28.316276]  el0_svc+0x34/0xb0
[   28.316917]  el0t_64_sync_handler+0xa4/0x130
[   28.317786]  el0t_64_sync+0x18c/0x190
[   28.318560] Code: b90027e0 17ea 941b4d4c d503245f (d421) 
[   28.319796] ---[ end trace  ]---
[   28.320736] note: bash[151] exited with preempt_count 1
[   28.329377] [ cut here ]
[   28.330327] WARNING: CPU: 0 PID: 0 at kernel/rcu/tree.c:624 
rcu_eqs_enter.constprop.0+0x7c/0x84
[   28.332103] Modules linked in:
[   28.332757] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G  D   
5.18.0-rc4 #1
[   28.334355] Hardware name: linux,dummy-virt (DT)
[   28.335318] pstate: 204000c5 (nzCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   28.336745] pc : rcu_eqs_enter.constprop.0+0x7c/0x84
[   28.337766] lr : rcu_idle_enter+0x10/0x1c
[   28.338609] sp : a4bc1a8b3d40
[   28.339309] x29: a4bc1a8b3d40 x28: 41168458 x27: 
[   28.340788] x26: a4bc1a8c3340 x25:  x24: 
[   28.342255] x23: a4bc1a8b9b4c x22: a4bc1a37a6f8 x21: a4bc1a8b9a38
[   28.343705] x20: a4bc1a8b9b40 x19: 3c57bfbd4800 x18: 
[   28.345159] x17:  x16:  x15: 06a1d2912376
[   28.346632] x14: 018a x13: 018a x12: 
[   28.348089] x11: 0001 x10: 0a50 x9 : a4bc1a8b3ce0
[   28.349551] x8 : a4bc1a8c3df0 x7 : 3c57bfbd3b80 x6 : 000154de2486
[   28.351040] x5 : 03ff x4 : 0a5c x3 : a4bc1a8b79c0
[   28.352505] x2 : 0a5c x1 : 4002 x0 : 4000
[   28.353966] Call trace:
[   28.354496]  rcu_eqs_enter.constprop.0+0x7c/0x84
[   28.355467]  rcu_idle_enter+0x10/0x1c
[   28.356230]  default_idle_call+0x20/0x6c
[   28.357061

Re: [PATCH RFC 2/8] arm64: stacktrace: Add arch_within_stack_frames

2022-04-19 Thread Mark Rutland
Hi,

On Mon, Apr 18, 2022 at 09:22:11PM +0800, He Zhe wrote:
> This function checks if the given address range crosses frame boundary.

I don't think that's quite true, becuase arm64's procedure call standard
(AAPCS64) doesn't give us enough information to determine this without
additional metadata from the compiler, which we simply don't have today.

Since there's a lot of confusion in this area, I've made a bit of an info dump
below, before review on the patch itself, but TBH I'm struggling to see that
this is all that useful.

On arm64, we use a calling convention called AAPCS64, (in full: "Procedure Call
Standard for the Arm® 64-bit Architecture (AArch64)"). That's maintained at:

  https://github.com/ARM-software/abi-aa

... with the latest release (as of today) at:

  
https://github.com/ARM-software/abi-aa/blob/60a8eb8c55e999d74dac5e368fc9d7e36e38dda4/aapcs64/aapcs64.rst
  https://github.com/ARM-software/abi-aa/releases/download/2022Q1/aapcs64.pdf

In AAPCS64, there are two related but distinct things to be aware of:

* The "stack frame" of a function, which is the entire contiguous region of
  stack memory used by a function.

* The "frame record", which is the saved FP and LR placed *somewhere* within
  the function's stack frame. The FP points at the most recent frame record on
  the stack, and at function call boundaries points at the caller's frame
  record.

AAPCS64 doesn't say *where* a frame record is placed within a stack frame, and
there are reasons for compilers to place above and below it. So in genral, a
functionss stack frame looks like:
  
+=+
|  above  |
|-|
| FP | LR |
|-|
|  below  |
+=+

... where the "above" or "below" portions might be any size (even 0 bytes).

Typical code generation today means for most functions that the "below" portion
is 0 bytes in size, but this is not guaranteed, and even today there are cases
where this is not true.

When one function calls another without a stack transition, that looks like:

+=+ ___
|  above  |\
|-||
 ,->| FP | LR |+-- Caller's stack frame
 |  |-||
 |  |  below  | ___/
 |  +=+ ___ 
 |  |  above  |\
 |  |-||
 '--| FP | LR |+-- Callee's stack frame
|-||
|  below  | ___/
+=+

Where there's a stack transition, and the new stack is at a *lower* VA than the
old stack, that looks like:

+=+ ___
|  above  |\
|-||
 ,->| FP | LR |+-- Caller's stack frame
 |  |-||
 |  |  below  | ___/
 |  +=+
 | 
 |  ~~~
 |  Arbitrarily 
 |  large gap,
 |  potentially
 |  including
 |  other data
 |  ~~~
 |
 |  +=+ ___ 
 |  |  above  |\
 |  |-||
 '--| FP | LR |+-- Callee's stack frame
|-||
|  below  | ___/
+=+

Where there's a stack transition, and the new stack is at a *higher* VA than
the old stack, that looks like:

+=+ ___ 
|  above  |\
|-||
 ,--| FP | LR |+-- Callee's stack frame
 |  |-||
 |  |  below  | ___/
 |  +=+
 |
 |  ~~~
 |  Arbitrarily 
 |  large gap,
 |  potentially
 |  including
 |  other data
 |  ~~~
 | 
 |  +=+ ___
 |  |  above  |\
 |  |-||
 '->| FP | LR |+-- Caller's stack frame
|-||
|  below  | ___/
+=+
 
In all of these cases, we *cannot* identify the boundary between the two stack
frames, we can *only* identify where something overlaps a frame record. That
might itself be a good thing, but it's not the same thing as what you describe
in the commit message.

> It is based on the existing x86 algorithm, but implemented via stacktrace.
> This can be tested by USERCOPY_STACK_FRAME_FROM and
> USERCOPY_STACK_FRAME_TO in lkdtm.

Can you please explain *why* we'd want this?

Who do we expect to use this?

What's the overhead in practice?

Has this passed a more realistic stress test (e.g. running some userspace
applications which make intensive use of copies to/from the kernel)?

> 
> Signed-off-by: He Zhe 
> ---
>  arch/arm64/Kconfig   |  1 +
>  arch/arm64/include/asm/thread_info.h | 12 +
>  arch/arm64/kernel/stacktrace.c   | 76 ++--
>  3 files changed, 85 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 57c4c995965f..0f52a83d7771 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -165,6 +165,7 @@ config ARM64
>   select HAVE_ARCH_TRACEHOOK
>   select HAVE_ARCH_TRANSPARENT_HUGEPAGE
>   select HAVE_ARCH_VMAP_STACK
> + select 

Re: [PATCH RFC 1/8] stacktrace: Change callback prototype to pass more information

2022-04-19 Thread Mark Rutland
On Mon, Apr 18, 2022 at 09:22:10PM +0800, He Zhe wrote:
> Currently stack_trace_consume_fn can only have pc of each frame of the
> stack. Copying-beyond-the-frame-detection also needs fp of current and
> previous frame. Other detection algorithm in the future may need more
> information of the frame.
> 
> We define a frame_info to include them all.
> 
> 
> Signed-off-by: He Zhe 
> ---
>  include/linux/stacktrace.h |  9 -
>  kernel/stacktrace.c| 10 +-
>  2 files changed, 13 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/stacktrace.h b/include/linux/stacktrace.h
> index 97455880ac41..5a61bfafe6f0 100644
> --- a/include/linux/stacktrace.h
> +++ b/include/linux/stacktrace.h
> @@ -10,15 +10,22 @@ struct pt_regs;
>  
>  #ifdef CONFIG_ARCH_STACKWALK
>  
> +struct frame_info {
> + unsigned long pc;
> + unsigned long fp;
> + unsigned long prev_fp;
> +};

I don't think this should be exposed through a generic interface; the `fp` and
`prev_fp` values are only meaningful with arch-specific knowledge, and they're
*very* easy to misuse (e.g. when transitioning from one stack to another).
There's also a bunch of other information one may or may not want, depending on
what you're trying to achieve.

I am happy to have an arch-specific internal unwinder where we can access this
information, and *maybe* it makes sense to have a generic API that passes some
opaque token, but I don't think we should make the structure generic.

Thanks,
Mark.

> +
>  /**
>   * stack_trace_consume_fn - Callback for arch_stack_walk()
>   * @cookie:  Caller supplied pointer handed back by arch_stack_walk()
>   * @addr:The stack entry address to consume
> + * @fi:  The frame information to consume
>   *
>   * Return:   True, if the entry was consumed or skipped
>   *   False, if there is no space left to store
>   */
> -typedef bool (*stack_trace_consume_fn)(void *cookie, unsigned long addr);
> +typedef bool (*stack_trace_consume_fn)(void *cookie, struct frame_info *fi);
>  /**
>   * arch_stack_walk - Architecture specific function to walk the stack
>   * @consume_entry:   Callback which is invoked by the architecture code for
> diff --git a/kernel/stacktrace.c b/kernel/stacktrace.c
> index 9ed5ce989415..2d0a2812e92b 100644
> --- a/kernel/stacktrace.c
> +++ b/kernel/stacktrace.c
> @@ -79,7 +79,7 @@ struct stacktrace_cookie {
>   unsigned intlen;
>  };
>  
> -static bool stack_trace_consume_entry(void *cookie, unsigned long addr)
> +static bool stack_trace_consume_entry(void *cookie, struct frame_info *fi)
>  {
>   struct stacktrace_cookie *c = cookie;
>  
> @@ -90,15 +90,15 @@ static bool stack_trace_consume_entry(void *cookie, 
> unsigned long addr)
>   c->skip--;
>   return true;
>   }
> - c->store[c->len++] = addr;
> + c->store[c->len++] = fi->pc;
>   return c->len < c->size;
>  }
>  
> -static bool stack_trace_consume_entry_nosched(void *cookie, unsigned long 
> addr)
> +static bool stack_trace_consume_entry_nosched(void *cookie, struct 
> frame_info *fi)
>  {
> - if (in_sched_functions(addr))
> + if (in_sched_functions(fi->pc))
>   return true;
> - return stack_trace_consume_entry(cookie, addr);
> + return stack_trace_consume_entry(cookie, fi);
>  }
>  
>  /**
> -- 
> 2.25.1
> 


Re: [PATCH 08/14] arm64: simplify access_ok()

2022-02-15 Thread Mark Rutland
On Tue, Feb 15, 2022 at 09:30:41AM +, David Laight wrote:
> From: Ard Biesheuvel
> > Sent: 15 February 2022 08:18
> > 
> > On Mon, 14 Feb 2022 at 17:37, Arnd Bergmann  wrote:
> > >
> > > From: Arnd Bergmann 
> > >
> > > arm64 has an inline asm implementation of access_ok() that is derived from
> > > the 32-bit arm version and optimized for the case that both the limit and
> > > the size are variable. With set_fs() gone, the limit is always constant,
> > > and the size usually is as well, so just using the default implementation
> > > reduces the check into a comparison against a constant that can be
> > > scheduled by the compiler.
> > >
> > > On a defconfig build, this saves over 28KB of .text.
> > >
> > > Signed-off-by: Arnd Bergmann 
> > > ---
> > >  arch/arm64/include/asm/uaccess.h | 28 +---
> > >  1 file changed, 5 insertions(+), 23 deletions(-)
> > >
> > > diff --git a/arch/arm64/include/asm/uaccess.h 
> > > b/arch/arm64/include/asm/uaccess.h
> > > index 357f7bd9c981..e8dce0cc5eaa 100644
> > > --- a/arch/arm64/include/asm/uaccess.h
> > > +++ b/arch/arm64/include/asm/uaccess.h
> > > @@ -26,6 +26,8 @@
> > >  #include 
> > >  #include 
> > >
> > > +static inline int __access_ok(const void __user *ptr, unsigned long 
> > > size);
> > > +
> > >  /*
> > >   * Test whether a block of memory is a valid user space address.
> > >   * Returns 1 if the range is valid, 0 otherwise.
> > > @@ -33,10 +35,8 @@
> > >   * This is equivalent to the following test:
> > >   * (u65)addr + (u65)size <= (u65)TASK_SIZE_MAX
> > >   */
> > > -static inline unsigned long __access_ok(const void __user *addr, 
> > > unsigned long size)
> > > +static inline int access_ok(const void __user *addr, unsigned long size)
> > >  {
> > > -   unsigned long ret, limit = TASK_SIZE_MAX - 1;
> > > -
> > > /*
> > >  * Asynchronous I/O running in a kernel thread does not have the
> > >  * TIF_TAGGED_ADDR flag of the process owning the mm, so always 
> > > untag
> > > @@ -46,27 +46,9 @@ static inline unsigned long __access_ok(const void 
> > > __user *addr, unsigned long s
> > > (current->flags & PF_KTHREAD || 
> > > test_thread_flag(TIF_TAGGED_ADDR)))
> > > addr = untagged_addr(addr);
> > >
> > > -   __chk_user_ptr(addr);
> > > -   asm volatile(
> > > -   // A + B <= C + 1 for all A,B,C, in four easy steps:
> > > -   // 1: X = A + B; X' = X % 2^64
> > > -   "   adds%0, %3, %2\n"
> > > -   // 2: Set C = 0 if X > 2^64, to guarantee X' > C in step 4
> > > -   "   csel%1, xzr, %1, hi\n"
> > > -   // 3: Set X' = ~0 if X >= 2^64. For X == 2^64, this decrements X'
> > > -   //to compensate for the carry flag being set in step 4. For
> > > -   //X > 2^64, X' merely has to remain nonzero, which it does.
> > > -   "   csinv   %0, %0, xzr, cc\n"
> > > -   // 4: For X < 2^64, this gives us X' - C - 1 <= 0, where the -1
> > > -   //comes from the carry in being clear. Otherwise, we are
> > > -   //testing X' - C == 0, subject to the previous adjustments.
> > > -   "   sbcsxzr, %0, %1\n"
> > > -   "   cset%0, ls\n"
> > > -   : "=" (ret), "+r" (limit) : "Ir" (size), "0" (addr) : "cc");
> > > -
> > > -   return ret;
> > > +   return likely(__access_ok(addr, size));
> > >  }
> > > -#define __access_ok __access_ok
> > > +#define access_ok access_ok
> > >
> > >  #include 
> > >
> > > --
> > > 2.29.2
> > >
> > 
> > With set_fs() out of the picture, wouldn't it be sufficient to check
> > that bit #55 is clear? (the bit that selects between TTBR0 and TTBR1)
> > That would also remove the need to strip the tag from the address.
> > 
> > Something like
> > 
> > asm goto("tbnz  %0, #55, %2 \n"
> >  "tbnz  %1, #55, %2 \n"
> >  :: "r"(addr), "r"(addr + size - 1) :: notok);
> > return 1;
> > notok:
> > return 0;
> > 
> > with an additional sanity check on the size which the compiler could
> > eliminate for compile-time constant values.
> 
> Is there are reason not to just use:
>   size < 1u << 48 && !((addr | (addr + size - 1)) & 1u << 55)

That has a few problems, including being an ABI change for tasks not using the
relaxed tag ABI and not working for 52-bit VAs.

If we really want to relax the tag checking aspect, there are simpler options,
including variations on Ard's approach above.

> Ugg, is arm64 addressing as horrid as it looks - with the 'kernel'
> bit in the middle of the virtual address space?

It's just sign-extension/canonical addressing, except bits [63:56] are
configurable between a few uses, so the achitecture says bit 55 is the one to
look at in all configurations to figure out if an address is high/low (in
addition to checking the remaining bits are canonical).

Thanks,
Mark.


Re: [PATCH 08/14] arm64: simplify access_ok()

2022-02-15 Thread Mark Rutland
On Mon, Feb 14, 2022 at 05:34:46PM +0100, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> arm64 has an inline asm implementation of access_ok() that is derived from
> the 32-bit arm version and optimized for the case that both the limit and
> the size are variable. With set_fs() gone, the limit is always constant,
> and the size usually is as well, so just using the default implementation
> reduces the check into a comparison against a constant that can be
> scheduled by the compiler.
> 
> On a defconfig build, this saves over 28KB of .text.
> 
> Signed-off-by: Arnd Bergmann 

I had a play around with this and a number of alternative options that had
previously been discussed (e.g. using uint128_t for the check to allow the
compiler to use the carry flag), and:

* Any sequences which we significantly simpler involved an ABI change (e.g. not
  checking tags for tasks not using the relaxed tag ABI), or didn't interact
  well with the uaccess pointer masking we do for speculation hardening.

* For all constant-size cases, this was joint-best for codegen.

* For variable-size cases the difference between options (which did not change
  ABI or break pointer masking) fell in the noise and really depended on what
  you were optimizing for.

This patch itself is clear, I believe the logic is sound and does not result in
a behavioural change, so for this as-is:

Acked-by: Mark Rutland 

As on other replies, I think that if we want to make further changes to this,
we should do that as follow-ups, since there are a number of subtleties in this
area w.r.t. tag management and speculation with potential ABI implications.

Thanks,
Mark.

> ---
>  arch/arm64/include/asm/uaccess.h | 28 +---
>  1 file changed, 5 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/uaccess.h 
> b/arch/arm64/include/asm/uaccess.h
> index 357f7bd9c981..e8dce0cc5eaa 100644
> --- a/arch/arm64/include/asm/uaccess.h
> +++ b/arch/arm64/include/asm/uaccess.h
> @@ -26,6 +26,8 @@
>  #include 
>  #include 
>  
> +static inline int __access_ok(const void __user *ptr, unsigned long size);
> +
>  /*
>   * Test whether a block of memory is a valid user space address.
>   * Returns 1 if the range is valid, 0 otherwise.
> @@ -33,10 +35,8 @@
>   * This is equivalent to the following test:
>   * (u65)addr + (u65)size <= (u65)TASK_SIZE_MAX
>   */
> -static inline unsigned long __access_ok(const void __user *addr, unsigned 
> long size)
> +static inline int access_ok(const void __user *addr, unsigned long size)
>  {
> - unsigned long ret, limit = TASK_SIZE_MAX - 1;
> -
>   /*
>* Asynchronous I/O running in a kernel thread does not have the
>* TIF_TAGGED_ADDR flag of the process owning the mm, so always untag
> @@ -46,27 +46,9 @@ static inline unsigned long __access_ok(const void __user 
> *addr, unsigned long s
>   (current->flags & PF_KTHREAD || test_thread_flag(TIF_TAGGED_ADDR)))
>   addr = untagged_addr(addr);
>  
> - __chk_user_ptr(addr);
> - asm volatile(
> - // A + B <= C + 1 for all A,B,C, in four easy steps:
> - // 1: X = A + B; X' = X % 2^64
> - "   adds%0, %3, %2\n"
> - // 2: Set C = 0 if X > 2^64, to guarantee X' > C in step 4
> - "   csel%1, xzr, %1, hi\n"
> - // 3: Set X' = ~0 if X >= 2^64. For X == 2^64, this decrements X'
> - //to compensate for the carry flag being set in step 4. For
> - //X > 2^64, X' merely has to remain nonzero, which it does.
> - "   csinv   %0, %0, xzr, cc\n"
> - // 4: For X < 2^64, this gives us X' - C - 1 <= 0, where the -1
> - //comes from the carry in being clear. Otherwise, we are
> - //testing X' - C == 0, subject to the previous adjustments.
> - "   sbcsxzr, %0, %1\n"
> - "   cset%0, ls\n"
> - : "=" (ret), "+r" (limit) : "Ir" (size), "0" (addr) : "cc");
> -
> - return ret;
> + return likely(__access_ok(addr, size));
>  }
> -#define __access_ok __access_ok
> +#define access_ok access_ok
>  
>  #include 
>  
> -- 
> 2.29.2
> 


Re: [PATCH 07/14] uaccess: generalize access_ok()

2022-02-15 Thread Mark Rutland
On Mon, Feb 14, 2022 at 05:34:45PM +0100, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> There are many different ways that access_ok() is defined across
> architectures, but in the end, they all just compare against the
> user_addr_max() value or they accept anything.
> 
> Provide one definition that works for most architectures, checking
> against TASK_SIZE_MAX for user processes or skipping the check inside
> of uaccess_kernel() sections.
> 
> For architectures without CONFIG_SET_FS(), this should be the fastest
> check, as it comes down to a single comparison of a pointer against a
> compile-time constant, while the architecture specific versions tend to
> do something more complex for historic reasons or get something wrong.
> 
> Type checking for __user annotations is handled inconsistently across
> architectures, but this is easily simplified as well by using an inline
> function that takes a 'const void __user *' argument. A handful of
> callers need an extra __user annotation for this.
> 
> Some architectures had trick to use 33-bit or 65-bit arithmetic on the
> addresses to calculate the overflow, however this simpler version uses
> fewer registers, which means it can produce better object code in the
> end despite needing a second (statically predicted) branch.
> 
> Signed-off-by: Arnd Bergmann 

As discussed over IRC, the generic sequence looks good to me, and likewise for
the arm64 change, so:

Acked-by: Mark Rutland  [arm64, asm-generic]

Thanks,
Mark.

> ---
>  arch/alpha/include/asm/uaccess.h  | 34 +++
>  arch/arc/include/asm/uaccess.h| 29 -
>  arch/arm/include/asm/uaccess.h| 20 +
>  arch/arm/kernel/swp_emulate.c |  2 +-
>  arch/arm/kernel/traps.c   |  2 +-
>  arch/arm64/include/asm/uaccess.h  |  5 ++-
>  arch/csky/include/asm/uaccess.h   |  8 
>  arch/csky/kernel/signal.c |  2 +-
>  arch/hexagon/include/asm/uaccess.h| 25 
>  arch/ia64/include/asm/uaccess.h   |  5 +--
>  arch/m68k/include/asm/uaccess.h   |  5 ++-
>  arch/microblaze/include/asm/uaccess.h |  8 +---
>  arch/mips/include/asm/uaccess.h   | 29 +
>  arch/nds32/include/asm/uaccess.h  |  7 +---
>  arch/nios2/include/asm/uaccess.h  | 11 +
>  arch/nios2/kernel/signal.c| 20 +
>  arch/openrisc/include/asm/uaccess.h   | 19 +
>  arch/parisc/include/asm/uaccess.h | 10 +++--
>  arch/powerpc/include/asm/uaccess.h| 11 +
>  arch/powerpc/lib/sstep.c  |  4 +-
>  arch/riscv/include/asm/uaccess.h  | 31 +-
>  arch/riscv/kernel/perf_callchain.c|  2 +-
>  arch/s390/include/asm/uaccess.h   | 11 ++---
>  arch/sh/include/asm/uaccess.h | 22 +-
>  arch/sparc/include/asm/uaccess.h  |  3 --
>  arch/sparc/include/asm/uaccess_32.h   | 18 ++--
>  arch/sparc/include/asm/uaccess_64.h   | 35 
>  arch/sparc/kernel/signal_32.c |  2 +-
>  arch/um/include/asm/uaccess.h |  5 ++-
>  arch/x86/include/asm/uaccess.h| 14 +--
>  arch/xtensa/include/asm/uaccess.h | 10 +
>  include/asm-generic/access_ok.h   | 59 +++
>  include/asm-generic/uaccess.h | 21 +-
>  include/linux/uaccess.h   |  7 
>  34 files changed, 130 insertions(+), 366 deletions(-)
>  create mode 100644 include/asm-generic/access_ok.h
> 
> diff --git a/arch/alpha/include/asm/uaccess.h 
> b/arch/alpha/include/asm/uaccess.h
> index 1b6f25efa247..82c5743fc9cd 100644
> --- a/arch/alpha/include/asm/uaccess.h
> +++ b/arch/alpha/include/asm/uaccess.h
> @@ -20,28 +20,7 @@
>  #define get_fs()  (current_thread_info()->addr_limit)
>  #define set_fs(x) (current_thread_info()->addr_limit = (x))
>  
> -#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg)
> -
> -/*
> - * Is a address valid? This does a straightforward calculation rather
> - * than tests.
> - *
> - * Address valid if:
> - *  - "addr" doesn't have any high-bits set
> - *  - AND "size" doesn't have any high-bits set
> - *  - AND "addr+size-(size != 0)" doesn't have any high-bits set
> - *  - OR we are in kernel mode.
> - */
> -#define __access_ok(addr, size) ({   \
> - unsigned long __ao_a = (addr), __ao_b = (size); \
> - unsigned long __ao_end = __ao_a + __ao_b - !!__ao_b;\
> - (get_fs().seg & (__ao_a | __ao_b | __ao_end)) == 0; })
> -
> -#define access_ok(addr, size)\
> -({   \
> - __chk_user_ptr(addr);   \
>

Re: [PATCH 08/14] arm64: simplify access_ok()

2022-02-15 Thread Mark Rutland
On Tue, Feb 15, 2022 at 10:39:46AM +0100, Arnd Bergmann wrote:
> On Tue, Feb 15, 2022 at 10:21 AM Ard Biesheuvel  wrote:
> > On Tue, 15 Feb 2022 at 10:13, Arnd Bergmann  wrote:
> >
> > arm64 also has this leading up to the range check, and I think we'd no
> > longer need it:
> >
> > if (IS_ENABLED(CONFIG_ARM64_TAGGED_ADDR_ABI) &&
> > (current->flags & PF_KTHREAD || test_thread_flag(TIF_TAGGED_ADDR)))
> > addr = untagged_addr(addr);
> 
> I suspect the expensive part here is checking the two flags, as 
> untagged_addr()
> seems to always just add a sbfx instruction. Would this work?
> 
> #ifdef CONFIG_ARM64_TAGGED_ADDR_ABI
> #define access_ok(ptr, size) __access_ok(untagged_addr(ptr), (size))
> #else // the else path is the default, this can be left out.
> #define access_ok(ptr, size) __access_ok((ptr), (size))
> #endif

This would be an ABI change, e.g. for tasks without TIF_TAGGED_ADDR.

I don't think we should change this as part of this series.

Thanks,
Mark.


Re: [PATCH 08/14] arm64: simplify access_ok()

2022-02-15 Thread Mark Rutland
On Tue, Feb 15, 2022 at 10:21:16AM +0100, Ard Biesheuvel wrote:
> On Tue, 15 Feb 2022 at 10:13, Arnd Bergmann  wrote:
> >
> > On Tue, Feb 15, 2022 at 9:17 AM Ard Biesheuvel  wrote:
> > > On Mon, 14 Feb 2022 at 17:37, Arnd Bergmann  wrote:
> > > > From: Arnd Bergmann 
> > > >
> > >
> > > With set_fs() out of the picture, wouldn't it be sufficient to check
> > > that bit #55 is clear? (the bit that selects between TTBR0 and TTBR1)
> > > That would also remove the need to strip the tag from the address.
> > >
> > > Something like
> > >
> > > asm goto("tbnz  %0, #55, %2 \n"
> > >  "tbnz  %1, #55, %2 \n"
> > >  :: "r"(addr), "r"(addr + size - 1) :: notok);
> > > return 1;
> > > notok:
> > > return 0;
> > >
> > > with an additional sanity check on the size which the compiler could
> > > eliminate for compile-time constant values.
> >
> > That should work, but I don't see it as a clear enough advantage to
> > have a custom implementation. For the constant-size case, it probably
> > isn't better than a compiler-scheduled comparison against a
> > constant limit, but it does hurt maintainability when the next person
> > wants to change the behavior of access_ok() globally.
> >
> 
> arm64 also has this leading up to the range check, and I think we'd no
> longer need it:
> 
> if (IS_ENABLED(CONFIG_ARM64_TAGGED_ADDR_ABI) &&
> (current->flags & PF_KTHREAD || test_thread_flag(TIF_TAGGED_ADDR)))
> addr = untagged_addr(addr);
> 

ABI-wise, we aim to *reject* tagged pointers unless the task is using the
tagged addr ABI, so we need to retain both the untagging logic and the full
pointer check (to actually check the tag bits) unless we relax that ABI
decision generally (or go context-switch the TCR_EL1.TBI* bits).

Since that has subtle ABI implications, I don't think we should change that
within this series.

If we *did* relax things, we could just check bit 55 here, and unconditionally
clear that in uaccess_mask_ptr(), since LDTR/STTR should fault on kernel memory.
On parts with meltdown those might not fault until committed, and so we need
masking to avoid speculative access to a kernel pointer, and that requires the
prior explciit check.

Thanks,
Mark.


Re: ftrace hangs waiting for rcu

2022-01-28 Thread Mark Rutland
On Fri, Jan 28, 2022 at 05:08:48PM +0100, Sven Schnelle wrote:
> Hi Mark,
> 
> Mark Rutland  writes:
> 
> > On arm64 I bisected this down to:
> >
> >   7a30871b6a27de1a ("rcu-tasks: Introduce ->percpu_enqueue_shift for 
> > dynamic queue selection")
> >
> > Which was going wrong because ilog2() rounds down, and so the shift was 
> > wrong
> > for any nr_cpus that was not a power-of-two. Paul had already fixed that in
> > rcu-next, and just sent a pull request to Linus:
> >
> >   
> > https://lore.kernel.org/lkml/20220128143251.GA2398275@paulmck-ThinkPad-P17-Gen-1/
> >
> > With that applied, I no longer see these hangs.
> >
> > Does your s390 test machine have a non-power-of-two nr_cpus, and does that 
> > fix
> > the issue for you?
> 
> We noticed the PR from Paul and are currently testing the fix. So far
> it's looking good. The configuration where we have seen the hang is a
> bit unusual:
> 
> - 16 physical CPUs on the kvm host
> - 248 logical CPUs inside kvm

Aha! 248 is notably *NOT* a power of two, and in this case the shift would be
wrong (ilog2() would give 7, when we need a shift of 8).

So I suspect you're hitting the same issue as I was.

Thanks,
Mark.

> - debug kernel both on the host and kvm guest
> 
> So things are likely a bit slow in the kvm guest. Interesting is that
> the number of CPUs is even. But maybe RCU sees an odd number of CPUs
> and gets confused before all cpus are brought up. Have to read code/test
> to see whether that could be possible.
> 
> Thanks for investigating!
> Sven


Re: ftrace hangs waiting for rcu (was: Re: [PATCH] ftrace: Have architectures opt-in for mcount build time sorting)

2022-01-28 Thread Mark Rutland
Hi Sven,

On Thu, Jan 27, 2022 at 07:42:35PM +0100, Sven Schnelle wrote:
> Mark Rutland  writes:
> 
> > * I intermittently see a hang when running the tests. I previously hit that
> >   when originally trying to bisect this issue (and IIRC that bisected down 
> > to
> >   some RCU changes, but I need to re-run that). When the tests hang I
> >   magic-srsrq + L tells me:
> >
> >   [  271.938438] sysrq: Show Blocked State
> >   [  271.939245] task:ftracetest  state:D stack:0 pid: 5687 ppid:  
> > 5627 flags:0x0200
> >   [  271.940961] Call trace:
> >   [  271.941472]  __switch_to+0x104/0x160
> >   [  271.942213]  __schedule+0x2b0/0x6e0
> >   [  271.942933]  schedule+0x5c/0xf0
> >   [  271.943586]  schedule_timeout+0x184/0x1c4
> >   [  271.944410]  wait_for_completion+0x8c/0x12c
> >   [  271.945274]  __wait_rcu_gp+0x184/0x190
> >   [  271.946047]  synchronize_rcu_tasks_rude+0x48/0x70
> >   [  271.947007]  update_ftrace_function+0xa4/0xec
> >   [  271.947897]  __unregister_ftrace_function+0xa4/0xf0
> >   [  271.948898]  unregister_ftrace_function+0x34/0x70
> >   [  271.949857]  wakeup_tracer_reset+0x4c/0x100
> >   [  271.950713]  tracing_set_tracer+0xd0/0x2b0
> >   [  271.951552]  tracing_set_trace_write+0xe8/0x150
> >   [  271.952477]  vfs_write+0xfc/0x284
> >   [  271.953171]  ksys_write+0x7c/0x110
> >   [  271.953874]  __arm64_sys_write+0x2c/0x40
> >   [  271.954678]  invoke_syscall+0x5c/0x130
> >   [  271.955442]  el0_svc_common.constprop.0+0x108/0x130
> >   [  271.956435]  do_el0_svc+0x74/0x90
> >   [  271.957124]  el0_svc+0x2c/0x90
> >   [  271.957757]  el0t_64_sync_handler+0xa8/0x12c
> >   [  271.958629]  el0t_64_sync+0x1a0/0x1a4

On arm64 I bisected this down to:

  7a30871b6a27de1a ("rcu-tasks: Introduce ->percpu_enqueue_shift for dynamic 
queue selection")

Which was going wrong because ilog2() rounds down, and so the shift was wrong
for any nr_cpus that was not a power-of-two. Paul had already fixed that in
rcu-next, and just sent a pull request to Linus:

  
https://lore.kernel.org/lkml/20220128143251.GA2398275@paulmck-ThinkPad-P17-Gen-1/

With that applied, I no longer see these hangs.

Does your s390 test machine have a non-power-of-two nr_cpus, and does that fix
the issue for you?

On arm64 the startup tests didn't seem to trigger the hang, but I was able to
trigger the hang fairly reliably with the ftrace selftests, e.g.

  $ for N in $(seq 1 10); do ./ftracetest test.d/00basic/basic2.tc; done

... which prior to the fix, would hang between runs 2 to 5.

Thanks,
Mark.

> that's interesting. On s390 i'm seeing the same problem in CI, but with
> the startup ftrace tests. So that's likely not arm64 spacific.
> 
> On s390, the last messages from ftrace are [5.663568] clocksource: 
> jiffies: mask: 0x max_cycles: 0x, max_idle_ns: 
> 1911260446275 ns
> [5.667099] futex hash table entries: 65536 (order: 12, 16777216 bytes, 
> vmalloc)
> [5.739549] Running postponed tracer tests:
> [5.740662] Testing tracer function: PASSED
> [6.194635] Testing dynamic ftrace: PASSED
> [6.471213] Testing dynamic ftrace ops #1: 
> [6.558445] (1 0 1 0 0) 
> [6.558458] (1 1 2 0 0) 
> [6.699135] (2 1 3 0 764347) 
> [6.699252] (2 2 4 0 766466) 
> [6.759857] (3 2 4 0 1159604)
> [..] hangs here
> 
> The backtrace looks like this, which is very similar to the one above:
> 
> crash> bt 1
> PID: 1  TASK: 80e68100  CPU: 133  COMMAND: "swapper/0"
>  #0 [380004df808] __schedule at cda39f0e
>  #1 [380004df880] schedule at cda3a488
>  #2 [380004df8b0] schedule_timeout at cda41ef6
>  #3 [380004df978] wait_for_completion at cda3bd0a
>  #4 [380004df9d8] __wait_rcu_gp at cc92
>  #5 [380004dfa30] synchronize_rcu_tasks_generic at ccdde0aa
>  #6 [380004dfad8] ftrace_shutdown at cce7b050
>  #7 [380004dfb18] unregister_ftrace_function at cce7b192
>  #8 [380004dfb50] trace_selftest_ops at cda1e0fa
>  #9 [380004dfba0] run_tracer_selftest at cda1e4f2
> #10 [380004dfc00] trace_selftest_startup_function at ce74355c
> #11 [380004dfc58] run_tracer_selftest at cda1e2fc
> #12 [380004dfc98] init_trace_selftests at ce742d30
> #13 [380004dfcd0] do_one_initcall at cccdca16
> #14 [380004dfd68] do_initcalls at ce72e776
> #15 [380004dfde0] kernel_init_freeable at ce72ea60
> #16 [380004dfe50] kernel_init at cda333fe
> #17 [380004dfe68] __ret_from_fork at cccdf920
> #18 [380004dfe98] ret_from_fork at cda444ca
> 
> I didn't had success reproducing it so far, but it is good to know that
> this also happens when running the ftrace testsuite.
> 
> I have several crashdumps, so i could try to pull out some information
> if someone tells me what to look for.
> 
> Thanks,
> Sven


Re: [PATCH] ftrace: Have architectures opt-in for mcount build time sorting

2022-01-27 Thread Mark Rutland
On Thu, Jan 27, 2022 at 11:42:49AM -0500, Steven Rostedt wrote:
> From: "Steven Rostedt (Google)" 
> 
> First S390 complained that the sorting of the mcount sections at build
> time caused the kernel to crash on their architecture. Now PowerPC is
> complaining about it too. And also ARM64 appears to be having issues.
> 
> It may be necessary to also update the relocation table for the values
> in the mcount table. Not only do we have to sort the table, but also
> update the relocations that may be applied to the items in the table.
> 
> If the system is not relocatable, then it is fine to sort, but if it is,
> some architectures may have issues (although x86 does not as it shifts all
> addresses the same).
> 
> Add a HAVE_BUILDTIME_MCOUNT_SORT that an architecture can set to say it is
> safe to do the sorting at build time.
> 
> Also update the config to compile in build time sorting in the sorttable
> code in scripts/ to depend on CONFIG_BUILDTIME_MCOUNT_SORT.
> 
> Link: 
> https://lore.kernel.org/all/944d10da-8200-4ba9-8d0a-3bed9aa99...@linux.ibm.com/
> 
> Cc: Mark Rutland 
> Cc: Yinan Liu 
> Cc: Ard Biesheuvel 
> Cc: Kees Cook 
> Cc: linuxppc-dev@lists.ozlabs.org
> Reported-by: Sachin Sant 
> Tested-by: Sachin Sant 
> Fixes: 72b3942a173c ("scripts: ftrace - move the sort-processing in 
> ftrace_init")
> Signed-off-by: Steven Rostedt (Google) 

Thanks for this; the rationale and patch makes sense to me. As previously, I
understand that:

* For arch/arm, the build-time sort should be safe as the 32-bit kernel isn't
  virtually relocatable, and so the sort affects the final values, and will not
  be clobbered later.

* For arch/x86, the build time sort should be safe as the mcount_loc will be
  initialazed with values consistent with the default load address, and the
  boot-time relocation will add the same offset to all values, so there's no
  need to sort the relocs.

So enabling this for arch/arm and arch/x86 makes sense to me.

Similarly, I understand that for arm64 the build-time sort isn't sound due to
not adjusting the relocations, and so it needs to be disabled there (for now at
least).

I gave this patch a spin on arm64 atop v5.17-rc1, using GCC 11.1.0 from the
kernel.org crosstool page:

  https://mirrors.edge.kernel.org/pub/tools/crosstool/

... with this applied, I'm no longer seeing a number of ftrace selftest
failures and ftrace internal bugs I previously reported at:

  https://lore.kernel.org/all/YfKGKWW5UfZ15kCW@FVFF77S0Q05N/

It looks like there's a trivial bit of whitespace damage in the patch (noted
below), but regardless:

  Reviewed-by: Mark Rutland 
  Tested-by: Mark Rutland  [arm64]



As a heads-up, with this issue out of the way I'm hitting some unrelated issues
when running the ftrace selftests on arm64, which I'll look into. Quick summary
on those below, but I'll start new threads once I've got more detail.

* The duplicate events test seems to fail consistently:

  [15] Generic dynamic event - check if duplicate events are caught   [FAIL]

* I intermittently see a hang when running the tests. I previously hit that
  when originally trying to bisect this issue (and IIRC that bisected down to
  some RCU changes, but I need to re-run that). When the tests hang I
  magic-srsrq + L tells me:

  [  271.938438] sysrq: Show Blocked State
  [  271.939245] task:ftracetest  state:D stack:0 pid: 5687 ppid:  5627 
flags:0x0200
  [  271.940961] Call trace:
  [  271.941472]  __switch_to+0x104/0x160
  [  271.942213]  __schedule+0x2b0/0x6e0
  [  271.942933]  schedule+0x5c/0xf0
  [  271.943586]  schedule_timeout+0x184/0x1c4
  [  271.944410]  wait_for_completion+0x8c/0x12c
  [  271.945274]  __wait_rcu_gp+0x184/0x190
  [  271.946047]  synchronize_rcu_tasks_rude+0x48/0x70
  [  271.947007]  update_ftrace_function+0xa4/0xec
  [  271.947897]  __unregister_ftrace_function+0xa4/0xf0
  [  271.948898]  unregister_ftrace_function+0x34/0x70
  [  271.949857]  wakeup_tracer_reset+0x4c/0x100
  [  271.950713]  tracing_set_tracer+0xd0/0x2b0
  [  271.951552]  tracing_set_trace_write+0xe8/0x150
  [  271.952477]  vfs_write+0xfc/0x284
  [  271.953171]  ksys_write+0x7c/0x110
  [  271.953874]  __arm64_sys_write+0x2c/0x40
  [  271.954678]  invoke_syscall+0x5c/0x130
  [  271.955442]  el0_svc_common.constprop.0+0x108/0x130
  [  271.956435]  do_el0_svc+0x74/0x90
  [  271.957124]  el0_svc+0x2c/0x90
  [  271.957757]  el0t_64_sync_handler+0xa8/0x12c
  [  271.958629]  el0t_64_sync+0x1a0/0x1a4

> ---
>  arch/arm/Kconfig | 1 +
>  arch/x86/Kconfig | 1 +
>  kernel/trace/Kconfig | 8 +++-
>  scripts/Makefile | 2 +-
>  4 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> index c2724d986fa0..5256ebe57451 100644
> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig
> @@ -82,6 

Re: [powerpc] ftrace warning kernel/trace/ftrace.c:2068 with code-patching selftests

2022-01-27 Thread Mark Rutland
On Thu, Jan 27, 2022 at 08:55:43AM -0500, Steven Rostedt wrote:
> On Thu, 27 Jan 2022 13:33:02 +
> Mark Rutland  wrote:
> 
> > I want to get the regression fixed ASAP, so can we take a simple patch for 
> > -rc2
> > which disables the build-time sort where it's currently broken (by limiting 
> > the
> > opt-in to arm and x86), then follow-up per-architecture to re-enable it if
> > desired/safe?
> 
> I'm going to retest my patch that makes it an opt in for just x86 and arm
> (32bit). I'll be pushing that hopefully later today. I have some other
> patches to test as well.

Great; thanks!

Let me know if you'd like me to give that a spin on arm or arm64.

Thanks,
Mark.


Re: [powerpc] ftrace warning kernel/trace/ftrace.c:2068 with code-patching selftests

2022-01-27 Thread Mark Rutland
On Thu, Jan 27, 2022 at 02:59:31PM +0100, Ard Biesheuvel wrote:
> On Thu, 27 Jan 2022 at 14:24, Mark Rutland  wrote:
> >
> > On Thu, Jan 27, 2022 at 02:07:03PM +0100, Ard Biesheuvel wrote:
> > > I suppose that on arm64, we can work around this by passing
> > > --apply-dynamic-relocs to the linker, so that all R_AARCH64_RELATIVE
> > > targets are prepopulated with the link time value of the respective
> > > addresses. It does cause some bloat, which is why we disable that
> > > today, but we could make that dependent on ftrace being enabled.
> >
> > We'd also need to teach the build-time sort to update the relocations, 
> > unless
> > you mean to also change the boot-time reloc code to RMW with the offset?
> 
> Why would that be necessary? Every RELA entry has the same effect on
> its target address, as it just adds a fixed offset.

Currently in relocate_kernel() we generate the absolute address from the
relocation alone, with the core of the relocation logic being as follows, with
x9 being the pointer to a RELA entry, and x23 being the offset relative to the
default load address:

ldp x12, x13, [x9], #24
ldr x14, [x9, #-8]

add x14, x14, x23   // relocate
str x14, [x12, x23]

... and (as per another reply), a sample RELA entry currently contains:

0x890b1ab0  // default load VA of pointer to update
0x0403  // R_AARCH64_RELATIVE
0x890b6000  // default load VA of addr to write

So either:

* That code stays as-is, and we must update the relocs to correspond to their
  new sorted locations, or we'll blat the sorted values with the original
  relocs as we do today.

* The code needs to change to RMW: read the existing value, add the offset
  (ignoring the content of the RELA entry's addend field), and write it back.
  This is what I meant when I said "change the boot-time reloc code to RMW with
  the offset".

Does that make sense, or have I misunderstood?

Thanks,
Mark.


Re: [powerpc] ftrace warning kernel/trace/ftrace.c:2068 with code-patching selftests

2022-01-27 Thread Mark Rutland
On Thu, Jan 27, 2022 at 02:16:31PM +0100, Sven Schnelle wrote:
> Mark Rutland  writes:
> 
> > On Thu, Jan 27, 2022 at 07:46:01AM -0500, Steven Rostedt wrote:
> >> On Thu, 27 Jan 2022 12:27:04 +0000
> >> Mark Rutland  wrote:
> >> 
> >> > Ah, so those non-ELF relocations for the mcount_loc table just mean 
> >> > "apply the
> >> > KASLR offset here", which is equivalent for all entries.
> >> > 
> >> > That makes sense, thanks!
> >> 
> >> And this is why we were having such a hard time understanding each other 
> >> ;-)
> >
> > ;)
> >
> > With that in mind, I think that we understand that the build-time sort works
> > for:
> >
> > * arch/x86, becuase the non-ELF relocations for mcount_loc happen to be
> >   equivalent.
> >  
> > * arch/arm, because there's no dynamic relocaiton and the mcount_loc entries
> >   have been finalized prior to sorting.
> >
> > ... but doesn't work for anyone else (including arm64) because the ELF
> > relocations are not equivalent, and need special care that is not yet
> > implemented.
> 
> For s390 my idea is to just skip the addresses between __start_mcount_loc
> and __stop_mcount_loc, because for these addresses we know that they are
> 64 bits wide, so we just need to add the KASLR offset.
> 
> I'm thinking about something like this:
> 
> diff --git a/arch/s390/boot/compressed/decompressor.h 
> b/arch/s390/boot/compressed/decompressor.h
> index f75cc31a77dd..015d7e2e94ef 100644
> --- a/arch/s390/boot/compressed/decompressor.h
> +++ b/arch/s390/boot/compressed/decompressor.h
> @@ -25,6 +25,8 @@ struct vmlinux_info {
>   unsigned long rela_dyn_start;
>   unsigned long rela_dyn_end;
>   unsigned long amode31_size;
> + unsigned long start_mcount_loc;
> + unsigned long stop_mcount_loc;
>  };
>  
>  /* Symbols defined by linker scripts */
> diff --git a/arch/s390/boot/startup.c b/arch/s390/boot/startup.c
> index 1aa11a8f57dd..7bb0d88db5c6 100644
> --- a/arch/s390/boot/startup.c
> +++ b/arch/s390/boot/startup.c
> @@ -88,6 +88,11 @@ static void handle_relocs(unsigned long offset)
>   dynsym = (Elf64_Sym *) vmlinux.dynsym_start;
>   for (rela = rela_start; rela < rela_end; rela++) {
>   loc = rela->r_offset + offset;
> + if ((loc >= vmlinux.start_mcount_loc) &&
> + (loc < vmlinux.stop_mcount_loc)) {
> + (*(unsigned long *)loc) += offset;
> + continue;
> + }
>   val = rela->r_addend;
>   r_sym = ELF64_R_SYM(rela->r_info);
>   if (r_sym) {
> @@ -232,6 +237,8 @@ static void offset_vmlinux_info(unsigned long offset)
>   vmlinux.rela_dyn_start += offset;
>   vmlinux.rela_dyn_end += offset;
>   vmlinux.dynsym_start += offset;
> + vmlinux.start_mcount_loc += offset;
> + vmlinux.stop_mcount_loc += offset;
>  }
>  
>  static unsigned long reserve_amode31(unsigned long safe_addr)
> diff --git a/arch/s390/kernel/vmlinux.lds.S b/arch/s390/kernel/vmlinux.lds.S
> index 42c43521878f..51c773405608 100644
> --- a/arch/s390/kernel/vmlinux.lds.S
> +++ b/arch/s390/kernel/vmlinux.lds.S
> @@ -213,6 +213,8 @@ SECTIONS
>   QUAD(__rela_dyn_start)  /* 
> rela_dyn_start */
>   QUAD(__rela_dyn_end)/* rela_dyn_end 
> */
>   QUAD(_eamode31 - _samode31) /* amode31_size 
> */
> + QUAD(__start_mcount_loc)
> + QUAD(__stop_mcount_loc)
>   } :NONE
>  
>   /* Debugging sections.  */
> 
> Not sure whether that would also work on power, and also not sure
> whether i missed something thinking about that. Maybe it doesn't even
> work. ;-)

I don't know enough about s390 or powerpc relocs to say whether that works, but
I can say that approach isn't going to work for arm64 without other signficant
changes.

I want to get the regression fixed ASAP, so can we take a simple patch for -rc2
which disables the build-time sort where it's currently broken (by limiting the
opt-in to arm and x86), then follow-up per-architecture to re-enable it if
desired/safe?

Thanks,
Mark.


Re: [powerpc] ftrace warning kernel/trace/ftrace.c:2068 with code-patching selftests

2022-01-27 Thread Mark Rutland
On Thu, Jan 27, 2022 at 02:07:03PM +0100, Ard Biesheuvel wrote:
> On Thu, 27 Jan 2022 at 13:59, Mark Rutland  wrote:
> >
> > On Thu, Jan 27, 2022 at 01:22:17PM +0100, Ard Biesheuvel wrote:
> > > On Thu, 27 Jan 2022 at 13:20, Mark Rutland  wrote:
> > > > On Thu, Jan 27, 2022 at 01:03:34PM +0100, Ard Biesheuvel wrote:
> > > >
> > > > > These architectures use place-relative extables for the same reason:
> > > > > place relative references are resolved at build time rather than at
> > > > > runtime during relocation, making a build time sort feasible.
> > > > >
> > > > > arch/alpha/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > > > arch/arm64/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > > > arch/ia64/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > > > arch/parisc/include/asm/uaccess.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > > > arch/powerpc/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > > > arch/riscv/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > > > arch/s390/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > > > arch/x86/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > > >
> > > > > Note that the swap routine becomes something like the below, given
> > > > > that the relative references need to be fixed up after the entry
> > > > > changes place in the sorted list.
> > > > >
> > > > > static void swap_ex(void *a, void *b, int size)
> > > > > {
> > > > > struct exception_table_entry *x = a, *y = b, tmp;
> > > > > int delta = b - a;
> > > > >
> > > > > tmp = *x;
> > > > > x->insn = y->insn + delta;
> > > > > y->insn = tmp.insn - delta;
> > > > > ...
> > > > > }
> > > > >
> > > > > As a bonus, the resulting footprint of the table in the image is
> > > > > reduced by 8x, given that every 8 byte pointer has an accompanying 24
> > > > > byte RELA record, so we go from 32 bytes to 4 bytes for every call to
> > > > > __gnu_mcount_mc.
> > > >
> > > > Absolutely -- it'd be great if we could do that for the callsite 
> > > > locations; the
> > > > difficulty is that the entries are generated by the compiler itself, so 
> > > > we'd
> > > > either need some build/link time processing to convert each absolute 
> > > > 64-bit
> > > > value to a relative 32-bit offset, or new compiler options to generate 
> > > > those as
> > > > relative offsets from the outset.
> > >
> > > Don't we use scripts/recordmcount.pl for that?
> >
> > Not quite -- we could adjust it to do so, but today it doesn't consider
> > existing mcount_loc entries, and just generates new ones where the compiler 
> > has
> > generated calls to mcount, which it finds by scanning the instructions in 
> > the
> > binary. Today it is not used:
> >
> > * On arm64 when we default to using `-fpatchable-function-entry=N`.  That 
> > makes
> >   the compiler insert 2 NOPs in the function prologue, and log the location 
> > of
> >   that NOP sled to a section called.  `__patchable_function_entries`.
> >
> >   We need the compiler to do that since we can't reliably identify 2 NOPs 
> > in a
> >   function prologue as being intended to be a patch site, as e.g. there 
> > could
> >   be notrace functions where the compiler had to insert NOPs for alignment 
> > of a
> >   subsequent brnach or similar.
> >
> > * On architectures with `-nop-mcount`. On these, it's necessary to use
> >   `-mrecord-mcount` to have the compiler log the patch-site, for the same
> >   reason as with `-fpatchable-function-entry`.
> >
> > * On architectures with `-mrecord-mcount` generally, since this avoids the
> >   overhead of scanning each object.
> >
> > * On x86 when objtool is used.
> >
> 
> Right.
> 
> I suppose that on arm64, we can work around this by passing
> --apply-dynamic-relocs to the linker, so that all R_AARCH64_RELATIVE
> targets are prepopulated with the link time value of the respective
> addresses. It does cause some bloat, which is why we disable that
> today, but we could make that dependent on ftrace being enabled.

We'd also need to teach the build-time sort to update the relocations, unless
you mean to also change the boot-time reloc code to RMW with the offset?

I think for right now the best thing is to disable the build-time sort for
arm64, but maybe something like that is the right thing to do longer term.

> I do wonder how much over head we accumulate, though, by having all
> these relocations, but I suppose that is the situation today in any
> case.

Yeah; I suspect if we want to do something about that we want to do it more
generally, and would probably need to do something like the x86 approach and
rewrite the relocs at build-time to something more compressed. If we applied
the dynamic relocs with the link-time address we'd only need 4 bytes to
identify each pointer to apply an offset to.

I'm not exactly sure how we could do that, nor what the trade-off look like in
practice.

THanks,
Mark.


Re: [powerpc] ftrace warning kernel/trace/ftrace.c:2068 with code-patching selftests

2022-01-27 Thread Mark Rutland
On Thu, Jan 27, 2022 at 07:46:01AM -0500, Steven Rostedt wrote:
> On Thu, 27 Jan 2022 12:27:04 +
> Mark Rutland  wrote:
> 
> > Ah, so those non-ELF relocations for the mcount_loc table just mean "apply 
> > the
> > KASLR offset here", which is equivalent for all entries.
> > 
> > That makes sense, thanks!
> 
> And this is why we were having such a hard time understanding each other ;-)

;)

With that in mind, I think that we understand that the build-time sort works
for:

* arch/x86, becuase the non-ELF relocations for mcount_loc happen to be
  equivalent.
 
* arch/arm, because there's no dynamic relocaiton and the mcount_loc entries
  have been finalized prior to sorting.

... but doesn't work for anyone else (including arm64) because the ELF
relocations are not equivalent, and need special care that is not yet
implemented.

So we should have arm and x86 opt-in, but for now everyone else (including
arm64, powerpc, s390) should be left with the prior behaviour with the runtime
sort only (in case the build-time sort breaks anything, as I mentioned in my
other mail).

Does that sound about right?

In future we might be able to do something much smarter (e.g. adding some
preprocessing to use relative entries).

I'll take a look at shelf. :)

Thanks,
Mark.

> I started a new project called "shelf", which is a shell interface to
> read ELF files (Shelf on a ELF!).
> 
> It uses my ccli library:
> 
>https://github.com/rostedt/libccli
> 
> and can be found here:
> 
>https://github.com/rostedt/shelf
> 
> Build and install the latest libccli and then build this with just
> "make".
> 
>   $ shelf vmlinux
> 
> and then you can see what is stored in the mcount location:
> 
>   shelf> dump symbol __start_mcount_loc - __stop_mcount_loc
> 
> I plan on adding more to include the REL and RELA sections and show how
> they affect symbols and such.
> 
> Feel free to contribute too ;-)
> 
> -- Steve


Re: [powerpc] ftrace warning kernel/trace/ftrace.c:2068 with code-patching selftests

2022-01-27 Thread Mark Rutland
On Thu, Jan 27, 2022 at 01:22:17PM +0100, Ard Biesheuvel wrote:
> On Thu, 27 Jan 2022 at 13:20, Mark Rutland  wrote:
> > On Thu, Jan 27, 2022 at 01:03:34PM +0100, Ard Biesheuvel wrote:
> >
> > > These architectures use place-relative extables for the same reason:
> > > place relative references are resolved at build time rather than at
> > > runtime during relocation, making a build time sort feasible.
> > >
> > > arch/alpha/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > arch/arm64/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > arch/ia64/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > arch/parisc/include/asm/uaccess.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > arch/powerpc/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > arch/riscv/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > arch/s390/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > arch/x86/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > >
> > > Note that the swap routine becomes something like the below, given
> > > that the relative references need to be fixed up after the entry
> > > changes place in the sorted list.
> > >
> > > static void swap_ex(void *a, void *b, int size)
> > > {
> > > struct exception_table_entry *x = a, *y = b, tmp;
> > > int delta = b - a;
> > >
> > > tmp = *x;
> > > x->insn = y->insn + delta;
> > > y->insn = tmp.insn - delta;
> > > ...
> > > }
> > >
> > > As a bonus, the resulting footprint of the table in the image is
> > > reduced by 8x, given that every 8 byte pointer has an accompanying 24
> > > byte RELA record, so we go from 32 bytes to 4 bytes for every call to
> > > __gnu_mcount_mc.
> >
> > Absolutely -- it'd be great if we could do that for the callsite locations; 
> > the
> > difficulty is that the entries are generated by the compiler itself, so we'd
> > either need some build/link time processing to convert each absolute 64-bit
> > value to a relative 32-bit offset, or new compiler options to generate 
> > those as
> > relative offsets from the outset.
> 
> Don't we use scripts/recordmcount.pl for that?

Not quite -- we could adjust it to do so, but today it doesn't consider
existing mcount_loc entries, and just generates new ones where the compiler has
generated calls to mcount, which it finds by scanning the instructions in the
binary. Today it is not used:

* On arm64 when we default to using `-fpatchable-function-entry=N`.  That makes
  the compiler insert 2 NOPs in the function prologue, and log the location of
  that NOP sled to a section called.  `__patchable_function_entries`. 

  We need the compiler to do that since we can't reliably identify 2 NOPs in a
  function prologue as being intended to be a patch site, as e.g. there could
  be notrace functions where the compiler had to insert NOPs for alignment of a
  subsequent brnach or similar.

* On architectures with `-nop-mcount`. On these, it's necessary to use
  `-mrecord-mcount` to have the compiler log the patch-site, for the same
  reason as with `-fpatchable-function-entry`.

* On architectures with `-mrecord-mcount` generally, since this avoids the
  overhead of scanning each object.

* On x86 when objtool is used.

Thanks,
Mark.


Re: [powerpc] ftrace warning kernel/trace/ftrace.c:2068 with code-patching selftests

2022-01-27 Thread Mark Rutland
On Thu, Jan 27, 2022 at 01:04:41PM +0100, Sven Schnelle wrote:
> Mark Rutland  writes:
> 
> >> Isn't x86 relocatable in some configurations (e.g. for KASLR)?
> >> 
> >> I can't see how the sort works for those cases, because the mcount_loc 
> >> entries
> >> are absolute, and either:
> >> 
> >> * The sorted entries will get overwritten by the unsorted relocation 
> >> entries,
> >>   and won't be sorted.
> >> 
> >> * The sorted entries won't get overwritten, but then the absolute address 
> >> will
> >>   be wrong since they hadn't been relocated.
> >> 
> >> How does that work?
> 
> From what i've seen when looking into this ftrace sort problem x86 has a
> a relocation tool, which is run before final linking: arch/x86/tools/relocs.c
> This tools converts all the required relocations to three types:
> 
> - 32 bit relocations
> - 64 bit relocations
> - inverse 32 bit relocations
> 
> These are added to the end of the image.
> 
> The decompressor then iterates over that array, and just adds/subtracts
> the KASLR offset - see arch/x86/boot/compressed/misc.c, handle_relocations()
> 
> So IMHO x86 never uses 'real' relocations during boot, and just
> adds/subtracts. That's why the order stays the same, and the compile
> time sort works.

Ah, so those non-ELF relocations for the mcount_loc table just mean "apply the
KASLR offset here", which is equivalent for all entries.

That makes sense, thanks!

Mark.


Re: [powerpc] ftrace warning kernel/trace/ftrace.c:2068 with code-patching selftests

2022-01-27 Thread Mark Rutland
On Thu, Jan 27, 2022 at 01:03:34PM +0100, Ard Biesheuvel wrote:
> On Thu, 27 Jan 2022 at 12:47, Mark Rutland  wrote:
> >
> > [adding LKML so this is easier for others to find]
> >
> > If anyone wants to follow the thread from the start, it's at:
> >
> >   
> > https://lore.kernel.org/linuxppc-dev/944d10da-8200-4ba9-8d0a-3bed9aa99...@linux.ibm.com/
> >
> > Ard, I was under the impression that the 32-bit arm kernel was (virtually)
> > relocatable, but I couldn't spot where, and suspect I'm mistaken. Do you 
> > know
> > whether it currently does any boot-time dynamic relocation?
> 
> No, it does not.

Thanks for comfirming!

So 32-bit arm should be able to opt into the build-time sort as-is.

> > Steve asked for a bit more detail on IRC, so the below is an attempt to 
> > explain
> > what's actually going on here.
> >
> > The short answer is that relocatable kernels (e.g. those with KASLR support)
> > need to handle the kernel being loaded at (somewhat) arbitrary virtual
> > addresses. Even where code can be position-independent, any pointers in the
> > kernel image (e.g. the contents of the mcount_loc table) need to be updated 
> > to
> > account for the specific VA the kernel was loaded at -- arch code does this
> > early at boot time by applying dynamic (ELF) relocations.
> 
> These architectures use place-relative extables for the same reason:
> place relative references are resolved at build time rather than at
> runtime during relocation, making a build time sort feasible.
> 
> arch/alpha/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> arch/arm64/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> arch/ia64/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> arch/parisc/include/asm/uaccess.h:#define ARCH_HAS_RELATIVE_EXTABLE
> arch/powerpc/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> arch/riscv/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> arch/s390/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> arch/x86/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> 
> Note that the swap routine becomes something like the below, given
> that the relative references need to be fixed up after the entry
> changes place in the sorted list.
> 
> static void swap_ex(void *a, void *b, int size)
> {
> struct exception_table_entry *x = a, *y = b, tmp;
> int delta = b - a;
> 
> tmp = *x;
> x->insn = y->insn + delta;
> y->insn = tmp.insn - delta;
> ...
> }
> 
> As a bonus, the resulting footprint of the table in the image is
> reduced by 8x, given that every 8 byte pointer has an accompanying 24
> byte RELA record, so we go from 32 bytes to 4 bytes for every call to
> __gnu_mcount_mc.

Absolutely -- it'd be great if we could do that for the callsite locations; the
difficulty is that the entries are generated by the compiler itself, so we'd
either need some build/link time processing to convert each absolute 64-bit
value to a relative 32-bit offset, or new compiler options to generate those as
relative offsets from the outset.

Thanks,
Mark.


Re: [powerpc] ftrace warning kernel/trace/ftrace.c:2068 with code-patching selftests

2022-01-27 Thread Mark Rutland
[adding LKML so this is easier for others to find]

If anyone wants to follow the thread from the start, it's at:

  
https://lore.kernel.org/linuxppc-dev/944d10da-8200-4ba9-8d0a-3bed9aa99...@linux.ibm.com/

Ard, I was under the impression that the 32-bit arm kernel was (virtually)
relocatable, but I couldn't spot where, and suspect I'm mistaken. Do you know
whether it currently does any boot-time dynamic relocation?

Kees, there's an x86_64 relocation question for you at the end.

On Wed, Jan 26, 2022 at 02:37:16PM +, Mark Rutland wrote:
> Hi,
> 
> Steve pointed me at this thread over IRC -- I'm not subscribed to this list so
> grabbed a copy of the thread thus far via b4.
> 
> On Tue, Jan 25, 2022 at 11:20:27AM +0800, Yinan Liu wrote:
> > > Yeah, I think it's time to opt in, instead of opting out.
> 
> I agree this must be opt-in rather than opt-out.
> 
> However, I think most architectures were broken (in at least some
> configurations) by commit:
> 
>   72b3942a173c387b ("scripts: ftrace - move the sort-processing in 
> ftrace_init")
> 
> ... and so I don't think this fix is correct as-is, and we might want to 
> revert
> that or at least mark is as BROKEN for now.

Steve asked for a bit more detail on IRC, so the below is an attempt to explain
what's actually going on here.

The short answer is that relocatable kernels (e.g. those with KASLR support)
need to handle the kernel being loaded at (somewhat) arbitrary virtual
addresses. Even where code can be position-independent, any pointers in the
kernel image (e.g. the contents of the mcount_loc table) need to be updated to
account for the specific VA the kernel was loaded at -- arch code does this
early at boot time by applying dynamic (ELF) relocations.

Walking through how we get there, considering arm64 specifically:

1) When an object is created with traceable functions:

   The compiler records the addresses of the callsites into a section. Those
   are absolute virtual addresses, but the final virtual addresses are not yet
   known, so the compiler generates ELF relocations to tell the linker how to
   fill these in later.

   On arm64, since the compiler doesn't know the final value yet, it fills the
   actual values with zero for now. Other architectures might do differently.

   For example, per `objdump -r init/main.o`:

   | RELOCATION RECORDS FOR [__patchable_function_entries]:
   | OFFSET   TYPE  VALUE 
   |  R_AARCH64_ABS64   .text+0x0028
   | 0008 R_AARCH64_ABS64   .text+0x0088
   | 0010 R_AARCH64_ABS64   .text+0x00e8

2) When vmlinux is linked:

   The linker script accumulates the callsite pointers from all the object
   files into the mcount_loc table. Since the kernel is relocatable, the
   runtime absolute addresses are still not yet known, but the offset relative
   to the kernel base is known, and so the linker consumes the absolute
   relocations created by the compiler and generates new relocations relative
   to the kernel's default load address so that these can be adjusted at boot
   time.

   On arm64, those are RELA and/or RELR relocations, which our vmlinux.lds.S
   accumulates those into a location in the initdata section that the kernel
   can find at boot time:

 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/vmlinux.lds.S?h=v5.17-rc1#n252
 
   For example, per `objdump -s vmlinux -j .rela.dyn`:

   | vmlinux: file format elf64-littleaarch64
   | 
   | Contents of section .rela.dyn:
   |  89beb0d0 b01a0b09 0080 0304   
   |  89beb0e0 00600b09 0080 b81a0b09 0080  .`..
   |  89beb0f0 0304  80710b09 0080  .q..
   |  89beb100 e8670b09 0080 0304   .g..
   |  89beb110 48e60809 0080 f0670b09 0080  Hg..
   |  89beb120 0304  ec190b09 0080  

   Each of the relocations in .rela.dyn consists of 3 64-bit little-endian
   values:

   * The first (e.g. 0x890b1ab0) is the VA of the pointer to write to,
 assuming the kernel's default load address (e.g. 0x8800). An
 offset must be applied to this depending on where the kernel was actually
 loaded relative to that default load address.

   * The second (e.g. 0x0403) is the ELF relocation type (1027, AKA
 R_AARCH64_RELATIVE).

   * The third, (e.g. 0x890b6000) is the VA to write to the pointer,
 assuming the kernel's default load address (e.g. 0x8800). An
 offset must be applied to this depending on where the kernel was actually
 loaded relative to that default load address.

   The AArch64 ELF spec defines our relocations, e.g.

 
https://github.com/ARM-software/abi-aa/blob/mai

Re: [powerpc] ftrace warning kernel/trace/ftrace.c:2068 with code-patching selftests

2022-01-26 Thread Mark Rutland
Hi,

Steve pointed me at this thread over IRC -- I'm not subscribed to this list so
grabbed a copy of the thread thus far via b4.

On Tue, Jan 25, 2022 at 11:20:27AM +0800, Yinan Liu wrote:
> > Yeah, I think it's time to opt in, instead of opting out.

I agree this must be opt-in rather than opt-out.

However, I think most architectures were broken (in at least some
configurations) by commit:

  72b3942a173c387b ("scripts: ftrace - move the sort-processing in ftrace_init")

... and so I don't think this fix is correct as-is, and we might want to revert
that or at least mark is as BROKEN for now.

More on that below.

> > 
> > Something like this:
> > 
> > -- Steve
> > 
> > diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> > index c2724d986fa0..5256ebe57451 100644
> > --- a/arch/arm/Kconfig
> > +++ b/arch/arm/Kconfig
> > @@ -82,6 +82,7 @@ config ARM
> > select HAVE_EBPF_JIT if !CPU_ENDIAN_BE32
> > select HAVE_CONTEXT_TRACKING
> > select HAVE_C_RECORDMCOUNT
> > +   select HAVE_BUILDTIME_MCOUNT_SORT
> > select HAVE_DEBUG_KMEMLEAK if !XIP_KERNEL
> > select HAVE_DMA_CONTIGUOUS if MMU
> > select HAVE_DYNAMIC_FTRACE if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU

IIUC the 32-bit arm kernel can be relocated at boot time, so I don't believe
this is correct, and I believe any relocatable arm kernel has been broken since
htat was introduced.

> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index c4207cf9bb17..7996548b2b27 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -166,6 +166,7 @@ config ARM64
> > select HAVE_ASM_MODVERSIONS
> > select HAVE_EBPF_JIT
> > select HAVE_C_RECORDMCOUNT
> > +   select HAVE_BUILDTIME_MCOUNT_SORT
> > select HAVE_CMPXCHG_DOUBLE
> > select HAVE_CMPXCHG_LOCAL
> > select HAVE_CONTEXT_TRACKING

The arm64 kernel is relocatable by default, and has been broken since the
build-time sort was introduced -- I see ftrace test failures, and the
CONFIG_FTRACE_SORT_STARTUP_TEST screams at boot time.

> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 7399327d1eff..46080dea5dba 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -186,6 +186,7 @@ config X86
> > select HAVE_CONTEXT_TRACKING_OFFSTACK   if HAVE_CONTEXT_TRACKING
> > select HAVE_C_RECORDMCOUNT
> > select HAVE_OBJTOOL_MCOUNT  if STACK_VALIDATION
> > +   select HAVE_BUILDTIME_MCOUNT_SORT

Isn't x86 relocatable in some configurations (e.g. for KASLR)?

I can't see how the sort works for those cases, because the mcount_loc entries
are absolute, and either:

* The sorted entries will get overwritten by the unsorted relocation entries,
  and won't be sorted.

* The sorted entries won't get overwritten, but then the absolute address will
  be wrong since they hadn't been relocated.

How does that work?

Thanks,
Mark.

> > select HAVE_DEBUG_KMEMLEAK
> > select HAVE_DMA_CONTIGUOUS
> > select HAVE_DYNAMIC_FTRACE
> > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> > index 752ed89a293b..7e5b92090faa 100644
> > --- a/kernel/trace/Kconfig
> > +++ b/kernel/trace/Kconfig
> > @@ -70,10 +70,16 @@ config HAVE_C_RECORDMCOUNT
> > help
> >   C version of recordmcount available?
> > +config HAVE_BUILDTIME_MCOUNT_SORT
> > +   bool
> > +   help
> > + An architecture selects this if it sorts the mcount_loc section
> > +at build time.
> > +
> >   config BUILDTIME_MCOUNT_SORT
> >  bool
> >  default y
> > -   depends on BUILDTIME_TABLE_SORT && !S390
> > +   depends on HAVE_BUILDTIME_MCOUNT_SORT
> >  help
> >Sort the mcount_loc section at build time.
> 
> LGTM. This will no longer destroy ftrace on other architectures.
> Those arches that we are not sure about can test and enable this function by
> themselves.
> 
> 
> Best regards
> --yinan
> 


Re: [PATCH] of: unmap memory regions in /memreserve node

2021-12-02 Thread Mark Rutland
On Tue, Nov 30, 2021 at 04:43:31PM -0600, Rob Herring wrote:
> +linuxppc-dev
> 
> On Wed, Nov 24, 2021 at 09:33:47PM +0800, Calvin Zhang wrote:
> > Reserved memory regions in /memreserve node aren't and shouldn't
> > be referenced elsewhere. So mark them no-map to skip direct mapping
> > for them.
> 
> I suspect this has a high chance of breaking some platform. There's no 
> rule a region can't be accessed.

The subtlety is that the region shouldn't be explicitly accessed (e.g.
modified), but the OS is permitted to have the region mapped. In ePAPR this is
described as:

   This requirement is necessary because the client program is permitted to map
   memory with storage attributes specified as not Write Through Required, not
   Caching Inhibited, and Memory Coherence Required (i.e., WIMG = 0b001x), and
   VLE=0 where supported. The client program may use large virtual pages that
   contain reserved memory. However, the client program may not modify reserved
   memory, so the boot program may perform accesses to reserved memory as Write
   Through Required where conflicting values for this storage attribute are
   architecturally permissible.

Historically arm64 relied upon this for spin-table to work, and I *think* we
might not need that any more I agree that there's a high chance this will break
something (especially on 16K or 64K page size kernels), so I'd prefer to leave
it as-is.

If someone requires no-map behaviour, they should use a /reserved-memory entry
with a no-map property, which will work today and document their requirement
explicitly.

Thanks,
Mark.

> > Signed-off-by: Calvin Zhang 
> > ---
> >  drivers/of/fdt.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
> > index bdca35284ceb..9e88cc8445f6 100644
> > --- a/drivers/of/fdt.c
> > +++ b/drivers/of/fdt.c
> > @@ -638,7 +638,7 @@ void __init early_init_fdt_scan_reserved_mem(void)
> > fdt_get_mem_rsv(initial_boot_params, n, , );
> > if (!size)
> > break;
> > -   early_init_dt_reserve_memory_arch(base, size, false);
> > +   early_init_dt_reserve_memory_arch(base, size, true);
> > }
> >  
> > fdt_scan_reserved_mem();
> > -- 
> > 2.30.2
> > 
> > 


  1   2   3   >