Re: [PATCH v2 3/4] PCI/AER: Fetch information for FTrace

2024-02-02 Thread Wang, Qingshun
On Fri, Feb 02, 2024 at 10:01:40AM -0800, Dan Williams wrote:
> Wang, Qingshun wrote:
> > Fetch and store the data of 3 more registers: "Link Status", "Device
> > Control 2", and "Advanced Error Capabilities and Control". This data is
> > needed for external observation to better understand ANFE.
> > 
> > Signed-off-by: "Wang, Qingshun" 
> > ---
> >  drivers/acpi/apei/ghes.c |  8 +++-
> >  drivers/cxl/core/pci.c   | 11 ++-
> >  drivers/pci/pci.h|  4 
> >  drivers/pci/pcie/aer.c   | 26 --
> >  include/linux/aer.h  |  6 --
> >  5 files changed, 45 insertions(+), 10 deletions(-)
> > 
> > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> > index 6034039d5cff..047cc01be68c 100644
> > --- a/drivers/acpi/apei/ghes.c
> > +++ b/drivers/acpi/apei/ghes.c
> > @@ -594,7 +594,9 @@ static void ghes_handle_aer(struct 
> > acpi_hest_generic_data *gdata)
> > if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
> > pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO) {
> > struct pcie_capability_regs *pcie_caps;
> > +   u16 device_control_2 = 0;
> > u16 device_status = 0;
> > +   u16 link_status = 0;
> > unsigned int devfn;
> > int aer_severity;
> > u8 *aer_info;
> > @@ -619,7 +621,9 @@ static void ghes_handle_aer(struct 
> > acpi_hest_generic_data *gdata)
> >  
> > if (pcie_err->validation_bits & CPER_PCIE_VALID_CAPABILITY) {
> > pcie_caps = (struct pcie_capability_regs 
> > *)pcie_err->capability;
> > +   device_control_2 = pcie_caps->device_control_2;
> > device_status = pcie_caps->device_status;
> > +   link_status = pcie_caps->link_status;
> > }
> >  
> > aer_recover_queue(pcie_err->device_id.segment,
> > @@ -627,7 +631,9 @@ static void ghes_handle_aer(struct 
> > acpi_hest_generic_data *gdata)
> >   devfn, aer_severity,
> >   (struct aer_capability_regs *)
> >   aer_info,
> > - device_status);
> > + device_status,
> > + link_status,
> > + device_control_2);
> > }
> >  #endif
> >  }
> > diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> > index 9111a4415a63..3aa57fe8db42 100644
> > --- a/drivers/cxl/core/pci.c
> > +++ b/drivers/cxl/core/pci.c
> > @@ -903,7 +903,9 @@ static void cxl_handle_rdport_errors(struct 
> > cxl_dev_state *cxlds)
> > struct aer_capability_regs aer_regs;
> > struct cxl_dport *dport;
> > struct cxl_port *port;
> > +   u16 device_control_2;
> > u16 device_status;
> > +   u16 link_status;
> > int severity;
> >  
> > port = cxl_pci_find_port(pdev, );
> > @@ -918,10 +920,17 @@ static void cxl_handle_rdport_errors(struct 
> > cxl_dev_state *cxlds)
> > if (!cxl_rch_get_aer_severity(_regs, ))
> > return;
> >  
> > +   if (pcie_capability_read_word(pdev, PCI_EXP_DEVCTL2, _control_2))
> > +   return;
> > +
> > if (pcie_capability_read_word(pdev, PCI_EXP_DEVSTA, _status))
> > return;
> >  
> > -   pci_print_aer(pdev, severity, _regs, device_status);
> > +   if (pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, _status))
> > +   return;
> > +
> > +   pci_print_aer(pdev, severity, _regs, device_status,
> > + link_status, device_control_2);
> 
> Rather than complicate the calling convention of pci_print_aer(), update
> the internals of pci_print_aer() to get these extra registers, or
> provide a new wrapper interface that satisfies the dependencies and
> switch users over to that.  Otherwise multiple touches of the same code
> path in one patch set is indicative of the need for a higher level
> helper.

Thanks for the advice, it does make sense. Will reconsider the
implementation.

--
Best regards,
Wang, Qingshun


[PATCH] powerpc/pseries/papr-sysparm: use u8 arrays for payloads

2024-02-02 Thread Nathan Lynch via B4 Relay
From: Nathan Lynch 

Some PAPR system parameter values are formatted by firmware as
nul-terminated strings (e.g. LPAR name, shared processor attributes).
But the values returned for other parameters, such as processor module
info and TLB block invalidate characteristics, are binary data with
parameter-specific layouts. So char[] isn't the appropriate type for
the general case. Use u8/__u8.

Signed-off-by: Nathan Lynch 
Fixes: 905b9e48786e ("powerpc/pseries/papr-sysparm: Expose character device to 
user space")
---
I'd like to get this in for v6.8 so the uapi header has the change for
its first point release.
---
 arch/powerpc/include/asm/papr-sysparm.h  | 2 +-
 arch/powerpc/include/uapi/asm/papr-sysparm.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/papr-sysparm.h 
b/arch/powerpc/include/asm/papr-sysparm.h
index 0dbbff59101d..c3cd5b131033 100644
--- a/arch/powerpc/include/asm/papr-sysparm.h
+++ b/arch/powerpc/include/asm/papr-sysparm.h
@@ -32,7 +32,7 @@ typedef struct {
  */
 struct papr_sysparm_buf {
__be16 len;
-   char val[PAPR_SYSPARM_MAX_OUTPUT];
+   u8 val[PAPR_SYSPARM_MAX_OUTPUT];
 };
 
 struct papr_sysparm_buf *papr_sysparm_buf_alloc(void);
diff --git a/arch/powerpc/include/uapi/asm/papr-sysparm.h 
b/arch/powerpc/include/uapi/asm/papr-sysparm.h
index 9f9a0f267ea5..f733467b1534 100644
--- a/arch/powerpc/include/uapi/asm/papr-sysparm.h
+++ b/arch/powerpc/include/uapi/asm/papr-sysparm.h
@@ -14,7 +14,7 @@ enum {
 struct papr_sysparm_io_block {
__u32 parameter;
__u16 length;
-   char data[PAPR_SYSPARM_MAX_OUTPUT];
+   __u8 data[PAPR_SYSPARM_MAX_OUTPUT];
 };
 
 /**

---
base-commit: 44a1aad2fe6c10bfe0589d8047057b10a4c18a19
change-id: 20240201-papr-sysparm-ioblock-data-use-u8-10d283cb6f1c

Best regards,
-- 
Nathan Lynch 



Re: [PATCH linux-next 1/3] x86, crash: don't nest CONFIG_CRASH_DUMP ifdef inside CONFIG_KEXEC_CODE ifdef scope

2024-02-02 Thread Nathan Chancellor
This series resolves the build issues I was seeing. Please feel free to
carry

  Tested-by: Nathan Chancellor  # build

forward if there are any more revisions without drastic changes.

On Mon, Jan 29, 2024 at 09:50:31PM +0800, Baoquan He wrote:
> Michael pointed out that the #ifdef CONFIG_CRASH_DUMP is nested inside
> arch/x86/xen/enlighten_hvm.c.
> 
> Although the nesting works well too since CONFIG_CRASH_DUMP has
> dependency on CONFIG_KEXEC_CORE, it may cause confuse because there
> are places where it's not nested, and people may think it need be nested
> even though it doesn't have to.
> 
> Fix that by moving  CONFIG_CRASH_DUMP ifdeffery of codes out of
> CONFIG_KEXEC_CODE ifdeffery scope.
> 
> And also fix a building error Nathan reported as below by replacing
> CONFIG_KEXEC_CORE ifdef with CONFIG_VMCORE_INFO ifdef.
> 
> 
> $ curl -LSso .config 
> https://git.alpinelinux.org/aports/plain/community/linux-edge/config-edge.x86_64
> $ make -skj"$(nproc)" ARCH=x86_64 CROSS_COMPILE=x86_64-linux- olddefconfig all
> ...
> x86_64-linux-ld: arch/x86/xen/mmu_pv.o: in function `paddr_vmcoreinfo_note':
> mmu_pv.c:(.text+0x3af3): undefined reference to `vmcoreinfo_note'
> 
> 
> Link: 
> https://lore.kernel.org/all/sn6pr02mb4157931105fa68d72e3d3db8d4...@sn6pr02mb4157.namprd02.prod.outlook.com/T/#u
> Link: 
> https://lore.kernel.org/all/20240126045551.GA126645@dev-arch.thelio-3990X/T/#u
> Signed-off-by: Baoquan He 
> ---
>  arch/x86/kernel/cpu/mshyperv.c | 10 ++
>  arch/x86/kernel/reboot.c   |  2 +-
>  arch/x86/xen/enlighten_hvm.c   |  4 ++--
>  arch/x86/xen/mmu_pv.c  |  2 +-
>  4 files changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index f8163a59026b..2e8cd5a4ae85 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -209,6 +209,7 @@ static void hv_machine_shutdown(void)
>   if (kexec_in_progress)
>   hyperv_cleanup();
>  }
> +#endif /* CONFIG_KEXEC_CORE */
>  
>  #ifdef CONFIG_CRASH_DUMP
>  static void hv_machine_crash_shutdown(struct pt_regs *regs)
> @@ -222,8 +223,7 @@ static void hv_machine_crash_shutdown(struct pt_regs 
> *regs)
>   /* Disable the hypercall page when there is only 1 active CPU. */
>   hyperv_cleanup();
>  }
> -#endif
> -#endif /* CONFIG_KEXEC_CORE */
> +#endif /* CONFIG_CRASH_DUMP */
>  #endif /* CONFIG_HYPERV */
>  
>  static uint32_t  __init ms_hyperv_platform(void)
> @@ -497,9 +497,11 @@ static void __init ms_hyperv_init_platform(void)
>   no_timer_check = 1;
>  #endif
>  
> -#if IS_ENABLED(CONFIG_HYPERV) && defined(CONFIG_KEXEC_CORE)
> +#if IS_ENABLED(CONFIG_HYPERV)
> +#if defined(CONFIG_KEXEC_CORE)
>   machine_ops.shutdown = hv_machine_shutdown;
> -#ifdef CONFIG_CRASH_DUMP
> +#endif
> +#if defined(CONFIG_CRASH_DUMP)
>   machine_ops.crash_shutdown = hv_machine_crash_shutdown;
>  #endif
>  #endif
> diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
> index 1287b0d5962f..f3130f762784 100644
> --- a/arch/x86/kernel/reboot.c
> +++ b/arch/x86/kernel/reboot.c
> @@ -826,7 +826,7 @@ void machine_halt(void)
>   machine_ops.halt();
>  }
>  
> -#ifdef CONFIG_KEXEC_CORE
> +#ifdef CONFIG_CRASH_DUMP
>  void machine_crash_shutdown(struct pt_regs *regs)
>  {
>   machine_ops.crash_shutdown(regs);
> diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
> index 09e3db7ff990..0b367c1e086d 100644
> --- a/arch/x86/xen/enlighten_hvm.c
> +++ b/arch/x86/xen/enlighten_hvm.c
> @@ -148,6 +148,7 @@ static void xen_hvm_shutdown(void)
>   if (kexec_in_progress)
>   xen_reboot(SHUTDOWN_soft_reset);
>  }
> +#endif
>  
>  #ifdef CONFIG_CRASH_DUMP
>  static void xen_hvm_crash_shutdown(struct pt_regs *regs)
> @@ -156,7 +157,6 @@ static void xen_hvm_crash_shutdown(struct pt_regs *regs)
>   xen_reboot(SHUTDOWN_soft_reset);
>  }
>  #endif
> -#endif
>  
>  static int xen_cpu_up_prepare_hvm(unsigned int cpu)
>  {
> @@ -238,10 +238,10 @@ static void __init xen_hvm_guest_init(void)
>  
>  #ifdef CONFIG_KEXEC_CORE
>   machine_ops.shutdown = xen_hvm_shutdown;
> +#endif
>  #ifdef CONFIG_CRASH_DUMP
>   machine_ops.crash_shutdown = xen_hvm_crash_shutdown;
>  #endif
> -#endif
>  }
>  
>  static __init int xen_parse_nopv(char *arg)
> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
> index 218773cfb009..e21974f2cf2d 100644
> --- a/arch/x86/xen/mmu_pv.c
> +++ b/arch/x86/xen/mmu_pv.c
> @@ -2520,7 +2520,7 @@ int xen_remap_pfn(struct vm_area_struct *vma, unsigned 
> long addr,
>  }
>  EXPORT_SYMBOL_GPL(xen_remap_pfn);
>  
> -#ifdef CONFIG_KEXEC_CORE
> +#ifdef CONFIG_VMCORE_INFO
>  phys_addr_t paddr_vmcoreinfo_note(void)
>  {
>   if (xen_pv_domain())
> -- 
> 2.41.0
> 


Re: [PATCH RFC/RFT v2 4/4] riscv: Stop emitting preventive sfence.vma for new userspace mappings with Svvptc

2024-02-02 Thread Alexandre Ghiti
On Fri, Feb 2, 2024 at 4:42 PM Alexandre Ghiti  wrote:
>
> Hi Andrea,
>
> On Thu, Feb 1, 2024 at 4:03 PM Andrea Parri  wrote:
> >
> > On Wed, Jan 31, 2024 at 04:59:29PM +0100, Alexandre Ghiti wrote:
> > > The preventive sfence.vma were emitted because new mappings must be made
> > > visible to the page table walker but Svvptc guarantees that xRET act as
> > > a fence, so no need to sfence.vma for the uarchs that implement this
> > > extension.
> >
> > AFAIU, your first submission shows that you don't need that xRET property.
> > Similarly for other archs.  What was rationale behind this Svvptc change?
>
> Actually, the ARC has just changed its mind and removed this new

The wording was incorrect here, the ARC did not state anything, the
author of Svvptc proposed an amended version of the spec that removes
this behaviour and that's under discussion.

> behaviour from the Svvptc extension, so we will take some gratuitous
> page faults (but that should be outliners), which makes riscv similar
> to x86 and arm64.
>
> >
> >
> > > This allows to drastically reduce the number of sfence.vma emitted:
> > >
> > > * Ubuntu boot to login:
> > > Before: ~630k sfence.vma
> > > After:  ~200k sfence.vma
> > >
> > > * ltp - mmapstress01
> > > Before: ~45k
> > > After:  ~6.3k
> > >
> > > * lmbench - lat_pagefault
> > > Before: ~665k
> > > After:   832 (!)
> > >
> > > * lmbench - lat_mmap
> > > Before: ~546k
> > > After:   718 (!)
> >
> > This Svvptc seems to move/add the "burden" of the synchronization to xRET:
> > Perhaps integrate the above counts w/ the perf gains in the cover letter?
>
> Yes, I'll copy that to the cover letter.
>
> Thanks for your interest!
>
> Alex
>
> >
> >   Andrea


Re: [PATCH v2 3/4] PCI/AER: Fetch information for FTrace

2024-02-02 Thread Dan Williams
Wang, Qingshun wrote:
> Fetch and store the data of 3 more registers: "Link Status", "Device
> Control 2", and "Advanced Error Capabilities and Control". This data is
> needed for external observation to better understand ANFE.
> 
> Signed-off-by: "Wang, Qingshun" 
> ---
>  drivers/acpi/apei/ghes.c |  8 +++-
>  drivers/cxl/core/pci.c   | 11 ++-
>  drivers/pci/pci.h|  4 
>  drivers/pci/pcie/aer.c   | 26 --
>  include/linux/aer.h  |  6 --
>  5 files changed, 45 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 6034039d5cff..047cc01be68c 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -594,7 +594,9 @@ static void ghes_handle_aer(struct acpi_hest_generic_data 
> *gdata)
>   if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
>   pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO) {
>   struct pcie_capability_regs *pcie_caps;
> + u16 device_control_2 = 0;
>   u16 device_status = 0;
> + u16 link_status = 0;
>   unsigned int devfn;
>   int aer_severity;
>   u8 *aer_info;
> @@ -619,7 +621,9 @@ static void ghes_handle_aer(struct acpi_hest_generic_data 
> *gdata)
>  
>   if (pcie_err->validation_bits & CPER_PCIE_VALID_CAPABILITY) {
>   pcie_caps = (struct pcie_capability_regs 
> *)pcie_err->capability;
> + device_control_2 = pcie_caps->device_control_2;
>   device_status = pcie_caps->device_status;
> + link_status = pcie_caps->link_status;
>   }
>  
>   aer_recover_queue(pcie_err->device_id.segment,
> @@ -627,7 +631,9 @@ static void ghes_handle_aer(struct acpi_hest_generic_data 
> *gdata)
> devfn, aer_severity,
> (struct aer_capability_regs *)
> aer_info,
> -   device_status);
> +   device_status,
> +   link_status,
> +   device_control_2);
>   }
>  #endif
>  }
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 9111a4415a63..3aa57fe8db42 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -903,7 +903,9 @@ static void cxl_handle_rdport_errors(struct cxl_dev_state 
> *cxlds)
>   struct aer_capability_regs aer_regs;
>   struct cxl_dport *dport;
>   struct cxl_port *port;
> + u16 device_control_2;
>   u16 device_status;
> + u16 link_status;
>   int severity;
>  
>   port = cxl_pci_find_port(pdev, );
> @@ -918,10 +920,17 @@ static void cxl_handle_rdport_errors(struct 
> cxl_dev_state *cxlds)
>   if (!cxl_rch_get_aer_severity(_regs, ))
>   return;
>  
> + if (pcie_capability_read_word(pdev, PCI_EXP_DEVCTL2, _control_2))
> + return;
> +
>   if (pcie_capability_read_word(pdev, PCI_EXP_DEVSTA, _status))
>   return;
>  
> - pci_print_aer(pdev, severity, _regs, device_status);
> + if (pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, _status))
> + return;
> +
> + pci_print_aer(pdev, severity, _regs, device_status,
> +   link_status, device_control_2);

Rather than complicate the calling convention of pci_print_aer(), update
the internals of pci_print_aer() to get these extra registers, or
provide a new wrapper interface that satisfies the dependencies and
switch users over to that.  Otherwise multiple touches of the same code
path in one patch set is indicative of the need for a higher level
helper.


[PATCH v2] powerpc/64: Set task pt_regs->link to the LR value on scv entry

2024-02-02 Thread Naveen N Rao
Nysal reported that userspace backtraces are missing in offcputime bcc
tool. As an example:
$ sudo ./bcc/tools/offcputime.py -uU
Tracing off-CPU time (us) of user threads by user stack... Hit Ctrl-C to 
end.

^C
write
-python (9107)
8

write
-sudo (9105)
9

mmap
-python (9107)
16

clock_nanosleep
-multipathd (697)
3001604

The offcputime bcc tool attaches a bpf program to a kprobe on
finish_task_switch(), which is usually hit on a syscall from userspace.
With the switch to system call vectored, we started setting
pt_regs->link to zero. This is because system call vectored behaves like
a function call with LR pointing to the system call return address, and
with no modification to SRR0/SRR1. The LR value does indicate our next
instruction, so it is being saved as pt_regs->nip, and pt_regs->link is
being set to zero. This is not a problem by itself, but BPF uses perf
callchain infrastructure for capturing stack traces, and that stores LR
as the second entry in the stack trace. perf has code to cope with the
second entry being zero, and skips over it. However, generic userspace
unwinders assume that a zero entry indicates end of the stack trace,
resulting in a truncated userspace stack trace.

Rather than fixing all userspace unwinders to ignore/skip past the
second entry, store the real LR value in pt_regs->link so that there
continues to be a valid, though duplicate entry in the stack trace.

With this change:
$ sudo ./bcc/tools/offcputime.py -uU
Tracing off-CPU time (us) of user threads by user stack... Hit Ctrl-C to 
end.

^C
write
write
[unknown]
[unknown]
[unknown]
[unknown]
[unknown]
PyObject_VectorcallMethod
[unknown]
[unknown]
PyObject_CallOneArg
PyFile_WriteObject
PyFile_WriteString
[unknown]
[unknown]
PyObject_Vectorcall
_PyEval_EvalFrameDefault
PyEval_EvalCode
[unknown]
[unknown]
[unknown]
_PyRun_SimpleFileObject
_PyRun_AnyFileObject
Py_RunMain
[unknown]
Py_BytesMain
[unknown]
__libc_start_main
-python (1293)
7

write
write
[unknown]
sudo_ev_loop_v1
sudo_ev_dispatch_v1
[unknown]
[unknown]
[unknown]
[unknown]
__libc_start_main
-sudo (1291)
7

syscall
syscall
bpf_open_perf_buffer_opts
[unknown]
[unknown]
[unknown]
[unknown]
_PyObject_MakeTpCall
PyObject_Vectorcall
_PyEval_EvalFrameDefault
PyEval_EvalCode
[unknown]
[unknown]
[unknown]
_PyRun_SimpleFileObject
_PyRun_AnyFileObject
Py_RunMain
[unknown]
Py_BytesMain
[unknown]
__libc_start_main
-python (1293)
11

clock_nanosleep
clock_nanosleep
nanosleep
sleep
[unknown]
[unknown]
__clone
-multipathd (698)
3001661

Fixes: 7fa95f9adaee ("powerpc/64s: system call support for scv/rfscv 
instructions")
Cc: sta...@vger.kernel.org
Reported-by: Nysal Jan K.A 
Signed-off-by: Naveen N Rao 
---
v2: Update change log, re-order instructions storing into pt_regs->nip 
and pt_regs->link and add a comment to better describe the change. Also 
added a Fixes: tag.


 arch/powerpc/kernel/interrupt_64.S | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/interrupt_64.S 
b/arch/powerpc/kernel/interrupt_64.S
index bd863702d812..1ad059a9e2fe 100644
--- a/arch/powerpc/kernel/interrupt_64.S
+++ b/arch/powerpc/kernel/interrupt_64.S
@@ -52,7 +52,8 @@ _ASM_NOKPROBE_SYMBOL(system_call_vectored_\name)
mr  r10,r1
ld  r1,PACAKSAVE(r13)
std r10,0(r1)
-   std r11,_NIP(r1)
+   std r11,_LINK(r1)
+   std r11,_NIP(r1)/* Saved LR is also the next instruction */
std r12,_MSR(r1)
std r0,GPR0(r1)
std r10,GPR1(r1)
@@ -70,7 +71,6 @@ _ASM_NOKPROBE_SYMBOL(system_call_vectored_\name)
std r9,GPR13(r1)
SAVE_NVGPRS(r1)
std r11,_XER(r1)
-   std r11,_LINK(r1)
std r11,_CTR(r1)
 
li  r11,\trapnr

base-commit: 414e92af226ede4935509b0b5e041810c92e003f
-- 
2.43.0



Re: [PATCH RFC/RFT v2 4/4] riscv: Stop emitting preventive sfence.vma for new userspace mappings with Svvptc

2024-02-02 Thread Alexandre Ghiti
Hi Andrea,

On Thu, Feb 1, 2024 at 4:03 PM Andrea Parri  wrote:
>
> On Wed, Jan 31, 2024 at 04:59:29PM +0100, Alexandre Ghiti wrote:
> > The preventive sfence.vma were emitted because new mappings must be made
> > visible to the page table walker but Svvptc guarantees that xRET act as
> > a fence, so no need to sfence.vma for the uarchs that implement this
> > extension.
>
> AFAIU, your first submission shows that you don't need that xRET property.
> Similarly for other archs.  What was rationale behind this Svvptc change?

Actually, the ARC has just changed its mind and removed this new
behaviour from the Svvptc extension, so we will take some gratuitous
page faults (but that should be outliners), which makes riscv similar
to x86 and arm64.

>
>
> > This allows to drastically reduce the number of sfence.vma emitted:
> >
> > * Ubuntu boot to login:
> > Before: ~630k sfence.vma
> > After:  ~200k sfence.vma
> >
> > * ltp - mmapstress01
> > Before: ~45k
> > After:  ~6.3k
> >
> > * lmbench - lat_pagefault
> > Before: ~665k
> > After:   832 (!)
> >
> > * lmbench - lat_mmap
> > Before: ~546k
> > After:   718 (!)
>
> This Svvptc seems to move/add the "burden" of the synchronization to xRET:
> Perhaps integrate the above counts w/ the perf gains in the cover letter?

Yes, I'll copy that to the cover letter.

Thanks for your interest!

Alex

>
>   Andrea


Re: Re: [PATCH] powerpc/64: Set LR to a non-NULL value in task pt_regs on scv entry

2024-02-02 Thread Naveen N Rao
On Fri, Feb 02, 2024 at 01:02:39PM +1100, Michael Ellerman wrote:
> Segher Boessenkool  writes:
> > Hi!
> >
> > On Thu, Jan 25, 2024 at 05:12:28PM +0530, Naveen N Rao wrote:
> >> diff --git a/arch/powerpc/kernel/interrupt_64.S 
> >> b/arch/powerpc/kernel/interrupt_64.S
> >> index bd863702d812..5cf3758a19d3 100644
> >> --- a/arch/powerpc/kernel/interrupt_64.S
> >> +++ b/arch/powerpc/kernel/interrupt_64.S
> >> @@ -53,6 +53,7 @@ _ASM_NOKPROBE_SYMBOL(system_call_vectored_\name)
> >>ld  r1,PACAKSAVE(r13)
> >>std r10,0(r1)
> >>std r11,_NIP(r1)
> >> +  std r11,_LINK(r1)
> >
> > Please add a comment here then, saying what the store is for?
> 
> Yeah a comment would be good. 
> 
> Also the r11 value comes from LR, so it's not that we're storing the NIP
> value into the LR slot, rather the value we store in NIP is from LR, see:
> 
> EXC_VIRT_BEGIN(system_call_vectored, 0x3000, 0x1000)
>   /* SCV 0 */
>   mr  r9,r13
>   GET_PACA(r13)
>   mflrr11
> ...
>   b   system_call_vectored_common
> 
> That's slightly pedantic, but I think it answers the question of why
> it's OK to use the same value for NIP & LR, or why we don't have to do
> mflr in system_call_vectored_common to get the actual LR value.

Thanks for clarifying that. I should have done a better job describing 
that in the commit log. I'll update that, add a comment here and send a 
v2.


- Naveen



Re: [mainline] [linux-next] [6.8-rc1] [FC] [DLPAR] OOps kernel crash after performing dlpar remove test

2024-02-02 Thread Robin Murphy

On 02/02/2024 7:11 am, Tasmiya Nalatwad wrote:

Greetings,

I have tried reverting some latest commits and tested the issue. I see
reverting below commit hits to some other problem which was reported
earlier and the patch for fixing that issue is under review

1. Reverted commit :

  commit 17de3f5fdd35676b0e3d41c7c9bf4e3032eb3673
  iommu: Retire bus ops

2. Below are the traces of other issue that was seen after reverting
above commit, And below is the patch which fixes this issue is that is 
under review


Patch :
https://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg225210.html


Yes, it's the same fundamental issue (failing to manage the IOMMU state 
for dynamic addition/removal) that's been present since the commit cited 
in the fix patch; the bus ops change just makes us more sensitive to the 
lack of unregistration on remove, vs. the lack of registration on add. 
The fix should solve both aspects (although I'd be inlined to agree with 
factoring out the registration between both paths).


Thanks,
Robin.


--- Traces ---

[  981.124047] Kernel attempted to read user page (30) - exploit
attempt? (uid: 0)
[  981.124053] BUG: Kernel NULL pointer dereference on read at 0x0030
[  981.124056] Faulting instruction address: 0xc0689864
[  981.124060] Oops: Kernel access of bad area, sig: 11 [#1]
[  981.124063] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=8192 NUMA pSeries
[  981.124067] Modules linked in: sit tunnel4 ip_tunnel rpadlpar_io
rpaphp xsk_diag nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bonding
tls ip_set rfkill nf_tables libcrc32c nfnetlink pseries_rng vmx_crypto
binfmt_misc ext4 mbcache jbd2 dm_service_time sd_mod t10_pi
crc64_rocksoft crc64 sg ibmvfc scsi_transport_fc ibmveth mlx5_core mlxfw
psample dm_multipath dm_mirror dm_region_hash dm_log dm_mod fuse
[  981.124111] CPU: 24 PID: 78294 Comm: drmgr Kdump: loaded Not tainted
6.5.0-rc6-next-20230817-auto #1
[  981.124115] Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200
0xf06 of:IBM,FW1030.30 (NH1030_062) hv:phyp pSeries
[  981.124118] NIP:  c0689864 LR: c09bd05c CTR:
c005fb90
[  981.124121] REGS: c000a878b1e0 TRAP: 0300   Not tainted
(6.5.0-rc6-next-20230817-auto)
[  981.124125] MSR:  80009033   CR:
44822422  XER: 20040006
[  981.124132] CFAR: c09bd058 DAR: 0030 DSISR:
4000 IRQMASK: 0
[  981.124132] GPR00: c09bd05c c000a878b480 c1451400

[  981.124132] GPR04: c128d510  ceeccf50
c000a878b420
[  981.124132] GPR08: 0001 ceed76e0 c2c24c28
0220
[  981.124132] GPR12: c005fb90 c01837969300 

[  981.124132] GPR16:   

[  981.124132] GPR20: c125cef0  c125cf08
c2bce500
[  981.124132] GPR24: c000573e90c0 f000 c000573e93c0
c000a877d2a0
[  981.124132] GPR28: c128d510 ceeccf50 c000a877d2a0
c000573e90c0
[  981.124171] NIP [c0689864] sysfs_add_link_to_group+0x34/0x90
[  981.124178] LR [c09bd05c] iommu_device_link+0x5c/0x110
[  981.124184] Call Trace:
[  981.124186] [c000a878b480] [c048d630]
kmalloc_trace+0x50/0x140 (unreliable)
[  981.124193] [c000a878b4c0] [c09bd05c]
iommu_device_link+0x5c/0x110
[  981.124198] [c000a878b500] [c09ba050]
__iommu_probe_device+0x250/0x5c0
[  981.124203] [c000a878b570] [c09ba9e0]
iommu_probe_device_locked+0x30/0x90
[  981.124207] [c000a878b5a0] [c09baa80]
iommu_probe_device+0x40/0x70
[  981.124212] [c000a878b5d0] [c09baaf0]
iommu_bus_notifier+0x40/0x80
[  981.124217] [c000a878b5f0] [c019aad0]
notifier_call_chain+0xc0/0x1b0
[  981.124221] [c000a878b650] [c019b604]
blocking_notifier_call_chain+0x64/0xa0
[  981.124226] [c000a878b690] [c09cd870] bus_notify+0x50/0x80
[  981.124230] [c000a878b6d0] [c09c8f04] device_add+0x744/0x9b0
[  981.124235] [c000a878b790] [c089f2ec]
pci_device_add+0x2fc/0x880
[  981.124240] [c000a878b840] [c007ef90]
of_create_pci_dev+0x390/0xa10
[  981.124245] [c000a878b920] [c007f858]
__of_scan_bus+0x248/0x320
[  981.124249] [c000a878ba00] [c007c1f0]
pcibios_scan_phb+0x2d0/0x3c0
[  981.124254] [c000a878bad0] [c0107f08]
init_phb_dynamic+0xb8/0x110
[  981.124259] [c000a878bb40] [c00802cc03b4]
dlpar_add_slot+0x18c/0x380 [rpadlpar_io]
[  981.124265] [c000a878bbe0] [c00802cc0bec]
add_slot_store+0xa4/0x150 [rpadlpar_io]
[  981.124270] [c000a878bc70] [c0f2f800]
kobj_attr_store+0x30/0x50
[  981.124274] [c000a878bc90] [c0687368]
sysfs_kf_write+0x68/0x80
[  981.124278] 

Re: [kvm-unit-tests PATCH v2 1/9] (arm|powerpc|s390x): Makefile: Fix .aux.o generation

2024-02-02 Thread Andrew Jones
On Fri, Feb 02, 2024 at 04:57:32PM +1000, Nicholas Piggin wrote:
> Using all prerequisites for the source file results in the build
> dying on the second time around with:
> 
> gcc: fatal error: cannot specify ‘-o’ with ‘-c’, ‘-S’ or ‘-E’ with multiple 
> files
> 
> This is due to auxinfo.h becoming a prerequisite after the first
> build recorded the dependency.
> 
> Use the first prerequisite for this recipe.
> 
> Fixes: f2372f2d49135 ("(arm|powerpc|s390x): Makefile: add `%.aux.o` target")
> Signed-off-by: Nicholas Piggin 
> ---
>  arm/Makefile.common | 2 +-
>  powerpc/Makefile.common | 2 +-
>  s390x/Makefile  | 2 +-
>  3 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arm/Makefile.common b/arm/Makefile.common
> index 54cb4a63..c2ee568c 100644
> --- a/arm/Makefile.common
> +++ b/arm/Makefile.common
> @@ -71,7 +71,7 @@ FLATLIBS = $(libcflat) $(LIBFDT_archive) $(libeabi)
>  
>  ifeq ($(CONFIG_EFI),y)
>  %.aux.o: $(SRCDIR)/lib/auxinfo.c
> - $(CC) $(CFLAGS) -c -o $@ $^ \
> + $(CC) $(CFLAGS) -c -o $@ $< \
>   -DPROGNAME=\"$(@:.aux.o=.efi)\" -DAUXFLAGS=$(AUXFLAGS)

There are two instances of the %.aux.o target in arm/Makefile.common. We
need to fix both. We can actually pull the target out of the two arms of
the CONFIG_EFI if-else, though, by changing the .efi/.flat to .$(exe).

Thanks,
drew

>  
>  %.so: EFI_LDFLAGS += -defsym=EFI_SUBSYSTEM=0xa --no-undefined
> diff --git a/powerpc/Makefile.common b/powerpc/Makefile.common
> index 483ff648..eb88398d 100644
> --- a/powerpc/Makefile.common
> +++ b/powerpc/Makefile.common
> @@ -48,7 +48,7 @@ cflatobjs += lib/powerpc/smp.o
>  OBJDIRS += lib/powerpc
>  
>  %.aux.o: $(SRCDIR)/lib/auxinfo.c
> - $(CC) $(CFLAGS) -c -o $@ $^ -DPROGNAME=\"$(@:.aux.o=.elf)\"
> + $(CC) $(CFLAGS) -c -o $@ $< -DPROGNAME=\"$(@:.aux.o=.elf)\"
>  
>  FLATLIBS = $(libcflat) $(LIBFDT_archive)
>  %.elf: CFLAGS += $(arch_CFLAGS)
> diff --git a/s390x/Makefile b/s390x/Makefile
> index e64521e0..b72f7578 100644
> --- a/s390x/Makefile
> +++ b/s390x/Makefile
> @@ -177,7 +177,7 @@ lds-autodepend-flags = -MMD -MF $(dir $*).$(notdir $*).d 
> -MT $@
>   $(CPP) $(lds-autodepend-flags) $(CPPFLAGS) -P -C -o $@ $<
>  
>  %.aux.o: $(SRCDIR)/lib/auxinfo.c
> - $(CC) $(CFLAGS) -c -o $@ $^ -DPROGNAME=\"$(@:.aux.o=.elf)\"
> + $(CC) $(CFLAGS) -c -o $@ $< -DPROGNAME=\"$(@:.aux.o=.elf)\"
>  
>  .SECONDEXPANSION:
>  %.elf: $(FLATLIBS) $(asmlib) $(SRCDIR)/s390x/flat.lds $$(snippets-obj) 
> $$(snippet-hdr-obj) %.o %.aux.o
> -- 
> 2.42.0
> 


Re: [PATCH v2] powerpc: iommu: Bring back table group release_ownership() call

2024-02-02 Thread Joerg Roedel
On Fri, Jan 26, 2024 at 09:09:18AM -0600, Shivaprasad G Bhat wrote:
> The commit 2ad56efa80db ("powerpc/iommu: Setup a default domain and
> remove set_platform_dma_ops") refactored the code removing the
> set_platform_dma_ops(). It missed out the table group
> release_ownership() call which would have got called otherwise
> during the guest shutdown via vfio_group_detach_container(). On
> PPC64, this particular call actually sets up the 32-bit TCE table,
> and enables the 64-bit DMA bypass etc. Now after guest shutdown,
> the subsequent host driver (e.g megaraid-sas) probe post unbind
> from vfio-pci fails like,
> 
> megaraid_sas 0031:01:00.0: Warning: IOMMU dma not supported: mask 
> 0x7fff, table unavailable
> megaraid_sas 0031:01:00.0: Warning: IOMMU dma not supported: mask 0x, 
> table unavailable
> megaraid_sas 0031:01:00.0: Failed to set DMA mask
> megaraid_sas 0031:01:00.0: Failed from megasas_init_fw 6539
> 
> The patch brings back the call to table_group release_ownership()
> call when switching back to PLATFORM domain from BLOCKED, while
> also separates the domain_ops for both.
> 
> Fixes: 2ad56efa80db ("powerpc/iommu: Setup a default domain and remove 
> set_platform_dma_ops")
> Signed-off-by: Shivaprasad G Bhat 
> ---
> Changelog:
> v1: 
> https://lore.kernel.org/linux-iommu/170618451433.3805.9015493852395837391.st...@ltcd48-lp2.aus.stglab.ibm.com/
>  - Split the common attach_dev call to platform and blocked attach_dev
>calls as suggested.
> 
>  arch/powerpc/kernel/iommu.c |   37 -
>  1 file changed, 28 insertions(+), 9 deletions(-)

Applied, thanks.

-- 
Jörg Rödel
jroe...@suse.de

SUSE Software Solutions Germany GmbH
Frankenstraße 146
90461 Nürnberg
Germany
https://www.suse.com/

Geschäftsführer: Ivo Totev, Andrew McDonald, Werner Knoblich
(HRB 36809, AG Nürnberg)


[PATCH v5 25/25] arm64/mm: Automatically fold contpte mappings

2024-02-02 Thread Ryan Roberts
There are situations where a change to a single PTE could cause the
contpte block in which it resides to become foldable (i.e. could be
repainted with the contiguous bit). Such situations arise, for example,
when user space temporarily changes protections, via mprotect, for
individual pages, such can be the case for certain garbage collectors.

We would like to detect when such a PTE change occurs. However this can
be expensive due to the amount of checking required. Therefore only
perform the checks when an indiviual PTE is modified via mprotect
(ptep_modify_prot_commit() -> set_pte_at() -> set_ptes(nr=1)) and only
when we are setting the final PTE in a contpte-aligned block.

Signed-off-by: Ryan Roberts 
---
 arch/arm64/include/asm/pgtable.h | 26 +
 arch/arm64/mm/contpte.c  | 64 
 2 files changed, 90 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index cdc310880a3b..d3357fe4eb89 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1192,6 +1192,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, 
pte_t pte);
  * where it is possible and makes sense to do so. The PTE_CONT bit is 
considered
  * a private implementation detail of the public ptep API (see below).
  */
+extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
+   pte_t *ptep, pte_t pte);
 extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pte);
 extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
@@ -1213,6 +1215,29 @@ extern int contpte_ptep_set_access_flags(struct 
vm_area_struct *vma,
unsigned long addr, pte_t *ptep,
pte_t entry, int dirty);
 
+static __always_inline void contpte_try_fold(struct mm_struct *mm,
+   unsigned long addr, pte_t *ptep, pte_t pte)
+{
+   /*
+* Only bother trying if both the virtual and physical addresses are
+* aligned and correspond to the last entry in a contig range. The core
+* code mostly modifies ranges from low to high, so this is the likely
+* the last modification in the contig range, so a good time to fold.
+* We can't fold special mappings, because there is no associated folio.
+*/
+
+   const unsigned long contmask = CONT_PTES - 1;
+   bool valign = ((addr >> PAGE_SHIFT) & contmask) == contmask;
+
+   if (unlikely(valign)) {
+   bool palign = (pte_pfn(pte) & contmask) == contmask;
+
+   if (unlikely(palign &&
+   pte_valid(pte) && !pte_cont(pte) && !pte_special(pte)))
+   __contpte_try_fold(mm, addr, ptep, pte);
+   }
+}
+
 static __always_inline void contpte_try_unfold(struct mm_struct *mm,
unsigned long addr, pte_t *ptep, pte_t pte)
 {
@@ -1287,6 +1312,7 @@ static __always_inline void set_ptes(struct mm_struct 
*mm, unsigned long addr,
if (likely(nr == 1)) {
contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
__set_ptes(mm, addr, ptep, pte, 1);
+   contpte_try_fold(mm, addr, ptep, pte);
} else {
contpte_set_ptes(mm, addr, ptep, pte, nr);
}
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index 80346108450b..2c7dafd0552a 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -67,6 +67,70 @@ static void contpte_convert(struct mm_struct *mm, unsigned 
long addr,
__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
 }
 
+void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
+   pte_t *ptep, pte_t pte)
+{
+   /*
+* We have already checked that the virtual and pysical addresses are
+* correctly aligned for a contpte mapping in contpte_try_fold() so the
+* remaining checks are to ensure that the contpte range is fully
+* covered by a single folio, and ensure that all the ptes are valid
+* with contiguous PFNs and matching prots. We ignore the state of the
+* access and dirty bits for the purpose of deciding if its a contiguous
+* range; the folding process will generate a single contpte entry which
+* has a single access and dirty bit. Those 2 bits are the logical OR of
+* their respective bits in the constituent pte entries. In order to
+* ensure the contpte range is covered by a single folio, we must
+* recover the folio from the pfn, but special mappings don't have a
+* folio backing them. Fortunately contpte_try_fold() already checked
+* that the pte is not special - we never try to fold special mappings.
+* Note we can't use vm_normal_page() for this since we don't have the
+* vma.
+*/
+
+   

[PATCH v5 24/25] arm64/mm: __always_inline to improve fork() perf

2024-02-02 Thread Ryan Roberts
As set_ptes() and wrprotect_ptes() become a bit more complex, the
compiler may choose not to inline them. But this is critical for fork()
performance. So mark the functions, along with contpte_try_unfold()
which is called by them, as __always_inline. This is worth ~1% on the
fork() microbenchmark with order-0 folios (the common case).

Signed-off-by: Ryan Roberts 
---
 arch/arm64/include/asm/pgtable.h | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 353ea67b5d75..cdc310880a3b 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1213,8 +1213,8 @@ extern int contpte_ptep_set_access_flags(struct 
vm_area_struct *vma,
unsigned long addr, pte_t *ptep,
pte_t entry, int dirty);
 
-static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
-   pte_t *ptep, pte_t pte)
+static __always_inline void contpte_try_unfold(struct mm_struct *mm,
+   unsigned long addr, pte_t *ptep, pte_t pte)
 {
if (unlikely(pte_valid_cont(pte)))
__contpte_try_unfold(mm, addr, ptep, pte);
@@ -1279,7 +1279,7 @@ static inline void set_pte(pte_t *ptep, pte_t pte)
 }
 
 #define set_ptes set_ptes
-static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
+static __always_inline void set_ptes(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pte, unsigned int nr)
 {
pte = pte_mknoncont(pte);
@@ -1361,8 +1361,8 @@ static inline int ptep_clear_flush_young(struct 
vm_area_struct *vma,
 }
 
 #define wrprotect_ptes wrprotect_ptes
-static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
-   pte_t *ptep, unsigned int nr)
+static __always_inline void wrprotect_ptes(struct mm_struct *mm,
+   unsigned long addr, pte_t *ptep, unsigned int 
nr)
 {
if (likely(nr == 1)) {
/*
-- 
2.25.1



[PATCH v5 23/25] arm64/mm: Implement pte_batch_hint()

2024-02-02 Thread Ryan Roberts
When core code iterates over a range of ptes and calls ptep_get() for
each of them, if the range happens to cover contpte mappings, the number
of pte reads becomes amplified by a factor of the number of PTEs in a
contpte block. This is because for each call to ptep_get(), the
implementation must read all of the ptes in the contpte block to which
it belongs to gather the access and dirty bits.

This causes a hotspot for fork(), as well as operations that unmap
memory such as munmap(), exit and madvise(MADV_DONTNEED). Fortunately we
can fix this by implementing pte_batch_hint() which allows their
iterators to skip getting the contpte tail ptes when gathering the batch
of ptes to operate on. This results in the number of PTE reads returning
to 1 per pte.

Tested-by: John Hubbard 
Signed-off-by: Ryan Roberts 
---
 arch/arm64/include/asm/pgtable.h | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index ad04adb7b87f..353ea67b5d75 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1220,6 +1220,15 @@ static inline void contpte_try_unfold(struct mm_struct 
*mm, unsigned long addr,
__contpte_try_unfold(mm, addr, ptep, pte);
 }
 
+#define pte_batch_hint pte_batch_hint
+static inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
+{
+   if (!pte_valid_cont(pte))
+   return 1;
+
+   return CONT_PTES - (((unsigned long)ptep >> 3) & (CONT_PTES - 1));
+}
+
 /*
  * The below functions constitute the public API that arm64 presents to the
  * core-mm to manipulate PTE entries within their page tables (or at least this
-- 
2.25.1



[PATCH v5 22/25] mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch()

2024-02-02 Thread Ryan Roberts
Some architectures (e.g. arm64) can tell from looking at a pte, if some
follow-on ptes also map contiguous physical memory with the same pgprot.
(for arm64, these are contpte mappings).

Take advantage of this knowledge to optimize folio_pte_batch() so that
it can skip these ptes when scanning to create a batch. By default, if
an arch does not opt-in, folio_pte_batch() returns a compile-time 1, so
the changes are optimized out and the behaviour is as before.

arm64 will opt-in to providing this hint in the next patch, which will
greatly reduce the cost of ptep_get() when scanning a range of contptes.

Tested-by: John Hubbard 
Signed-off-by: Ryan Roberts 
---
 include/linux/pgtable.h | 18 ++
 mm/memory.c | 20 +---
 2 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 50f32cccbd92..cba31f177d27 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -212,6 +212,24 @@ static inline int pmd_dirty(pmd_t pmd)
 #define arch_flush_lazy_mmu_mode() do {} while (0)
 #endif
 
+#ifndef pte_batch_hint
+/**
+ * pte_batch_hint - Number of pages that can be added to batch without 
scanning.
+ * @ptep: Page table pointer for the entry.
+ * @pte: Page table entry.
+ *
+ * Some architectures know that a set of contiguous ptes all map the same
+ * contiguous memory with the same permissions. In this case, it can provide a
+ * hint to aid pte batching without the core code needing to scan every pte.
+ *
+ * May be overridden by the architecture, else pte_batch_hint is always 1.
+ */
+static inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
+{
+   return 1;
+}
+#endif
+
 #ifndef pte_advance_pfn
 static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 65fbe4f886c1..902665b27702 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -988,16 +988,21 @@ static inline int folio_pte_batch(struct folio *folio, 
unsigned long addr,
 {
unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
const pte_t *end_ptep = start_ptep + max_nr;
-   pte_t expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, 1), 
flags);
-   pte_t *ptep = start_ptep + 1;
+   pte_t expected_pte = __pte_batch_clear_ignored(pte, flags);
+   pte_t *ptep = start_ptep;
bool writable;
+   int nr;
 
if (any_writable)
*any_writable = false;
 
VM_WARN_ON_FOLIO(!pte_present(pte), folio);
 
-   while (ptep != end_ptep) {
+   nr = pte_batch_hint(ptep, pte);
+   expected_pte = pte_advance_pfn(expected_pte, nr);
+   ptep += nr;
+
+   while (ptep < end_ptep) {
pte = ptep_get(ptep);
if (any_writable)
writable = !!pte_write(pte);
@@ -1011,17 +1016,18 @@ static inline int folio_pte_batch(struct folio *folio, 
unsigned long addr,
 * corner cases the next PFN might fall into a different
 * folio.
 */
-   if (pte_pfn(pte) == folio_end_pfn)
+   if (pte_pfn(pte) >= folio_end_pfn)
break;
 
if (any_writable)
*any_writable |= writable;
 
-   expected_pte = pte_advance_pfn(expected_pte, 1);
-   ptep++;
+   nr = pte_batch_hint(ptep, pte);
+   expected_pte = pte_advance_pfn(expected_pte, nr);
+   ptep += nr;
}
 
-   return ptep - start_ptep;
+   return min(ptep - start_ptep, max_nr);
 }
 
 /*
-- 
2.25.1



[PATCH v5 21/25] arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs

2024-02-02 Thread Ryan Roberts
Optimize the contpte implementation to fix some of the
exit/munmap/dontneed performance regression introduced by the initial
contpte commit. Subsequent patches will solve it entirely.

During exit(), munmap() or madvise(MADV_DONTNEED), mappings must be
cleared. Previously this was done 1 PTE at a time. But the core-mm
supports batched clear via the new [get_and_]clear_full_ptes() APIs. So
let's implement those APIs and for fully covered contpte mappings, we no
longer need to unfold the contpte. This significantly reduces unfolding
operations, reducing the number of tlbis that must be issued.

Tested-by: John Hubbard 
Signed-off-by: Ryan Roberts 
---
 arch/arm64/include/asm/pgtable.h | 67 
 arch/arm64/mm/contpte.c  | 17 
 2 files changed, 84 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index c07f0d563733..ad04adb7b87f 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -965,6 +965,37 @@ static inline pte_t __ptep_get_and_clear(struct mm_struct 
*mm,
return pte;
 }
 
+static inline void __clear_full_ptes(struct mm_struct *mm, unsigned long addr,
+   pte_t *ptep, unsigned int nr, int full)
+{
+   for (;;) {
+   __ptep_get_and_clear(mm, addr, ptep);
+   if (--nr == 0)
+   break;
+   ptep++;
+   addr += PAGE_SIZE;
+   }
+}
+
+static inline pte_t __get_and_clear_full_ptes(struct mm_struct *mm,
+   unsigned long addr, pte_t *ptep,
+   unsigned int nr, int full)
+{
+   pte_t pte, tmp_pte;
+
+   pte = __ptep_get_and_clear(mm, addr, ptep);
+   while (--nr) {
+   ptep++;
+   addr += PAGE_SIZE;
+   tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
+   if (pte_dirty(tmp_pte))
+   pte = pte_mkdirty(pte);
+   if (pte_young(tmp_pte))
+   pte = pte_mkyoung(pte);
+   }
+   return pte;
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
 static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
@@ -1167,6 +1198,11 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t 
orig_pte);
 extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
 extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pte, unsigned int nr);
+extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
+   pte_t *ptep, unsigned int nr, int full);
+extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
+   unsigned long addr, pte_t *ptep,
+   unsigned int nr, int full);
 extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep);
 extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
@@ -1254,6 +1290,35 @@ static inline void pte_clear(struct mm_struct *mm,
__pte_clear(mm, addr, ptep);
 }
 
+#define clear_full_ptes clear_full_ptes
+static inline void clear_full_ptes(struct mm_struct *mm, unsigned long addr,
+   pte_t *ptep, unsigned int nr, int full)
+{
+   if (likely(nr == 1)) {
+   contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+   __clear_full_ptes(mm, addr, ptep, nr, full);
+   } else {
+   contpte_clear_full_ptes(mm, addr, ptep, nr, full);
+   }
+}
+
+#define get_and_clear_full_ptes get_and_clear_full_ptes
+static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
+   unsigned long addr, pte_t *ptep,
+   unsigned int nr, int full)
+{
+   pte_t pte;
+
+   if (likely(nr == 1)) {
+   contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+   pte = __get_and_clear_full_ptes(mm, addr, ptep, nr, full);
+   } else {
+   pte = contpte_get_and_clear_full_ptes(mm, addr, ptep, nr, full);
+   }
+
+   return pte;
+}
+
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
unsigned long addr, pte_t *ptep)
@@ -1338,6 +1403,8 @@ static inline int ptep_set_access_flags(struct 
vm_area_struct *vma,
 #define set_pte__set_pte
 #define set_ptes   __set_ptes
 #define pte_clear  __pte_clear
+#define clear_full_ptes__clear_full_ptes
+#define get_and_clear_full_ptes
__get_and_clear_full_ptes
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 #define ptep_get_and_clear __ptep_get_and_clear
 #define 

[PATCH v5 20/25] arm64/mm: Implement new wrprotect_ptes() batch API

2024-02-02 Thread Ryan Roberts
Optimize the contpte implementation to fix some of the fork performance
regression introduced by the initial contpte commit. Subsequent patches
will solve it entirely.

During fork(), any private memory in the parent must be write-protected.
Previously this was done 1 PTE at a time. But the core-mm supports
batched wrprotect via the new wrprotect_ptes() API. So let's implement
that API and for fully covered contpte mappings, we no longer need to
unfold the contpte. This has 2 benefits:

  - reduced unfolding, reduces the number of tlbis that must be issued.
  - The memory remains contpte-mapped ("folded") in the parent, so it
continues to benefit from the more efficient use of the TLB after
the fork.

The optimization to wrprotect a whole contpte block without unfolding is
possible thanks to the tightening of the Arm ARM in respect to the
definition and behaviour when 'Misprogramming the Contiguous bit'. See
section D21194 at https://developer.arm.com/documentation/102105/latest/

Tested-by: John Hubbard 
Signed-off-by: Ryan Roberts 
---
 arch/arm64/include/asm/pgtable.h | 61 ++--
 arch/arm64/mm/contpte.c  | 35 ++
 2 files changed, 86 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 34892a95403d..c07f0d563733 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -978,16 +978,12 @@ static inline pmd_t pmdp_huge_get_and_clear(struct 
mm_struct *mm,
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-/*
- * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
- * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
- */
-static inline void __ptep_set_wrprotect(struct mm_struct *mm,
-   unsigned long address, pte_t *ptep)
+static inline void ___ptep_set_wrprotect(struct mm_struct *mm,
+   unsigned long address, pte_t *ptep,
+   pte_t pte)
 {
-   pte_t old_pte, pte;
+   pte_t old_pte;
 
-   pte = __ptep_get(ptep);
do {
old_pte = pte;
pte = pte_wrprotect(pte);
@@ -996,6 +992,25 @@ static inline void __ptep_set_wrprotect(struct mm_struct 
*mm,
} while (pte_val(pte) != pte_val(old_pte));
 }
 
+/*
+ * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
+ * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
+ */
+static inline void __ptep_set_wrprotect(struct mm_struct *mm,
+   unsigned long address, pte_t *ptep)
+{
+   ___ptep_set_wrprotect(mm, address, ptep, __ptep_get(ptep));
+}
+
+static inline void __wrprotect_ptes(struct mm_struct *mm, unsigned long 
address,
+   pte_t *ptep, unsigned int nr)
+{
+   unsigned int i;
+
+   for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
+   __ptep_set_wrprotect(mm, address, ptep);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define __HAVE_ARCH_PMDP_SET_WRPROTECT
 static inline void pmdp_set_wrprotect(struct mm_struct *mm,
@@ -1156,6 +1171,8 @@ extern int contpte_ptep_test_and_clear_young(struct 
vm_area_struct *vma,
unsigned long addr, pte_t *ptep);
 extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep);
+extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
+   pte_t *ptep, unsigned int nr);
 extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep,
pte_t entry, int dirty);
@@ -1269,12 +1286,35 @@ static inline int ptep_clear_flush_young(struct 
vm_area_struct *vma,
return contpte_ptep_clear_flush_young(vma, addr, ptep);
 }
 
+#define wrprotect_ptes wrprotect_ptes
+static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
+   pte_t *ptep, unsigned int nr)
+{
+   if (likely(nr == 1)) {
+   /*
+* Optimization: wrprotect_ptes() can only be called for present
+* ptes so we only need to check contig bit as condition for
+* unfold, and we can remove the contig bit from the pte we read
+* to avoid re-reading. This speeds up fork() which is sensitive
+* for order-0 folios. Equivalent to contpte_try_unfold().
+*/
+   pte_t orig_pte = __ptep_get(ptep);
+
+   if (unlikely(pte_cont(orig_pte))) {
+   __contpte_try_unfold(mm, addr, ptep, orig_pte);
+   orig_pte = pte_mknoncont(orig_pte);
+   }
+   ___ptep_set_wrprotect(mm, addr, ptep, orig_pte);
+   } else {
+   

[PATCH v5 18/25] arm64/mm: Split __flush_tlb_range() to elide trailing DSB

2024-02-02 Thread Ryan Roberts
Split __flush_tlb_range() into __flush_tlb_range_nosync() +
__flush_tlb_range(), in the same way as the existing flush_tlb_page()
arrangement. This allows calling __flush_tlb_range_nosync() to elide the
trailing DSB. Forthcoming "contpte" code will take advantage of this
when clearing the young bit from a contiguous range of ptes.

Tested-by: John Hubbard 
Signed-off-by: Ryan Roberts 
---
 arch/arm64/include/asm/tlbflush.h | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/tlbflush.h 
b/arch/arm64/include/asm/tlbflush.h
index 79e932a1bdf8..50a765917327 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -422,7 +422,7 @@ do {
\
 #define __flush_s2_tlb_range_op(op, start, pages, stride, tlb_level) \
__flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false, 
kvm_lpa2_is_enabled());
 
-static inline void __flush_tlb_range(struct vm_area_struct *vma,
+static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
 unsigned long start, unsigned long end,
 unsigned long stride, bool last_level,
 int tlb_level)
@@ -456,10 +456,19 @@ static inline void __flush_tlb_range(struct 
vm_area_struct *vma,
__flush_tlb_range_op(vae1is, start, pages, stride, asid,
 tlb_level, true, lpa2_is_enabled());
 
-   dsb(ish);
mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
 }
 
+static inline void __flush_tlb_range(struct vm_area_struct *vma,
+unsigned long start, unsigned long end,
+unsigned long stride, bool last_level,
+int tlb_level)
+{
+   __flush_tlb_range_nosync(vma, start, end, stride,
+last_level, tlb_level);
+   dsb(ish);
+}
+
 static inline void flush_tlb_range(struct vm_area_struct *vma,
   unsigned long start, unsigned long end)
 {
-- 
2.25.1



[PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-02 Thread Ryan Roberts
With the ptep API sufficiently refactored, we can now introduce a new
"contpte" API layer, which transparently manages the PTE_CONT bit for
user mappings.

In this initial implementation, only suitable batches of PTEs, set via
set_ptes(), are mapped with the PTE_CONT bit. Any subsequent
modification of individual PTEs will cause an "unfold" operation to
repaint the contpte block as individual PTEs before performing the
requested operation. While, a modification of a single PTE could cause
the block of PTEs to which it belongs to become eligible for "folding"
into a contpte entry, "folding" is not performed in this initial
implementation due to the costs of checking the requirements are met.
Due to this, contpte mappings will degrade back to normal pte mappings
over time if/when protections are changed. This will be solved in a
future patch.

Since a contpte block only has a single access and dirty bit, the
semantic here changes slightly; when getting a pte (e.g. ptep_get())
that is part of a contpte mapping, the access and dirty information are
pulled from the block (so all ptes in the block return the same
access/dirty info). When changing the access/dirty info on a pte (e.g.
ptep_set_access_flags()) that is part of a contpte mapping, this change
will affect the whole contpte block. This is works fine in practice
since we guarantee that only a single folio is mapped by a contpte
block, and the core-mm tracks access/dirty information per folio.

In order for the public functions, which used to be pure inline, to
continue to be callable by modules, export all the contpte_* symbols
that are now called by those public inline functions.

The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter
at build time. It defaults to enabled as long as its dependency,
TRANSPARENT_HUGEPAGE is also enabled. The core-mm depends upon
TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not
enabled, then there is no chance of meeting the physical contiguity
requirement for contpte mappings.

Tested-by: John Hubbard 
Signed-off-by: Ryan Roberts 
---
 arch/arm64/Kconfig   |   9 +
 arch/arm64/include/asm/pgtable.h | 161 ++
 arch/arm64/mm/Makefile   |   1 +
 arch/arm64/mm/contpte.c  | 283 +++
 4 files changed, 454 insertions(+)
 create mode 100644 arch/arm64/mm/contpte.c

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index d86d7f4758b5..1442e8ed95b6 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2230,6 +2230,15 @@ config UNWIND_PATCH_PAC_INTO_SCS
select UNWIND_TABLES
select DYNAMIC_SCS
 
+config ARM64_CONTPTE
+   bool "Contiguous PTE mappings for user memory" if EXPERT
+   depends on TRANSPARENT_HUGEPAGE
+   default y
+   help
+ When enabled, user mappings are configured using the PTE contiguous
+ bit, for any mappings that meet the size and alignment requirements.
+ This reduces TLB pressure and improves performance.
+
 endmenu # "Kernel Features"
 
 menu "Boot options"
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 7dc6b68ee516..34892a95403d 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -133,6 +133,10 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
  */
 #define pte_valid_not_user(pte) \
((pte_val(pte) & (PTE_VALID | PTE_USER | PTE_UXN)) == (PTE_VALID | 
PTE_UXN))
+/*
+ * Returns true if the pte is valid and has the contiguous bit set.
+ */
+#define pte_valid_cont(pte)(pte_valid(pte) && pte_cont(pte))
 /*
  * Could the pte be present in the TLB? We must check mm_tlb_flush_pending
  * so that we don't erroneously return false for pages that have been
@@ -1135,6 +1139,161 @@ void vmemmap_update_pte(unsigned long addr, pte_t 
*ptep, pte_t pte);
 #define vmemmap_update_pte vmemmap_update_pte
 #endif
 
+#ifdef CONFIG_ARM64_CONTPTE
+
+/*
+ * The contpte APIs are used to transparently manage the contiguous bit in ptes
+ * where it is possible and makes sense to do so. The PTE_CONT bit is 
considered
+ * a private implementation detail of the public ptep API (see below).
+ */
+extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
+   pte_t *ptep, pte_t pte);
+extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
+extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
+extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
+   pte_t *ptep, pte_t pte, unsigned int nr);
+extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
+   unsigned long addr, pte_t *ptep);
+extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
+   unsigned long addr, pte_t *ptep);
+extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
+   unsigned long 

[PATCH v5 14/25] arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit

2024-02-02 Thread Ryan Roberts
Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard 
Signed-off-by: Ryan Roberts 
---
 arch/arm64/include/asm/pgtable.h | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 77a8b100e1cd..2870bc12f288 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -138,7 +138,7 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
  * so that we don't erroneously return false for pages that have been
  * remapped as PROT_NONE but are yet to be flushed from the TLB.
  * Note that we can't make any assumptions based on the state of the access
- * flag, since ptep_clear_flush_young() elides a DSB when invalidating the
+ * flag, since __ptep_clear_flush_young() elides a DSB when invalidating the
  * TLB.
  */
 #define pte_accessible(mm, pte)\
@@ -916,8 +916,7 @@ static inline int __ptep_test_and_clear_young(struct 
vm_area_struct *vma,
return pte_young(pte);
 }
 
-#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
-static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
+static inline int __ptep_clear_flush_young(struct vm_area_struct *vma,
 unsigned long address, pte_t *ptep)
 {
int young = __ptep_test_and_clear_young(vma, address, ptep);
@@ -1138,6 +1137,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, 
pte_t pte);
 #define ptep_get_and_clear __ptep_get_and_clear
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 #define ptep_test_and_clear_young  __ptep_test_and_clear_young
+#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
+#define ptep_clear_flush_young __ptep_clear_flush_young
 
 #endif /* !__ASSEMBLY__ */
 
-- 
2.25.1



[PATCH v5 13/25] arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit

2024-02-02 Thread Ryan Roberts
Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard 
Signed-off-by: Ryan Roberts 
---
 arch/arm64/include/asm/pgtable.h | 18 +++---
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 5f560326116e..77a8b100e1cd 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -899,8 +899,9 @@ static inline bool pud_user_accessible_page(pud_t pud)
 /*
  * Atomic pte/pmd modifications.
  */
-#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
-static inline int __ptep_test_and_clear_young(pte_t *ptep)
+static inline int __ptep_test_and_clear_young(struct vm_area_struct *vma,
+ unsigned long address,
+ pte_t *ptep)
 {
pte_t old_pte, pte;
 
@@ -915,18 +916,11 @@ static inline int __ptep_test_and_clear_young(pte_t *ptep)
return pte_young(pte);
 }
 
-static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
-   unsigned long address,
-   pte_t *ptep)
-{
-   return __ptep_test_and_clear_young(ptep);
-}
-
 #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
 static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
 unsigned long address, pte_t *ptep)
 {
-   int young = ptep_test_and_clear_young(vma, address, ptep);
+   int young = __ptep_test_and_clear_young(vma, address, ptep);
 
if (young) {
/*
@@ -949,7 +943,7 @@ static inline int pmdp_test_and_clear_young(struct 
vm_area_struct *vma,
unsigned long address,
pmd_t *pmdp)
 {
-   return ptep_test_and_clear_young(vma, address, (pte_t *)pmdp);
+   return __ptep_test_and_clear_young(vma, address, (pte_t *)pmdp);
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
@@ -1142,6 +1136,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, 
pte_t pte);
 #define pte_clear  __pte_clear
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 #define ptep_get_and_clear __ptep_get_and_clear
+#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
+#define ptep_test_and_clear_young  __ptep_test_and_clear_young
 
 #endif /* !__ASSEMBLY__ */
 
-- 
2.25.1



[PATCH v5 12/25] arm64/mm: ptep_get_and_clear(): New layer to manage contig bit

2024-02-02 Thread Ryan Roberts
Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard 
Signed-off-by: Ryan Roberts 
---
 arch/arm64/include/asm/pgtable.h | 5 +++--
 arch/arm64/mm/hugetlbpage.c  | 6 +++---
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 3b0ff58109c5..5f560326116e 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -953,8 +953,7 @@ static inline int pmdp_test_and_clear_young(struct 
vm_area_struct *vma,
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
-static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
+static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
   unsigned long address, pte_t *ptep)
 {
pte_t pte = __pte(xchg_relaxed(_val(*ptep), 0));
@@ -1141,6 +1140,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, 
pte_t pte);
 #define set_pte__set_pte
 #define set_ptes   __set_ptes
 #define pte_clear  __pte_clear
+#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
+#define ptep_get_and_clear __ptep_get_and_clear
 
 #endif /* !__ASSEMBLY__ */
 
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 3d73b83cf97f..7e74e7b67107 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -188,7 +188,7 @@ static pte_t get_clear_contig(struct mm_struct *mm,
unsigned long i;
 
for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) {
-   pte_t pte = ptep_get_and_clear(mm, addr, ptep);
+   pte_t pte = __ptep_get_and_clear(mm, addr, ptep);
 
/*
 * If HW_AFDBM is enabled, then the HW could turn on
@@ -236,7 +236,7 @@ static void clear_flush(struct mm_struct *mm,
unsigned long i, saddr = addr;
 
for (i = 0; i < ncontig; i++, addr += pgsize, ptep++)
-   ptep_clear(mm, addr, ptep);
+   __ptep_get_and_clear(mm, addr, ptep);
 
flush_tlb_range(, saddr, addr);
 }
@@ -411,7 +411,7 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
pte_t orig_pte = ptep_get(ptep);
 
if (!pte_cont(orig_pte))
-   return ptep_get_and_clear(mm, addr, ptep);
+   return __ptep_get_and_clear(mm, addr, ptep);
 
ncontig = find_num_contig(mm, addr, ptep, );
 
-- 
2.25.1



[PATCH v5 11/25] arm64/mm: pte_clear(): New layer to manage contig bit

2024-02-02 Thread Ryan Roberts
Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard 
Signed-off-by: Ryan Roberts 
---
 arch/arm64/include/asm/pgtable.h | 3 ++-
 arch/arm64/mm/fixmap.c   | 2 +-
 arch/arm64/mm/hugetlbpage.c  | 2 +-
 arch/arm64/mm/mmu.c  | 2 +-
 4 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index f1fd6c5e3eca..3b0ff58109c5 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -93,7 +93,7 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
__pte(__phys_to_pte_val((phys_addr_t)(pfn) << PAGE_SHIFT) | 
pgprot_val(prot))
 
 #define pte_none(pte)  (!pte_val(pte))
-#define pte_clear(mm, addr, ptep) \
+#define __pte_clear(mm, addr, ptep) \
__set_pte(ptep, __pte(0))
 #define pte_page(pte)  (pfn_to_page(pte_pfn(pte)))
 
@@ -1140,6 +1140,7 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, 
pte_t pte);
 
 #define set_pte__set_pte
 #define set_ptes   __set_ptes
+#define pte_clear  __pte_clear
 
 #endif /* !__ASSEMBLY__ */
 
diff --git a/arch/arm64/mm/fixmap.c b/arch/arm64/mm/fixmap.c
index 51cd4501816d..bfc02568805a 100644
--- a/arch/arm64/mm/fixmap.c
+++ b/arch/arm64/mm/fixmap.c
@@ -123,7 +123,7 @@ void __set_fixmap(enum fixed_addresses idx,
if (pgprot_val(flags)) {
__set_pte(ptep, pfn_pte(phys >> PAGE_SHIFT, flags));
} else {
-   pte_clear(_mm, addr, ptep);
+   __pte_clear(_mm, addr, ptep);
flush_tlb_kernel_range(addr, addr+PAGE_SIZE);
}
 }
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 9d7e7315eaa3..3d73b83cf97f 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -400,7 +400,7 @@ void huge_pte_clear(struct mm_struct *mm, unsigned long 
addr,
ncontig = num_contig_ptes(sz, );
 
for (i = 0; i < ncontig; i++, addr += pgsize, ptep++)
-   pte_clear(mm, addr, ptep);
+   __pte_clear(mm, addr, ptep);
 }
 
 pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 7cc1930f0e10..bcaa5a5d86f8 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -859,7 +859,7 @@ static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned 
long addr,
continue;
 
WARN_ON(!pte_present(pte));
-   pte_clear(_mm, addr, ptep);
+   __pte_clear(_mm, addr, ptep);
flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
if (free_mapped)
free_hotplug_page_range(pte_page(pte),
-- 
2.25.1



[PATCH v5 10/25] arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit

2024-02-02 Thread Ryan Roberts
Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

set_pte_at() is a core macro that forwards to set_ptes() (with nr=1).
Instead of creating a __set_pte_at() internal macro, convert all arch
users to use set_ptes()/__set_ptes() directly, as appropriate. Callers
in hugetlb may benefit from calling __set_ptes() once for their whole
range rather than managing their own loop. This is left for future
improvement.

Tested-by: John Hubbard 
Signed-off-by: Ryan Roberts 
---
 arch/arm64/include/asm/pgtable.h | 10 +-
 arch/arm64/kernel/mte.c  |  2 +-
 arch/arm64/kvm/guest.c   |  2 +-
 arch/arm64/mm/fault.c|  2 +-
 arch/arm64/mm/hugetlbpage.c  | 10 +-
 5 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 3cb45e8dbb52..f1fd6c5e3eca 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -358,9 +358,9 @@ static inline pte_t pte_advance_pfn(pte_t pte, unsigned 
long nr)
return pfn_pte(pte_pfn(pte) + nr, pte_pgprot(pte));
 }
 
-static inline void set_ptes(struct mm_struct *mm,
-   unsigned long __always_unused addr,
-   pte_t *ptep, pte_t pte, unsigned int nr)
+static inline void __set_ptes(struct mm_struct *mm,
+ unsigned long __always_unused addr,
+ pte_t *ptep, pte_t pte, unsigned int nr)
 {
page_table_check_ptes_set(mm, ptep, pte, nr);
__sync_cache_and_tags(pte, nr);
@@ -374,7 +374,6 @@ static inline void set_ptes(struct mm_struct *mm,
pte = pte_advance_pfn(pte, 1);
}
 }
-#define set_ptes set_ptes
 
 /*
  * Huge pte definitions.
@@ -1079,7 +1078,7 @@ static inline void arch_swap_restore(swp_entry_t entry, 
struct folio *folio)
 #endif /* CONFIG_ARM64_MTE */
 
 /*
- * On AArch64, the cache coherency is handled via the set_pte_at() function.
+ * On AArch64, the cache coherency is handled via the __set_ptes() function.
  */
 static inline void update_mmu_cache_range(struct vm_fault *vmf,
struct vm_area_struct *vma, unsigned long addr, pte_t *ptep,
@@ -1140,6 +1139,7 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, 
pte_t pte);
 #endif
 
 #define set_pte__set_pte
+#define set_ptes   __set_ptes
 
 #endif /* !__ASSEMBLY__ */
 
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index a41ef3213e1e..dcdcccd40891 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -67,7 +67,7 @@ int memcmp_pages(struct page *page1, struct page *page2)
/*
 * If the page content is identical but at least one of the pages is
 * tagged, return non-zero to avoid KSM merging. If only one of the
-* pages is tagged, set_pte_at() may zero or change the tags of the
+* pages is tagged, __set_ptes() may zero or change the tags of the
 * other page via mte_sync_tags().
 */
if (page_mte_tagged(page1) || page_mte_tagged(page2))
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index aaf1d4939739..629145fd3161 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -1072,7 +1072,7 @@ int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
} else {
/*
 * Only locking to serialise with a concurrent
-* set_pte_at() in the VMM but still overriding the
+* __set_ptes() in the VMM but still overriding the
 * tags, hence ignoring the return value.
 */
try_page_mte_tagging(page);
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 13189322a38f..23d0dfc16686 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -205,7 +205,7 @@ static void show_pte(unsigned long addr)
  *
  * It needs to cope with hardware update of the accessed/dirty state by other
  * agents in the system and can safely skip the __sync_icache_dcache() call as,
- * like set_pte_at(), the PTE is never changed from no-exec to exec here.
+ * like __set_ptes(), the PTE is never changed from no-exec to exec here.
  *
  * Returns whether or not the PTE actually changed.
  */
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 

[PATCH v5 09/25] arm64/mm: set_pte(): New layer to manage contig bit

2024-02-02 Thread Ryan Roberts
Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard 
Signed-off-by: Ryan Roberts 
---
 arch/arm64/include/asm/pgtable.h | 11 +++
 arch/arm64/kernel/efi.c  |  2 +-
 arch/arm64/mm/fixmap.c   |  2 +-
 arch/arm64/mm/kasan_init.c   |  4 ++--
 arch/arm64/mm/mmu.c  |  2 +-
 arch/arm64/mm/pageattr.c |  2 +-
 arch/arm64/mm/trans_pgd.c|  4 ++--
 7 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 6a6cc78cf879..3cb45e8dbb52 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -93,7 +93,8 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
__pte(__phys_to_pte_val((phys_addr_t)(pfn) << PAGE_SHIFT) | 
pgprot_val(prot))
 
 #define pte_none(pte)  (!pte_val(pte))
-#define pte_clear(mm,addr,ptep)set_pte(ptep, __pte(0))
+#define pte_clear(mm, addr, ptep) \
+   __set_pte(ptep, __pte(0))
 #define pte_page(pte)  (pfn_to_page(pte_pfn(pte)))
 
 /*
@@ -261,7 +262,7 @@ static inline pte_t pte_mkdevmap(pte_t pte)
return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
 }
 
-static inline void set_pte(pte_t *ptep, pte_t pte)
+static inline void __set_pte(pte_t *ptep, pte_t pte)
 {
WRITE_ONCE(*ptep, pte);
 
@@ -366,7 +367,7 @@ static inline void set_ptes(struct mm_struct *mm,
 
for (;;) {
__check_safe_pte_update(mm, ptep, pte);
-   set_pte(ptep, pte);
+   __set_pte(ptep, pte);
if (--nr == 0)
break;
ptep++;
@@ -540,7 +541,7 @@ static inline void __set_pte_at(struct mm_struct *mm,
 {
__sync_cache_and_tags(pte, nr);
__check_safe_pte_update(mm, ptep, pte);
-   set_pte(ptep, pte);
+   __set_pte(ptep, pte);
 }
 
 static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
@@ -1138,6 +1139,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, 
pte_t pte);
 #define vmemmap_update_pte vmemmap_update_pte
 #endif
 
+#define set_pte__set_pte
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __ASM_PGTABLE_H */
diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
index 0228001347be..44288a12fc6c 100644
--- a/arch/arm64/kernel/efi.c
+++ b/arch/arm64/kernel/efi.c
@@ -111,7 +111,7 @@ static int __init set_permissions(pte_t *ptep, unsigned 
long addr, void *data)
pte = set_pte_bit(pte, __pgprot(PTE_PXN));
else if (system_supports_bti_kernel() && spd->has_bti)
pte = set_pte_bit(pte, __pgprot(PTE_GP));
-   set_pte(ptep, pte);
+   __set_pte(ptep, pte);
return 0;
 }
 
diff --git a/arch/arm64/mm/fixmap.c b/arch/arm64/mm/fixmap.c
index c0a3301203bd..51cd4501816d 100644
--- a/arch/arm64/mm/fixmap.c
+++ b/arch/arm64/mm/fixmap.c
@@ -121,7 +121,7 @@ void __set_fixmap(enum fixed_addresses idx,
ptep = fixmap_pte(addr);
 
if (pgprot_val(flags)) {
-   set_pte(ptep, pfn_pte(phys >> PAGE_SHIFT, flags));
+   __set_pte(ptep, pfn_pte(phys >> PAGE_SHIFT, flags));
} else {
pte_clear(_mm, addr, ptep);
flush_tlb_kernel_range(addr, addr+PAGE_SIZE);
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index 4c7ad574b946..f659bd98c63f 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -112,7 +112,7 @@ static void __init kasan_pte_populate(pmd_t *pmdp, unsigned 
long addr,
if (!early)
memset(__va(page_phys), KASAN_SHADOW_INIT, PAGE_SIZE);
next = addr + PAGE_SIZE;
-   set_pte(ptep, pfn_pte(__phys_to_pfn(page_phys), PAGE_KERNEL));
+   __set_pte(ptep, pfn_pte(__phys_to_pfn(page_phys), PAGE_KERNEL));
} while (ptep++, addr = next, addr != end && 
pte_none(READ_ONCE(*ptep)));
 }
 
@@ -271,7 +271,7 @@ static void __init kasan_init_shadow(void)
 * so we should make sure that it maps the zero page read-only.
 */
for (i = 0; i < PTRS_PER_PTE; i++)
-   set_pte(_early_shadow_pte[i],
+   __set_pte(_early_shadow_pte[i],
pfn_pte(sym_to_pfn(kasan_early_shadow_page),
PAGE_KERNEL_RO));
 
diff --git a/arch/arm64/mm/mmu.c 

[PATCH v5 16/25] arm64/mm: ptep_set_access_flags(): New layer to manage contig bit

2024-02-02 Thread Ryan Roberts
Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard 
Signed-off-by: Ryan Roberts 
---
 arch/arm64/include/asm/pgtable.h | 10 ++
 arch/arm64/mm/fault.c|  6 +++---
 arch/arm64/mm/hugetlbpage.c  |  2 +-
 3 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 4c2d6c483390..fe27a3175618 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -312,7 +312,7 @@ static inline void __check_safe_pte_update(struct mm_struct 
*mm, pte_t *ptep,
 
/*
 * Check for potential race with hardware updates of the pte
-* (ptep_set_access_flags safely changes valid ptes without going
+* (__ptep_set_access_flags safely changes valid ptes without going
 * through an invalid entry).
 */
VM_WARN_ONCE(!pte_young(pte),
@@ -854,8 +854,7 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
return pte_pmd(pte_modify(pmd_pte(pmd), newprot));
 }
 
-#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
-extern int ptep_set_access_flags(struct vm_area_struct *vma,
+extern int __ptep_set_access_flags(struct vm_area_struct *vma,
 unsigned long address, pte_t *ptep,
 pte_t entry, int dirty);
 
@@ -865,7 +864,8 @@ static inline int pmdp_set_access_flags(struct 
vm_area_struct *vma,
unsigned long address, pmd_t *pmdp,
pmd_t entry, int dirty)
 {
-   return ptep_set_access_flags(vma, address, (pte_t *)pmdp, 
pmd_pte(entry), dirty);
+   return __ptep_set_access_flags(vma, address, (pte_t *)pmdp,
+   pmd_pte(entry), dirty);
 }
 
 static inline int pud_devmap(pud_t pud)
@@ -1141,6 +1141,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, 
pte_t pte);
 #define ptep_clear_flush_young __ptep_clear_flush_young
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 #define ptep_set_wrprotect __ptep_set_wrprotect
+#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
+#define ptep_set_access_flags  __ptep_set_access_flags
 
 #endif /* !__ASSEMBLY__ */
 
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 23d0dfc16686..dbbc06cfb848 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -209,9 +209,9 @@ static void show_pte(unsigned long addr)
  *
  * Returns whether or not the PTE actually changed.
  */
-int ptep_set_access_flags(struct vm_area_struct *vma,
- unsigned long address, pte_t *ptep,
- pte_t entry, int dirty)
+int __ptep_set_access_flags(struct vm_area_struct *vma,
+   unsigned long address, pte_t *ptep,
+   pte_t entry, int dirty)
 {
pteval_t old_pteval, pteval;
pte_t pte = READ_ONCE(*ptep);
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index f6612f3e1c07..9949b80baac8 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -459,7 +459,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
pte_t orig_pte;
 
if (!pte_cont(pte))
-   return ptep_set_access_flags(vma, addr, ptep, pte, dirty);
+   return __ptep_set_access_flags(vma, addr, ptep, pte, dirty);
 
ncontig = find_num_contig(mm, addr, ptep, );
dpfn = pgsize >> PAGE_SHIFT;
-- 
2.25.1



[PATCH v5 15/25] arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit

2024-02-02 Thread Ryan Roberts
Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

Tested-by: John Hubbard 
Signed-off-by: Ryan Roberts 
---
 arch/arm64/include/asm/pgtable.h | 10 ++
 arch/arm64/mm/hugetlbpage.c  |  2 +-
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 2870bc12f288..4c2d6c483390 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -970,11 +970,11 @@ static inline pmd_t pmdp_huge_get_and_clear(struct 
mm_struct *mm,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 /*
- * ptep_set_wrprotect - mark read-only while trasferring potential hardware
+ * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
  * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
  */
-#define __HAVE_ARCH_PTEP_SET_WRPROTECT
-static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long 
address, pte_t *ptep)
+static inline void __ptep_set_wrprotect(struct mm_struct *mm,
+   unsigned long address, pte_t *ptep)
 {
pte_t old_pte, pte;
 
@@ -992,7 +992,7 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, 
unsigned long addres
 static inline void pmdp_set_wrprotect(struct mm_struct *mm,
  unsigned long address, pmd_t *pmdp)
 {
-   ptep_set_wrprotect(mm, address, (pte_t *)pmdp);
+   __ptep_set_wrprotect(mm, address, (pte_t *)pmdp);
 }
 
 #define pmdp_establish pmdp_establish
@@ -1139,6 +1139,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, 
pte_t pte);
 #define ptep_test_and_clear_young  __ptep_test_and_clear_young
 #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
 #define ptep_clear_flush_young __ptep_clear_flush_young
+#define __HAVE_ARCH_PTEP_SET_WRPROTECT
+#define ptep_set_wrprotect __ptep_set_wrprotect
 
 #endif /* !__ASSEMBLY__ */
 
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 7e74e7b67107..f6612f3e1c07 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -493,7 +493,7 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
pte_t pte;
 
if (!pte_cont(READ_ONCE(*ptep))) {
-   ptep_set_wrprotect(mm, addr, ptep);
+   __ptep_set_wrprotect(mm, addr, ptep);
return;
}
 
-- 
2.25.1



[PATCH v5 17/25] arm64/mm: ptep_get(): New layer to manage contig bit

2024-02-02 Thread Ryan Roberts
Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

arm64 did not previously define an arch-specific ptep_get(), so override
the default version in the arch code, and also define the private
__ptep_get() version. Currently they both do the same thing that the
default version does (READ_ONCE()). Some arch users (hugetlb) were
already using ptep_get() so convert those to the private API. While
other callsites were doing direct READ_ONCE(), so convert those to use
the appropriate (public/private) API too.

Tested-by: John Hubbard 
Signed-off-by: Ryan Roberts 
---
 arch/arm64/include/asm/pgtable.h | 12 +---
 arch/arm64/kernel/efi.c  |  2 +-
 arch/arm64/mm/fault.c|  4 ++--
 arch/arm64/mm/hugetlbpage.c  | 18 +-
 arch/arm64/mm/kasan_init.c   |  2 +-
 arch/arm64/mm/mmu.c  | 12 ++--
 arch/arm64/mm/pageattr.c |  4 ++--
 arch/arm64/mm/trans_pgd.c|  2 +-
 8 files changed, 31 insertions(+), 25 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index fe27a3175618..7dc6b68ee516 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -276,6 +276,11 @@ static inline void __set_pte(pte_t *ptep, pte_t pte)
}
 }
 
+static inline pte_t __ptep_get(pte_t *ptep)
+{
+   return READ_ONCE(*ptep);
+}
+
 extern void __sync_icache_dcache(pte_t pteval);
 bool pgattr_change_is_safe(u64 old, u64 new);
 
@@ -303,7 +308,7 @@ static inline void __check_safe_pte_update(struct mm_struct 
*mm, pte_t *ptep,
if (!IS_ENABLED(CONFIG_DEBUG_VM))
return;
 
-   old_pte = READ_ONCE(*ptep);
+   old_pte = __ptep_get(ptep);
 
if (!pte_valid(old_pte) || !pte_valid(pte))
return;
@@ -905,7 +910,7 @@ static inline int __ptep_test_and_clear_young(struct 
vm_area_struct *vma,
 {
pte_t old_pte, pte;
 
-   pte = READ_ONCE(*ptep);
+   pte = __ptep_get(ptep);
do {
old_pte = pte;
pte = pte_mkold(pte);
@@ -978,7 +983,7 @@ static inline void __ptep_set_wrprotect(struct mm_struct 
*mm,
 {
pte_t old_pte, pte;
 
-   pte = READ_ONCE(*ptep);
+   pte = __ptep_get(ptep);
do {
old_pte = pte;
pte = pte_wrprotect(pte);
@@ -1130,6 +1135,7 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, 
pte_t pte);
 #define vmemmap_update_pte vmemmap_update_pte
 #endif
 
+#define ptep_get   __ptep_get
 #define set_pte__set_pte
 #define set_ptes   __set_ptes
 #define pte_clear  __pte_clear
diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
index 44288a12fc6c..9afcc690fe73 100644
--- a/arch/arm64/kernel/efi.c
+++ b/arch/arm64/kernel/efi.c
@@ -103,7 +103,7 @@ static int __init set_permissions(pte_t *ptep, unsigned 
long addr, void *data)
 {
struct set_perm_data *spd = data;
const efi_memory_desc_t *md = spd->md;
-   pte_t pte = READ_ONCE(*ptep);
+   pte_t pte = __ptep_get(ptep);
 
if (md->attribute & EFI_MEMORY_RO)
pte = set_pte_bit(pte, __pgprot(PTE_RDONLY));
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index dbbc06cfb848..892e8cc8983f 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -191,7 +191,7 @@ static void show_pte(unsigned long addr)
if (!ptep)
break;
 
-   pte = READ_ONCE(*ptep);
+   pte = __ptep_get(ptep);
pr_cont(", pte=%016llx", pte_val(pte));
pte_unmap(ptep);
} while(0);
@@ -214,7 +214,7 @@ int __ptep_set_access_flags(struct vm_area_struct *vma,
pte_t entry, int dirty)
 {
pteval_t old_pteval, pteval;
-   pte_t pte = READ_ONCE(*ptep);
+   pte_t pte = __ptep_get(ptep);
 
if (pte_same(pte, entry))
return 0;
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 9949b80baac8..c3db949560f9 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -152,14 +152,14 @@ pte_t huge_ptep_get(pte_t *ptep)
 {
int ncontig, i;
size_t pgsize;
-   pte_t orig_pte = ptep_get(ptep);
+   pte_t orig_pte = __ptep_get(ptep);
 
if (!pte_present(orig_pte) || 

[PATCH v5 07/25] x86/mm: Convert pte_next_pfn() to pte_advance_pfn()

2024-02-02 Thread Ryan Roberts
Core-mm needs to be able to advance the pfn by an arbitrary amount, so
improve the API to do so and change the name.

Signed-off-by: Ryan Roberts 
---
 arch/x86/include/asm/pgtable.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 9d077bca6a10..b60b0c897b4c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -956,13 +956,13 @@ static inline int pte_same(pte_t a, pte_t b)
return a.pte == b.pte;
 }
 
-static inline pte_t pte_next_pfn(pte_t pte)
+static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
 {
if (__pte_needs_invert(pte_val(pte)))
-   return __pte(pte_val(pte) - (1UL << PFN_PTE_SHIFT));
-   return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
+   return __pte(pte_val(pte) - (nr << PFN_PTE_SHIFT));
+   return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
 }
-#define pte_next_pfn   pte_next_pfn
+#define pte_advance_pfnpte_advance_pfn
 
 static inline int pte_present(pte_t a)
 {
-- 
2.25.1



[PATCH v5 05/25] arm64/mm: Convert pte_next_pfn() to pte_advance_pfn()

2024-02-02 Thread Ryan Roberts
Core-mm needs to be able to advance the pfn by an arbitrary amount, so
improve the API to do so and change the name.

Signed-off-by: Ryan Roberts 
---
 arch/arm64/include/asm/pgtable.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 9428801c1040..6a6cc78cf879 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -351,10 +351,10 @@ static inline pgprot_t pte_pgprot(pte_t pte)
return __pgprot(pte_val(pfn_pte(pfn, __pgprot(0))) ^ pte_val(pte));
 }
 
-#define pte_next_pfn pte_next_pfn
-static inline pte_t pte_next_pfn(pte_t pte)
+#define pte_advance_pfn pte_advance_pfn
+static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
 {
-   return pfn_pte(pte_pfn(pte) + 1, pte_pgprot(pte));
+   return pfn_pte(pte_pfn(pte) + nr, pte_pgprot(pte));
 }
 
 static inline void set_ptes(struct mm_struct *mm,
@@ -370,7 +370,7 @@ static inline void set_ptes(struct mm_struct *mm,
if (--nr == 0)
break;
ptep++;
-   pte = pte_next_pfn(pte);
+   pte = pte_advance_pfn(pte, 1);
}
 }
 #define set_ptes set_ptes
-- 
2.25.1



[PATCH v5 06/25] powerpc/mm: Convert pte_next_pfn() to pte_advance_pfn()

2024-02-02 Thread Ryan Roberts
Core-mm needs to be able to advance the pfn by an arbitrary amount, so
improve the API to do so and change the name.

Signed-off-by: Ryan Roberts 
---
 arch/powerpc/mm/pgtable.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 549a440ed7f6..6853cdb1290d 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -220,7 +220,7 @@ void set_ptes(struct mm_struct *mm, unsigned long addr, 
pte_t *ptep,
break;
ptep++;
addr += PAGE_SIZE;
-   pte = pte_next_pfn(pte);
+   pte = pte_advance_pfn(pte, 1);
}
 }
 
-- 
2.25.1



[PATCH v5 08/25] mm: Remove pte_next_pfn() and replace with pte_advance_pfn()

2024-02-02 Thread Ryan Roberts
Now that the architectures are converted over to pte_advance_pfn(), we
can remove the pte_next_pfn() wrapper and convert the callers to call
pte_advance_pfn().

Signed-off-by: Ryan Roberts 
---
 include/linux/pgtable.h | 9 +
 mm/memory.c | 4 ++--
 2 files changed, 3 insertions(+), 10 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 815d92dcb96b..50f32cccbd92 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -212,19 +212,12 @@ static inline int pmd_dirty(pmd_t pmd)
 #define arch_flush_lazy_mmu_mode() do {} while (0)
 #endif
 
-
-#ifndef pte_next_pfn
 #ifndef pte_advance_pfn
 static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
 {
return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
 }
 #endif
-static inline pte_t pte_next_pfn(pte_t pte)
-{
-   return pte_advance_pfn(pte, 1);
-}
-#endif
 
 #ifndef set_ptes
 /**
@@ -256,7 +249,7 @@ static inline void set_ptes(struct mm_struct *mm, unsigned 
long addr,
if (--nr == 0)
break;
ptep++;
-   pte = pte_next_pfn(pte);
+   pte = pte_advance_pfn(pte, 1);
}
arch_leave_lazy_mmu_mode();
 }
diff --git a/mm/memory.c b/mm/memory.c
index 38a010c4d04d..65fbe4f886c1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -988,7 +988,7 @@ static inline int folio_pte_batch(struct folio *folio, 
unsigned long addr,
 {
unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
const pte_t *end_ptep = start_ptep + max_nr;
-   pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte), 
flags);
+   pte_t expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, 1), 
flags);
pte_t *ptep = start_ptep + 1;
bool writable;
 
@@ -1017,7 +1017,7 @@ static inline int folio_pte_batch(struct folio *folio, 
unsigned long addr,
if (any_writable)
*any_writable |= writable;
 
-   expected_pte = pte_next_pfn(expected_pte);
+   expected_pte = pte_advance_pfn(expected_pte, 1);
ptep++;
}
 
-- 
2.25.1



[PATCH v5 04/25] arm/mm: Convert pte_next_pfn() to pte_advance_pfn()

2024-02-02 Thread Ryan Roberts
Core-mm needs to be able to advance the pfn by an arbitrary amount, so
improve the API to do so and change the name.

Signed-off-by: Ryan Roberts 
---
 arch/arm/mm/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
index c24e29c0b9a4..137711c68f2f 100644
--- a/arch/arm/mm/mmu.c
+++ b/arch/arm/mm/mmu.c
@@ -1814,6 +1814,6 @@ void set_ptes(struct mm_struct *mm, unsigned long addr,
if (--nr == 0)
break;
ptep++;
-   pteval = pte_next_pfn(pteval);
+   pteval = pte_advance_pfn(pteval, 1);
}
 }
-- 
2.25.1



[PATCH v5 03/25] mm: Make pte_next_pfn() a wrapper around pte_advance_pfn()

2024-02-02 Thread Ryan Roberts
The goal is to be able to advance a PTE by an arbitrary number of PFNs.
So introduce a new API that takes a nr param.

We are going to remove pte_next_pfn() and replace it with
pte_advance_pfn(). As a first step, implement pte_next_pfn() as a
wrapper around pte_advance_pfn() so that we can incrementally switch the
architectures over. Once all arches are moved over, we will change all
the core-mm callers to call pte_advance_pfn() directly and remove the
wrapper.

Signed-off-by: Ryan Roberts 
---
 include/linux/pgtable.h | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5e7eaf8f2b97..815d92dcb96b 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -214,9 +214,15 @@ static inline int pmd_dirty(pmd_t pmd)
 
 
 #ifndef pte_next_pfn
+#ifndef pte_advance_pfn
+static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
+{
+   return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
+}
+#endif
 static inline pte_t pte_next_pfn(pte_t pte)
 {
-   return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
+   return pte_advance_pfn(pte, 1);
 }
 #endif
 
-- 
2.25.1



[PATCH v5 02/25] mm: thp: Batch-collapse PMD with set_ptes()

2024-02-02 Thread Ryan Roberts
Refactor __split_huge_pmd_locked() so that a present PMD can be
collapsed to PTEs in a single batch using set_ptes().

This should improve performance a little bit, but the real motivation is
to remove the need for the arm64 backend to have to fold the contpte
entries. Instead, since the ptes are set as a batch, the contpte blocks
can be initially set up pre-folded (once the arm64 contpte support is
added in the next few patches). This leads to noticeable performance
improvement during split.

Acked-by: David Hildenbrand 
Signed-off-by: Ryan Roberts 
---
 mm/huge_memory.c | 58 +++-
 1 file changed, 33 insertions(+), 25 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 016e20bd813e..14888b15121e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2579,15 +2579,16 @@ static void __split_huge_pmd_locked(struct 
vm_area_struct *vma, pmd_t *pmd,
 
pte = pte_offset_map(&_pmd, haddr);
VM_BUG_ON(!pte);
-   for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
-   pte_t entry;
-   /*
-* Note that NUMA hinting access restrictions are not
-* transferred to avoid any possibility of altering
-* permissions across VMAs.
-*/
-   if (freeze || pmd_migration) {
+
+   /*
+* Note that NUMA hinting access restrictions are not transferred to
+* avoid any possibility of altering permissions across VMAs.
+*/
+   if (freeze || pmd_migration) {
+   for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += 
PAGE_SIZE) {
+   pte_t entry;
swp_entry_t swp_entry;
+
if (write)
swp_entry = make_writable_migration_entry(
page_to_pfn(page + i));
@@ -2606,25 +2607,32 @@ static void __split_huge_pmd_locked(struct 
vm_area_struct *vma, pmd_t *pmd,
entry = pte_swp_mksoft_dirty(entry);
if (uffd_wp)
entry = pte_swp_mkuffd_wp(entry);
-   } else {
-   entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot));
-   if (write)
-   entry = pte_mkwrite(entry, vma);
-   if (!young)
-   entry = pte_mkold(entry);
-   /* NOTE: this may set soft-dirty too on some archs */
-   if (dirty)
-   entry = pte_mkdirty(entry);
-   if (soft_dirty)
-   entry = pte_mksoft_dirty(entry);
-   if (uffd_wp)
-   entry = pte_mkuffd_wp(entry);
+
+   VM_WARN_ON(!pte_none(ptep_get(pte + i)));
+   set_pte_at(mm, addr, pte + i, entry);
}
-   VM_BUG_ON(!pte_none(ptep_get(pte)));
-   set_pte_at(mm, addr, pte, entry);
-   pte++;
+   } else {
+   pte_t entry;
+
+   entry = mk_pte(page, READ_ONCE(vma->vm_page_prot));
+   if (write)
+   entry = pte_mkwrite(entry, vma);
+   if (!young)
+   entry = pte_mkold(entry);
+   /* NOTE: this may set soft-dirty too on some archs */
+   if (dirty)
+   entry = pte_mkdirty(entry);
+   if (soft_dirty)
+   entry = pte_mksoft_dirty(entry);
+   if (uffd_wp)
+   entry = pte_mkuffd_wp(entry);
+
+   for (i = 0; i < HPAGE_PMD_NR; i++)
+   VM_WARN_ON(!pte_none(ptep_get(pte + i)));
+
+   set_ptes(mm, haddr, pte, entry, HPAGE_PMD_NR);
}
-   pte_unmap(pte - 1);
+   pte_unmap(pte);
 
if (!pmd_migration)
folio_remove_rmap_pmd(folio, page, vma);
-- 
2.25.1



[PATCH v5 01/25] mm: Clarify the spec for set_ptes()

2024-02-02 Thread Ryan Roberts
set_ptes() spec implies that it can only be used to set a present pte
because it interprets the PFN field to increment it. However,
set_pte_at() has been implemented on top of set_ptes() since set_ptes()
was introduced, and set_pte_at() allows setting a pte to a not-present
state. So clarify the spec to state that when nr==1, new state of pte
may be present or not present. When nr>1, new state of all ptes must be
present.

While we are at it, tighten the spec to set requirements around the
initial state of ptes; when nr==1 it may be either present or
not-present. But when nr>1 all ptes must initially be not-present. All
set_ptes() callsites already conform to this requirement. Stating it
explicitly is useful because it allows for a simplification to the
upcoming arm64 contpte implementation.

Signed-off-by: Ryan Roberts 
---
 include/linux/pgtable.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index f0feae7f89fb..5e7eaf8f2b97 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -229,6 +229,10 @@ static inline pte_t pte_next_pfn(pte_t pte)
  * @pte: Page table entry for the first page.
  * @nr: Number of pages to map.
  *
+ * When nr==1, initial state of pte may be present or not present, and new 
state
+ * may be present or not present. When nr>1, initial state of all ptes must be
+ * not present, and new state must be present.
+ *
  * May be overridden by the architecture, or the architecture can define
  * set_pte() and PFN_PTE_SHIFT.
  *
-- 
2.25.1



[PATCH v5 00/25] Transparent Contiguous PTEs for User Mappings

2024-02-02 Thread Ryan Roberts
Hi All,

This is a series to opportunistically and transparently use contpte mappings
(set the contiguous bit in ptes) for user memory when those mappings meet the
requirements. The change benefits arm64, but there is some minor refactoring for
x86 and powerpc to enable its integration with core-mm.

It is part of a wider effort to improve performance by allocating and mapping
variable-sized blocks of memory (folios). One aim is for the 4K kernel to
approach the performance of the 16K kernel, but without breaking compatibility
and without the associated increase in memory. Another aim is to benefit the 16K
and 64K kernels by enabling 2M THP, since this is the contpte size for those
kernels. We have good performance data that demonstrates both aims are being met
(see below).

Of course this is only one half of the change. We require the mapped physical
memory to be the correct size and alignment for this to actually be useful (i.e.
64K for 4K pages, or 2M for 16K/64K pages). Fortunately folios are solving this
problem for us. Filesystems that support it (XFS, AFS, EROFS, tmpfs, ...) will
allocate large folios up to the PMD size today, and more filesystems are coming.
And for anonymous memory, "multi-size THP" is now upstream.


Patch Layout


In this version, I've split the patches to better show each optimization:

  - 1-2:mm prep: misc code and docs cleanups
  - 3-8:mm,arm,arm64,powerpc,x86 prep: Replace pte_next_pfn() with more
general pte_advance_pfn()
  - 9-18:   arm64 prep: Refactor ptep helpers into new layer
  - 19: functional contpte implementation
  - 20-25:  various optimizations on top of the contpte implementation


Testing
===

I've tested this series on both Ampere Altra (bare metal) and Apple M2 (VM):
  - mm selftests (inc new tests written for multi-size THP); no regressions
  - Speedometer Java script benchmark in Chromium web browser; no issues
  - Kernel compilation; no issues
  - Various tests under high memory pressure with swap enabled; no issues


Performance
===

High Level Use Cases


First some high level use cases (kernel compilation and speedometer JavaScript
benchmarks). These are running on Ampere Altra (I've seen similar improvements
on Android/Pixel 6).

baseline:  mm-unstable (mTHP switched off)
mTHP:  + enable 16K, 32K, 64K mTHP sizes "always"
mTHP + contpte:+ this series
mTHP + contpte + exefolio: + patch at [5], which series supports

Kernel Compilation with -j8 (negative is faster):

| kernel| real-time | kern-time | user-time |
|---|---|---|---|
| baseline  |  0.0% |  0.0% |  0.0% |
| mTHP  | -5.0% |-39.1% | -0.7% |
| mTHP + contpte| -6.0% |-41.4% | -1.5% |
| mTHP + contpte + exefolio | -7.8% |-43.1% | -3.4% |

Kernel Compilation with -j80 (negative is faster):

| kernel| real-time | kern-time | user-time |
|---|---|---|---|
| baseline  |  0.0% |  0.0% |  0.0% |
| mTHP  | -5.0% |-36.6% | -0.6% |
| mTHP + contpte| -6.1% |-38.2% | -1.6% |
| mTHP + contpte + exefolio | -7.4% |-39.2% | -3.2% |

Speedometer (positive is faster):

| kernel| runs_per_min |
|:--|--|
| baseline  | 0.0% |
| mTHP  | 1.5% |
| mTHP + contpte| 3.2% |
| mTHP + contpte + exefolio | 4.5% |


Micro Benchmarks


The following microbenchmarks are intended to demonstrate the performance of
fork() and munmap() do not regress. I'm showing results for order-0 (4K)
mappings, and for order-9 (2M) PTE-mapped THP. Thanks to David for sharing his
benchmarks.

baseline:  mm-unstable + batch fork [6] and zap [7] series
contpte-basic: + patches 0-19; functional contpte implementation
contpte-batch: + patches 20-23; implement new batched APIs
contpte-inline:+ patch 24; __always_inline to help compiler
contpte-fold:  + patch 25; fold contpte mapping when sensible

Primary platform is Ampere Altra bare metal. I'm also showing results for M2 VM
(on top of MacOS) for reference, although experience suggests this might not be
the most reliable for performance numbers of this sort:

| FORK   | order-0| order-9|
| Ampere Altra   |||
| (pte-map)  |   mean | stdev |   mean | stdev |
|||---||---|
| baseline   |   0.0% |  2.7% |   0.0% |  0.2% |
| contpte-basic  |   6.3% |  1.4% |1948.7% |