date:20240215

[PATCH] arch/powerpc: Remove duplicate ifdefs

2024-02-15 Thread Shrikanth Hegde

When a ifdef is used in the below manner, second one could be considered as
duplicate.

ifdef DEFINE_A
...code block...
ifdef DEFINE_A   <-- This is a duplicate.
...code block...
endif
else
ifndef DEFINE_A <-- This is also duplicate.
...code block...
endif
endif
More details about the script and methods used to find these code
patterns are in cover letter of [1]

Few places in arch/powerpc where this pattern was seen.
Hunk1: Code is under check of CONFIG_PPC64 from  line 13 in
arch/powerpc/include/asm/paca.h. Hence the second CONFIG_PPC64 at line 166
is a duplicate.
Hunk2: CONFIG_PPC_BOOK3S_64 was defined back to back. Merged the two
ifdefs.
Hunk3: Code is under check of CONFIG_PPC64 from line 176 in
arch/powerpc/kernel/asm-offsets.c. Hence second CONFIG_PPC64 at line 249
is a duplicate.
Hunk4: #ifndef CONFIG_PPC64 is used at line 2066 in
arch/powerpc/platforms/powermac/feature.c. And then in #else
again #ifdef CONFIG_PPC64 is used. Which is a duplicate since in #else
means CONFIG_PPC64 is defined.
Hunk5: Code is under the check of CONFIG_SMP from line 521 in
arch/powerpc/xmon/xmon.c. Hence the same check of CONFIG_SMP at line 646
is a duplicate.

No functional change is intended here. It only aims to improve code
readability.
[1] https://lore.kernel.org/all/20240118080326.13137-1-sshe...@linux.ibm.com/

Signed-off-by: Shrikanth Hegde 
---
Changes from v2:
- Converted from series to individual patches.
- Dropped RFC tag.
- Added more context on each hunk for review.

 arch/powerpc/include/asm/paca.h   | 4 
 arch/powerpc/kernel/asm-offsets.c | 2 --
 arch/powerpc/platforms/powermac/feature.c | 2 --
 arch/powerpc/xmon/xmon.c  | 2 --
 4 files changed, 10 deletions(-)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index e667d455ecb4..1d58da946739 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -163,9 +163,7 @@ struct paca_struct {
u64 kstack; /* Saved Kernel stack addr */
u64 saved_r1;   /* r1 save for RTAS calls or PM or EE=0 
*/
u64 saved_msr;  /* MSR saved here by enter_rtas */
-#ifdef CONFIG_PPC64
u64 exit_save_r1;   /* Syscall/interrupt R1 save */
-#endif
 #ifdef CONFIG_PPC_BOOK3E_64
u16 trap_save;  /* Used when bad stack is encountered */
 #endif
@@ -214,8 +212,6 @@ struct paca_struct {
/* Non-maskable exceptions that are not performance critical */
u64 exnmi[EX_SIZE]; /* used for system reset (nmi) */
u64 exmc[EX_SIZE];  /* used for machine checks */
-#endif
-#ifdef CONFIG_PPC_BOOK3S_64
/* Exclusive stacks for system reset and machine check exception. */
void *nmi_emergency_sp;
void *mc_emergency_sp;
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 9f14d95b8b32..f029755f9e69 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -246,9 +246,7 @@ int main(void)
OFFSET(PACAHWCPUID, paca_struct, hw_cpu_id);
OFFSET(PACAKEXECSTATE, paca_struct, kexec_state);
OFFSET(PACA_DSCR_DEFAULT, paca_struct, dscr_default);
-#ifdef CONFIG_PPC64
OFFSET(PACA_EXIT_SAVE_R1, paca_struct, exit_save_r1);
-#endif
 #ifdef CONFIG_PPC_BOOK3E_64
OFFSET(PACA_TRAP_SAVE, paca_struct, trap_save);
 #endif
diff --git a/arch/powerpc/platforms/powermac/feature.c 
b/arch/powerpc/platforms/powermac/feature.c
index 81c9fbae88b1..2cc257f75c50 100644
--- a/arch/powerpc/platforms/powermac/feature.c
+++ b/arch/powerpc/platforms/powermac/feature.c
@@ -2333,7 +2333,6 @@ static struct pmac_mb_def pmac_mb_defs[] = {
PMAC_TYPE_POWERMAC_G5,  g5_features,
0,
},
-#ifdef CONFIG_PPC64
{   "PowerMac7,3",  "PowerMac G5",
PMAC_TYPE_POWERMAC_G5,  g5_features,
0,
@@ -2359,7 +2358,6 @@ static struct pmac_mb_def pmac_mb_defs[] = {
0,
},
 #endif /* CONFIG_PPC64 */
-#endif /* CONFIG_PPC64 */
 };

 /*
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index b3b94cd37713..f413c220165c 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -643,10 +643,8 @@ static int xmon_core(struct pt_regs *regs, volatile int 
fromipi)
touch_nmi_watchdog();
} else {
cmd = 1;
-#ifdef CONFIG_SMP
if (xmon_batch)
cmd = batch_cmds(regs);
-#endif
if (!locked_down && cmd)
cmd = cmds(regs);
if (locked_down || cmd != 0) {
--
2.39.3

Re: [PATCH v2] uapi/auxvec: Define AT_HWCAP3 and AT_HWCAP4 aux vector, entries

2024-02-15 Thread Peter Bergner

On 2/15/24 7:49 PM, Michael Ellerman wrote:
> Peter Bergner  writes:
>> On 2/15/24 2:16 AM, Arnd Bergmann wrote:
>>> On Wed, Feb 14, 2024, at 23:34, Peter Bergner wrote:
 Arnd, we seem to have consensus on the patch below.  Is this something
 you could take and apply to your tree? 

>>>
>>> I don't mind taking it, but it may be better to use the
>>> powerpc tree if that is where it's actually being used.
>>
>> So this is not a powerpc only patch, but we may be the first arch
>> to use it.  Szabolcs mentioned that aarch64 was pretty quickly filling
>> up their AT_HWCAP2 and that they will eventually require using AT_HWCAP3
>> as well.  If you still think this should go through the powerpc tree,
>> I can check on that.
> 
> I'm happy to take it with Arnd's ack.
> 
> I trimmed up the commit message a bit, see below.

Perfect.  Thanks everyone!

Peter

Re: [RFC PATCH 1/5] powerpc/smp: Adjust nr_cpu_ids to cover all threads of a core

2024-02-15 Thread Pingfan Liu

On Thu, Feb 15, 2024 at 9:09 PM Michael Ellerman
 wrote:
>
> On Fri, 29 Dec 2023 23:01:03 +1100, Michael Ellerman wrote:
> > If nr_cpu_ids is too low to include at least all the threads of a single
> > core adjust nr_cpu_ids upwards. This avoids triggering odd bugs in code
> > that assumes all threads of a core are available.
> >
> >
>
> Applied to powerpc/next.
>

Great! After all these years, finally we are close to the conclusion
of this feature.

Thanks,

Pingfan

> [1/5] powerpc/smp: Adjust nr_cpu_ids to cover all threads of a core
>   
> https://git.kernel.org/powerpc/c/5580e96dad5a439d561d9648ffcbccb739c2a120
> [2/5] powerpc/smp: Increase nr_cpu_ids to include the boot CPU
>   
> https://git.kernel.org/powerpc/c/777f81f0a9c780a6443bcf2c7785f0cc2e87c1ef
> [3/5] powerpc/smp: Lookup avail once per device tree node
>   
> https://git.kernel.org/powerpc/c/dca79603fbc592ec7ea8bd7ba274052d3984e882
> [4/5] powerpc/smp: Factor out assign_threads()
>   
> https://git.kernel.org/powerpc/c/9832de654499f0bf797a3719c4d4c5bd401f18f5
> [5/5] powerpc/smp: Remap boot CPU onto core 0 if >= nr_cpu_ids
>   
> https://git.kernel.org/powerpc/c/0875f1ceba974042069f04946aa8f1d4d1e688da
>
> cheers
>

Re: [PATCH v2] uapi/auxvec: Define AT_HWCAP3 and AT_HWCAP4 aux vector, entries

2024-02-15 Thread Michael Ellerman

Peter Bergner  writes:
> On 2/15/24 2:16 AM, Arnd Bergmann wrote:
>> On Wed, Feb 14, 2024, at 23:34, Peter Bergner wrote:
>>> The powerpc toolchain keeps a copy of the HWCAP bit masks in our TCB for 
>>> fast
>>> access by the __builtin_cpu_supports built-in function.  The TCB space for
>>> the HWCAP entries - which are created in pairs - is an ABI extension, so
>>> waiting to create the space for HWCAP3 and HWCAP4 until we need them is
>>> problematical.  Define AT_HWCAP3 and AT_HWCAP4 in the generic uapi header
>>> so they can be used in glibc to reserve space in the powerpc TCB for their
>>> future use.
>>>
>>> I scanned through the Linux and GLIBC source codes looking for unused AT_*
>>> values and 29 and 30 did not seem to be used, so they are what I went
>>> with.  This has received Acked-by's from both GLIBC and Linux kernel
>>> developers and no reservations or Nacks from anyone.
>>>
>>> Arnd, we seem to have consensus on the patch below.  Is this something
>>> you could take and apply to your tree? 
>>>
>> 
>> I don't mind taking it, but it may be better to use the
>> powerpc tree if that is where it's actually being used.
>
> So this is not a powerpc only patch, but we may be the first arch
> to use it.  Szabolcs mentioned that aarch64 was pretty quickly filling
> up their AT_HWCAP2 and that they will eventually require using AT_HWCAP3
> as well.  If you still think this should go through the powerpc tree,
> I can check on that.

I'm happy to take it with Arnd's ack.

I trimmed up the commit message a bit, see below.

cheers


Author: Peter Bergner 
AuthorDate: Wed Feb 14 16:34:06 2024 -0600
Commit: Michael Ellerman 
CommitDate: Fri Feb 16 12:42:59 2024 +1100

uapi/auxvec: Define AT_HWCAP3 and AT_HWCAP4 aux vector, entries

The powerpc toolchain keeps a copy of the HWCAP bit masks in the TCB
for fast access by the __builtin_cpu_supports() built-in function. The
TCB space for the HWCAP entries - which are created in pairs - is an ABI
extension, so waiting to create the space for HWCAP3 and HWCAP4 until
they are needed is problematic. Define AT_HWCAP3 and AT_HWCAP4 in the
generic uapi header so they can be used in glibc to reserve space in the
powerpc TCB for their future use.

I scanned through the Linux and GLIBC source codes looking for unused
AT_* values and 29 and 30 did not seem to be used, so they are what I
went with.

Signed-off-by: Peter Bergner 
Acked-by: Adhemerval Zanella 
Acked-by: Nicholas Piggin 
Acked-by: Szabolcs Nagy 
Acked-by: Arnd Bergmann 
Signed-off-by: Michael Ellerman 
Link: https://msgid.link/a406b535-dc55-4856-8ae9-5a063644a...@linux.ibm.com

Re: [PATCH 0/7] macintosh: Convert to platform remove callback returning void

2024-02-15 Thread Michael Ellerman

Uwe Kleine-König  writes:
> Hello,
>
> On Wed, Jan 10, 2024 at 04:42:47PM +0100, Uwe Kleine-König wrote:
>> Hello,
>> 
>> this series converts all drivers below drivers/macintosh to use
>> .remove_new(). See commit 5c5a7680e67b ("platform: Provide a remove
>> callback that returns no value") for an extended explanation and the
>> eventual goal. The TL;DR; is to make it harder for driver authors to
>> leak resources without noticing.
>> 
>> This is merge window material. All patches are pairwise independent of
>> each other so they can be applied individually. There isn't a maintainer
>> for drivers/macintosh, I'm still sending this as a series in the hope
>> Michael feels repsonsible and applies it completely.
>
> this didn't happen yet. Michael, is this still on your radar?

Yes, just behind as always. Thanks for the reminder.

cheers

[PATCH v3] powerpc/pseries/iommu: DLPAR ADD of pci device doesn't completely initialize pci_controller structure

2024-02-15 Thread Gaurav Batra

When a PCI device is Dynamically added, LPAR OOPS with NULL pointer
exception.

Complete stack is as below

[  211.239206] BUG: Kernel NULL pointer dereference on read at 0x0030
[  211.239210] Faulting instruction address: 0xc06bbe5c
[  211.239214] Oops: Kernel access of bad area, sig: 11 [#1]
[  211.239218] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
[  211.239223] Modules linked in: rpadlpar_io rpaphp rpcsec_gss_krb5 
auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs xsk_diag bonding 
nft_compat nf_tables nfnetlink rfkill binfmt_misc dm_multipath rpcrdma sunrpc 
rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_umad ib_iser 
libiscsi scsi_transport_iscsi ib_ipoib rdma_cm iw_cm ib_cm mlx5_ib ib_uverbs 
ib_core pseries_rng drm drm_panel_orientation_quirks xfs libcrc32c mlx5_core 
mlxfw sd_mod t10_pi sg tls ibmvscsi ibmveth scsi_transport_srp vmx_crypto 
pseries_wdt psample dm_mirror dm_region_hash dm_log dm_mod fuse
[  211.239280] CPU: 17 PID: 2685 Comm: drmgr Not tainted 6.7.0-203405+ #66
[  211.239284] Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf06 
of:IBM,FW1060.00 (NH1060_008) hv:phyp pSeries
[  211.239289] NIP:  c06bbe5c LR: c0a13e68 CTR: c00579f8
[  211.239293] REGS: c0009924f240 TRAP: 0300   Not tainted  (6.7.0-203405+)
[  211.239298] MSR:  80009033   CR: 24002220  
XER: 20040006
[  211.239306] CFAR: c0a13e64 DAR: 0030 DSISR: 4000 
IRQMASK: 0
[  211.239306] GPR00: c0a13e68 c0009924f4e0 c15a2b00 

[  211.239306] GPR04: c13c5590  c6d07970 
c000d8f8f180
[  211.239306] GPR08: 06ec c000d8f8f180 c2c35d58 
24002228
[  211.239306] GPR12: c00579f8 c003ffeb3880  

[  211.239306] GPR16:    

[  211.239306] GPR20:    

[  211.239306] GPR24: c000919460c0  f000 
c10088e8
[  211.239306] GPR28: c13c5590 c6d07970 c000919460c0 
c000919460c0
[  211.239354] NIP [c06bbe5c] sysfs_add_link_to_group+0x34/0x94
[  211.239361] LR [c0a13e68] iommu_device_link+0x5c/0x118
[  211.239367] Call Trace:
[  211.239369] [c0009924f4e0] [c0a109b8] 
iommu_init_device+0x26c/0x318 (unreliable)
[  211.239376] [c0009924f520] [c0a13e68] 
iommu_device_link+0x5c/0x118
[  211.239382] [c0009924f560] [c0a107f4] 
iommu_init_device+0xa8/0x318
[  211.239387] [c0009924f5c0] [c0a11a08] 
iommu_probe_device+0xc0/0x134
[  211.239393] [c0009924f600] [c0a11ac0] 
iommu_bus_notifier+0x44/0x104
[  211.239398] [c0009924f640] [c018dcc0] 
notifier_call_chain+0xb8/0x19c
[  211.239405] [c0009924f6a0] [c018df88] 
blocking_notifier_call_chain+0x64/0x98
[  211.239411] [c0009924f6e0] [c0a250fc] bus_notify+0x50/0x7c
[  211.239416] [c0009924f720] [c0a20838] device_add+0x640/0x918
[  211.239421] [c0009924f7f0] [c08f1a34] pci_device_add+0x23c/0x298
[  211.239427] [c0009924f840] [c0077460] 
of_create_pci_dev+0x400/0x884
[  211.239432] [c0009924f8e0] [c0077a08] of_scan_pci_dev+0x124/0x1b0
[  211.239437] [c0009924f980] [c0077b0c] __of_scan_bus+0x78/0x18c
[  211.239442] [c0009924fa10] [c0073f90] 
pcibios_scan_phb+0x2a4/0x3b0
[  211.239447] [c0009924fad0] [c01007a8] init_phb_dynamic+0xb8/0x110
[  211.239453] [c0009924fb40] [c00806920620] dlpar_add_slot+0x170/0x3b8 
[rpadlpar_io]
[  211.239461] [c0009924fbe0] [c00806920d64] 
add_slot_store.part.0+0xb4/0x130 [rpadlpar_io]
[  211.239468] [c0009924fc70] [c0fb4144] kobj_attr_store+0x2c/0x48
[  211.239473] [c0009924fc90] [c06b90e4] sysfs_kf_write+0x64/0x78
[  211.239479] [c0009924fcb0] [c06b7b78] 
kernfs_fop_write_iter+0x1b0/0x290
[  211.239485] [c0009924fd00] [c05b6fdc] vfs_write+0x350/0x4a0
[  211.239491] [c0009924fdc0] [c05b7450] ksys_write+0x84/0x140
[  211.239496] [c0009924fe10] [c0030a04] 
system_call_exception+0x124/0x330
[  211.239502] [c0009924fe50] [c000cedc] 
system_call_vectored_common+0x15c/0x2ec

Commit a940904443e4 ("powerpc/iommu: Add iommu_ops to report capabilities
and allow blocking domains") broke DLPAR ADD of pci devices.

The above added iommu_device structure to pci_controller. During
system boot, pci devices are discovered and this newly added iommu_device
structure initialized by a call to iommu_device_register().

During DLPAR ADD of a PCI device, a new pci_controller structure is
allocated but there are no calls made to iommu_device_register()
interface.

Fix would be to register iommu device during DLPAR ADD as well.

Fixes: a940904443e4 ("powerpc/iommu: Add

Re: [PATCH] selftests: powerpc: Add header symlinks for building papr character device tests

2024-02-15 Thread Michal Suchánek

On Thu, Feb 15, 2024 at 01:39:27PM -0600, Nathan Lynch wrote:
> Michal Suchánek  writes:
> > On Thu, Feb 15, 2024 at 01:13:34PM -0600, Nathan Lynch wrote:
> >> Michal Suchanek  writes:
> >> >
> >> > Without the headers the tests don't build.
> >> >
> >> > Fixes: 9118c5d32bdd ("powerpc/selftests: Add test for papr-vpd")
> >> > Fixes: 76b2ec3faeaa ("powerpc/selftests: Add test for papr-sysparm")
> >> > Signed-off-by: Michal Suchanek 
> >> > ---
> >> >  tools/testing/selftests/powerpc/include/asm/papr-miscdev.h | 1 +
> >> >  tools/testing/selftests/powerpc/include/asm/papr-sysparm.h | 1 +
> >> >  tools/testing/selftests/powerpc/include/asm/papr-vpd.h | 1 +
> >> >  3 files changed, 3 insertions(+)
> >> >  create mode 12 
> >> > tools/testing/selftests/powerpc/include/asm/papr-miscdev.h
> >> >  create mode 12 
> >> > tools/testing/selftests/powerpc/include/asm/papr-sysparm.h
> >> >  create mode 12
> >> > tools/testing/selftests/powerpc/include/asm/papr-vpd.h
> >> 
> >> I really hope making symlinks into the kernel source isn't necessary. I
> >> haven't experienced build failures with these tests. How are you
> >> building them?
> >> 
> >> I usually do something like (on a x86 build host):
> >> 
> >> $ make ARCH=powerpc CROSS_COMPILE=powerpc64le-linux- ppc64le_defconfig
> >> $ make ARCH=powerpc CROSS_COMPILE=powerpc64le-linux- headers
> >> $ make ARCH=powerpc CROSS_COMPILE=powerpc64le-linux- -C 
> >> tools/testing/selftests/powerpc/
> >> 
> >> without issue.
> >
> > I am not configuring the kernel, only building the tests, and certainly
> > not installing headers on the system.
> 
> OK, but again: how do you provoke the build errors, exactly? Don't make
> us guess please.

cd tools/testing/selftests/powerpc/

make -k

> > Apparently this is what people aim to do, and report bugs when it does
> > not work: build the kselftests as self-contained testsuite that relies
> > only on standard libc, and whatever it brought in the sources.
> >
> > That said, the target to install headers is headers_install, not
> > headers. The headers target is not documented, it's probably meant to be
> > internal to the build system. Yet it is not enforced that it is built
> > before building the selftests.
> 
>  the headers target is used in Documentation/dev-tools/kselftest.rst:
> 
> """
> To build the tests::
> 
>   $ make headers
>   $ make -C tools/testing/selftests
> """

Indeed so it's not supposed to work otherwise. It would be nice if it
did but might be difficult to achieve with plain makefiles.

'headers' is not in 'make help' output but whatever.

Thanks

Michal

Re: [PATCH 0/7] macintosh: Convert to platform remove callback returning void

2024-02-15 Thread Uwe Kleine-König

Hello,

On Wed, Jan 10, 2024 at 04:42:47PM +0100, Uwe Kleine-König wrote:
> Hello,
> 
> this series converts all drivers below drivers/macintosh to use
> .remove_new(). See commit 5c5a7680e67b ("platform: Provide a remove
> callback that returns no value") for an extended explanation and the
> eventual goal. The TL;DR; is to make it harder for driver authors to
> leak resources without noticing.
> 
> This is merge window material. All patches are pairwise independent of
> each other so they can be applied individually. There isn't a maintainer
> for drivers/macintosh, I'm still sending this as a series in the hope
> Michael feels repsonsible and applies it completely.

this didn't happen yet. Michael, is this still on your radar? Or is
there someone more suiteable to take these patches?

Best regards
Uwe

-- 
Pengutronix e.K.   | Uwe Kleine-König|
Industrial Linux Solutions | https://www.pengutronix.de/ |


signature.asc
Description: PGP signature

Re: [PATCH] selftests: powerpc: Add header symlinks for building papr character device tests

2024-02-15 Thread Nathan Lynch

Michal Suchánek  writes:
> On Thu, Feb 15, 2024 at 01:13:34PM -0600, Nathan Lynch wrote:
>> Michal Suchanek  writes:
>> >
>> > Without the headers the tests don't build.
>> >
>> > Fixes: 9118c5d32bdd ("powerpc/selftests: Add test for papr-vpd")
>> > Fixes: 76b2ec3faeaa ("powerpc/selftests: Add test for papr-sysparm")
>> > Signed-off-by: Michal Suchanek 
>> > ---
>> >  tools/testing/selftests/powerpc/include/asm/papr-miscdev.h | 1 +
>> >  tools/testing/selftests/powerpc/include/asm/papr-sysparm.h | 1 +
>> >  tools/testing/selftests/powerpc/include/asm/papr-vpd.h | 1 +
>> >  3 files changed, 3 insertions(+)
>> >  create mode 12 
>> > tools/testing/selftests/powerpc/include/asm/papr-miscdev.h
>> >  create mode 12 
>> > tools/testing/selftests/powerpc/include/asm/papr-sysparm.h
>> >  create mode 12
>> > tools/testing/selftests/powerpc/include/asm/papr-vpd.h
>> 
>> I really hope making symlinks into the kernel source isn't necessary. I
>> haven't experienced build failures with these tests. How are you
>> building them?
>> 
>> I usually do something like (on a x86 build host):
>> 
>> $ make ARCH=powerpc CROSS_COMPILE=powerpc64le-linux- ppc64le_defconfig
>> $ make ARCH=powerpc CROSS_COMPILE=powerpc64le-linux- headers
>> $ make ARCH=powerpc CROSS_COMPILE=powerpc64le-linux- -C 
>> tools/testing/selftests/powerpc/
>> 
>> without issue.
>
> I am not configuring the kernel, only building the tests, and certainly
> not installing headers on the system.

OK, but again: how do you provoke the build errors, exactly? Don't make
us guess please.

> Apparently this is what people aim to do, and report bugs when it does
> not work: build the kselftests as self-contained testsuite that relies
> only on standard libc, and whatever it brought in the sources.
>
> That said, the target to install headers is headers_install, not
> headers. The headers target is not documented, it's probably meant to be
> internal to the build system. Yet it is not enforced that it is built
> before building the selftests.

 the headers target is used in Documentation/dev-tools/kselftest.rst:

"""
To build the tests::

  $ make headers
  $ make -C tools/testing/selftests
"""

This is what I've been following.

Re: [PATCH] selftests: powerpc: Add header symlinks for building papr character device tests

2024-02-15 Thread Michal Suchánek

On Thu, Feb 15, 2024 at 01:13:34PM -0600, Nathan Lynch wrote:
> Michal Suchanek  writes:
> >
> > Without the headers the tests don't build.
> >
> > Fixes: 9118c5d32bdd ("powerpc/selftests: Add test for papr-vpd")
> > Fixes: 76b2ec3faeaa ("powerpc/selftests: Add test for papr-sysparm")
> > Signed-off-by: Michal Suchanek 
> > ---
> >  tools/testing/selftests/powerpc/include/asm/papr-miscdev.h | 1 +
> >  tools/testing/selftests/powerpc/include/asm/papr-sysparm.h | 1 +
> >  tools/testing/selftests/powerpc/include/asm/papr-vpd.h | 1 +
> >  3 files changed, 3 insertions(+)
> >  create mode 12 
> > tools/testing/selftests/powerpc/include/asm/papr-miscdev.h
> >  create mode 12 
> > tools/testing/selftests/powerpc/include/asm/papr-sysparm.h
> >  create mode 12
> > tools/testing/selftests/powerpc/include/asm/papr-vpd.h
> 
> I really hope making symlinks into the kernel source isn't necessary. I
> haven't experienced build failures with these tests. How are you
> building them?
> 
> I usually do something like (on a x86 build host):
> 
> $ make ARCH=powerpc CROSS_COMPILE=powerpc64le-linux- ppc64le_defconfig
> $ make ARCH=powerpc CROSS_COMPILE=powerpc64le-linux- headers
> $ make ARCH=powerpc CROSS_COMPILE=powerpc64le-linux- -C 
> tools/testing/selftests/powerpc/
> 
> without issue.

I am not configuring the kernel, only building the tests, and certainly
not installing headers on the system.

Apparently this is what people aim to do, and report bugs when it does
not work: build the kselftests as self-contained testsuite that relies
only on standard libc, and whatever it brought in the sources.

That said, the target to install headers is headers_install, not
headers. The headers target is not documented, it's probably meant to be
internal to the build system. Yet it is not enforced that it is built
before building the selftests.

Thanks

Michal

Re: [PATCH v6 11/18] arm64/mm: Split __flush_tlb_range() to elide trailing DSB

2024-02-15 Thread Catalin Marinas

On Thu, Feb 15, 2024 at 10:31:58AM +, Ryan Roberts wrote:
> Split __flush_tlb_range() into __flush_tlb_range_nosync() +
> __flush_tlb_range(), in the same way as the existing flush_tlb_page()
> arrangement. This allows calling __flush_tlb_range_nosync() to elide the
> trailing DSB. Forthcoming "contpte" code will take advantage of this
> when clearing the young bit from a contiguous range of ptes.
> 
> Ordering between dsb and mmu_notifier_arch_invalidate_secondary_tlbs()
> has changed, but now aligns with the ordering of __flush_tlb_page(). It
> has been discussed that __flush_tlb_page() may be wrong though.
> Regardless, both will be resolved separately if needed.
> 
> Reviewed-by: David Hildenbrand 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas

Re: [PATCH v6 10/18] arm64/mm: New ptep layer to manage contig bit

2024-02-15 Thread Catalin Marinas

On Thu, Feb 15, 2024 at 10:31:57AM +, Ryan Roberts wrote:
> Create a new layer for the in-table PTE manipulation APIs. For now, The
> existing API is prefixed with double underscore to become the
> arch-private API and the public API is just a simple wrapper that calls
> the private API.
> 
> The public API implementation will subsequently be used to transparently
> manipulate the contiguous bit where appropriate. But since there are
> already some contig-aware users (e.g. hugetlb, kernel mapper), we must
> first ensure those users use the private API directly so that the future
> contig-bit manipulations in the public API do not interfere with those
> existing uses.
> 
> The following APIs are treated this way:
> 
>  - ptep_get
>  - set_pte
>  - set_ptes
>  - pte_clear
>  - ptep_get_and_clear
>  - ptep_test_and_clear_young
>  - ptep_clear_flush_young
>  - ptep_set_wrprotect
>  - ptep_set_access_flags
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas

Re: [PATCH v3 RESEND 3/6] bitmap: Make bitmap_onto() available to users

2024-02-15 Thread Andy Shevchenko

On Thu, Feb 15, 2024 at 06:46:12PM +0100, Herve Codina wrote:
> On Mon, 12 Feb 2024 11:13:13 -0800
> Yury Norov  wrote:

...

> > That's I agree. Scatter/gather from your last approach sound better.
> > Do you plan to send a v2?

See below.

...

> > I think your scatter/gather is better then this onto/off by naming and
> > implementation. If you'll send a v2, and it would work for Herve, I'd
> > prefer scatter/gather. But we can live with onto/off as well.
> 
> Andy, I tested your bitmap_{scatter,gather}() in my code.
> I simply replaced my bitmap_{onto,off}() calls by calls to your helpers and
> it works perfectly for my use case.
> 
> I didn't use your whole patch
>   "[PATCH v1 2/5] lib/bitmap: Introduce bitmap_scatter() and bitmap_gather() 
> helpers"
> because it didn't apply on a v6.8-rc1 based branch.
> I just manually extracted the needed functions for my tests and I didn't look
> at the lib/test_bitmap.c part.
> 
> Now what's the plan ?
> Andy, do you want to send a v2 of this patch or may I get the patch, modify it
> according to reviews already present in v1 and integrate it in my current
> series ?

I would like to do that, but under pile of different things.
I would try my best but if you have enough time and motivation feel free
to take over, address the comments and integrate in your series.

I dunno what to do with bitmap_onto(), perhaps in a separate patch we can
replace it with bitmap_scatter() (IIUC) with explanation that the former
1) uses atomic ops while being non-atomic as a whole, and b) having quite
hard to get documentation. At least that's how I see it, I mean that I would
like to leave bitmap_onto() alone and address it separately.

> Yury, any preferences ?

-- 
With Best Regards,
Andy Shevchenko

Re: [kvm-unit-tests PATCH v1 01/18] Makefile: Define ASSEMBLY for assembly files

2024-02-15 Thread Andrew Jones

On Thu, Feb 15, 2024 at 05:16:01PM +, Alexandru Elisei wrote:
> Hi Drew,
> 
> On Thu, Feb 15, 2024 at 05:32:22PM +0100, Andrew Jones wrote:
> > On Thu, Feb 15, 2024 at 04:05:56PM +, Alexandru Elisei wrote:
> > > Hi Drew,
> > > 
> > > On Mon, Jan 15, 2024 at 01:44:17PM +0100, Andrew Jones wrote:
> > > > On Thu, Nov 30, 2023 at 04:07:03AM -0500, Shaoqin Huang wrote:
> > > > > From: Alexandru Elisei 
> > > > > 
> > > > > There are 25 header files today (found with grep -r "#ifndef 
> > > > > __ASSEMBLY__)
> > > > > with functionality relies on the __ASSEMBLY__ prepocessor constant 
> > > > > being
> > > > > correctly defined to work correctly. So far, kvm-unit-tests has 
> > > > > relied on
> > > > > the assembly files to define the constant before including any header
> > > > > files which depend on it.
> > > > > 
> > > > > Let's make sure that nobody gets this wrong and define it as a 
> > > > > compiler
> > > > > constant when compiling assembly files. __ASSEMBLY__ is now defined 
> > > > > for all
> > > > > .S files, even those that didn't set it explicitely before.
> > > > > 
> > > > > Reviewed-by: Nikos Nikoleris 
> > > > > Reviewed-by: Andrew Jones 
> > > > > Signed-off-by: Alexandru Elisei 
> > > > > Signed-off-by: Shaoqin Huang 
> > > > > ---
> > > > >  Makefile   | 5 -
> > > > >  arm/cstart.S   | 1 -
> > > > >  arm/cstart64.S | 1 -
> > > > >  powerpc/cstart64.S | 1 -
> > > > >  4 files changed, 4 insertions(+), 4 deletions(-)
> > > > > 
> > > > > diff --git a/Makefile b/Makefile
> > > > > index 602910dd..27ed14e6 100644
> > > > > --- a/Makefile
> > > > > +++ b/Makefile
> > > > > @@ -92,6 +92,9 @@ CFLAGS += -Woverride-init -Wmissing-prototypes 
> > > > > -Wstrict-prototypes
> > > > >  
> > > > >  autodepend-flags = -MMD -MP -MF $(dir $*).$(notdir $*).d
> > > > >  
> > > > > +AFLAGS  = $(CFLAGS)
> > > > > +AFLAGS += -D__ASSEMBLY__
> > > > > +
> > > > >  LDFLAGS += -nostdlib $(no_pie) -z noexecstack
> > > > >  
> > > > >  $(libcflat): $(cflatobjs)
> > > > > @@ -113,7 +116,7 @@ directories:
> > > > >   @mkdir -p $(OBJDIRS)
> > > > >  
> > > > >  %.o: %.S
> > > > > - $(CC) $(CFLAGS) -c -nostdlib -o $@ $<
> > > > > + $(CC) $(AFLAGS) -c -nostdlib -o $@ $<
> > > > 
> > > > I think we can drop the two hunks above from this patch and just rely on
> > > > the compiler to add __ASSEMBLY__ for us when compiling assembly files.
> > > 
> > > I think the precompiler adds __ASSEMBLER__, not __ASSEMBLY__ [1]. Am I
> > > missing something?
> > > 
> > > [1] 
> > > https://gcc.gnu.org/onlinedocs/cpp/macros/predefined-macros.html#c.__ASSEMBLER__
> > 
> > You're right. I'm not opposed to changing all the __ASSEMBLY__ references
> > to __ASSEMBLER__. I'll try to do that at some point unless you beat me to
> > it.
> 
> Actually, I quite prefer the Linux style of using __ASSEMBLY__ instead of
> __ASSEMBLER__, because it makes reusing Linux files easier. That, and the
> habit formed by staring at Linux assembly files.

Those are good arguments and also saves the churn. OK, let's keep this
patch and __ASSEMBLY__

Thanks,
drew

Re: [PATCH] selftests: powerpc: Add header symlinks for building papr character device tests

2024-02-15 Thread Nathan Lynch

Michal Suchanek  writes:
>
> Without the headers the tests don't build.
>
> Fixes: 9118c5d32bdd ("powerpc/selftests: Add test for papr-vpd")
> Fixes: 76b2ec3faeaa ("powerpc/selftests: Add test for papr-sysparm")
> Signed-off-by: Michal Suchanek 
> ---
>  tools/testing/selftests/powerpc/include/asm/papr-miscdev.h | 1 +
>  tools/testing/selftests/powerpc/include/asm/papr-sysparm.h | 1 +
>  tools/testing/selftests/powerpc/include/asm/papr-vpd.h | 1 +
>  3 files changed, 3 insertions(+)
>  create mode 12 tools/testing/selftests/powerpc/include/asm/papr-miscdev.h
>  create mode 12 tools/testing/selftests/powerpc/include/asm/papr-sysparm.h
>  create mode 12
> tools/testing/selftests/powerpc/include/asm/papr-vpd.h

I really hope making symlinks into the kernel source isn't necessary. I
haven't experienced build failures with these tests. How are you
building them?

I usually do something like (on a x86 build host):

$ make ARCH=powerpc CROSS_COMPILE=powerpc64le-linux- ppc64le_defconfig
$ make ARCH=powerpc CROSS_COMPILE=powerpc64le-linux- headers
$ make ARCH=powerpc CROSS_COMPILE=powerpc64le-linux- -C 
tools/testing/selftests/powerpc/

without issue.

Re: [PATCH v6 09/18] arm64/mm: Convert ptep_clear() to ptep_get_and_clear()

2024-02-15 Thread Catalin Marinas

On Thu, Feb 15, 2024 at 10:31:56AM +, Ryan Roberts wrote:
> ptep_clear() is a generic wrapper around the arch-implemented
> ptep_get_and_clear(). We are about to convert ptep_get_and_clear() into
> a public version and private version (__ptep_get_and_clear()) to support
> the transparent contpte work. We won't have a private version of
> ptep_clear() so let's convert it to directly call ptep_get_and_clear().
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas

Re: [PATCH v6 08/18] arm64/mm: Convert set_pte_at() to set_ptes(..., 1)

2024-02-15 Thread Catalin Marinas

On Thu, Feb 15, 2024 at 10:31:55AM +, Ryan Roberts wrote:
> Since set_ptes() was introduced, set_pte_at() has been implemented as a
> generic macro around set_ptes(..., 1). So this change should continue to
> generate the same code. However, making this change prepares us for the
> transparent contpte support. It means we can reroute set_ptes() to
> __set_ptes(). Since set_pte_at() is a generic macro, there will be no
> equivalent __set_pte_at() to reroute to.
> 
> Note that a couple of calls to set_pte_at() remain in the arch code.
> This is intentional, since those call sites are acting on behalf of
> core-mm and should continue to call into the public set_ptes() rather
> than the arch-private __set_ptes().
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas

Re: [PATCH v6 07/18] arm64/mm: Convert READ_ONCE(*ptep) to ptep_get(ptep)

2024-02-15 Thread Catalin Marinas

On Thu, Feb 15, 2024 at 10:31:54AM +, Ryan Roberts wrote:
> There are a number of places in the arch code that read a pte by using
> the READ_ONCE() macro. Refactor these call sites to instead use the
> ptep_get() helper, which itself is a READ_ONCE(). Generated code should
> be the same.
> 
> This will benefit us when we shortly introduce the transparent contpte
> support. In this case, ptep_get() will become more complex so we now
> have all the code abstracted through it.
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas

Re: [PATCH v6 04/18] arm64/mm: Convert pte_next_pfn() to pte_advance_pfn()

2024-02-15 Thread Catalin Marinas

On Thu, Feb 15, 2024 at 10:31:51AM +, Ryan Roberts wrote:
> Core-mm needs to be able to advance the pfn by an arbitrary amount, so
> override the new pte_advance_pfn() API to do so.
> 
> Signed-off-by: Ryan Roberts 

Acked-by: Catalin Marinas

Re: [PATCH v3 RESEND 3/6] bitmap: Make bitmap_onto() available to users

2024-02-15 Thread Herve Codina

Hi Andy, Yury,

On Mon, 12 Feb 2024 11:13:13 -0800
Yury Norov  wrote:

...

> 
> That's I agree. Scatter/gather from your last approach sound better.
> Do you plan to send a v2?
> 
...
> 
> I think your scatter/gather is better then this onto/off by naming and
> implementation. If you'll send a v2, and it would work for Herve, I'd
> prefer scatter/gather. But we can live with onto/off as well.
> 

Andy, I tested your bitmap_{scatter,gather}() in my code.
I simply replaced my bitmap_{onto,off}() calls by calls to your helpers and
it works perfectly for my use case.

I didn't use your whole patch
  "[PATCH v1 2/5] lib/bitmap: Introduce bitmap_scatter() and bitmap_gather() 
helpers"
because it didn't apply on a v6.8-rc1 based branch.
I just manually extracted the needed functions for my tests and I didn't look
at the lib/test_bitmap.c part.

Now what's the plan ?
Andy, do you want to send a v2 of this patch or may I get the patch, modify it
according to reviews already present in v1 and integrate it in my current
series ?

Yury, any preferences ?

Best regards,
Hervé
-- 
Hervé Codina, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

Re: [kvm-unit-tests PATCH v1 01/18] Makefile: Define ASSEMBLY for assembly files

2024-02-15 Thread Alexandru Elisei

Hi Drew,

On Thu, Feb 15, 2024 at 05:32:22PM +0100, Andrew Jones wrote:
> On Thu, Feb 15, 2024 at 04:05:56PM +, Alexandru Elisei wrote:
> > Hi Drew,
> > 
> > On Mon, Jan 15, 2024 at 01:44:17PM +0100, Andrew Jones wrote:
> > > On Thu, Nov 30, 2023 at 04:07:03AM -0500, Shaoqin Huang wrote:
> > > > From: Alexandru Elisei 
> > > > 
> > > > There are 25 header files today (found with grep -r "#ifndef 
> > > > __ASSEMBLY__)
> > > > with functionality relies on the __ASSEMBLY__ prepocessor constant being
> > > > correctly defined to work correctly. So far, kvm-unit-tests has relied 
> > > > on
> > > > the assembly files to define the constant before including any header
> > > > files which depend on it.
> > > > 
> > > > Let's make sure that nobody gets this wrong and define it as a compiler
> > > > constant when compiling assembly files. __ASSEMBLY__ is now defined for 
> > > > all
> > > > .S files, even those that didn't set it explicitely before.
> > > > 
> > > > Reviewed-by: Nikos Nikoleris 
> > > > Reviewed-by: Andrew Jones 
> > > > Signed-off-by: Alexandru Elisei 
> > > > Signed-off-by: Shaoqin Huang 
> > > > ---
> > > >  Makefile   | 5 -
> > > >  arm/cstart.S   | 1 -
> > > >  arm/cstart64.S | 1 -
> > > >  powerpc/cstart64.S | 1 -
> > > >  4 files changed, 4 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/Makefile b/Makefile
> > > > index 602910dd..27ed14e6 100644
> > > > --- a/Makefile
> > > > +++ b/Makefile
> > > > @@ -92,6 +92,9 @@ CFLAGS += -Woverride-init -Wmissing-prototypes 
> > > > -Wstrict-prototypes
> > > >  
> > > >  autodepend-flags = -MMD -MP -MF $(dir $*).$(notdir $*).d
> > > >  
> > > > +AFLAGS  = $(CFLAGS)
> > > > +AFLAGS += -D__ASSEMBLY__
> > > > +
> > > >  LDFLAGS += -nostdlib $(no_pie) -z noexecstack
> > > >  
> > > >  $(libcflat): $(cflatobjs)
> > > > @@ -113,7 +116,7 @@ directories:
> > > > @mkdir -p $(OBJDIRS)
> > > >  
> > > >  %.o: %.S
> > > > -   $(CC) $(CFLAGS) -c -nostdlib -o $@ $<
> > > > +   $(CC) $(AFLAGS) -c -nostdlib -o $@ $<
> > > 
> > > I think we can drop the two hunks above from this patch and just rely on
> > > the compiler to add __ASSEMBLY__ for us when compiling assembly files.
> > 
> > I think the precompiler adds __ASSEMBLER__, not __ASSEMBLY__ [1]. Am I
> > missing something?
> > 
> > [1] 
> > https://gcc.gnu.org/onlinedocs/cpp/macros/predefined-macros.html#c.__ASSEMBLER__
> 
> You're right. I'm not opposed to changing all the __ASSEMBLY__ references
> to __ASSEMBLER__. I'll try to do that at some point unless you beat me to
> it.

Actually, I quite prefer the Linux style of using __ASSEMBLY__ instead of
__ASSEMBLER__, because it makes reusing Linux files easier. That, and the
habit formed by staring at Linux assembly files.

Thanks,
Alex

[PATCH] selftests: powerpc: Add header symlinks for building papr character device tests

2024-02-15 Thread Michal Suchanek

From: root 

Without the headers the tests don't build.

Fixes: 9118c5d32bdd ("powerpc/selftests: Add test for papr-vpd")
Fixes: 76b2ec3faeaa ("powerpc/selftests: Add test for papr-sysparm")
Signed-off-by: Michal Suchanek 
---
 tools/testing/selftests/powerpc/include/asm/papr-miscdev.h | 1 +
 tools/testing/selftests/powerpc/include/asm/papr-sysparm.h | 1 +
 tools/testing/selftests/powerpc/include/asm/papr-vpd.h | 1 +
 3 files changed, 3 insertions(+)
 create mode 12 tools/testing/selftests/powerpc/include/asm/papr-miscdev.h
 create mode 12 tools/testing/selftests/powerpc/include/asm/papr-sysparm.h
 create mode 12 tools/testing/selftests/powerpc/include/asm/papr-vpd.h

diff --git a/tools/testing/selftests/powerpc/include/asm/papr-miscdev.h 
b/tools/testing/selftests/powerpc/include/asm/papr-miscdev.h
new file mode 12
index ..0f811020354d
--- /dev/null
+++ b/tools/testing/selftests/powerpc/include/asm/papr-miscdev.h
@@ -0,0 +1 @@
+../../../../../../arch/powerpc/include/uapi/asm/papr-miscdev.h
\ No newline at end of file
diff --git a/tools/testing/selftests/powerpc/include/asm/papr-sysparm.h 
b/tools/testing/selftests/powerpc/include/asm/papr-sysparm.h
new file mode 12
index ..6355e122245e
--- /dev/null
+++ b/tools/testing/selftests/powerpc/include/asm/papr-sysparm.h
@@ -0,0 +1 @@
+../../../../../../arch/powerpc/include/uapi/asm/papr-sysparm.h
\ No newline at end of file
diff --git a/tools/testing/selftests/powerpc/include/asm/papr-vpd.h 
b/tools/testing/selftests/powerpc/include/asm/papr-vpd.h
new file mode 12
index ..403ddec6b422
--- /dev/null
+++ b/tools/testing/selftests/powerpc/include/asm/papr-vpd.h
@@ -0,0 +1 @@
+../../../../../../arch/powerpc/include/uapi/asm/papr-vpd.h
\ No newline at end of file
-- 
2.43.0

Re: [kvm-unit-tests PATCH v1 01/18] Makefile: Define ASSEMBLY for assembly files

2024-02-15 Thread Andrew Jones

On Thu, Feb 15, 2024 at 04:05:56PM +, Alexandru Elisei wrote:
> Hi Drew,
> 
> On Mon, Jan 15, 2024 at 01:44:17PM +0100, Andrew Jones wrote:
> > On Thu, Nov 30, 2023 at 04:07:03AM -0500, Shaoqin Huang wrote:
> > > From: Alexandru Elisei 
> > > 
> > > There are 25 header files today (found with grep -r "#ifndef __ASSEMBLY__)
> > > with functionality relies on the __ASSEMBLY__ prepocessor constant being
> > > correctly defined to work correctly. So far, kvm-unit-tests has relied on
> > > the assembly files to define the constant before including any header
> > > files which depend on it.
> > > 
> > > Let's make sure that nobody gets this wrong and define it as a compiler
> > > constant when compiling assembly files. __ASSEMBLY__ is now defined for 
> > > all
> > > .S files, even those that didn't set it explicitely before.
> > > 
> > > Reviewed-by: Nikos Nikoleris 
> > > Reviewed-by: Andrew Jones 
> > > Signed-off-by: Alexandru Elisei 
> > > Signed-off-by: Shaoqin Huang 
> > > ---
> > >  Makefile   | 5 -
> > >  arm/cstart.S   | 1 -
> > >  arm/cstart64.S | 1 -
> > >  powerpc/cstart64.S | 1 -
> > >  4 files changed, 4 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/Makefile b/Makefile
> > > index 602910dd..27ed14e6 100644
> > > --- a/Makefile
> > > +++ b/Makefile
> > > @@ -92,6 +92,9 @@ CFLAGS += -Woverride-init -Wmissing-prototypes 
> > > -Wstrict-prototypes
> > >  
> > >  autodepend-flags = -MMD -MP -MF $(dir $*).$(notdir $*).d
> > >  
> > > +AFLAGS  = $(CFLAGS)
> > > +AFLAGS += -D__ASSEMBLY__
> > > +
> > >  LDFLAGS += -nostdlib $(no_pie) -z noexecstack
> > >  
> > >  $(libcflat): $(cflatobjs)
> > > @@ -113,7 +116,7 @@ directories:
> > >   @mkdir -p $(OBJDIRS)
> > >  
> > >  %.o: %.S
> > > - $(CC) $(CFLAGS) -c -nostdlib -o $@ $<
> > > + $(CC) $(AFLAGS) -c -nostdlib -o $@ $<
> > 
> > I think we can drop the two hunks above from this patch and just rely on
> > the compiler to add __ASSEMBLY__ for us when compiling assembly files.
> 
> I think the precompiler adds __ASSEMBLER__, not __ASSEMBLY__ [1]. Am I
> missing something?
> 
> [1] 
> https://gcc.gnu.org/onlinedocs/cpp/macros/predefined-macros.html#c.__ASSEMBLER__

You're right. I'm not opposed to changing all the __ASSEMBLY__ references
to __ASSEMBLER__. I'll try to do that at some point unless you beat me to
it.

Thanks,
drew

[RESEND PATCH net v4 1/2] soc: fsl: qbman: Always disable interrupts when taking cgr_lock

2024-02-15 Thread Sean Anderson

smp_call_function_single disables IRQs when executing the callback. To
prevent deadlocks, we must disable IRQs when taking cgr_lock elsewhere.
This is already done by qman_update_cgr and qman_delete_cgr; fix the
other lockers.

Fixes: 96f413f47677 ("soc/fsl/qbman: fix issue in qman_delete_cgr_safe()")
CC: sta...@vger.kernel.org
Signed-off-by: Sean Anderson 
Reviewed-by: Camelia Groza 
Tested-by: Vladimir Oltean 
---
I got no response the first time I sent this, so I am resending to net.
This issue was introduced in a series which went through net, so I hope
it makes sense to take it via net.

[1] 
https://lore.kernel.org/linux-arm-kernel/20240108161904.2865093-1-sean.ander...@seco.com/

(no changes since v3)

Changes in v3:
- Change blamed commit to something more appropriate

Changes in v2:
- Fix one additional call to spin_unlock

 drivers/soc/fsl/qbman/qman.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/soc/fsl/qbman/qman.c b/drivers/soc/fsl/qbman/qman.c
index 739e4eee6b75..1bf1f1ea67f0 100644
--- a/drivers/soc/fsl/qbman/qman.c
+++ b/drivers/soc/fsl/qbman/qman.c
@@ -1456,11 +1456,11 @@ static void qm_congestion_task(struct work_struct *work)
union qm_mc_result *mcr;
struct qman_cgr *cgr;

-   spin_lock(>cgr_lock);
+   spin_lock_irq(>cgr_lock);
qm_mc_start(>p);
qm_mc_commit(>p, QM_MCC_VERB_QUERYCONGESTION);
if (!qm_mc_result_timeout(>p, )) {
-   spin_unlock(>cgr_lock);
+   spin_unlock_irq(>cgr_lock);
dev_crit(p->config->dev, "QUERYCONGESTION timeout\n");
qman_p_irqsource_add(p, QM_PIRQ_CSCI);
return;
@@ -1476,7 +1476,7 @@ static void qm_congestion_task(struct work_struct *work)
list_for_each_entry(cgr, >cgr_cbs, node)
if (cgr->cb && qman_cgrs_get(, cgr->cgrid))
cgr->cb(p, cgr, qman_cgrs_get(, cgr->cgrid));
-   spin_unlock(>cgr_lock);
+   spin_unlock_irq(>cgr_lock);
qman_p_irqsource_add(p, QM_PIRQ_CSCI);
 }

@@ -2440,7 +2440,7 @@ int qman_create_cgr(struct qman_cgr *cgr, u32 flags,
preempt_enable();

cgr->chan = p->config->channel;
-   spin_lock(>cgr_lock);
+   spin_lock_irq(>cgr_lock);

if (opts) {
struct qm_mcc_initcgr local_opts = *opts;
@@ -2477,7 +2477,7 @@ int qman_create_cgr(struct qman_cgr *cgr, u32 flags,
qman_cgrs_get(>cgrs[1], cgr->cgrid))
cgr->cb(p, cgr, 1);
 out:
-   spin_unlock(>cgr_lock);
+   spin_unlock_irq(>cgr_lock);
put_affine_portal();
return ret;
 }
--
2.35.1.1320.gc452695387.dirty


[Embedded World 2024, SECO 
SpA]

[RESEND PATCH net v4 2/2] soc: fsl: qbman: Use raw spinlock for cgr_lock

2024-02-15 Thread Sean Anderson

cgr_lock may be locked with interrupts already disabled by
smp_call_function_single. As such, we must use a raw spinlock to avoid
problems on PREEMPT_RT kernels. Although this bug has existed for a
while, it was not apparent until commit ef2a8d5478b9 ("net: dpaa: Adjust
queue depth on rate change") which invokes smp_call_function_single via
qman_update_cgr_safe every time a link goes up or down.

Fixes: 96f413f47677 ("soc/fsl/qbman: fix issue in qman_delete_cgr_safe()")
CC: sta...@vger.kernel.org
Reported-by: Vladimir Oltean 
Closes: https://lore.kernel.org/all/20230323153935.nofnjucqjqnz34ej@skbuf/
Reported-by: Steffen Trumtrar 
Closes: https://lore.kernel.org/linux-arm-kernel/87wmsyvclu@pengutronix.de/
Signed-off-by: Sean Anderson 
Reviewed-by: Camelia Groza 
Tested-by: Vladimir Oltean 

---

Changes in v4:
- Add a note about how raw spinlocks aren't quite right

Changes in v3:
- Change blamed commit to something more appropriate

 drivers/soc/fsl/qbman/qman.c | 25 ++---
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/drivers/soc/fsl/qbman/qman.c b/drivers/soc/fsl/qbman/qman.c
index 1bf1f1ea67f0..7e9074519ad2 100644
--- a/drivers/soc/fsl/qbman/qman.c
+++ b/drivers/soc/fsl/qbman/qman.c
@@ -991,7 +991,7 @@ struct qman_portal {
/* linked-list of CSCN handlers. */
struct list_head cgr_cbs;
/* list lock */
-   spinlock_t cgr_lock;
+   raw_spinlock_t cgr_lock;
struct work_struct congestion_work;
struct work_struct mr_work;
char irqname[MAX_IRQNAME];
@@ -1281,7 +1281,7 @@ static int qman_create_portal(struct qman_portal *portal,
/* if the given mask is NULL, assume all CGRs can be seen */
qman_cgrs_fill(>cgrs[0]);
INIT_LIST_HEAD(>cgr_cbs);
-   spin_lock_init(>cgr_lock);
+   raw_spin_lock_init(>cgr_lock);
INIT_WORK(>congestion_work, qm_congestion_task);
INIT_WORK(>mr_work, qm_mr_process_task);
portal->bits = 0;
@@ -1456,11 +1456,14 @@ static void qm_congestion_task(struct work_struct *work)
union qm_mc_result *mcr;
struct qman_cgr *cgr;

-   spin_lock_irq(>cgr_lock);
+   /*
+* FIXME: QM_MCR_TIMEOUT is 10ms, which is too long for a raw spinlock!
+*/
+   raw_spin_lock_irq(>cgr_lock);
qm_mc_start(>p);
qm_mc_commit(>p, QM_MCC_VERB_QUERYCONGESTION);
if (!qm_mc_result_timeout(>p, )) {
-   spin_unlock_irq(>cgr_lock);
+   raw_spin_unlock_irq(>cgr_lock);
dev_crit(p->config->dev, "QUERYCONGESTION timeout\n");
qman_p_irqsource_add(p, QM_PIRQ_CSCI);
return;
@@ -1476,7 +1479,7 @@ static void qm_congestion_task(struct work_struct *work)
list_for_each_entry(cgr, >cgr_cbs, node)
if (cgr->cb && qman_cgrs_get(, cgr->cgrid))
cgr->cb(p, cgr, qman_cgrs_get(, cgr->cgrid));
-   spin_unlock_irq(>cgr_lock);
+   raw_spin_unlock_irq(>cgr_lock);
qman_p_irqsource_add(p, QM_PIRQ_CSCI);
 }

@@ -2440,7 +2443,7 @@ int qman_create_cgr(struct qman_cgr *cgr, u32 flags,
preempt_enable();

cgr->chan = p->config->channel;
-   spin_lock_irq(>cgr_lock);
+   raw_spin_lock_irq(>cgr_lock);

if (opts) {
struct qm_mcc_initcgr local_opts = *opts;
@@ -2477,7 +2480,7 @@ int qman_create_cgr(struct qman_cgr *cgr, u32 flags,
qman_cgrs_get(>cgrs[1], cgr->cgrid))
cgr->cb(p, cgr, 1);
 out:
-   spin_unlock_irq(>cgr_lock);
+   raw_spin_unlock_irq(>cgr_lock);
put_affine_portal();
return ret;
 }
@@ -2512,7 +2515,7 @@ int qman_delete_cgr(struct qman_cgr *cgr)
return -EINVAL;

memset(_opts, 0, sizeof(struct qm_mcc_initcgr));
-   spin_lock_irqsave(>cgr_lock, irqflags);
+   raw_spin_lock_irqsave(>cgr_lock, irqflags);
list_del(>node);
/*
 * If there are no other CGR objects for this CGRID in the list,
@@ -2537,7 +2540,7 @@ int qman_delete_cgr(struct qman_cgr *cgr)
/* add back to the list */
list_add(>node, >cgr_cbs);
 release_lock:
-   spin_unlock_irqrestore(>cgr_lock, irqflags);
+   raw_spin_unlock_irqrestore(>cgr_lock, irqflags);
put_affine_portal();
return ret;
 }
@@ -2577,9 +2580,9 @@ static int qman_update_cgr(struct qman_cgr *cgr, struct 
qm_mcc_initcgr *opts)
if (!p)
return -EINVAL;

-   spin_lock_irqsave(>cgr_lock, irqflags);
+   raw_spin_lock_irqsave(>cgr_lock, irqflags);
ret = qm_modify_cgr(cgr, 0, opts);
-   spin_unlock_irqrestore(>cgr_lock, irqflags);
+   raw_spin_unlock_irqrestore(>cgr_lock, irqflags);
put_affine_portal();
return ret;
 }
--
2.35.1.1320.gc452695387.dirty


[Embedded World 2024, SECO 
SpA]

Re: [kvm-unit-tests PATCH v1 01/18] Makefile: Define ASSEMBLY for assembly files

2024-02-15 Thread Alexandru Elisei

Hi Drew,

On Mon, Jan 15, 2024 at 01:44:17PM +0100, Andrew Jones wrote:
> On Thu, Nov 30, 2023 at 04:07:03AM -0500, Shaoqin Huang wrote:
> > From: Alexandru Elisei 
> > 
> > There are 25 header files today (found with grep -r "#ifndef __ASSEMBLY__)
> > with functionality relies on the __ASSEMBLY__ prepocessor constant being
> > correctly defined to work correctly. So far, kvm-unit-tests has relied on
> > the assembly files to define the constant before including any header
> > files which depend on it.
> > 
> > Let's make sure that nobody gets this wrong and define it as a compiler
> > constant when compiling assembly files. __ASSEMBLY__ is now defined for all
> > .S files, even those that didn't set it explicitely before.
> > 
> > Reviewed-by: Nikos Nikoleris 
> > Reviewed-by: Andrew Jones 
> > Signed-off-by: Alexandru Elisei 
> > Signed-off-by: Shaoqin Huang 
> > ---
> >  Makefile   | 5 -
> >  arm/cstart.S   | 1 -
> >  arm/cstart64.S | 1 -
> >  powerpc/cstart64.S | 1 -
> >  4 files changed, 4 insertions(+), 4 deletions(-)
> > 
> > diff --git a/Makefile b/Makefile
> > index 602910dd..27ed14e6 100644
> > --- a/Makefile
> > +++ b/Makefile
> > @@ -92,6 +92,9 @@ CFLAGS += -Woverride-init -Wmissing-prototypes 
> > -Wstrict-prototypes
> >  
> >  autodepend-flags = -MMD -MP -MF $(dir $*).$(notdir $*).d
> >  
> > +AFLAGS  = $(CFLAGS)
> > +AFLAGS += -D__ASSEMBLY__
> > +
> >  LDFLAGS += -nostdlib $(no_pie) -z noexecstack
> >  
> >  $(libcflat): $(cflatobjs)
> > @@ -113,7 +116,7 @@ directories:
> > @mkdir -p $(OBJDIRS)
> >  
> >  %.o: %.S
> > -   $(CC) $(CFLAGS) -c -nostdlib -o $@ $<
> > +   $(CC) $(AFLAGS) -c -nostdlib -o $@ $<
> 
> I think we can drop the two hunks above from this patch and just rely on
> the compiler to add __ASSEMBLY__ for us when compiling assembly files.

I think the precompiler adds __ASSEMBLER__, not __ASSEMBLY__ [1]. Am I
missing something?

[1] 
https://gcc.gnu.org/onlinedocs/cpp/macros/predefined-macros.html#c.__ASSEMBLER__

Thanks,
Alex

> 
> Thanks,
> drew
> 
> >  
> >  -include */.*.d */*/.*.d
> >  
> > diff --git a/arm/cstart.S b/arm/cstart.S
> > index 3dd71ed9..b24ecabc 100644
> > --- a/arm/cstart.S
> > +++ b/arm/cstart.S
> > @@ -5,7 +5,6 @@
> >   *
> >   * This work is licensed under the terms of the GNU LGPL, version 2.
> >   */
> > -#define __ASSEMBLY__
> >  #include 
> >  #include 
> >  #include 
> > diff --git a/arm/cstart64.S b/arm/cstart64.S
> > index bc2be45a..a8ad6dc8 100644
> > --- a/arm/cstart64.S
> > +++ b/arm/cstart64.S
> > @@ -5,7 +5,6 @@
> >   *
> >   * This work is licensed under the terms of the GNU GPL, version 2.
> >   */
> > -#define __ASSEMBLY__
> >  #include 
> >  #include 
> >  #include 
> > diff --git a/powerpc/cstart64.S b/powerpc/cstart64.S
> > index 34e39341..fa32ef24 100644
> > --- a/powerpc/cstart64.S
> > +++ b/powerpc/cstart64.S
> > @@ -5,7 +5,6 @@
> >   *
> >   * This work is licensed under the terms of the GNU LGPL, version 2.
> >   */
> > -#define __ASSEMBLY__
> >  #include 
> >  #include 
> >  #include 
> > -- 
> > 2.40.1
> >

Re: [PATCH v2] uapi/auxvec: Define AT_HWCAP3 and AT_HWCAP4 aux vector, entries

2024-02-15 Thread Peter Bergner

On 2/15/24 2:16 AM, Arnd Bergmann wrote:
> On Wed, Feb 14, 2024, at 23:34, Peter Bergner wrote:
>> The powerpc toolchain keeps a copy of the HWCAP bit masks in our TCB for fast
>> access by the __builtin_cpu_supports built-in function.  The TCB space for
>> the HWCAP entries - which are created in pairs - is an ABI extension, so
>> waiting to create the space for HWCAP3 and HWCAP4 until we need them is
>> problematical.  Define AT_HWCAP3 and AT_HWCAP4 in the generic uapi header
>> so they can be used in glibc to reserve space in the powerpc TCB for their
>> future use.
>>
>> I scanned through the Linux and GLIBC source codes looking for unused AT_*
>> values and 29 and 30 did not seem to be used, so they are what I went
>> with.  This has received Acked-by's from both GLIBC and Linux kernel
>> developers and no reservations or Nacks from anyone.
>>
>> Arnd, we seem to have consensus on the patch below.  Is this something
>> you could take and apply to your tree? 
>>
> 
> I don't mind taking it, but it may be better to use the
> powerpc tree if that is where it's actually being used.

So this is not a powerpc only patch, but we may be the first arch
to use it.  Szabolcs mentioned that aarch64 was pretty quickly filling
up their AT_HWCAP2 and that they will eventually require using AT_HWCAP3
as well.  If you still think this should go through the powerpc tree,
I can check on that.

Peter

[powerpc:fixes-test] BUILD SUCCESS 0846dd77c8349ec92ca0079c9c71d130f34cb192

2024-02-15 Thread kernel test robot

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
fixes-test
branch HEAD: 0846dd77c8349ec92ca0079c9c71d130f34cb192  powerpc/iommu: Fix the 
missing iommu_group_put() during platform domain attach

elapsed time: 1459m

configs tested: 109
configs skipped: 3

The following configs have been built successfully.
More configs may be tested in the coming days.

tested configs:
alpha allnoconfig   gcc  
alphaallyesconfig   gcc  
alpha   defconfig   gcc  
arc  allmodconfig   gcc  
arc   allnoconfig   gcc  
arc  allyesconfig   gcc  
arc defconfig   gcc  
arm  allmodconfig   gcc  
arm   allnoconfig   clang
arm  allyesconfig   gcc  
arm defconfig   clang
arm  ep93xx_defconfig   clang
arm   imxrt_defconfig   clang
arm64allmodconfig   clang
arm64 allnoconfig   gcc  
arm64   defconfig   gcc  
csky allmodconfig   gcc  
csky  allnoconfig   gcc  
csky allyesconfig   gcc  
cskydefconfig   gcc  
hexagon  allmodconfig   clang
hexagon   allnoconfig   clang
hexagon  allyesconfig   clang
hexagon defconfig   clang
i386  allnoconfig   gcc  
i386 buildonly-randconfig-001-20240215   clang
i386 buildonly-randconfig-002-20240215   clang
i386 buildonly-randconfig-003-20240215   clang
i386 buildonly-randconfig-004-20240215   clang
i386 buildonly-randconfig-005-20240215   clang
i386 buildonly-randconfig-006-20240215   clang
i386defconfig   clang
i386  randconfig-001-20240215   gcc  
i386  randconfig-002-20240215   gcc  
i386  randconfig-003-20240215   clang
i386  randconfig-004-20240215   gcc  
i386  randconfig-005-20240215   gcc  
i386  randconfig-006-20240215   gcc  
i386  randconfig-011-20240215   clang
i386  randconfig-012-20240215   clang
i386  randconfig-013-20240215   gcc  
i386  randconfig-014-20240215   gcc  
i386  randconfig-015-20240215   clang
i386  randconfig-016-20240215   gcc  
loongarchallmodconfig   gcc  
loongarch allnoconfig   gcc  
loongarchallyesconfig   gcc  
loongarch   defconfig   gcc  
m68k allmodconfig   gcc  
m68k  allnoconfig   gcc  
m68k allyesconfig   gcc  
m68kdefconfig   gcc  
microblaze   allmodconfig   gcc  
microblazeallnoconfig   gcc  
microblaze   allyesconfig   gcc  
microblaze  defconfig   gcc  
mips allmodconfig   gcc  
mips  allnoconfig   gcc  
mips allyesconfig   gcc  
mipsmaltaup_defconfig   clang
nios2allmodconfig   gcc  
nios2 allnoconfig   gcc  
nios2allyesconfig   gcc  
nios2   defconfig   gcc  
openrisc allmodconfig   gcc  
openrisc  allnoconfig   gcc  
openrisc allyesconfig   gcc  
openriscdefconfig   gcc  
parisc   allmodconfig   gcc  
pariscallnoconfig   gcc  
parisc   allyesconfig   gcc  
parisc  defconfig   gcc  
parisc64defconfig   gcc  
powerpc  allmodconfig   gcc  
powerpc   allnoconfig   gcc  
powerpc  allyesconfig   clang
riscvallmodconfig   clang
riscv allnoconfig   gcc  
riscvallyesconfig   clang
riscv   defconfig   clang
riscvnommu_k210_defconfig   clang
s390 allmodconfig   clang
s390  allnoconfig   clang
s390 allyesconfig   gcc  
s390defconfig

[powerpc:next] BUILD SUCCESS 14ce0dbb562713bc058ad16d281db355757e6ec0

2024-02-15 Thread kernel test robot

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
next
branch HEAD: 14ce0dbb562713bc058ad16d281db355757e6ec0  powerpc: ibmebus: make 
ibmebus_bus_type const

elapsed time: 1445m

configs tested: 120
configs skipped: 3

The following configs have been built successfully.
More configs may be tested in the coming days.

tested configs:
alpha allnoconfig   gcc  
alphaallyesconfig   gcc  
alpha   defconfig   gcc  
arc  allmodconfig   gcc  
arc   allnoconfig   gcc  
arc  allyesconfig   gcc  
arc defconfig   gcc  
arm  allmodconfig   gcc  
arm   allnoconfig   clang
arm  allyesconfig   gcc  
arm   aspeed_g5_defconfig   gcc  
arm defconfig   clang
arm  ep93xx_defconfig   clang
arm   imxrt_defconfig   clang
arm   sama5_defconfig   gcc  
arm64allmodconfig   clang
arm64 allnoconfig   gcc  
arm64   defconfig   gcc  
csky allmodconfig   gcc  
csky  allnoconfig   gcc  
csky allyesconfig   gcc  
cskydefconfig   gcc  
hexagon  allmodconfig   clang
hexagon   allnoconfig   clang
hexagon  allyesconfig   clang
hexagon defconfig   clang
i386 allmodconfig   gcc  
i386  allnoconfig   gcc  
i386 allyesconfig   gcc  
i386 buildonly-randconfig-001-20240215   clang
i386 buildonly-randconfig-002-20240215   clang
i386 buildonly-randconfig-003-20240215   clang
i386 buildonly-randconfig-004-20240215   clang
i386 buildonly-randconfig-005-20240215   clang
i386 buildonly-randconfig-006-20240215   clang
i386defconfig   clang
i386  randconfig-001-20240215   gcc  
i386  randconfig-002-20240215   gcc  
i386  randconfig-003-20240215   clang
i386  randconfig-004-20240215   gcc  
i386  randconfig-005-20240215   gcc  
i386  randconfig-006-20240215   gcc  
i386  randconfig-011-20240215   clang
i386  randconfig-012-20240215   clang
i386  randconfig-013-20240215   gcc  
i386  randconfig-014-20240215   gcc  
i386  randconfig-015-20240215   clang
i386  randconfig-016-20240215   gcc  
loongarchallmodconfig   gcc  
loongarch allnoconfig   gcc  
loongarchallyesconfig   gcc  
loongarch   defconfig   gcc  
m68k allmodconfig   gcc  
m68k  allnoconfig   gcc  
m68k allyesconfig   gcc  
m68kdefconfig   gcc  
m68k  sun3x_defconfig   gcc  
microblaze   allmodconfig   gcc  
microblazeallnoconfig   gcc  
microblaze   allyesconfig   gcc  
microblaze  defconfig   gcc  
mips allmodconfig   gcc  
mips  allnoconfig   gcc  
mips allyesconfig   gcc  
mips  maltasmvp_eva_defconfig   gcc  
mipsmaltaup_defconfig   clang
nios2allmodconfig   gcc  
nios2 allnoconfig   gcc  
nios2allyesconfig   gcc  
nios2   defconfig   gcc  
openrisc allmodconfig   gcc  
openrisc  allnoconfig   gcc  
openrisc allyesconfig   gcc  
openriscdefconfig   gcc  
parisc   allmodconfig   gcc  
pariscallnoconfig   gcc  
parisc   allyesconfig   gcc  
parisc  defconfig   gcc  
parisc64defconfig   gcc  
powerpc  allmodconfig   gcc  
powerpc   allnoconfig   gcc  
powerpc  allyesconfig   clang
riscvallmodconfig   clang
riscv allnoconfig   gcc  
riscvallyesconfig   clang
riscv

Re: [PATCH v2] powerpc/iommu: Fix the iommu group reference leak during platform domain attach

2024-02-15 Thread Shivaprasad G Bhat


On 2/15/24 08:01, Michael Ellerman wrote:

Shivaprasad G Bhat  writes:

The function spapr_tce_platform_iommu_attach_dev() is missing to call
iommu_group_put() when the domain is already set. This refcount leak
shows up with BUG_ON() during DLPAR remove operation as,



   [c013aed5fd10] [c05bfeb4] vfs_write+0xf8/0x488
   [c013aed5fdc0] [c05c0570] ksys_write+0x84/0x140
   [c013aed5fe10] [c0033358] system_call_exception+0x138/0x330
   [c013aed5fe50] [c000d05c] system_call_vectored_common+0x15c/0x2ec
   --- interrupt: 3000 at 0x2433acb4
   
   ---[ end trace  ]---

The patch makes the iommu_group_get() call only when using it there by
avoiding the leak.

Fixes: a8ca9fc9134c ("powerpc/iommu: Do not do platform domain attach atctions after 
probe")
Reported-by: Venkat Rao Bagalkote 
Closes: 
https://lore.kernel.org/all/274e0d2b-b5cc-475e-94e6-8427e88e2...@linux.vnet.ibm.com
Signed-off-by: Shivaprasad G Bhat 
---
Changelog:
v1: 
https://lore.kernel.org/all/170784021983.6249.10039296655906636112.st...@linux.ibm.com/
  - Minor refactor to call the iommu_group_get() only if required.
  - Updated the title, description and signature(Closes/Reported-by).

Sorry I already applied v1.

If you send this as a patch on top of v1 with a new change log I can
merge it as a cleanup/rework.


I have posted the cleanup patch at 
https://lore.kernel.org/linux-iommu/170800513841.2411.13524607664262048895.st...@linux.ibm.com/


Thank you!

Shivaprasad


cheers

[PATCH] powerpc/iommu: Refactor spapr_tce_platform_iommu_attach_dev()

2024-02-15 Thread Shivaprasad G Bhat

The patch makes the iommu_group_get() call only when using it
thereby avoiding the unnecessary get & put for domain already
being set case.

Reviewed-by: Jason Gunthorpe 
Signed-off-by: Shivaprasad G Bhat 
---
Changelog:
v2: 
https://lore.kernel.org/linux-iommu/170793401503.7491.9431631474642074097.st...@linux.ibm.com/
 - As the v1 itself was merged, the patch was suggested to be reposted as
   cleanup/refactoring to be applied on top of v1.
 - Removed the versioning as this is actually new cleanup/refactoring.
 - Retaining the Reviewed-by as the effective new code was actually reviewed.

v1: 
https://lore.kernel.org/all/170784021983.6249.10039296655906636112.st...@linux.ibm.com/
 - Minor refactor to call the iommu_get_group only if required.
 - Updated the title, description and signature(Closes/Reported-by).

 arch/powerpc/kernel/iommu.c |7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index a9bebfd56b3b..37fae3bd89c6 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1285,15 +1285,14 @@ spapr_tce_platform_iommu_attach_dev(struct iommu_domain 
*platform_domain,
struct device *dev)
 {
struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
-   struct iommu_group *grp = iommu_group_get(dev);
struct iommu_table_group *table_group;
+   struct iommu_group *grp;

/* At first attach the ownership is already set */
-   if (!domain) {
-   iommu_group_put(grp);
+   if (!domain)
return 0;
-   }

+   grp = iommu_group_get(dev);
table_group = iommu_group_get_iommudata(grp);
/*
 * The domain being set to PLATFORM from earlier

Re: [PATCH v2 0/5] powerpc: struct bus_type cleanup

2024-02-15 Thread Michael Ellerman

On Mon, 12 Feb 2024 17:04:58 -0300, Ricardo B. Marliere wrote:
> This series is part of an effort to cleanup the users of the driver
> core, as can be seen in many recent patches authored by Greg across the
> tree (e.g. [1]). Patch 1/5 is a prerequisite to 2/5, but the others have
> no dependency. They were built using bootlin's without warnings using
> powerpc64le-power8--glibc--stable-2023.11-1 toolchain.
> 

Applied to powerpc/next.

[1/5] powerpc: vio: move device attributes into a new ifdef
  https://git.kernel.org/powerpc/c/e15d01277a8bdacf8ac485049d21d450153fa47e
[2/5] powerpc: vio: make vio_bus_type const
  https://git.kernel.org/powerpc/c/565206aaa6528b30df9294e9aafac429e4bc94eb
[3/5] powerpc: mpic: make mpic_subsys const
  https://git.kernel.org/powerpc/c/8e3d0b8d99d708e8262e76313e0436339add80ec
[4/5] powerpc: pmac: make macio_bus_type const
  https://git.kernel.org/powerpc/c/112202f34e56cd475e26b2a461dd856ca7570ef9
[5/5] powerpc: ibmebus: make ibmebus_bus_type const
  https://git.kernel.org/powerpc/c/14ce0dbb562713bc058ad16d281db355757e6ec0

cheers

Re: [RFC PATCH 1/5] powerpc/smp: Adjust nr_cpu_ids to cover all threads of a core

2024-02-15 Thread Michael Ellerman

On Fri, 29 Dec 2023 23:01:03 +1100, Michael Ellerman wrote:
> If nr_cpu_ids is too low to include at least all the threads of a single
> core adjust nr_cpu_ids upwards. This avoids triggering odd bugs in code
> that assumes all threads of a core are available.
> 
> 

Applied to powerpc/next.

[1/5] powerpc/smp: Adjust nr_cpu_ids to cover all threads of a core
  https://git.kernel.org/powerpc/c/5580e96dad5a439d561d9648ffcbccb739c2a120
[2/5] powerpc/smp: Increase nr_cpu_ids to include the boot CPU
  https://git.kernel.org/powerpc/c/777f81f0a9c780a6443bcf2c7785f0cc2e87c1ef
[3/5] powerpc/smp: Lookup avail once per device tree node
  https://git.kernel.org/powerpc/c/dca79603fbc592ec7ea8bd7ba274052d3984e882
[4/5] powerpc/smp: Factor out assign_threads()
  https://git.kernel.org/powerpc/c/9832de654499f0bf797a3719c4d4c5bd401f18f5
[5/5] powerpc/smp: Remap boot CPU onto core 0 if >= nr_cpu_ids
  https://git.kernel.org/powerpc/c/0875f1ceba974042069f04946aa8f1d4d1e688da

cheers

Re: [PATCH] powerpc: Force inlining of arch_vmap_p{u/m}d_supported()

2024-02-15 Thread Michael Ellerman

On Tue, 13 Feb 2024 14:58:37 +0100, Christophe Leroy wrote:
> arch_vmap_pud_supported() and arch_vmap_pmd_supported() are
> expected to constant-fold to false when RADIX is not enabled.
> 
> Force inlining in order to avoid following failure which
> leads to unexpected call of non-existing pud_set_huge() and
> pmd_set_huge() on powerpc 8xx.
> 
> [...]

Applied to powerpc/next.

[1/1] powerpc: Force inlining of arch_vmap_p{u/m}d_supported()
  https://git.kernel.org/powerpc/c/c5aebb53b32460bc52680dd4e2a2f6b84d5ea521

cheers

Re: [PING PATCH] powerpc/kasan: Fix addr error caused by page alignment

2024-02-15 Thread Michael Ellerman

On Tue, 23 Jan 2024 09:45:59 +0800, Jiangfeng Xiao wrote:
> In kasan_init_region, when k_start is not page aligned,
> at the begin of for loop, k_cur = k_start & PAGE_MASK
> is less than k_start, and then va = block + k_cur - k_start
> is less than block, the addr va is invalid, because the
> memory address space from va to block is not alloced by
> memblock_alloc, which will not be reserved
> by memblock_reserve later, it will be used by other places.
> 
> [...]

Applied to powerpc/fixes.

[1/1] powerpc/kasan: Fix addr error caused by page alignment
  https://git.kernel.org/powerpc/c/4a7aee96200ad281a5cc4cf5c7a2e2a49d2b97b0

cheers

Re: [PATCH] powerpc/pseries: fix accuracy of stolen time

2024-02-15 Thread Michael Ellerman

On Tue, 13 Feb 2024 10:56:35 +0530, Shrikanth Hegde wrote:
> powerVM hypervisor updates the VPA fields with stolen time data.
> It currently reports enqueue_dispatch_tb and ready_enqueue_tb for
> this purpose. In linux these two fields are used to report the stolen time.
> 
> The VPA fields are updated at the TB frequency. On powerPC its mostly
> set at 512Mhz. Hence this needs a conversion to ns when reporting it
> back as rest of the kernel timings are in ns. This conversion is already
> handled in tb_to_ns function. So use that function to report accurate
> stolen time.
> 
> [...]

Applied to powerpc/fixes.

[1/1] powerpc/pseries: fix accuracy of stolen time
  https://git.kernel.org/powerpc/c/cbecc9fcbbec60136b0180ba0609c829afed5c81

cheers

Re: [PATCH] powerpc/iommu: Fix the missing iommu_group_put() during platform domain attach

2024-02-15 Thread Michael Ellerman

On Tue, 13 Feb 2024 10:05:22 -0600, Shivaprasad G Bhat wrote:
> The function spapr_tce_platform_iommu_attach_dev() is missing to call
> iommu_group_put() when the domain is already set. This refcount leak
> shows up with BUG_ON() during DLPAR remove operation as,
> 
>   KernelBug: Kernel bug in state 'None': kernel BUG at 
> arch/powerpc/platforms/pseries/iommu.c:100!
>   Oops: Exception in kernel mode, sig: 5 [#1]
>   LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=8192 NUMA pSeries
>   
>   Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf06 
> of:IBM,FW1060.00 (NH1060_016) hv:phyp pSeries
>   NIP:  c00ff4d4 LR: c00ff4cc CTR: 
>   REGS: c013aed5f840 TRAP: 0700   Tainted: G  I 
> (6.8.0-rc3-autotest-g99bd3cb0d12e)
>   MSR:  80029033   CR: 44002402  XER: 2004
>   CFAR: c0a0d170 IRQMASK: 0
>   GPR00: c00ff4cc c013aed5fae0 c1512700 c013aa362138
>   GPR04:    000119c8afd0
>   GPR08:  c01284442b00 0001 1003
>   GPR12: 0003 c0182f00  
>   GPR16:    
>   GPR20:    
>   GPR24: c013aed5fc40 0002  c2757d90
>   GPR28: c00ff440 c2757cb8 c0183799c1a0 c013aa362b00
>   NIP [c00ff4d4] iommu_reconfig_notifier+0x94/0x200
>   LR [c00ff4cc] iommu_reconfig_notifier+0x8c/0x200
>   Call Trace:
>   [c013aed5fae0] [c00ff4cc] iommu_reconfig_notifier+0x8c/0x200 
> (unreliable)
>   [c013aed5fb10] [c01a27b0] notifier_call_chain+0xb8/0x19c
>   [c013aed5fb70] [c01a2a78] blocking_notifier_call_chain+0x64/0x98
>   [c013aed5fbb0] [c0c4a898] of_reconfig_notify+0x44/0xdc
>   [c013aed5fc20] [c0c4add4] of_detach_node+0x78/0xb0
>   [c013aed5fc70] [c00f96a8] ofdt_write.part.0+0x86c/0xbb8
>   [c013aed5fce0] [c069b4bc] proc_reg_write+0xf4/0x150
>   [c013aed5fd10] [c05bfeb4] vfs_write+0xf8/0x488
>   [c013aed5fdc0] [c05c0570] ksys_write+0x84/0x140
>   [c013aed5fe10] [c0033358] system_call_exception+0x138/0x330
>   [c013aed5fe50] [c000d05c] 
> system_call_vectored_common+0x15c/0x2ec
>   --- interrupt: 3000 at 0x2433acb4
>   
>   ---[ end trace  ]---
> 
> [...]

Applied to powerpc/fixes.

[1/1] powerpc/iommu: Fix the missing iommu_group_put() during platform domain 
attach
  https://git.kernel.org/powerpc/c/0846dd77c8349ec92ca0079c9c71d130f34cb192

cheers

Re: [PATCH] papr_vpd.c: calling devfd before get_system_loc_code

2024-02-15 Thread Michael Ellerman

On Wed, 31 Jan 2024 18:38:59 +0530, R Nageswara Sastry wrote:
> Calling get_system_loc_code before checking devfd and errno - fails the test
> when the device is not available, expected a SKIP.
> Change the order of 'SKIP_IF_MSG' correctly SKIP when the /dev/papr-vpd device
> is not available.
> 
> with out patch: Test FAILED on line 271
> with patch: [SKIP] Test skipped on line 266: /dev/papr-vpd not present
> 
> [...]

Applied to powerpc/fixes.

[1/1] papr_vpd.c: calling devfd before get_system_loc_code
  https://git.kernel.org/powerpc/c/f09696279b5dd1770a3de2e062f1c5d1449213ff

cheers

Re: [PATCH v2] powerpc/ftrace: Ignore ftrace locations in exit text sections

2024-02-15 Thread Michael Ellerman

On Tue, 13 Feb 2024 23:24:10 +0530, Naveen N Rao wrote:
> Michael reported that we are seeing ftrace bug on bootup when KASAN is
> enabled, and if we are using -fpatchable-function-entry:
> 
> ftrace: allocating 47780 entries in 18 pages
> ftrace-powerpc: 0xc20b3d5c: No module provided for non-kernel 
> address
> [ ftrace bug ]
> ftrace faulted on modifying
> [] 0xc20b3d5c
> Initializing ftrace call sites
> ftrace record flags: 0
>  (0)
>  expected tramp: c008cef4
> [ cut here ]
> WARNING: CPU: 0 PID: 0 at kernel/trace/ftrace.c:2180 
> ftrace_bug+0x3c0/0x424
> Modules linked in:
> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0-rc3-00120-g0f71dcfb4aef #860
> Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw) 0x4e1202 
> 0xf05 of:SLOF,HEAD hv:linux,kvm pSeries
> NIP:  c03aa81c LR: c03aa818 CTR: 
> REGS: c33cfab0 TRAP: 0700   Not tainted  
> (6.5.0-rc3-00120-g0f71dcfb4aef)
> MSR:  82021033   CR: 28028240  XER: 
> 
> CFAR: c02781a8 IRQMASK: 3
> ...
> NIP [c03aa81c] ftrace_bug+0x3c0/0x424
> LR [c03aa818] ftrace_bug+0x3bc/0x424
> Call Trace:
>  ftrace_bug+0x3bc/0x424 (unreliable)
>  ftrace_process_locs+0x5f4/0x8a0
>  ftrace_init+0xc0/0x1d0
>  start_kernel+0x1d8/0x484
> 
> [...]

Applied to powerpc/fixes.

[1/1] powerpc/ftrace: Ignore ftrace locations in exit text sections
  https://git.kernel.org/powerpc/c/ea73179e64131bcd29ba6defd33732abdf8ca14b

cheers

Re: [PATCH v2] powerpc/64: Set task pt_regs->link to the LR value on scv entry

2024-02-15 Thread Michael Ellerman

On Fri, 02 Feb 2024 21:13:16 +0530, Naveen N Rao wrote:
> Nysal reported that userspace backtraces are missing in offcputime bcc
> tool. As an example:
> $ sudo ./bcc/tools/offcputime.py -uU
> Tracing off-CPU time (us) of user threads by user stack... Hit Ctrl-C to 
> end.
> 
> ^C
>   write
>   -python (9107)
>   8
> 
> [...]

Applied to powerpc/fixes.

[1/1] powerpc/64: Set task pt_regs->link to the LR value on scv entry
  https://git.kernel.org/powerpc/c/aad98efd0b121f63a2e1c221dcb4d4850128c697

cheers

Re: [PATCH] powerpc/pseries/papr-sysparm: use u8 arrays for payloads

2024-02-15 Thread Michael Ellerman

On Fri, 02 Feb 2024 18:26:46 -0600, Nathan Lynch wrote:
> Some PAPR system parameter values are formatted by firmware as
> nul-terminated strings (e.g. LPAR name, shared processor attributes).
> But the values returned for other parameters, such as processor module
> info and TLB block invalidate characteristics, are binary data with
> parameter-specific layouts. So char[] isn't the appropriate type for
> the general case. Use u8/__u8.
> 
> [...]

Applied to powerpc/fixes.

[1/1] powerpc/pseries/papr-sysparm: use u8 arrays for payloads
  https://git.kernel.org/powerpc/c/8ded03ae48b3657e0fbca99d8a9d8fa1bfb8c9be

cheers

Re: [PATCH 1/2] powerpc: udbg_memcons: mark functions static

2024-02-15 Thread Michael Ellerman

On Tue, 23 Jan 2024 13:51:41 +0100, Arnd Bergmann wrote:
> ppc64_book3e_allmodconfig has one more driver that triggeres a
> few missing-prototypes warnings:
> 
> arch/powerpc/sysdev/udbg_memcons.c:44:6: error: no previous prototype for 
> 'memcons_putc' [-Werror=missing-prototypes]
> arch/powerpc/sysdev/udbg_memcons.c:57:5: error: no previous prototype for 
> 'memcons_getc_poll' [-Werror=missing-prototypes]
> arch/powerpc/sysdev/udbg_memcons.c:80:5: error: no previous prototype for 
> 'memcons_getc' [-Werror=missing-prototypes]
> 
> [...]

Applied to powerpc/fixes.

[1/2] powerpc: udbg_memcons: mark functions static
  https://git.kernel.org/powerpc/c/5c84bc8b617bf90e722cc57d447abd9a468d3a52
[2/2] powerpc: 85xx: mark local functions static
  https://git.kernel.org/powerpc/c/1c57b9f63ab34f01b8c73731cc0efacb5a9a2f16

cheers

Re: [PATCH] powerpc/kasan: Limit KASAN thread size increase to 32KB

2024-02-15 Thread Michael Ellerman

On Mon, 12 Feb 2024 17:42:44 +1100, Michael Ellerman wrote:
> KASAN is seen to increase stack usage, to the point that it was reported
> to lead to stack overflow on some 32-bit machines (see link).
> 
> To avoid overflows the stack size was doubled for KASAN builds in
> commit 3e8635fb2e07 ("powerpc/kasan: Force thread size increase with
> KASAN").
> 
> [...]

Applied to powerpc/fixes.

[1/1] powerpc/kasan: Limit KASAN thread size increase to 32KB
  https://git.kernel.org/powerpc/c/f1acb109505d983779bbb7e20a1ee6244d2b5736

cheers

Re: [PATCH v2] powerpc/6xx: set High BAT Enable flag on G2_LE cores

2024-02-15 Thread Michael Ellerman

On Wed, 24 Jan 2024 11:38:38 +0100, Matthias Schiffer wrote:
> MMU_FTR_USE_HIGH_BATS is set for G2_LE cores and derivatives like e300cX,
> but the high BATs need to be enabled in HID2 to work. Add register
> definitions and add the needed setup to __setup_cpu_603.
> 
> This fixes boot on CPUs like the MPC5200B with STRICT_KERNEL_RWX enabled
> on systems where the flag has not been set by the bootloader already.
> 
> [...]

Applied to powerpc/fixes.

[1/1] powerpc/6xx: set High BAT Enable flag on G2_LE cores
  https://git.kernel.org/powerpc/c/a038a3ff8c6582404834852c043dadc73a5b68b4

cheers

Re: [PATCH] powerpc/cputable: Add missing PPC_FEATURE_BOOKE on PPC64 Book-E

2024-02-15 Thread Michael Ellerman

On Wed, 07 Feb 2024 10:27:58 +0100, David Engraf wrote:
> Commit e320a76db4b0 ("powerpc/cputable: Split cpu_specs[] out of cputable.h")
> moved the cpu_specs to separate header files. Previously PPC_FEATURE_BOOKE
> was enabled by CONFIG_PPC_BOOK3E_64. The definition in cpu_specs_e500mc.h
> for PPC64 no longer enables PPC_FEATURE_BOOKE.
> 
> This breaks user space reading the ELF hwcaps and expect PPC_FEATURE_BOOKE.
> Debugging an application with gdb is no longer working on e5500/e6500
> because the 64-bit detection relies on PPC_FEATURE_BOOKE for Book-E.
> 
> [...]

Applied to powerpc/fixes.

[1/1] powerpc/cputable: Add missing PPC_FEATURE_BOOKE on PPC64 Book-E
  https://git.kernel.org/powerpc/c/eb6d871f4ba49ac8d0537e051fe983a3a4027f61

cheers

Re: [PATCH 0/2] ALSA: struct bus_type cleanup

2024-02-15 Thread Takashi Iwai

On Wed, 14 Feb 2024 20:28:27 +0100,
Ricardo B. Marliere wrote:
> 
> This series is part of an effort to cleanup the users of the driver
> core, as can be seen in many recent patches authored by Greg across the
> tree (e.g. [1]).
> 
> ---
> [1]: 
> https://lore.kernel.org/lkml/?q=f%3Agregkh%40linuxfoundation.org+s%3A%22make%22+and+s%3A%22const%22
> 
> To: Johannes Berg 
> To: Jaroslav Kysela 
> To: Takashi Iwai 
> Cc:  
> Cc:  
> Cc:  
> Cc:  
> Cc: Greg Kroah-Hartman 
> Signed-off-by: Ricardo B. Marliere 
> 
> ---
> Ricardo B. Marliere (2):
>   ALSA: aoa: make soundbus_bus_type const
>   ALSA: seq: make snd_seq_bus_type const

Applied both patches now.  Thanks.


Takashi

Re: [PATCH v6 00/18] Transparent Contiguous PTEs for User Mappings

2024-02-15 Thread Mark Rutland

On Thu, Feb 15, 2024 at 10:31:47AM +, Ryan Roberts wrote:
> Hi All,
> 
> This is a series to opportunistically and transparently use contpte mappings
> (set the contiguous bit in ptes) for user memory when those mappings meet the
> requirements. The change benefits arm64, but there is some (very) minor
> refactoring for x86 to enable its integration with core-mm.

I've looked over each of the arm64-specific patches, and those all seem good to
me. I've thrown my local Syzkaller instance at the series, and I'll shout if
that hits anything that's not clearly a latent issue prior to this series.

The other bits also look good to me, so FWIW, for the series as a whole:

Acked-by: Mark Rutland 

Mark.

Re: [PATCH v6 18/18] arm64/mm: Automatically fold contpte mappings

2024-02-15 Thread Mark Rutland

On Thu, Feb 15, 2024 at 10:32:05AM +, Ryan Roberts wrote:
> There are situations where a change to a single PTE could cause the
> contpte block in which it resides to become foldable (i.e. could be
> repainted with the contiguous bit). Such situations arise, for example,
> when user space temporarily changes protections, via mprotect, for
> individual pages, such can be the case for certain garbage collectors.
> 
> We would like to detect when such a PTE change occurs. However this can
> be expensive due to the amount of checking required. Therefore only
> perform the checks when an indiviual PTE is modified via mprotect
> (ptep_modify_prot_commit() -> set_pte_at() -> set_ptes(nr=1)) and only
> when we are setting the final PTE in a contpte-aligned block.
> 
> Signed-off-by: Ryan Roberts 

Acked-by: Mark Rutland 

Mark.

> ---
>  arch/arm64/include/asm/pgtable.h | 26 +
>  arch/arm64/mm/contpte.c  | 64 
>  2 files changed, 90 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index 8310875133ff..401087e8a43d 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1185,6 +1185,8 @@ extern void ptep_modify_prot_commit(struct 
> vm_area_struct *vma,
>   * where it is possible and makes sense to do so. The PTE_CONT bit is 
> considered
>   * a private implementation detail of the public ptep API (see below).
>   */
> +extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte);
>  extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>   pte_t *ptep, pte_t pte);
>  extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
> @@ -1206,6 +1208,29 @@ extern int contpte_ptep_set_access_flags(struct 
> vm_area_struct *vma,
>   unsigned long addr, pte_t *ptep,
>   pte_t entry, int dirty);
>  
> +static __always_inline void contpte_try_fold(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep, pte_t pte)
> +{
> + /*
> +  * Only bother trying if both the virtual and physical addresses are
> +  * aligned and correspond to the last entry in a contig range. The core
> +  * code mostly modifies ranges from low to high, so this is the likely
> +  * the last modification in the contig range, so a good time to fold.
> +  * We can't fold special mappings, because there is no associated folio.
> +  */
> +
> + const unsigned long contmask = CONT_PTES - 1;
> + bool valign = ((addr >> PAGE_SHIFT) & contmask) == contmask;
> +
> + if (unlikely(valign)) {
> + bool palign = (pte_pfn(pte) & contmask) == contmask;
> +
> + if (unlikely(palign &&
> + pte_valid(pte) && !pte_cont(pte) && !pte_special(pte)))
> + __contpte_try_fold(mm, addr, ptep, pte);
> + }
> +}
> +
>  static __always_inline void contpte_try_unfold(struct mm_struct *mm,
>   unsigned long addr, pte_t *ptep, pte_t pte)
>  {
> @@ -1286,6 +1311,7 @@ static __always_inline void set_ptes(struct mm_struct 
> *mm, unsigned long addr,
>   if (likely(nr == 1)) {
>   contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>   __set_ptes(mm, addr, ptep, pte, 1);
> + contpte_try_fold(mm, addr, ptep, pte);
>   } else {
>   contpte_set_ptes(mm, addr, ptep, pte, nr);
>   }
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index 50e0173dc5ee..16788f07716d 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -73,6 +73,70 @@ static void contpte_convert(struct mm_struct *mm, unsigned 
> long addr,
>   __set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
>  }
>  
> +void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte)
> +{
> + /*
> +  * We have already checked that the virtual and pysical addresses are
> +  * correctly aligned for a contpte mapping in contpte_try_fold() so the
> +  * remaining checks are to ensure that the contpte range is fully
> +  * covered by a single folio, and ensure that all the ptes are valid
> +  * with contiguous PFNs and matching prots. We ignore the state of the
> +  * access and dirty bits for the purpose of deciding if its a contiguous
> +  * range; the folding process will generate a single contpte entry which
> +  * has a single access and dirty bit. Those 2 bits are the logical OR of
> +  * their respective bits in the constituent pte entries. In order to
> +  * ensure the contpte range is covered by a single folio, we must
> +  * recover the folio from the pfn, but special mappings don't have a
> +  * folio backing them. Fortunately contpte_try_fold() already

Re: [PATCH v6 14/18] arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs

2024-02-15 Thread Mark Rutland

On Thu, Feb 15, 2024 at 10:32:01AM +, Ryan Roberts wrote:
> Optimize the contpte implementation to fix some of the
> exit/munmap/dontneed performance regression introduced by the initial
> contpte commit. Subsequent patches will solve it entirely.
> 
> During exit(), munmap() or madvise(MADV_DONTNEED), mappings must be
> cleared. Previously this was done 1 PTE at a time. But the core-mm
> supports batched clear via the new [get_and_]clear_full_ptes() APIs. So
> let's implement those APIs and for fully covered contpte mappings, we no
> longer need to unfold the contpte. This significantly reduces unfolding
> operations, reducing the number of tlbis that must be issued.
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Mark Rutland 

Mark.

> ---
>  arch/arm64/include/asm/pgtable.h | 67 
>  arch/arm64/mm/contpte.c  | 17 
>  2 files changed, 84 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index 8643227c318b..a8f1a35e3086 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -965,6 +965,37 @@ static inline pte_t __ptep_get_and_clear(struct 
> mm_struct *mm,
>   return pte;
>  }
>  
> +static inline void __clear_full_ptes(struct mm_struct *mm, unsigned long 
> addr,
> + pte_t *ptep, unsigned int nr, int full)
> +{
> + for (;;) {
> + __ptep_get_and_clear(mm, addr, ptep);
> + if (--nr == 0)
> + break;
> + ptep++;
> + addr += PAGE_SIZE;
> + }
> +}
> +
> +static inline pte_t __get_and_clear_full_ptes(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep,
> + unsigned int nr, int full)
> +{
> + pte_t pte, tmp_pte;
> +
> + pte = __ptep_get_and_clear(mm, addr, ptep);
> + while (--nr) {
> + ptep++;
> + addr += PAGE_SIZE;
> + tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
> + if (pte_dirty(tmp_pte))
> + pte = pte_mkdirty(pte);
> + if (pte_young(tmp_pte))
> + pte = pte_mkyoung(pte);
> + }
> + return pte;
> +}
> +
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
>  static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
> @@ -1160,6 +1191,11 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t 
> orig_pte);
>  extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>  extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>   pte_t *ptep, pte_t pte, unsigned int nr);
> +extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, unsigned int nr, int full);
> +extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep,
> + unsigned int nr, int full);
>  extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>   unsigned long addr, pte_t *ptep);
>  extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> @@ -1253,6 +1289,35 @@ static inline void pte_clear(struct mm_struct *mm,
>   __pte_clear(mm, addr, ptep);
>  }
>  
> +#define clear_full_ptes clear_full_ptes
> +static inline void clear_full_ptes(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, unsigned int nr, int full)
> +{
> + if (likely(nr == 1)) {
> + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> + __clear_full_ptes(mm, addr, ptep, nr, full);
> + } else {
> + contpte_clear_full_ptes(mm, addr, ptep, nr, full);
> + }
> +}
> +
> +#define get_and_clear_full_ptes get_and_clear_full_ptes
> +static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep,
> + unsigned int nr, int full)
> +{
> + pte_t pte;
> +
> + if (likely(nr == 1)) {
> + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> + pte = __get_and_clear_full_ptes(mm, addr, ptep, nr, full);
> + } else {
> + pte = contpte_get_and_clear_full_ptes(mm, addr, ptep, nr, full);
> + }
> +
> + return pte;
> +}
> +
>  #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>   unsigned long addr, pte_t *ptep)
> @@ -1337,6 +1402,8 @@ static inline int ptep_set_access_flags(struct 
> vm_area_struct *vma,
>  #define set_pte  __set_pte
>  #define set_ptes __set_ptes
>  #define pte_clear__pte_clear
> +#define clear_full_ptes

Re: [PATCH v6 13/18] arm64/mm: Implement new wrprotect_ptes() batch API

2024-02-15 Thread Mark Rutland

On Thu, Feb 15, 2024 at 10:32:00AM +, Ryan Roberts wrote:
> Optimize the contpte implementation to fix some of the fork performance
> regression introduced by the initial contpte commit. Subsequent patches
> will solve it entirely.
> 
> During fork(), any private memory in the parent must be write-protected.
> Previously this was done 1 PTE at a time. But the core-mm supports
> batched wrprotect via the new wrprotect_ptes() API. So let's implement
> that API and for fully covered contpte mappings, we no longer need to
> unfold the contpte. This has 2 benefits:
> 
>   - reduced unfolding, reduces the number of tlbis that must be issued.
>   - The memory remains contpte-mapped ("folded") in the parent, so it
> continues to benefit from the more efficient use of the TLB after
> the fork.
> 
> The optimization to wrprotect a whole contpte block without unfolding is
> possible thanks to the tightening of the Arm ARM in respect to the
> definition and behaviour when 'Misprogramming the Contiguous bit'. See
> section D21194 at https://developer.arm.com/documentation/102105/ja-07/
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Mark Rutland 

Mark.

> ---
>  arch/arm64/include/asm/pgtable.h | 61 ++--
>  arch/arm64/mm/contpte.c  | 38 
>  2 files changed, 89 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index 831099cfc96b..8643227c318b 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -978,16 +978,12 @@ static inline pmd_t pmdp_huge_get_and_clear(struct 
> mm_struct *mm,
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> -/*
> - * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
> - * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
> - */
> -static inline void __ptep_set_wrprotect(struct mm_struct *mm,
> - unsigned long address, pte_t *ptep)
> +static inline void ___ptep_set_wrprotect(struct mm_struct *mm,
> + unsigned long address, pte_t *ptep,
> + pte_t pte)
>  {
> - pte_t old_pte, pte;
> + pte_t old_pte;
>  
> - pte = __ptep_get(ptep);
>   do {
>   old_pte = pte;
>   pte = pte_wrprotect(pte);
> @@ -996,6 +992,25 @@ static inline void __ptep_set_wrprotect(struct mm_struct 
> *mm,
>   } while (pte_val(pte) != pte_val(old_pte));
>  }
>  
> +/*
> + * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
> + * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
> + */
> +static inline void __ptep_set_wrprotect(struct mm_struct *mm,
> + unsigned long address, pte_t *ptep)
> +{
> + ___ptep_set_wrprotect(mm, address, ptep, __ptep_get(ptep));
> +}
> +
> +static inline void __wrprotect_ptes(struct mm_struct *mm, unsigned long 
> address,
> + pte_t *ptep, unsigned int nr)
> +{
> + unsigned int i;
> +
> + for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
> + __ptep_set_wrprotect(mm, address, ptep);
> +}
> +
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  #define __HAVE_ARCH_PMDP_SET_WRPROTECT
>  static inline void pmdp_set_wrprotect(struct mm_struct *mm,
> @@ -1149,6 +1164,8 @@ extern int contpte_ptep_test_and_clear_young(struct 
> vm_area_struct *vma,
>   unsigned long addr, pte_t *ptep);
>  extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>   unsigned long addr, pte_t *ptep);
> +extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, unsigned int nr);
>  extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>   unsigned long addr, pte_t *ptep,
>   pte_t entry, int dirty);
> @@ -1268,12 +1285,35 @@ static inline int ptep_clear_flush_young(struct 
> vm_area_struct *vma,
>   return contpte_ptep_clear_flush_young(vma, addr, ptep);
>  }
>  
> +#define wrprotect_ptes wrprotect_ptes
> +static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, unsigned int nr)
> +{
> + if (likely(nr == 1)) {
> + /*
> +  * Optimization: wrprotect_ptes() can only be called for present
> +  * ptes so we only need to check contig bit as condition for
> +  * unfold, and we can remove the contig bit from the pte we read
> +  * to avoid re-reading. This speeds up fork() which is sensitive
> +  * for order-0 folios. Equivalent to contpte_try_unfold().
> +  */
> + pte_t orig_pte = __ptep_get(ptep);
> +
> + if (unlikely(pte_cont(orig_pte))) {

Re: [PATCH v6 12/18] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-15 Thread Mark Rutland

On Thu, Feb 15, 2024 at 10:31:59AM +, Ryan Roberts wrote:
> With the ptep API sufficiently refactored, we can now introduce a new
> "contpte" API layer, which transparently manages the PTE_CONT bit for
> user mappings.
> 
> In this initial implementation, only suitable batches of PTEs, set via
> set_ptes(), are mapped with the PTE_CONT bit. Any subsequent
> modification of individual PTEs will cause an "unfold" operation to
> repaint the contpte block as individual PTEs before performing the
> requested operation. While, a modification of a single PTE could cause
> the block of PTEs to which it belongs to become eligible for "folding"
> into a contpte entry, "folding" is not performed in this initial
> implementation due to the costs of checking the requirements are met.
> Due to this, contpte mappings will degrade back to normal pte mappings
> over time if/when protections are changed. This will be solved in a
> future patch.
> 
> Since a contpte block only has a single access and dirty bit, the
> semantic here changes slightly; when getting a pte (e.g. ptep_get())
> that is part of a contpte mapping, the access and dirty information are
> pulled from the block (so all ptes in the block return the same
> access/dirty info). When changing the access/dirty info on a pte (e.g.
> ptep_set_access_flags()) that is part of a contpte mapping, this change
> will affect the whole contpte block. This is works fine in practice
> since we guarantee that only a single folio is mapped by a contpte
> block, and the core-mm tracks access/dirty information per folio.
> 
> In order for the public functions, which used to be pure inline, to
> continue to be callable by modules, export all the contpte_* symbols
> that are now called by those public inline functions.
> 
> The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter
> at build time. It defaults to enabled as long as its dependency,
> TRANSPARENT_HUGEPAGE is also enabled. The core-mm depends upon
> TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not
> enabled, then there is no chance of meeting the physical contiguity
> requirement for contpte mappings.
> 
> Acked-by: Ard Biesheuvel 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Mark Rutland 

Mark.

> ---
>  arch/arm64/Kconfig   |   9 +
>  arch/arm64/include/asm/pgtable.h | 167 ++
>  arch/arm64/mm/Makefile   |   1 +
>  arch/arm64/mm/contpte.c  | 285 +++
>  include/linux/efi.h  |   5 +
>  5 files changed, 467 insertions(+)
>  create mode 100644 arch/arm64/mm/contpte.c
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index e8275a40afbd..5a7ac1f37bdc 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -2229,6 +2229,15 @@ config UNWIND_PATCH_PAC_INTO_SCS
>   select UNWIND_TABLES
>   select DYNAMIC_SCS
>  
> +config ARM64_CONTPTE
> + bool "Contiguous PTE mappings for user memory" if EXPERT
> + depends on TRANSPARENT_HUGEPAGE
> + default y
> + help
> +   When enabled, user mappings are configured using the PTE contiguous
> +   bit, for any mappings that meet the size and alignment requirements.
> +   This reduces TLB pressure and improves performance.
> +
>  endmenu # "Kernel Features"
>  
>  menu "Boot options"
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index 7336d40a893a..831099cfc96b 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -133,6 +133,10 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t 
> phys)
>   */
>  #define pte_valid_not_user(pte) \
>   ((pte_val(pte) & (PTE_VALID | PTE_USER | PTE_UXN)) == (PTE_VALID | 
> PTE_UXN))
> +/*
> + * Returns true if the pte is valid and has the contiguous bit set.
> + */
> +#define pte_valid_cont(pte)  (pte_valid(pte) && pte_cont(pte))
>  /*
>   * Could the pte be present in the TLB? We must check mm_tlb_flush_pending
>   * so that we don't erroneously return false for pages that have been
> @@ -1128,6 +1132,167 @@ extern void ptep_modify_prot_commit(struct 
> vm_area_struct *vma,
>   unsigned long addr, pte_t *ptep,
>   pte_t old_pte, pte_t new_pte);
>  
> +#ifdef CONFIG_ARM64_CONTPTE
> +
> +/*
> + * The contpte APIs are used to transparently manage the contiguous bit in 
> ptes
> + * where it is possible and makes sense to do so. The PTE_CONT bit is 
> considered
> + * a private implementation detail of the public ptep API (see below).
> + */
> +extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte);
> +extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
> +extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
> +extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
> + pte_t

Re: [PATCH v6 11/18] arm64/mm: Split __flush_tlb_range() to elide trailing DSB

2024-02-15 Thread Mark Rutland

On Thu, Feb 15, 2024 at 10:31:58AM +, Ryan Roberts wrote:
> Split __flush_tlb_range() into __flush_tlb_range_nosync() +
> __flush_tlb_range(), in the same way as the existing flush_tlb_page()
> arrangement. This allows calling __flush_tlb_range_nosync() to elide the
> trailing DSB. Forthcoming "contpte" code will take advantage of this
> when clearing the young bit from a contiguous range of ptes.
> 
> Ordering between dsb and mmu_notifier_arch_invalidate_secondary_tlbs()
> has changed, but now aligns with the ordering of __flush_tlb_page(). It
> has been discussed that __flush_tlb_page() may be wrong though.
> Regardless, both will be resolved separately if needed.
> 
> Reviewed-by: David Hildenbrand 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Mark Rutland 

Mark.

> ---
>  arch/arm64/include/asm/tlbflush.h | 13 +++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/tlbflush.h 
> b/arch/arm64/include/asm/tlbflush.h
> index 1deb5d789c2e..3b0e8248e1a4 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -422,7 +422,7 @@ do {  
> \
>  #define __flush_s2_tlb_range_op(op, start, pages, stride, tlb_level) \
>   __flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false, 
> kvm_lpa2_is_enabled());
>  
> -static inline void __flush_tlb_range(struct vm_area_struct *vma,
> +static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
>unsigned long start, unsigned long end,
>unsigned long stride, bool last_level,
>int tlb_level)
> @@ -456,10 +456,19 @@ static inline void __flush_tlb_range(struct 
> vm_area_struct *vma,
>   __flush_tlb_range_op(vae1is, start, pages, stride, asid,
>tlb_level, true, lpa2_is_enabled());
>  
> - dsb(ish);
>   mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
>  }
>  
> +static inline void __flush_tlb_range(struct vm_area_struct *vma,
> +  unsigned long start, unsigned long end,
> +  unsigned long stride, bool last_level,
> +  int tlb_level)
> +{
> + __flush_tlb_range_nosync(vma, start, end, stride,
> +  last_level, tlb_level);
> + dsb(ish);
> +}
> +
>  static inline void flush_tlb_range(struct vm_area_struct *vma,
>  unsigned long start, unsigned long end)
>  {
> -- 
> 2.25.1
>

Re: [PATCH v6 10/18] arm64/mm: New ptep layer to manage contig bit

2024-02-15 Thread Mark Rutland

On Thu, Feb 15, 2024 at 10:31:57AM +, Ryan Roberts wrote:
> Create a new layer for the in-table PTE manipulation APIs. For now, The
> existing API is prefixed with double underscore to become the
> arch-private API and the public API is just a simple wrapper that calls
> the private API.
> 
> The public API implementation will subsequently be used to transparently
> manipulate the contiguous bit where appropriate. But since there are
> already some contig-aware users (e.g. hugetlb, kernel mapper), we must
> first ensure those users use the private API directly so that the future
> contig-bit manipulations in the public API do not interfere with those
> existing uses.
> 
> The following APIs are treated this way:
> 
>  - ptep_get
>  - set_pte
>  - set_ptes
>  - pte_clear
>  - ptep_get_and_clear
>  - ptep_test_and_clear_young
>  - ptep_clear_flush_young
>  - ptep_set_wrprotect
>  - ptep_set_access_flags
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Mark Rutland 

Mark.

> ---
>  arch/arm64/include/asm/pgtable.h | 83 +---
>  arch/arm64/kernel/efi.c  |  4 +-
>  arch/arm64/kernel/mte.c  |  2 +-
>  arch/arm64/kvm/guest.c   |  2 +-
>  arch/arm64/mm/fault.c| 12 ++---
>  arch/arm64/mm/fixmap.c   |  4 +-
>  arch/arm64/mm/hugetlbpage.c  | 40 +++
>  arch/arm64/mm/kasan_init.c   |  6 +--
>  arch/arm64/mm/mmu.c  | 14 +++---
>  arch/arm64/mm/pageattr.c |  6 +--
>  arch/arm64/mm/trans_pgd.c|  6 +--
>  11 files changed, 93 insertions(+), 86 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index 9a2df85eb493..7336d40a893a 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -93,7 +93,8 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
>   __pte(__phys_to_pte_val((phys_addr_t)(pfn) << PAGE_SHIFT) | 
> pgprot_val(prot))
>  
>  #define pte_none(pte)(!pte_val(pte))
> -#define pte_clear(mm,addr,ptep)  set_pte(ptep, __pte(0))
> +#define __pte_clear(mm, addr, ptep) \
> + __set_pte(ptep, __pte(0))
>  #define pte_page(pte)(pfn_to_page(pte_pfn(pte)))
>  
>  /*
> @@ -137,7 +138,7 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
>   * so that we don't erroneously return false for pages that have been
>   * remapped as PROT_NONE but are yet to be flushed from the TLB.
>   * Note that we can't make any assumptions based on the state of the access
> - * flag, since ptep_clear_flush_young() elides a DSB when invalidating the
> + * flag, since __ptep_clear_flush_young() elides a DSB when invalidating the
>   * TLB.
>   */
>  #define pte_accessible(mm, pte)  \
> @@ -261,7 +262,7 @@ static inline pte_t pte_mkdevmap(pte_t pte)
>   return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
>  }
>  
> -static inline void set_pte(pte_t *ptep, pte_t pte)
> +static inline void __set_pte(pte_t *ptep, pte_t pte)
>  {
>   WRITE_ONCE(*ptep, pte);
>  
> @@ -275,8 +276,7 @@ static inline void set_pte(pte_t *ptep, pte_t pte)
>   }
>  }
>  
> -#define ptep_get ptep_get
> -static inline pte_t ptep_get(pte_t *ptep)
> +static inline pte_t __ptep_get(pte_t *ptep)
>  {
>   return READ_ONCE(*ptep);
>  }
> @@ -308,7 +308,7 @@ static inline void __check_safe_pte_update(struct 
> mm_struct *mm, pte_t *ptep,
>   if (!IS_ENABLED(CONFIG_DEBUG_VM))
>   return;
>  
> - old_pte = ptep_get(ptep);
> + old_pte = __ptep_get(ptep);
>  
>   if (!pte_valid(old_pte) || !pte_valid(pte))
>   return;
> @@ -317,7 +317,7 @@ static inline void __check_safe_pte_update(struct 
> mm_struct *mm, pte_t *ptep,
>  
>   /*
>* Check for potential race with hardware updates of the pte
> -  * (ptep_set_access_flags safely changes valid ptes without going
> +  * (__ptep_set_access_flags safely changes valid ptes without going
>* through an invalid entry).
>*/
>   VM_WARN_ONCE(!pte_young(pte),
> @@ -363,23 +363,22 @@ static inline pte_t pte_advance_pfn(pte_t pte, unsigned 
> long nr)
>   return pfn_pte(pte_pfn(pte) + nr, pte_pgprot(pte));
>  }
>  
> -static inline void set_ptes(struct mm_struct *mm,
> - unsigned long __always_unused addr,
> - pte_t *ptep, pte_t pte, unsigned int nr)
> +static inline void __set_ptes(struct mm_struct *mm,
> +   unsigned long __always_unused addr,
> +   pte_t *ptep, pte_t pte, unsigned int nr)
>  {
>   page_table_check_ptes_set(mm, ptep, pte, nr);
>   __sync_cache_and_tags(pte, nr);
>  
>   for (;;) {
>   __check_safe_pte_update(mm, ptep, pte);
> - set_pte(ptep, pte);
> + __set_pte(ptep, pte);
>   if (--nr == 0)
>   break;
>   ptep++;
>

Re: [PATCH v6 09/18] arm64/mm: Convert ptep_clear() to ptep_get_and_clear()

2024-02-15 Thread Mark Rutland

On Thu, Feb 15, 2024 at 10:31:56AM +, Ryan Roberts wrote:
> ptep_clear() is a generic wrapper around the arch-implemented
> ptep_get_and_clear(). We are about to convert ptep_get_and_clear() into
> a public version and private version (__ptep_get_and_clear()) to support
> the transparent contpte work. We won't have a private version of
> ptep_clear() so let's convert it to directly call ptep_get_and_clear().
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Mark Rutland 

Mark.

> ---
>  arch/arm64/mm/hugetlbpage.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index 27f6160890d1..48e8b429879d 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -229,7 +229,7 @@ static void clear_flush(struct mm_struct *mm,
>   unsigned long i, saddr = addr;
>  
>   for (i = 0; i < ncontig; i++, addr += pgsize, ptep++)
> - ptep_clear(mm, addr, ptep);
> + ptep_get_and_clear(mm, addr, ptep);
>  
>   flush_tlb_range(, saddr, addr);
>  }
> -- 
> 2.25.1
>

Re: [PATCH v6 08/18] arm64/mm: Convert set_pte_at() to set_ptes(..., 1)

2024-02-15 Thread Mark Rutland

On Thu, Feb 15, 2024 at 10:31:55AM +, Ryan Roberts wrote:
> Since set_ptes() was introduced, set_pte_at() has been implemented as a
> generic macro around set_ptes(..., 1). So this change should continue to
> generate the same code. However, making this change prepares us for the
> transparent contpte support. It means we can reroute set_ptes() to
> __set_ptes(). Since set_pte_at() is a generic macro, there will be no
> equivalent __set_pte_at() to reroute to.
> 
> Note that a couple of calls to set_pte_at() remain in the arch code.
> This is intentional, since those call sites are acting on behalf of
> core-mm and should continue to call into the public set_ptes() rather
> than the arch-private __set_ptes().
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Mark Rutland 

Mark.

> ---
>  arch/arm64/include/asm/pgtable.h |  2 +-
>  arch/arm64/kernel/mte.c  |  2 +-
>  arch/arm64/kvm/guest.c   |  2 +-
>  arch/arm64/mm/fault.c|  2 +-
>  arch/arm64/mm/hugetlbpage.c  | 10 +-
>  5 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index de034ca40bad..9a2df85eb493 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1084,7 +1084,7 @@ static inline void arch_swap_restore(swp_entry_t entry, 
> struct folio *folio)
>  #endif /* CONFIG_ARM64_MTE */
>  
>  /*
> - * On AArch64, the cache coherency is handled via the set_pte_at() function.
> + * On AArch64, the cache coherency is handled via the set_ptes() function.
>   */
>  static inline void update_mmu_cache_range(struct vm_fault *vmf,
>   struct vm_area_struct *vma, unsigned long addr, pte_t *ptep,
> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> index a41ef3213e1e..59bfe2e96f8f 100644
> --- a/arch/arm64/kernel/mte.c
> +++ b/arch/arm64/kernel/mte.c
> @@ -67,7 +67,7 @@ int memcmp_pages(struct page *page1, struct page *page2)
>   /*
>* If the page content is identical but at least one of the pages is
>* tagged, return non-zero to avoid KSM merging. If only one of the
> -  * pages is tagged, set_pte_at() may zero or change the tags of the
> +  * pages is tagged, set_ptes() may zero or change the tags of the
>* other page via mte_sync_tags().
>*/
>   if (page_mte_tagged(page1) || page_mte_tagged(page2))
> diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
> index aaf1d4939739..6e0df623c8e9 100644
> --- a/arch/arm64/kvm/guest.c
> +++ b/arch/arm64/kvm/guest.c
> @@ -1072,7 +1072,7 @@ int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
>   } else {
>   /*
>* Only locking to serialise with a concurrent
> -  * set_pte_at() in the VMM but still overriding the
> +  * set_ptes() in the VMM but still overriding the
>* tags, hence ignoring the return value.
>*/
>   try_page_mte_tagging(page);
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index a254761fa1bd..3235e23309ec 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -205,7 +205,7 @@ static void show_pte(unsigned long addr)
>   *
>   * It needs to cope with hardware update of the accessed/dirty state by other
>   * agents in the system and can safely skip the __sync_icache_dcache() call 
> as,
> - * like set_pte_at(), the PTE is never changed from no-exec to exec here.
> + * like set_ptes(), the PTE is never changed from no-exec to exec here.
>   *
>   * Returns whether or not the PTE actually changed.
>   */
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index 2892f925ed66..27f6160890d1 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -247,12 +247,12 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned 
> long addr,
>  
>   if (!pte_present(pte)) {
>   for (i = 0; i < ncontig; i++, ptep++, addr += pgsize)
> - set_pte_at(mm, addr, ptep, pte);
> + set_ptes(mm, addr, ptep, pte, 1);
>   return;
>   }
>  
>   if (!pte_cont(pte)) {
> - set_pte_at(mm, addr, ptep, pte);
> + set_ptes(mm, addr, ptep, pte, 1);
>   return;
>   }
>  
> @@ -263,7 +263,7 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long 
> addr,
>   clear_flush(mm, addr, ptep, pgsize, ncontig);
>  
>   for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
> - set_pte_at(mm, addr, ptep, pfn_pte(pfn, hugeprot));
> + set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
>  }
>  
>  pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
> @@ -471,7 +471,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
>  
>   hugeprot = pte_pgprot(pte);
>

Re: [PATCH v6 07/18] arm64/mm: Convert READ_ONCE(*ptep) to ptep_get(ptep)

2024-02-15 Thread Mark Rutland

On Thu, Feb 15, 2024 at 10:31:54AM +, Ryan Roberts wrote:
> There are a number of places in the arch code that read a pte by using
> the READ_ONCE() macro. Refactor these call sites to instead use the
> ptep_get() helper, which itself is a READ_ONCE(). Generated code should
> be the same.
> 
> This will benefit us when we shortly introduce the transparent contpte
> support. In this case, ptep_get() will become more complex so we now
> have all the code abstracted through it.
> 
> Tested-by: John Hubbard 
> Signed-off-by: Ryan Roberts 

Acked-by: Mark Rutland 

Mark.

> ---
>  arch/arm64/include/asm/pgtable.h | 12 +---
>  arch/arm64/kernel/efi.c  |  2 +-
>  arch/arm64/mm/fault.c|  4 ++--
>  arch/arm64/mm/hugetlbpage.c  |  6 +++---
>  arch/arm64/mm/kasan_init.c   |  2 +-
>  arch/arm64/mm/mmu.c  | 12 ++--
>  arch/arm64/mm/pageattr.c |  4 ++--
>  arch/arm64/mm/trans_pgd.c|  2 +-
>  8 files changed, 25 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index b6d3e9e0a946..de034ca40bad 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -275,6 +275,12 @@ static inline void set_pte(pte_t *ptep, pte_t pte)
>   }
>  }
>  
> +#define ptep_get ptep_get
> +static inline pte_t ptep_get(pte_t *ptep)
> +{
> + return READ_ONCE(*ptep);
> +}
> +
>  extern void __sync_icache_dcache(pte_t pteval);
>  bool pgattr_change_is_safe(u64 old, u64 new);
>  
> @@ -302,7 +308,7 @@ static inline void __check_safe_pte_update(struct 
> mm_struct *mm, pte_t *ptep,
>   if (!IS_ENABLED(CONFIG_DEBUG_VM))
>   return;
>  
> - old_pte = READ_ONCE(*ptep);
> + old_pte = ptep_get(ptep);
>  
>   if (!pte_valid(old_pte) || !pte_valid(pte))
>   return;
> @@ -904,7 +910,7 @@ static inline int __ptep_test_and_clear_young(pte_t *ptep)
>  {
>   pte_t old_pte, pte;
>  
> - pte = READ_ONCE(*ptep);
> + pte = ptep_get(ptep);
>   do {
>   old_pte = pte;
>   pte = pte_mkold(pte);
> @@ -986,7 +992,7 @@ static inline void ptep_set_wrprotect(struct mm_struct 
> *mm, unsigned long addres
>  {
>   pte_t old_pte, pte;
>  
> - pte = READ_ONCE(*ptep);
> + pte = ptep_get(ptep);
>   do {
>   old_pte = pte;
>   pte = pte_wrprotect(pte);
> diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
> index 0228001347be..d0e08e93b246 100644
> --- a/arch/arm64/kernel/efi.c
> +++ b/arch/arm64/kernel/efi.c
> @@ -103,7 +103,7 @@ static int __init set_permissions(pte_t *ptep, unsigned 
> long addr, void *data)
>  {
>   struct set_perm_data *spd = data;
>   const efi_memory_desc_t *md = spd->md;
> - pte_t pte = READ_ONCE(*ptep);
> + pte_t pte = ptep_get(ptep);
>  
>   if (md->attribute & EFI_MEMORY_RO)
>   pte = set_pte_bit(pte, __pgprot(PTE_RDONLY));
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 55f6455a8284..a254761fa1bd 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -191,7 +191,7 @@ static void show_pte(unsigned long addr)
>   if (!ptep)
>   break;
>  
> - pte = READ_ONCE(*ptep);
> + pte = ptep_get(ptep);
>   pr_cont(", pte=%016llx", pte_val(pte));
>   pte_unmap(ptep);
>   } while(0);
> @@ -214,7 +214,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
> pte_t entry, int dirty)
>  {
>   pteval_t old_pteval, pteval;
> - pte_t pte = READ_ONCE(*ptep);
> + pte_t pte = ptep_get(ptep);
>  
>   if (pte_same(pte, entry))
>   return 0;
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index 6720ec8d50e7..2892f925ed66 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -485,7 +485,7 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
>   size_t pgsize;
>   pte_t pte;
>  
> - if (!pte_cont(READ_ONCE(*ptep))) {
> + if (!pte_cont(ptep_get(ptep))) {
>   ptep_set_wrprotect(mm, addr, ptep);
>   return;
>   }
> @@ -510,7 +510,7 @@ pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
>   size_t pgsize;
>   int ncontig;
>  
> - if (!pte_cont(READ_ONCE(*ptep)))
> + if (!pte_cont(ptep_get(ptep)))
>   return ptep_clear_flush(vma, addr, ptep);
>  
>   ncontig = find_num_contig(mm, addr, ptep, );
> @@ -543,7 +543,7 @@ pte_t huge_ptep_modify_prot_start(struct vm_area_struct 
> *vma, unsigned long addr
>* when the permission changes from executable to non-executable
>* in cases where cpu is affected with errata #2645198.
>*/
> - if (pte_user_exec(READ_ONCE(*ptep)))
> + if (pte_user_exec(ptep_get(ptep)))
>   return huge_ptep_clear_flush(vma,

Re: [PATCH v6 04/18] arm64/mm: Convert pte_next_pfn() to pte_advance_pfn()

2024-02-15 Thread Mark Rutland

On Thu, Feb 15, 2024 at 10:31:51AM +, Ryan Roberts wrote:
> Core-mm needs to be able to advance the pfn by an arbitrary amount, so
> override the new pte_advance_pfn() API to do so.
> 
> Signed-off-by: Ryan Roberts 

Acked-by: Mark Rutland 

Mark.

> ---
>  arch/arm64/include/asm/pgtable.h | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index 52d0b0a763f1..b6d3e9e0a946 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -351,10 +351,10 @@ static inline pgprot_t pte_pgprot(pte_t pte)
>   return __pgprot(pte_val(pfn_pte(pfn, __pgprot(0))) ^ pte_val(pte));
>  }
>  
> -#define pte_next_pfn pte_next_pfn
> -static inline pte_t pte_next_pfn(pte_t pte)
> +#define pte_advance_pfn pte_advance_pfn
> +static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
>  {
> - return pfn_pte(pte_pfn(pte) + 1, pte_pgprot(pte));
> + return pfn_pte(pte_pfn(pte) + nr, pte_pgprot(pte));
>  }
>  
>  static inline void set_ptes(struct mm_struct *mm,
> @@ -370,7 +370,7 @@ static inline void set_ptes(struct mm_struct *mm,
>   if (--nr == 0)
>   break;
>   ptep++;
> - pte = pte_next_pfn(pte);
> + pte = pte_advance_pfn(pte, 1);
>   }
>  }
>  #define set_ptes set_ptes
> -- 
> 2.25.1
>

Re: [PATCH v2 2/2] powerpc/bpf: enable kfunc call

2024-02-15 Thread Naveen N Rao

On Tue, Feb 13, 2024 at 07:54:27AM +, Christophe Leroy wrote:
> 
> 
> Le 01/02/2024 à 18:12, Hari Bathini a écrit :
> > With module addresses supported, override bpf_jit_supports_kfunc_call()
> > to enable kfunc support. Module address offsets can be more than 32-bit
> > long, so override bpf_jit_supports_far_kfunc_call() to enable 64-bit
> > pointers.
> 
> What's the impact on PPC32 ? There are no 64-bit pointers on PPC32.

Looking at commit 1cf3bfc60f98 ("bpf: Support 64-bit pointers to 
kfuncs"), which added bpf_jit_supports_far_kfunc_call(), that does look 
to be very specific to platforms where module addresses are farther than 
s32. This is true for powerpc 64-bit, but shouldn't be needed for 
32-bit.

> 
> > 
> > Signed-off-by: Hari Bathini 
> > ---
> > 
> > * No changes since v1.
> > 
> > 
> >   arch/powerpc/net/bpf_jit_comp.c | 10 ++
> >   1 file changed, 10 insertions(+)
> > 
> > diff --git a/arch/powerpc/net/bpf_jit_comp.c 
> > b/arch/powerpc/net/bpf_jit_comp.c
> > index 7b4103b4c929..f896a4213696 100644
> > --- a/arch/powerpc/net/bpf_jit_comp.c
> > +++ b/arch/powerpc/net/bpf_jit_comp.c
> > @@ -359,3 +359,13 @@ void bpf_jit_free(struct bpf_prog *fp)
> >   
> > bpf_prog_unlock_free(fp);
> >   }
> > +
> > +bool bpf_jit_supports_kfunc_call(void)
> > +{
> > +   return true;
> > +}
> > +
> > +bool bpf_jit_supports_far_kfunc_call(void)
> > +{
> > +   return true;
> > +}

I am not sure there is value in keeping this as a separate patch since 
all support code for kfunc calls is introduced in an earlier patch.  
Consider folding this into the previous patch.

- Naveen

Re: [PATCH v6 06/18] mm: Tidy up pte_next_pfn() definition

2024-02-15 Thread David Hildenbrand


On 15.02.24 11:31, Ryan Roberts wrote:

Now that the all architecture overrides of pte_next_pfn() have been
replaced with pte_advance_pfn(), we can simplify the definition of the
generic pte_next_pfn() macro so that it is unconditionally defined.

Signed-off-by: Ryan Roberts 
---
  include/linux/pgtable.h | 2 --
  1 file changed, 2 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index b7ac8358f2aa..bc005d84f764 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -212,7 +212,6 @@ static inline int pmd_dirty(pmd_t pmd)
  #define arch_flush_lazy_mmu_mode()do {} while (0)
  #endif
  
-#ifndef pte_next_pfn

  #ifndef pte_advance_pfn
  static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
  {
@@ -221,7 +220,6 @@ static inline pte_t pte_advance_pfn(pte_t pte, unsigned 
long nr)
  #endif
  
  #define pte_next_pfn(pte) pte_advance_pfn(pte, 1)

-#endif
  
  #ifndef set_ptes

  /**


Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH v6 05/18] x86/mm: Convert pte_next_pfn() to pte_advance_pfn()

2024-02-15 Thread David Hildenbrand


On 15.02.24 11:31, Ryan Roberts wrote:

Core-mm needs to be able to advance the pfn by an arbitrary amount, so
override the new pte_advance_pfn() API to do so.

Signed-off-by: Ryan Roberts 
---
  arch/x86/include/asm/pgtable.h | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index b50b2ef63672..69ed0ea0641b 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -955,13 +955,13 @@ static inline int pte_same(pte_t a, pte_t b)
return a.pte == b.pte;
  }
  
-static inline pte_t pte_next_pfn(pte_t pte)

+static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
  {
if (__pte_needs_invert(pte_val(pte)))
-   return __pte(pte_val(pte) - (1UL << PFN_PTE_SHIFT));
-   return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
+   return __pte(pte_val(pte) - (nr << PFN_PTE_SHIFT));
+   return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
  }
-#define pte_next_pfn   pte_next_pfn
+#define pte_advance_pfnpte_advance_pfn
  
  static inline int pte_present(pte_t a)

  {


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH v6 04/18] arm64/mm: Convert pte_next_pfn() to pte_advance_pfn()

2024-02-15 Thread David Hildenbrand


On 15.02.24 11:31, Ryan Roberts wrote:

Core-mm needs to be able to advance the pfn by an arbitrary amount, so
override the new pte_advance_pfn() API to do so.

Signed-off-by: Ryan Roberts 
---
  arch/arm64/include/asm/pgtable.h | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 52d0b0a763f1..b6d3e9e0a946 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -351,10 +351,10 @@ static inline pgprot_t pte_pgprot(pte_t pte)
return __pgprot(pte_val(pfn_pte(pfn, __pgprot(0))) ^ pte_val(pte));
  }
  
-#define pte_next_pfn pte_next_pfn

-static inline pte_t pte_next_pfn(pte_t pte)
+#define pte_advance_pfn pte_advance_pfn
+static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
  {
-   return pfn_pte(pte_pfn(pte) + 1, pte_pgprot(pte));
+   return pfn_pte(pte_pfn(pte) + nr, pte_pgprot(pte));
  }
  
  static inline void set_ptes(struct mm_struct *mm,

@@ -370,7 +370,7 @@ static inline void set_ptes(struct mm_struct *mm,
if (--nr == 0)
break;
ptep++;
-   pte = pte_next_pfn(pte);
+   pte = pte_advance_pfn(pte, 1);



Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH v6 03/18] mm: Introduce pte_advance_pfn() and use for pte_next_pfn()

2024-02-15 Thread David Hildenbrand


On 15.02.24 11:31, Ryan Roberts wrote:

The goal is to be able to advance a PTE by an arbitrary number of PFNs.
So introduce a new API that takes a nr param. Define the default
implementation here and allow for architectures to override.
pte_next_pfn() becomes a wrapper around pte_advance_pfn().

Follow up commits will convert each overriding architecture's
pte_next_pfn() to pte_advance_pfn().

Signed-off-by: Ryan Roberts 
---
  include/linux/pgtable.h | 9 ++---
  1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 231370e1b80f..b7ac8358f2aa 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -212,14 +212,17 @@ static inline int pmd_dirty(pmd_t pmd)
  #define arch_flush_lazy_mmu_mode()do {} while (0)
  #endif
  
-

  #ifndef pte_next_pfn
-static inline pte_t pte_next_pfn(pte_t pte)
+#ifndef pte_advance_pfn
+static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
  {
-   return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
+   return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
  }
  #endif
  
+#define pte_next_pfn(pte) pte_advance_pfn(pte, 1)

+#endif
+
  #ifndef set_ptes
  /**
   * set_ptes - Map consecutive pages to a contiguous range of addresses.


Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH v2 1/2] powerpc/bpf: ensure module addresses are supported

2024-02-15 Thread Naveen N Rao

On Thu, Feb 01, 2024 at 10:42:48PM +0530, Hari Bathini wrote:
> Currently, bpf jit code on powerpc assumes all the bpf functions and
> helpers to be kernel text. This is false for kfunc case, as function
> addresses are mostly module addresses in that case. Ensure module
> addresses are supported to enable kfunc support.

I don't think that statement is entirely accurate. Our current 
assumptions are still correct in the sense that:
1. all BPF helpers are part of core kernel text, calls to which are 
generated in bpf_jit_emit_func_call_hlp().
2. bpf-to-bpf calls go out to module area where the JIT output of BPF 
subprogs are written to, handled by bpf_jit_emit_func_call_rel().

kfunc calls add another variant where BPF progs can call directly into a 
kernel function (similar to a BPF helper call), except for the fact that 
the said function could be in a kernel module.

> 
> Assume kernel text address for programs with no kfunc call to optimize
> instruction sequence in that case. Add a check to error out if this
> assumption ever changes in the future.
> 
> Signed-off-by: Hari Bathini 
> ---
> 
> Changes in v2:
> * Using bpf_prog_has_kfunc_call() to decide whether to use optimized
>   instruction sequence or not as suggested by Naveen.
> 
> 
>  arch/powerpc/net/bpf_jit.h|   5 +-
>  arch/powerpc/net/bpf_jit_comp.c   |   4 +-
>  arch/powerpc/net/bpf_jit_comp32.c |   8 ++-
>  arch/powerpc/net/bpf_jit_comp64.c | 109 --
>  4 files changed, 97 insertions(+), 29 deletions(-)
> 
> diff --git a/arch/powerpc/net/bpf_jit.h b/arch/powerpc/net/bpf_jit.h
> index cdea5dccaefe..fc56ee0ee9c5 100644
> --- a/arch/powerpc/net/bpf_jit.h
> +++ b/arch/powerpc/net/bpf_jit.h
> @@ -160,10 +160,11 @@ static inline void bpf_clear_seen_register(struct 
> codegen_context *ctx, int i)
>  }
>  
>  void bpf_jit_init_reg_mapping(struct codegen_context *ctx);
> -int bpf_jit_emit_func_call_rel(u32 *image, u32 *fimage, struct 
> codegen_context *ctx, u64 func);
> +int bpf_jit_emit_func_call_rel(u32 *image, u32 *fimage, struct 
> codegen_context *ctx, u64 func,
> +bool has_kfunc_call);
>  int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, u32 *fimage, struct 
> codegen_context *ctx,
>  u32 *addrs, int pass, bool extra_pass);
> -void bpf_jit_build_prologue(u32 *image, struct codegen_context *ctx);
> +void bpf_jit_build_prologue(u32 *image, struct codegen_context *ctx, bool 
> has_kfunc_call);
>  void bpf_jit_build_epilogue(u32 *image, struct codegen_context *ctx);
>  void bpf_jit_realloc_regs(struct codegen_context *ctx);
>  int bpf_jit_emit_exit_insn(u32 *image, struct codegen_context *ctx, int 
> tmp_reg, long exit_addr);
> diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c
> index 0f9a21783329..7b4103b4c929 100644
> --- a/arch/powerpc/net/bpf_jit_comp.c
> +++ b/arch/powerpc/net/bpf_jit_comp.c
> @@ -163,7 +163,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
>* update ctgtx.idx as it pretends to output instructions, then we can
>* calculate total size from idx.
>*/
> - bpf_jit_build_prologue(NULL, );
> + bpf_jit_build_prologue(NULL, , bpf_prog_has_kfunc_call(fp));
>   addrs[fp->len] = cgctx.idx * 4;
>   bpf_jit_build_epilogue(NULL, );
>  
> @@ -192,7 +192,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
>   /* Now build the prologue, body code & epilogue for real. */
>   cgctx.idx = 0;
>   cgctx.alt_exit_addr = 0;
> - bpf_jit_build_prologue(code_base, );
> + bpf_jit_build_prologue(code_base, , 
> bpf_prog_has_kfunc_call(fp));
>   if (bpf_jit_build_body(fp, code_base, fcode_base, , 
> addrs, pass,
>  extra_pass)) {
>   bpf_arch_text_copy(>size, >size, 
> sizeof(hdr->size));
> diff --git a/arch/powerpc/net/bpf_jit_comp32.c 
> b/arch/powerpc/net/bpf_jit_comp32.c
> index 2f39c50ca729..447747e51a58 100644
> --- a/arch/powerpc/net/bpf_jit_comp32.c
> +++ b/arch/powerpc/net/bpf_jit_comp32.c
> @@ -123,7 +123,7 @@ void bpf_jit_realloc_regs(struct codegen_context *ctx)
>   }
>  }
>  
> -void bpf_jit_build_prologue(u32 *image, struct codegen_context *ctx)
> +void bpf_jit_build_prologue(u32 *image, struct codegen_context *ctx, bool 
> has_kfunc_call)
>  {
>   int i;
>  
> @@ -201,7 +201,8 @@ void bpf_jit_build_epilogue(u32 *image, struct 
> codegen_context *ctx)
>  }
>  
>  /* Relative offset needs to be calculated based on final image location */
> -int bpf_jit_emit_func_call_rel(u32 *image, u32 *fimage, struct 
> codegen_context *ctx, u64 func)
> +int bpf_jit_emit_func_call_rel(u32 *image, u32 *fimage, struct 
> codegen_context *ctx, u64 func,
> +bool has_kfunc_call)
>  {
>   s32 rel = (s32)func - (s32)(fimage + ctx->idx);
>  
> @@ -1054,7 +1055,8 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image,

Re: [PATCH v2 1/2] powerpc/bpf: ensure module addresses are supported

2024-02-15 Thread Hari Bathini





On 13/02/24 1:23 pm, Christophe Leroy wrote:



Le 01/02/2024 à 18:12, Hari Bathini a écrit :

Currently, bpf jit code on powerpc assumes all the bpf functions and
helpers to be kernel text. This is false for kfunc case, as function
addresses are mostly module addresses in that case. Ensure module
addresses are supported to enable kfunc support.

Assume kernel text address for programs with no kfunc call to optimize
instruction sequence in that case. Add a check to error out if this
assumption ever changes in the future.

Signed-off-by: Hari Bathini 
---

Changes in v2:
* Using bpf_prog_has_kfunc_call() to decide whether to use optimized
instruction sequence or not as suggested by Naveen.


   arch/powerpc/net/bpf_jit.h|   5 +-
   arch/powerpc/net/bpf_jit_comp.c   |   4 +-
   arch/powerpc/net/bpf_jit_comp32.c |   8 ++-
   arch/powerpc/net/bpf_jit_comp64.c | 109 --
   4 files changed, 97 insertions(+), 29 deletions(-)

diff --git a/arch/powerpc/net/bpf_jit.h b/arch/powerpc/net/bpf_jit.h
index cdea5dccaefe..fc56ee0ee9c5 100644
--- a/arch/powerpc/net/bpf_jit.h
+++ b/arch/powerpc/net/bpf_jit.h
@@ -160,10 +160,11 @@ static inline void bpf_clear_seen_register(struct 
codegen_context *ctx, int i)
   }
   
   void bpf_jit_init_reg_mapping(struct codegen_context *ctx);

-int bpf_jit_emit_func_call_rel(u32 *image, u32 *fimage, struct codegen_context 
*ctx, u64 func);
+int bpf_jit_emit_func_call_rel(u32 *image, u32 *fimage, struct codegen_context 
*ctx, u64 func,
+  bool has_kfunc_call);
   int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, u32 *fimage, struct 
codegen_context *ctx,
   u32 *addrs, int pass, bool extra_pass);
-void bpf_jit_build_prologue(u32 *image, struct codegen_context *ctx);
+void bpf_jit_build_prologue(u32 *image, struct codegen_context *ctx, bool 
has_kfunc_call);
   void bpf_jit_build_epilogue(u32 *image, struct codegen_context *ctx);
   void bpf_jit_realloc_regs(struct codegen_context *ctx);
   int bpf_jit_emit_exit_insn(u32 *image, struct codegen_context *ctx, int 
tmp_reg, long exit_addr);
diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c
index 0f9a21783329..7b4103b4c929 100644
--- a/arch/powerpc/net/bpf_jit_comp.c
+++ b/arch/powerpc/net/bpf_jit_comp.c
@@ -163,7 +163,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
 * update ctgtx.idx as it pretends to output instructions, then we can
 * calculate total size from idx.
 */
-   bpf_jit_build_prologue(NULL, );
+   bpf_jit_build_prologue(NULL, , bpf_prog_has_kfunc_call(fp));
addrs[fp->len] = cgctx.idx * 4;
bpf_jit_build_epilogue(NULL, );
   
@@ -192,7 +192,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)

/* Now build the prologue, body code & epilogue for real. */
cgctx.idx = 0;
cgctx.alt_exit_addr = 0;
-   bpf_jit_build_prologue(code_base, );
+   bpf_jit_build_prologue(code_base, , 
bpf_prog_has_kfunc_call(fp));
if (bpf_jit_build_body(fp, code_base, fcode_base, , 
addrs, pass,
   extra_pass)) {
bpf_arch_text_copy(>size, >size, 
sizeof(hdr->size));
diff --git a/arch/powerpc/net/bpf_jit_comp32.c 
b/arch/powerpc/net/bpf_jit_comp32.c
index 2f39c50ca729..447747e51a58 100644
--- a/arch/powerpc/net/bpf_jit_comp32.c
+++ b/arch/powerpc/net/bpf_jit_comp32.c
@@ -123,7 +123,7 @@ void bpf_jit_realloc_regs(struct codegen_context *ctx)
}
   }
   
-void bpf_jit_build_prologue(u32 *image, struct codegen_context *ctx)

+void bpf_jit_build_prologue(u32 *image, struct codegen_context *ctx, bool 
has_kfunc_call)
   {
int i;
   
@@ -201,7 +201,8 @@ void bpf_jit_build_epilogue(u32 *image, struct codegen_context *ctx)

   }
   
   /* Relative offset needs to be calculated based on final image location */

-int bpf_jit_emit_func_call_rel(u32 *image, u32 *fimage, struct codegen_context 
*ctx, u64 func)
+int bpf_jit_emit_func_call_rel(u32 *image, u32 *fimage, struct codegen_context 
*ctx, u64 func,
+  bool has_kfunc_call)
   {
s32 rel = (s32)func - (s32)(fimage + ctx->idx);
   
@@ -1054,7 +1055,8 @@ int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, u32 *fimage, struct code

EMIT(PPC_RAW_STW(bpf_to_ppc(BPF_REG_5), _R1, 
12));
}
   
-			ret = bpf_jit_emit_func_call_rel(image, fimage, ctx, func_addr);

+   ret = bpf_jit_emit_func_call_rel(image, fimage, ctx, 
func_addr,
+
bpf_prog_has_kfunc_call(fp));
if (ret)
return ret;
   
diff --git a/arch/powerpc/net/bpf_jit_comp64.c b/arch/powerpc/net/bpf_jit_comp64.c

index 79f23974a320..385a5df1670c 100644
--- a/arch/powerpc/net/bpf_jit_comp64.c

Re: [PATCH v2 2/2] powerpc/bpf: enable kfunc call

2024-02-15 Thread Hari Bathini





On 13/02/24 1:24 pm, Christophe Leroy wrote:



Le 01/02/2024 à 18:12, Hari Bathini a écrit :

With module addresses supported, override bpf_jit_supports_kfunc_call()
to enable kfunc support. Module address offsets can be more than 32-bit
long, so override bpf_jit_supports_far_kfunc_call() to enable 64-bit
pointers.


What's the impact on PPC32 ? There are no 64-bit pointers on PPC32.


Yeah. Not required to return true for PPC32 case and probably not a
good thing to claim support for far kfunc calls for PPC32. Changing to:

+bool bpf_jit_supports_far_kfunc_call(void)
+{
+   return IS_ENABLED(CONFIG_PPC64);
+}



Signed-off-by: Hari Bathini 
---

* No changes since v1.


   arch/powerpc/net/bpf_jit_comp.c | 10 ++
   1 file changed, 10 insertions(+)

diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c
index 7b4103b4c929..f896a4213696 100644
--- a/arch/powerpc/net/bpf_jit_comp.c
+++ b/arch/powerpc/net/bpf_jit_comp.c
@@ -359,3 +359,13 @@ void bpf_jit_free(struct bpf_prog *fp)
   
   	bpf_prog_unlock_free(fp);

   }
+
+bool bpf_jit_supports_kfunc_call(void)
+{
+   return true;
+}
+
+bool bpf_jit_supports_far_kfunc_call(void)
+{
+   return true;
+}

[PATCH v6 14/18] arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs

2024-02-15 Thread Ryan Roberts

Optimize the contpte implementation to fix some of the
exit/munmap/dontneed performance regression introduced by the initial
contpte commit. Subsequent patches will solve it entirely.

During exit(), munmap() or madvise(MADV_DONTNEED), mappings must be
cleared. Previously this was done 1 PTE at a time. But the core-mm
supports batched clear via the new [get_and_]clear_full_ptes() APIs. So
let's implement those APIs and for fully covered contpte mappings, we no
longer need to unfold the contpte. This significantly reduces unfolding
operations, reducing the number of tlbis that must be issued.

Tested-by: John Hubbard 
Signed-off-by: Ryan Roberts 
---
 arch/arm64/include/asm/pgtable.h | 67 
 arch/arm64/mm/contpte.c  | 17 
 2 files changed, 84 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 8643227c318b..a8f1a35e3086 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -965,6 +965,37 @@ static inline pte_t __ptep_get_and_clear(struct mm_struct 
*mm,
return pte;
 }
 
+static inline void __clear_full_ptes(struct mm_struct *mm, unsigned long addr,
+   pte_t *ptep, unsigned int nr, int full)
+{
+   for (;;) {
+   __ptep_get_and_clear(mm, addr, ptep);
+   if (--nr == 0)
+   break;
+   ptep++;
+   addr += PAGE_SIZE;
+   }
+}
+
+static inline pte_t __get_and_clear_full_ptes(struct mm_struct *mm,
+   unsigned long addr, pte_t *ptep,
+   unsigned int nr, int full)
+{
+   pte_t pte, tmp_pte;
+
+   pte = __ptep_get_and_clear(mm, addr, ptep);
+   while (--nr) {
+   ptep++;
+   addr += PAGE_SIZE;
+   tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
+   if (pte_dirty(tmp_pte))
+   pte = pte_mkdirty(pte);
+   if (pte_young(tmp_pte))
+   pte = pte_mkyoung(pte);
+   }
+   return pte;
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
 static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
@@ -1160,6 +1191,11 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t 
orig_pte);
 extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
 extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pte, unsigned int nr);
+extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
+   pte_t *ptep, unsigned int nr, int full);
+extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
+   unsigned long addr, pte_t *ptep,
+   unsigned int nr, int full);
 extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep);
 extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
@@ -1253,6 +1289,35 @@ static inline void pte_clear(struct mm_struct *mm,
__pte_clear(mm, addr, ptep);
 }
 
+#define clear_full_ptes clear_full_ptes
+static inline void clear_full_ptes(struct mm_struct *mm, unsigned long addr,
+   pte_t *ptep, unsigned int nr, int full)
+{
+   if (likely(nr == 1)) {
+   contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+   __clear_full_ptes(mm, addr, ptep, nr, full);
+   } else {
+   contpte_clear_full_ptes(mm, addr, ptep, nr, full);
+   }
+}
+
+#define get_and_clear_full_ptes get_and_clear_full_ptes
+static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
+   unsigned long addr, pte_t *ptep,
+   unsigned int nr, int full)
+{
+   pte_t pte;
+
+   if (likely(nr == 1)) {
+   contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
+   pte = __get_and_clear_full_ptes(mm, addr, ptep, nr, full);
+   } else {
+   pte = contpte_get_and_clear_full_ptes(mm, addr, ptep, nr, full);
+   }
+
+   return pte;
+}
+
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
unsigned long addr, pte_t *ptep)
@@ -1337,6 +1402,8 @@ static inline int ptep_set_access_flags(struct 
vm_area_struct *vma,
 #define set_pte__set_pte
 #define set_ptes   __set_ptes
 #define pte_clear  __pte_clear
+#define clear_full_ptes__clear_full_ptes
+#define get_and_clear_full_ptes
__get_and_clear_full_ptes
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 #define ptep_get_and_clear __ptep_get_and_clear
 #define

[PATCH v6 13/18] arm64/mm: Implement new wrprotect_ptes() batch API

2024-02-15 Thread Ryan Roberts

Optimize the contpte implementation to fix some of the fork performance
regression introduced by the initial contpte commit. Subsequent patches
will solve it entirely.

During fork(), any private memory in the parent must be write-protected.
Previously this was done 1 PTE at a time. But the core-mm supports
batched wrprotect via the new wrprotect_ptes() API. So let's implement
that API and for fully covered contpte mappings, we no longer need to
unfold the contpte. This has 2 benefits:

  - reduced unfolding, reduces the number of tlbis that must be issued.
  - The memory remains contpte-mapped ("folded") in the parent, so it
continues to benefit from the more efficient use of the TLB after
the fork.

The optimization to wrprotect a whole contpte block without unfolding is
possible thanks to the tightening of the Arm ARM in respect to the
definition and behaviour when 'Misprogramming the Contiguous bit'. See
section D21194 at https://developer.arm.com/documentation/102105/ja-07/

Tested-by: John Hubbard 
Signed-off-by: Ryan Roberts 
---
 arch/arm64/include/asm/pgtable.h | 61 ++--
 arch/arm64/mm/contpte.c  | 38 
 2 files changed, 89 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 831099cfc96b..8643227c318b 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -978,16 +978,12 @@ static inline pmd_t pmdp_huge_get_and_clear(struct 
mm_struct *mm,
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-/*
- * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
- * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
- */
-static inline void __ptep_set_wrprotect(struct mm_struct *mm,
-   unsigned long address, pte_t *ptep)
+static inline void ___ptep_set_wrprotect(struct mm_struct *mm,
+   unsigned long address, pte_t *ptep,
+   pte_t pte)
 {
-   pte_t old_pte, pte;
+   pte_t old_pte;
 
-   pte = __ptep_get(ptep);
do {
old_pte = pte;
pte = pte_wrprotect(pte);
@@ -996,6 +992,25 @@ static inline void __ptep_set_wrprotect(struct mm_struct 
*mm,
} while (pte_val(pte) != pte_val(old_pte));
 }
 
+/*
+ * __ptep_set_wrprotect - mark read-only while trasferring potential hardware
+ * dirty status (PTE_DBM && !PTE_RDONLY) to the software PTE_DIRTY bit.
+ */
+static inline void __ptep_set_wrprotect(struct mm_struct *mm,
+   unsigned long address, pte_t *ptep)
+{
+   ___ptep_set_wrprotect(mm, address, ptep, __ptep_get(ptep));
+}
+
+static inline void __wrprotect_ptes(struct mm_struct *mm, unsigned long 
address,
+   pte_t *ptep, unsigned int nr)
+{
+   unsigned int i;
+
+   for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
+   __ptep_set_wrprotect(mm, address, ptep);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define __HAVE_ARCH_PMDP_SET_WRPROTECT
 static inline void pmdp_set_wrprotect(struct mm_struct *mm,
@@ -1149,6 +1164,8 @@ extern int contpte_ptep_test_and_clear_young(struct 
vm_area_struct *vma,
unsigned long addr, pte_t *ptep);
 extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep);
+extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
+   pte_t *ptep, unsigned int nr);
 extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep,
pte_t entry, int dirty);
@@ -1268,12 +1285,35 @@ static inline int ptep_clear_flush_young(struct 
vm_area_struct *vma,
return contpte_ptep_clear_flush_young(vma, addr, ptep);
 }
 
+#define wrprotect_ptes wrprotect_ptes
+static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
+   pte_t *ptep, unsigned int nr)
+{
+   if (likely(nr == 1)) {
+   /*
+* Optimization: wrprotect_ptes() can only be called for present
+* ptes so we only need to check contig bit as condition for
+* unfold, and we can remove the contig bit from the pte we read
+* to avoid re-reading. This speeds up fork() which is sensitive
+* for order-0 folios. Equivalent to contpte_try_unfold().
+*/
+   pte_t orig_pte = __ptep_get(ptep);
+
+   if (unlikely(pte_cont(orig_pte))) {
+   __contpte_try_unfold(mm, addr, ptep, orig_pte);
+   orig_pte = pte_mknoncont(orig_pte);
+   }
+   ___ptep_set_wrprotect(mm, addr, ptep, orig_pte);
+   } else {
+

[PATCH v6 12/18] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-15 Thread Ryan Roberts

With the ptep API sufficiently refactored, we can now introduce a new
"contpte" API layer, which transparently manages the PTE_CONT bit for
user mappings.

In this initial implementation, only suitable batches of PTEs, set via
set_ptes(), are mapped with the PTE_CONT bit. Any subsequent
modification of individual PTEs will cause an "unfold" operation to
repaint the contpte block as individual PTEs before performing the
requested operation. While, a modification of a single PTE could cause
the block of PTEs to which it belongs to become eligible for "folding"
into a contpte entry, "folding" is not performed in this initial
implementation due to the costs of checking the requirements are met.
Due to this, contpte mappings will degrade back to normal pte mappings
over time if/when protections are changed. This will be solved in a
future patch.

Since a contpte block only has a single access and dirty bit, the
semantic here changes slightly; when getting a pte (e.g. ptep_get())
that is part of a contpte mapping, the access and dirty information are
pulled from the block (so all ptes in the block return the same
access/dirty info). When changing the access/dirty info on a pte (e.g.
ptep_set_access_flags()) that is part of a contpte mapping, this change
will affect the whole contpte block. This is works fine in practice
since we guarantee that only a single folio is mapped by a contpte
block, and the core-mm tracks access/dirty information per folio.

In order for the public functions, which used to be pure inline, to
continue to be callable by modules, export all the contpte_* symbols
that are now called by those public inline functions.

The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter
at build time. It defaults to enabled as long as its dependency,
TRANSPARENT_HUGEPAGE is also enabled. The core-mm depends upon
TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not
enabled, then there is no chance of meeting the physical contiguity
requirement for contpte mappings.

Acked-by: Ard Biesheuvel 
Tested-by: John Hubbard 
Signed-off-by: Ryan Roberts 
---
 arch/arm64/Kconfig   |   9 +
 arch/arm64/include/asm/pgtable.h | 167 ++
 arch/arm64/mm/Makefile   |   1 +
 arch/arm64/mm/contpte.c  | 285 +++
 include/linux/efi.h  |   5 +
 5 files changed, 467 insertions(+)
 create mode 100644 arch/arm64/mm/contpte.c

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index e8275a40afbd..5a7ac1f37bdc 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2229,6 +2229,15 @@ config UNWIND_PATCH_PAC_INTO_SCS
select UNWIND_TABLES
select DYNAMIC_SCS
 
+config ARM64_CONTPTE
+   bool "Contiguous PTE mappings for user memory" if EXPERT
+   depends on TRANSPARENT_HUGEPAGE
+   default y
+   help
+ When enabled, user mappings are configured using the PTE contiguous
+ bit, for any mappings that meet the size and alignment requirements.
+ This reduces TLB pressure and improves performance.
+
 endmenu # "Kernel Features"
 
 menu "Boot options"
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 7336d40a893a..831099cfc96b 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -133,6 +133,10 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
  */
 #define pte_valid_not_user(pte) \
((pte_val(pte) & (PTE_VALID | PTE_USER | PTE_UXN)) == (PTE_VALID | 
PTE_UXN))
+/*
+ * Returns true if the pte is valid and has the contiguous bit set.
+ */
+#define pte_valid_cont(pte)(pte_valid(pte) && pte_cont(pte))
 /*
  * Could the pte be present in the TLB? We must check mm_tlb_flush_pending
  * so that we don't erroneously return false for pages that have been
@@ -1128,6 +1132,167 @@ extern void ptep_modify_prot_commit(struct 
vm_area_struct *vma,
unsigned long addr, pte_t *ptep,
pte_t old_pte, pte_t new_pte);
 
+#ifdef CONFIG_ARM64_CONTPTE
+
+/*
+ * The contpte APIs are used to transparently manage the contiguous bit in ptes
+ * where it is possible and makes sense to do so. The PTE_CONT bit is 
considered
+ * a private implementation detail of the public ptep API (see below).
+ */
+extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
+   pte_t *ptep, pte_t pte);
+extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
+extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
+extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
+   pte_t *ptep, pte_t pte, unsigned int nr);
+extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
+   unsigned long addr, pte_t *ptep);
+extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
+

[PATCH v6 09/18] arm64/mm: Convert ptep_clear() to ptep_get_and_clear()

2024-02-15 Thread Ryan Roberts

ptep_clear() is a generic wrapper around the arch-implemented
ptep_get_and_clear(). We are about to convert ptep_get_and_clear() into
a public version and private version (__ptep_get_and_clear()) to support
the transparent contpte work. We won't have a private version of
ptep_clear() so let's convert it to directly call ptep_get_and_clear().

Tested-by: John Hubbard 
Signed-off-by: Ryan Roberts 
---
 arch/arm64/mm/hugetlbpage.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 27f6160890d1..48e8b429879d 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -229,7 +229,7 @@ static void clear_flush(struct mm_struct *mm,
unsigned long i, saddr = addr;
 
for (i = 0; i < ncontig; i++, addr += pgsize, ptep++)
-   ptep_clear(mm, addr, ptep);
+   ptep_get_and_clear(mm, addr, ptep);
 
flush_tlb_range(, saddr, addr);
 }
-- 
2.25.1

[PATCH v6 11/18] arm64/mm: Split __flush_tlb_range() to elide trailing DSB

2024-02-15 Thread Ryan Roberts

Split __flush_tlb_range() into __flush_tlb_range_nosync() +
__flush_tlb_range(), in the same way as the existing flush_tlb_page()
arrangement. This allows calling __flush_tlb_range_nosync() to elide the
trailing DSB. Forthcoming "contpte" code will take advantage of this
when clearing the young bit from a contiguous range of ptes.

Ordering between dsb and mmu_notifier_arch_invalidate_secondary_tlbs()
has changed, but now aligns with the ordering of __flush_tlb_page(). It
has been discussed that __flush_tlb_page() may be wrong though.
Regardless, both will be resolved separately if needed.

Reviewed-by: David Hildenbrand 
Tested-by: John Hubbard 
Signed-off-by: Ryan Roberts 
---
 arch/arm64/include/asm/tlbflush.h | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/tlbflush.h 
b/arch/arm64/include/asm/tlbflush.h
index 1deb5d789c2e..3b0e8248e1a4 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -422,7 +422,7 @@ do {
\
 #define __flush_s2_tlb_range_op(op, start, pages, stride, tlb_level) \
__flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false, 
kvm_lpa2_is_enabled());
 
-static inline void __flush_tlb_range(struct vm_area_struct *vma,
+static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
 unsigned long start, unsigned long end,
 unsigned long stride, bool last_level,
 int tlb_level)
@@ -456,10 +456,19 @@ static inline void __flush_tlb_range(struct 
vm_area_struct *vma,
__flush_tlb_range_op(vae1is, start, pages, stride, asid,
 tlb_level, true, lpa2_is_enabled());
 
-   dsb(ish);
mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
 }
 
+static inline void __flush_tlb_range(struct vm_area_struct *vma,
+unsigned long start, unsigned long end,
+unsigned long stride, bool last_level,
+int tlb_level)
+{
+   __flush_tlb_range_nosync(vma, start, end, stride,
+last_level, tlb_level);
+   dsb(ish);
+}
+
 static inline void flush_tlb_range(struct vm_area_struct *vma,
   unsigned long start, unsigned long end)
 {
-- 
2.25.1

[PATCH v6 10/18] arm64/mm: New ptep layer to manage contig bit

2024-02-15 Thread Ryan Roberts

Create a new layer for the in-table PTE manipulation APIs. For now, The
existing API is prefixed with double underscore to become the
arch-private API and the public API is just a simple wrapper that calls
the private API.

The public API implementation will subsequently be used to transparently
manipulate the contiguous bit where appropriate. But since there are
already some contig-aware users (e.g. hugetlb, kernel mapper), we must
first ensure those users use the private API directly so that the future
contig-bit manipulations in the public API do not interfere with those
existing uses.

The following APIs are treated this way:

 - ptep_get
 - set_pte
 - set_ptes
 - pte_clear
 - ptep_get_and_clear
 - ptep_test_and_clear_young
 - ptep_clear_flush_young
 - ptep_set_wrprotect
 - ptep_set_access_flags

Tested-by: John Hubbard 
Signed-off-by: Ryan Roberts 
---
 arch/arm64/include/asm/pgtable.h | 83 +---
 arch/arm64/kernel/efi.c  |  4 +-
 arch/arm64/kernel/mte.c  |  2 +-
 arch/arm64/kvm/guest.c   |  2 +-
 arch/arm64/mm/fault.c| 12 ++---
 arch/arm64/mm/fixmap.c   |  4 +-
 arch/arm64/mm/hugetlbpage.c  | 40 +++
 arch/arm64/mm/kasan_init.c   |  6 +--
 arch/arm64/mm/mmu.c  | 14 +++---
 arch/arm64/mm/pageattr.c |  6 +--
 arch/arm64/mm/trans_pgd.c|  6 +--
 11 files changed, 93 insertions(+), 86 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 9a2df85eb493..7336d40a893a 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -93,7 +93,8 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
__pte(__phys_to_pte_val((phys_addr_t)(pfn) << PAGE_SHIFT) | 
pgprot_val(prot))
 
 #define pte_none(pte)  (!pte_val(pte))
-#define pte_clear(mm,addr,ptep)set_pte(ptep, __pte(0))
+#define __pte_clear(mm, addr, ptep) \
+   __set_pte(ptep, __pte(0))
 #define pte_page(pte)  (pfn_to_page(pte_pfn(pte)))
 
 /*
@@ -137,7 +138,7 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
  * so that we don't erroneously return false for pages that have been
  * remapped as PROT_NONE but are yet to be flushed from the TLB.
  * Note that we can't make any assumptions based on the state of the access
- * flag, since ptep_clear_flush_young() elides a DSB when invalidating the
+ * flag, since __ptep_clear_flush_young() elides a DSB when invalidating the
  * TLB.
  */
 #define pte_accessible(mm, pte)\
@@ -261,7 +262,7 @@ static inline pte_t pte_mkdevmap(pte_t pte)
return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
 }
 
-static inline void set_pte(pte_t *ptep, pte_t pte)
+static inline void __set_pte(pte_t *ptep, pte_t pte)
 {
WRITE_ONCE(*ptep, pte);
 
@@ -275,8 +276,7 @@ static inline void set_pte(pte_t *ptep, pte_t pte)
}
 }
 
-#define ptep_get ptep_get
-static inline pte_t ptep_get(pte_t *ptep)
+static inline pte_t __ptep_get(pte_t *ptep)
 {
return READ_ONCE(*ptep);
 }
@@ -308,7 +308,7 @@ static inline void __check_safe_pte_update(struct mm_struct 
*mm, pte_t *ptep,
if (!IS_ENABLED(CONFIG_DEBUG_VM))
return;
 
-   old_pte = ptep_get(ptep);
+   old_pte = __ptep_get(ptep);
 
if (!pte_valid(old_pte) || !pte_valid(pte))
return;
@@ -317,7 +317,7 @@ static inline void __check_safe_pte_update(struct mm_struct 
*mm, pte_t *ptep,
 
/*
 * Check for potential race with hardware updates of the pte
-* (ptep_set_access_flags safely changes valid ptes without going
+* (__ptep_set_access_flags safely changes valid ptes without going
 * through an invalid entry).
 */
VM_WARN_ONCE(!pte_young(pte),
@@ -363,23 +363,22 @@ static inline pte_t pte_advance_pfn(pte_t pte, unsigned 
long nr)
return pfn_pte(pte_pfn(pte) + nr, pte_pgprot(pte));
 }
 
-static inline void set_ptes(struct mm_struct *mm,
-   unsigned long __always_unused addr,
-   pte_t *ptep, pte_t pte, unsigned int nr)
+static inline void __set_ptes(struct mm_struct *mm,
+ unsigned long __always_unused addr,
+ pte_t *ptep, pte_t pte, unsigned int nr)
 {
page_table_check_ptes_set(mm, ptep, pte, nr);
__sync_cache_and_tags(pte, nr);
 
for (;;) {
__check_safe_pte_update(mm, ptep, pte);
-   set_pte(ptep, pte);
+   __set_pte(ptep, pte);
if (--nr == 0)
break;
ptep++;
pte = pte_advance_pfn(pte, 1);
}
 }
-#define set_ptes set_ptes
 
 /*
  * Huge pte definitions.
@@ -546,7 +545,7 @@ static inline void __set_pte_at(struct mm_struct *mm,
 {
__sync_cache_and_tags(pte, nr);
__check_safe_pte_update(mm, ptep, pte);
-

[PATCH v6 15/18] mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch()

2024-02-15 Thread Ryan Roberts

Some architectures (e.g. arm64) can tell from looking at a pte, if some
follow-on ptes also map contiguous physical memory with the same pgprot.
(for arm64, these are contpte mappings).

Take advantage of this knowledge to optimize folio_pte_batch() so that
it can skip these ptes when scanning to create a batch. By default, if
an arch does not opt-in, folio_pte_batch() returns a compile-time 1, so
the changes are optimized out and the behaviour is as before.

arm64 will opt-in to providing this hint in the next patch, which will
greatly reduce the cost of ptep_get() when scanning a range of contptes.

Acked-by: David Hildenbrand 
Tested-by: John Hubbard 
Signed-off-by: Ryan Roberts 
---
 include/linux/pgtable.h | 21 +
 mm/memory.c | 19 ---
 2 files changed, 33 insertions(+), 7 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index bc005d84f764..a36cf4e124b0 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -212,6 +212,27 @@ static inline int pmd_dirty(pmd_t pmd)
 #define arch_flush_lazy_mmu_mode() do {} while (0)
 #endif
 
+#ifndef pte_batch_hint
+/**
+ * pte_batch_hint - Number of pages that can be added to batch without 
scanning.
+ * @ptep: Page table pointer for the entry.
+ * @pte: Page table entry.
+ *
+ * Some architectures know that a set of contiguous ptes all map the same
+ * contiguous memory with the same permissions. In this case, it can provide a
+ * hint to aid pte batching without the core code needing to scan every pte.
+ *
+ * An architecture implementation may ignore the PTE accessed state. Further,
+ * the dirty state must apply atomically to all the PTEs described by the hint.
+ *
+ * May be overridden by the architecture, else pte_batch_hint is always 1.
+ */
+static inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
+{
+   return 1;
+}
+#endif
+
 #ifndef pte_advance_pfn
 static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 3b8e56eb08a3..4dd8e35b593a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -988,16 +988,20 @@ static inline int folio_pte_batch(struct folio *folio, 
unsigned long addr,
 {
unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
const pte_t *end_ptep = start_ptep + max_nr;
-   pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte), 
flags);
-   pte_t *ptep = start_ptep + 1;
+   pte_t expected_pte, *ptep;
bool writable;
+   int nr;
 
if (any_writable)
*any_writable = false;
 
VM_WARN_ON_FOLIO(!pte_present(pte), folio);
 
-   while (ptep != end_ptep) {
+   nr = pte_batch_hint(start_ptep, pte);
+   expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), 
flags);
+   ptep = start_ptep + nr;
+
+   while (ptep < end_ptep) {
pte = ptep_get(ptep);
if (any_writable)
writable = !!pte_write(pte);
@@ -1011,17 +1015,18 @@ static inline int folio_pte_batch(struct folio *folio, 
unsigned long addr,
 * corner cases the next PFN might fall into a different
 * folio.
 */
-   if (pte_pfn(pte) == folio_end_pfn)
+   if (pte_pfn(pte) >= folio_end_pfn)
break;
 
if (any_writable)
*any_writable |= writable;
 
-   expected_pte = pte_next_pfn(expected_pte);
-   ptep++;
+   nr = pte_batch_hint(ptep, pte);
+   expected_pte = pte_advance_pfn(expected_pte, nr);
+   ptep += nr;
}
 
-   return ptep - start_ptep;
+   return min(ptep - start_ptep, max_nr);
 }
 
 /*
-- 
2.25.1

[PATCH v6 17/18] arm64/mm: __always_inline to improve fork() perf

2024-02-15 Thread Ryan Roberts

As set_ptes() and wrprotect_ptes() become a bit more complex, the
compiler may choose not to inline them. But this is critical for fork()
performance. So mark the functions, along with contpte_try_unfold()
which is called by them, as __always_inline. This is worth ~1% on the
fork() microbenchmark with order-0 folios (the common case).

Acked-by: Mark Rutland 
Signed-off-by: Ryan Roberts 
---
 arch/arm64/include/asm/pgtable.h | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index d759a20d2929..8310875133ff 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1206,8 +1206,8 @@ extern int contpte_ptep_set_access_flags(struct 
vm_area_struct *vma,
unsigned long addr, pte_t *ptep,
pte_t entry, int dirty);
 
-static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
-   pte_t *ptep, pte_t pte)
+static __always_inline void contpte_try_unfold(struct mm_struct *mm,
+   unsigned long addr, pte_t *ptep, pte_t pte)
 {
if (unlikely(pte_valid_cont(pte)))
__contpte_try_unfold(mm, addr, ptep, pte);
@@ -1278,7 +1278,7 @@ static inline void set_pte(pte_t *ptep, pte_t pte)
 }
 
 #define set_ptes set_ptes
-static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
+static __always_inline void set_ptes(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pte, unsigned int nr)
 {
pte = pte_mknoncont(pte);
@@ -1360,8 +1360,8 @@ static inline int ptep_clear_flush_young(struct 
vm_area_struct *vma,
 }
 
 #define wrprotect_ptes wrprotect_ptes
-static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
-   pte_t *ptep, unsigned int nr)
+static __always_inline void wrprotect_ptes(struct mm_struct *mm,
+   unsigned long addr, pte_t *ptep, unsigned int 
nr)
 {
if (likely(nr == 1)) {
/*
-- 
2.25.1

[PATCH v6 18/18] arm64/mm: Automatically fold contpte mappings

2024-02-15 Thread Ryan Roberts

There are situations where a change to a single PTE could cause the
contpte block in which it resides to become foldable (i.e. could be
repainted with the contiguous bit). Such situations arise, for example,
when user space temporarily changes protections, via mprotect, for
individual pages, such can be the case for certain garbage collectors.

We would like to detect when such a PTE change occurs. However this can
be expensive due to the amount of checking required. Therefore only
perform the checks when an indiviual PTE is modified via mprotect
(ptep_modify_prot_commit() -> set_pte_at() -> set_ptes(nr=1)) and only
when we are setting the final PTE in a contpte-aligned block.

Signed-off-by: Ryan Roberts 
---
 arch/arm64/include/asm/pgtable.h | 26 +
 arch/arm64/mm/contpte.c  | 64 
 2 files changed, 90 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 8310875133ff..401087e8a43d 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1185,6 +1185,8 @@ extern void ptep_modify_prot_commit(struct vm_area_struct 
*vma,
  * where it is possible and makes sense to do so. The PTE_CONT bit is 
considered
  * a private implementation detail of the public ptep API (see below).
  */
+extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
+   pte_t *ptep, pte_t pte);
 extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pte);
 extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
@@ -1206,6 +1208,29 @@ extern int contpte_ptep_set_access_flags(struct 
vm_area_struct *vma,
unsigned long addr, pte_t *ptep,
pte_t entry, int dirty);
 
+static __always_inline void contpte_try_fold(struct mm_struct *mm,
+   unsigned long addr, pte_t *ptep, pte_t pte)
+{
+   /*
+* Only bother trying if both the virtual and physical addresses are
+* aligned and correspond to the last entry in a contig range. The core
+* code mostly modifies ranges from low to high, so this is the likely
+* the last modification in the contig range, so a good time to fold.
+* We can't fold special mappings, because there is no associated folio.
+*/
+
+   const unsigned long contmask = CONT_PTES - 1;
+   bool valign = ((addr >> PAGE_SHIFT) & contmask) == contmask;
+
+   if (unlikely(valign)) {
+   bool palign = (pte_pfn(pte) & contmask) == contmask;
+
+   if (unlikely(palign &&
+   pte_valid(pte) && !pte_cont(pte) && !pte_special(pte)))
+   __contpte_try_fold(mm, addr, ptep, pte);
+   }
+}
+
 static __always_inline void contpte_try_unfold(struct mm_struct *mm,
unsigned long addr, pte_t *ptep, pte_t pte)
 {
@@ -1286,6 +1311,7 @@ static __always_inline void set_ptes(struct mm_struct 
*mm, unsigned long addr,
if (likely(nr == 1)) {
contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
__set_ptes(mm, addr, ptep, pte, 1);
+   contpte_try_fold(mm, addr, ptep, pte);
} else {
contpte_set_ptes(mm, addr, ptep, pte, nr);
}
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index 50e0173dc5ee..16788f07716d 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -73,6 +73,70 @@ static void contpte_convert(struct mm_struct *mm, unsigned 
long addr,
__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
 }
 
+void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
+   pte_t *ptep, pte_t pte)
+{
+   /*
+* We have already checked that the virtual and pysical addresses are
+* correctly aligned for a contpte mapping in contpte_try_fold() so the
+* remaining checks are to ensure that the contpte range is fully
+* covered by a single folio, and ensure that all the ptes are valid
+* with contiguous PFNs and matching prots. We ignore the state of the
+* access and dirty bits for the purpose of deciding if its a contiguous
+* range; the folding process will generate a single contpte entry which
+* has a single access and dirty bit. Those 2 bits are the logical OR of
+* their respective bits in the constituent pte entries. In order to
+* ensure the contpte range is covered by a single folio, we must
+* recover the folio from the pfn, but special mappings don't have a
+* folio backing them. Fortunately contpte_try_fold() already checked
+* that the pte is not special - we never try to fold special mappings.
+* Note we can't use vm_normal_page() for this since we don't have the
+* vma.
+*/
+
+   unsigned

[PATCH v6 16/18] arm64/mm: Implement pte_batch_hint()

2024-02-15 Thread Ryan Roberts

When core code iterates over a range of ptes and calls ptep_get() for
each of them, if the range happens to cover contpte mappings, the number
of pte reads becomes amplified by a factor of the number of PTEs in a
contpte block. This is because for each call to ptep_get(), the
implementation must read all of the ptes in the contpte block to which
it belongs to gather the access and dirty bits.

This causes a hotspot for fork(), as well as operations that unmap
memory such as munmap(), exit and madvise(MADV_DONTNEED). Fortunately we
can fix this by implementing pte_batch_hint() which allows their
iterators to skip getting the contpte tail ptes when gathering the batch
of ptes to operate on. This results in the number of PTE reads returning
to 1 per pte.

Acked-by: Mark Rutland 
Reviewed-by: David Hildenbrand 
Tested-by: John Hubbard 
Signed-off-by: Ryan Roberts 
---
 arch/arm64/include/asm/pgtable.h | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index a8f1a35e3086..d759a20d2929 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1213,6 +1213,15 @@ static inline void contpte_try_unfold(struct mm_struct 
*mm, unsigned long addr,
__contpte_try_unfold(mm, addr, ptep, pte);
 }
 
+#define pte_batch_hint pte_batch_hint
+static inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
+{
+   if (!pte_valid_cont(pte))
+   return 1;
+
+   return CONT_PTES - (((unsigned long)ptep >> 3) & (CONT_PTES - 1));
+}
+
 /*
  * The below functions constitute the public API that arm64 presents to the
  * core-mm to manipulate PTE entries within their page tables (or at least this
-- 
2.25.1

[PATCH v6 08/18] arm64/mm: Convert set_pte_at() to set_ptes(..., 1)

2024-02-15 Thread Ryan Roberts

Since set_ptes() was introduced, set_pte_at() has been implemented as a
generic macro around set_ptes(..., 1). So this change should continue to
generate the same code. However, making this change prepares us for the
transparent contpte support. It means we can reroute set_ptes() to
__set_ptes(). Since set_pte_at() is a generic macro, there will be no
equivalent __set_pte_at() to reroute to.

Note that a couple of calls to set_pte_at() remain in the arch code.
This is intentional, since those call sites are acting on behalf of
core-mm and should continue to call into the public set_ptes() rather
than the arch-private __set_ptes().

Tested-by: John Hubbard 
Signed-off-by: Ryan Roberts 
---
 arch/arm64/include/asm/pgtable.h |  2 +-
 arch/arm64/kernel/mte.c  |  2 +-
 arch/arm64/kvm/guest.c   |  2 +-
 arch/arm64/mm/fault.c|  2 +-
 arch/arm64/mm/hugetlbpage.c  | 10 +-
 5 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index de034ca40bad..9a2df85eb493 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1084,7 +1084,7 @@ static inline void arch_swap_restore(swp_entry_t entry, 
struct folio *folio)
 #endif /* CONFIG_ARM64_MTE */
 
 /*
- * On AArch64, the cache coherency is handled via the set_pte_at() function.
+ * On AArch64, the cache coherency is handled via the set_ptes() function.
  */
 static inline void update_mmu_cache_range(struct vm_fault *vmf,
struct vm_area_struct *vma, unsigned long addr, pte_t *ptep,
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index a41ef3213e1e..59bfe2e96f8f 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -67,7 +67,7 @@ int memcmp_pages(struct page *page1, struct page *page2)
/*
 * If the page content is identical but at least one of the pages is
 * tagged, return non-zero to avoid KSM merging. If only one of the
-* pages is tagged, set_pte_at() may zero or change the tags of the
+* pages is tagged, set_ptes() may zero or change the tags of the
 * other page via mte_sync_tags().
 */
if (page_mte_tagged(page1) || page_mte_tagged(page2))
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index aaf1d4939739..6e0df623c8e9 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -1072,7 +1072,7 @@ int kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm,
} else {
/*
 * Only locking to serialise with a concurrent
-* set_pte_at() in the VMM but still overriding the
+* set_ptes() in the VMM but still overriding the
 * tags, hence ignoring the return value.
 */
try_page_mte_tagging(page);
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index a254761fa1bd..3235e23309ec 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -205,7 +205,7 @@ static void show_pte(unsigned long addr)
  *
  * It needs to cope with hardware update of the accessed/dirty state by other
  * agents in the system and can safely skip the __sync_icache_dcache() call as,
- * like set_pte_at(), the PTE is never changed from no-exec to exec here.
+ * like set_ptes(), the PTE is never changed from no-exec to exec here.
  *
  * Returns whether or not the PTE actually changed.
  */
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 2892f925ed66..27f6160890d1 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -247,12 +247,12 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long 
addr,
 
if (!pte_present(pte)) {
for (i = 0; i < ncontig; i++, ptep++, addr += pgsize)
-   set_pte_at(mm, addr, ptep, pte);
+   set_ptes(mm, addr, ptep, pte, 1);
return;
}
 
if (!pte_cont(pte)) {
-   set_pte_at(mm, addr, ptep, pte);
+   set_ptes(mm, addr, ptep, pte, 1);
return;
}
 
@@ -263,7 +263,7 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long 
addr,
clear_flush(mm, addr, ptep, pgsize, ncontig);
 
for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
-   set_pte_at(mm, addr, ptep, pfn_pte(pfn, hugeprot));
+   set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
 }
 
 pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -471,7 +471,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 
hugeprot = pte_pgprot(pte);
for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
-   set_pte_at(mm, addr, ptep, pfn_pte(pfn, hugeprot));
+   set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
 
return 1;
 }
@@ -500,7 +500,7 @@

[PATCH v6 07/18] arm64/mm: Convert READ_ONCE(*ptep) to ptep_get(ptep)

2024-02-15 Thread Ryan Roberts

There are a number of places in the arch code that read a pte by using
the READ_ONCE() macro. Refactor these call sites to instead use the
ptep_get() helper, which itself is a READ_ONCE(). Generated code should
be the same.

This will benefit us when we shortly introduce the transparent contpte
support. In this case, ptep_get() will become more complex so we now
have all the code abstracted through it.

Tested-by: John Hubbard 
Signed-off-by: Ryan Roberts 
---
 arch/arm64/include/asm/pgtable.h | 12 +---
 arch/arm64/kernel/efi.c  |  2 +-
 arch/arm64/mm/fault.c|  4 ++--
 arch/arm64/mm/hugetlbpage.c  |  6 +++---
 arch/arm64/mm/kasan_init.c   |  2 +-
 arch/arm64/mm/mmu.c  | 12 ++--
 arch/arm64/mm/pageattr.c |  4 ++--
 arch/arm64/mm/trans_pgd.c|  2 +-
 8 files changed, 25 insertions(+), 19 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index b6d3e9e0a946..de034ca40bad 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -275,6 +275,12 @@ static inline void set_pte(pte_t *ptep, pte_t pte)
}
 }
 
+#define ptep_get ptep_get
+static inline pte_t ptep_get(pte_t *ptep)
+{
+   return READ_ONCE(*ptep);
+}
+
 extern void __sync_icache_dcache(pte_t pteval);
 bool pgattr_change_is_safe(u64 old, u64 new);
 
@@ -302,7 +308,7 @@ static inline void __check_safe_pte_update(struct mm_struct 
*mm, pte_t *ptep,
if (!IS_ENABLED(CONFIG_DEBUG_VM))
return;
 
-   old_pte = READ_ONCE(*ptep);
+   old_pte = ptep_get(ptep);
 
if (!pte_valid(old_pte) || !pte_valid(pte))
return;
@@ -904,7 +910,7 @@ static inline int __ptep_test_and_clear_young(pte_t *ptep)
 {
pte_t old_pte, pte;
 
-   pte = READ_ONCE(*ptep);
+   pte = ptep_get(ptep);
do {
old_pte = pte;
pte = pte_mkold(pte);
@@ -986,7 +992,7 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, 
unsigned long addres
 {
pte_t old_pte, pte;
 
-   pte = READ_ONCE(*ptep);
+   pte = ptep_get(ptep);
do {
old_pte = pte;
pte = pte_wrprotect(pte);
diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
index 0228001347be..d0e08e93b246 100644
--- a/arch/arm64/kernel/efi.c
+++ b/arch/arm64/kernel/efi.c
@@ -103,7 +103,7 @@ static int __init set_permissions(pte_t *ptep, unsigned 
long addr, void *data)
 {
struct set_perm_data *spd = data;
const efi_memory_desc_t *md = spd->md;
-   pte_t pte = READ_ONCE(*ptep);
+   pte_t pte = ptep_get(ptep);
 
if (md->attribute & EFI_MEMORY_RO)
pte = set_pte_bit(pte, __pgprot(PTE_RDONLY));
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 55f6455a8284..a254761fa1bd 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -191,7 +191,7 @@ static void show_pte(unsigned long addr)
if (!ptep)
break;
 
-   pte = READ_ONCE(*ptep);
+   pte = ptep_get(ptep);
pr_cont(", pte=%016llx", pte_val(pte));
pte_unmap(ptep);
} while(0);
@@ -214,7 +214,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
  pte_t entry, int dirty)
 {
pteval_t old_pteval, pteval;
-   pte_t pte = READ_ONCE(*ptep);
+   pte_t pte = ptep_get(ptep);
 
if (pte_same(pte, entry))
return 0;
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 6720ec8d50e7..2892f925ed66 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -485,7 +485,7 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
size_t pgsize;
pte_t pte;
 
-   if (!pte_cont(READ_ONCE(*ptep))) {
+   if (!pte_cont(ptep_get(ptep))) {
ptep_set_wrprotect(mm, addr, ptep);
return;
}
@@ -510,7 +510,7 @@ pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
size_t pgsize;
int ncontig;
 
-   if (!pte_cont(READ_ONCE(*ptep)))
+   if (!pte_cont(ptep_get(ptep)))
return ptep_clear_flush(vma, addr, ptep);
 
ncontig = find_num_contig(mm, addr, ptep, );
@@ -543,7 +543,7 @@ pte_t huge_ptep_modify_prot_start(struct vm_area_struct 
*vma, unsigned long addr
 * when the permission changes from executable to non-executable
 * in cases where cpu is affected with errata #2645198.
 */
-   if (pte_user_exec(READ_ONCE(*ptep)))
+   if (pte_user_exec(ptep_get(ptep)))
return huge_ptep_clear_flush(vma, addr, ptep);
}
return huge_ptep_get_and_clear(vma->vm_mm, addr, ptep);
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index 4c7ad574b946..c2a9f4f6c7dd 100644
--- a/arch/arm64/mm/kasan_init.c
+++

[PATCH v6 05/18] x86/mm: Convert pte_next_pfn() to pte_advance_pfn()

2024-02-15 Thread Ryan Roberts

Core-mm needs to be able to advance the pfn by an arbitrary amount, so
override the new pte_advance_pfn() API to do so.

Signed-off-by: Ryan Roberts 
---
 arch/x86/include/asm/pgtable.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index b50b2ef63672..69ed0ea0641b 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -955,13 +955,13 @@ static inline int pte_same(pte_t a, pte_t b)
return a.pte == b.pte;
 }
 
-static inline pte_t pte_next_pfn(pte_t pte)
+static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
 {
if (__pte_needs_invert(pte_val(pte)))
-   return __pte(pte_val(pte) - (1UL << PFN_PTE_SHIFT));
-   return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
+   return __pte(pte_val(pte) - (nr << PFN_PTE_SHIFT));
+   return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
 }
-#define pte_next_pfn   pte_next_pfn
+#define pte_advance_pfnpte_advance_pfn
 
 static inline int pte_present(pte_t a)
 {
-- 
2.25.1

[PATCH v6 06/18] mm: Tidy up pte_next_pfn() definition

2024-02-15 Thread Ryan Roberts

Now that the all architecture overrides of pte_next_pfn() have been
replaced with pte_advance_pfn(), we can simplify the definition of the
generic pte_next_pfn() macro so that it is unconditionally defined.

Signed-off-by: Ryan Roberts 
---
 include/linux/pgtable.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index b7ac8358f2aa..bc005d84f764 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -212,7 +212,6 @@ static inline int pmd_dirty(pmd_t pmd)
 #define arch_flush_lazy_mmu_mode() do {} while (0)
 #endif
 
-#ifndef pte_next_pfn
 #ifndef pte_advance_pfn
 static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
 {
@@ -221,7 +220,6 @@ static inline pte_t pte_advance_pfn(pte_t pte, unsigned 
long nr)
 #endif
 
 #define pte_next_pfn(pte) pte_advance_pfn(pte, 1)
-#endif
 
 #ifndef set_ptes
 /**
-- 
2.25.1

[PATCH v6 04/18] arm64/mm: Convert pte_next_pfn() to pte_advance_pfn()

2024-02-15 Thread Ryan Roberts

Core-mm needs to be able to advance the pfn by an arbitrary amount, so
override the new pte_advance_pfn() API to do so.

Signed-off-by: Ryan Roberts 
---
 arch/arm64/include/asm/pgtable.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 52d0b0a763f1..b6d3e9e0a946 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -351,10 +351,10 @@ static inline pgprot_t pte_pgprot(pte_t pte)
return __pgprot(pte_val(pfn_pte(pfn, __pgprot(0))) ^ pte_val(pte));
 }
 
-#define pte_next_pfn pte_next_pfn
-static inline pte_t pte_next_pfn(pte_t pte)
+#define pte_advance_pfn pte_advance_pfn
+static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
 {
-   return pfn_pte(pte_pfn(pte) + 1, pte_pgprot(pte));
+   return pfn_pte(pte_pfn(pte) + nr, pte_pgprot(pte));
 }
 
 static inline void set_ptes(struct mm_struct *mm,
@@ -370,7 +370,7 @@ static inline void set_ptes(struct mm_struct *mm,
if (--nr == 0)
break;
ptep++;
-   pte = pte_next_pfn(pte);
+   pte = pte_advance_pfn(pte, 1);
}
 }
 #define set_ptes set_ptes
-- 
2.25.1

[PATCH v6 03/18] mm: Introduce pte_advance_pfn() and use for pte_next_pfn()

2024-02-15 Thread Ryan Roberts

The goal is to be able to advance a PTE by an arbitrary number of PFNs.
So introduce a new API that takes a nr param. Define the default
implementation here and allow for architectures to override.
pte_next_pfn() becomes a wrapper around pte_advance_pfn().

Follow up commits will convert each overriding architecture's
pte_next_pfn() to pte_advance_pfn().

Signed-off-by: Ryan Roberts 
---
 include/linux/pgtable.h | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 231370e1b80f..b7ac8358f2aa 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -212,14 +212,17 @@ static inline int pmd_dirty(pmd_t pmd)
 #define arch_flush_lazy_mmu_mode() do {} while (0)
 #endif
 
-
 #ifndef pte_next_pfn
-static inline pte_t pte_next_pfn(pte_t pte)
+#ifndef pte_advance_pfn
+static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
 {
-   return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
+   return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
 }
 #endif
 
+#define pte_next_pfn(pte) pte_advance_pfn(pte, 1)
+#endif
+
 #ifndef set_ptes
 /**
  * set_ptes - Map consecutive pages to a contiguous range of addresses.
-- 
2.25.1

[PATCH v6 02/18] mm: thp: Batch-collapse PMD with set_ptes()

2024-02-15 Thread Ryan Roberts

Refactor __split_huge_pmd_locked() so that a present PMD can be
collapsed to PTEs in a single batch using set_ptes().

This should improve performance a little bit, but the real motivation is
to remove the need for the arm64 backend to have to fold the contpte
entries. Instead, since the ptes are set as a batch, the contpte blocks
can be initially set up pre-folded (once the arm64 contpte support is
added in the next few patches). This leads to noticeable performance
improvement during split.

Acked-by: David Hildenbrand 
Signed-off-by: Ryan Roberts 
---
 mm/huge_memory.c | 58 +++-
 1 file changed, 33 insertions(+), 25 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 016e20bd813e..14888b15121e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2579,15 +2579,16 @@ static void __split_huge_pmd_locked(struct 
vm_area_struct *vma, pmd_t *pmd,
 
pte = pte_offset_map(&_pmd, haddr);
VM_BUG_ON(!pte);
-   for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
-   pte_t entry;
-   /*
-* Note that NUMA hinting access restrictions are not
-* transferred to avoid any possibility of altering
-* permissions across VMAs.
-*/
-   if (freeze || pmd_migration) {
+
+   /*
+* Note that NUMA hinting access restrictions are not transferred to
+* avoid any possibility of altering permissions across VMAs.
+*/
+   if (freeze || pmd_migration) {
+   for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += 
PAGE_SIZE) {
+   pte_t entry;
swp_entry_t swp_entry;
+
if (write)
swp_entry = make_writable_migration_entry(
page_to_pfn(page + i));
@@ -2606,25 +2607,32 @@ static void __split_huge_pmd_locked(struct 
vm_area_struct *vma, pmd_t *pmd,
entry = pte_swp_mksoft_dirty(entry);
if (uffd_wp)
entry = pte_swp_mkuffd_wp(entry);
-   } else {
-   entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot));
-   if (write)
-   entry = pte_mkwrite(entry, vma);
-   if (!young)
-   entry = pte_mkold(entry);
-   /* NOTE: this may set soft-dirty too on some archs */
-   if (dirty)
-   entry = pte_mkdirty(entry);
-   if (soft_dirty)
-   entry = pte_mksoft_dirty(entry);
-   if (uffd_wp)
-   entry = pte_mkuffd_wp(entry);
+
+   VM_WARN_ON(!pte_none(ptep_get(pte + i)));
+   set_pte_at(mm, addr, pte + i, entry);
}
-   VM_BUG_ON(!pte_none(ptep_get(pte)));
-   set_pte_at(mm, addr, pte, entry);
-   pte++;
+   } else {
+   pte_t entry;
+
+   entry = mk_pte(page, READ_ONCE(vma->vm_page_prot));
+   if (write)
+   entry = pte_mkwrite(entry, vma);
+   if (!young)
+   entry = pte_mkold(entry);
+   /* NOTE: this may set soft-dirty too on some archs */
+   if (dirty)
+   entry = pte_mkdirty(entry);
+   if (soft_dirty)
+   entry = pte_mksoft_dirty(entry);
+   if (uffd_wp)
+   entry = pte_mkuffd_wp(entry);
+
+   for (i = 0; i < HPAGE_PMD_NR; i++)
+   VM_WARN_ON(!pte_none(ptep_get(pte + i)));
+
+   set_ptes(mm, haddr, pte, entry, HPAGE_PMD_NR);
}
-   pte_unmap(pte - 1);
+   pte_unmap(pte);
 
if (!pmd_migration)
folio_remove_rmap_pmd(folio, page, vma);
-- 
2.25.1

[PATCH v6 00/18] Transparent Contiguous PTEs for User Mappings

2024-02-15 Thread Ryan Roberts

Hi All,

This is a series to opportunistically and transparently use contpte mappings
(set the contiguous bit in ptes) for user memory when those mappings meet the
requirements. The change benefits arm64, but there is some (very) minor
refactoring for x86 to enable its integration with core-mm.

It is part of a wider effort to improve performance by allocating and mapping
variable-sized blocks of memory (folios). One aim is for the 4K kernel to
approach the performance of the 16K kernel, but without breaking compatibility
and without the associated increase in memory. Another aim is to benefit the 16K
and 64K kernels by enabling 2M THP, since this is the contpte size for those
kernels. We have good performance data that demonstrates both aims are being met
(see below).

Of course this is only one half of the change. We require the mapped physical
memory to be the correct size and alignment for this to actually be useful (i.e.
64K for 4K pages, or 2M for 16K/64K pages). Fortunately folios are solving this
problem for us. Filesystems that support it (XFS, AFS, EROFS, tmpfs, ...) will
allocate large folios up to the PMD size today, and more filesystems are coming.
And for anonymous memory, "multi-size THP" is now upstream.


Patch Layout


In this version, I've split the patches to better show each optimization:

  - 1-2:mm prep: misc code and docs cleanups
  - 3-6:mm,arm64,x86 prep: Add pte_advance_pfn() and make pte_next_pfn() a
generic wrapper around it
  - 7-11:   arm64 prep: Refactor ptep helpers into new layer
  - 12: functional contpte implementation
  - 23-18:  various optimizations on top of the contpte implementation


Testing
===

I've tested this series on both Ampere Altra (bare metal) and Apple M2 (VM):
  - mm selftests (inc new tests written for multi-size THP); no regressions
  - Speedometer Java script benchmark in Chromium web browser; no issues
  - Kernel compilation; no issues
  - Various tests under high memory pressure with swap enabled; no issues


Performance
===

High Level Use Cases


First some high level use cases (kernel compilation and speedometer JavaScript
benchmarks). These are running on Ampere Altra (I've seen similar improvements
on Android/Pixel 6).

baseline:  mm-unstable (mTHP switched off)
mTHP:  + enable 16K, 32K, 64K mTHP sizes "always"
mTHP + contpte:+ this series
mTHP + contpte + exefolio: + patch at [6], which series supports

Kernel Compilation with -j8 (negative is faster):

| kernel| real-time | kern-time | user-time |
|---|---|---|---|
| baseline  |  0.0% |  0.0% |  0.0% |
| mTHP  | -5.0% |-39.1% | -0.7% |
| mTHP + contpte| -6.0% |-41.4% | -1.5% |
| mTHP + contpte + exefolio | -7.8% |-43.1% | -3.4% |

Kernel Compilation with -j80 (negative is faster):

| kernel| real-time | kern-time | user-time |
|---|---|---|---|
| baseline  |  0.0% |  0.0% |  0.0% |
| mTHP  | -5.0% |-36.6% | -0.6% |
| mTHP + contpte| -6.1% |-38.2% | -1.6% |
| mTHP + contpte + exefolio | -7.4% |-39.2% | -3.2% |

Speedometer (positive is faster):

| kernel| runs_per_min |
|:--|--|
| baseline  | 0.0% |
| mTHP  | 1.5% |
| mTHP + contpte| 3.2% |
| mTHP + contpte + exefolio | 4.5% |


Micro Benchmarks


The following microbenchmarks are intended to demonstrate the performance of
fork() and munmap() do not regress. I'm showing results for order-0 (4K)
mappings, and for order-9 (2M) PTE-mapped THP. Thanks to David for sharing his
benchmarks.

baseline:  mm-unstable + batch zap [7] series
contpte-basic: + patches 0-19; functional contpte implementation
contpte-batch: + patches 20-23; implement new batched APIs
contpte-inline:+ patch 24; __always_inline to help compiler
contpte-fold:  + patch 25; fold contpte mapping when sensible

Primary platform is Ampere Altra bare metal. I'm also showing results for M2 VM
(on top of MacOS) for reference, although experience suggests this might not be
the most reliable for performance numbers of this sort:

| FORK   | order-0| order-9|
| Ampere Altra   |||
| (pte-map)  |   mean | stdev |   mean | stdev |
|||---||---|
| baseline   |   0.0% |  2.7% |   0.0% |  0.2% |
| contpte-basic  |   6.3% |  1.4% |1948.7% |  0.2% |
|

[PATCH v6 01/18] mm: Clarify the spec for set_ptes()

2024-02-15 Thread Ryan Roberts

set_ptes() spec implies that it can only be used to set a present pte
because it interprets the PFN field to increment it. However,
set_pte_at() has been implemented on top of set_ptes() since set_ptes()
was introduced, and set_pte_at() allows setting a pte to a not-present
state. So clarify the spec to state that when nr==1, new state of pte
may be present or not present. When nr>1, new state of all ptes must be
present.

While we are at it, tighten the spec to set requirements around the
initial state of ptes; when nr==1 it may be either present or
not-present. But when nr>1 all ptes must initially be not-present. All
set_ptes() callsites already conform to this requirement. Stating it
explicitly is useful because it allows for a simplification to the
upcoming arm64 contpte implementation.

Acked-by: David Hildenbrand 
Signed-off-by: Ryan Roberts 
---
 include/linux/pgtable.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 49ab1f73b5c2..231370e1b80f 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -229,6 +229,10 @@ static inline pte_t pte_next_pfn(pte_t pte)
  * @pte: Page table entry for the first page.
  * @nr: Number of pages to map.
  *
+ * When nr==1, initial state of pte may be present or not present, and new 
state
+ * may be present or not present. When nr>1, initial state of all ptes must be
+ * not present, and new state must be present.
+ *
  * May be overridden by the architecture, or the architecture can define
  * set_pte() and PFN_PTE_SHIFT.
  *
-- 
2.25.1

Re: [PATCH v2] uapi/auxvec: Define AT_HWCAP3 and AT_HWCAP4 aux vector, entries

2024-02-15 Thread Arnd Bergmann

On Wed, Feb 14, 2024, at 23:34, Peter Bergner wrote:
> The powerpc toolchain keeps a copy of the HWCAP bit masks in our TCB for fast
> access by the __builtin_cpu_supports built-in function.  The TCB space for
> the HWCAP entries - which are created in pairs - is an ABI extension, so
> waiting to create the space for HWCAP3 and HWCAP4 until we need them is
> problematical.  Define AT_HWCAP3 and AT_HWCAP4 in the generic uapi header
> so they can be used in glibc to reserve space in the powerpc TCB for their
> future use.
>
> I scanned through the Linux and GLIBC source codes looking for unused AT_*
> values and 29 and 30 did not seem to be used, so they are what I went
> with.  This has received Acked-by's from both GLIBC and Linux kernel
> developers and no reservations or Nacks from anyone.
>
> Arnd, we seem to have consensus on the patch below.  Is this something
> you could take and apply to your tree? 
>

I don't mind taking it, but it may be better to use the
powerpc tree if that is where it's actually being used.

If Michael takes it, please add

Acked-by: Arnd Bergmann 

  Arnd

85 matches

Mail list logo