Re: [PATCH] MAINTAINERS: adjust file entries after crypto vmx file movement

2024-01-31 Thread Herbert Xu
On Mon, Jan 29, 2024 at 02:17:29PM +0100, Lukas Bulwahn wrote:
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 2fb944964be5..15bc79e80e28 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -10307,12 +10307,12 @@ M:  Nayna Jain 
>  M:   Paulo Flabiano Smorigo 
>  L:   linux-cry...@vger.kernel.org
>  S:   Supported
> -F:   drivers/crypto/vmx/Kconfig
> -F:   drivers/crypto/vmx/Makefile
> -F:   drivers/crypto/vmx/aes*
> -F:   drivers/crypto/vmx/ghash*
> -F:   drivers/crypto/vmx/ppc-xlate.pl
> -F:   drivers/crypto/vmx/vmx.c
> +F:   arch/powerpc/crypto/Kconfig
> +F:   arch/powerpc/crypto/Makefile
> +F:   arch/powerpc/crypto/aes*

Are you sure about this? There are non-vmx aes* files in that
directory.  Perhaps something more specific is needed here?

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time

2024-01-31 Thread Luis Chamberlain
On Wed, Jan 31, 2024 at 06:53:13AM +, Christophe Leroy wrote:
> The problem being identified in commit 677bfb9db8a3 ("module: Don't 
> ignore errors from set_memory_XX()"), you can keep/re-apply the series 
> [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time.

Sure, queued that up into modules-testing before I push to modules-next.

  Luis


Re: [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time

2024-01-31 Thread Marek Szyprowski
Hi Christophe,

On 31.01.2024 21:07, Christophe Leroy wrote:
> Le 31/01/2024 à 16:17, Marek Szyprowski a écrit :
>> [Vous ne recevez pas souvent de courriers de m.szyprow...@samsung.com. 
>> Découvrez pourquoi ceci est important à 
>> https://aka.ms/LearnAboutSenderIdentification ]
>>
>> On 31.01.2024 12:58, Christophe Leroy wrote:
>>> Le 30/01/2024 à 18:48, Marek Szyprowski a écrit :
 [Vous ne recevez pas souvent de courriers de m.szyprow...@samsung.com. 
 Découvrez pourquoi ceci est important à 
 https://aka.ms/LearnAboutSenderIdentification ]

 On 30.01.2024 12:03, Christophe Leroy wrote:
> Le 30/01/2024 à 10:16, Chen-Yu Tsai a écrit :
>> [Vous ne recevez pas souvent de courriers de we...@chromium.org. 
>> D?couvrez pourquoi ceci est important ? 
>> https://aka.ms/LearnAboutSenderIdentification ]
>>
>> On Mon, Jan 29, 2024 at 12:09:50PM -0800, Luis Chamberlain wrote:
>>> On Thu, Dec 21, 2023 at 10:02:46AM +0100, Christophe Leroy wrote:
 Declaring rodata_enabled and mark_rodata_ro() at all time
 helps removing related #ifdefery in C files.

 Signed-off-by: Christophe Leroy 
>>> Very nice cleanup, thanks!, applied and pushed
>>>
>>> Luis
>> On next-20240130, which has your modules-next branch, and thus this
>> series and the other "module: Use set_memory_rox()" series applied,
>> my kernel crashes in some very weird way. Reverting your branch
>> makes the crash go away.
>>
>> I thought I'd report it right away. Maybe you folks would know what's
>> happening here? This is on arm64.
> That's strange, it seems to bug in module_bug_finalize() which is
> _before_ calls to module_enable_ro() and such.
>
> Can you try to revert the 6 patches one by one to see which one
> introduces the problem ?
>
> In reality, only patch 677bfb9db8a3 really change things. Other ones are
> more on less only cleanup.
 I've also run into this issue with today's (20240130) linux-next on my
 test farm. The issue is not fully reproducible, so it was a bit hard to
 bisect it automatically. I've spent some time on manual testing and it
 looks that reverting the following 2 commits on top of linux-next fixes
 the problem:

 65929884f868 ("modules: Remove #ifdef CONFIG_STRICT_MODULE_RWX around
 rodata_enabled")
 677bfb9db8a3 ("module: Don't ignore errors from set_memory_XX()")

 This in fact means that commit 677bfb9db8a3 is responsible for this
 regression, as 65929884f868 has to be reverted only because the latter
 depends on it. Let me know what I can do to help debugging this issue.

>>> Thanks for the bisect. I suspect you hit one of the errors and something
>>> goes wrong in the error path.
>>>
>>> To confirm this assumption, could you try with the following change on
>>> top of everything ?
>>
>> Yes, this is the problem. I've added printing a mod->name to the log.
>> Here is a log from kernel build from next-20240130 (sometimes it even
>> boots to shell):
>>
>> # dmesg | grep module_set_memory
>> [    8.061525] module_set_memory(6, , 0) name ipv6
>> returned -22
>> [    8.067543] WARNING: CPU: 3 PID: 1 at kernel/module/strict_rwx.c:22
>> module_set_memory+0x9c/0xb8
> Would be good if you could show the backtrace too so that we know who is
> the caller. I guess what you show here is what you get on the screen ?
> The backtrace should be available throught 'dmesg'.

Here are relevant parts of the boot log:

[    8.096850] [ cut here ]
[    8.096939] module_set_memory(6, , 0) name ipv6 
returned -22
[    8.102947] WARNING: CPU: 4 PID: 1 at kernel/module/strict_rwx.c:22 
module_set_memory+0x9c/0xb8
[    8.111561] Modules linked in:
[    8.114596] CPU: 4 PID: 1 Comm: systemd Not tainted 
6.8.0-rc2-next-20240130-dirty #14429
[    8.122654] Hardware name: Khadas VIM3 (DT)
[    8.126815] pstate: 6005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS 
BTYPE=--)
[    8.133747] pc : module_set_memory+0x9c/0xb8
[    8.137994] lr : module_set_memory+0x9c/0xb8
[    8.142240] sp : 800083fcba80
[    8.145534] x29: 800083fcba80 x28: 0001 x27: 
80007c024448
[    8.152640] x26: 800083fcbc10 x25: 80007c007958 x24: 
80007c024450
[    8.159747] x23: 800083f2a090 x22: 80007c007940 x21: 
0006
[    8.166854] x20: ffea x19: 80007c007af0 x18: 
0030
[    8.173960] x17:  x16: 5932 x15: 

[    8.181067] x14: 800082ea5658 x13: 03d5 x12: 
0147
[    8.188174] x11: 6920656d616e2029 x10: 800082efd658 x9 : 
f000
[    8.195280] x8 : 800082ea5658 x7 : 800082efd658 x6 : 

[    8.202387] x5 : bff4 x4 :  x3 : 

[    8.209494] x2 :  x1 :  x0 : 

Re: [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time

2024-01-31 Thread Christophe Leroy


Le 31/01/2024 à 16:17, Marek Szyprowski a écrit :
> [Vous ne recevez pas souvent de courriers de m.szyprow...@samsung.com. 
> Découvrez pourquoi ceci est important à 
> https://aka.ms/LearnAboutSenderIdentification ]
> 
> Hi Christophe,
> 
> On 31.01.2024 12:58, Christophe Leroy wrote:
>> Le 30/01/2024 à 18:48, Marek Szyprowski a écrit :
>>> [Vous ne recevez pas souvent de courriers de m.szyprow...@samsung.com. 
>>> Découvrez pourquoi ceci est important à 
>>> https://aka.ms/LearnAboutSenderIdentification ]
>>>
>>> On 30.01.2024 12:03, Christophe Leroy wrote:
 Le 30/01/2024 à 10:16, Chen-Yu Tsai a écrit :
> [Vous ne recevez pas souvent de courriers de we...@chromium.org. 
> D?couvrez pourquoi ceci est important ? 
> https://aka.ms/LearnAboutSenderIdentification ]
>
> On Mon, Jan 29, 2024 at 12:09:50PM -0800, Luis Chamberlain wrote:
>> On Thu, Dec 21, 2023 at 10:02:46AM +0100, Christophe Leroy wrote:
>>> Declaring rodata_enabled and mark_rodata_ro() at all time
>>> helps removing related #ifdefery in C files.
>>>
>>> Signed-off-by: Christophe Leroy 
>> Very nice cleanup, thanks!, applied and pushed
>>
>>Luis
> On next-20240130, which has your modules-next branch, and thus this
> series and the other "module: Use set_memory_rox()" series applied,
> my kernel crashes in some very weird way. Reverting your branch
> makes the crash go away.
>
> I thought I'd report it right away. Maybe you folks would know what's
> happening here? This is on arm64.
 That's strange, it seems to bug in module_bug_finalize() which is
 _before_ calls to module_enable_ro() and such.

 Can you try to revert the 6 patches one by one to see which one
 introduces the problem ?

 In reality, only patch 677bfb9db8a3 really change things. Other ones are
 more on less only cleanup.
>>> I've also run into this issue with today's (20240130) linux-next on my
>>> test farm. The issue is not fully reproducible, so it was a bit hard to
>>> bisect it automatically. I've spent some time on manual testing and it
>>> looks that reverting the following 2 commits on top of linux-next fixes
>>> the problem:
>>>
>>> 65929884f868 ("modules: Remove #ifdef CONFIG_STRICT_MODULE_RWX around
>>> rodata_enabled")
>>> 677bfb9db8a3 ("module: Don't ignore errors from set_memory_XX()")
>>>
>>> This in fact means that commit 677bfb9db8a3 is responsible for this
>>> regression, as 65929884f868 has to be reverted only because the latter
>>> depends on it. Let me know what I can do to help debugging this issue.
>>>
>> Thanks for the bisect. I suspect you hit one of the errors and something
>> goes wrong in the error path.
>>
>> To confirm this assumption, could you try with the following change on
>> top of everything ?
> 
> 
> Yes, this is the problem. I've added printing a mod->name to the log.
> Here is a log from kernel build from next-20240130 (sometimes it even
> boots to shell):
> 
> # dmesg | grep module_set_memory
> [    8.061525] module_set_memory(6, , 0) name ipv6
> returned -22
> [    8.067543] WARNING: CPU: 3 PID: 1 at kernel/module/strict_rwx.c:22
> module_set_memory+0x9c/0xb8

Would be good if you could show the backtrace too so that we know who is 
the caller. I guess what you show here is what you get on the screen ? 
The backtrace should be available throught 'dmesg'.

I guess we will now seek help from ARM64 people to understand why 
module_set_memory_something() fails with -EINVAL when loading modules.


> [    8.097821] pc : module_set_memory+0x9c/0xb8
> [    8.102068] lr : module_set_memory+0x9c/0xb8
> [    8.183101]  module_set_memory+0x9c/0xb8
> [    8.472862] module_set_memory(6, , 0) name x_tables
> returned -22
> [    8.479215] WARNING: CPU: 2 PID: 1 at kernel/module/strict_rwx.c:22
> module_set_memory+0x9c/0xb8
> [    8.510978] pc : module_set_memory+0x9c/0xb8
> [    8.515225] lr : module_set_memory+0x9c/0xb8
> [    8.596259]  module_set_memory+0x9c/0xb8
> [   10.529879] module_set_memory(6, , 0) name dm_mod
> returned -22
> [   10.536087] WARNING: CPU: 3 PID: 127 at kernel/module/strict_rwx.c:22
> module_set_memory+0x9c/0xb8
> [   10.568254] pc : module_set_memory+0x9c/0xb8
> [   10.572501] lr : module_set_memory+0x9c/0xb8
> [   10.653535]  module_set_memory+0x9c/0xb8
> [   10.853177] module_set_memory(6, , 0) name fuse
> returned -22
> [   10.859196] WARNING: CPU: 5 PID: 130 at kernel/module/strict_rwx.c:22
> module_set_memory+0x9c/0xb8
> [   10.891382] pc : module_set_memory+0x9c/0xb8
> [   10.895629] lr : module_set_memory+0x9c/0xb8
> [   10.976663]  module_set_memory+0x9c/0xb8
> 
> 
> 
>> diff --git a/kernel/module/strict_rwx.c b/kernel/module/strict_rwx.c
>> index a14df9655dbe..fdf8484154dd 100644
>> --- a/kernel/module/strict_rwx.c
>> +++ b/kernel/module/strict_rwx.c
>> @@ -15,9 +15,12 @@ static int module_set_memory(const struct module
>> *mod, 

[PATCH RFC/RFT v2 4/4] riscv: Stop emitting preventive sfence.vma for new userspace mappings with Svvptc

2024-01-31 Thread Alexandre Ghiti
The preventive sfence.vma were emitted because new mappings must be made
visible to the page table walker but Svvptc guarantees that xRET act as
a fence, so no need to sfence.vma for the uarchs that implement this
extension.

This allows to drastically reduce the number of sfence.vma emitted:

* Ubuntu boot to login:
Before: ~630k sfence.vma
After:  ~200k sfence.vma

* ltp - mmapstress01
Before: ~45k
After:  ~6.3k

* lmbench - lat_pagefault
Before: ~665k
After:   832 (!)

* lmbench - lat_mmap
Before: ~546k
After:   718 (!)

Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/include/asm/pgtable.h | 16 +++-
 arch/riscv/mm/pgtable.c  | 13 +
 2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 0c94260b5d0c..50986e4c4601 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -473,6 +473,9 @@ static inline void update_mmu_cache_range(struct vm_fault 
*vmf,
struct vm_area_struct *vma, unsigned long address,
pte_t *ptep, unsigned int nr)
 {
+   asm_volatile_goto(ALTERNATIVE("nop", "j %l[svvptc]", 0, 
RISCV_ISA_EXT_SVVPTC, 1)
+ : : : : svvptc);
+
/*
 * The kernel assumes that TLBs don't cache invalid entries, but
 * in RISC-V, SFENCE.VMA specifies an ordering constraint, not a
@@ -482,12 +485,23 @@ static inline void update_mmu_cache_range(struct vm_fault 
*vmf,
 */
while (nr--)
local_flush_tlb_page(address + nr * PAGE_SIZE);
+
+svvptc:
+   /*
+* Svvptc guarantees that xRET act as a fence, so when the uarch does
+* not cache invalid entries, we don't have to do anything.
+*/
+   ;
 }
 #define update_mmu_cache(vma, addr, ptep) \
update_mmu_cache_range(NULL, vma, addr, ptep, 1)
 
 #define __HAVE_ARCH_UPDATE_MMU_TLB
-#define update_mmu_tlb update_mmu_cache
+static inline void update_mmu_tlb(struct vm_area_struct *vma,
+ unsigned long address, pte_t *ptep)
+{
+   flush_tlb_range(vma, address, address + PAGE_SIZE);
+}
 
 static inline void update_mmu_cache_pmd(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmdp)
diff --git a/arch/riscv/mm/pgtable.c b/arch/riscv/mm/pgtable.c
index ef887efcb679..99ed389e4c8a 100644
--- a/arch/riscv/mm/pgtable.c
+++ b/arch/riscv/mm/pgtable.c
@@ -9,6 +9,9 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
  unsigned long address, pte_t *ptep,
  pte_t entry, int dirty)
 {
+   asm_volatile_goto(ALTERNATIVE("nop", "j %l[svvptc]", 0, 
RISCV_ISA_EXT_SVVPTC, 1)
+ : : : : svvptc);
+
if (!pte_same(ptep_get(ptep), entry))
__set_pte_at(ptep, entry);
/*
@@ -16,6 +19,16 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
 * the case that the PTE changed and the spurious fault case.
 */
return true;
+
+svvptc:
+   if (!pte_same(ptep_get(ptep), entry)) {
+   __set_pte_at(ptep, entry);
+   /* Here only not svadu is impacted */
+   flush_tlb_page(vma, address);
+   return true;
+   }
+
+   return false;
 }
 
 int ptep_test_and_clear_young(struct vm_area_struct *vma,
-- 
2.39.2



[PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings

2024-01-31 Thread Alexandre Ghiti
In 6.5, we removed the vmalloc fault path because that can't work (see
[1] [2]). Then in order to make sure that new page table entries were
seen by the page table walker, we had to preventively emit a sfence.vma
on all harts [3] but this solution is very costly since it relies on IPI.

And even there, we could end up in a loop of vmalloc faults if a vmalloc
allocation is done in the IPI path (for example if it is traced, see
[4]), which could result in a kernel stack overflow.

Those preventive sfence.vma needed to be emitted because:

- if the uarch caches invalid entries, the new mapping may not be
  observed by the page table walker and an invalidation may be needed.
- if the uarch does not cache invalid entries, a reordered access
  could "miss" the new mapping and traps: in that case, we would actually
  only need to retry the access, no sfence.vma is required.

So this patch removes those preventive sfence.vma and actually handles
the possible (and unlikely) exceptions. And since the kernel stacks
mappings lie in the vmalloc area, this handling must be done very early
when the trap is taken, at the very beginning of handle_exception: this
also rules out the vmalloc allocations in the fault path.

Link: 
https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bj...@kernel.org/ 
[1]
Link: 
https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dy...@andestech.com
 [2]
Link: 
https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexgh...@rivosinc.com/
 [3]
Link: https://lore.kernel.org/lkml/20200508144043.13893-1-j...@8bytes.org/ [4]
Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/include/asm/cacheflush.h  | 18 +-
 arch/riscv/include/asm/thread_info.h |  5 ++
 arch/riscv/kernel/asm-offsets.c  |  5 ++
 arch/riscv/kernel/entry.S| 84 
 arch/riscv/mm/init.c |  2 +
 5 files changed, 113 insertions(+), 1 deletion(-)

diff --git a/arch/riscv/include/asm/cacheflush.h 
b/arch/riscv/include/asm/cacheflush.h
index a129dac4521d..b0d631701757 100644
--- a/arch/riscv/include/asm/cacheflush.h
+++ b/arch/riscv/include/asm/cacheflush.h
@@ -37,7 +37,23 @@ static inline void flush_dcache_page(struct page *page)
flush_icache_mm(vma->vm_mm, 0)
 
 #ifdef CONFIG_64BIT
-#define flush_cache_vmap(start, end)   flush_tlb_kernel_range(start, 
end)
+extern u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
+extern char _end[];
+#define flush_cache_vmap flush_cache_vmap
+static inline void flush_cache_vmap(unsigned long start, unsigned long end)
+{
+   if (is_vmalloc_or_module_addr((void *)start)) {
+   int i;
+
+   /*
+* We don't care if concurrently a cpu resets this value since
+* the only place this can happen is in handle_exception() where
+* an sfence.vma is emitted.
+*/
+   for (i = 0; i < ARRAY_SIZE(new_vmalloc); ++i)
+   new_vmalloc[i] = -1ULL;
+   }
+}
 #define flush_cache_vmap_early(start, end) 
local_flush_tlb_kernel_range(start, end)
 #endif
 
diff --git a/arch/riscv/include/asm/thread_info.h 
b/arch/riscv/include/asm/thread_info.h
index 5d473343634b..32631acdcdd4 100644
--- a/arch/riscv/include/asm/thread_info.h
+++ b/arch/riscv/include/asm/thread_info.h
@@ -60,6 +60,11 @@ struct thread_info {
void*scs_base;
void*scs_sp;
 #endif
+   /*
+* Used in handle_exception() to save a0, a1 and a2 before knowing if we
+* can access the kernel stack.
+*/
+   unsigned long   a0, a1, a2;
 };
 
 #ifdef CONFIG_SHADOW_CALL_STACK
diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
index a03129f40c46..939ddc0e3c6e 100644
--- a/arch/riscv/kernel/asm-offsets.c
+++ b/arch/riscv/kernel/asm-offsets.c
@@ -35,6 +35,8 @@ void asm_offsets(void)
OFFSET(TASK_THREAD_S9, task_struct, thread.s[9]);
OFFSET(TASK_THREAD_S10, task_struct, thread.s[10]);
OFFSET(TASK_THREAD_S11, task_struct, thread.s[11]);
+
+   OFFSET(TASK_TI_CPU, task_struct, thread_info.cpu);
OFFSET(TASK_TI_FLAGS, task_struct, thread_info.flags);
OFFSET(TASK_TI_PREEMPT_COUNT, task_struct, thread_info.preempt_count);
OFFSET(TASK_TI_KERNEL_SP, task_struct, thread_info.kernel_sp);
@@ -42,6 +44,9 @@ void asm_offsets(void)
 #ifdef CONFIG_SHADOW_CALL_STACK
OFFSET(TASK_TI_SCS_SP, task_struct, thread_info.scs_sp);
 #endif
+   OFFSET(TASK_TI_A0, task_struct, thread_info.a0);
+   OFFSET(TASK_TI_A1, task_struct, thread_info.a1);
+   OFFSET(TASK_TI_A2, task_struct, thread_info.a2);
 
OFFSET(TASK_TI_CPU_NUM, task_struct, thread_info.cpu);
OFFSET(TASK_THREAD_F0,  task_struct, thread.fstate.f[0]);
diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
index 9d1a305d5508..c1ffaeaba7aa 100644
--- a/arch/riscv/kernel/entry.S
+++ b/arch/riscv/kernel/entry.S
@@ -19,6 

[PATCH RFC/RFT v2 2/4] dt-bindings: riscv: Add Svvptc ISA extension description

2024-01-31 Thread Alexandre Ghiti
Add description for the Svvptc ISA extension which was ratified recently.

Signed-off-by: Alexandre Ghiti 
---
 Documentation/devicetree/bindings/riscv/extensions.yaml | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/Documentation/devicetree/bindings/riscv/extensions.yaml 
b/Documentation/devicetree/bindings/riscv/extensions.yaml
index 63d81dc895e5..59bf14d2c1eb 100644
--- a/Documentation/devicetree/bindings/riscv/extensions.yaml
+++ b/Documentation/devicetree/bindings/riscv/extensions.yaml
@@ -171,6 +171,13 @@ properties:
 memory types as ratified in the 20191213 version of the privileged
 ISA specification.
 
+- const: svvptc
+  description:
+The standard Svvptc supervisor-level extension for
+address-translation cache behaviour with respect to invalid entries
+as ratified in the  version of the privileged ISA
+specification.
+
 - const: zacas
   description: |
 The Zacas extension for Atomic Compare-and-Swap (CAS) instructions
-- 
2.39.2



[PATCH RFC/RFT v2 1/4] riscv: Add ISA extension parsing for Svvptc

2024-01-31 Thread Alexandre Ghiti
Add support to parse the Svvptc string in the riscv,isa string.

Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/include/asm/hwcap.h | 1 +
 arch/riscv/kernel/cpufeature.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/riscv/include/asm/hwcap.h b/arch/riscv/include/asm/hwcap.h
index 5340f818746b..2e15192135fb 100644
--- a/arch/riscv/include/asm/hwcap.h
+++ b/arch/riscv/include/asm/hwcap.h
@@ -80,6 +80,7 @@
 #define RISCV_ISA_EXT_ZFA  71
 #define RISCV_ISA_EXT_ZTSO 72
 #define RISCV_ISA_EXT_ZACAS73
+#define RISCV_ISA_EXT_SVVPTC   74
 
 #define RISCV_ISA_EXT_MAX  128
 #define RISCV_ISA_EXT_INVALID  U32_MAX
diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
index 89920f84d0a3..4a8f14bfa0f2 100644
--- a/arch/riscv/kernel/cpufeature.c
+++ b/arch/riscv/kernel/cpufeature.c
@@ -307,6 +307,7 @@ const struct riscv_isa_ext_data riscv_isa_ext[] = {
__RISCV_ISA_EXT_DATA(svinval, RISCV_ISA_EXT_SVINVAL),
__RISCV_ISA_EXT_DATA(svnapot, RISCV_ISA_EXT_SVNAPOT),
__RISCV_ISA_EXT_DATA(svpbmt, RISCV_ISA_EXT_SVPBMT),
+   __RISCV_ISA_EXT_DATA(svvptc, RISCV_ISA_EXT_SVVPTC),
 };
 
 const size_t riscv_isa_ext_count = ARRAY_SIZE(riscv_isa_ext);
-- 
2.39.2



[PATCH RFC v2 0/4] Svvptc extension to remove preventive sfence.vma

2024-01-31 Thread Alexandre Ghiti
In RISC-V, after a new mapping is established, a sfence.vma needs to be
emitted for different reasons:

- if the uarch caches invalid entries, we need to invalidate it otherwise
  we would trap on this invalid entry,
- if the uarch does not cache invalid entries, a reordered access could fail
  to see the new mapping and then trap (sfence.vma acts as a fence).

We can actually avoid emitting those (mostly) useless and costly sfence.vma
by handling the traps instead:

- for new kernel mappings: only vmalloc mappings need to be taken care of,
  other new mapping are rare and already emit the required sfence.vma if
  needed.
  That must be achieved very early in the exception path as explained in
  patch 3, and this also fixes our fragile way of dealing with vmalloc faults.

- for new user mappings: Svvptc makes update_mmu_cache() a no-op and no
  traps can happen since xRET instructions now act as fences.

Patch 1 and 2 introduce Svvptc extension probing.

It's still an RFC because Svvptc is not ratified yet.

On our uarch that does not cache invalid entries and a 6.5 kernel, the
gains are measurable:

* Kernel boot:  6%
* ltp - mmapstress01:   8%
* lmbench - lat_pagefault:  20%
* lmbench - lat_mmap:   5%

Thanks to Ved and Matt Evans for triggering the discussion that led to
this patchset!

Any feedback, test or relevant benchmark are welcome :)

Changes in v2:
- Rebase on top of 6.8-rc1
- Remove patch with runtime detection of tlb caching and debugfs patch
- Add patch that probes Svvptc
- Add patch that defines the new Svvptc dt-binding
- Leave the behaviour as-is for uarchs that cache invalid TLB entries since
  I don't have any good perf numbers
- Address comments from Christoph on v1
- Fix a race condition in new_vmalloc update:

   ld  a2, 0(a0) <= this could load something which is != -1
   not a1, a1<= here or in the instruction after, flush_cache_vmap()
could make the whole bitmap to 1
   and a1, a2, a1
   sd  a1, 0(a0) <= here we would clear bits that should not be cleared!

   Instead, replace the whole sequence with:
   amoxor.wa0, a1, (a0)

Alexandre Ghiti (4):
  riscv: Add ISA extension parsing for Svvptc
  dt-bindings: riscv: Add Svvptc ISA extension description
  riscv: Stop emitting preventive sfence.vma for new vmalloc mappings
  riscv: Stop emitting preventive sfence.vma for new userspace mappings
with Svvptc

 .../devicetree/bindings/riscv/extensions.yaml |  7 ++
 arch/riscv/include/asm/cacheflush.h   | 18 +++-
 arch/riscv/include/asm/hwcap.h|  1 +
 arch/riscv/include/asm/pgtable.h  | 16 +++-
 arch/riscv/include/asm/thread_info.h  |  5 ++
 arch/riscv/kernel/asm-offsets.c   |  5 ++
 arch/riscv/kernel/cpufeature.c|  1 +
 arch/riscv/kernel/entry.S | 84 +++
 arch/riscv/mm/init.c  |  2 +
 arch/riscv/mm/pgtable.c   | 13 +++
 10 files changed, 150 insertions(+), 2 deletions(-)

-- 
2.39.2



Re: [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time

2024-01-31 Thread Marek Szyprowski
Hi Christophe,

On 31.01.2024 12:58, Christophe Leroy wrote:
> Le 30/01/2024 à 18:48, Marek Szyprowski a écrit :
>> [Vous ne recevez pas souvent de courriers de m.szyprow...@samsung.com. 
>> Découvrez pourquoi ceci est important à 
>> https://aka.ms/LearnAboutSenderIdentification ]
>>
>> On 30.01.2024 12:03, Christophe Leroy wrote:
>>> Le 30/01/2024 à 10:16, Chen-Yu Tsai a écrit :
 [Vous ne recevez pas souvent de courriers de we...@chromium.org. D?couvrez 
 pourquoi ceci est important ? 
 https://aka.ms/LearnAboutSenderIdentification ]

 On Mon, Jan 29, 2024 at 12:09:50PM -0800, Luis Chamberlain wrote:
> On Thu, Dec 21, 2023 at 10:02:46AM +0100, Christophe Leroy wrote:
>> Declaring rodata_enabled and mark_rodata_ro() at all time
>> helps removing related #ifdefery in C files.
>>
>> Signed-off-by: Christophe Leroy 
> Very nice cleanup, thanks!, applied and pushed
>
>   Luis
 On next-20240130, which has your modules-next branch, and thus this
 series and the other "module: Use set_memory_rox()" series applied,
 my kernel crashes in some very weird way. Reverting your branch
 makes the crash go away.

 I thought I'd report it right away. Maybe you folks would know what's
 happening here? This is on arm64.
>>> That's strange, it seems to bug in module_bug_finalize() which is
>>> _before_ calls to module_enable_ro() and such.
>>>
>>> Can you try to revert the 6 patches one by one to see which one
>>> introduces the problem ?
>>>
>>> In reality, only patch 677bfb9db8a3 really change things. Other ones are
>>> more on less only cleanup.
>> I've also run into this issue with today's (20240130) linux-next on my
>> test farm. The issue is not fully reproducible, so it was a bit hard to
>> bisect it automatically. I've spent some time on manual testing and it
>> looks that reverting the following 2 commits on top of linux-next fixes
>> the problem:
>>
>> 65929884f868 ("modules: Remove #ifdef CONFIG_STRICT_MODULE_RWX around
>> rodata_enabled")
>> 677bfb9db8a3 ("module: Don't ignore errors from set_memory_XX()")
>>
>> This in fact means that commit 677bfb9db8a3 is responsible for this
>> regression, as 65929884f868 has to be reverted only because the latter
>> depends on it. Let me know what I can do to help debugging this issue.
>>
> Thanks for the bisect. I suspect you hit one of the errors and something
> goes wrong in the error path.
>
> To confirm this assumption, could you try with the following change on
> top of everything ?


Yes, this is the problem. I've added printing a mod->name to the log. 
Here is a log from kernel build from next-20240130 (sometimes it even 
boots to shell):

# dmesg | grep module_set_memory
[    8.061525] module_set_memory(6, , 0) name ipv6 
returned -22
[    8.067543] WARNING: CPU: 3 PID: 1 at kernel/module/strict_rwx.c:22 
module_set_memory+0x9c/0xb8
[    8.097821] pc : module_set_memory+0x9c/0xb8
[    8.102068] lr : module_set_memory+0x9c/0xb8
[    8.183101]  module_set_memory+0x9c/0xb8
[    8.472862] module_set_memory(6, , 0) name x_tables 
returned -22
[    8.479215] WARNING: CPU: 2 PID: 1 at kernel/module/strict_rwx.c:22 
module_set_memory+0x9c/0xb8
[    8.510978] pc : module_set_memory+0x9c/0xb8
[    8.515225] lr : module_set_memory+0x9c/0xb8
[    8.596259]  module_set_memory+0x9c/0xb8
[   10.529879] module_set_memory(6, , 0) name dm_mod 
returned -22
[   10.536087] WARNING: CPU: 3 PID: 127 at kernel/module/strict_rwx.c:22 
module_set_memory+0x9c/0xb8
[   10.568254] pc : module_set_memory+0x9c/0xb8
[   10.572501] lr : module_set_memory+0x9c/0xb8
[   10.653535]  module_set_memory+0x9c/0xb8
[   10.853177] module_set_memory(6, , 0) name fuse 
returned -22
[   10.859196] WARNING: CPU: 5 PID: 130 at kernel/module/strict_rwx.c:22 
module_set_memory+0x9c/0xb8
[   10.891382] pc : module_set_memory+0x9c/0xb8
[   10.895629] lr : module_set_memory+0x9c/0xb8
[   10.976663]  module_set_memory+0x9c/0xb8



> diff --git a/kernel/module/strict_rwx.c b/kernel/module/strict_rwx.c
> index a14df9655dbe..fdf8484154dd 100644
> --- a/kernel/module/strict_rwx.c
> +++ b/kernel/module/strict_rwx.c
> @@ -15,9 +15,12 @@ static int module_set_memory(const struct module
> *mod, enum mod_mem_type type,
> int (*set_memory)(unsigned long start, int 
> num_pages))
>{
>   const struct module_memory *mod_mem = >mem[type];
> + int err;
>
>   set_vm_flush_reset_perms(mod_mem->base);
> - return set_memory((unsigned long)mod_mem->base, mod_mem->size >>
> PAGE_SHIFT);
> + err = set_memory((unsigned long)mod_mem->base, mod_mem->size >>
> PAGE_SHIFT);
> + WARN(err, "module_set_memory(%d, %px, %x) returned %d\n", type,
> mod_mem->base, mod_mem->size, err);
> + return err;
>}
>
>/*
>
>
> Thanks for your help
> Christophe

Best regards
-- 
Marek Szyprowski, PhD
Samsung R Institute Poland



Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-31 Thread David Hildenbrand

dontneed should hopefully/likely see a speedup.


Yes, but that's almost exactly the same path as munmap, so I'm sure it really
adds much for this particular series. 


Right, that's why I'm not including these measurements. dontneed vs. 
munmap is more about measuring the overhead of VMA modifications + page 
table reclaim.



Anyway, on Altra at least, I'm seeing no
regressions, so:

Tested-by: Ryan Roberts 



Thanks!

--
Cheers,

David / dhildenb



Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-31 Thread Ryan Roberts
On 31/01/2024 15:05, David Hildenbrand wrote:
> On 31.01.24 16:02, Ryan Roberts wrote:
>> On 31/01/2024 14:29, David Hildenbrand wrote:
> Note that regarding NUMA effects, I mean when some memory access within 
> the
> same
> socket is faster/slower even with only a single node. On AMD EPYC that's
> possible, depending on which core you are running and on which memory
> controller
> the memory you want to access is located. If both are in different 
> quadrants
> IIUC, the access latency will be different.

 I've configured the NUMA to only bring the RAM and CPUs for a single socket
 online, so I shouldn't be seeing any of these effects. Anyway, I've been 
 using
 the Altra as a secondary because its so much slower than the M2. Let me 
 move
 over to it and see if everything looks more straightforward there.
>>>
>>> Better use a system where people will actually run Linux production 
>>> workloads
>>> on, even if it is slower :)
>>>
>>> [...]
>>>
>>
>> I'll continue to mess around with it until the end of the day. But I'm 
>> not
>> making any headway, then I'll change tack; I'll just measure the
>> performance of
>> my contpte changes using your fork/zap stuff as the baseline and post
>> based on
>> that.
>
> You should likely not focus on M2 results. Just pick a representative bare
> metal
> machine where you get consistent, explainable results.
>
> Nothing in the code is fine-tuned for a particular architecture so far, 
> only
> order-0 handling is kept separate.
>
> BTW: I see the exact same speedups for dontneed that I see for munmap. For
> example, for order-9, it goes from 0.023412s -> 0.009785, so -58%. So I'm
> curious why you see a speedup for munmap but not for dontneed.

 Ugh... ok, coming up.
>>>
>>> Hopefully you were just staring at the wrong numbers (e.g., only with fork
>>> patches). Because both (munmap/pte-dontneed) are using the exact same code 
>>> path.
>>>
>>
>> Ahh... I'm doing pte-dontneed, which is the only option in your original
>> benchmark - it does MADV_DONTNEED one page at a time. It looks like your new
>> benchmark has an additional "dontneed" option that does it in one shot. Which
>> option are you running? Assuming the latter, I think that explains it.
> 
> I temporarily removed that option and then re-added it. Guess you got a wrong
> snapshot of the benchmark :D
> 
> pte-dontneed not observing any change is great (no batching possible).

indeed.

> 
> dontneed should hopefully/likely see a speedup.

Yes, but that's almost exactly the same path as munmap, so I'm sure it really
adds much for this particular series. Anyway, on Altra at least, I'm seeing no
regressions, so:

Tested-by: Ryan Roberts 

> 
> Great!
> 



Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-31 Thread David Hildenbrand

On 31.01.24 16:02, Ryan Roberts wrote:

On 31/01/2024 14:29, David Hildenbrand wrote:

Note that regarding NUMA effects, I mean when some memory access within the same
socket is faster/slower even with only a single node. On AMD EPYC that's
possible, depending on which core you are running and on which memory controller
the memory you want to access is located. If both are in different quadrants
IIUC, the access latency will be different.


I've configured the NUMA to only bring the RAM and CPUs for a single socket
online, so I shouldn't be seeing any of these effects. Anyway, I've been using
the Altra as a secondary because its so much slower than the M2. Let me move
over to it and see if everything looks more straightforward there.


Better use a system where people will actually run Linux production workloads
on, even if it is slower :)

[...]



I'll continue to mess around with it until the end of the day. But I'm not
making any headway, then I'll change tack; I'll just measure the performance of
my contpte changes using your fork/zap stuff as the baseline and post based on
that.


You should likely not focus on M2 results. Just pick a representative bare metal
machine where you get consistent, explainable results.

Nothing in the code is fine-tuned for a particular architecture so far, only
order-0 handling is kept separate.

BTW: I see the exact same speedups for dontneed that I see for munmap. For
example, for order-9, it goes from 0.023412s -> 0.009785, so -58%. So I'm
curious why you see a speedup for munmap but not for dontneed.


Ugh... ok, coming up.


Hopefully you were just staring at the wrong numbers (e.g., only with fork
patches). Because both (munmap/pte-dontneed) are using the exact same code path.



Ahh... I'm doing pte-dontneed, which is the only option in your original
benchmark - it does MADV_DONTNEED one page at a time. It looks like your new
benchmark has an additional "dontneed" option that does it in one shot. Which
option are you running? Assuming the latter, I think that explains it.


I temporarily removed that option and then re-added it. Guess you got a 
wrong snapshot of the benchmark :D


pte-dontneed not observing any change is great (no batching possible).

dontneed should hopefully/likely see a speedup.

Great!

--
Cheers,

David / dhildenb



Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-31 Thread Ryan Roberts
On 31/01/2024 14:29, David Hildenbrand wrote:
>>> Note that regarding NUMA effects, I mean when some memory access within the 
>>> same
>>> socket is faster/slower even with only a single node. On AMD EPYC that's
>>> possible, depending on which core you are running and on which memory 
>>> controller
>>> the memory you want to access is located. If both are in different quadrants
>>> IIUC, the access latency will be different.
>>
>> I've configured the NUMA to only bring the RAM and CPUs for a single socket
>> online, so I shouldn't be seeing any of these effects. Anyway, I've been 
>> using
>> the Altra as a secondary because its so much slower than the M2. Let me move
>> over to it and see if everything looks more straightforward there.
> 
> Better use a system where people will actually run Linux production workloads
> on, even if it is slower :)
> 
> [...]
> 

 I'll continue to mess around with it until the end of the day. But I'm not
 making any headway, then I'll change tack; I'll just measure the 
 performance of
 my contpte changes using your fork/zap stuff as the baseline and post 
 based on
 that.
>>>
>>> You should likely not focus on M2 results. Just pick a representative bare 
>>> metal
>>> machine where you get consistent, explainable results.
>>>
>>> Nothing in the code is fine-tuned for a particular architecture so far, only
>>> order-0 handling is kept separate.
>>>
>>> BTW: I see the exact same speedups for dontneed that I see for munmap. For
>>> example, for order-9, it goes from 0.023412s -> 0.009785, so -58%. So I'm
>>> curious why you see a speedup for munmap but not for dontneed.
>>
>> Ugh... ok, coming up.
> 
> Hopefully you were just staring at the wrong numbers (e.g., only with fork
> patches). Because both (munmap/pte-dontneed) are using the exact same code 
> path.
> 

Ahh... I'm doing pte-dontneed, which is the only option in your original
benchmark - it does MADV_DONTNEED one page at a time. It looks like your new
benchmark has an additional "dontneed" option that does it in one shot. Which
option are you running? Assuming the latter, I think that explains it.


Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-31 Thread David Hildenbrand

Note that regarding NUMA effects, I mean when some memory access within the same
socket is faster/slower even with only a single node. On AMD EPYC that's
possible, depending on which core you are running and on which memory controller
the memory you want to access is located. If both are in different quadrants
IIUC, the access latency will be different.


I've configured the NUMA to only bring the RAM and CPUs for a single socket
online, so I shouldn't be seeing any of these effects. Anyway, I've been using
the Altra as a secondary because its so much slower than the M2. Let me move
over to it and see if everything looks more straightforward there.


Better use a system where people will actually run Linux production 
workloads on, even if it is slower :)


[...]



I'll continue to mess around with it until the end of the day. But I'm not
making any headway, then I'll change tack; I'll just measure the performance of
my contpte changes using your fork/zap stuff as the baseline and post based on
that.


You should likely not focus on M2 results. Just pick a representative bare metal
machine where you get consistent, explainable results.

Nothing in the code is fine-tuned for a particular architecture so far, only
order-0 handling is kept separate.

BTW: I see the exact same speedups for dontneed that I see for munmap. For
example, for order-9, it goes from 0.023412s -> 0.009785, so -58%. So I'm
curious why you see a speedup for munmap but not for dontneed.


Ugh... ok, coming up.


Hopefully you were just staring at the wrong numbers (e.g., only with 
fork patches). Because both (munmap/pte-dontneed) are using the exact 
same code path.


--
Cheers,

David / dhildenb



Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-01-31 Thread David Hildenbrand

On 31.01.24 15:08, Michal Hocko wrote:

On Wed 31-01-24 10:26:13, Ryan Roberts wrote:

IIRC there is an option to zero memory when it is freed back to the buddy? So
that could be a place where time is proportional to size rather than
proportional to folio count? But I think that option is intended for debug only?
So perhaps not a problem in practice?


init_on_free is considered a security/hardening feature more than a
debugging one. It will surely add an overhead and I guess this is
something people who use it know about. The batch size limit is a latency
reduction feature for !PREEMPT kernels but by no means it should be
considered low latency guarantee feature. A lot of has changed since
the limit was introduced and the current latency numbers will surely be
different than back then. As long as soft lockups do not trigger again
this should be acceptable IMHO.


It could now be zeroing out ~512 MiB. That shouldn't take double-digit 
seconds unless we are running in a very problematic environment 
(over-committed VM). But then, we might have different problems already.


I'll do some sanity checks with an extremely large processes (as much as 
I can fit on my machines), with a !CONFIG_PREEMPT kernel and 
init_on_free, to see if anything pops up.


Thanks Michal!

--
Cheers,

David / dhildenb



Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-01-31 Thread Michal Hocko
On Wed 31-01-24 11:16:01, David Hildenbrand wrote:
[...]
> This 1 pages limit was introduced in 53a59fc67f97 ("mm: limit mmu_gather
> batching to fix soft lockups on !CONFIG_PREEMPT") where we wanted to handle
> soft-lockups.

AFAIR at the time of this patch this was mostly just to put some cap on
the number of batches to collect and free at once. If there is a lot of
free memory and a large process exiting this could grow really high. Now
that those pages^Wfolios can represent larger memory chunks it could
mean more physical memory being freed but from which might make the
operation take longer but still far from soft lockup triggering.

Now latency might suck on !PREEMPT kernels with too many pages to
free in a single batch but I guess this is somehow expected for this
preemption model. The soft lockup has to be avoided because this can
panic the machine in some configurations.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-01-31 Thread Michal Hocko
On Wed 31-01-24 10:26:13, Ryan Roberts wrote:
> IIRC there is an option to zero memory when it is freed back to the buddy? So
> that could be a place where time is proportional to size rather than
> proportional to folio count? But I think that option is intended for debug 
> only?
> So perhaps not a problem in practice?

init_on_free is considered a security/hardening feature more than a
debugging one. It will surely add an overhead and I guess this is
something people who use it know about. The batch size limit is a latency
reduction feature for !PREEMPT kernels but by no means it should be
considered low latency guarantee feature. A lot of has changed since
the limit was introduced and the current latency numbers will surely be
different than back then. As long as soft lockups do not trigger again
this should be acceptable IMHO.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-31 Thread Ryan Roberts
On 31/01/2024 13:38, David Hildenbrand wrote:
 Nope: looks the same. I've taken my test harness out of the picture and 
 done
 everything manually from the ground up, with the old tests and the new.
 Headline
 is that I see similar numbers from both.
>>>
>>> I took me a while to get really reproducible numbers on Intel. Most 
>>> importantly:
>>> * Set a fixed CPU frequency, disabling any boost and avoiding any
>>>    thermal throttling.
>>> * Pin the test to CPUs and set a nice level.
>>
>> I'm already pinning the test to cpu 0. But for M2, at least, I'm running in 
>> a VM
>> on top of macos, and I don't have a mechanism to pin the QEMU threads to the
>> physical CPUs. Anyway, I don't think these are problems because for a given
>> kernel build I can accurately repro numbers.
> 
> Oh, you do have a layer of virtualization in there. I *suspect* that might
> amplify some odd things regarding code layout, caching effects, etc.
> 
> I guess especially the fork() benchmark is too sensible (fast) for things like
> that, so I would just focus on bare metal results where you can control the
> environment completely.

Yeah, maybe. OK I'll park M2 for now.

> 
> Note that regarding NUMA effects, I mean when some memory access within the 
> same
> socket is faster/slower even with only a single node. On AMD EPYC that's
> possible, depending on which core you are running and on which memory 
> controller
> the memory you want to access is located. If both are in different quadrants
> IIUC, the access latency will be different.

I've configured the NUMA to only bring the RAM and CPUs for a single socket
online, so I shouldn't be seeing any of these effects. Anyway, I've been using
the Altra as a secondary because its so much slower than the M2. Let me move
over to it and see if everything looks more straightforward there.

> 
>>> But yes: I was observing something similar on AMD EPYC, where you get
>>> consecutive pages from the buddy, but once you allocate from the PCP it 
>>> might no
>>> longer be consecutive.
>>>
    - test is 5-10% slower when output is printed to terminal vs when
 redirected to
  file. I've always effectively been redirecting. Not sure if this 
 overhead
  could start to dominate the regression and that's why you don't see 
 it?
>>>
>>> That's weird, because we don't print while measuring? Anyhow, 5/10% 
>>> variance on
>>> some system is not the end of the world.
>>
>> I imagine its cache effects? More work to do to print the output could be
>> evicting some code that's in the benchmark path?
> 
> Maybe. Do you also see these oddities on the bare metal system?
> 
>>
>>>

 I'm inclined to run this test for the last N kernel releases and if the 
 number
 moves around significantly, conclude that these tests don't really matter.
 Otherwise its an exercise in randomly refactoring code until it works 
 well, but
 that's just overfitting to the compiler and hw. What do you think?
>>>
>>> Personally, I wouldn't lose sleep if you see weird, unexplainable behavior 
>>> on
>>> some system (not even architecture!). Trying to optimize for that would 
>>> indeed
>>> be random refactorings.
>>>
>>> But I would not be so fast to say that "these tests don't really matter" and
>>> then go wild and degrade them as much as you want. There are use cases that 
>>> care
>>> about fork performance especially with order-0 pages -- such as Redis.
>>
>> Indeed. But also remember that my fork baseline time is ~2.5ms, and I think 
>> you
>> said yours was 14ms :)
> 
> Yes, no idea why M2 is that fast (BTW, which page size? 4k or 16k? ) :)

The guest kernel is using 4K pages. I'm not quite sure what is happening at
stage2; QEMU doesn't expose any options to explicitly request huge pages for
macos AFAICT.

> 
>>
>> I'll continue to mess around with it until the end of the day. But I'm not
>> making any headway, then I'll change tack; I'll just measure the performance 
>> of
>> my contpte changes using your fork/zap stuff as the baseline and post based 
>> on
>> that.
> 
> You should likely not focus on M2 results. Just pick a representative bare 
> metal
> machine where you get consistent, explainable results.
> 
> Nothing in the code is fine-tuned for a particular architecture so far, only
> order-0 handling is kept separate.
> 
> BTW: I see the exact same speedups for dontneed that I see for munmap. For
> example, for order-9, it goes from 0.023412s -> 0.009785, so -58%. So I'm
> curious why you see a speedup for munmap but not for dontneed.

Ugh... ok, coming up.



Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-31 Thread David Hildenbrand

Nope: looks the same. I've taken my test harness out of the picture and done
everything manually from the ground up, with the old tests and the new. Headline
is that I see similar numbers from both.


I took me a while to get really reproducible numbers on Intel. Most importantly:
* Set a fixed CPU frequency, disabling any boost and avoiding any
   thermal throttling.
* Pin the test to CPUs and set a nice level.


I'm already pinning the test to cpu 0. But for M2, at least, I'm running in a VM
on top of macos, and I don't have a mechanism to pin the QEMU threads to the
physical CPUs. Anyway, I don't think these are problems because for a given
kernel build I can accurately repro numbers.


Oh, you do have a layer of virtualization in there. I *suspect* that 
might amplify some odd things regarding code layout, caching effects, etc.


I guess especially the fork() benchmark is too sensible (fast) for 
things like that, so I would just focus on bare metal results where you 
can control the environment completely.


Note that regarding NUMA effects, I mean when some memory access within 
the same socket is faster/slower even with only a single node. On AMD 
EPYC that's possible, depending on which core you are running and on 
which memory controller the memory you want to access is located. If 
both are in different quadrants IIUC, the access latency will be different.



But yes: I was observing something similar on AMD EPYC, where you get
consecutive pages from the buddy, but once you allocate from the PCP it might no
longer be consecutive.


   - test is 5-10% slower when output is printed to terminal vs when redirected 
to
     file. I've always effectively been redirecting. Not sure if this overhead
     could start to dominate the regression and that's why you don't see it?


That's weird, because we don't print while measuring? Anyhow, 5/10% variance on
some system is not the end of the world.


I imagine its cache effects? More work to do to print the output could be
evicting some code that's in the benchmark path?


Maybe. Do you also see these oddities on the bare metal system?







I'm inclined to run this test for the last N kernel releases and if the number
moves around significantly, conclude that these tests don't really matter.
Otherwise its an exercise in randomly refactoring code until it works well, but
that's just overfitting to the compiler and hw. What do you think?


Personally, I wouldn't lose sleep if you see weird, unexplainable behavior on
some system (not even architecture!). Trying to optimize for that would indeed
be random refactorings.

But I would not be so fast to say that "these tests don't really matter" and
then go wild and degrade them as much as you want. There are use cases that care
about fork performance especially with order-0 pages -- such as Redis.


Indeed. But also remember that my fork baseline time is ~2.5ms, and I think you
said yours was 14ms :)


Yes, no idea why M2 is that fast (BTW, which page size? 4k or 16k? ) :)



I'll continue to mess around with it until the end of the day. But I'm not
making any headway, then I'll change tack; I'll just measure the performance of
my contpte changes using your fork/zap stuff as the baseline and post based on 
that.


You should likely not focus on M2 results. Just pick a representative 
bare metal machine where you get consistent, explainable results.


Nothing in the code is fine-tuned for a particular architecture so far, 
only order-0 handling is kept separate.


BTW: I see the exact same speedups for dontneed that I see for munmap. 
For example, for order-9, it goes from 0.023412s -> 0.009785, so -58%. 
So I'm curious why you see a speedup for munmap but not for dontneed.


--
Cheers,

David / dhildenb



Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-31 Thread Ryan Roberts
On 31/01/2024 12:56, David Hildenbrand wrote:
> On 31.01.24 13:37, Ryan Roberts wrote:
>> On 31/01/2024 11:49, Ryan Roberts wrote:
>>> On 31/01/2024 11:28, David Hildenbrand wrote:
 On 31.01.24 12:16, Ryan Roberts wrote:
> On 31/01/2024 11:06, David Hildenbrand wrote:
>> On 31.01.24 11:43, Ryan Roberts wrote:
>>> On 29/01/2024 12:46, David Hildenbrand wrote:
 Now that the rmap overhaul[1] is upstream that provides a clean 
 interface
 for rmap batching, let's implement PTE batching during fork when 
 processing
 PTE-mapped THPs.

 This series is partially based on Ryan's previous work[2] to implement
 cont-pte support on arm64, but its a complete rewrite based on [1] to
 optimize all architectures independent of any such PTE bits, and to
 use the new rmap batching functions that simplify the code and prepare
 for further rmap accounting changes.

 We collect consecutive PTEs that map consecutive pages of the same 
 large
 folio, making sure that the other PTE bits are compatible, and (a) 
 adjust
 the refcount only once per batch, (b) call rmap handling functions only
 once per batch and (c) perform batch PTE setting/updates.

 While this series should be beneficial for adding cont-pte support on
 ARM64[2], it's one of the requirements for maintaining a total 
 mapcount[3]
 for large folios with minimal added overhead and further changes[4] 
 that
 build up on top of the total mapcount.

 Independent of all that, this series results in a speedup during fork 
 with
 PTE-mapped THP, which is the default with THPs that are smaller than a 
 PMD
 (for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]).

 On an Intel Xeon Silver 4210R CPU, fork'ing with 1GiB of PTE-mapped 
 folios
 of the same size (stddev < 1%) results in the following runtimes
 for fork() (shorter is better):

 Folio Size | v6.8-rc1 |  New | Change
 --
  4KiB | 0.014328 | 0.014035 |   - 2%
     16KiB | 0.014263 | 0.01196  |   -16%
     32KiB | 0.014334 | 0.01094  |   -24%
     64KiB | 0.014046 | 0.010444 |   -26%
    128KiB | 0.014011 | 0.010063 |   -28%
    256KiB | 0.013993 | 0.009938 |   -29%
    512KiB | 0.013983 | 0.00985  |   -30%
   1024KiB | 0.013986 | 0.00982  |   -30%
   2048KiB | 0.014305 | 0.010076 |   -30%
>>>
>>> Just a heads up that I'm seeing some strange results on Apple M2. Fork 
>>> for
>>> order-0 is seemingly costing ~17% more. I'm using GCC 13.2 and was 
>>> pretty
>>> sure I
>>> didn't see this problem with version 1; although that was on a different
>>> baseline and I've thrown the numbers away so will rerun and try to debug
>>> this.
>>>
>>> Numbers for v1 of the series, both on top of 6.8-rc1 and rebased to the same
>>> mm-unstable base as v3 of the series (first 2 rows are from what I just 
>>> posted
>>> for context):
>>>
>>> | kernel |   mean_rel |   std_rel |
>>> |:---|---:|--:|
>>> | mm-unstabe (base)  |   0.0% |  1.1% |
>>> | mm-unstable + v3   |  16.7% |  0.8% |
>>> | mm-unstable + v1   |  -2.5% |  1.7% |
>>> | v6.8-rc1 + v1  |  -6.6% |  1.1% |
>>>
>>> So all looks good with v1. And seems to suggest mm-unstable has regressed 
>>> by ~4%
>>> vs v6.8-rc1. Is this really a useful benchmark? Does the raw performance of
>>> fork() syscall really matter? Evidence suggests its moving all over the 
>>> place -
>>> breath on the code and it changes - not a great place to be when using the 
>>> test
>>> for gating purposes!
>>>
>>> Still with the old tests - I'll move to the new ones now.
>>>
>>>
>>>
>>
>> So far, on my x86 tests (Intel, AMD EPYC), I was not able to observe 
>> this.
>> fork() for order-0 was consistently effectively unchanged. Do you observe
>> that
>> on other ARM systems as well?
>
> Nope; running the exact same kernel binary and user space on Altra, I see
> sensible numbers;
>
> fork order-0: -1.3%
> fork order-9: -7.6%
> dontneed order-0: -0.5%
> dontneed order-9: 0.1%
> munmap order-0: 0.0%
> munmap order-9: -67.9%
>
> So I guess some pipelining issue that causes the M2 to stall more?

 With one effective added folio_test_large(), it could only be a code layout
 problem? Or the compiler does something stupid, but you say that you run 
 the
 exact same kernel binary, so that doesn't make sense.
>>>
>>> Yup, same binary. We know this code is very sensitive - 1 cycle makes a big
>>> difference. So could easily 

[PATCH] papr_vpd.c: calling devfd before get_system_loc_code

2024-01-31 Thread R Nageswara Sastry
Calling get_system_loc_code before checking devfd and errno - fails the test
when the device is not available, expected a SKIP.
Change the order of 'SKIP_IF_MSG' correctly SKIP when the /dev/papr-vpd device
is not available.

with out patch: Test FAILED on line 271
with patch: [SKIP] Test skipped on line 266: /dev/papr-vpd not present

Signed-off-by: R Nageswara Sastry 
---
 tools/testing/selftests/powerpc/papr_vpd/papr_vpd.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/powerpc/papr_vpd/papr_vpd.c 
b/tools/testing/selftests/powerpc/papr_vpd/papr_vpd.c
index 98cbb9109ee6..505294da1b9f 100644
--- a/tools/testing/selftests/powerpc/papr_vpd/papr_vpd.c
+++ b/tools/testing/selftests/powerpc/papr_vpd/papr_vpd.c
@@ -263,10 +263,10 @@ static int papr_vpd_system_loc_code(void)
off_t size;
int fd;
 
-   SKIP_IF_MSG(get_system_loc_code(),
-   "Cannot determine system location code");
SKIP_IF_MSG(devfd < 0 && errno == ENOENT,
DEVPATH " not present");
+   SKIP_IF_MSG(get_system_loc_code(),
+   "Cannot determine system location code");
 
FAIL_IF(devfd < 0);
 
-- 
2.37.1 (Apple Git-137.1)



Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-31 Thread David Hildenbrand


I'm also surprised about the dontneed vs. munmap numbers.


You mean the ones for Altra that I posted? (I didn't post any for M2). The altra
numbers look ok to me; dontneed has no change, and munmap has no change for
order-0 and is massively improved for order-9.



I would expect that dontneed would similarly benefit -- same code path. 
But I focused on munmap measurements for now, I'll try finding time to 
confirm that it's the same on Intel.


--
Cheers,

David / dhildenb



Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-31 Thread David Hildenbrand

On 31.01.24 13:37, Ryan Roberts wrote:

On 31/01/2024 11:49, Ryan Roberts wrote:

On 31/01/2024 11:28, David Hildenbrand wrote:

On 31.01.24 12:16, Ryan Roberts wrote:

On 31/01/2024 11:06, David Hildenbrand wrote:

On 31.01.24 11:43, Ryan Roberts wrote:

On 29/01/2024 12:46, David Hildenbrand wrote:

Now that the rmap overhaul[1] is upstream that provides a clean interface
for rmap batching, let's implement PTE batching during fork when processing
PTE-mapped THPs.

This series is partially based on Ryan's previous work[2] to implement
cont-pte support on arm64, but its a complete rewrite based on [1] to
optimize all architectures independent of any such PTE bits, and to
use the new rmap batching functions that simplify the code and prepare
for further rmap accounting changes.

We collect consecutive PTEs that map consecutive pages of the same large
folio, making sure that the other PTE bits are compatible, and (a) adjust
the refcount only once per batch, (b) call rmap handling functions only
once per batch and (c) perform batch PTE setting/updates.

While this series should be beneficial for adding cont-pte support on
ARM64[2], it's one of the requirements for maintaining a total mapcount[3]
for large folios with minimal added overhead and further changes[4] that
build up on top of the total mapcount.

Independent of all that, this series results in a speedup during fork with
PTE-mapped THP, which is the default with THPs that are smaller than a PMD
(for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]).

On an Intel Xeon Silver 4210R CPU, fork'ing with 1GiB of PTE-mapped folios
of the same size (stddev < 1%) results in the following runtimes
for fork() (shorter is better):

Folio Size | v6.8-rc1 |  New | Change
--
     4KiB | 0.014328 | 0.014035 |   - 2%
    16KiB | 0.014263 | 0.01196  |   -16%
    32KiB | 0.014334 | 0.01094  |   -24%
    64KiB | 0.014046 | 0.010444 |   -26%
   128KiB | 0.014011 | 0.010063 |   -28%
   256KiB | 0.013993 | 0.009938 |   -29%
   512KiB | 0.013983 | 0.00985  |   -30%
  1024KiB | 0.013986 | 0.00982  |   -30%
  2048KiB | 0.014305 | 0.010076 |   -30%


Just a heads up that I'm seeing some strange results on Apple M2. Fork for
order-0 is seemingly costing ~17% more. I'm using GCC 13.2 and was pretty
sure I
didn't see this problem with version 1; although that was on a different
baseline and I've thrown the numbers away so will rerun and try to debug this.


Numbers for v1 of the series, both on top of 6.8-rc1 and rebased to the same
mm-unstable base as v3 of the series (first 2 rows are from what I just posted
for context):

| kernel |   mean_rel |   std_rel |
|:---|---:|--:|
| mm-unstabe (base)  |   0.0% |  1.1% |
| mm-unstable + v3   |  16.7% |  0.8% |
| mm-unstable + v1   |  -2.5% |  1.7% |
| v6.8-rc1 + v1  |  -6.6% |  1.1% |

So all looks good with v1. And seems to suggest mm-unstable has regressed by ~4%
vs v6.8-rc1. Is this really a useful benchmark? Does the raw performance of
fork() syscall really matter? Evidence suggests its moving all over the place -
breath on the code and it changes - not a great place to be when using the test
for gating purposes!

Still with the old tests - I'll move to the new ones now.






So far, on my x86 tests (Intel, AMD EPYC), I was not able to observe this.
fork() for order-0 was consistently effectively unchanged. Do you observe that
on other ARM systems as well?


Nope; running the exact same kernel binary and user space on Altra, I see
sensible numbers;

fork order-0: -1.3%
fork order-9: -7.6%
dontneed order-0: -0.5%
dontneed order-9: 0.1%
munmap order-0: 0.0%
munmap order-9: -67.9%

So I guess some pipelining issue that causes the M2 to stall more?


With one effective added folio_test_large(), it could only be a code layout
problem? Or the compiler does something stupid, but you say that you run the
exact same kernel binary, so that doesn't make sense.


Yup, same binary. We know this code is very sensitive - 1 cycle makes a big
difference. So could easily be code layout, branch prediction, etc...



I'm also surprised about the dontneed vs. munmap numbers.


You mean the ones for Altra that I posted? (I didn't post any for M2). The altra
numbers look ok to me; dontneed has no change, and munmap has no change for
order-0 and is massively improved for order-9.

  Doesn't make any sense

(again, there was this VMA merging problem but it would still allow for batching
within a single VMA that spans exactly one large folio).

What are you using as baseline? Really just mm-unstable vs. mm-unstable+patches?


yes. except for "v6.8-rc1 + v1" above.



Let's see if the new test changes the numbers you measure.


Nope: looks the same. I've taken my test harness out of the picture and done
everything manually from the ground up, with the old tests and the new. Headline
is that 

Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-31 Thread Ryan Roberts
On 31/01/2024 11:49, Ryan Roberts wrote:
> On 31/01/2024 11:28, David Hildenbrand wrote:
>> On 31.01.24 12:16, Ryan Roberts wrote:
>>> On 31/01/2024 11:06, David Hildenbrand wrote:
 On 31.01.24 11:43, Ryan Roberts wrote:
> On 29/01/2024 12:46, David Hildenbrand wrote:
>> Now that the rmap overhaul[1] is upstream that provides a clean interface
>> for rmap batching, let's implement PTE batching during fork when 
>> processing
>> PTE-mapped THPs.
>>
>> This series is partially based on Ryan's previous work[2] to implement
>> cont-pte support on arm64, but its a complete rewrite based on [1] to
>> optimize all architectures independent of any such PTE bits, and to
>> use the new rmap batching functions that simplify the code and prepare
>> for further rmap accounting changes.
>>
>> We collect consecutive PTEs that map consecutive pages of the same large
>> folio, making sure that the other PTE bits are compatible, and (a) adjust
>> the refcount only once per batch, (b) call rmap handling functions only
>> once per batch and (c) perform batch PTE setting/updates.
>>
>> While this series should be beneficial for adding cont-pte support on
>> ARM64[2], it's one of the requirements for maintaining a total 
>> mapcount[3]
>> for large folios with minimal added overhead and further changes[4] that
>> build up on top of the total mapcount.
>>
>> Independent of all that, this series results in a speedup during fork 
>> with
>> PTE-mapped THP, which is the default with THPs that are smaller than a 
>> PMD
>> (for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]).
>>
>> On an Intel Xeon Silver 4210R CPU, fork'ing with 1GiB of PTE-mapped 
>> folios
>> of the same size (stddev < 1%) results in the following runtimes
>> for fork() (shorter is better):
>>
>> Folio Size | v6.8-rc1 |  New | Change
>> --
>>     4KiB | 0.014328 | 0.014035 |   - 2%
>>    16KiB | 0.014263 | 0.01196  |   -16%
>>    32KiB | 0.014334 | 0.01094  |   -24%
>>    64KiB | 0.014046 | 0.010444 |   -26%
>>   128KiB | 0.014011 | 0.010063 |   -28%
>>   256KiB | 0.013993 | 0.009938 |   -29%
>>   512KiB | 0.013983 | 0.00985  |   -30%
>>  1024KiB | 0.013986 | 0.00982  |   -30%
>>  2048KiB | 0.014305 | 0.010076 |   -30%
>
> Just a heads up that I'm seeing some strange results on Apple M2. Fork for
> order-0 is seemingly costing ~17% more. I'm using GCC 13.2 and was pretty
> sure I
> didn't see this problem with version 1; although that was on a different
> baseline and I've thrown the numbers away so will rerun and try to debug 
> this.
> 
> Numbers for v1 of the series, both on top of 6.8-rc1 and rebased to the same
> mm-unstable base as v3 of the series (first 2 rows are from what I just posted
> for context):
> 
> | kernel |   mean_rel |   std_rel |
> |:---|---:|--:|
> | mm-unstabe (base)  |   0.0% |  1.1% |
> | mm-unstable + v3   |  16.7% |  0.8% |
> | mm-unstable + v1   |  -2.5% |  1.7% |
> | v6.8-rc1 + v1  |  -6.6% |  1.1% |
> 
> So all looks good with v1. And seems to suggest mm-unstable has regressed by 
> ~4%
> vs v6.8-rc1. Is this really a useful benchmark? Does the raw performance of
> fork() syscall really matter? Evidence suggests its moving all over the place 
> -
> breath on the code and it changes - not a great place to be when using the 
> test
> for gating purposes!
> 
> Still with the old tests - I'll move to the new ones now.
> 
> 
>

 So far, on my x86 tests (Intel, AMD EPYC), I was not able to observe this.
 fork() for order-0 was consistently effectively unchanged. Do you observe 
 that
 on other ARM systems as well?
>>>
>>> Nope; running the exact same kernel binary and user space on Altra, I see
>>> sensible numbers;
>>>
>>> fork order-0: -1.3%
>>> fork order-9: -7.6%
>>> dontneed order-0: -0.5%
>>> dontneed order-9: 0.1%
>>> munmap order-0: 0.0%
>>> munmap order-9: -67.9%
>>>
>>> So I guess some pipelining issue that causes the M2 to stall more?
>>
>> With one effective added folio_test_large(), it could only be a code layout
>> problem? Or the compiler does something stupid, but you say that you run the
>> exact same kernel binary, so that doesn't make sense.
> 
> Yup, same binary. We know this code is very sensitive - 1 cycle makes a big
> difference. So could easily be code layout, branch prediction, etc...
> 
>>
>> I'm also surprised about the dontneed vs. munmap numbers.
> 
> You mean the ones for Altra that I posted? (I didn't post any for M2). The 
> altra
> numbers look ok to me; dontneed has no change, and munmap has no change for
> order-0 and is massively improved for order-9.
> 
>  Doesn't make any sense
>> (again, there was 

Re: [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time

2024-01-31 Thread Christophe Leroy
Hi,

Le 30/01/2024 à 18:48, Marek Szyprowski a écrit :
> [Vous ne recevez pas souvent de courriers de m.szyprow...@samsung.com. 
> Découvrez pourquoi ceci est important à 
> https://aka.ms/LearnAboutSenderIdentification ]
> 
> Dear All,
> 
> On 30.01.2024 12:03, Christophe Leroy wrote:
>> Le 30/01/2024 à 10:16, Chen-Yu Tsai a écrit :
>>> [Vous ne recevez pas souvent de courriers de we...@chromium.org. D?couvrez 
>>> pourquoi ceci est important ? https://aka.ms/LearnAboutSenderIdentification 
>>> ]
>>>
>>> On Mon, Jan 29, 2024 at 12:09:50PM -0800, Luis Chamberlain wrote:
 On Thu, Dec 21, 2023 at 10:02:46AM +0100, Christophe Leroy wrote:
> Declaring rodata_enabled and mark_rodata_ro() at all time
> helps removing related #ifdefery in C files.
>
> Signed-off-by: Christophe Leroy 
 Very nice cleanup, thanks!, applied and pushed

  Luis
>>> On next-20240130, which has your modules-next branch, and thus this
>>> series and the other "module: Use set_memory_rox()" series applied,
>>> my kernel crashes in some very weird way. Reverting your branch
>>> makes the crash go away.
>>>
>>> I thought I'd report it right away. Maybe you folks would know what's
>>> happening here? This is on arm64.
>> That's strange, it seems to bug in module_bug_finalize() which is
>> _before_ calls to module_enable_ro() and such.
>>
>> Can you try to revert the 6 patches one by one to see which one
>> introduces the problem ?
>>
>> In reality, only patch 677bfb9db8a3 really change things. Other ones are
>> more on less only cleanup.
> 
> I've also run into this issue with today's (20240130) linux-next on my
> test farm. The issue is not fully reproducible, so it was a bit hard to
> bisect it automatically. I've spent some time on manual testing and it
> looks that reverting the following 2 commits on top of linux-next fixes
> the problem:
> 
> 65929884f868 ("modules: Remove #ifdef CONFIG_STRICT_MODULE_RWX around
> rodata_enabled")
> 677bfb9db8a3 ("module: Don't ignore errors from set_memory_XX()")
> 
> This in fact means that commit 677bfb9db8a3 is responsible for this
> regression, as 65929884f868 has to be reverted only because the latter
> depends on it. Let me know what I can do to help debugging this issue.
> 

Thanks for the bisect. I suspect you hit one of the errors and something 
goes wrong in the error path.

To confirm this assumption, could you try with the following change on 
top of everything ?

diff --git a/kernel/module/strict_rwx.c b/kernel/module/strict_rwx.c
index a14df9655dbe..fdf8484154dd 100644
--- a/kernel/module/strict_rwx.c
+++ b/kernel/module/strict_rwx.c
@@ -15,9 +15,12 @@ static int module_set_memory(const struct module 
*mod, enum mod_mem_type type,
  int (*set_memory)(unsigned long start, int 
num_pages))
  {
const struct module_memory *mod_mem = >mem[type];
+   int err;

set_vm_flush_reset_perms(mod_mem->base);
-   return set_memory((unsigned long)mod_mem->base, mod_mem->size >> 
PAGE_SHIFT);
+   err = set_memory((unsigned long)mod_mem->base, mod_mem->size >> 
PAGE_SHIFT);
+   WARN(err, "module_set_memory(%d, %px, %x) returned %d\n", type, 
mod_mem->base, mod_mem->size, err);
+   return err;
  }

  /*


Thanks for your help
Christophe


Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-31 Thread Ryan Roberts
On 31/01/2024 11:28, David Hildenbrand wrote:
> On 31.01.24 12:16, Ryan Roberts wrote:
>> On 31/01/2024 11:06, David Hildenbrand wrote:
>>> On 31.01.24 11:43, Ryan Roberts wrote:
 On 29/01/2024 12:46, David Hildenbrand wrote:
> Now that the rmap overhaul[1] is upstream that provides a clean interface
> for rmap batching, let's implement PTE batching during fork when 
> processing
> PTE-mapped THPs.
>
> This series is partially based on Ryan's previous work[2] to implement
> cont-pte support on arm64, but its a complete rewrite based on [1] to
> optimize all architectures independent of any such PTE bits, and to
> use the new rmap batching functions that simplify the code and prepare
> for further rmap accounting changes.
>
> We collect consecutive PTEs that map consecutive pages of the same large
> folio, making sure that the other PTE bits are compatible, and (a) adjust
> the refcount only once per batch, (b) call rmap handling functions only
> once per batch and (c) perform batch PTE setting/updates.
>
> While this series should be beneficial for adding cont-pte support on
> ARM64[2], it's one of the requirements for maintaining a total mapcount[3]
> for large folios with minimal added overhead and further changes[4] that
> build up on top of the total mapcount.
>
> Independent of all that, this series results in a speedup during fork with
> PTE-mapped THP, which is the default with THPs that are smaller than a PMD
> (for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]).
>
> On an Intel Xeon Silver 4210R CPU, fork'ing with 1GiB of PTE-mapped folios
> of the same size (stddev < 1%) results in the following runtimes
> for fork() (shorter is better):
>
> Folio Size | v6.8-rc1 |  New | Change
> --
>     4KiB | 0.014328 | 0.014035 |   - 2%
>    16KiB | 0.014263 | 0.01196  |   -16%
>    32KiB | 0.014334 | 0.01094  |   -24%
>    64KiB | 0.014046 | 0.010444 |   -26%
>   128KiB | 0.014011 | 0.010063 |   -28%
>   256KiB | 0.013993 | 0.009938 |   -29%
>   512KiB | 0.013983 | 0.00985  |   -30%
>  1024KiB | 0.013986 | 0.00982  |   -30%
>  2048KiB | 0.014305 | 0.010076 |   -30%

 Just a heads up that I'm seeing some strange results on Apple M2. Fork for
 order-0 is seemingly costing ~17% more. I'm using GCC 13.2 and was pretty
 sure I
 didn't see this problem with version 1; although that was on a different
 baseline and I've thrown the numbers away so will rerun and try to debug 
 this.

Numbers for v1 of the series, both on top of 6.8-rc1 and rebased to the same
mm-unstable base as v3 of the series (first 2 rows are from what I just posted
for context):

| kernel |   mean_rel |   std_rel |
|:---|---:|--:|
| mm-unstabe (base)  |   0.0% |  1.1% |
| mm-unstable + v3   |  16.7% |  0.8% |
| mm-unstable + v1   |  -2.5% |  1.7% |
| v6.8-rc1 + v1  |  -6.6% |  1.1% |

So all looks good with v1. And seems to suggest mm-unstable has regressed by ~4%
vs v6.8-rc1. Is this really a useful benchmark? Does the raw performance of
fork() syscall really matter? Evidence suggests its moving all over the place -
breath on the code and it changes - not a great place to be when using the test
for gating purposes!

Still with the old tests - I'll move to the new ones now.



>>>
>>> So far, on my x86 tests (Intel, AMD EPYC), I was not able to observe this.
>>> fork() for order-0 was consistently effectively unchanged. Do you observe 
>>> that
>>> on other ARM systems as well?
>>
>> Nope; running the exact same kernel binary and user space on Altra, I see
>> sensible numbers;
>>
>> fork order-0: -1.3%
>> fork order-9: -7.6%
>> dontneed order-0: -0.5%
>> dontneed order-9: 0.1%
>> munmap order-0: 0.0%
>> munmap order-9: -67.9%
>>
>> So I guess some pipelining issue that causes the M2 to stall more?
> 
> With one effective added folio_test_large(), it could only be a code layout
> problem? Or the compiler does something stupid, but you say that you run the
> exact same kernel binary, so that doesn't make sense.

Yup, same binary. We know this code is very sensitive - 1 cycle makes a big
difference. So could easily be code layout, branch prediction, etc...

> 
> I'm also surprised about the dontneed vs. munmap numbers.

You mean the ones for Altra that I posted? (I didn't post any for M2). The altra
numbers look ok to me; dontneed has no change, and munmap has no change for
order-0 and is massively improved for order-9.

 Doesn't make any sense
> (again, there was this VMA merging problem but it would still allow for 
> batching
> within a single VMA that spans exactly one large folio).
> 
> What are you using as baseline? Really just mm-unstable vs. 
> mm-unstable+patches?

yes. except 

BUG: KASAN: vmalloc-out-of-bounds in memset32 (bpf_prog_pack_free)

2024-01-31 Thread Christophe Leroy
Hi,

Got the following BUG while loading module test_bpf.ko

No time to investigate for now.

root@vgoip:~# insmod test_bpf.ko
[  263.409030] 
==
[  263.416415] BUG: KASAN: vmalloc-out-of-bounds in memset32+0x5c/0xa0
[  263.422952] Write of size 4 at addr c9000e40 by task kworker/0:0/7
[  263.429322]
[  263.430816] CPU: 0 PID: 7 Comm: kworker/0:0 Not tainted 
6.8.0-rc1-s3k-dev-02364-gc626219462a6-dirty #606
[  263.440580] Hardware name: MIAE 8xx 0x50 CMPC885
[  263.445658] Workqueue: events bpf_prog_free_deferred
[  263.450973] Call Trace:
[  263.453411] [c905bd00] [c0c114e8] dump_stack_lvl+0x34/0x50 (unreliable)
[  263.460744] [c905bd20] [c026b9d4] print_report+0x174/0x608
[  263.466853] [c905bd70] [c026c01c] kasan_report+0xc0/0x130
[  263.472788] [c905bdd0] [c0c43cb0] memset32+0x5c/0xa0
[  263.478198] [c905bdf0] [c0030690] patch_instructions+0x70/0x17c
[  263.484656] [c905be30] [c00389b0] bpf_arch_text_invalidate+0xa8/0x120
[  263.491638] [c905be90] [c018e63c] bpf_prog_pack_free+0xec/0x24c
[  263.498096] [c905bec0] [c018ea34] bpf_jit_binary_pack_free+0x3c/0x80
[  263.504991] [c905bee0] [c0038ae8] bpf_jit_free+0xc0/0x128
[  263.510925] [c905bf00] [c00790f8] process_one_work+0x310/0x6e8
[  263.517209] [c905bf50] [c0079b2c] worker_thread+0x65c/0x868
[  263.523232] [c905bfc0] [c0084b78] kthread+0x17c/0x1ac
[  263.528817] [c905bff0] [c00181fc] start_kernel_thread+0x10/0x14
[  263.535279]
[  263.536782] The buggy address belongs to the virtual mapping at
[  263.536782]  [c900, c9008000) created by:
[  263.536782]  text_area_cpu_up+0x28/0x1d4
[  263.551418]
[  263.552902] The buggy address belongs to the physical page:
[  263.560228]
[  263.561713] Memory state around the buggy address:
[  263.566585]  c9000d00: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
[  263.573307]  c9000d80: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
[  263.580027] >c9000e00: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
[  263.586677]^
[  263.591370]  c9000e80: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
[  263.598093]  c9000f00: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
[  263.604743] 
==

Christophe



Re: [PATCH] perf/pmu-events/powerpc: Update json mapfile with Power11 PVR

2024-01-31 Thread kajoljain
Patch looks fine to me.

Reviewed-by: Kajol Jain 

Thanks,
Kajol Jain

On 1/29/24 17:38, Madhavan Srinivasan wrote:
> Update the Power11 PVR to json mapfile to enable
> json events. Power11 is PowerISA v3.1 compliant
> and support Power10 events.
> 
> Signed-off-by: Madhavan Srinivasan 
> ---
>  tools/perf/pmu-events/arch/powerpc/mapfile.csv | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/tools/perf/pmu-events/arch/powerpc/mapfile.csv 
> b/tools/perf/pmu-events/arch/powerpc/mapfile.csv
> index 599a588dbeb4..4d5e9138d4cc 100644
> --- a/tools/perf/pmu-events/arch/powerpc/mapfile.csv
> +++ b/tools/perf/pmu-events/arch/powerpc/mapfile.csv
> @@ -15,3 +15,4 @@
>  0x0066[[:xdigit:]]{4},1,power8,core
>  0x004e[[:xdigit:]]{4},1,power9,core
>  0x0080[[:xdigit:]]{4},1,power10,core
> +0x0082[[:xdigit:]]{4},1,power10,core


[PATCH] powerpc/hv-gpci: Fix the hcall return value checks in single_gpci_request function

2024-01-31 Thread Kajol Jain
Running event 
hv_gpci/dispatch_timebase_by_processor_processor_time_in_timebase_cycles,phys_processor_idx=0/
in one of the system throws below error:

 ---Logs---
 # perf list | grep 
hv_gpci/dispatch_timebase_by_processor_processor_time_in_timebase_cycles
  
hv_gpci/dispatch_timebase_by_processor_processor_time_in_timebase_cycles,phys_processor_idx=?/[Kernel
 PMU event]


 # perf stat -v -e 
hv_gpci/dispatch_timebase_by_processor_processor_time_in_timebase_cycles,phys_processor_idx=0/
 sleep 2
Using CPUID 00800200
Control descriptor is not initialized
Warning:
hv_gpci/dispatch_timebase_by_processor_processor_time_in_timebase_cycles,phys_processor_idx=0/
 event is not supported by the kernel.
failed to read counter 
hv_gpci/dispatch_timebase_by_processor_processor_time_in_timebase_cycles,phys_processor_idx=0/

 Performance counter stats for 'system wide':

 
hv_gpci/dispatch_timebase_by_processor_processor_time_in_timebase_cycles,phys_processor_idx=0/

   2.000700771 seconds time elapsed

The above error is because of the hcall failure as required
permission "Enable Performance Information Collection" is not set.
Based on current code, single_gpci_request function did not check the
error type incase hcall fails and by default returns EINVAL. But we can
have other reasons for hcall failures like H_AUTHORITY/H_PARAMETER for which
we need to act accordingly.
Fix this issue by adding new checks in the single_gpci_request function.

Result after fix patch changes:

 # perf stat -e 
hv_gpci/dispatch_timebase_by_processor_processor_time_in_timebase_cycles,phys_processor_idx=0/
 sleep 2
Error:
No permission to enable 
hv_gpci/dispatch_timebase_by_processor_processor_time_in_timebase_cycles,phys_processor_idx=0/
 event.

Fixes: 220a0c609ad1 ("powerpc/perf: Add support for the hv gpci (get 
performance counter info) interface")
Reported-by: Akanksha J N 
Signed-off-by: Kajol Jain 
---
 arch/powerpc/perf/hv-gpci.c | 29 +
 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/perf/hv-gpci.c b/arch/powerpc/perf/hv-gpci.c
index 27f18119fda1..101060facd81 100644
--- a/arch/powerpc/perf/hv-gpci.c
+++ b/arch/powerpc/perf/hv-gpci.c
@@ -695,7 +695,17 @@ static unsigned long single_gpci_request(u32 req, u32 
starting_index,
 
ret = plpar_hcall_norets(H_GET_PERF_COUNTER_INFO,
virt_to_phys(arg), HGPCI_REQ_BUFFER_SIZE);
-   if (ret) {
+
+   /*
+* ret value as 'H_PARAMETER' corresponds to 'GEN_BUF_TOO_SMALL',
+* which means that the current buffer size cannot accommodate
+* all the information and a partial buffer returned.
+* Since in this function we are only accessing data for a given 
starting index,
+* we don't need to accommodate whole data and can get required count by
+* accessing very first entry.
+* Hence hcall fails only incase the ret value is other than H_SUCCESS 
or H_PARAMETER.
+*/
+   if (ret && (ret != H_PARAMETER)) {
pr_devel("hcall failed: 0x%lx\n", ret);
goto out;
}
@@ -724,7 +734,7 @@ static u64 h_gpci_get_value(struct perf_event *event)
event_get_offset(event),
event_get_length(event),
);
-   if (ret)
+   if (ret && (ret != H_PARAMETER))
return 0;
return count;
 }
@@ -759,6 +769,7 @@ static int h_gpci_event_init(struct perf_event *event)
 {
u64 count;
u8 length;
+   unsigned long ret;
 
/* Not our event */
if (event->attr.type != event->pmu->type)
@@ -789,13 +800,23 @@ static int h_gpci_event_init(struct perf_event *event)
}
 
/* check if the request works... */
-   if (single_gpci_request(event_get_request(event),
+   ret = single_gpci_request(event_get_request(event),
event_get_starting_index(event),
event_get_secondary_index(event),
event_get_counter_info_version(event),
event_get_offset(event),
length,
-   )) {
+   );
+
+   /*
+* ret value as H_AUTHORITY implies that partition is not permitted to 
retrieve
+* performance information, and required to set
+* "Enable Performance Information Collection" option.
+*/
+   if (ret == H_AUTHORITY)
+   return -EPERM;
+
+   if (ret && (ret != H_PARAMETER)) {
pr_devel("gpci hcall failed\n");
return -EINVAL;
}
-- 
2.43.0



Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-31 Thread David Hildenbrand

On 31.01.24 12:16, Ryan Roberts wrote:

On 31/01/2024 11:06, David Hildenbrand wrote:

On 31.01.24 11:43, Ryan Roberts wrote:

On 29/01/2024 12:46, David Hildenbrand wrote:

Now that the rmap overhaul[1] is upstream that provides a clean interface
for rmap batching, let's implement PTE batching during fork when processing
PTE-mapped THPs.

This series is partially based on Ryan's previous work[2] to implement
cont-pte support on arm64, but its a complete rewrite based on [1] to
optimize all architectures independent of any such PTE bits, and to
use the new rmap batching functions that simplify the code and prepare
for further rmap accounting changes.

We collect consecutive PTEs that map consecutive pages of the same large
folio, making sure that the other PTE bits are compatible, and (a) adjust
the refcount only once per batch, (b) call rmap handling functions only
once per batch and (c) perform batch PTE setting/updates.

While this series should be beneficial for adding cont-pte support on
ARM64[2], it's one of the requirements for maintaining a total mapcount[3]
for large folios with minimal added overhead and further changes[4] that
build up on top of the total mapcount.

Independent of all that, this series results in a speedup during fork with
PTE-mapped THP, which is the default with THPs that are smaller than a PMD
(for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]).

On an Intel Xeon Silver 4210R CPU, fork'ing with 1GiB of PTE-mapped folios
of the same size (stddev < 1%) results in the following runtimes
for fork() (shorter is better):

Folio Size | v6.8-rc1 |  New | Change
--
    4KiB | 0.014328 | 0.014035 |   - 2%
   16KiB | 0.014263 | 0.01196  |   -16%
   32KiB | 0.014334 | 0.01094  |   -24%
   64KiB | 0.014046 | 0.010444 |   -26%
  128KiB | 0.014011 | 0.010063 |   -28%
  256KiB | 0.013993 | 0.009938 |   -29%
  512KiB | 0.013983 | 0.00985  |   -30%
     1024KiB | 0.013986 | 0.00982  |   -30%
     2048KiB | 0.014305 | 0.010076 |   -30%


Just a heads up that I'm seeing some strange results on Apple M2. Fork for
order-0 is seemingly costing ~17% more. I'm using GCC 13.2 and was pretty sure I
didn't see this problem with version 1; although that was on a different
baseline and I've thrown the numbers away so will rerun and try to debug this.



So far, on my x86 tests (Intel, AMD EPYC), I was not able to observe this.
fork() for order-0 was consistently effectively unchanged. Do you observe that
on other ARM systems as well?


Nope; running the exact same kernel binary and user space on Altra, I see
sensible numbers;

fork order-0: -1.3%
fork order-9: -7.6%
dontneed order-0: -0.5%
dontneed order-9: 0.1%
munmap order-0: 0.0%
munmap order-9: -67.9%

So I guess some pipelining issue that causes the M2 to stall more?


With one effective added folio_test_large(), it could only be a code 
layout problem? Or the compiler does something stupid, but you say that 
you run the exact same kernel binary, so that doesn't make sense.


I'm also surprised about the dontneed vs. munmap numbers. Doesn't make 
any sense (again, there was this VMA merging problem but it would still 
allow for batching within a single VMA that spans exactly one large folio).


What are you using as baseline? Really just mm-unstable vs. 
mm-unstable+patches?


Let's see if the new test changes the numbers you measure.

--
Cheers,

David / dhildenb



Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-31 Thread Ryan Roberts
On 31/01/2024 11:06, David Hildenbrand wrote:
> On 31.01.24 11:43, Ryan Roberts wrote:
>> On 29/01/2024 12:46, David Hildenbrand wrote:
>>> Now that the rmap overhaul[1] is upstream that provides a clean interface
>>> for rmap batching, let's implement PTE batching during fork when processing
>>> PTE-mapped THPs.
>>>
>>> This series is partially based on Ryan's previous work[2] to implement
>>> cont-pte support on arm64, but its a complete rewrite based on [1] to
>>> optimize all architectures independent of any such PTE bits, and to
>>> use the new rmap batching functions that simplify the code and prepare
>>> for further rmap accounting changes.
>>>
>>> We collect consecutive PTEs that map consecutive pages of the same large
>>> folio, making sure that the other PTE bits are compatible, and (a) adjust
>>> the refcount only once per batch, (b) call rmap handling functions only
>>> once per batch and (c) perform batch PTE setting/updates.
>>>
>>> While this series should be beneficial for adding cont-pte support on
>>> ARM64[2], it's one of the requirements for maintaining a total mapcount[3]
>>> for large folios with minimal added overhead and further changes[4] that
>>> build up on top of the total mapcount.
>>>
>>> Independent of all that, this series results in a speedup during fork with
>>> PTE-mapped THP, which is the default with THPs that are smaller than a PMD
>>> (for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]).
>>>
>>> On an Intel Xeon Silver 4210R CPU, fork'ing with 1GiB of PTE-mapped folios
>>> of the same size (stddev < 1%) results in the following runtimes
>>> for fork() (shorter is better):
>>>
>>> Folio Size | v6.8-rc1 |  New | Change
>>> --
>>>    4KiB | 0.014328 | 0.014035 |   - 2%
>>>   16KiB | 0.014263 | 0.01196  |   -16%
>>>   32KiB | 0.014334 | 0.01094  |   -24%
>>>   64KiB | 0.014046 | 0.010444 |   -26%
>>>  128KiB | 0.014011 | 0.010063 |   -28%
>>>  256KiB | 0.013993 | 0.009938 |   -29%
>>>  512KiB | 0.013983 | 0.00985  |   -30%
>>>     1024KiB | 0.013986 | 0.00982  |   -30%
>>>     2048KiB | 0.014305 | 0.010076 |   -30%
>>
>> Just a heads up that I'm seeing some strange results on Apple M2. Fork for
>> order-0 is seemingly costing ~17% more. I'm using GCC 13.2 and was pretty 
>> sure I
>> didn't see this problem with version 1; although that was on a different
>> baseline and I've thrown the numbers away so will rerun and try to debug 
>> this.
>>
> 
> So far, on my x86 tests (Intel, AMD EPYC), I was not able to observe this.
> fork() for order-0 was consistently effectively unchanged. Do you observe that
> on other ARM systems as well?

Nope; running the exact same kernel binary and user space on Altra, I see
sensible numbers;

fork order-0: -1.3%
fork order-9: -7.6%
dontneed order-0: -0.5%
dontneed order-9: 0.1%
munmap order-0: 0.0%
munmap order-9: -67.9%

So I guess some pipelining issue that causes the M2 to stall more?

> 
> 
>> | kernel  |   mean_rel |   std_rel |
>> |:|---:|--:|
>> | mm-unstable |   0.0% |  1.1% |
>> | patch 1 |  -2.3% |  1.3% |
>> | patch 10    |  -2.9% |  2.7% |
>> | patch 11    |  13.5% |  0.5% |
>> | patch 12    |  15.2% |  1.2% |
>> | patch 13    |  18.2% |  0.7% |
>> | patch 14    |  20.5% |  1.0% |
>> | patch 15    |  17.1% |  1.6% |
>> | patch 15    |  16.7% |  0.8% |
>>
>> fork for order-9 is looking good (-20%), and for the zap series, munmap is
>> looking good, but dontneed is looking poor for both order-0 and 9. But one 
>> thing
>> at a time... let's concentrate on fork order-0 first.
> 
> munmap and dontneed end up calling the exact same call paths. So a big
> performance difference is rather surprising and might indicate something else.
> 
> (I think I told you that I was running in some kind of VMA merging problem 
> where
> one would suddenly get with my benchmark 1 VMA per page. The new benchmark 
> below
> works around that, but I am not sure if that was fixed in the meantime)
> 
> VMA merging can of course explain a big difference in fork and munmap vs.
> dontneed times, especially when comparing different code base where that VMA
> merging behavior was different.
> 
>>
>> Note that I'm still using the "old" benchmark code. Could you resend me the 
>> link
>> to the new code? Although I don't think there should be any effect for 
>> order-0
>> anyway, if I understood your changes correctly?
> 
> This is the combined one (small and large PTEs):
> 
> https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c?inline=false

I'll have a go with this.

> 



Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-01-31 Thread David Hildenbrand

-    folio_remove_rmap_pte(folio, page, vma);
+    folio_remove_rmap_ptes(folio, page, nr, vma);
+
+    /* Only sanity-check the first page in a batch. */
   if (unlikely(page_mapcount(page) < 0))
   print_bad_pte(vma, addr, ptent, page);


Is there a case for either removing this all together or moving it into
folio_remove_rmap_ptes()? It seems odd to only check some pages.



I really wanted to avoid another nasty loop here.

In my thinking, for 4k folios, or when zapping subpages of large folios, we
still perform the exact same checks. Only when batching we don't. So if there is
some problem, there are ways to get it triggered. And these problems are barely
ever seen.

folio_remove_rmap_ptes() feels like the better place -- especially because the
delayed-rmap handling is effectively unchecked. But in there, we cannot
"print_bad_pte()".

[background: if we had a total mapcount -- iow cheap folio_mapcount(), I'd check
here that the total mapcount does not underflow, instead of checking 
per-subpage]


All good points... perhaps extend the comment to describe how this could be
solved in future with cheap total_mapcount()? Or in the commit log if you 
prefer?


I'll add more meat to the cover letter, thanks!

--
Cheers,

David / dhildenb



Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-31 Thread David Hildenbrand

On 31.01.24 11:43, Ryan Roberts wrote:

On 29/01/2024 12:46, David Hildenbrand wrote:

Now that the rmap overhaul[1] is upstream that provides a clean interface
for rmap batching, let's implement PTE batching during fork when processing
PTE-mapped THPs.

This series is partially based on Ryan's previous work[2] to implement
cont-pte support on arm64, but its a complete rewrite based on [1] to
optimize all architectures independent of any such PTE bits, and to
use the new rmap batching functions that simplify the code and prepare
for further rmap accounting changes.

We collect consecutive PTEs that map consecutive pages of the same large
folio, making sure that the other PTE bits are compatible, and (a) adjust
the refcount only once per batch, (b) call rmap handling functions only
once per batch and (c) perform batch PTE setting/updates.

While this series should be beneficial for adding cont-pte support on
ARM64[2], it's one of the requirements for maintaining a total mapcount[3]
for large folios with minimal added overhead and further changes[4] that
build up on top of the total mapcount.

Independent of all that, this series results in a speedup during fork with
PTE-mapped THP, which is the default with THPs that are smaller than a PMD
(for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]).

On an Intel Xeon Silver 4210R CPU, fork'ing with 1GiB of PTE-mapped folios
of the same size (stddev < 1%) results in the following runtimes
for fork() (shorter is better):

Folio Size | v6.8-rc1 |  New | Change
--
   4KiB | 0.014328 | 0.014035 |   - 2%
  16KiB | 0.014263 | 0.01196  |   -16%
  32KiB | 0.014334 | 0.01094  |   -24%
  64KiB | 0.014046 | 0.010444 |   -26%
 128KiB | 0.014011 | 0.010063 |   -28%
 256KiB | 0.013993 | 0.009938 |   -29%
 512KiB | 0.013983 | 0.00985  |   -30%
1024KiB | 0.013986 | 0.00982  |   -30%
2048KiB | 0.014305 | 0.010076 |   -30%


Just a heads up that I'm seeing some strange results on Apple M2. Fork for
order-0 is seemingly costing ~17% more. I'm using GCC 13.2 and was pretty sure I
didn't see this problem with version 1; although that was on a different
baseline and I've thrown the numbers away so will rerun and try to debug this.



So far, on my x86 tests (Intel, AMD EPYC), I was not able to observe 
this. fork() for order-0 was consistently effectively unchanged. Do you 
observe that on other ARM systems as well?




| kernel  |   mean_rel |   std_rel |
|:|---:|--:|
| mm-unstable |   0.0% |  1.1% |
| patch 1 |  -2.3% |  1.3% |
| patch 10|  -2.9% |  2.7% |
| patch 11|  13.5% |  0.5% |
| patch 12|  15.2% |  1.2% |
| patch 13|  18.2% |  0.7% |
| patch 14|  20.5% |  1.0% |
| patch 15|  17.1% |  1.6% |
| patch 15|  16.7% |  0.8% |

fork for order-9 is looking good (-20%), and for the zap series, munmap is
looking good, but dontneed is looking poor for both order-0 and 9. But one thing
at a time... let's concentrate on fork order-0 first.


munmap and dontneed end up calling the exact same call paths. So a big 
performance difference is rather surprising and might indicate something 
else.


(I think I told you that I was running in some kind of VMA merging 
problem where one would suddenly get with my benchmark 1 VMA per page. 
The new benchmark below works around that, but I am not sure if that was 
fixed in the meantime)


VMA merging can of course explain a big difference in fork and munmap 
vs. dontneed times, especially when comparing different code base where 
that VMA merging behavior was different.




Note that I'm still using the "old" benchmark code. Could you resend me the link
to the new code? Although I don't think there should be any effect for order-0
anyway, if I understood your changes correctly?


This is the combined one (small and large PTEs):

https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c?inline=false

--
Cheers,

David / dhildenb



Re: [mainline] [linux-next] [6.8-rc1] [FC] [DLPAR] OOps kernel crash after performing dlpar remove test

2024-01-31 Thread Robin Murphy

On 2024-01-31 9:19 am, Tasmiya Nalatwad wrote:

Greetings,

[mainline] [linux-next] [6.8-rc1] [DLPAR] OOps kernel crash after 
performing dlpar remove test


--- Traces ---

[58563.146236] BUG: Unable to handle kernel data access at 
0x6b6b6b6b6b6b6b83

[58563.146242] Faulting instruction address: 0xc09c0e60
[58563.146248] Oops: Kernel access of bad area, sig: 11 [#1]
[58563.146252] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=8192 NUMA pSeries
[58563.146258] Modules linked in: isofs cdrom dm_snapshot dm_bufio 
dm_round_robin dm_queue_length exfat vfat fat btrfs blake2b_generic xor 
raid6_pq zstd_compress loop xfs libcrc32c raid0 nvram rpadlpar_io rpaphp 
nfnetlink xsk_diag bonding tls rfkill sunrpc dm_service_time 
dm_multipath dm_mod pseries_rng vmx_crypto binfmt_misc ext4 mbcache jbd2 
sd_mod sg ibmvscsi scsi_transport_srp ibmveth lpfc nvmet_fc nvmet 
nvme_fc nvme_fabrics nvme_core t10_pi crc64_rocksoft crc64 
scsi_transport_fc fuse
[58563.146326] CPU: 0 PID: 1071247 Comm: drmgr Kdump: loaded Not tainted 
6.8.0-rc1-auto-gecb1b8288dc7 #1
[58563.146332] Hardware name: IBM,9009-42A POWER9 (raw) 0x4e0202 
0xf05 of:IBM,FW950.A0 (VL950_141) hv:phyp pSeries
[58563.146337] NIP:  c09c0e60 LR: c09c0e28 CTR: 
c09c1584
[58563.146342] REGS: c0007960f260 TRAP: 0380   Not tainted 
(6.8.0-rc1-auto-gecb1b8288dc7)
[58563.146347] MSR:  80009033   CR: 
24822424  XER: 20040006

[58563.146360] CFAR: c09c0e74 IRQMASK: 0
[58563.146360] GPR00: c09c0e28 c0007960f500 c1482600 
c3050540
[58563.146360] GPR04:  c0089a6870c0 0001 
fffe
[58563.146360] GPR08: c2bac020 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 
0220
[58563.146360] GPR12: 2000 c308  

[58563.146360] GPR16:    
0001
[58563.146360] GPR20: c1281478  c1281490 
c2bfed80
[58563.146360] GPR24: c0089a6870c0   
c2b9ffb8
[58563.146360] GPR28:  c2bac0e8  


[58563.146421] NIP [c09c0e60] iommu_ops_from_fwnode+0x68/0x118
[58563.146430] LR [c09c0e28] iommu_ops_from_fwnode+0x30/0x118


This implies that iommu_device_list has become corrupted. Looks like 
spapr_tce_setup_phb_iommus_initcall() registers an iommu_device which 
pcibios_free_controller() could free if a PCI controller is removed, but 
there's no path anywhere to ever unregister any of those IOMMUs. 
Presumably this also means that is a PCI controller is dynamically added 
after init, its IOMMU won't be set up properly either.


Thanks,
Robin.


[58563.146437] Call Trace:
[58563.146439] [c0007960f500] [c0007960f560] 0xc0007960f560 
(unreliable)
[58563.146446] [c0007960f530] [c09c0fd0] 
__iommu_probe_device+0xc0/0x5c0
[58563.146454] [c0007960f5a0] [c09c151c] 
iommu_probe_device+0x4c/0xb4
[58563.146462] [c0007960f5e0] [c09c15d0] 
iommu_bus_notifier+0x4c/0x8c
[58563.146469] [c0007960f600] [c019e3d0] 
notifier_call_chain+0xb8/0x1a0
[58563.146476] [c0007960f660] [c019eea0] 
blocking_notifier_call_chain+0x64/0x94

[58563.146483] [c0007960f6a0] [c09d3c5c] bus_notify+0x50/0x7c
[58563.146491] [c0007960f6e0] [c09cfba4] device_add+0x774/0x9bc
[58563.146498] [c0007960f7a0] [c08abe9c] 
pci_device_add+0x2f4/0x864
[58563.146506] [c0007960f850] [c007d5a0] 
of_create_pci_dev+0x390/0xa08
[58563.146514] [c0007960f930] [c007de68] 
__of_scan_bus+0x250/0x328
[58563.146520] [c0007960fa10] [c007a680] 
pcibios_scan_phb+0x274/0x3c0
[58563.146527] [c0007960fae0] [c0105d58] 
init_phb_dynamic+0xb8/0x110
[58563.146535] [c0007960fb50] [c008217b0380] 
dlpar_add_slot+0x170/0x3b4 [rpadlpar_io]
[58563.146544] [c0007960fbf0] [c008217b0ca0] 
add_slot_store+0xa4/0x140 [rpadlpar_io]
[58563.146551] [c0007960fc80] [c0f3dbec] 
kobj_attr_store+0x30/0x4c
[58563.146559] [c0007960fca0] [c06931fc] 
sysfs_kf_write+0x68/0x7c
[58563.146566] [c0007960fcc0] [c0691b2c] 
kernfs_fop_write_iter+0x1c8/0x278

[58563.146573] [c0007960fd10] [c0599f54] vfs_write+0x340/0x4cc
[58563.146580] [c0007960fdc0] [c059a2bc] ksys_write+0x7c/0x140
[58563.146587] [c0007960fe10] [c0035d74] 
system_call_exception+0x134/0x330
[58563.146595] [c0007960fe50] [c000d6a0] 
system_call_common+0x160/0x2e4

[58563.146602] --- interrupt: c00 at 0x24470cb4
[58563.146606] NIP:  24470cb4 LR: 243e7d04 CTR: 

[58563.146611] REGS: c0007960fe80 TRAP: 0c00   Not tainted 
(6.8.0-rc1-auto-gecb1b8288dc7)
[58563.146616] MSR:  8280f033 
  CR: 24000282  XER: 

[58563.146632] IRQMASK: 0
[58563.146632] GPR00: 

Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-31 Thread Ryan Roberts
On 29/01/2024 12:46, David Hildenbrand wrote:
> Now that the rmap overhaul[1] is upstream that provides a clean interface
> for rmap batching, let's implement PTE batching during fork when processing
> PTE-mapped THPs.
> 
> This series is partially based on Ryan's previous work[2] to implement
> cont-pte support on arm64, but its a complete rewrite based on [1] to
> optimize all architectures independent of any such PTE bits, and to
> use the new rmap batching functions that simplify the code and prepare
> for further rmap accounting changes.
> 
> We collect consecutive PTEs that map consecutive pages of the same large
> folio, making sure that the other PTE bits are compatible, and (a) adjust
> the refcount only once per batch, (b) call rmap handling functions only
> once per batch and (c) perform batch PTE setting/updates.
> 
> While this series should be beneficial for adding cont-pte support on
> ARM64[2], it's one of the requirements for maintaining a total mapcount[3]
> for large folios with minimal added overhead and further changes[4] that
> build up on top of the total mapcount.
> 
> Independent of all that, this series results in a speedup during fork with
> PTE-mapped THP, which is the default with THPs that are smaller than a PMD
> (for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]).
> 
> On an Intel Xeon Silver 4210R CPU, fork'ing with 1GiB of PTE-mapped folios
> of the same size (stddev < 1%) results in the following runtimes
> for fork() (shorter is better):
> 
> Folio Size | v6.8-rc1 |  New | Change
> --
>   4KiB | 0.014328 | 0.014035 |   - 2%
>  16KiB | 0.014263 | 0.01196  |   -16%
>  32KiB | 0.014334 | 0.01094  |   -24%
>  64KiB | 0.014046 | 0.010444 |   -26%
> 128KiB | 0.014011 | 0.010063 |   -28%
> 256KiB | 0.013993 | 0.009938 |   -29%
> 512KiB | 0.013983 | 0.00985  |   -30%
>1024KiB | 0.013986 | 0.00982  |   -30%
>2048KiB | 0.014305 | 0.010076 |   -30%

Just a heads up that I'm seeing some strange results on Apple M2. Fork for
order-0 is seemingly costing ~17% more. I'm using GCC 13.2 and was pretty sure I
didn't see this problem with version 1; although that was on a different
baseline and I've thrown the numbers away so will rerun and try to debug this.

| kernel  |   mean_rel |   std_rel |
|:|---:|--:|
| mm-unstable |   0.0% |  1.1% |
| patch 1 |  -2.3% |  1.3% |
| patch 10|  -2.9% |  2.7% |
| patch 11|  13.5% |  0.5% |
| patch 12|  15.2% |  1.2% |
| patch 13|  18.2% |  0.7% |
| patch 14|  20.5% |  1.0% |
| patch 15|  17.1% |  1.6% |
| patch 15|  16.7% |  0.8% |

fork for order-9 is looking good (-20%), and for the zap series, munmap is
looking good, but dontneed is looking poor for both order-0 and 9. But one thing
at a time... let's concentrate on fork order-0 first.

Note that I'm still using the "old" benchmark code. Could you resend me the link
to the new code? Although I don't think there should be any effect for order-0
anyway, if I understood your changes correctly?


> 
> Note that these numbers are even better than the ones from v1 (verified
> over multiple reboots), even though there were only minimal code changes.
> Well, I removed a pte_mkclean() call for anon folios, maybe that also
> plays a role.
> 
> But my experience is that fork() is extremely sensitive to code size,
> inlining, ... so I suspect we'll see on other architectures rather a change
> of -20% instead of -30%, and it will be easy to "lose" some of that speedup
> in the future by subtle code changes.
> 
> Next up is PTE batching when unmapping. Only tested on x86-64.
> Compile-tested on most other architectures.
> 
> v2 -> v3:
>  * Rebased on mm-unstable
>  * Picked up RB's
>  * Updated documentation of wrprotect_ptes().
> 
> v1 -> v2:
>  * "arm64/mm: Make set_ptes() robust when OAs cross 48-bit boundary"
>   -> Added patch from Ryan
>  * "arm/pgtable: define PFN_PTE_SHIFT"
>   -> Removed the arm64 bits
>  * "mm/pgtable: make pte_next_pfn() independent of set_ptes()"
>  * "arm/mm: use pte_next_pfn() in set_ptes()"
>  * "powerpc/mm: use pte_next_pfn() in set_ptes()"
>   -> Added to use pte_next_pfn() in some arch set_ptes() implementations
>  I tried to make use of pte_next_pfn() also in the others, but it's
>  not trivial because the other archs implement set_ptes() in their
>  asm/pgtable.h. Future work.
>  * "mm/memory: factor out copying the actual PTE in copy_present_pte()"
>   -> Move common folio_get() out of if/else
>  * "mm/memory: optimize fork() with PTE-mapped THP"
>   -> Add doc for wrprotect_ptes
>   -> Extend description to mention handling of pinned folios
>   -> Move common folio_ref_add() out of if/else
>  * "mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch()"
>   -> Be more conservative with dirt/soft-dirty, let the caller specify
>   

Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-01-31 Thread Yin, Fengwei




On 1/31/2024 6:30 PM, David Hildenbrand wrote:

On 31.01.24 03:30, Yin Fengwei wrote:



On 1/29/24 22:32, David Hildenbrand wrote:

+static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
+    unsigned long addr, pte_t *ptep, unsigned int nr, int full)
+{
+    pte_t pte, tmp_pte;
+
+    pte = ptep_get_and_clear_full(mm, addr, ptep, full);
+    while (--nr) {
+    ptep++;
+    addr += PAGE_SIZE;
+    tmp_pte = ptep_get_and_clear_full(mm, addr, ptep, full);
+    if (pte_dirty(tmp_pte))
+    pte = pte_mkdirty(pte);
+    if (pte_young(tmp_pte))
+    pte = pte_mkyoung(pte);
I am wondering whether it's worthy to move the pte_mkdirty() and 
pte_mkyoung()
out of the loop and just do it one time if needed. The worst case is 
that they

are called nr - 1 time. Or it's just too micro?


I also thought about just indicating "any_accessed" or "any_dirty" using 
flags to the caller, to avoid the PTE modifications completely. Felt a 
bit micro-optimized.


Regarding your proposal: I thought about that as well, but my assumption 
was that dirty+young are "cheap" to be set.


On x86, pte_mkyoung() is setting _PAGE_ACCESSED.
pte_mkdirty() is setting _PAGE_DIRTY | _PAGE_SOFT_DIRTY, but it also has 
to handle the saveddirty handling, using some bit trickery.


So at least for pte_mkyoung() there would be no real benefit as far as I 
can see (might be even worse). For pte_mkdirty() there might be a small 
benefit.


Is it going to be measurable? Likely not.

Yeah. We can do more investigation when performance profiling call this
out.


Regards
Yin, Fengwei



Am I missing something?

Thanks!



Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-01-31 Thread David Hildenbrand

On 31.01.24 03:20, Yin Fengwei wrote:

On 1/29/24 22:32, David Hildenbrand wrote:

This series is based on [1] and must be applied on top of it.
Similar to what we did with fork(), let's implement PTE batching
during unmap/zap when processing PTE-mapped THPs.

We collect consecutive PTEs that map consecutive pages of the same large
folio, making sure that the other PTE bits are compatible, and (a) adjust
the refcount only once per batch, (b) call rmap handling functions only
once per batch, (c) perform batch PTE setting/updates and (d) perform TLB
entry removal once per batch.

Ryan was previously working on this in the context of cont-pte for
arm64, int latest iteration [2] with a focus on arm6 with cont-pte only.
This series implements the optimization for all architectures, independent
of such PTE bits, teaches MMU gather/TLB code to be fully aware of such
large-folio-pages batches as well, and amkes use of our new rmap batching
function when removing the rmap.

To achieve that, we have to enlighten MMU gather / page freeing code
(i.e., everything that consumes encoded_page) to process unmapping
of consecutive pages that all belong to the same large folio. I'm being
very careful to not degrade order-0 performance, and it looks like I
managed to achieve that.


One possible scenario:
If all the folio is 2M size folio, then one full batch could hold 510M memory.
Is it too much regarding one full batch before just can hold (2M - 4096 * 2)
memory?


Good point, we do have CONFIG_INIT_ON_FREE_DEFAULT_ON. I don't remember 
if init_on_free or init_on_alloc was used in production systems. In 
tlb_batch_pages_flush(), there is a cond_resched() to limit the number 
of entries we process.


So if that is actually problematic, we'd run into a soft-lockup and need 
another cond_resched() [I have some faint recollection that people are 
working on removing cond_resched() completely].


One could do some counting in free_pages_and_swap_cache() (where we 
iterate all entries already) and insert cond_resched+release_pages() for 
every (e.g., 512) pages.


--
Cheers,

David / dhildenb



Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-01-31 Thread Ryan Roberts
On 31/01/2024 10:21, David Hildenbrand wrote:
> 
>>> +
>>> +#ifndef clear_full_ptes
>>> +/**
>>> + * clear_full_ptes - Clear PTEs that map consecutive pages of the same 
>>> folio.
>>
>> I know its implied from "pages of the same folio" (and even more so for the
>> above variant due to mention of access/dirty), but I wonder if its useful to
>> explicitly state that "all ptes being cleared are present at the time of the
>> call"?
> 
> "Clear PTEs" -> "Clear present PTEs" ?
> 
> That should make it clearer.

Works for me.

> 
> [...]
> 
>>>   if (!delay_rmap) {
>>> -    folio_remove_rmap_pte(folio, page, vma);
>>> +    folio_remove_rmap_ptes(folio, page, nr, vma);
>>> +
>>> +    /* Only sanity-check the first page in a batch. */
>>>   if (unlikely(page_mapcount(page) < 0))
>>>   print_bad_pte(vma, addr, ptent, page);
>>
>> Is there a case for either removing this all together or moving it into
>> folio_remove_rmap_ptes()? It seems odd to only check some pages.
>>
> 
> I really wanted to avoid another nasty loop here.
> 
> In my thinking, for 4k folios, or when zapping subpages of large folios, we
> still perform the exact same checks. Only when batching we don't. So if there 
> is
> some problem, there are ways to get it triggered. And these problems are 
> barely
> ever seen.
> 
> folio_remove_rmap_ptes() feels like the better place -- especially because the
> delayed-rmap handling is effectively unchecked. But in there, we cannot
> "print_bad_pte()".
> 
> [background: if we had a total mapcount -- iow cheap folio_mapcount(), I'd 
> check
> here that the total mapcount does not underflow, instead of checking 
> per-subpage]

All good points... perhaps extend the comment to describe how this could be
solved in future with cheap total_mapcount()? Or in the commit log if you 
prefer?

> 
>>
>>>   }
>>> -    if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
>>> +    if (unlikely(__tlb_remove_folio_pages(tlb, page, nr, delay_rmap))) {
>>>   *force_flush = true;
>>>   *force_break = true;
>>>   }
>>>   }
>>>   -static inline void zap_present_pte(struct mmu_gather *tlb,
>>> +/*
>>> + * Zap or skip one present PTE, trying to batch-process subsequent PTEs 
>>> that
>>> map
>>
>> Zap or skip *at least* one... ?
> 
> Ack
> 



Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-01-31 Thread David Hildenbrand

On 31.01.24 03:30, Yin Fengwei wrote:



On 1/29/24 22:32, David Hildenbrand wrote:

+static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
+   unsigned long addr, pte_t *ptep, unsigned int nr, int full)
+{
+   pte_t pte, tmp_pte;
+
+   pte = ptep_get_and_clear_full(mm, addr, ptep, full);
+   while (--nr) {
+   ptep++;
+   addr += PAGE_SIZE;
+   tmp_pte = ptep_get_and_clear_full(mm, addr, ptep, full);
+   if (pte_dirty(tmp_pte))
+   pte = pte_mkdirty(pte);
+   if (pte_young(tmp_pte))
+   pte = pte_mkyoung(pte);

I am wondering whether it's worthy to move the pte_mkdirty() and pte_mkyoung()
out of the loop and just do it one time if needed. The worst case is that they
are called nr - 1 time. Or it's just too micro?


I also thought about just indicating "any_accessed" or "any_dirty" using 
flags to the caller, to avoid the PTE modifications completely. Felt a 
bit micro-optimized.


Regarding your proposal: I thought about that as well, but my assumption 
was that dirty+young are "cheap" to be set.


On x86, pte_mkyoung() is setting _PAGE_ACCESSED.
pte_mkdirty() is setting _PAGE_DIRTY | _PAGE_SOFT_DIRTY, but it also has 
to handle the saveddirty handling, using some bit trickery.


So at least for pte_mkyoung() there would be no real benefit as far as I 
can see (might be even worse). For pte_mkdirty() there might be a small 
benefit.


Is it going to be measurable? Likely not.

Am I missing something?

Thanks!

--
Cheers,

David / dhildenb



Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-01-31 Thread Ryan Roberts
On 31/01/2024 10:16, David Hildenbrand wrote:
> On 31.01.24 03:20, Yin Fengwei wrote:
>> On 1/29/24 22:32, David Hildenbrand wrote:
>>> This series is based on [1] and must be applied on top of it.
>>> Similar to what we did with fork(), let's implement PTE batching
>>> during unmap/zap when processing PTE-mapped THPs.
>>>
>>> We collect consecutive PTEs that map consecutive pages of the same large
>>> folio, making sure that the other PTE bits are compatible, and (a) adjust
>>> the refcount only once per batch, (b) call rmap handling functions only
>>> once per batch, (c) perform batch PTE setting/updates and (d) perform TLB
>>> entry removal once per batch.
>>>
>>> Ryan was previously working on this in the context of cont-pte for
>>> arm64, int latest iteration [2] with a focus on arm6 with cont-pte only.
>>> This series implements the optimization for all architectures, independent
>>> of such PTE bits, teaches MMU gather/TLB code to be fully aware of such
>>> large-folio-pages batches as well, and amkes use of our new rmap batching
>>> function when removing the rmap.
>>>
>>> To achieve that, we have to enlighten MMU gather / page freeing code
>>> (i.e., everything that consumes encoded_page) to process unmapping
>>> of consecutive pages that all belong to the same large folio. I'm being
>>> very careful to not degrade order-0 performance, and it looks like I
>>> managed to achieve that.
>>
> 
> Let's CC Linus and Michal to make sure I'm not daydreaming.
> 
> Relevant patch:
>   https://lkml.kernel.org/r/20240129143221.263763-8-da...@redhat.com
> 
> Context: I'm adjusting MMU gather code to support batching of consecutive 
> pages
> that belong to the same large folio, when unmapping/zapping PTEs.
> 
> For small folios, there is no (relevant) change.
> 
> Imagine we have a PTE-mapped THP (2M folio -> 512 pages) and zap all 512 PTEs:
> Instead of adding 512 individual encoded_page entries, we add a combined entry
> that expresses "page+nr_pages". That allows for "easily" adding various other
> per-folio batching (refcount, rmap, swap freeing).
> 
> The implication is, that we can now batch effective more pages with large
> folios, exceeding the old 1 limit. The number of involved *folios* does 
> not
> increase, though.
> 
>> One possible scenario:
>> If all the folio is 2M size folio, then one full batch could hold 510M 
>> memory.
>> Is it too much regarding one full batch before just can hold (2M - 4096 * 2)
>> memory?
> 
> Excellent point, I think there are three parts to it:
> 
> (1) Batch pages / folio fragments per batch page
> 
> Before this change (and with 4k folios) we have exactly one page (4k) per
> encoded_page entry in the batch. Now, we can have (with 2M folios), 512 pages
> for every two encoded_page entries (page+nr_pages) in a batch page. So an
> average ~256 pages per encoded_page entry.
> 
> So one batch page can now store in the worst case ~256 times the number of
> pages, but the number of folio fragments ("pages+nr_pages") would not 
> increase.
> 
> The time it takes to perform the actual page freeing of a batch will not be 
> 256
> times higher -- the time is expected to be much closer to the old time (i.e.,
> not freeing more folios).

IIRC there is an option to zero memory when it is freed back to the buddy? So
that could be a place where time is proportional to size rather than
proportional to folio count? But I think that option is intended for debug only?
So perhaps not a problem in practice?


> 
> (2) Delayed rmap handling
> 
> We limit batching early (see tlb_next_batch()) when we have delayed rmap
> pending. Reason being, that we don't want to check for many entries if they
> require delayed rmap handling, while still holding the page table lock (see
> tlb_flush_rmaps()), because we have to remove the rmap before dropping the 
> PTL.
> 
> Note that we perform the check whether we need delayed rmap handling per
> page+nr_pages entry, not per page. So we won't perform more such checks.
> 
> Once we set tlb->delayed_rmap (because we add one entry that requires it), we
> already force a flush before dropping the PT lock. So once we get a single
> delayed rmap entry in there, we will not batch more than we could have in the
> same page table: so not more than 512 entries (x86-64) in the worst case. So 
> it
> will still be bounded, and not significantly more than what we had before.
> 
> So regarding delayed rmap handling I think this should be fine.
> 
> (3) Total patched pages
> 
> MAX_GATHER_BATCH_COUNT effectively limits the number of pages we allocate 
> (full
> batches), and thereby limits the number of pages we were able to batch.
> 
> The old limit was ~1 pages, now we could batch ~5000 folio fragments
> (page+nr_pages), resulting int the "times 256" increase in the worst case on
> x86-64 as you point out.
> 
> This 1 pages limit was introduced in 53a59fc67f97 ("mm: limit mmu_gather
> batching to fix soft lockups on !CONFIG_PREEMPT") where we wanted to 

Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-01-31 Thread David Hildenbrand




+
+#ifndef clear_full_ptes
+/**
+ * clear_full_ptes - Clear PTEs that map consecutive pages of the same folio.


I know its implied from "pages of the same folio" (and even more so for the
above variant due to mention of access/dirty), but I wonder if its useful to
explicitly state that "all ptes being cleared are present at the time of the 
call"?


"Clear PTEs" -> "Clear present PTEs" ?

That should make it clearer.

[...]


if (!delay_rmap) {
-   folio_remove_rmap_pte(folio, page, vma);
+   folio_remove_rmap_ptes(folio, page, nr, vma);
+
+   /* Only sanity-check the first page in a batch. */
if (unlikely(page_mapcount(page) < 0))
print_bad_pte(vma, addr, ptent, page);


Is there a case for either removing this all together or moving it into
folio_remove_rmap_ptes()? It seems odd to only check some pages.



I really wanted to avoid another nasty loop here.

In my thinking, for 4k folios, or when zapping subpages of large folios, 
we still perform the exact same checks. Only when batching we don't. So 
if there is some problem, there are ways to get it triggered. And these 
problems are barely ever seen.


folio_remove_rmap_ptes() feels like the better place -- especially 
because the delayed-rmap handling is effectively unchecked. But in 
there, we cannot "print_bad_pte()".


[background: if we had a total mapcount -- iow cheap folio_mapcount(), 
I'd check here that the total mapcount does not underflow, instead of 
checking per-subpage]





}
-   if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
+   if (unlikely(__tlb_remove_folio_pages(tlb, page, nr, delay_rmap))) {
*force_flush = true;
*force_break = true;
}
  }
  
-static inline void zap_present_pte(struct mmu_gather *tlb,

+/*
+ * Zap or skip one present PTE, trying to batch-process subsequent PTEs that 
map


Zap or skip *at least* one... ?


Ack

--
Cheers,

David / dhildenb



Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-01-31 Thread David Hildenbrand

On 31.01.24 03:20, Yin Fengwei wrote:

On 1/29/24 22:32, David Hildenbrand wrote:

This series is based on [1] and must be applied on top of it.
Similar to what we did with fork(), let's implement PTE batching
during unmap/zap when processing PTE-mapped THPs.

We collect consecutive PTEs that map consecutive pages of the same large
folio, making sure that the other PTE bits are compatible, and (a) adjust
the refcount only once per batch, (b) call rmap handling functions only
once per batch, (c) perform batch PTE setting/updates and (d) perform TLB
entry removal once per batch.

Ryan was previously working on this in the context of cont-pte for
arm64, int latest iteration [2] with a focus on arm6 with cont-pte only.
This series implements the optimization for all architectures, independent
of such PTE bits, teaches MMU gather/TLB code to be fully aware of such
large-folio-pages batches as well, and amkes use of our new rmap batching
function when removing the rmap.

To achieve that, we have to enlighten MMU gather / page freeing code
(i.e., everything that consumes encoded_page) to process unmapping
of consecutive pages that all belong to the same large folio. I'm being
very careful to not degrade order-0 performance, and it looks like I
managed to achieve that.




Let's CC Linus and Michal to make sure I'm not daydreaming.

Relevant patch:
  https://lkml.kernel.org/r/20240129143221.263763-8-da...@redhat.com

Context: I'm adjusting MMU gather code to support batching of 
consecutive pages that belong to the same large folio, when 
unmapping/zapping PTEs.


For small folios, there is no (relevant) change.

Imagine we have a PTE-mapped THP (2M folio -> 512 pages) and zap all 512 
PTEs: Instead of adding 512 individual encoded_page entries, we add a 
combined entry that expresses "page+nr_pages". That allows for "easily" 
adding various other per-folio batching (refcount, rmap, swap freeing).


The implication is, that we can now batch effective more pages with 
large folios, exceeding the old 1 limit. The number of involved 
*folios* does not increase, though.



One possible scenario:
If all the folio is 2M size folio, then one full batch could hold 510M memory.
Is it too much regarding one full batch before just can hold (2M - 4096 * 2)
memory?


Excellent point, I think there are three parts to it:

(1) Batch pages / folio fragments per batch page

Before this change (and with 4k folios) we have exactly one page (4k) 
per encoded_page entry in the batch. Now, we can have (with 2M folios), 
512 pages for every two encoded_page entries (page+nr_pages) in a batch 
page. So an average ~256 pages per encoded_page entry.


So one batch page can now store in the worst case ~256 times the number 
of pages, but the number of folio fragments ("pages+nr_pages") would not 
increase.


The time it takes to perform the actual page freeing of a batch will not 
be 256 times higher -- the time is expected to be much closer to the old 
time (i.e., not freeing more folios).


(2) Delayed rmap handling

We limit batching early (see tlb_next_batch()) when we have delayed rmap 
pending. Reason being, that we don't want to check for many entries if 
they require delayed rmap handling, while still holding the page table 
lock (see tlb_flush_rmaps()), because we have to remove the rmap before 
dropping the PTL.


Note that we perform the check whether we need delayed rmap handling per 
page+nr_pages entry, not per page. So we won't perform more such checks.


Once we set tlb->delayed_rmap (because we add one entry that requires 
it), we already force a flush before dropping the PT lock. So once we 
get a single delayed rmap entry in there, we will not batch more than we 
could have in the same page table: so not more than 512 entries (x86-64) 
in the worst case. So it will still be bounded, and not significantly 
more than what we had before.


So regarding delayed rmap handling I think this should be fine.

(3) Total patched pages

MAX_GATHER_BATCH_COUNT effectively limits the number of pages we 
allocate (full batches), and thereby limits the number of pages we were 
able to batch.


The old limit was ~1 pages, now we could batch ~5000 folio fragments 
(page+nr_pages), resulting int the "times 256" increase in the worst 
case on x86-64 as you point out.


This 1 pages limit was introduced in 53a59fc67f97 ("mm: limit 
mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT") where we 
wanted to handle soft-lockups.


As the number of effective folios we are freeing does not increase, I 
*think* this should be fine.



If any of that is a problem, we would have to keep track of the total 
number of pages in our batch, and stop as soon as we hit our 1 limit 
-- independent of page vs. folio fragment. Something I would like to 
avoid of possible.


--
Cheers,

David / dhildenb



[mainline] [linux-next] [6.8-rc1] [FC] [DLPAR] OOps kernel crash after performing dlpar remove test

2024-01-31 Thread Tasmiya Nalatwad

Greetings,

[mainline] [linux-next] [6.8-rc1] [DLPAR] OOps kernel crash after 
performing dlpar remove test


--- Traces ---

[58563.146236] BUG: Unable to handle kernel data access at 
0x6b6b6b6b6b6b6b83

[58563.146242] Faulting instruction address: 0xc09c0e60
[58563.146248] Oops: Kernel access of bad area, sig: 11 [#1]
[58563.146252] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=8192 NUMA pSeries
[58563.146258] Modules linked in: isofs cdrom dm_snapshot dm_bufio 
dm_round_robin dm_queue_length exfat vfat fat btrfs blake2b_generic xor 
raid6_pq zstd_compress loop xfs libcrc32c raid0 nvram rpadlpar_io rpaphp 
nfnetlink xsk_diag bonding tls rfkill sunrpc dm_service_time 
dm_multipath dm_mod pseries_rng vmx_crypto binfmt_misc ext4 mbcache jbd2 
sd_mod sg ibmvscsi scsi_transport_srp ibmveth lpfc nvmet_fc nvmet 
nvme_fc nvme_fabrics nvme_core t10_pi crc64_rocksoft crc64 
scsi_transport_fc fuse
[58563.146326] CPU: 0 PID: 1071247 Comm: drmgr Kdump: loaded Not tainted 
6.8.0-rc1-auto-gecb1b8288dc7 #1
[58563.146332] Hardware name: IBM,9009-42A POWER9 (raw) 0x4e0202 
0xf05 of:IBM,FW950.A0 (VL950_141) hv:phyp pSeries
[58563.146337] NIP:  c09c0e60 LR: c09c0e28 CTR: 
c09c1584
[58563.146342] REGS: c0007960f260 TRAP: 0380   Not tainted 
(6.8.0-rc1-auto-gecb1b8288dc7)
[58563.146347] MSR:  80009033   CR: 
24822424  XER: 20040006

[58563.146360] CFAR: c09c0e74 IRQMASK: 0
[58563.146360] GPR00: c09c0e28 c0007960f500 c1482600 
c3050540
[58563.146360] GPR04:  c0089a6870c0 0001 
fffe
[58563.146360] GPR08: c2bac020 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 
0220
[58563.146360] GPR12: 2000 c308  

[58563.146360] GPR16:    
0001
[58563.146360] GPR20: c1281478  c1281490 
c2bfed80
[58563.146360] GPR24: c0089a6870c0   
c2b9ffb8
[58563.146360] GPR28:  c2bac0e8  


[58563.146421] NIP [c09c0e60] iommu_ops_from_fwnode+0x68/0x118
[58563.146430] LR [c09c0e28] iommu_ops_from_fwnode+0x30/0x118
[58563.146437] Call Trace:
[58563.146439] [c0007960f500] [c0007960f560] 0xc0007960f560 
(unreliable)
[58563.146446] [c0007960f530] [c09c0fd0] 
__iommu_probe_device+0xc0/0x5c0
[58563.146454] [c0007960f5a0] [c09c151c] 
iommu_probe_device+0x4c/0xb4
[58563.146462] [c0007960f5e0] [c09c15d0] 
iommu_bus_notifier+0x4c/0x8c
[58563.146469] [c0007960f600] [c019e3d0] 
notifier_call_chain+0xb8/0x1a0
[58563.146476] [c0007960f660] [c019eea0] 
blocking_notifier_call_chain+0x64/0x94

[58563.146483] [c0007960f6a0] [c09d3c5c] bus_notify+0x50/0x7c
[58563.146491] [c0007960f6e0] [c09cfba4] device_add+0x774/0x9bc
[58563.146498] [c0007960f7a0] [c08abe9c] 
pci_device_add+0x2f4/0x864
[58563.146506] [c0007960f850] [c007d5a0] 
of_create_pci_dev+0x390/0xa08
[58563.146514] [c0007960f930] [c007de68] 
__of_scan_bus+0x250/0x328
[58563.146520] [c0007960fa10] [c007a680] 
pcibios_scan_phb+0x274/0x3c0
[58563.146527] [c0007960fae0] [c0105d58] 
init_phb_dynamic+0xb8/0x110
[58563.146535] [c0007960fb50] [c008217b0380] 
dlpar_add_slot+0x170/0x3b4 [rpadlpar_io]
[58563.146544] [c0007960fbf0] [c008217b0ca0] 
add_slot_store+0xa4/0x140 [rpadlpar_io]
[58563.146551] [c0007960fc80] [c0f3dbec] 
kobj_attr_store+0x30/0x4c
[58563.146559] [c0007960fca0] [c06931fc] 
sysfs_kf_write+0x68/0x7c
[58563.146566] [c0007960fcc0] [c0691b2c] 
kernfs_fop_write_iter+0x1c8/0x278

[58563.146573] [c0007960fd10] [c0599f54] vfs_write+0x340/0x4cc
[58563.146580] [c0007960fdc0] [c059a2bc] ksys_write+0x7c/0x140
[58563.146587] [c0007960fe10] [c0035d74] 
system_call_exception+0x134/0x330
[58563.146595] [c0007960fe50] [c000d6a0] 
system_call_common+0x160/0x2e4

[58563.146602] --- interrupt: c00 at 0x24470cb4
[58563.146606] NIP:  24470cb4 LR: 243e7d04 CTR: 

[58563.146611] REGS: c0007960fe80 TRAP: 0c00   Not tainted 
(6.8.0-rc1-auto-gecb1b8288dc7)
[58563.146616] MSR:  8280f033 
  CR: 24000282  XER: 

[58563.146632] IRQMASK: 0
[58563.146632] GPR00: 0004 7fffd3993420 24557300 
0007
[58563.146632] GPR04: 01000d8a5270 0006 fbad2c80 
01000d8a02a0
[58563.146632] GPR08: 0001   

[58563.146632] GPR12:  2422bb50  

[58563.146632] GPR16:    

[58563.146632] GPR20:   

Re: [PATCH v2 1/4] PCI/AER: Store more information in aer_err_info

2024-01-31 Thread Wang, Qingshun
On Tue, Jan 30, 2024 at 06:26:39PM -0800, Kuppuswamy Sathyanarayanan wrote:
> 
> On 1/24/24 10:27 PM, Wang, Qingshun wrote:
> > When Advisory Non-Fatal errors are raised, both correctable and
> 
> Maybe you can start with same info about what Advisory Non-FataL
> errors are and the specification reference. I know that you included
> it in cover letter. But it is good to include it in commit log.

Good idea, thanks!

> 
> > uncorrectable error statuses will be set. The current kernel code cannot
> > store both statuses at the same time, thus failing to handle ANFE properly.
> > In addition, to avoid clearing UEs that are not ANFE by accident, UE
> 
> Please add some details about the impact of not clearing them.

Makes sense, will do.

> > severity and Device Status also need to be recorded: any fatal UE cannot
> > be ANFE, and if Fatal/Non-Fatal Error Detected is set in Device Status, do
> > not take any assumption and let UE handler to clear UE status.
> >
> > Store status and mask of both correctable and uncorrectable errors in
> > aer_err_info. The severity of UEs and the values of the Device Status
> > register are also recorded, which will be used to determine UEs that should
> > be handled by the ANFE handler. Refactor the rest of the code to use
> > cor/uncor_status and cor/uncor_mask fields instead of status and mask
> > fields.
> >
> > Signed-off-by: "Wang, Qingshun" 
> > ---
> >  drivers/acpi/apei/ghes.c | 10 -
> >  drivers/cxl/core/pci.c   |  6 ++-
> >  drivers/pci/pci.h|  8 +++-
> >  drivers/pci/pcie/aer.c   | 93 ++--
> >  include/linux/aer.h  |  4 +-
> >  include/linux/pci.h  | 27 
> >  6 files changed, 111 insertions(+), 37 deletions(-)
> >
> > ..
> >
> > @@ -1213,38 +1233,49 @@ int aer_get_device_error_info(struct pci_dev *dev, 
> > struct aer_err_info *info)
> > int temp;
> >  
> > /* Must reset in this function */
> > -   info->status = 0;
> > +   info->cor_status = 0;
> > +   info->uncor_status = 0;
> > +   info->uncor_severity = 0;
> > info->tlp_header_valid = 0;
> >  
> > /* The device might not support AER */
> > if (!aer)
> > return 0;
> >  
> > -   if (info->severity == AER_CORRECTABLE) {
> > +   if (info->severity == AER_CORRECTABLE ||
> > +   info->severity == AER_NONFATAL ||
> > +   type == PCI_EXP_TYPE_ROOT_PORT ||
> > +   type == PCI_EXP_TYPE_RC_EC ||
> > +   type == PCI_EXP_TYPE_DOWNSTREAM) {
> 
> 
> It looks like you are reading both uncorrectable and correctable status
> by default for both NONFATAL and CORRECTABLE errors. Why not do
> it conditionally only for ANFE errors?
> 
> 

My initial purpose was the value will be used in aer_event trace in
PATCH 4 under both conditions, but I can also add checks here to reduce
unnecessary IO and remove the checks in PATCH 4.

> > +   /* Link is healthy for IO reads */
> > pci_read_config_dword(dev, aer + PCI_ERR_COR_STATUS,
> > -   >status);
> > + >cor_status);
> > pci_read_config_dword(dev, aer + PCI_ERR_COR_MASK,
> > -   >mask);
> >  
> > ..
> >
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index add9368e6314..259812620d4d 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -318,6 +318,33 @@ struct pci_sriov;
> >  struct pci_p2pdma;
> >  struct rcec_ea;
> >  
> > +struct pcie_capability_regs {
> > +   u8 pcie_cap_id;
> > +   u8 next_cap_ptr;
> > +   u16 pcie_caps;
> > +   u32 device_caps;
> > +   u16 device_control;
> > +   u16 device_status;
> > +   u32 link_caps;
> > +   u16 link_control;
> > +   u16 link_status;
> > +   u32 slot_caps;
> > +   u16 slot_control;
> > +   u16 slot_status;
> > +   u16 root_control;
> > +   u16 root_caps;
> > +   u32 root_status;
> > +   u32 device_caps_2;
> > +   u16 device_control_2;
> > +   u16 device_status_2;
> > +   u32 link_caps_2;
> > +   u16 link_control_2;
> > +   u16 link_status_2;
> > +   u32 slot_caps_2;
> > +   u16 slot_control_2;
> > +   u16 slot_status_2;
> > +};
> > +
> IIUC, this struct is only used drivers/acpi/apei/ghes.c . Why not define it 
> in that file?

You are right. Whenever we need it elsewhere, we can move it to the
header file anyway.

> >  /* The pci_dev structure describes PCI devices */
> >  struct pci_dev {
> > struct list_head bus_list;  /* Node in per-bus list */
> 
> -- 
> Sathyanarayanan Kuppuswamy
> Linux Kernel Developer
> 

--
Best regards,
Wang, Qingshun