Re: [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time

2024-01-30 Thread Christophe Leroy


Le 30/01/2024 à 21:27, Luis Chamberlain a écrit :
> On Tue, Jan 30, 2024 at 06:48:11PM +0100, Marek Szyprowski wrote:
>> Dear All,
>>
>> On 30.01.2024 12:03, Christophe Leroy wrote:
>>> Le 30/01/2024 à 10:16, Chen-Yu Tsai a écrit :
>>>> [Vous ne recevez pas souvent de courriers de we...@chromium.org. D?couvrez 
>>>> pourquoi ceci est important ? 
>>>> https://aka.ms/LearnAboutSenderIdentification ]
>>>>
>>>> On Mon, Jan 29, 2024 at 12:09:50PM -0800, Luis Chamberlain wrote:
>>>>> On Thu, Dec 21, 2023 at 10:02:46AM +0100, Christophe Leroy wrote:
>>>>>> Declaring rodata_enabled and mark_rodata_ro() at all time
>>>>>> helps removing related #ifdefery in C files.
>>>>>>
>>>>>> Signed-off-by: Christophe Leroy 
>>>>> Very nice cleanup, thanks!, applied and pushed
>>>>>
>>>>>  Luis
>>>> On next-20240130, which has your modules-next branch, and thus this
>>>> series and the other "module: Use set_memory_rox()" series applied,
>>>> my kernel crashes in some very weird way. Reverting your branch
>>>> makes the crash go away.
>>>>
>>>> I thought I'd report it right away. Maybe you folks would know what's
>>>> happening here? This is on arm64.
>>> That's strange, it seems to bug in module_bug_finalize() which is
>>> _before_ calls to module_enable_ro() and such.
>>>
>>> Can you try to revert the 6 patches one by one to see which one
>>> introduces the problem ?
>>>
>>> In reality, only patch 677bfb9db8a3 really change things. Other ones are
>>> more on less only cleanup.
>>
>> I've also run into this issue with today's (20240130) linux-next on my
>> test farm. The issue is not fully reproducible, so it was a bit hard to
>> bisect it automatically. I've spent some time on manual testing and it
>> looks that reverting the following 2 commits on top of linux-next fixes
>> the problem:
>>
>> 65929884f868 ("modules: Remove #ifdef CONFIG_STRICT_MODULE_RWX around
>> rodata_enabled")
>> 677bfb9db8a3 ("module: Don't ignore errors from set_memory_XX()")
>>
>> This in fact means that commit 677bfb9db8a3 is responsible for this
>> regression, as 65929884f868 has to be reverted only because the latter
>> depends on it. Let me know what I can do to help debugging this issue.
> 
> Thanks for the bisect, I've reset my tree to commit
> 3559ad395bf02 ("module: Change module_enable_{nx/x/ro}() to more
> explicit names") for now then, so to remove those commits.
> 

The problem being identified in commit 677bfb9db8a3 ("module: Don't 
ignore errors from set_memory_XX()"), you can keep/re-apply the series 
[PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time.

Christophe


Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-01-30 Thread Yin Fengwei



On 1/29/24 22:32, David Hildenbrand wrote:
> +static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep, unsigned int nr, int full)
> +{
> + pte_t pte, tmp_pte;
> +
> + pte = ptep_get_and_clear_full(mm, addr, ptep, full);
> + while (--nr) {
> + ptep++;
> + addr += PAGE_SIZE;
> + tmp_pte = ptep_get_and_clear_full(mm, addr, ptep, full);
> + if (pte_dirty(tmp_pte))
> + pte = pte_mkdirty(pte);
> + if (pte_young(tmp_pte))
> + pte = pte_mkyoung(pte);
I am wondering whether it's worthy to move the pte_mkdirty() and pte_mkyoung()
out of the loop and just do it one time if needed. The worst case is that they
are called nr - 1 time. Or it's just too micro?


Regards
Yin, Fengwei

> + }
> + return pte;
> +}


Re: [PATCH v2 1/4] PCI/AER: Store more information in aer_err_info

2024-01-30 Thread Kuppuswamy Sathyanarayanan


On 1/24/24 10:27 PM, Wang, Qingshun wrote:
> When Advisory Non-Fatal errors are raised, both correctable and

Maybe you can start with same info about what Advisory Non-FataL
errors are and the specification reference. I know that you included
it in cover letter. But it is good to include it in commit log.

> uncorrectable error statuses will be set. The current kernel code cannot
> store both statuses at the same time, thus failing to handle ANFE properly.
> In addition, to avoid clearing UEs that are not ANFE by accident, UE

Please add some details about the impact of not clearing them.
> severity and Device Status also need to be recorded: any fatal UE cannot
> be ANFE, and if Fatal/Non-Fatal Error Detected is set in Device Status, do
> not take any assumption and let UE handler to clear UE status.
>
> Store status and mask of both correctable and uncorrectable errors in
> aer_err_info. The severity of UEs and the values of the Device Status
> register are also recorded, which will be used to determine UEs that should
> be handled by the ANFE handler. Refactor the rest of the code to use
> cor/uncor_status and cor/uncor_mask fields instead of status and mask
> fields.
>
> Signed-off-by: "Wang, Qingshun" 
> ---
>  drivers/acpi/apei/ghes.c | 10 -
>  drivers/cxl/core/pci.c   |  6 ++-
>  drivers/pci/pci.h|  8 +++-
>  drivers/pci/pcie/aer.c   | 93 ++--
>  include/linux/aer.h  |  4 +-
>  include/linux/pci.h  | 27 
>  6 files changed, 111 insertions(+), 37 deletions(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 7b7c605166e0..6034039d5cff 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -593,6 +593,8 @@ static void ghes_handle_aer(struct acpi_hest_generic_data 
> *gdata)
>  
>   if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
>   pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO) {
> + struct pcie_capability_regs *pcie_caps;
> + u16 device_status = 0;
>   unsigned int devfn;
>   int aer_severity;
>   u8 *aer_info;
> @@ -615,11 +617,17 @@ static void ghes_handle_aer(struct 
> acpi_hest_generic_data *gdata)
>   return;
>   memcpy(aer_info, pcie_err->aer_info, sizeof(struct 
> aer_capability_regs));
>  
> + if (pcie_err->validation_bits & CPER_PCIE_VALID_CAPABILITY) {
> + pcie_caps = (struct pcie_capability_regs 
> *)pcie_err->capability;
> + device_status = pcie_caps->device_status;
> + }
> +
>   aer_recover_queue(pcie_err->device_id.segment,
> pcie_err->device_id.bus,
> devfn, aer_severity,
> (struct aer_capability_regs *)
> -   aer_info);
> +   aer_info,
> +   device_status);
>   }
>  #endif
>  }
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 6c9c8d92f8f7..9111a4415a63 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -903,6 +903,7 @@ static void cxl_handle_rdport_errors(struct cxl_dev_state 
> *cxlds)
>   struct aer_capability_regs aer_regs;
>   struct cxl_dport *dport;
>   struct cxl_port *port;
> + u16 device_status;
>   int severity;
>  
>   port = cxl_pci_find_port(pdev, );
> @@ -917,7 +918,10 @@ static void cxl_handle_rdport_errors(struct 
> cxl_dev_state *cxlds)
>   if (!cxl_rch_get_aer_severity(_regs, ))
>   return;
>  
> - pci_print_aer(pdev, severity, _regs);
> + if (pcie_capability_read_word(pdev, PCI_EXP_DEVSTA, _status))
> + return;
> +
> + pci_print_aer(pdev, severity, _regs, device_status);
>  
>   if (severity == AER_CORRECTABLE)
>   cxl_handle_rdport_cor_ras(cxlds, dport);
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 2336a8d1edab..f881a1b42f14 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -407,8 +407,12 @@ struct aer_err_info {
>   unsigned int __pad2:2;
>   unsigned int tlp_header_valid:1;
>  
> - unsigned int status;/* COR/UNCOR Error Status */
> - unsigned int mask;  /* COR/UNCOR Error Mask */
> + u32 cor_mask;   /* COR Error Mask */
> + u32 cor_status; /* COR Error Status */
> + u32 uncor_mask; /* UNCOR Error Mask */
> + u32 uncor_status;   /* UNCOR Error Status */
> + u32 uncor_severity; /* UNCOR Error Severity */
> + u16 device_status;
>   struct aer_header_log_regs tlp; /* TLP Header */
>  };
>  
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 05fc30bb5134..6583dcf50977 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -615,7 +615,7 @@ const struct attribute_group 

Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-01-30 Thread Yin Fengwei
On 1/29/24 22:32, David Hildenbrand wrote:
> This series is based on [1] and must be applied on top of it.
> Similar to what we did with fork(), let's implement PTE batching
> during unmap/zap when processing PTE-mapped THPs.
> 
> We collect consecutive PTEs that map consecutive pages of the same large
> folio, making sure that the other PTE bits are compatible, and (a) adjust
> the refcount only once per batch, (b) call rmap handling functions only
> once per batch, (c) perform batch PTE setting/updates and (d) perform TLB
> entry removal once per batch.
> 
> Ryan was previously working on this in the context of cont-pte for
> arm64, int latest iteration [2] with a focus on arm6 with cont-pte only.
> This series implements the optimization for all architectures, independent
> of such PTE bits, teaches MMU gather/TLB code to be fully aware of such
> large-folio-pages batches as well, and amkes use of our new rmap batching
> function when removing the rmap.
> 
> To achieve that, we have to enlighten MMU gather / page freeing code
> (i.e., everything that consumes encoded_page) to process unmapping
> of consecutive pages that all belong to the same large folio. I'm being
> very careful to not degrade order-0 performance, and it looks like I
> managed to achieve that.

One possible scenario:
If all the folio is 2M size folio, then one full batch could hold 510M memory.
Is it too much regarding one full batch before just can hold (2M - 4096 * 2)
memory?


Regards
Yin, Fengwei



Re: [PATCH] dt-bindings: soc: fsl: narrow regex for unit address to hex numbers

2024-01-30 Thread Rob Herring


On Tue, 23 Jan 2024 09:35:05 +0100, Krzysztof Kozlowski wrote:
> Regular expression used to match the unit address part should not allow
> non-hex numbers.
> 
> Signed-off-by: Krzysztof Kozlowski 
> ---
>  .../devicetree/bindings/soc/fsl/fsl,layerscape-dcfg.yaml| 2 +-
>  .../devicetree/bindings/soc/fsl/fsl,layerscape-scfg.yaml| 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 

Acked-by: Rob Herring 



Re: [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time

2024-01-30 Thread Luis Chamberlain
On Tue, Jan 30, 2024 at 06:48:11PM +0100, Marek Szyprowski wrote:
> Dear All,
> 
> On 30.01.2024 12:03, Christophe Leroy wrote:
> > Le 30/01/2024 à 10:16, Chen-Yu Tsai a écrit :
> >> [Vous ne recevez pas souvent de courriers de we...@chromium.org. D?couvrez 
> >> pourquoi ceci est important ? 
> >> https://aka.ms/LearnAboutSenderIdentification ]
> >>
> >> On Mon, Jan 29, 2024 at 12:09:50PM -0800, Luis Chamberlain wrote:
> >>> On Thu, Dec 21, 2023 at 10:02:46AM +0100, Christophe Leroy wrote:
> >>>> Declaring rodata_enabled and mark_rodata_ro() at all time
> >>>> helps removing related #ifdefery in C files.
> >>>>
> >>>> Signed-off-by: Christophe Leroy 
> >>> Very nice cleanup, thanks!, applied and pushed
> >>>
> >>> Luis
> >> On next-20240130, which has your modules-next branch, and thus this
> >> series and the other "module: Use set_memory_rox()" series applied,
> >> my kernel crashes in some very weird way. Reverting your branch
> >> makes the crash go away.
> >>
> >> I thought I'd report it right away. Maybe you folks would know what's
> >> happening here? This is on arm64.
> > That's strange, it seems to bug in module_bug_finalize() which is
> > _before_ calls to module_enable_ro() and such.
> >
> > Can you try to revert the 6 patches one by one to see which one
> > introduces the problem ?
> >
> > In reality, only patch 677bfb9db8a3 really change things. Other ones are
> > more on less only cleanup.
> 
> I've also run into this issue with today's (20240130) linux-next on my 
> test farm. The issue is not fully reproducible, so it was a bit hard to 
> bisect it automatically. I've spent some time on manual testing and it 
> looks that reverting the following 2 commits on top of linux-next fixes 
> the problem:
> 
> 65929884f868 ("modules: Remove #ifdef CONFIG_STRICT_MODULE_RWX around 
> rodata_enabled")
> 677bfb9db8a3 ("module: Don't ignore errors from set_memory_XX()")
> 
> This in fact means that commit 677bfb9db8a3 is responsible for this 
> regression, as 65929884f868 has to be reverted only because the latter 
> depends on it. Let me know what I can do to help debugging this issue.

Thanks for the bisect, I've reset my tree to commit
3559ad395bf02 ("module: Change module_enable_{nx/x/ro}() to more
explicit names") for now then, so to remove those commits.

  Luis


Re: [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time

2024-01-30 Thread Marek Szyprowski
Dear All,

On 30.01.2024 12:03, Christophe Leroy wrote:
> Le 30/01/2024 à 10:16, Chen-Yu Tsai a écrit :
>> [Vous ne recevez pas souvent de courriers de we...@chromium.org. D?couvrez 
>> pourquoi ceci est important ? https://aka.ms/LearnAboutSenderIdentification ]
>>
>> On Mon, Jan 29, 2024 at 12:09:50PM -0800, Luis Chamberlain wrote:
>>> On Thu, Dec 21, 2023 at 10:02:46AM +0100, Christophe Leroy wrote:
>>>> Declaring rodata_enabled and mark_rodata_ro() at all time
>>>> helps removing related #ifdefery in C files.
>>>>
>>>> Signed-off-by: Christophe Leroy 
>>> Very nice cleanup, thanks!, applied and pushed
>>>
>>> Luis
>> On next-20240130, which has your modules-next branch, and thus this
>> series and the other "module: Use set_memory_rox()" series applied,
>> my kernel crashes in some very weird way. Reverting your branch
>> makes the crash go away.
>>
>> I thought I'd report it right away. Maybe you folks would know what's
>> happening here? This is on arm64.
> That's strange, it seems to bug in module_bug_finalize() which is
> _before_ calls to module_enable_ro() and such.
>
> Can you try to revert the 6 patches one by one to see which one
> introduces the problem ?
>
> In reality, only patch 677bfb9db8a3 really change things. Other ones are
> more on less only cleanup.

I've also run into this issue with today's (20240130) linux-next on my 
test farm. The issue is not fully reproducible, so it was a bit hard to 
bisect it automatically. I've spent some time on manual testing and it 
looks that reverting the following 2 commits on top of linux-next fixes 
the problem:

65929884f868 ("modules: Remove #ifdef CONFIG_STRICT_MODULE_RWX around 
rodata_enabled")
677bfb9db8a3 ("module: Don't ignore errors from set_memory_XX()")

This in fact means that commit 677bfb9db8a3 is responsible for this 
regression, as 65929884f868 has to be reverted only because the latter 
depends on it. Let me know what I can do to help debugging this issue.

Here is the stack trace I've got on Khadas VIM3 ARM64 board:

Unable to handle kernel paging request at virtual address 80007bfeeb30
Mem abort info:
   ESR = 0x9647
   EC = 0x25: DABT (current EL), IL = 32 bits
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
   FSC = 0x07: level 3 translation fault
Data abort info:
   ISV = 0, ISS = 0x0047, ISS2 = 0x
   CM = 0, WnR = 1, TnD = 0, TagAccess = 0
   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
swapper pgtable: 4k pages, 48-bit VAs, pgdp=0a35a000
[80007bfeeb30] pgd=1000f4806003, p4d=1000f4806003, 
pud=17ed1003, pmd=17ed2003, pte=
Internal error: Oops: 9647 [#1] PREEMPT SMP
Modules linked in:
CPU: 4 PID: 182 Comm: (udev-worker) Not tainted 6.8.0-rc2-next-20240130 
#14391
Hardware name: Khadas VIM3 (DT)
pstate: 6005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : module_bug_finalize+0xb0/0xdc
lr : module_bug_finalize+0x70/0xdc
...
Call trace:
  module_bug_finalize+0xb0/0xdc
  load_module+0x182c/0x1c88
  init_module_from_file+0x84/0xc0
  idempotent_init_module+0x180/0x250
  __arm64_sys_finit_module+0x64/0xa0
  invoke_syscall+0x48/0x114
  el0_svc_common.constprop.0+0xc0/0xe0
  do_el0_svc+0x1c/0x28
  el0_svc+0x4c/0xe4
  el0t_64_sync_handler+0xc0/0xc4
  el0t_64_sync+0x190/0x194
Code: 9116e003 f942dc01 a93e8c41 c89ffc73 (f9000433)
---[ end trace  ]---



Best regards
-- 
Marek Szyprowski, PhD
Samsung R Institute Poland



Re: [PATCH v10 6/6] arm64: introduce copy_mc_to_kernel() implementation

2024-01-30 Thread Tong Tiangen




在 2024/1/30 18:20, Mark Rutland 写道:

On Mon, Jan 29, 2024 at 09:46:52PM +0800, Tong Tiangen wrote:

The copy_mc_to_kernel() helper is memory copy implementation that handles
source exceptions. It can be used in memory copy scenarios that tolerate
hardware memory errors(e.g: pmem_read/dax_copy_to_iter).

Currnently, only x86 and ppc suuport this helper, after arm64 support
machine check safe framework, we introduce copy_mc_to_kernel()
implementation.

Signed-off-by: Tong Tiangen 
---
  arch/arm64/include/asm/string.h  |   5 +
  arch/arm64/include/asm/uaccess.h |  21 +++
  arch/arm64/lib/Makefile  |   2 +-
  arch/arm64/lib/memcpy_mc.S   | 257 +++
  mm/kasan/shadow.c|  12 ++
  5 files changed, 296 insertions(+), 1 deletion(-)
  create mode 100644 arch/arm64/lib/memcpy_mc.S


Looking at the diffstat and code, this duplicates arch/arm64/lib/memcpy.S with
a few annotations. Duplicating that code is not maintainable, and so we cannot
take this as-is.

If you want a version that can handle faults that *must* be written such that
the code is shared with the regular memcpy. That could be done by using macros
to instantiate two copies (one with fault handling, the other without).

It would also be very helpful to see *any* indication that this has been
tested, which is sorely lacking in the series as-is.

Mark.


OK, so that's what I'm really want to solve right now, a lot of
 duplicate code, and I'm going to think about how to deal with that.

Thank you very much for your good advice:)
Tong.




diff --git a/arch/arm64/include/asm/string.h b/arch/arm64/include/asm/string.h
index 3a3264ff47b9..995b63c26e99 100644
--- a/arch/arm64/include/asm/string.h
+++ b/arch/arm64/include/asm/string.h
@@ -35,6 +35,10 @@ extern void *memchr(const void *, int, __kernel_size_t);
  extern void *memcpy(void *, const void *, __kernel_size_t);
  extern void *__memcpy(void *, const void *, __kernel_size_t);
  
+#define __HAVE_ARCH_MEMCPY_MC

+extern int memcpy_mcs(void *, const void *, __kernel_size_t);
+extern int __memcpy_mcs(void *, const void *, __kernel_size_t);
+
  #define __HAVE_ARCH_MEMMOVE
  extern void *memmove(void *, const void *, __kernel_size_t);
  extern void *__memmove(void *, const void *, __kernel_size_t);
@@ -57,6 +61,7 @@ void memcpy_flushcache(void *dst, const void *src, size_t 
cnt);
   */
  
  #define memcpy(dst, src, len) __memcpy(dst, src, len)

+#define memcpy_mcs(dst, src, len) __memcpy_mcs(dst, src, len)
  #define memmove(dst, src, len) __memmove(dst, src, len)
  #define memset(s, c, n) __memset(s, c, n)
  
diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h

index 14be5000c5a0..61e28ef2112a 100644
--- a/arch/arm64/include/asm/uaccess.h
+++ b/arch/arm64/include/asm/uaccess.h
@@ -425,4 +425,25 @@ static inline size_t probe_subpage_writeable(const char 
__user *uaddr,
  
  #endif /* CONFIG_ARCH_HAS_SUBPAGE_FAULTS */
  
+#ifdef CONFIG_ARCH_HAS_COPY_MC

+/**
+ * copy_mc_to_kernel - memory copy that handles source exceptions
+ *
+ * @dst:   destination address
+ * @src:   source address
+ * @len:   number of bytes to copy
+ *
+ * Return 0 for success, or #size if there was an exception.
+ */
+static inline unsigned long __must_check
+copy_mc_to_kernel(void *to, const void *from, unsigned long size)
+{
+   int ret;
+
+   ret = memcpy_mcs(to, from, size);
+   return (ret == -EFAULT) ? size : 0;
+}
+#define copy_mc_to_kernel copy_mc_to_kernel
+#endif
+
  #endif /* __ASM_UACCESS_H */
diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile
index a2fd865b816d..899d6ae9698c 100644
--- a/arch/arm64/lib/Makefile
+++ b/arch/arm64/lib/Makefile
@@ -3,7 +3,7 @@ lib-y   := clear_user.o delay.o copy_from_user.o
\
   copy_to_user.o copy_page.o   \
   clear_page.o csum.o insn.o memchr.o memcpy.o \
   memset.o memcmp.o strcmp.o strncmp.o strlen.o\
-  strnlen.o strchr.o strrchr.o tishift.o
+  strnlen.o strchr.o strrchr.o tishift.o memcpy_mc.o
  
  ifeq ($(CONFIG_KERNEL_MODE_NEON), y)

  obj-$(CONFIG_XOR_BLOCKS)  += xor-neon.o
diff --git a/arch/arm64/lib/memcpy_mc.S b/arch/arm64/lib/memcpy_mc.S
new file mode 100644
index ..7076b500d154
--- /dev/null
+++ b/arch/arm64/lib/memcpy_mc.S
@@ -0,0 +1,257 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (c) 2012-2021, Arm Limited.
+ *
+ * Adapted from the original at:
+ * 
https://github.com/ARM-software/optimized-routines/blob/afd6244a1f8d9229/string/aarch64/memcpy.S
+ */
+
+#include 
+#include 
+
+/* Assumptions:
+ *
+ * ARMv8-a, AArch64, unaligned accesses.
+ *
+ */
+
+#define L(label) .L ## label
+
+#define dstin  x0
+#define srcx1
+#define count  x2
+#define dstx3
+#define srcend x4
+#define dstend x5
+#define A_lx6
+#define A_lw   w6
+#define A_hx7
+#define B_lx8
+#define B_lw 

Re: [PATCH v10 5/6] arm64: support copy_mc_[user]_highpage()

2024-01-30 Thread Tong Tiangen




在 2024/1/30 18:31, Mark Rutland 写道:

On Mon, Jan 29, 2024 at 09:46:51PM +0800, Tong Tiangen wrote:

Currently, many scenarios that can tolerate memory errors when copying page
have been supported in the kernel[1][2][3], all of which are implemented by
copy_mc_[user]_highpage(). arm64 should also support this mechanism.

Due to mte, arm64 needs to have its own copy_mc_[user]_highpage()
architecture implementation, macros __HAVE_ARCH_COPY_MC_HIGHPAGE and
__HAVE_ARCH_COPY_MC_USER_HIGHPAGE have been added to control it.

Add new helper copy_mc_page() which provide a page copy implementation with
machine check safe. The copy_mc_page() in copy_mc_page.S is largely borrows
from copy_page() in copy_page.S and the main difference is copy_mc_page()
add extable entry to every load/store insn to support machine check safe.

Add new extable type EX_TYPE_COPY_MC_PAGE_ERR_ZERO which used in
copy_mc_page().

[1]a873dfe1032a ("mm, hwpoison: try to recover from copy-on write faults")
[2]5f2500b93cc9 ("mm/khugepaged: recover from poisoned anonymous memory")
[3]6b970599e807 ("mm: hwpoison: support recovery from ksm_might_need_to_copy()")

Signed-off-by: Tong Tiangen 
---
  arch/arm64/include/asm/asm-extable.h | 15 ++
  arch/arm64/include/asm/assembler.h   |  4 ++
  arch/arm64/include/asm/mte.h |  5 ++
  arch/arm64/include/asm/page.h| 10 
  arch/arm64/lib/Makefile  |  2 +
  arch/arm64/lib/copy_mc_page.S| 78 
  arch/arm64/lib/mte.S | 27 ++
  arch/arm64/mm/copypage.c | 66 ---
  arch/arm64/mm/extable.c  |  7 +--
  include/linux/highmem.h  |  8 +++
  10 files changed, 213 insertions(+), 9 deletions(-)
  create mode 100644 arch/arm64/lib/copy_mc_page.S

diff --git a/arch/arm64/include/asm/asm-extable.h 
b/arch/arm64/include/asm/asm-extable.h
index 980d1dd8e1a3..819044fefbe7 100644
--- a/arch/arm64/include/asm/asm-extable.h
+++ b/arch/arm64/include/asm/asm-extable.h
@@ -10,6 +10,7 @@
  #define EX_TYPE_UACCESS_ERR_ZERO  2
  #define EX_TYPE_KACCESS_ERR_ZERO  3
  #define EX_TYPE_LOAD_UNALIGNED_ZEROPAD4
+#define EX_TYPE_COPY_MC_PAGE_ERR_ZERO  5
  
  /* Data fields for EX_TYPE_UACCESS_ERR_ZERO */

  #define EX_DATA_REG_ERR_SHIFT 0
@@ -51,6 +52,16 @@
  #define _ASM_EXTABLE_UACCESS(insn, fixup) \
_ASM_EXTABLE_UACCESS_ERR_ZERO(insn, fixup, wzr, wzr)
  
+#define _ASM_EXTABLE_COPY_MC_PAGE_ERR_ZERO(insn, fixup, err, zero)	\

+   __ASM_EXTABLE_RAW(insn, fixup,  \
+ EX_TYPE_COPY_MC_PAGE_ERR_ZERO,\
+ ( \
+   EX_DATA_REG(ERR, err) | \
+   EX_DATA_REG(ZERO, zero) \
+ ))
+
+#define _ASM_EXTABLE_COPY_MC_PAGE(insn, fixup) \
+   _ASM_EXTABLE_COPY_MC_PAGE_ERR_ZERO(insn, fixup, wzr, wzr)
  /*
   * Create an exception table entry for uaccess `insn`, which will branch to 
`fixup`
   * when an unhandled fault is taken.
@@ -59,6 +70,10 @@
_ASM_EXTABLE_UACCESS(\insn, \fixup)
.endm
  
+	.macro  _asm_extable_copy_mc_page, insn, fixup

+   _ASM_EXTABLE_COPY_MC_PAGE(\insn, \fixup)
+   .endm
+


This should share a common EX_TYPE_ with the other "kaccess where memory error
is handled but other faults are fatal" cases.


OK, reasonable.



  /*
   * Create an exception table entry for `insn` if `fixup` is provided. 
Otherwise
   * do nothing.
diff --git a/arch/arm64/include/asm/assembler.h 
b/arch/arm64/include/asm/assembler.h
index 513787e43329..e1d8ce155878 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -154,6 +154,10 @@ lr .reqx30 // link register
  #define CPU_LE(code...) code
  #endif
  
+#define CPY_MC(l, x...)		\

+:   x; \
+   _asm_extable_copy_mc_pageb, l
+
  /*
   * Define a macro that constructs a 64-bit value by concatenating two
   * 32-bit registers. Note that on big endian systems the order of the
diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index 91fbd5c8a391..9cdded082dd4 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -92,6 +92,7 @@ static inline bool try_page_mte_tagging(struct page *page)
  void mte_zero_clear_page_tags(void *addr);
  void mte_sync_tags(pte_t pte, unsigned int nr_pages);
  void mte_copy_page_tags(void *kto, const void *kfrom);
+int mte_copy_mc_page_tags(void *kto, const void *kfrom);
  void mte_thread_init_user(void);
  void mte_thread_switch(struct task_struct *next);
  void mte_cpu_setup(void);
@@ -128,6 +129,10 @@ static inline void mte_sync_tags(pte_t pte, unsigned int 
nr_pages)
  static inline void mte_copy_page_tags(void *kto, const void *kfrom)
  {
  }

Re: [PATCH v10 3/6] arm64: add uaccess to machine check safe

2024-01-30 Thread Tong Tiangen




在 2024/1/30 20:01, Mark Rutland 写道:

On Tue, Jan 30, 2024 at 07:14:35PM +0800, Tong Tiangen wrote:

在 2024/1/30 1:43, Mark Rutland 写道:

On Mon, Jan 29, 2024 at 09:46:49PM +0800, Tong Tiangen wrote:
Further, this change will also silently fixup unexpected kernel faults if we
pass bad kernel pointers to copy_{to,from}_user, which will hide real bugs.


I think this is better than the panic kernel, because the real bugs
belongs to the user process. Even if the wrong pointer is
transferred, the page corresponding to the wrong pointer has a memroy
error.


I think you have misunderstood my point; I'm talking about the case of a bad
kernel pointer *without* a memory error.

For example, consider some buggy code such as:

void __user *uptr = some_valid_user_pointer;
void *kptr = NULL; // or any other bad pointer

ret = copy_to_user(uptr, kptr, size);
if (ret)
return -EFAULT;

Before this patch, when copy_to_user() attempted to load from NULL it would
fault, there would be no fixup handler for the LDR, and the kernel would die(),
reporting the bad kernel access.

After this patch (which adds fixup handlers to all the LDR*s in
copy_to_user()), the fault (which is *not* a memory error) would be handled by
the fixup handler, and copy_to_user() would return an error without *any*
indication of the horrible kernel bug.

This will hide kernel bugs, which will make those harder to identify and fix,
and will also potentially make it easier to exploit the kernel: if the user
somehow gains control of the kernel pointer, they can rely on the fixup handler
returning an error, and can scan through memory rather than dying as soon as
they pas a bad pointer.


I should understand what you mean. I'll think about this and reply.

Many thanks.
Tong.




In addition, the panic information contains necessary information
for users to check.


There is no panic() in the case I am describing.


So NAK to this change as-is; likewise for the addition of USER() to other ldr*
macros in copy_from_user.S and the addition of USER() str* macros in
copy_to_user.S.

If we want to handle memory errors on some kaccesses, we need a new EX_TYPE_*
separate from the usual EX_TYPE_KACESS_ERR_ZERO that means "handle memory
errors, but treat other faults as fatal". That should come with a rationale and
explanation of why it's actually useful.


This makes sense. Add kaccess types that can be processed properly.



[...]


diff --git a/arch/arm64/mm/extable.c b/arch/arm64/mm/extable.c
index 478e639f8680..28ec35e3d210 100644
--- a/arch/arm64/mm/extable.c
+++ b/arch/arm64/mm/extable.c
@@ -85,10 +85,10 @@ bool fixup_exception_mc(struct pt_regs *regs)
if (!ex)
return false;
-   /*
-* This is not complete, More Machine check safe extable type can
-* be processed here.
-*/
+   switch (ex->type) {
+   case EX_TYPE_UACCESS_ERR_ZERO:
+   return ex_handler_uaccess_err_zero(ex, regs);
+   }


Please fold this part into the prior patch, and start ogf with *only* handling
errors on accesses already marked with EX_TYPE_UACCESS_ERR_ZERO. I think that
change would be relatively uncontroversial, and it would be much easier to
build atop that.


OK, the two patches will be merged in the next release.


Thanks.

Mark.
.


Re: [PATCH v10 2/6] arm64: add support for machine check error safe

2024-01-30 Thread Tong Tiangen




在 2024/1/30 21:07, Mark Rutland 写道:

On Tue, Jan 30, 2024 at 06:57:24PM +0800, Tong Tiangen wrote:

在 2024/1/30 1:51, Mark Rutland 写道:

On Mon, Jan 29, 2024 at 09:46:48PM +0800, Tong Tiangen wrote:



diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 55f6455a8284..312932dc100b 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -730,6 +730,31 @@ static int do_bad(unsigned long far, unsigned long esr, 
struct pt_regs *regs)
return 1; /* "fault" */
   }
+static bool arm64_do_kernel_sea(unsigned long addr, unsigned int esr,
+struct pt_regs *regs, int sig, int code)
+{
+   if (!IS_ENABLED(CONFIG_ARCH_HAS_COPY_MC))
+   return false;
+
+   if (user_mode(regs))
+   return false;


This function is called "arm64_do_kernel_sea"; surely the caller should *never*
call this for a SEA taken from user mode?


In do_sea(), the processing logic is as follows:
   do_sea()
   {
 [...]
 if (user_mode(regs) && apei_claim_sea(regs) == 0) {
return 0;
 }
 [...]
 //[1]
 if (!arm64_do_kernel_sea()) {
arm64_notify_die();
 }
   }

[1] user_mode() is still possible to go here,If user_mode() goes here,
  it indicates that the impact caused by the memory error cannot be
  processed correctly by apei_claim_sea().


In this case, only arm64_notify_die() can be used, This also maintains
the original logic of user_mode()'s processing.


My point is that either:

(a) The name means that this should *only* be called for SEAs from a kernel
 context, and the caller should be responsible for ensuring that.

(b) The name is misleading, and the 'kernel' part should be removed from the
 name.

I prefer (a), and if you head down that route it's clear that you can get rid
of a bunch of redundant logic and remove the need for do_kernel_sea(), anyway,
e.g.

| static int do_sea(unsigned long far, unsigned long esr, struct pt_regs *regs)
| {
| const struct fault_info *inf = esr_to_fault_info(esr);
| bool claimed = apei_claim_sea(regs) == 0;
| unsigned long siaddr;
|
| if (claimed) {
| if (user_mode(regs)) {
| /*
|  * APEI claimed this as a firmware-first notification.
|  * Some processing deferred to task_work before 
ret_to_user().
|  */
| return 0;
| } else {
| /*
|  * TODO: explain why this is correct.
|  */
| if ((current->flags & PF_KTHREAD) &&
| fixup_exception_mc(regs))
| return 0;
| }
| }


This code seems to be a bit more concise and avoids misleading function 
names, which I'll use in the next version:)



|
| if (esr & ESR_ELx_FnV) {
| siaddr = 0;
| } else {
| /*
|  * The architecture specifies that the tag bits of FAR_EL1 are
|  * UNKNOWN for synchronous external aborts. Mask them out now
|  * so that userspace doesn't see them.
|  */
| siaddr  = untagged_addr(far);
| }
| arm64_notify_die(inf->name, regs, inf->sig, inf->code, siaddr, esr);
|
| return 0;
| }


+
+   if (apei_claim_sea(regs) < 0)
+   return false;
+
+   if (!fixup_exception_mc(regs))
+   return false;
+
+   if (current->flags & PF_KTHREAD)
+   return true;


I think this needs a comment; why do we allow kthreads to go on, yet kill user
threads? What about helper threads (e.g. for io_uring)?


If a memroy error occurs in the kernel thread, the problem is more
serious than that of the user thread. As a result, related kernel
functions, such as khugepaged, cannot run properly. kernel panic should
be a better choice at this time.

Therefore, the processing scope of this framework is limited to the user
thread.


That's reasonable, but needs to be explained in a comment.

Also, as above, I think you haven't conisderd helper threads (e.g. io_uring),
which don't have PF_KTHREAD set but do have PF_USER_WORKER set. I suspect those
need the same treatment as kthreads.


Okay, I'm going to investigate PF_USER_WORKER.




+   set_thread_esr(0, esr);


Why do we set the ESR to 0?


The purpose is to reuse the logic of arm64_notify_die() and set the
following parameters before sending signals to users:
   current->thread.fault_address = 0;
   current->thread.fault_code = err;


Ok, but there's no need to open-code that.

As per my above example, please continue to use the existing call to
arm64_notify_die() rather than open-coding bits of it.


OK.

Many thanks.
Tong.


Mark.
.


Re: [PATCH v10 2/6] arm64: add support for machine check error safe

2024-01-30 Thread Mark Rutland
On Tue, Jan 30, 2024 at 06:57:24PM +0800, Tong Tiangen wrote:
> 在 2024/1/30 1:51, Mark Rutland 写道:
> > On Mon, Jan 29, 2024 at 09:46:48PM +0800, Tong Tiangen wrote:

> > > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> > > index 55f6455a8284..312932dc100b 100644
> > > --- a/arch/arm64/mm/fault.c
> > > +++ b/arch/arm64/mm/fault.c
> > > @@ -730,6 +730,31 @@ static int do_bad(unsigned long far, unsigned long 
> > > esr, struct pt_regs *regs)
> > >   return 1; /* "fault" */
> > >   }
> > > +static bool arm64_do_kernel_sea(unsigned long addr, unsigned int esr,
> > > +  struct pt_regs *regs, int sig, int code)
> > > +{
> > > + if (!IS_ENABLED(CONFIG_ARCH_HAS_COPY_MC))
> > > + return false;
> > > +
> > > + if (user_mode(regs))
> > > + return false;
> > 
> > This function is called "arm64_do_kernel_sea"; surely the caller should 
> > *never*
> > call this for a SEA taken from user mode?
> 
> In do_sea(), the processing logic is as follows:
>   do_sea()
>   {
> [...]
> if (user_mode(regs) && apei_claim_sea(regs) == 0) {
>return 0;
> }
> [...]
> //[1]
> if (!arm64_do_kernel_sea()) {
>arm64_notify_die();
> }
>   }
> 
> [1] user_mode() is still possible to go here,If user_mode() goes here,
>  it indicates that the impact caused by the memory error cannot be
>  processed correctly by apei_claim_sea().
> 
> 
> In this case, only arm64_notify_die() can be used, This also maintains
> the original logic of user_mode()'s processing.

My point is that either:

(a) The name means that this should *only* be called for SEAs from a kernel
context, and the caller should be responsible for ensuring that.

(b) The name is misleading, and the 'kernel' part should be removed from the
name.

I prefer (a), and if you head down that route it's clear that you can get rid
of a bunch of redundant logic and remove the need for do_kernel_sea(), anyway,
e.g.

| static int do_sea(unsigned long far, unsigned long esr, struct pt_regs *regs)
| {
| const struct fault_info *inf = esr_to_fault_info(esr);
| bool claimed = apei_claim_sea(regs) == 0;
| unsigned long siaddr;
| 
| if (claimed) {
| if (user_mode(regs)) {
| /*  
|  * APEI claimed this as a firmware-first notification.
|  * Some processing deferred to task_work before 
ret_to_user().
|  */
| return 0;
| } else {
| /*
|  * TODO: explain why this is correct.
|  */
| if ((current->flags & PF_KTHREAD) &&
| fixup_exception_mc(regs))
| return 0;
| }
| }
| 
| if (esr & ESR_ELx_FnV) {
| siaddr = 0;
| } else {
| /*  
|  * The architecture specifies that the tag bits of FAR_EL1 are
|  * UNKNOWN for synchronous external aborts. Mask them out now
|  * so that userspace doesn't see them.
|  */
| siaddr  = untagged_addr(far);
| }   
| arm64_notify_die(inf->name, regs, inf->sig, inf->code, siaddr, esr);
| 
| return 0;
| }

> > > +
> > > + if (apei_claim_sea(regs) < 0)
> > > + return false;
> > > +
> > > + if (!fixup_exception_mc(regs))
> > > + return false;
> > > +
> > > + if (current->flags & PF_KTHREAD)
> > > + return true;
> > 
> > I think this needs a comment; why do we allow kthreads to go on, yet kill 
> > user
> > threads? What about helper threads (e.g. for io_uring)?
> 
> If a memroy error occurs in the kernel thread, the problem is more
> serious than that of the user thread. As a result, related kernel
> functions, such as khugepaged, cannot run properly. kernel panic should
> be a better choice at this time.
> 
> Therefore, the processing scope of this framework is limited to the user
> thread.

That's reasonable, but needs to be explained in a comment.

Also, as above, I think you haven't conisderd helper threads (e.g. io_uring),
which don't have PF_KTHREAD set but do have PF_USER_WORKER set. I suspect those
need the same treatment as kthreads.

> > > + set_thread_esr(0, esr);
> > 
> > Why do we set the ESR to 0?
> 
> The purpose is to reuse the logic of arm64_notify_die() and set the
> following parameters before sending signals to users:
>   current->thread.fault_address = 0;
>   current->thread.fault_code = err;

Ok, but there's no need to open-code that.

As per my above example, please continue to use the existing call to
arm64_notify_die() rather than open-coding bits of it.

Mark.


Re: [PATCH v10 3/6] arm64: add uaccess to machine check safe

2024-01-30 Thread Mark Rutland
On Tue, Jan 30, 2024 at 07:14:35PM +0800, Tong Tiangen wrote:
> 在 2024/1/30 1:43, Mark Rutland 写道:
> > On Mon, Jan 29, 2024 at 09:46:49PM +0800, Tong Tiangen wrote:
> > Further, this change will also silently fixup unexpected kernel faults if we
> > pass bad kernel pointers to copy_{to,from}_user, which will hide real bugs.
> 
> I think this is better than the panic kernel, because the real bugs
> belongs to the user process. Even if the wrong pointer is
> transferred, the page corresponding to the wrong pointer has a memroy
> error.

I think you have misunderstood my point; I'm talking about the case of a bad
kernel pointer *without* a memory error.

For example, consider some buggy code such as:

void __user *uptr = some_valid_user_pointer;
void *kptr = NULL; // or any other bad pointer

ret = copy_to_user(uptr, kptr, size);
if (ret)
return -EFAULT;

Before this patch, when copy_to_user() attempted to load from NULL it would
fault, there would be no fixup handler for the LDR, and the kernel would die(),
reporting the bad kernel access.

After this patch (which adds fixup handlers to all the LDR*s in
copy_to_user()), the fault (which is *not* a memory error) would be handled by
the fixup handler, and copy_to_user() would return an error without *any*
indication of the horrible kernel bug.

This will hide kernel bugs, which will make those harder to identify and fix,
and will also potentially make it easier to exploit the kernel: if the user
somehow gains control of the kernel pointer, they can rely on the fixup handler
returning an error, and can scan through memory rather than dying as soon as
they pas a bad pointer.

> In addition, the panic information contains necessary information
> for users to check.

There is no panic() in the case I am describing.

> > So NAK to this change as-is; likewise for the addition of USER() to other 
> > ldr*
> > macros in copy_from_user.S and the addition of USER() str* macros in
> > copy_to_user.S.
> > 
> > If we want to handle memory errors on some kaccesses, we need a new 
> > EX_TYPE_*
> > separate from the usual EX_TYPE_KACESS_ERR_ZERO that means "handle memory
> > errors, but treat other faults as fatal". That should come with a rationale 
> > and
> > explanation of why it's actually useful.
> 
> This makes sense. Add kaccess types that can be processed properly.
> 
> > 
> > [...]
> > 
> > > diff --git a/arch/arm64/mm/extable.c b/arch/arm64/mm/extable.c
> > > index 478e639f8680..28ec35e3d210 100644
> > > --- a/arch/arm64/mm/extable.c
> > > +++ b/arch/arm64/mm/extable.c
> > > @@ -85,10 +85,10 @@ bool fixup_exception_mc(struct pt_regs *regs)
> > >   if (!ex)
> > >   return false;
> > > - /*
> > > -  * This is not complete, More Machine check safe extable type can
> > > -  * be processed here.
> > > -  */
> > > + switch (ex->type) {
> > > + case EX_TYPE_UACCESS_ERR_ZERO:
> > > + return ex_handler_uaccess_err_zero(ex, regs);
> > > + }
> > 
> > Please fold this part into the prior patch, and start ogf with *only* 
> > handling
> > errors on accesses already marked with EX_TYPE_UACCESS_ERR_ZERO. I think 
> > that
> > change would be relatively uncontroversial, and it would be much easier to
> > build atop that.
> 
> OK, the two patches will be merged in the next release.

Thanks.

Mark.


Re: [PATCH v10 3/6] arm64: add uaccess to machine check safe

2024-01-30 Thread Tong Tiangen




在 2024/1/30 1:43, Mark Rutland 写道:

On Mon, Jan 29, 2024 at 09:46:49PM +0800, Tong Tiangen wrote:

If user process access memory fails due to hardware memory error, only the
relevant processes are affected, so it is more reasonable to kill the user
process and isolate the corrupt page than to panic the kernel.

Signed-off-by: Tong Tiangen 
---
  arch/arm64/lib/copy_from_user.S | 10 +-
  arch/arm64/lib/copy_to_user.S   | 10 +-
  arch/arm64/mm/extable.c |  8 
  3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/lib/copy_from_user.S b/arch/arm64/lib/copy_from_user.S
index 34e317907524..1bf676e9201d 100644
--- a/arch/arm64/lib/copy_from_user.S
+++ b/arch/arm64/lib/copy_from_user.S
@@ -25,7 +25,7 @@
.endm
  
  	.macro strb1 reg, ptr, val

-   strb \reg, [\ptr], \val
+   USER(9998f, strb \reg, [\ptr], \val)
.endm


This is a store to *kernel* memory, not user memory. It should not be marked
with USER().


This does cause some misconceptions, and my original idea was to reuse 
the fixup capability of USER().




I understand that you *might* want to handle memory errors on these stores, but
the commit message doesn't describe that and the associated trade-off. For
example, consider that when a copy_form_user fails we'll try to zero the
remaining buffer via memset(); so if a STR* instruction in copy_to_user
faulted, upon handling the fault we'll immediately try to fix that up with some
more stores which will also fault, but won't get fixed up, leading to a panic()
anyway...


When copy_from_user() triggers a memory error, there are two cases: ld
user memory error and st kernel memory error. The former can clear the
remaining kernel memory, and the latter cannot be cleared because the
page is poison.

The purpose of memset() is to keep the data consistency of the kernel
memory (or multiple subsequent pages) (the data that is not copied
should be set to 0). My consideration here is that since our ultimate
goal is to kill the owner thread of the kernel memory data, the
"consistency" of the kernel memory data is not so important, but
increases the processing complexity.

The trade-offs do need to be added to commit message after agreement
is reached :)


Further, this change will also silently fixup unexpected kernel faults if we
pass bad kernel pointers to copy_{to,from}_user, which will hide real bugs.


I think this is better than the panic kernel, because the real bugs
belongs to the user process. Even if the wrong pointer is
transferred, the page corresponding to the wrong pointer has a memroy
error. In addition, the panic information contains necessary information
for users to check.



So NAK to this change as-is; likewise for the addition of USER() to other ldr*
macros in copy_from_user.S and the addition of USER() str* macros in
copy_to_user.S.

If we want to handle memory errors on some kaccesses, we need a new EX_TYPE_*
separate from the usual EX_TYPE_KACESS_ERR_ZERO that means "handle memory
errors, but treat other faults as fatal". That should come with a rationale and
explanation of why it's actually useful.


This makes sense. Add kaccess types that can be processed properly.



[...]


diff --git a/arch/arm64/mm/extable.c b/arch/arm64/mm/extable.c
index 478e639f8680..28ec35e3d210 100644
--- a/arch/arm64/mm/extable.c
+++ b/arch/arm64/mm/extable.c
@@ -85,10 +85,10 @@ bool fixup_exception_mc(struct pt_regs *regs)
if (!ex)
return false;
  
-	/*

-* This is not complete, More Machine check safe extable type can
-* be processed here.
-*/
+   switch (ex->type) {
+   case EX_TYPE_UACCESS_ERR_ZERO:
+   return ex_handler_uaccess_err_zero(ex, regs);
+   }


Please fold this part into the prior patch, and start ogf with *only* handling
errors on accesses already marked with EX_TYPE_UACCESS_ERR_ZERO. I think that
change would be relatively uncontroversial, and it would be much easier to
build atop that.


OK, the two patches will be merged in the next release.

Many thanks.
Tong.



Thanks,
Mark.
.


Re: [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time

2024-01-30 Thread Christophe Leroy


Le 30/01/2024 à 10:16, Chen-Yu Tsai a écrit :
> [Vous ne recevez pas souvent de courriers de we...@chromium.org. D?couvrez 
> pourquoi ceci est important ? https://aka.ms/LearnAboutSenderIdentification ]
> 
> Hi,
> 
> On Mon, Jan 29, 2024 at 12:09:50PM -0800, Luis Chamberlain wrote:
>> On Thu, Dec 21, 2023 at 10:02:46AM +0100, Christophe Leroy wrote:
>>> Declaring rodata_enabled and mark_rodata_ro() at all time
>>> helps removing related #ifdefery in C files.
>>>
>>> Signed-off-by: Christophe Leroy 
>>
>> Very nice cleanup, thanks!, applied and pushed
>>
>>Luis
> 
> On next-20240130, which has your modules-next branch, and thus this
> series and the other "module: Use set_memory_rox()" series applied,
> my kernel crashes in some very weird way. Reverting your branch
> makes the crash go away.
> 
> I thought I'd report it right away. Maybe you folks would know what's
> happening here? This is on arm64.

That's strange, it seems to bug in module_bug_finalize() which is 
_before_ calls to module_enable_ro() and such.

Can you try to revert the 6 patches one by one to see which one 
introduces the problem ?

In reality, only patch 677bfb9db8a3 really change things. Other ones are 
more on less only cleanup.

Thanks
Christophe


> 
> [   10.481015] Unable to handle kernel paging request at virtual address 
> ffde85245d30
> [   10.490369] KASAN: maybe wild-memory-access in range 
> [0x00f42922e980-0x00f42922e987]
> [   10.503744] Mem abort info:
> [   10.509383]   ESR = 0x9647
> [   10.514400]   EC = 0x25: DABT (current EL), IL = 32 bits
> [   10.522366]   SET = 0, FnV = 0
> [   10.526343]   EA = 0, S1PTW = 0
> [   10.530695]   FSC = 0x07: level 3 translation fault
> [   10.537081] Data abort info:
> [   10.540839]   ISV = 0, ISS = 0x0047, ISS2 = 0x
> [   10.546456]   CM = 0, WnR = 1, TnD = 0, TagAccess = 0
> [   10.551726]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> [   10.557612] swapper pgtable: 4k pages, 39-bit VAs, pgdp=41f98000
> [   10.565214] [ffde85245d30] pgd=10023003, p4d=10023003, 
> pud=10023003, pmd=1001121eb003, pte=
> [   10.578887] Internal error: Oops: 00009647 [#1] PREEMPT SMP
> [   10.585815] Modules linked in:
> [   10.590235] CPU: 6 PID: 195 Comm: (udev-worker) Tainted: GB
>   6.8.0-rc2-next-20240130-02908-ge8ad01d60927-dirty #163 
> 3f2318148ecc5fa70d1092c2b874f9b59bdb7d60
> [   10.607021] Hardware name: Google Tentacruel board (DT)
> [   10.613607] pstate: a049 (NzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [   10.621954] pc : module_bug_finalize+0x118/0x148
> [   10.626823] lr : module_bug_finalize+0x118/0x148
> [   10.631463] sp : ffc0820478d0
> [   10.631466] x29: ffc0820478d0 x28: ffc082047ca0 x27: 
> ffde8d7d31a0
> [   10.631477] x26: ffde85223780 x25:  x24: 
> ffde8c413cc0
> [   10.631486] x23: ffde8dfcec80 x22: ffde8dfce000 x21: 
> ffde85223ba8
> [   10.631495] x20: ffde85223780 x19: ffde85245d28 x18: 
> 
> [   10.631504] x17: ffde8aa15938 x16: ffde8aabdd90 x15: 
> ffde8aab8124
> [   10.631513] x14: ffde8acdd380 x13: 41b58ab3 x12: 
> ffbbd1bf9d91
> [   10.631522] x11: 1ffbd1bf9d90 x10: ffbbd1bf9d90 x9 : 
> dfc0
> [   10.631531] x8 : 00442e406270 x7 : ffde8dfcec87 x6 : 
> 0001
> [   10.631539] x5 : ffde8dfcec80 x4 :  x3 : 
> ffde8bbadf08
> [   10.631548] x2 : 0001 x1 : ffde8eaff080 x0 : 
> 
> [   10.631556] Call trace:
> [   10.631559]  module_bug_finalize+0x118/0x148
> [   10.631565]  load_module+0x25ec/0x2a78
> [   10.631572]  __do_sys_init_module+0x234/0x418
> [   10.631578]  __arm64_sys_init_module+0x4c/0x68
> [   10.631584]  invoke_syscall+0x68/0x198
> [   10.631589]  el0_svc_common.constprop.0+0x11c/0x150
> [   10.631594]  do_el0_svc+0x38/0x50
> [   10.631598]  el0_svc+0x50/0xa0
> [   10.631604]  el0t_64_sync_handler+0x120/0x130
> [   10.631609]  el0t_64_sync+0x1a8/0x1b0
> [   10.631619] Code: 97c5418e c89ffef5 91002260 97c53ca7 (f9000675)
> [   10.631624] ---[ end trace  ]---
> [   10.642965] Kernel panic - not syncing: Oops: Fatal exception
> [   10.642975] SMP: stopping secondary CPUs
> [   10.648339] Kernel Offset: 0x1e0a80 from 0xffc08000
> [   10.648343] PHYS_OFFSET: 0x4000
> [   10.648345] CPU features: 0x0,c061,7002814a,2100720b
> [   10.648350] Memory Limit: none
> 


Re: [PATCH v10 2/6] arm64: add support for machine check error safe

2024-01-30 Thread Tong Tiangen




在 2024/1/30 1:51, Mark Rutland 写道:

On Mon, Jan 29, 2024 at 09:46:48PM +0800, Tong Tiangen wrote:

For the arm64 kernel, when it processes hardware memory errors for
synchronize notifications(do_sea()), if the errors is consumed within the
kernel, the current processing is panic. However, it is not optimal.

Take uaccess for example, if the uaccess operation fails due to memory
error, only the user process will be affected. Killing the user process and
isolating the corrupt page is a better choice.

This patch only enable machine error check framework and adds an exception
fixup before the kernel panic in do_sea().

Signed-off-by: Tong Tiangen 
---
  arch/arm64/Kconfig   |  1 +
  arch/arm64/include/asm/extable.h |  1 +
  arch/arm64/mm/extable.c  | 16 
  arch/arm64/mm/fault.c| 29 -
  4 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index aa7c1d435139..2cc34b5e7abb 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -20,6 +20,7 @@ config ARM64
select ARCH_ENABLE_SPLIT_PMD_PTLOCK if PGTABLE_LEVELS > 2
select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
select ARCH_HAS_CACHE_LINE_SIZE
+   select ARCH_HAS_COPY_MC if ACPI_APEI_GHES
select ARCH_HAS_CURRENT_STACK_POINTER
select ARCH_HAS_DEBUG_VIRTUAL
select ARCH_HAS_DEBUG_VM_PGTABLE
diff --git a/arch/arm64/include/asm/extable.h b/arch/arm64/include/asm/extable.h
index 72b0e71cc3de..f80ebd0addfd 100644
--- a/arch/arm64/include/asm/extable.h
+++ b/arch/arm64/include/asm/extable.h
@@ -46,4 +46,5 @@ bool ex_handler_bpf(const struct exception_table_entry *ex,
  #endif /* !CONFIG_BPF_JIT */
  
  bool fixup_exception(struct pt_regs *regs);

+bool fixup_exception_mc(struct pt_regs *regs);
  #endif
diff --git a/arch/arm64/mm/extable.c b/arch/arm64/mm/extable.c
index 228d681a8715..478e639f8680 100644
--- a/arch/arm64/mm/extable.c
+++ b/arch/arm64/mm/extable.c
@@ -76,3 +76,19 @@ bool fixup_exception(struct pt_regs *regs)
  
  	BUG();

  }
+
+bool fixup_exception_mc(struct pt_regs *regs)


Can we please replace 'mc' with something like 'memory_error' ?

There's no "machine check" on arm64, and 'mc' is opaque regardless.


OK, It's more appropriate to use "memory_error" on arm64.




+{
+   const struct exception_table_entry *ex;
+
+   ex = search_exception_tables(instruction_pointer(regs));
+   if (!ex)
+   return false;
+
+   /*
+* This is not complete, More Machine check safe extable type can
+* be processed here.
+*/
+
+   return false;
+}


As with my comment on the subsequenty patch, I'd much prefer that we handle
EX_TYPE_UACCESS_ERR_ZERO from the outset.


OK, In the next version, the two patches will be merged.






diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 55f6455a8284..312932dc100b 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -730,6 +730,31 @@ static int do_bad(unsigned long far, unsigned long esr, 
struct pt_regs *regs)
return 1; /* "fault" */
  }
  
+static bool arm64_do_kernel_sea(unsigned long addr, unsigned int esr,

+struct pt_regs *regs, int sig, int code)
+{
+   if (!IS_ENABLED(CONFIG_ARCH_HAS_COPY_MC))
+   return false;
+
+   if (user_mode(regs))
+   return false;


This function is called "arm64_do_kernel_sea"; surely the caller should *never*
call this for a SEA taken from user mode?


In do_sea(), the processing logic is as follows:
  do_sea()
  {
[...]
if (user_mode(regs) && apei_claim_sea(regs) == 0) {
   return 0;
}
[...]
//[1]
if (!arm64_do_kernel_sea()) {
   arm64_notify_die();
}
  }

[1] user_mode() is still possible to go here,If user_mode() goes here,
 it indicates that the impact caused by the memory error cannot be
 processed correctly by apei_claim_sea().


In this case, only arm64_notify_die() can be used, This also maintains
the original logic of user_mode()'s processing.




+
+   if (apei_claim_sea(regs) < 0)
+   return false;
+
+   if (!fixup_exception_mc(regs))
+   return false;
+
+   if (current->flags & PF_KTHREAD)
+   return true;


I think this needs a comment; why do we allow kthreads to go on, yet kill user
threads? What about helper threads (e.g. for io_uring)?


If a memroy error occurs in the kernel thread, the problem is more
serious than that of the user thread. As a result, related kernel
functions, such as khugepaged, cannot run properly. kernel panic should
be a better choice at this time.

Therefore, the processing scope of this framework is limited to the user 
 thread.





+
+   set_thread_esr(0, esr);


Why do we set the ESR to 0?


The purpose is to reuse the logic of arm64_notify_die() and set the 
following parameters before sending signals to users:

  

[PATCH v2 5/5] mm: ptdump: add check_wx_pages debugfs attribute

2024-01-30 Thread Christophe Leroy
Add a readable attribute in debugfs to trigger a
W^X pages check at any time.

To trigger the test, just read /sys/kernel/debug/check_wx_pages
It will report FAILED if the test failed, SUCCESS otherwise.

Detailed result is provided into dmesg.

Signed-off-by: Christophe Leroy 
---
v2: Make it a read attribute which reports SUCCESS/FAILED instead of only 
relying on kernel message log.
---
 mm/ptdump.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/mm/ptdump.c b/mm/ptdump.c
index 03c1bdae4a43..106e1d66e9f9 100644
--- a/mm/ptdump.c
+++ b/mm/ptdump.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 
 #include 
+#include 
 #include 
 #include 
 
@@ -163,3 +164,24 @@ void ptdump_walk_pgd(struct ptdump_state *st, struct 
mm_struct *mm, pgd_t *pgd)
/* Flush out the last page */
st->note_page(st, 0, -1, 0);
 }
+
+static int check_wx_show(struct seq_file *m, void *v)
+{
+   if (ptdump_check_wx())
+   seq_puts(m, "SUCCESS\n");
+   else
+   seq_puts(m, "FAILED\n");
+
+   return 0;
+}
+
+DEFINE_SHOW_ATTRIBUTE(check_wx);
+
+static int ptdump_debugfs_init(void)
+{
+   debugfs_create_file("check_wx_pages", 0400, NULL, NULL, _wx_fops);
+
+   return 0;
+}
+
+device_initcall(ptdump_debugfs_init);
-- 
2.43.0



[PATCH v2 4/5] mm: ptdump: Have ptdump_check_wx() return bool

2024-01-30 Thread Christophe Leroy
Have ptdump_check_wx() return true when the check is successful
or false otherwise.

Signed-off-by: Christophe Leroy 
---
v2: New
---
 arch/arm64/mm/ptdump.c  | 11 ---
 arch/powerpc/mm/ptdump/ptdump.c | 13 +
 arch/riscv/mm/ptdump.c  | 11 ---
 arch/s390/mm/dump_pagetables.c  | 13 +
 arch/x86/include/asm/pgtable.h  |  2 +-
 arch/x86/mm/dump_pagetables.c   | 19 ---
 include/linux/ptdump.h  |  2 +-
 7 files changed, 48 insertions(+), 23 deletions(-)

diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
index e305b6593c4e..696822f75582 100644
--- a/arch/arm64/mm/ptdump.c
+++ b/arch/arm64/mm/ptdump.c
@@ -345,7 +345,7 @@ static struct ptdump_info kernel_ptdump_info = {
.base_addr  = PAGE_OFFSET,
 };
 
-void ptdump_check_wx(void)
+bool ptdump_check_wx(void)
 {
struct pg_state st = {
.seq = NULL,
@@ -366,11 +366,16 @@ void ptdump_check_wx(void)
 
ptdump_walk_pgd(, _mm, NULL);
 
-   if (st.wx_pages || st.uxn_pages)
+   if (st.wx_pages || st.uxn_pages) {
pr_warn("Checked W+X mappings: FAILED, %lu W+X pages found, %lu 
non-UXN pages found\n",
st.wx_pages, st.uxn_pages);
-   else
+
+   return false;
+   } else {
pr_info("Checked W+X mappings: passed, no W+X pages found\n");
+
+   return true;
+   }
 }
 
 static int __init ptdump_init(void)
diff --git a/arch/powerpc/mm/ptdump/ptdump.c b/arch/powerpc/mm/ptdump/ptdump.c
index b835c80371cd..9dc239967b77 100644
--- a/arch/powerpc/mm/ptdump/ptdump.c
+++ b/arch/powerpc/mm/ptdump/ptdump.c
@@ -327,7 +327,7 @@ static void __init build_pgtable_complete_mask(void)
pg_level[i].mask |= pg_level[i].flag[j].mask;
 }
 
-void ptdump_check_wx(void)
+bool ptdump_check_wx(void)
 {
struct pg_state st = {
.seq = NULL,
@@ -344,15 +344,20 @@ void ptdump_check_wx(void)
};
 
if (IS_ENABLED(CONFIG_PPC_BOOK3S_64) && 
!mmu_has_feature(MMU_FTR_KERNEL_RO))
-   return;
+   return true;
 
ptdump_walk_pgd(, _mm, NULL);
 
-   if (st.wx_pages)
+   if (st.wx_pages) {
pr_warn("Checked W+X mappings: FAILED, %lu W+X pages found\n",
st.wx_pages);
-   else
+
+   return false;
+   } else {
pr_info("Checked W+X mappings: passed, no W+X pages found\n");
+
+   return true;
+   }
 }
 
 static int __init ptdump_init(void)
diff --git a/arch/riscv/mm/ptdump.c b/arch/riscv/mm/ptdump.c
index 075265603313..1289cc6d3700 100644
--- a/arch/riscv/mm/ptdump.c
+++ b/arch/riscv/mm/ptdump.c
@@ -335,7 +335,7 @@ static void ptdump_walk(struct seq_file *s, struct 
ptd_mm_info *pinfo)
ptdump_walk_pgd(, pinfo->mm, NULL);
 }
 
-void ptdump_check_wx(void)
+bool ptdump_check_wx(void)
 {
struct pg_state st = {
.seq = NULL,
@@ -356,11 +356,16 @@ void ptdump_check_wx(void)
 
ptdump_walk_pgd(, _mm, NULL);
 
-   if (st.wx_pages)
+   if (st.wx_pages) {
pr_warn("Checked W+X mappings: failed, %lu W+X pages found\n",
st.wx_pages);
-   else
+
+   return false;
+   } else {
pr_info("Checked W+X mappings: passed, no W+X pages found\n");
+
+   return true;
+   }
 }
 
 static int ptdump_show(struct seq_file *m, void *v)
diff --git a/arch/s390/mm/dump_pagetables.c b/arch/s390/mm/dump_pagetables.c
index 99da5a5602a8..ffd07ed7b4af 100644
--- a/arch/s390/mm/dump_pagetables.c
+++ b/arch/s390/mm/dump_pagetables.c
@@ -192,7 +192,7 @@ static void note_page(struct ptdump_state *pt_st, unsigned 
long addr, int level,
}
 }
 
-void ptdump_check_wx(void)
+bool ptdump_check_wx(void)
 {
struct pg_state st = {
.ptdump = {
@@ -215,14 +215,19 @@ void ptdump_check_wx(void)
};
 
if (!MACHINE_HAS_NX)
-   return;
+   return true;
ptdump_walk_pgd(, _mm, NULL);
-   if (st.wx_pages)
+   if (st.wx_pages) {
pr_warn("Checked W+X mappings: FAILED, %lu W+X pages found\n", 
st.wx_pages);
-   else
+
+   return false;
+   } else {
pr_info("Checked W+X mappings: passed, no %sW+X pages found\n",
(nospec_uses_trampoline() || 
!static_key_enabled(_has_bear)) ?
"unexpected " : "");
+
+   return true;
+   }
 }
 
 #ifdef CONFIG_PTDUMP_DEBUGFS
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 6c979028e521..b50b2ef63672 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -31,7 +31,7 @@ struct seq_file;
 void ptdump_walk_pgd_level(struct seq_file *m, struct mm_struct *mm);
 void ptdump_walk_pgd_level_debugfs(struct seq_file *m, struct mm_struct *mm,
   

[PATCH v2 3/5] powerpc,s390: ptdump: Define ptdump_check_wx() regardless of CONFIG_DEBUG_WX

2024-01-30 Thread Christophe Leroy
Following patch will use ptdump_check_wx() regardless of
CONFIG_DEBUG_WX, so define it at all times on powerpc and s390
just like other architectures. Though keep the WARN_ON_ONCE()
only when CONFIG_DEBUG_WX is set.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/mm/ptdump/ptdump.c | 7 +++
 arch/s390/mm/dump_pagetables.c  | 7 ++-
 2 files changed, 5 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/mm/ptdump/ptdump.c b/arch/powerpc/mm/ptdump/ptdump.c
index 620d4917ebe8..b835c80371cd 100644
--- a/arch/powerpc/mm/ptdump/ptdump.c
+++ b/arch/powerpc/mm/ptdump/ptdump.c
@@ -184,13 +184,14 @@ static void note_prot_wx(struct pg_state *st, unsigned 
long addr)
 {
pte_t pte = __pte(st->current_flags);
 
-   if (!IS_ENABLED(CONFIG_DEBUG_WX) || !st->check_wx)
+   if (!st->check_wx)
return;
 
if (!pte_write(pte) || !pte_exec(pte))
return;
 
-   WARN_ONCE(1, "powerpc/mm: Found insecure W+X mapping at address 
%p/%pS\n",
+   WARN_ONCE(IS_ENABLED(CONFIG_DEBUG_WX),
+ "powerpc/mm: Found insecure W+X mapping at address %p/%pS\n",
  (void *)st->start_address, (void *)st->start_address);
 
st->wx_pages += (addr - st->start_address) / PAGE_SIZE;
@@ -326,7 +327,6 @@ static void __init build_pgtable_complete_mask(void)
pg_level[i].mask |= pg_level[i].flag[j].mask;
 }
 
-#ifdef CONFIG_DEBUG_WX
 void ptdump_check_wx(void)
 {
struct pg_state st = {
@@ -354,7 +354,6 @@ void ptdump_check_wx(void)
else
pr_info("Checked W+X mappings: passed, no W+X pages found\n");
 }
-#endif
 
 static int __init ptdump_init(void)
 {
diff --git a/arch/s390/mm/dump_pagetables.c b/arch/s390/mm/dump_pagetables.c
index 8dcb4e0c71bd..99da5a5602a8 100644
--- a/arch/s390/mm/dump_pagetables.c
+++ b/arch/s390/mm/dump_pagetables.c
@@ -121,7 +121,6 @@ static void print_prot(struct seq_file *m, unsigned int pr, 
int level)
 
 static void note_prot_wx(struct pg_state *st, unsigned long addr)
 {
-#ifdef CONFIG_DEBUG_WX
if (!st->check_wx)
return;
if (st->current_prot & _PAGE_INVALID)
@@ -138,10 +137,10 @@ static void note_prot_wx(struct pg_state *st, unsigned 
long addr)
 */
if (addr == PAGE_SIZE && (nospec_uses_trampoline() || 
!static_key_enabled(_has_bear)))
return;
-   WARN_ONCE(1, "s390/mm: Found insecure W+X mapping at address %pS\n",
+   WARN_ONCE(IS_ENABLED(CONFIG_DEBUG_WX),
+ "s390/mm: Found insecure W+X mapping at address %pS\n",
  (void *)st->start_address);
st->wx_pages += (addr - st->start_address) / PAGE_SIZE;
-#endif /* CONFIG_DEBUG_WX */
 }
 
 static void note_page(struct ptdump_state *pt_st, unsigned long addr, int 
level, u64 val)
@@ -193,7 +192,6 @@ static void note_page(struct ptdump_state *pt_st, unsigned 
long addr, int level,
}
 }
 
-#ifdef CONFIG_DEBUG_WX
 void ptdump_check_wx(void)
 {
struct pg_state st = {
@@ -226,7 +224,6 @@ void ptdump_check_wx(void)
(nospec_uses_trampoline() || 
!static_key_enabled(_has_bear)) ?
"unexpected " : "");
 }
-#endif /* CONFIG_DEBUG_WX */
 
 #ifdef CONFIG_PTDUMP_DEBUGFS
 static int ptdump_show(struct seq_file *m, void *v)
-- 
2.43.0



[PATCH v2 2/5] arm64, powerpc, riscv, s390, x86: ptdump: Refactor CONFIG_DEBUG_WX

2024-01-30 Thread Christophe Leroy
All architectures using the core ptdump functionality also implement
CONFIG_DEBUG_WX, and they all do it more or less the same way, with a
function called debug_checkwx() that is called by mark_rodata_ro(),
which is a substitute to ptdump_check_wx() when CONFIG_DEBUG_WX is
set and a no-op otherwise.

Refactor by centraly defining debug_checkwx() in linux/ptdump.h and
call debug_checkwx() immediately after calling mark_rodata_ro()
instead of calling it at the end of every mark_rodata_ro().

On x86_32, mark_rodata_ro() first checks __supported_pte_mask has
_PAGE_NX before calling debug_checkwx(). Now the check is inside the
callee ptdump_walk_pgd_level_checkwx().

On powerpc_64, mark_rodata_ro() bails out early before calling
ptdump_check_wx() when the MMU doesn't have KERNEL_RO feature. The
check is now also done in ptdump_check_wx() as it is called outside
mark_rodata_ro().

Signed-off-by: Christophe Leroy 
Reviewed-by: Alexandre Ghiti 
---
v2: For x86 change macro ptdump_check_wx() to ptdump_check_wx
---
 arch/arm64/include/asm/ptdump.h |  7 ---
 arch/arm64/mm/mmu.c |  2 --
 arch/powerpc/mm/mmu_decl.h  |  6 --
 arch/powerpc/mm/pgtable_32.c|  4 
 arch/powerpc/mm/pgtable_64.c|  3 ---
 arch/powerpc/mm/ptdump/ptdump.c |  3 +++
 arch/riscv/include/asm/ptdump.h | 22 --
 arch/riscv/mm/init.c|  3 ---
 arch/riscv/mm/ptdump.c  |  1 -
 arch/s390/include/asm/ptdump.h  | 14 --
 arch/s390/mm/dump_pagetables.c  |  1 -
 arch/s390/mm/init.c |  2 --
 arch/x86/include/asm/pgtable.h  |  3 +--
 arch/x86/mm/dump_pagetables.c   |  3 +++
 arch/x86/mm/init_32.c   |  2 --
 arch/x86/mm/init_64.c   |  2 --
 include/linux/ptdump.h  |  7 +++
 init/main.c |  2 ++
 18 files changed, 16 insertions(+), 71 deletions(-)
 delete mode 100644 arch/riscv/include/asm/ptdump.h
 delete mode 100644 arch/s390/include/asm/ptdump.h

diff --git a/arch/arm64/include/asm/ptdump.h b/arch/arm64/include/asm/ptdump.h
index 581caac525b0..5b1701c76d1c 100644
--- a/arch/arm64/include/asm/ptdump.h
+++ b/arch/arm64/include/asm/ptdump.h
@@ -29,13 +29,6 @@ void __init ptdump_debugfs_register(struct ptdump_info 
*info, const char *name);
 static inline void ptdump_debugfs_register(struct ptdump_info *info,
   const char *name) { }
 #endif
-void ptdump_check_wx(void);
 #endif /* CONFIG_PTDUMP_CORE */
 
-#ifdef CONFIG_DEBUG_WX
-#define debug_checkwx()ptdump_check_wx()
-#else
-#define debug_checkwx()do { } while (0)
-#endif
-
 #endif /* __ASM_PTDUMP_H */
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 1ac7467d34c9..3a27d887f7dd 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -632,8 +632,6 @@ void mark_rodata_ro(void)
section_size = (unsigned long)__init_begin - (unsigned 
long)__start_rodata;
update_mapping_prot(__pa_symbol(__start_rodata), (unsigned 
long)__start_rodata,
section_size, PAGE_KERNEL_RO);
-
-   debug_checkwx();
 }
 
 static void __init map_kernel_segment(pgd_t *pgdp, void *va_start, void 
*va_end,
diff --git a/arch/powerpc/mm/mmu_decl.h b/arch/powerpc/mm/mmu_decl.h
index 72341b9fb552..90dcc2844056 100644
--- a/arch/powerpc/mm/mmu_decl.h
+++ b/arch/powerpc/mm/mmu_decl.h
@@ -171,12 +171,6 @@ static inline void mmu_mark_rodata_ro(void) { }
 void __init mmu_mapin_immr(void);
 #endif
 
-#ifdef CONFIG_DEBUG_WX
-void ptdump_check_wx(void);
-#else
-static inline void ptdump_check_wx(void) { }
-#endif
-
 static inline bool debug_pagealloc_enabled_or_kfence(void)
 {
return IS_ENABLED(CONFIG_KFENCE) || debug_pagealloc_enabled();
diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c
index 5c02fd08d61e..12498017da8e 100644
--- a/arch/powerpc/mm/pgtable_32.c
+++ b/arch/powerpc/mm/pgtable_32.c
@@ -153,7 +153,6 @@ void mark_rodata_ro(void)
 
if (v_block_mapped((unsigned long)_stext + 1)) {
mmu_mark_rodata_ro();
-   ptdump_check_wx();
return;
}
 
@@ -166,9 +165,6 @@ void mark_rodata_ro(void)
   PFN_DOWN((unsigned long)_stext);
 
set_memory_ro((unsigned long)_stext, numpages);
-
-   // mark_initmem_nx() should have already run by now
-   ptdump_check_wx();
 }
 #endif
 
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index 5ac1fd30341b..1b366526f4f2 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -150,9 +150,6 @@ void mark_rodata_ro(void)
radix__mark_rodata_ro();
else
hash__mark_rodata_ro();
-
-   // mark_initmem_nx() should have already run by now
-   ptdump_check_wx();
 }
 
 void mark_initmem_nx(void)
diff --git a/arch/powerpc/mm/ptdump/ptdump.c b/arch/powerpc/mm/ptdump/ptdump.c
index 2313053fe679..620d4917ebe8 100644
--- a/arch/powerpc/mm/ptdump/ptdump.c
+++ 

[PATCH v2 1/5] arm: ptdump: Rename CONFIG_DEBUG_WX to CONFIG_ARM_DEBUG_WX

2024-01-30 Thread Christophe Leroy
CONFIG_DEBUG_WX is a core option defined in mm/Kconfig.debug

To avoid any future conflict, rename ARM version
into CONFIG_ARM_DEBUG_WX.

Signed-off-by: Christophe Leroy 
---
v2: Fixed left-over debug_checkwx() in mark_rodata_ro() and updated defconfigs
---
 arch/arm/Kconfig.debug   | 2 +-
 arch/arm/configs/aspeed_g4_defconfig | 2 +-
 arch/arm/configs/aspeed_g5_defconfig | 2 +-
 arch/arm/include/asm/ptdump.h| 6 +++---
 arch/arm/mm/init.c   | 2 +-
 5 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/arm/Kconfig.debug b/arch/arm/Kconfig.debug
index 5fbbac1b708b..f1fc278081d0 100644
--- a/arch/arm/Kconfig.debug
+++ b/arch/arm/Kconfig.debug
@@ -17,7 +17,7 @@ config ARM_PTDUMP_DEBUGFS
  kernel.
  If in doubt, say "N"
 
-config DEBUG_WX
+config ARM_DEBUG_WX
bool "Warn on W+X mappings at boot"
depends on MMU
select ARM_PTDUMP_CORE
diff --git a/arch/arm/configs/aspeed_g4_defconfig 
b/arch/arm/configs/aspeed_g4_defconfig
index b3dc0465796f..28b724d59e7e 100644
--- a/arch/arm/configs/aspeed_g4_defconfig
+++ b/arch/arm/configs/aspeed_g4_defconfig
@@ -252,7 +252,7 @@ CONFIG_DEBUG_INFO_REDUCED=y
 CONFIG_GDB_SCRIPTS=y
 CONFIG_STRIP_ASM_SYMS=y
 CONFIG_DEBUG_FS=y
-CONFIG_DEBUG_WX=y
+CONFIG_ARM_DEBUG_WX=y
 CONFIG_SCHED_STACK_END_CHECK=y
 CONFIG_PANIC_ON_OOPS=y
 CONFIG_PANIC_TIMEOUT=-1
diff --git a/arch/arm/configs/aspeed_g5_defconfig 
b/arch/arm/configs/aspeed_g5_defconfig
index 3fdf4dbfdea5..61cee1e7ebea 100644
--- a/arch/arm/configs/aspeed_g5_defconfig
+++ b/arch/arm/configs/aspeed_g5_defconfig
@@ -302,7 +302,7 @@ CONFIG_DEBUG_INFO_REDUCED=y
 CONFIG_GDB_SCRIPTS=y
 CONFIG_STRIP_ASM_SYMS=y
 CONFIG_DEBUG_FS=y
-CONFIG_DEBUG_WX=y
+CONFIG_ARM_DEBUG_WX=y
 CONFIG_SCHED_STACK_END_CHECK=y
 CONFIG_PANIC_ON_OOPS=y
 CONFIG_PANIC_TIMEOUT=-1
diff --git a/arch/arm/include/asm/ptdump.h b/arch/arm/include/asm/ptdump.h
index aad1d034136c..46a4575146ee 100644
--- a/arch/arm/include/asm/ptdump.h
+++ b/arch/arm/include/asm/ptdump.h
@@ -32,10 +32,10 @@ void ptdump_check_wx(void);
 
 #endif /* CONFIG_ARM_PTDUMP_CORE */
 
-#ifdef CONFIG_DEBUG_WX
-#define debug_checkwx() ptdump_check_wx()
+#ifdef CONFIG_ARM_DEBUG_WX
+#define arm_debug_checkwx() ptdump_check_wx()
 #else
-#define debug_checkwx() do { } while (0)
+#define arm_debug_checkwx() do { } while (0)
 #endif
 
 #endif /* __ASM_PTDUMP_H */
diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c
index a42e4cd11db2..4c3d78691279 100644
--- a/arch/arm/mm/init.c
+++ b/arch/arm/mm/init.c
@@ -458,7 +458,7 @@ static int __mark_rodata_ro(void *unused)
 void mark_rodata_ro(void)
 {
stop_machine(__mark_rodata_ro, NULL, NULL);
-   debug_checkwx();
+   arm_debug_checkwx();
 }
 
 #else
-- 
2.43.0



[PATCH v2 0/5] mm: ptdump: Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute

2024-01-30 Thread Christophe Leroy
Refer old discussion at 
https://lore.kernel.org/lkml/20200422152656.GF676@willie-the-truck/T/#m802eaf33efd6f8d575939d157301b35ac0d4a64f
And https://github.com/KSPP/linux/issues/35

This series refactors CONFIG_DEBUG_WX for the 5 architectures
implementing CONFIG_GENERIC_PTDUMP

First rename stuff in ARM which uses similar names while not
implementing CONFIG_GENERIC_PTDUMP.

Then define a generic version of debug_checkwx() that calls
ptdump_check_wx() when CONFIG_DEBUG_WX is set. Call it immediately
after calling mark_rodata_ro() instead of calling it at the end of
every mark_rodata_ro().

Then implement a debugfs attribute that can be used to trigger
a W^X test at anytime and regardless of CONFIG_DEBUG_WX

Changes in v2:
- Fixed a few build failures (patch 1 and 2)
- Added patch 4
- Make the attribute return SUCCESS/FAILURE as suggested by Heiko (patch 5)

Christophe Leroy (5):
  arm: ptdump: Rename CONFIG_DEBUG_WX to CONFIG_ARM_DEBUG_WX
  arm64, powerpc, riscv, s390, x86: ptdump: Refactor CONFIG_DEBUG_WX
  powerpc,s390: ptdump: Define ptdump_check_wx() regardless of
CONFIG_DEBUG_WX
  mm: ptdump: Have ptdump_check_wx() return bool
  mm: ptdump: add check_wx_pages debugfs attribute

 arch/arm/Kconfig.debug   |  2 +-
 arch/arm/configs/aspeed_g4_defconfig |  2 +-
 arch/arm/configs/aspeed_g5_defconfig |  2 +-
 arch/arm/include/asm/ptdump.h|  6 +++---
 arch/arm/mm/init.c   |  2 +-
 arch/arm64/include/asm/ptdump.h  |  7 ---
 arch/arm64/mm/mmu.c  |  2 --
 arch/arm64/mm/ptdump.c   | 11 ---
 arch/powerpc/mm/mmu_decl.h   |  6 --
 arch/powerpc/mm/pgtable_32.c |  4 
 arch/powerpc/mm/pgtable_64.c |  3 ---
 arch/powerpc/mm/ptdump/ptdump.c  | 21 ++---
 arch/riscv/include/asm/ptdump.h  | 22 --
 arch/riscv/mm/init.c |  3 ---
 arch/riscv/mm/ptdump.c   | 12 
 arch/s390/include/asm/ptdump.h   | 14 --
 arch/s390/mm/dump_pagetables.c   | 21 +++--
 arch/s390/mm/init.c  |  2 --
 arch/x86/include/asm/pgtable.h   |  5 ++---
 arch/x86/mm/dump_pagetables.c| 20 ++--
 arch/x86/mm/init_32.c|  2 --
 arch/x86/mm/init_64.c|  2 --
 include/linux/ptdump.h   |  7 +++
 init/main.c  |  2 ++
 mm/ptdump.c  | 22 ++
 25 files changed, 95 insertions(+), 107 deletions(-)
 delete mode 100644 arch/riscv/include/asm/ptdump.h
 delete mode 100644 arch/s390/include/asm/ptdump.h

-- 
2.43.0



Re: [PATCH v10 5/6] arm64: support copy_mc_[user]_highpage()

2024-01-30 Thread Mark Rutland
On Mon, Jan 29, 2024 at 09:46:51PM +0800, Tong Tiangen wrote:
> Currently, many scenarios that can tolerate memory errors when copying page
> have been supported in the kernel[1][2][3], all of which are implemented by
> copy_mc_[user]_highpage(). arm64 should also support this mechanism.
> 
> Due to mte, arm64 needs to have its own copy_mc_[user]_highpage()
> architecture implementation, macros __HAVE_ARCH_COPY_MC_HIGHPAGE and
> __HAVE_ARCH_COPY_MC_USER_HIGHPAGE have been added to control it.
> 
> Add new helper copy_mc_page() which provide a page copy implementation with
> machine check safe. The copy_mc_page() in copy_mc_page.S is largely borrows
> from copy_page() in copy_page.S and the main difference is copy_mc_page()
> add extable entry to every load/store insn to support machine check safe.
> 
> Add new extable type EX_TYPE_COPY_MC_PAGE_ERR_ZERO which used in
> copy_mc_page().
> 
> [1]a873dfe1032a ("mm, hwpoison: try to recover from copy-on write faults")
> [2]5f2500b93cc9 ("mm/khugepaged: recover from poisoned anonymous memory")
> [3]6b970599e807 ("mm: hwpoison: support recovery from 
> ksm_might_need_to_copy()")
> 
> Signed-off-by: Tong Tiangen 
> ---
>  arch/arm64/include/asm/asm-extable.h | 15 ++
>  arch/arm64/include/asm/assembler.h   |  4 ++
>  arch/arm64/include/asm/mte.h |  5 ++
>  arch/arm64/include/asm/page.h| 10 
>  arch/arm64/lib/Makefile  |  2 +
>  arch/arm64/lib/copy_mc_page.S| 78 
>  arch/arm64/lib/mte.S | 27 ++
>  arch/arm64/mm/copypage.c | 66 ---
>  arch/arm64/mm/extable.c  |  7 +--
>  include/linux/highmem.h  |  8 +++
>  10 files changed, 213 insertions(+), 9 deletions(-)
>  create mode 100644 arch/arm64/lib/copy_mc_page.S
> 
> diff --git a/arch/arm64/include/asm/asm-extable.h 
> b/arch/arm64/include/asm/asm-extable.h
> index 980d1dd8e1a3..819044fefbe7 100644
> --- a/arch/arm64/include/asm/asm-extable.h
> +++ b/arch/arm64/include/asm/asm-extable.h
> @@ -10,6 +10,7 @@
>  #define EX_TYPE_UACCESS_ERR_ZERO 2
>  #define EX_TYPE_KACCESS_ERR_ZERO 3
>  #define EX_TYPE_LOAD_UNALIGNED_ZEROPAD   4
> +#define EX_TYPE_COPY_MC_PAGE_ERR_ZERO5
>  
>  /* Data fields for EX_TYPE_UACCESS_ERR_ZERO */
>  #define EX_DATA_REG_ERR_SHIFT0
> @@ -51,6 +52,16 @@
>  #define _ASM_EXTABLE_UACCESS(insn, fixup)\
>   _ASM_EXTABLE_UACCESS_ERR_ZERO(insn, fixup, wzr, wzr)
>  
> +#define _ASM_EXTABLE_COPY_MC_PAGE_ERR_ZERO(insn, fixup, err, zero)   \
> + __ASM_EXTABLE_RAW(insn, fixup,  \
> +   EX_TYPE_COPY_MC_PAGE_ERR_ZERO,\
> +   ( \
> + EX_DATA_REG(ERR, err) | \
> + EX_DATA_REG(ZERO, zero) \
> +   ))
> +
> +#define _ASM_EXTABLE_COPY_MC_PAGE(insn, fixup)   
> \
> + _ASM_EXTABLE_COPY_MC_PAGE_ERR_ZERO(insn, fixup, wzr, wzr)
>  /*
>   * Create an exception table entry for uaccess `insn`, which will branch to 
> `fixup`
>   * when an unhandled fault is taken.
> @@ -59,6 +70,10 @@
>   _ASM_EXTABLE_UACCESS(\insn, \fixup)
>   .endm
>  
> + .macro  _asm_extable_copy_mc_page, insn, fixup
> + _ASM_EXTABLE_COPY_MC_PAGE(\insn, \fixup)
> + .endm
> +

This should share a common EX_TYPE_ with the other "kaccess where memory error
is handled but other faults are fatal" cases.

>  /*
>   * Create an exception table entry for `insn` if `fixup` is provided. 
> Otherwise
>   * do nothing.
> diff --git a/arch/arm64/include/asm/assembler.h 
> b/arch/arm64/include/asm/assembler.h
> index 513787e43329..e1d8ce155878 100644
> --- a/arch/arm64/include/asm/assembler.h
> +++ b/arch/arm64/include/asm/assembler.h
> @@ -154,6 +154,10 @@ lr   .reqx30 // link register
>  #define CPU_LE(code...) code
>  #endif
>  
> +#define CPY_MC(l, x...)  \
> +:   x;   \
> + _asm_extable_copy_mc_pageb, l
> +
>  /*
>   * Define a macro that constructs a 64-bit value by concatenating two
>   * 32-bit registers. Note that on big endian systems the order of the
> diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
> index 91fbd5c8a391..9cdded082dd4 100644
> --- a/arch/arm64/include/asm/mte.h
> +++ b/arch/arm64/include/asm/mte.h
> @@ -92,6 +92,7 @@ static inline bool try_page_mte_tagging(struct page *page)
>  void mte_zero_clear_page_tags(void *addr);
>  void mte_sync_tags(pte_t pte, unsigned int nr_pages);
>  void mte_copy_page_tags(void *kto, const void *kfrom);
> +int mte_copy_mc_page_tags(void *kto, const void *kfrom);
>  void mte_thread_init_user(void);
>  void mte_thread_switch(struct task_struct *next);
>  void mte_cpu_setup(void);
> @@ -128,6 +129,10 @@ static inline void 

Re: [PATCH v10 6/6] arm64: introduce copy_mc_to_kernel() implementation

2024-01-30 Thread Mark Rutland
On Mon, Jan 29, 2024 at 09:46:52PM +0800, Tong Tiangen wrote:
> The copy_mc_to_kernel() helper is memory copy implementation that handles
> source exceptions. It can be used in memory copy scenarios that tolerate
> hardware memory errors(e.g: pmem_read/dax_copy_to_iter).
> 
> Currnently, only x86 and ppc suuport this helper, after arm64 support
> machine check safe framework, we introduce copy_mc_to_kernel()
> implementation.
> 
> Signed-off-by: Tong Tiangen 
> ---
>  arch/arm64/include/asm/string.h  |   5 +
>  arch/arm64/include/asm/uaccess.h |  21 +++
>  arch/arm64/lib/Makefile  |   2 +-
>  arch/arm64/lib/memcpy_mc.S   | 257 +++
>  mm/kasan/shadow.c|  12 ++
>  5 files changed, 296 insertions(+), 1 deletion(-)
>  create mode 100644 arch/arm64/lib/memcpy_mc.S

Looking at the diffstat and code, this duplicates arch/arm64/lib/memcpy.S with
a few annotations. Duplicating that code is not maintainable, and so we cannot
take this as-is.

If you want a version that can handle faults that *must* be written such that
the code is shared with the regular memcpy. That could be done by using macros
to instantiate two copies (one with fault handling, the other without).

It would also be very helpful to see *any* indication that this has been
tested, which is sorely lacking in the series as-is.

Mark.

> diff --git a/arch/arm64/include/asm/string.h b/arch/arm64/include/asm/string.h
> index 3a3264ff47b9..995b63c26e99 100644
> --- a/arch/arm64/include/asm/string.h
> +++ b/arch/arm64/include/asm/string.h
> @@ -35,6 +35,10 @@ extern void *memchr(const void *, int, __kernel_size_t);
>  extern void *memcpy(void *, const void *, __kernel_size_t);
>  extern void *__memcpy(void *, const void *, __kernel_size_t);
>  
> +#define __HAVE_ARCH_MEMCPY_MC
> +extern int memcpy_mcs(void *, const void *, __kernel_size_t);
> +extern int __memcpy_mcs(void *, const void *, __kernel_size_t);
> +
>  #define __HAVE_ARCH_MEMMOVE
>  extern void *memmove(void *, const void *, __kernel_size_t);
>  extern void *__memmove(void *, const void *, __kernel_size_t);
> @@ -57,6 +61,7 @@ void memcpy_flushcache(void *dst, const void *src, size_t 
> cnt);
>   */
>  
>  #define memcpy(dst, src, len) __memcpy(dst, src, len)
> +#define memcpy_mcs(dst, src, len) __memcpy_mcs(dst, src, len)
>  #define memmove(dst, src, len) __memmove(dst, src, len)
>  #define memset(s, c, n) __memset(s, c, n)
>  
> diff --git a/arch/arm64/include/asm/uaccess.h 
> b/arch/arm64/include/asm/uaccess.h
> index 14be5000c5a0..61e28ef2112a 100644
> --- a/arch/arm64/include/asm/uaccess.h
> +++ b/arch/arm64/include/asm/uaccess.h
> @@ -425,4 +425,25 @@ static inline size_t probe_subpage_writeable(const char 
> __user *uaddr,
>  
>  #endif /* CONFIG_ARCH_HAS_SUBPAGE_FAULTS */
>  
> +#ifdef CONFIG_ARCH_HAS_COPY_MC
> +/**
> + * copy_mc_to_kernel - memory copy that handles source exceptions
> + *
> + * @dst: destination address
> + * @src: source address
> + * @len: number of bytes to copy
> + *
> + * Return 0 for success, or #size if there was an exception.
> + */
> +static inline unsigned long __must_check
> +copy_mc_to_kernel(void *to, const void *from, unsigned long size)
> +{
> + int ret;
> +
> + ret = memcpy_mcs(to, from, size);
> + return (ret == -EFAULT) ? size : 0;
> +}
> +#define copy_mc_to_kernel copy_mc_to_kernel
> +#endif
> +
>  #endif /* __ASM_UACCESS_H */
> diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile
> index a2fd865b816d..899d6ae9698c 100644
> --- a/arch/arm64/lib/Makefile
> +++ b/arch/arm64/lib/Makefile
> @@ -3,7 +3,7 @@ lib-y := clear_user.o delay.o copy_from_user.o
> \
>  copy_to_user.o copy_page.o   \
>  clear_page.o csum.o insn.o memchr.o memcpy.o \
>  memset.o memcmp.o strcmp.o strncmp.o strlen.o\
> -strnlen.o strchr.o strrchr.o tishift.o
> +strnlen.o strchr.o strrchr.o tishift.o memcpy_mc.o
>  
>  ifeq ($(CONFIG_KERNEL_MODE_NEON), y)
>  obj-$(CONFIG_XOR_BLOCKS) += xor-neon.o
> diff --git a/arch/arm64/lib/memcpy_mc.S b/arch/arm64/lib/memcpy_mc.S
> new file mode 100644
> index ..7076b500d154
> --- /dev/null
> +++ b/arch/arm64/lib/memcpy_mc.S
> @@ -0,0 +1,257 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (c) 2012-2021, Arm Limited.
> + *
> + * Adapted from the original at:
> + * 
> https://github.com/ARM-software/optimized-routines/blob/afd6244a1f8d9229/string/aarch64/memcpy.S
> + */
> +
> +#include 
> +#include 
> +
> +/* Assumptions:
> + *
> + * ARMv8-a, AArch64, unaligned accesses.
> + *
> + */
> +
> +#define L(label) .L ## label
> +
> +#define dstinx0
> +#define src  x1
> +#define countx2
> +#define dst  x3
> +#define srcend   x4
> +#define dstend   x5
> +#define A_l  x6
> +#define A_lw w6
> +#define A_h  x7
> +#define B_l  x8
> +#define B_lw w8
> +#define B_h  x9
> 

Re: [PATCH] powerpc: add crtsavres.o to always-y instead of extra-y

2024-01-30 Thread Jan Stancek

On Tue, Nov 21, 2023 at 10:51:34AM +1000, Nicholas Piggin wrote:

On Tue Nov 21, 2023 at 9:23 AM AEST, Masahiro Yamada wrote:

crtsavres.o is linked to modules. However, as explained in commit
d0e628cd817f ("kbuild: doc: clarify the difference between extra-y
and always-y"), 'make modules' does not build extra-y.

For example, the following command fails:

  $ make ARCH=powerpc LLVM=1 KBUILD_MODPOST_WARN=1 mrproper ps3_defconfig 
modules
[snip]
LD [M]  arch/powerpc/platforms/cell/spufs/spufs.ko
  ld.lld: error: cannot open arch/powerpc/lib/crtsavres.o: No such file or 
directory
  make[3]: *** [scripts/Makefile.modfinal:56: 
arch/powerpc/platforms/cell/spufs/spufs.ko] Error 1
  make[2]: *** [Makefile:1844: modules] Error 2
  make[1]: *** [/home/masahiro/workspace/linux-kbuild/Makefile:350: 
__build_one_by_one] Error 2
  make: *** [Makefile:234: __sub-make] Error 2



Thanks. Is this the correct Fixes tag?

Fixes: d0e628cd817f ("powerpc/64: Do not link crtsavres.o in vmlinux")

Hmm, looks like LLD might just do this now automatically for us
too without --save-restore-funcs (https://reviews.llvm.org/D79977).
But we probably still need it for older versions, so we still need
your patch.


Hi,

I'm still seeing the error of crtsavres.o missing when building external modules
after "make LLVM=1 modules_prepare". Should it be built also in archprepare?

Thanks,
Jan


diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 051247027..a62334194 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -57,8 +57,11 @@ ifeq ($(CONFIG_PPC64)$(CONFIG_LD_IS_BFD),yy)
 # Have the linker provide sfpr if possible.
 # There is a corresponding test in arch/powerpc/lib/Makefile
 KBUILD_LDFLAGS_MODULE += --save-restore-funcs
+crtsavres_prepare:
 else
 KBUILD_LDFLAGS_MODULE += arch/powerpc/lib/crtsavres.o
+crtsavres_prepare:
+   $(MAKE) $(build)=arch/powerpc/lib arch/powerpc/lib/crtsavres.o
 endif

 ifdef CONFIG_CPU_LITTLE_ENDIAN
@@ -389,7 +392,7 @@ vdso_prepare: prepare0
$(build)=arch/powerpc/kernel/vdso 
include/generated/vdso64-offsets.h)
 endif

-archprepare: checkbin
+archprepare: checkbin crtsavres_prepare

 archheaders:
$(Q)$(MAKE) $(build)=arch/powerpc/kernel/syscalls all



Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-01-30 Thread Ryan Roberts
On 29/01/2024 14:32, David Hildenbrand wrote:
> Similar to how we optimized fork(), let's implement PTE batching when
> consecutive (present) PTEs map consecutive pages of the same large
> folio.
> 
> Most infrastructure we need for batching (mmu gather, rmap) is already
> there. We only have to add get_and_clear_full_ptes() and
> clear_full_ptes(). Similarly, extend zap_install_uffd_wp_if_needed() to
> process a PTE range.
> 
> We won't bother sanity-checking the mapcount of all subpages, but only
> check the mapcount of the first subpage we process.
> 
> To keep small folios as fast as possible force inlining of a specialized
> variant using __always_inline with nr=1.
> 
> Signed-off-by: David Hildenbrand 
> ---
>  include/linux/pgtable.h | 66 +
>  mm/memory.c | 92 +
>  2 files changed, 132 insertions(+), 26 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index aab227e12493..f0feae7f89fb 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -580,6 +580,72 @@ static inline pte_t ptep_get_and_clear_full(struct 
> mm_struct *mm,
>  }
>  #endif
>  
> +#ifndef get_and_clear_full_ptes
> +/**
> + * get_and_clear_full_ptes - Clear PTEs that map consecutive pages of the 
> same
> + *folio, collecting dirty/accessed bits.
> + * @mm: Address space the pages are mapped into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.
> + * @nr: Number of entries to clear.
> + * @full: Whether we are clearing a full mm.
> + *
> + * May be overridden by the architecture; otherwise, implemented as a simple
> + * loop over ptep_get_and_clear_full(), merging dirty/accessed bits into
> + * returned PTE.
> + *
> + * Note that PTE bits in the PTE range besides the PFN can differ. For 
> example,
> + * some PTEs might be write-protected.
> + *
> + * Context: The caller holds the page table lock.  The PTEs map consecutive
> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
> + */
> +static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
> + unsigned long addr, pte_t *ptep, unsigned int nr, int full)
> +{
> + pte_t pte, tmp_pte;
> +
> + pte = ptep_get_and_clear_full(mm, addr, ptep, full);
> + while (--nr) {
> + ptep++;
> + addr += PAGE_SIZE;
> + tmp_pte = ptep_get_and_clear_full(mm, addr, ptep, full);
> + if (pte_dirty(tmp_pte))
> + pte = pte_mkdirty(pte);
> + if (pte_young(tmp_pte))
> + pte = pte_mkyoung(pte);
> + }
> + return pte;
> +}
> +#endif
> +
> +#ifndef clear_full_ptes
> +/**
> + * clear_full_ptes - Clear PTEs that map consecutive pages of the same folio.

I know its implied from "pages of the same folio" (and even more so for the
above variant due to mention of access/dirty), but I wonder if its useful to
explicitly state that "all ptes being cleared are present at the time of the 
call"?

> + * @mm: Address space the pages are mapped into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.
> + * @nr: Number of entries to clear.
> + * @full: Whether we are clearing a full mm.
> + *
> + * Note that PTE bits in the PTE range besides the PFN can differ. For 
> example,
> + * some PTEs might be write-protected.
> + *
> + * Context: The caller holds the page table lock.  The PTEs map consecutive
> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
> + */
> +static inline void clear_full_ptes(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, unsigned int nr, int full)
> +{
> + for (;;) {
> + ptep_get_and_clear_full(mm, addr, ptep, full);
> + if (--nr == 0)
> + break;
> + ptep++;
> + addr += PAGE_SIZE;
> + }
> +}
> +#endif
>  
>  /*
>   * If two threads concurrently fault at the same page, the thread that
> diff --git a/mm/memory.c b/mm/memory.c
> index a2190d7cfa74..38a010c4d04d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1515,7 +1515,7 @@ static inline bool zap_drop_file_uffd_wp(struct 
> zap_details *details)
>   */
>  static inline void
>  zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
> -   unsigned long addr, pte_t *pte,
> +   unsigned long addr, pte_t *pte, int nr,
> struct zap_details *details, pte_t pteval)
>  {
>   /* Zap on anonymous always means dropping everything */
> @@ -1525,20 +1525,27 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct 
> *vma,
>   if (zap_drop_file_uffd_wp(details))
>   return;
>  
> - pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
> + for (;;) {
> + /* the PFN in the PTE is irrelevant. */
> + 

Re: [PATCH v1 8/9] mm/mmu_gather: add tlb_remove_tlb_entries()

2024-01-30 Thread Ryan Roberts
On 29/01/2024 14:32, David Hildenbrand wrote:
> Let's add a helper that lets us batch-process multiple consecutive PTEs.
> 
> Note that the loop will get optimized out on all architectures except on
> powerpc. We have to add an early define of __tlb_remove_tlb_entry() on
> ppc to make the compiler happy (and avoid making tlb_remove_tlb_entries() a
> macro).
> 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Ryan Roberts 

> ---
>  arch/powerpc/include/asm/tlb.h |  2 ++
>  include/asm-generic/tlb.h  | 20 
>  2 files changed, 22 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
> index b3de6102a907..1ca7d4c4b90d 100644
> --- a/arch/powerpc/include/asm/tlb.h
> +++ b/arch/powerpc/include/asm/tlb.h
> @@ -19,6 +19,8 @@
>  
>  #include 
>  
> +static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t 
> *ptep,
> +   unsigned long address);
>  #define __tlb_remove_tlb_entry   __tlb_remove_tlb_entry
>  
>  #define tlb_flush tlb_flush
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index 428c3f93addc..bd00dd238b79 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -616,6 +616,26 @@ static inline void tlb_flush_p4d_range(struct mmu_gather 
> *tlb,
>   __tlb_remove_tlb_entry(tlb, ptep, address); \
>   } while (0)
>  
> +/**
> + * tlb_remove_tlb_entries - remember unmapping of multiple consecutive ptes 
> for
> + *   later tlb invalidation.
> + *
> + * Similar to tlb_remove_tlb_entry(), but remember unmapping of multiple
> + * consecutive ptes instead of only a single one.
> + */
> +static inline void tlb_remove_tlb_entries(struct mmu_gather *tlb,
> + pte_t *ptep, unsigned int nr, unsigned long address)
> +{
> + tlb_flush_pte_range(tlb, address, PAGE_SIZE * nr);
> + for (;;) {
> + __tlb_remove_tlb_entry(tlb, ptep, address);
> + if (--nr == 0)
> + break;
> + ptep++;
> + address += PAGE_SIZE;
> + }
> +}
> +
>  #define tlb_remove_huge_tlb_entry(h, tlb, ptep, address) \
>   do {\
>   unsigned long _sz = huge_page_size(h);  \



Re: [PATCH v1 7/9] mm/mmu_gather: add __tlb_remove_folio_pages()

2024-01-30 Thread David Hildenbrand

On 30.01.24 10:21, Ryan Roberts wrote:

On 29/01/2024 14:32, David Hildenbrand wrote:

Add __tlb_remove_folio_pages(), which will remove multiple consecutive
pages that belong to the same large folio, instead of only a single
page. We'll be using this function when optimizing unmapping/zapping of
large folios that are mapped by PTEs.

We're using the remaining spare bit in an encoded_page to indicate that
the next enoced page in an array contains actually shifted "nr_pages".
Teach swap/freeing code about putting multiple folio references, and
delayed rmap handling to remove page ranges of a folio.

This extension allows for still gathering almost as many small folios
as we used to (-1, because we have to prepare for a possibly bigger next
entry), but still allows for gathering consecutive pages that belong to the
same large folio.

Note that we don't pass the folio pointer, because it is not required for
now. Further, we don't support page_size != PAGE_SIZE, it won't be
required for simple PTE batching.

We have to provide a separate s390 implementation, but it's fairly
straight forward.

Another, more invasive and likely more expensive, approach would be to
use folio+range or a PFN range instead of page+nr_pages. But, we should
do that consistently for the whole mmu_gather. For now, let's keep it
simple and add "nr_pages" only.

Signed-off-by: David Hildenbrand 
---
  arch/s390/include/asm/tlb.h | 17 +++
  include/asm-generic/tlb.h   |  8 +
  include/linux/mm_types.h| 20 
  mm/mmu_gather.c | 61 +++--
  mm/swap.c   | 12 ++--
  mm/swap_state.c | 12 ++--
  6 files changed, 116 insertions(+), 14 deletions(-)

diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index 48df896d5b79..abfd2bf29e9e 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -26,6 +26,8 @@ void __tlb_remove_table(void *_table);
  static inline void tlb_flush(struct mmu_gather *tlb);
  static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
struct page *page, bool delay_rmap, int page_size);
+static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
+   struct page *page, unsigned int nr_pages, bool delay_rmap);
  
  #define tlb_flush tlb_flush

  #define pte_free_tlb pte_free_tlb
@@ -52,6 +54,21 @@ static inline bool __tlb_remove_page_size(struct mmu_gather 
*tlb,
return false;
  }
  
+static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,

+   struct page *page, unsigned int nr_pages, bool delay_rmap)
+{
+   struct encoded_page *encoded_pages[] = {
+   encode_page(page, ENCODED_PAGE_BIT_NR_PAGES),
+   encode_nr_pages(nr_pages),
+   };
+
+   VM_WARN_ON_ONCE(delay_rmap);
+   VM_WARN_ON_ONCE(page_folio(page) != page_folio(page + nr_pages - 1));
+
+   free_pages_and_swap_cache(encoded_pages, ARRAY_SIZE(encoded_pages));
+   return false;
+}
+
  static inline void tlb_flush(struct mmu_gather *tlb)
  {
__tlb_flush_mm_lazy(tlb->mm);
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 2eb7b0d4f5d2..428c3f93addc 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -69,6 +69,7 @@
   *
   *  - tlb_remove_page() / __tlb_remove_page()
   *  - tlb_remove_page_size() / __tlb_remove_page_size()
+ *  - __tlb_remove_folio_pages()
   *
   *__tlb_remove_page_size() is the basic primitive that queues a page for
   *freeing. __tlb_remove_page() assumes PAGE_SIZE. Both will return a
@@ -78,6 +79,11 @@
   *tlb_remove_page() and tlb_remove_page_size() imply the call to
   *tlb_flush_mmu() when required and has no return value.
   *
+ *__tlb_remove_folio_pages() is similar to __tlb_remove_page(), however,
+ *instead of removing a single page, remove the given number of consecutive
+ *pages that are all part of the same (large) folio: just like calling
+ *__tlb_remove_page() on each page individually.
+ *
   *  - tlb_change_page_size()
   *
   *call before __tlb_remove_page*() to set the current page-size; implies a
@@ -262,6 +268,8 @@ struct mmu_gather_batch {
  
  extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,

bool delay_rmap, int page_size);
+bool __tlb_remove_folio_pages(struct mmu_gather *tlb, struct page *page,
+   unsigned int nr_pages, bool delay_rmap);
  
  #ifdef CONFIG_SMP

  /*
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 1b89eec0d6df..198662b7a39a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -226,6 +226,15 @@ struct encoded_page;
  /* Perform rmap removal after we have flushed the TLB. */
  #define ENCODED_PAGE_BIT_DELAY_RMAP   1ul
  
+/*

+ * The next item in an encoded_page array is the "nr_pages" argument, 
specifying
+ * the number of consecutive pages starting from 

Re: [PATCH v1 7/9] mm/mmu_gather: add __tlb_remove_folio_pages()

2024-01-30 Thread Ryan Roberts
On 29/01/2024 14:32, David Hildenbrand wrote:
> Add __tlb_remove_folio_pages(), which will remove multiple consecutive
> pages that belong to the same large folio, instead of only a single
> page. We'll be using this function when optimizing unmapping/zapping of
> large folios that are mapped by PTEs.
> 
> We're using the remaining spare bit in an encoded_page to indicate that
> the next enoced page in an array contains actually shifted "nr_pages".
> Teach swap/freeing code about putting multiple folio references, and
> delayed rmap handling to remove page ranges of a folio.
> 
> This extension allows for still gathering almost as many small folios
> as we used to (-1, because we have to prepare for a possibly bigger next
> entry), but still allows for gathering consecutive pages that belong to the
> same large folio.
> 
> Note that we don't pass the folio pointer, because it is not required for
> now. Further, we don't support page_size != PAGE_SIZE, it won't be
> required for simple PTE batching.
> 
> We have to provide a separate s390 implementation, but it's fairly
> straight forward.
> 
> Another, more invasive and likely more expensive, approach would be to
> use folio+range or a PFN range instead of page+nr_pages. But, we should
> do that consistently for the whole mmu_gather. For now, let's keep it
> simple and add "nr_pages" only.
> 
> Signed-off-by: David Hildenbrand 
> ---
>  arch/s390/include/asm/tlb.h | 17 +++
>  include/asm-generic/tlb.h   |  8 +
>  include/linux/mm_types.h| 20 
>  mm/mmu_gather.c | 61 +++--
>  mm/swap.c   | 12 ++--
>  mm/swap_state.c | 12 ++--
>  6 files changed, 116 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
> index 48df896d5b79..abfd2bf29e9e 100644
> --- a/arch/s390/include/asm/tlb.h
> +++ b/arch/s390/include/asm/tlb.h
> @@ -26,6 +26,8 @@ void __tlb_remove_table(void *_table);
>  static inline void tlb_flush(struct mmu_gather *tlb);
>  static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
>   struct page *page, bool delay_rmap, int page_size);
> +static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
> + struct page *page, unsigned int nr_pages, bool delay_rmap);
>  
>  #define tlb_flush tlb_flush
>  #define pte_free_tlb pte_free_tlb
> @@ -52,6 +54,21 @@ static inline bool __tlb_remove_page_size(struct 
> mmu_gather *tlb,
>   return false;
>  }
>  
> +static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
> + struct page *page, unsigned int nr_pages, bool delay_rmap)
> +{
> + struct encoded_page *encoded_pages[] = {
> + encode_page(page, ENCODED_PAGE_BIT_NR_PAGES),
> + encode_nr_pages(nr_pages),
> + };
> +
> + VM_WARN_ON_ONCE(delay_rmap);
> + VM_WARN_ON_ONCE(page_folio(page) != page_folio(page + nr_pages - 1));
> +
> + free_pages_and_swap_cache(encoded_pages, ARRAY_SIZE(encoded_pages));
> + return false;
> +}
> +
>  static inline void tlb_flush(struct mmu_gather *tlb)
>  {
>   __tlb_flush_mm_lazy(tlb->mm);
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index 2eb7b0d4f5d2..428c3f93addc 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -69,6 +69,7 @@
>   *
>   *  - tlb_remove_page() / __tlb_remove_page()
>   *  - tlb_remove_page_size() / __tlb_remove_page_size()
> + *  - __tlb_remove_folio_pages()
>   *
>   *__tlb_remove_page_size() is the basic primitive that queues a page for
>   *freeing. __tlb_remove_page() assumes PAGE_SIZE. Both will return a
> @@ -78,6 +79,11 @@
>   *tlb_remove_page() and tlb_remove_page_size() imply the call to
>   *tlb_flush_mmu() when required and has no return value.
>   *
> + *__tlb_remove_folio_pages() is similar to __tlb_remove_page(), however,
> + *instead of removing a single page, remove the given number of 
> consecutive
> + *pages that are all part of the same (large) folio: just like calling
> + *__tlb_remove_page() on each page individually.
> + *
>   *  - tlb_change_page_size()
>   *
>   *call before __tlb_remove_page*() to set the current page-size; implies 
> a
> @@ -262,6 +268,8 @@ struct mmu_gather_batch {
>  
>  extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
>   bool delay_rmap, int page_size);
> +bool __tlb_remove_folio_pages(struct mmu_gather *tlb, struct page *page,
> + unsigned int nr_pages, bool delay_rmap);
>  
>  #ifdef CONFIG_SMP
>  /*
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 1b89eec0d6df..198662b7a39a 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -226,6 +226,15 @@ struct encoded_page;
>  /* Perform rmap removal after we have flushed the TLB. */
>  #define ENCODED_PAGE_BIT_DELAY_RMAP  1ul
>  
> +/*
> + * The 

Re: [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time

2024-01-30 Thread Chen-Yu Tsai
Hi,

On Mon, Jan 29, 2024 at 12:09:50PM -0800, Luis Chamberlain wrote:
> On Thu, Dec 21, 2023 at 10:02:46AM +0100, Christophe Leroy wrote:
> > Declaring rodata_enabled and mark_rodata_ro() at all time
> > helps removing related #ifdefery in C files.
> > 
> > Signed-off-by: Christophe Leroy 
> 
> Very nice cleanup, thanks!, applied and pushed
> 
>   Luis

On next-20240130, which has your modules-next branch, and thus this
series and the other "module: Use set_memory_rox()" series applied,
my kernel crashes in some very weird way. Reverting your branch
makes the crash go away.

I thought I'd report it right away. Maybe you folks would know what's
happening here? This is on arm64.

[   10.481015] Unable to handle kernel paging request at virtual address 
ffde85245d30
[   10.490369] KASAN: maybe wild-memory-access in range 
[0x00f42922e980-0x00f42922e987]
[   10.503744] Mem abort info:
[   10.509383]   ESR = 0x9647
[   10.514400]   EC = 0x25: DABT (current EL), IL = 32 bits
[   10.522366]   SET = 0, FnV = 0
[   10.526343]   EA = 0, S1PTW = 0
[   10.530695]   FSC = 0x07: level 3 translation fault
[   10.537081] Data abort info:
[   10.540839]   ISV = 0, ISS = 0x0047, ISS2 = 0x
[   10.546456]   CM = 0, WnR = 1, TnD = 0, TagAccess = 0
[   10.551726]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[   10.557612] swapper pgtable: 4k pages, 39-bit VAs, pgdp=41f98000
[   10.565214] [ffde85245d30] pgd=10023003, p4d=10023003, 
pud=10023003, pmd=1001121eb003, pte=
[   10.578887] Internal error: Oops: 9647 [#1] PREEMPT SMP
[   10.585815] Modules linked in:
[   10.590235] CPU: 6 PID: 195 Comm: (udev-worker) Tainted: GB  
6.8.0-rc2-next-20240130-02908-ge8ad01d60927-dirty #163 
3f2318148ecc5fa70d1092c2b874f9b59bdb7d60
[   10.607021] Hardware name: Google Tentacruel board (DT)
[   10.613607] pstate: a049 (NzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   10.621954] pc : module_bug_finalize+0x118/0x148
[   10.626823] lr : module_bug_finalize+0x118/0x148
[   10.631463] sp : ffc0820478d0
[   10.631466] x29: ffc0820478d0 x28: ffc082047ca0 x27: ffde8d7d31a0
[   10.631477] x26: ffde85223780 x25:  x24: ffde8c413cc0
[   10.631486] x23: ffde8dfcec80 x22: ffde8dfce000 x21: ffde85223ba8
[   10.631495] x20: ffde85223780 x19: ffde85245d28 x18: 
[   10.631504] x17: ffde8aa15938 x16: ffde8aabdd90 x15: ffde8aab8124
[   10.631513] x14: ffde8acdd380 x13: 41b58ab3 x12: ffbbd1bf9d91
[   10.631522] x11: 1ffbd1bf9d90 x10: ffbbd1bf9d90 x9 : dfc0
[   10.631531] x8 : 00442e406270 x7 : ffde8dfcec87 x6 : 0001
[   10.631539] x5 : ffde8dfcec80 x4 :  x3 : ffde8bbadf08
[   10.631548] x2 : 0001 x1 : ffde8eaff080 x0 : 
[   10.631556] Call trace:
[   10.631559]  module_bug_finalize+0x118/0x148
[   10.631565]  load_module+0x25ec/0x2a78
[   10.631572]  __do_sys_init_module+0x234/0x418
[   10.631578]  __arm64_sys_init_module+0x4c/0x68
[   10.631584]  invoke_syscall+0x68/0x198
[   10.631589]  el0_svc_common.constprop.0+0x11c/0x150
[   10.631594]  do_el0_svc+0x38/0x50
[   10.631598]  el0_svc+0x50/0xa0
[   10.631604]  el0t_64_sync_handler+0x120/0x130
[   10.631609]  el0t_64_sync+0x1a8/0x1b0
[   10.631619] Code: 97c5418e c89ffef5 91002260 97c53ca7 (f9000675)
[   10.631624] ---[ end trace  ]---
[   10.642965] Kernel panic - not syncing: Oops: Fatal exception
[   10.642975] SMP: stopping secondary CPUs
[   10.648339] Kernel Offset: 0x1e0a80 from 0xffc08000
[   10.648343] PHYS_OFFSET: 0x4000
[   10.648345] CPU features: 0x0,c061,7002814a,2100720b
[   10.648350] Memory Limit: none



Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-01-30 Thread David Hildenbrand

Re-reading the docs myself:


+#ifndef get_and_clear_full_ptes
+/**
+ * get_and_clear_full_ptes - Clear PTEs that map consecutive pages of the same
+ *  folio, collecting dirty/accessed bits.
+ * @mm: Address space the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries to clear.
+ * @full: Whether we are clearing a full mm.
+ *
+ * May be overridden by the architecture; otherwise, implemented as a simple
+ * loop over ptep_get_and_clear_full(), merging dirty/accessed bits into
+ * returned PTE.


"into the"


+ *
+ * Note that PTE bits in the PTE range besides the PFN can differ. For example,
+ * some PTEs might be write-protected.
+ *
+ * Context: The caller holds the page table lock.  The PTEs map consecutive
+ * pages that belong to the same folio.  The PTEs are all in the same PMD.
+ */
+static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
+   unsigned long addr, pte_t *ptep, unsigned int nr, int full)
+{
+   pte_t pte, tmp_pte;
+
+   pte = ptep_get_and_clear_full(mm, addr, ptep, full);
+   while (--nr) {
+   ptep++;
+   addr += PAGE_SIZE;
+   tmp_pte = ptep_get_and_clear_full(mm, addr, ptep, full);
+   if (pte_dirty(tmp_pte))
+   pte = pte_mkdirty(pte);
+   if (pte_young(tmp_pte))
+   pte = pte_mkyoung(pte);
+   }
+   return pte;
+}
+#endif
+
+#ifndef clear_full_ptes
+/**
+ * clear_full_ptes - Clear PTEs that map consecutive pages of the same folio.
+ * @mm: Address space the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries to clear.
+ * @full: Whether we are clearing a full mm.


Something went missing:

May be overridden by the architecture; otherwise, implemented as a 
simple loop over ptep_get_and_clear_full().



--
Cheers,

David / dhildenb



Re: [PATCH v1 6/9] mm/mmu_gather: define ENCODED_PAGE_FLAG_DELAY_RMAP

2024-01-30 Thread Ryan Roberts
On 29/01/2024 14:32, David Hildenbrand wrote:
> Nowadays, encoded pages are only used in mmu_gather handling. Let's
> update the documentation, and define ENCODED_PAGE_BIT_DELAY_RMAP. While at
> it, rename ENCODE_PAGE_BITS to ENCODED_PAGE_BITS.
> 
> If encoded page pointers would ever be used in other context again, we'd
> likely want to change the defines to reflect their context (e.g.,
> ENCODED_PAGE_FLAG_MMU_GATHER_DELAY_RMAP). For now, let's keep it simple.
> 
> This is a preparation for using the remaining spare bit to indicate that
> the next item in an array of encoded pages is a "nr_pages" argument and
> not an encoded page.
> 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Ryan Roberts 

> ---
>  include/linux/mm_types.h | 17 +++--
>  mm/mmu_gather.c  |  5 +++--
>  2 files changed, 14 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 8b611e13153e..1b89eec0d6df 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -210,8 +210,8 @@ struct page {
>   *
>   * An 'encoded_page' pointer is a pointer to a regular 'struct page', but
>   * with the low bits of the pointer indicating extra context-dependent
> - * information. Not super-common, but happens in mmu_gather and mlock
> - * handling, and this acts as a type system check on that use.
> + * information. Only used in mmu_gather handling, and this acts as a type
> + * system check on that use.
>   *
>   * We only really have two guaranteed bits in general, although you could
>   * play with 'struct page' alignment (see CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
> @@ -220,21 +220,26 @@ struct page {
>   * Use the supplied helper functions to endcode/decode the pointer and bits.
>   */
>  struct encoded_page;
> -#define ENCODE_PAGE_BITS 3ul
> +
> +#define ENCODED_PAGE_BITS3ul
> +
> +/* Perform rmap removal after we have flushed the TLB. */
> +#define ENCODED_PAGE_BIT_DELAY_RMAP  1ul
> +
>  static __always_inline struct encoded_page *encode_page(struct page *page, 
> unsigned long flags)
>  {
> - BUILD_BUG_ON(flags > ENCODE_PAGE_BITS);
> + BUILD_BUG_ON(flags > ENCODED_PAGE_BITS);
>   return (struct encoded_page *)(flags | (unsigned long)page);
>  }
>  
>  static inline unsigned long encoded_page_flags(struct encoded_page *page)
>  {
> - return ENCODE_PAGE_BITS & (unsigned long)page;
> + return ENCODED_PAGE_BITS & (unsigned long)page;
>  }
>  
>  static inline struct page *encoded_page_ptr(struct encoded_page *page)
>  {
> - return (struct page *)(~ENCODE_PAGE_BITS & (unsigned long)page);
> + return (struct page *)(~ENCODED_PAGE_BITS & (unsigned long)page);
>  }
>  
>  /*
> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> index ac733d81b112..6540c99c6758 100644
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -53,7 +53,7 @@ static void tlb_flush_rmap_batch(struct mmu_gather_batch 
> *batch, struct vm_area_
>   for (int i = 0; i < batch->nr; i++) {
>   struct encoded_page *enc = batch->encoded_pages[i];
>  
> - if (encoded_page_flags(enc)) {
> + if (encoded_page_flags(enc) & ENCODED_PAGE_BIT_DELAY_RMAP) {
>   struct page *page = encoded_page_ptr(enc);
>   folio_remove_rmap_pte(page_folio(page), page, vma);
>   }
> @@ -119,6 +119,7 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
>  bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
>   bool delay_rmap, int page_size)
>  {
> + int flags = delay_rmap ? ENCODED_PAGE_BIT_DELAY_RMAP : 0;
>   struct mmu_gather_batch *batch;
>  
>   VM_BUG_ON(!tlb->end);
> @@ -132,7 +133,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, 
> struct page *page,
>* Add the page and check if we are full. If so
>* force a flush.
>*/
> - batch->encoded_pages[batch->nr++] = encode_page(page, delay_rmap);
> + batch->encoded_pages[batch->nr++] = encode_page(page, flags);
>   if (batch->nr == batch->max) {
>   if (!tlb_next_batch(tlb))
>   return true;



Re: [PATCH v1 1/9] mm/memory: factor out zapping of present pte into zap_present_pte()

2024-01-30 Thread David Hildenbrand

On 30.01.24 09:46, Ryan Roberts wrote:

On 30/01/2024 08:41, David Hildenbrand wrote:

On 30.01.24 09:13, Ryan Roberts wrote:

On 29/01/2024 14:32, David Hildenbrand wrote:

Let's prepare for further changes by factoring out processing of present
PTEs.

Signed-off-by: David Hildenbrand 
---
   mm/memory.c | 92 ++---
   1 file changed, 52 insertions(+), 40 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index b05fd28dbce1..50a6c79c78fc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1532,13 +1532,61 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct
*vma,
   pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
   }
   +static inline void zap_present_pte(struct mmu_gather *tlb,
+    struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
+    unsigned long addr, struct zap_details *details,
+    int *rss, bool *force_flush, bool *force_break)
+{
+    struct mm_struct *mm = tlb->mm;
+    bool delay_rmap = false;
+    struct folio *folio;


You need to init this to NULL otherwise its a random value when calling
should_zap_folio() if vm_normal_page() returns NULL.


Right, and we can stop setting it to NULL in the original function. Patch #2
changes these checks, which is why it's only a problem in this patch.


Yeah I only noticed that after sending out this reply and moving to the next
patch. Still worth fixing this intermediate state I think.


Absolutely, I didn't do path-by-patch compilation yet (I suspect the 
compiler would complain).


--
Cheers,

David / dhildenb



Re: [PATCH v1 4/9] mm/memory: factor out zapping folio pte into zap_present_folio_pte()

2024-01-30 Thread Ryan Roberts
On 29/01/2024 14:32, David Hildenbrand wrote:
> Let's prepare for further changes by factoring it out into a separate
> function.
> 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Ryan Roberts 

> ---
>  mm/memory.c | 53 -
>  1 file changed, 32 insertions(+), 21 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 20bc13ab8db2..a2190d7cfa74 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1528,30 +1528,14 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct 
> *vma,
>   pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
>  }
>  
> -static inline void zap_present_pte(struct mmu_gather *tlb,
> - struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
> - unsigned long addr, struct zap_details *details,
> - int *rss, bool *force_flush, bool *force_break)
> +static inline void zap_present_folio_pte(struct mmu_gather *tlb,
> + struct vm_area_struct *vma, struct folio *folio,
> + struct page *page, pte_t *pte, pte_t ptent, unsigned long addr,
> + struct zap_details *details, int *rss, bool *force_flush,
> + bool *force_break)
>  {
>   struct mm_struct *mm = tlb->mm;
>   bool delay_rmap = false;
> - struct folio *folio;
> - struct page *page;
> -
> - page = vm_normal_page(vma, addr, ptent);
> - if (!page) {
> - /* We don't need up-to-date accessed/dirty bits. */
> - ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
> - arch_check_zapped_pte(vma, ptent);
> - tlb_remove_tlb_entry(tlb, pte, addr);
> - VM_WARN_ON_ONCE(userfaultfd_wp(vma));
> - ksm_might_unmap_zero_page(mm, ptent);
> - return;
> - }
> -
> - folio = page_folio(page);
> - if (unlikely(!should_zap_folio(details, folio)))
> - return;
>  
>   if (!folio_test_anon(folio)) {
>   ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
> @@ -1586,6 +1570,33 @@ static inline void zap_present_pte(struct mmu_gather 
> *tlb,
>   }
>  }
>  
> +static inline void zap_present_pte(struct mmu_gather *tlb,
> + struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
> + unsigned long addr, struct zap_details *details,
> + int *rss, bool *force_flush, bool *force_break)
> +{
> + struct mm_struct *mm = tlb->mm;
> + struct folio *folio;
> + struct page *page;
> +
> + page = vm_normal_page(vma, addr, ptent);
> + if (!page) {
> + /* We don't need up-to-date accessed/dirty bits. */
> + ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
> + arch_check_zapped_pte(vma, ptent);
> + tlb_remove_tlb_entry(tlb, pte, addr);
> + VM_WARN_ON_ONCE(userfaultfd_wp(vma));
> + ksm_might_unmap_zero_page(mm, ptent);
> + return;
> + }
> +
> + folio = page_folio(page);
> + if (unlikely(!should_zap_folio(details, folio)))
> + return;
> + zap_present_folio_pte(tlb, vma, folio, page, pte, ptent, addr, details,
> +   rss, force_flush, force_break);
> +}
> +
>  static unsigned long zap_pte_range(struct mmu_gather *tlb,
>   struct vm_area_struct *vma, pmd_t *pmd,
>   unsigned long addr, unsigned long end,



Re: [PATCH v1 3/9] mm/memory: further separate anon and pagecache folio handling in zap_present_pte()

2024-01-30 Thread David Hildenbrand

On 30.01.24 09:45, Ryan Roberts wrote:

On 30/01/2024 08:37, David Hildenbrand wrote:

On 30.01.24 09:31, Ryan Roberts wrote:

On 29/01/2024 14:32, David Hildenbrand wrote:

We don't need up-to-date accessed-dirty information for anon folios and can
simply work with the ptent we already have. Also, we know the RSS counter
we want to update.

We can safely move arch_check_zapped_pte() + tlb_remove_tlb_entry() +
zap_install_uffd_wp_if_needed() after updating the folio and RSS.

While at it, only call zap_install_uffd_wp_if_needed() if there is even
any chance that pte_install_uffd_wp_if_needed() would do *something*.
That is, just don't bother if uffd-wp does not apply.

Signed-off-by: David Hildenbrand 
---
   mm/memory.c | 16 +++-
   1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 69502cdc0a7d..20bc13ab8db2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1552,12 +1552,9 @@ static inline void zap_present_pte(struct mmu_gather
*tlb,
   folio = page_folio(page);
   if (unlikely(!should_zap_folio(details, folio)))
   return;
-    ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
-    arch_check_zapped_pte(vma, ptent);
-    tlb_remove_tlb_entry(tlb, pte, addr);
-    zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
     if (!folio_test_anon(folio)) {
+    ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
   if (pte_dirty(ptent)) {
   folio_mark_dirty(folio);
   if (tlb_delay_rmap(tlb)) {
@@ -1567,8 +1564,17 @@ static inline void zap_present_pte(struct mmu_gather
*tlb,
   }
   if (pte_young(ptent) && likely(vma_has_recency(vma)))
   folio_mark_accessed(folio);
+    rss[mm_counter(folio)]--;
+    } else {
+    /* We don't need up-to-date accessed/dirty bits. */
+    ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+    rss[MM_ANONPAGES]--;
   }
-    rss[mm_counter(folio)]--;
+    arch_check_zapped_pte(vma, ptent);


Isn't the x86 (only) implementation of this relying on the dirty bit? So doesn't
that imply you still need get_and_clear for anon? (And in hindsight I think that
logic would apply to the previous patch too?)


x86 uses the encoding !writable && dirty to indicate special shadow stacks. That
is, the hw dirty bit is set by software (to create that combination), not by
hardware.

So you don't have to sync against any hw changes of the hw dirty bit. What you
had in the original PTE you read is sufficient.



Right, got it. In that case:


Thanks a lot for paying that much attention during your reviews! Highly 
appreciated!




Reviewed-by: Ryan Roberts 




--
Cheers,

David / dhildenb



Re: [PATCH v1 1/9] mm/memory: factor out zapping of present pte into zap_present_pte()

2024-01-30 Thread Ryan Roberts
On 30/01/2024 08:41, David Hildenbrand wrote:
> On 30.01.24 09:13, Ryan Roberts wrote:
>> On 29/01/2024 14:32, David Hildenbrand wrote:
>>> Let's prepare for further changes by factoring out processing of present
>>> PTEs.
>>>
>>> Signed-off-by: David Hildenbrand 
>>> ---
>>>   mm/memory.c | 92 ++---
>>>   1 file changed, 52 insertions(+), 40 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index b05fd28dbce1..50a6c79c78fc 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -1532,13 +1532,61 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct
>>> *vma,
>>>   pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
>>>   }
>>>   +static inline void zap_present_pte(struct mmu_gather *tlb,
>>> +    struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
>>> +    unsigned long addr, struct zap_details *details,
>>> +    int *rss, bool *force_flush, bool *force_break)
>>> +{
>>> +    struct mm_struct *mm = tlb->mm;
>>> +    bool delay_rmap = false;
>>> +    struct folio *folio;
>>
>> You need to init this to NULL otherwise its a random value when calling
>> should_zap_folio() if vm_normal_page() returns NULL.
> 
> Right, and we can stop setting it to NULL in the original function. Patch #2
> changes these checks, which is why it's only a problem in this patch.

Yeah I only noticed that after sending out this reply and moving to the next
patch. Still worth fixing this intermediate state I think.

> 
> Will fix, thanks!
> 



Re: [PATCH v1 3/9] mm/memory: further separate anon and pagecache folio handling in zap_present_pte()

2024-01-30 Thread Ryan Roberts
On 30/01/2024 08:37, David Hildenbrand wrote:
> On 30.01.24 09:31, Ryan Roberts wrote:
>> On 29/01/2024 14:32, David Hildenbrand wrote:
>>> We don't need up-to-date accessed-dirty information for anon folios and can
>>> simply work with the ptent we already have. Also, we know the RSS counter
>>> we want to update.
>>>
>>> We can safely move arch_check_zapped_pte() + tlb_remove_tlb_entry() +
>>> zap_install_uffd_wp_if_needed() after updating the folio and RSS.
>>>
>>> While at it, only call zap_install_uffd_wp_if_needed() if there is even
>>> any chance that pte_install_uffd_wp_if_needed() would do *something*.
>>> That is, just don't bother if uffd-wp does not apply.
>>>
>>> Signed-off-by: David Hildenbrand 
>>> ---
>>>   mm/memory.c | 16 +++-
>>>   1 file changed, 11 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 69502cdc0a7d..20bc13ab8db2 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -1552,12 +1552,9 @@ static inline void zap_present_pte(struct mmu_gather
>>> *tlb,
>>>   folio = page_folio(page);
>>>   if (unlikely(!should_zap_folio(details, folio)))
>>>   return;
>>> -    ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
>>> -    arch_check_zapped_pte(vma, ptent);
>>> -    tlb_remove_tlb_entry(tlb, pte, addr);
>>> -    zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
>>>     if (!folio_test_anon(folio)) {
>>> +    ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
>>>   if (pte_dirty(ptent)) {
>>>   folio_mark_dirty(folio);
>>>   if (tlb_delay_rmap(tlb)) {
>>> @@ -1567,8 +1564,17 @@ static inline void zap_present_pte(struct mmu_gather
>>> *tlb,
>>>   }
>>>   if (pte_young(ptent) && likely(vma_has_recency(vma)))
>>>   folio_mark_accessed(folio);
>>> +    rss[mm_counter(folio)]--;
>>> +    } else {
>>> +    /* We don't need up-to-date accessed/dirty bits. */
>>> +    ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
>>> +    rss[MM_ANONPAGES]--;
>>>   }
>>> -    rss[mm_counter(folio)]--;
>>> +    arch_check_zapped_pte(vma, ptent);
>>
>> Isn't the x86 (only) implementation of this relying on the dirty bit? So 
>> doesn't
>> that imply you still need get_and_clear for anon? (And in hindsight I think 
>> that
>> logic would apply to the previous patch too?)
> 
> x86 uses the encoding !writable && dirty to indicate special shadow stacks. 
> That
> is, the hw dirty bit is set by software (to create that combination), not by
> hardware.
> 
> So you don't have to sync against any hw changes of the hw dirty bit. What you
> had in the original PTE you read is sufficient.
> 

Right, got it. In that case:

Reviewed-by: Ryan Roberts 




Re: [PATCH v1 5/9] mm/mmu_gather: pass "delay_rmap" instead of encoded page to __tlb_remove_page_size()

2024-01-30 Thread Ryan Roberts
On 29/01/2024 14:32, David Hildenbrand wrote:
> We have two bits available in the encoded page pointer to store
> additional information. Currently, we use one bit to request delay of the
> rmap removal until after a TLB flush.
> 
> We want to make use of the remaining bit internally for batching of
> multiple pages of the same folio, specifying that the next encoded page
> pointer in an array is actually "nr_pages". So pass page + delay_rmap flag
> instead of an encoded page, to handle the encoding internally.
> 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Ryan Roberts 

> ---
>  arch/s390/include/asm/tlb.h | 13 ++---
>  include/asm-generic/tlb.h   | 12 ++--
>  mm/mmu_gather.c |  7 ---
>  3 files changed, 16 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
> index d1455a601adc..48df896d5b79 100644
> --- a/arch/s390/include/asm/tlb.h
> +++ b/arch/s390/include/asm/tlb.h
> @@ -25,8 +25,7 @@
>  void __tlb_remove_table(void *_table);
>  static inline void tlb_flush(struct mmu_gather *tlb);
>  static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
> -   struct encoded_page *page,
> -   int page_size);
> + struct page *page, bool delay_rmap, int page_size);
>  
>  #define tlb_flush tlb_flush
>  #define pte_free_tlb pte_free_tlb
> @@ -42,14 +41,14 @@ static inline bool __tlb_remove_page_size(struct 
> mmu_gather *tlb,
>   * tlb_ptep_clear_flush. In both flush modes the tlb for a page cache page
>   * has already been freed, so just do free_page_and_swap_cache.
>   *
> - * s390 doesn't delay rmap removal, so there is nothing encoded in
> - * the page pointer.
> + * s390 doesn't delay rmap removal.
>   */
>  static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
> -   struct encoded_page *page,
> -   int page_size)
> + struct page *page, bool delay_rmap, int page_size)
>  {
> - free_page_and_swap_cache(encoded_page_ptr(page));
> + VM_WARN_ON_ONCE(delay_rmap);
> +
> + free_page_and_swap_cache(page);
>   return false;
>  }
>  
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index 129a3a759976..2eb7b0d4f5d2 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -260,9 +260,8 @@ struct mmu_gather_batch {
>   */
>  #define MAX_GATHER_BATCH_COUNT   (1UL/MAX_GATHER_BATCH)
>  
> -extern bool __tlb_remove_page_size(struct mmu_gather *tlb,
> -struct encoded_page *page,
> -int page_size);
> +extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
> + bool delay_rmap, int page_size);
>  
>  #ifdef CONFIG_SMP
>  /*
> @@ -462,13 +461,14 @@ static inline void tlb_flush_mmu_tlbonly(struct 
> mmu_gather *tlb)
>  static inline void tlb_remove_page_size(struct mmu_gather *tlb,
>   struct page *page, int page_size)
>  {
> - if (__tlb_remove_page_size(tlb, encode_page(page, 0), page_size))
> + if (__tlb_remove_page_size(tlb, page, false, page_size))
>   tlb_flush_mmu(tlb);
>  }
>  
> -static __always_inline bool __tlb_remove_page(struct mmu_gather *tlb, struct 
> page *page, unsigned int flags)
> +static __always_inline bool __tlb_remove_page(struct mmu_gather *tlb,
> + struct page *page, bool delay_rmap)
>  {
> - return __tlb_remove_page_size(tlb, encode_page(page, flags), PAGE_SIZE);
> + return __tlb_remove_page_size(tlb, page, delay_rmap, PAGE_SIZE);
>  }
>  
>  /* tlb_remove_page
> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> index 604ddf08affe..ac733d81b112 100644
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -116,7 +116,8 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
>   tlb->local.next = NULL;
>  }
>  
> -bool __tlb_remove_page_size(struct mmu_gather *tlb, struct encoded_page 
> *page, int page_size)
> +bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
> + bool delay_rmap, int page_size)
>  {
>   struct mmu_gather_batch *batch;
>  
> @@ -131,13 +132,13 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, 
> struct encoded_page *page, i
>* Add the page and check if we are full. If so
>* force a flush.
>*/
> - batch->encoded_pages[batch->nr++] = page;
> + batch->encoded_pages[batch->nr++] = encode_page(page, delay_rmap);
>   if (batch->nr == batch->max) {
>   if (!tlb_next_batch(tlb))
>   return true;
>   batch = tlb->active;
>   }
> - VM_BUG_ON_PAGE(batch->nr > batch->max, encoded_page_ptr(page));
> + VM_BUG_ON_PAGE(batch->nr > batch->max, page);
>  
>   return false;
>  }



Re: [PATCH v1 1/9] mm/memory: factor out zapping of present pte into zap_present_pte()

2024-01-30 Thread David Hildenbrand

On 30.01.24 09:13, Ryan Roberts wrote:

On 29/01/2024 14:32, David Hildenbrand wrote:

Let's prepare for further changes by factoring out processing of present
PTEs.

Signed-off-by: David Hildenbrand 
---
  mm/memory.c | 92 ++---
  1 file changed, 52 insertions(+), 40 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index b05fd28dbce1..50a6c79c78fc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1532,13 +1532,61 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct 
*vma,
pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
  }
  
+static inline void zap_present_pte(struct mmu_gather *tlb,

+   struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
+   unsigned long addr, struct zap_details *details,
+   int *rss, bool *force_flush, bool *force_break)
+{
+   struct mm_struct *mm = tlb->mm;
+   bool delay_rmap = false;
+   struct folio *folio;


You need to init this to NULL otherwise its a random value when calling
should_zap_folio() if vm_normal_page() returns NULL.


Right, and we can stop setting it to NULL in the original function. 
Patch #2 changes these checks, which is why it's only a problem in this 
patch.


Will fix, thanks!

--
Cheers,

David / dhildenb



[PATCH v2 6/6] net: wan: fsl_qmc_hdlc: Add framer support

2024-01-30 Thread Herve Codina
Add framer support in the fsl_qmc_hdlc driver in order to be able to
signal carrier changes to the network stack based on the framer status
Also use this framer to provide information related to the E1/T1 line
interface on IF_GET_IFACE and configure the line interface according to
IF_IFACE_{E1,T1} information.

Signed-off-by: Herve Codina 
Reviewed-by: Christophe Leroy 
---
 drivers/net/wan/fsl_qmc_hdlc.c | 239 -
 1 file changed, 235 insertions(+), 4 deletions(-)

diff --git a/drivers/net/wan/fsl_qmc_hdlc.c b/drivers/net/wan/fsl_qmc_hdlc.c
index 8316f2984864..388d909ad0c8 100644
--- a/drivers/net/wan/fsl_qmc_hdlc.c
+++ b/drivers/net/wan/fsl_qmc_hdlc.c
@@ -9,6 +9,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -28,6 +29,9 @@ struct qmc_hdlc {
struct device *dev;
struct qmc_chan *qmc_chan;
struct net_device *netdev;
+   struct framer *framer;
+   spinlock_t carrier_lock; /* Protect carrier detection */
+   struct notifier_block nb;
bool is_crc32;
spinlock_t tx_lock; /* Protect tx descriptors */
struct qmc_hdlc_desc tx_descs[8];
@@ -41,6 +45,195 @@ static inline struct qmc_hdlc *netdev_to_qmc_hdlc(struct 
net_device *netdev)
return dev_to_hdlc(netdev)->priv;
 }
 
+static int qmc_hdlc_framer_set_carrier(struct qmc_hdlc *qmc_hdlc)
+{
+   struct framer_status framer_status;
+   unsigned long flags;
+   int ret;
+
+   if (!qmc_hdlc->framer)
+   return 0;
+
+   spin_lock_irqsave(_hdlc->carrier_lock, flags);
+
+   ret = framer_get_status(qmc_hdlc->framer, _status);
+   if (ret) {
+   dev_err(qmc_hdlc->dev, "get framer status failed (%d)\n", ret);
+   goto end;
+   }
+   if (framer_status.link_is_on)
+   netif_carrier_on(qmc_hdlc->netdev);
+   else
+   netif_carrier_off(qmc_hdlc->netdev);
+
+end:
+   spin_unlock_irqrestore(_hdlc->carrier_lock, flags);
+   return ret;
+}
+
+static int qmc_hdlc_framer_notifier(struct notifier_block *nb, unsigned long 
action,
+   void *data)
+{
+   struct qmc_hdlc *qmc_hdlc = container_of(nb, struct qmc_hdlc, nb);
+   int ret;
+
+   if (action != FRAMER_EVENT_STATUS)
+   return NOTIFY_DONE;
+
+   ret = qmc_hdlc_framer_set_carrier(qmc_hdlc);
+   return ret ? NOTIFY_DONE : NOTIFY_OK;
+}
+
+static int qmc_hdlc_framer_start(struct qmc_hdlc *qmc_hdlc)
+{
+   struct framer_status framer_status;
+   int ret;
+
+   if (!qmc_hdlc->framer)
+   return 0;
+
+   ret = framer_power_on(qmc_hdlc->framer);
+   if (ret) {
+   dev_err(qmc_hdlc->dev, "framer power-on failed (%d)\n", ret);
+   return ret;
+   }
+
+   /* Be sure that get_status is supported */
+   ret = framer_get_status(qmc_hdlc->framer, _status);
+   if (ret) {
+   dev_err(qmc_hdlc->dev, "get framer status failed (%d)\n", ret);
+   goto framer_power_off;
+   }
+
+   qmc_hdlc->nb.notifier_call = qmc_hdlc_framer_notifier;
+   ret = framer_notifier_register(qmc_hdlc->framer, _hdlc->nb);
+   if (ret) {
+   dev_err(qmc_hdlc->dev, "framer notifier register failed 
(%d)\n", ret);
+   goto framer_power_off;
+   }
+
+   return 0;
+
+framer_power_off:
+   framer_power_off(qmc_hdlc->framer);
+   return ret;
+}
+
+static void qmc_hdlc_framer_stop(struct qmc_hdlc *qmc_hdlc)
+{
+   if (!qmc_hdlc->framer)
+   return;
+
+   framer_notifier_unregister(qmc_hdlc->framer, _hdlc->nb);
+   framer_power_off(qmc_hdlc->framer);
+}
+
+static int qmc_hdlc_framer_set_iface(struct qmc_hdlc *qmc_hdlc, int if_iface,
+const te1_settings *te1)
+{
+   struct framer_config config;
+   int ret;
+
+   if (!qmc_hdlc->framer)
+   return 0;
+
+   ret = framer_get_config(qmc_hdlc->framer, );
+   if (ret)
+   return ret;
+
+   switch (if_iface) {
+   case IF_IFACE_E1:
+   config.iface = FRAMER_IFACE_E1;
+   break;
+   case IF_IFACE_T1:
+   config.iface = FRAMER_IFACE_T1;
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   switch (te1->clock_type) {
+   case CLOCK_DEFAULT:
+   /* Keep current value */
+   break;
+   case CLOCK_EXT:
+   config.clock_type = FRAMER_CLOCK_EXT;
+   break;
+   case CLOCK_INT:
+   config.clock_type = FRAMER_CLOCK_INT;
+   break;
+   default:
+   return -EINVAL;
+   }
+   config.line_clock_rate = te1->clock_rate;
+
+   return framer_set_config(qmc_hdlc->framer, );
+}
+
+static int qmc_hdlc_framer_get_iface(struct qmc_hdlc *qmc_hdlc, int *if_iface, 
te1_settings *te1)
+{
+   struct framer_config config;
+ 

[PATCH v2 0/6] Add support for QMC HDLC

2024-01-30 Thread Herve Codina
Hi,

This series introduces the QMC HDLC support.

Patches were previously sent as part of a full feature series and were
previously reviewed in that context:
"Add support for QMC HDLC, framer infrastructure and PEF2256 framer" [1]

In order to ease the merge, the full feature series has been split and
needed parts were merged in v6.8-rc1:
 - "Prepare the PowerQUICC QMC and TSA for the HDLC QMC driver" [2]
 - "Add support for framer infrastructure and PEF2256 framer" [3]

This series contains patches related to the QMC HDLC part (QMC HDLC
driver):
 - Introduce the QMC HDLC driver (patches 1 and 2)
 - Add timeslots change support in QMC HDLC (patch 3)
 - Add framer support as a framer consumer in QMC HDLC (patch 4)

Compare to the original full feature series, a modification was done on
patch 3 in order to use a coherent prefix in the commit title.

I kept the patches unsquashed as they were previously sent and reviewed.
Of course, I can squash them if needed.

Compared to the previous iteration:
  https://lore.kernel.org/lkml/20240123164912.249540-1-herve.cod...@bootlin.com/
this v2 series:
- Update the qmc_hdlc initialisation in qmc_hcld_recv_complete()
- Use WARN_ONCE()
- Add a new bitmap function and use bitmap_*() in the QMC HDLC driver.

Best regards,
Hervé

[1]: 
https://lore.kernel.org/linux-kernel/20231115144007.478111-1-herve.cod...@bootlin.com/
[2]: 
https://lore.kernel.org/linux-kernel/20231205152116.122512-1-herve.cod...@bootlin.com/
[3]: 
https://lore.kernel.org/linux-kernel/20231128132534.258459-1-herve.cod...@bootlin.com/

Changes v1 -> v2
  - Patch 1
Use the same qmc_hdlc initialisation in qmc_hcld_recv_complete()
than the one present in qmc_hcld_xmit_complete().
Use WARN_ONCE()

  - Patch 3 (new patch in v2)
Make bitmap_onto() available to users

  - Patch 4 (new patch in v2)
Introduce bitmap_off()

  - Patch 5 (patch 3 in v1)
Use bitmap_*() functions

  - Patch 6 (patch 4 in v1)
No changes

Changes compare to the full feature series:
  - Patch 3
Use 'net: wan: fsl_qmc_hdlc:' as commit title prefix

Patches extracted:
  - Patch 1 : full feature series patch 7
  - Patch 2 : full feature series patch 8
  - Patch 3 : full feature series patch 20
  - Patch 4 : full feature series patch 27

Herve Codina (6):
  net: wan: Add support for QMC HDLC
  MAINTAINERS: Add the Freescale QMC HDLC driver entry
  bitmap: Make bitmap_onto() available to users
  bitmap: Introduce bitmap_off()
  net: wan: fsl_qmc_hdlc: Add runtime timeslots changes support
  net: wan: fsl_qmc_hdlc: Add framer support

 MAINTAINERS|   7 +
 drivers/net/wan/Kconfig|  12 +
 drivers/net/wan/Makefile   |   1 +
 drivers/net/wan/fsl_qmc_hdlc.c | 806 +
 include/linux/bitmap.h |   3 +
 lib/bitmap.c   |  45 +-
 6 files changed, 873 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/wan/fsl_qmc_hdlc.c

-- 
2.43.0



[PATCH v2 5/6] net: wan: fsl_qmc_hdlc: Add runtime timeslots changes support

2024-01-30 Thread Herve Codina
QMC channels support runtime timeslots changes but nothing is done at
the QMC HDLC driver to handle these changes.

Use existing IFACE ioctl in order to configure the timeslots to use.

Signed-off-by: Herve Codina 
Reviewed-by: Christophe Leroy 
Acked-by: Jakub Kicinski 
---
 drivers/net/wan/fsl_qmc_hdlc.c | 155 -
 1 file changed, 154 insertions(+), 1 deletion(-)

diff --git a/drivers/net/wan/fsl_qmc_hdlc.c b/drivers/net/wan/fsl_qmc_hdlc.c
index e7b2b72a6050..8316f2984864 100644
--- a/drivers/net/wan/fsl_qmc_hdlc.c
+++ b/drivers/net/wan/fsl_qmc_hdlc.c
@@ -7,6 +7,7 @@
  * Author: Herve Codina 
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -32,6 +33,7 @@ struct qmc_hdlc {
struct qmc_hdlc_desc tx_descs[8];
unsigned int tx_out;
struct qmc_hdlc_desc rx_descs[4];
+   u32 slot_map;
 };
 
 static inline struct qmc_hdlc *netdev_to_qmc_hdlc(struct net_device *netdev)
@@ -202,6 +204,147 @@ static netdev_tx_t qmc_hdlc_xmit(struct sk_buff *skb, 
struct net_device *netdev)
return NETDEV_TX_OK;
 }
 
+static int qmc_hdlc_xlate_slot_map(struct qmc_hdlc *qmc_hdlc,
+  u32 slot_map, struct qmc_chan_ts_info 
*ts_info)
+{
+   DECLARE_BITMAP(ts_mask_avail, 64);
+   DECLARE_BITMAP(ts_mask, 64);
+   DECLARE_BITMAP(map, 64);
+   u32 array32[2];
+
+   /* Tx and Rx available masks must be identical */
+   if (ts_info->rx_ts_mask_avail != ts_info->tx_ts_mask_avail) {
+   dev_err(qmc_hdlc->dev, "tx and rx available timeslots mismatch 
(0x%llx, 0x%llx)\n",
+   ts_info->rx_ts_mask_avail, ts_info->tx_ts_mask_avail);
+   return -EINVAL;
+   }
+
+   bitmap_from_arr64(ts_mask_avail, _info->rx_ts_mask_avail, 64);
+   array32[0] = slot_map;
+   array32[1] = 0;
+   bitmap_from_arr32(map, array32, 64);
+   bitmap_onto(ts_mask, map, ts_mask_avail, 64);
+
+   if (bitmap_weight(ts_mask, 64) != bitmap_weight(map, 64)) {
+   dev_err(qmc_hdlc->dev, "Cannot translate timeslots %*pb -> 
(%*pb, %*pb)\n",
+   64, map, 64, ts_mask_avail, 64, ts_mask);
+   return -EINVAL;
+   }
+
+   bitmap_to_arr64(_info->tx_ts_mask, ts_mask, 64);
+   ts_info->rx_ts_mask = ts_info->tx_ts_mask;
+   return 0;
+}
+
+static int qmc_hdlc_xlate_ts_info(struct qmc_hdlc *qmc_hdlc,
+ const struct qmc_chan_ts_info *ts_info, u32 
*slot_map)
+{
+   DECLARE_BITMAP(ts_mask_avail, 64);
+   DECLARE_BITMAP(ts_mask, 64);
+   DECLARE_BITMAP(map, 64);
+   u32 array32[2];
+
+   /* Tx and Rx masks and available masks must be identical */
+   if (ts_info->rx_ts_mask_avail != ts_info->tx_ts_mask_avail) {
+   dev_err(qmc_hdlc->dev, "tx and rx available timeslots mismatch 
(0x%llx, 0x%llx)\n",
+   ts_info->rx_ts_mask_avail, ts_info->tx_ts_mask_avail);
+   return -EINVAL;
+   }
+   if (ts_info->rx_ts_mask != ts_info->tx_ts_mask) {
+   dev_err(qmc_hdlc->dev, "tx and rx timeslots mismatch (0x%llx, 
0x%llx)\n",
+   ts_info->rx_ts_mask, ts_info->tx_ts_mask);
+   return -EINVAL;
+   }
+
+   bitmap_from_arr64(ts_mask_avail, _info->rx_ts_mask_avail, 64);
+   bitmap_from_arr64(ts_mask, _info->rx_ts_mask, 64);
+   bitmap_off(map, ts_mask, ts_mask_avail, 64);
+
+   if (bitmap_weight(ts_mask, 64) != bitmap_weight(map, 64)) {
+   dev_err(qmc_hdlc->dev, "Cannot translate timeslots (%*pb, %*pb) 
-> %*pb\n",
+   64, ts_mask_avail, 64, ts_mask, 64, map);
+   return -EINVAL;
+   }
+
+   bitmap_to_arr32(array32, map, 64);
+   if (array32[1]) {
+   dev_err(qmc_hdlc->dev, "Slot map out of 32bit (%*pb, %*pb) -> 
%*pb\n",
+   64, ts_mask_avail, 64, ts_mask, 64, map);
+   return -EINVAL;
+   }
+
+   *slot_map = array32[0];
+   return 0;
+}
+
+static int qmc_hdlc_set_iface(struct qmc_hdlc *qmc_hdlc, int if_iface, const 
te1_settings *te1)
+{
+   struct qmc_chan_ts_info ts_info;
+   int ret;
+
+   ret = qmc_chan_get_ts_info(qmc_hdlc->qmc_chan, _info);
+   if (ret) {
+   dev_err(qmc_hdlc->dev, "get QMC channel ts info failed %d\n", 
ret);
+   return ret;
+   }
+   ret = qmc_hdlc_xlate_slot_map(qmc_hdlc, te1->slot_map, _info);
+   if (ret)
+   return ret;
+
+   ret = qmc_chan_set_ts_info(qmc_hdlc->qmc_chan, _info);
+   if (ret) {
+   dev_err(qmc_hdlc->dev, "set QMC channel ts info failed %d\n", 
ret);
+   return ret;
+   }
+
+   qmc_hdlc->slot_map = te1->slot_map;
+
+   return 0;
+}
+
+static int qmc_hdlc_ioctl(struct net_device *netdev, struct if_settings *ifs)
+{
+   struct qmc_hdlc *qmc_hdlc = netdev_to_qmc_hdlc(netdev);
+   te1_settings te1;
+
+   switch (ifs->type) 

[PATCH v2 2/6] MAINTAINERS: Add the Freescale QMC HDLC driver entry

2024-01-30 Thread Herve Codina
After contributing the driver, add myself as the maintainer for the
Freescale QMC HDLC driver.

Signed-off-by: Herve Codina 
---
 MAINTAINERS | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 8d1052fa6a69..15cd3a8e5866 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8584,6 +8584,13 @@ F:   
Documentation/devicetree/bindings/soc/fsl/cpm_qe/fsl,cpm1-scc-qmc.yaml
 F: drivers/soc/fsl/qe/qmc.c
 F: include/soc/fsl/qe/qmc.h
 
+FREESCALE QUICC ENGINE QMC HDLC DRIVER
+M: Herve Codina 
+L: net...@vger.kernel.org
+L: linuxppc-dev@lists.ozlabs.org
+S: Maintained
+F: drivers/net/wan/fsl_qmc_hdlc.c
+
 FREESCALE QUICC ENGINE TSA DRIVER
 M: Herve Codina 
 L: linuxppc-dev@lists.ozlabs.org
-- 
2.43.0



[PATCH v2 3/6] bitmap: Make bitmap_onto() available to users

2024-01-30 Thread Herve Codina
Currently the bitmap_onto() is available only for CONFIG_NUMA=y case,
while some users may benefit out of it and being independent to NUMA
code.

Make it available to users by moving out of ifdeffery and exporting for
modules.

Signed-off-by: Herve Codina 
---
 lib/bitmap.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/lib/bitmap.c b/lib/bitmap.c
index 09522af227f1..2feccb5047dc 100644
--- a/lib/bitmap.c
+++ b/lib/bitmap.c
@@ -547,7 +547,6 @@ int bitmap_bitremap(int oldbit, const unsigned long *old,
 }
 EXPORT_SYMBOL(bitmap_bitremap);
 
-#ifdef CONFIG_NUMA
 /**
  * bitmap_onto - translate one bitmap relative to another
  * @dst: resulting translated bitmap
@@ -681,7 +680,9 @@ void bitmap_onto(unsigned long *dst, const unsigned long 
*orig,
m++;
}
 }
+EXPORT_SYMBOL(bitmap_onto);
 
+#ifdef CONFIG_NUMA
 /**
  * bitmap_fold - fold larger bitmap into smaller, modulo specified size
  * @dst: resulting smaller bitmap
-- 
2.43.0



[PATCH v2 4/6] bitmap: Introduce bitmap_off()

2024-01-30 Thread Herve Codina
The bitmap_onto() function translates one bitmap relative to another but
no function are present to perform the reverse translation.

Introduce bitmap_off() to fill this hole.

Signed-off-by: Herve Codina 
---
 include/linux/bitmap.h |  3 +++
 lib/bitmap.c   | 42 ++
 2 files changed, 45 insertions(+)

diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 99451431e4d6..5ecfcbbc91f4 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -65,6 +65,7 @@ struct device;
  *  bitmap_remap(dst, src, old, new, nbits) *dst = map(old, new)(src)
  *  bitmap_bitremap(oldbit, old, new, nbits)newbit = map(old, new)(oldbit)
  *  bitmap_onto(dst, orig, relmap, nbits)   *dst = orig relative to relmap
+ *  bitmap_off(dst, orig, relmap, nbits)*dst = bitmap_onto() reverse 
operation
  *  bitmap_fold(dst, orig, sz, nbits)   dst bits = orig bits mod sz
  *  bitmap_parse(buf, buflen, dst, nbits)   Parse bitmap dst from kernel 
buf
  *  bitmap_parse_user(ubuf, ulen, dst, nbits)   Parse bitmap dst from user buf
@@ -208,6 +209,8 @@ int bitmap_bitremap(int oldbit,
const unsigned long *old, const unsigned long *new, int bits);
 void bitmap_onto(unsigned long *dst, const unsigned long *orig,
const unsigned long *relmap, unsigned int bits);
+void bitmap_off(unsigned long *dst, const unsigned long *orig,
+   const unsigned long *relmap, unsigned int bits);
 void bitmap_fold(unsigned long *dst, const unsigned long *orig,
unsigned int sz, unsigned int nbits);
 
diff --git a/lib/bitmap.c b/lib/bitmap.c
index 2feccb5047dc..71343967335e 100644
--- a/lib/bitmap.c
+++ b/lib/bitmap.c
@@ -682,6 +682,48 @@ void bitmap_onto(unsigned long *dst, const unsigned long 
*orig,
 }
 EXPORT_SYMBOL(bitmap_onto);
 
+/**
+ * bitmap_off - revert operation done by bitmap_onto()
+ * @dst: resulting translated bitmap
+ * @orig: original untranslated bitmap
+ * @relmap: bitmap relative to which translated
+ * @bits: number of bits in each of these bitmaps
+ *
+ * Suppose onto computed using bitmap_onto(onto, src, relmap, n)
+ * The operation bitmap_off(result, onto, relmap, n) leads to a
+ * result equal or equivalent to src.
+ *
+ * The result can be 'equivalent' because bitmap_onto() and
+ * bitmap_off() are not bijective.
+ * The result and src values are equivalent in that sense that a
+ * call to bitmap_onto(onto, src, relmap, n) and a call to
+ * bitmap_onto(onto, result, relmap, n) will lead to the same onto
+ * value.
+ *
+ * If either of @orig or @relmap is empty (no set bits), then @dst
+ * will be returned empty.
+ *
+ * All bits in @dst not set by the above rule are cleared.
+ */
+void bitmap_off(unsigned long *dst, const unsigned long *orig,
+   const unsigned long *relmap, unsigned int bits)
+{
+   unsigned int n, m;  /* same meaning as in above comment */
+
+   if (dst == orig)/* following doesn't handle inplace mappings */
+   return;
+   bitmap_zero(dst, bits);
+
+   m = 0;
+   for_each_set_bit(n, relmap, bits) {
+   /* m == bitmap_pos_to_ord(relmap, n, bits) */
+   if (test_bit(n, orig))
+   set_bit(m, dst);
+   m++;
+   }
+}
+EXPORT_SYMBOL(bitmap_off);
+
 #ifdef CONFIG_NUMA
 /**
  * bitmap_fold - fold larger bitmap into smaller, modulo specified size
-- 
2.43.0



[PATCH v2 1/6] net: wan: Add support for QMC HDLC

2024-01-30 Thread Herve Codina
The QMC HDLC driver provides support for HDLC using the QMC (QUICC
Multichannel Controller) to transfer the HDLC data.

Signed-off-by: Herve Codina 
Reviewed-by: Christophe Leroy 
Acked-by: Jakub Kicinski 
---
 drivers/net/wan/Kconfig|  12 +
 drivers/net/wan/Makefile   |   1 +
 drivers/net/wan/fsl_qmc_hdlc.c | 422 +
 3 files changed, 435 insertions(+)
 create mode 100644 drivers/net/wan/fsl_qmc_hdlc.c

diff --git a/drivers/net/wan/Kconfig b/drivers/net/wan/Kconfig
index 7dda87756d3f..31ab2136cdf1 100644
--- a/drivers/net/wan/Kconfig
+++ b/drivers/net/wan/Kconfig
@@ -197,6 +197,18 @@ config FARSYNC
  To compile this driver as a module, choose M here: the
  module will be called farsync.
 
+config FSL_QMC_HDLC
+   tristate "Freescale QMC HDLC support"
+   depends on HDLC
+   depends on CPM_QMC
+   help
+ HDLC support using the Freescale QUICC Multichannel Controller (QMC).
+
+ To compile this driver as a module, choose M here: the
+ module will be called fsl_qmc_hdlc.
+
+ If unsure, say N.
+
 config FSL_UCC_HDLC
tristate "Freescale QUICC Engine HDLC support"
depends on HDLC
diff --git a/drivers/net/wan/Makefile b/drivers/net/wan/Makefile
index 8119b49d1da9..00e9b7ee1e01 100644
--- a/drivers/net/wan/Makefile
+++ b/drivers/net/wan/Makefile
@@ -25,6 +25,7 @@ obj-$(CONFIG_WANXL)   += wanxl.o
 obj-$(CONFIG_PCI200SYN)+= pci200syn.o
 obj-$(CONFIG_PC300TOO) += pc300too.o
 obj-$(CONFIG_IXP4XX_HSS)   += ixp4xx_hss.o
+obj-$(CONFIG_FSL_QMC_HDLC) += fsl_qmc_hdlc.o
 obj-$(CONFIG_FSL_UCC_HDLC) += fsl_ucc_hdlc.o
 obj-$(CONFIG_SLIC_DS26522) += slic_ds26522.o
 
diff --git a/drivers/net/wan/fsl_qmc_hdlc.c b/drivers/net/wan/fsl_qmc_hdlc.c
new file mode 100644
index ..e7b2b72a6050
--- /dev/null
+++ b/drivers/net/wan/fsl_qmc_hdlc.c
@@ -0,0 +1,422 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Freescale QMC HDLC Device Driver
+ *
+ * Copyright 2023 CS GROUP France
+ *
+ * Author: Herve Codina 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct qmc_hdlc_desc {
+   struct net_device *netdev;
+   struct sk_buff *skb; /* NULL if the descriptor is not in use */
+   dma_addr_t dma_addr;
+   size_t dma_size;
+};
+
+struct qmc_hdlc {
+   struct device *dev;
+   struct qmc_chan *qmc_chan;
+   struct net_device *netdev;
+   bool is_crc32;
+   spinlock_t tx_lock; /* Protect tx descriptors */
+   struct qmc_hdlc_desc tx_descs[8];
+   unsigned int tx_out;
+   struct qmc_hdlc_desc rx_descs[4];
+};
+
+static inline struct qmc_hdlc *netdev_to_qmc_hdlc(struct net_device *netdev)
+{
+   return dev_to_hdlc(netdev)->priv;
+}
+
+static int qmc_hdlc_recv_queue(struct qmc_hdlc *qmc_hdlc, struct qmc_hdlc_desc 
*desc, size_t size);
+
+#define QMC_HDLC_RX_ERROR_FLAGS (QMC_RX_FLAG_HDLC_OVF | \
+QMC_RX_FLAG_HDLC_UNA | \
+QMC_RX_FLAG_HDLC_ABORT | \
+QMC_RX_FLAG_HDLC_CRC)
+
+static void qmc_hcld_recv_complete(void *context, size_t length, unsigned int 
flags)
+{
+   struct qmc_hdlc_desc *desc = context;
+   struct net_device *netdev = desc->netdev;
+   struct qmc_hdlc *qmc_hdlc = netdev_to_qmc_hdlc(netdev);
+   int ret;
+
+   dma_unmap_single(qmc_hdlc->dev, desc->dma_addr, desc->dma_size, 
DMA_FROM_DEVICE);
+
+   if (flags & QMC_HDLC_RX_ERROR_FLAGS) {
+   netdev->stats.rx_errors++;
+   if (flags & QMC_RX_FLAG_HDLC_OVF) /* Data overflow */
+   netdev->stats.rx_over_errors++;
+   if (flags & QMC_RX_FLAG_HDLC_UNA) /* bits received not multiple 
of 8 */
+   netdev->stats.rx_frame_errors++;
+   if (flags & QMC_RX_FLAG_HDLC_ABORT) /* Received an abort 
sequence */
+   netdev->stats.rx_frame_errors++;
+   if (flags & QMC_RX_FLAG_HDLC_CRC) /* CRC error */
+   netdev->stats.rx_crc_errors++;
+   kfree_skb(desc->skb);
+   } else {
+   netdev->stats.rx_packets++;
+   netdev->stats.rx_bytes += length;
+
+   skb_put(desc->skb, length);
+   desc->skb->protocol = hdlc_type_trans(desc->skb, netdev);
+   netif_rx(desc->skb);
+   }
+
+   /* Re-queue a transfer using the same descriptor */
+   ret = qmc_hdlc_recv_queue(qmc_hdlc, desc, desc->dma_size);
+   if (ret) {
+   dev_err(qmc_hdlc->dev, "queue recv desc failed (%d)\n", ret);
+   netdev->stats.rx_errors++;
+   }
+}
+
+static int qmc_hdlc_recv_queue(struct qmc_hdlc *qmc_hdlc, struct qmc_hdlc_desc 
*desc, size_t size)
+{
+   int ret;
+
+   desc->skb = dev_alloc_skb(size);
+   if (!desc->skb)
+   return -ENOMEM;
+
+   desc->dma_size 

Re: [PATCH v1 3/9] mm/memory: further separate anon and pagecache folio handling in zap_present_pte()

2024-01-30 Thread David Hildenbrand

On 30.01.24 09:31, Ryan Roberts wrote:

On 29/01/2024 14:32, David Hildenbrand wrote:

We don't need up-to-date accessed-dirty information for anon folios and can
simply work with the ptent we already have. Also, we know the RSS counter
we want to update.

We can safely move arch_check_zapped_pte() + tlb_remove_tlb_entry() +
zap_install_uffd_wp_if_needed() after updating the folio and RSS.

While at it, only call zap_install_uffd_wp_if_needed() if there is even
any chance that pte_install_uffd_wp_if_needed() would do *something*.
That is, just don't bother if uffd-wp does not apply.

Signed-off-by: David Hildenbrand 
---
  mm/memory.c | 16 +++-
  1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 69502cdc0a7d..20bc13ab8db2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1552,12 +1552,9 @@ static inline void zap_present_pte(struct mmu_gather 
*tlb,
folio = page_folio(page);
if (unlikely(!should_zap_folio(details, folio)))
return;
-   ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
-   arch_check_zapped_pte(vma, ptent);
-   tlb_remove_tlb_entry(tlb, pte, addr);
-   zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
  
  	if (!folio_test_anon(folio)) {

+   ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
if (pte_dirty(ptent)) {
folio_mark_dirty(folio);
if (tlb_delay_rmap(tlb)) {
@@ -1567,8 +1564,17 @@ static inline void zap_present_pte(struct mmu_gather 
*tlb,
}
if (pte_young(ptent) && likely(vma_has_recency(vma)))
folio_mark_accessed(folio);
+   rss[mm_counter(folio)]--;
+   } else {
+   /* We don't need up-to-date accessed/dirty bits. */
+   ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+   rss[MM_ANONPAGES]--;
}
-   rss[mm_counter(folio)]--;
+   arch_check_zapped_pte(vma, ptent);


Isn't the x86 (only) implementation of this relying on the dirty bit? So doesn't
that imply you still need get_and_clear for anon? (And in hindsight I think that
logic would apply to the previous patch too?)


x86 uses the encoding !writable && dirty to indicate special shadow 
stacks. That is, the hw dirty bit is set by software (to create that 
combination), not by hardware.


So you don't have to sync against any hw changes of the hw dirty bit. 
What you had in the original PTE you read is sufficient.


--
Cheers,

David / dhildenb



Re: [PATCH v1 3/9] mm/memory: further separate anon and pagecache folio handling in zap_present_pte()

2024-01-30 Thread Ryan Roberts
On 29/01/2024 14:32, David Hildenbrand wrote:
> We don't need up-to-date accessed-dirty information for anon folios and can
> simply work with the ptent we already have. Also, we know the RSS counter
> we want to update.
> 
> We can safely move arch_check_zapped_pte() + tlb_remove_tlb_entry() +
> zap_install_uffd_wp_if_needed() after updating the folio and RSS.
> 
> While at it, only call zap_install_uffd_wp_if_needed() if there is even
> any chance that pte_install_uffd_wp_if_needed() would do *something*.
> That is, just don't bother if uffd-wp does not apply.
> 
> Signed-off-by: David Hildenbrand 
> ---
>  mm/memory.c | 16 +++-
>  1 file changed, 11 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 69502cdc0a7d..20bc13ab8db2 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1552,12 +1552,9 @@ static inline void zap_present_pte(struct mmu_gather 
> *tlb,
>   folio = page_folio(page);
>   if (unlikely(!should_zap_folio(details, folio)))
>   return;
> - ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
> - arch_check_zapped_pte(vma, ptent);
> - tlb_remove_tlb_entry(tlb, pte, addr);
> - zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
>  
>   if (!folio_test_anon(folio)) {
> + ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
>   if (pte_dirty(ptent)) {
>   folio_mark_dirty(folio);
>   if (tlb_delay_rmap(tlb)) {
> @@ -1567,8 +1564,17 @@ static inline void zap_present_pte(struct mmu_gather 
> *tlb,
>   }
>   if (pte_young(ptent) && likely(vma_has_recency(vma)))
>   folio_mark_accessed(folio);
> + rss[mm_counter(folio)]--;
> + } else {
> + /* We don't need up-to-date accessed/dirty bits. */
> + ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
> + rss[MM_ANONPAGES]--;
>   }
> - rss[mm_counter(folio)]--;
> + arch_check_zapped_pte(vma, ptent);

Isn't the x86 (only) implementation of this relying on the dirty bit? So doesn't
that imply you still need get_and_clear for anon? (And in hindsight I think that
logic would apply to the previous patch too?)

Impl:

void arch_check_zapped_pte(struct vm_area_struct *vma, pte_t pte)
{
/*
 * Hardware before shadow stack can (rarely) set Dirty=1
 * on a Write=0 PTE. So the below condition
 * only indicates a software bug when shadow stack is
 * supported by the HW. This checking is covered in
 * pte_shstk().
 */
VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) &&
pte_shstk(pte));
}

static inline bool pte_shstk(pte_t pte)
{
return cpu_feature_enabled(X86_FEATURE_SHSTK) &&
   (pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY;
}


> + tlb_remove_tlb_entry(tlb, pte, addr);
> + if (unlikely(userfaultfd_pte_wp(vma, ptent)))
> + zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
> +
>   if (!delay_rmap) {
>   folio_remove_rmap_pte(folio, page, vma);
>   if (unlikely(page_mapcount(page) < 0))



Re: [PATCH v1 2/9] mm/memory: handle !page case in zap_present_pte() separately

2024-01-30 Thread Ryan Roberts
On 29/01/2024 14:32, David Hildenbrand wrote:
> We don't need uptodate accessed/dirty bits, so in theory we could
> replace ptep_get_and_clear_full() by an optimized ptep_clear_full()
> function. Let's rely on the provided pte.
> 
> Further, there is no scenario where we would have to insert uffd-wp
> markers when zapping something that is not a normal page (i.e., zeropage).
> Add a sanity check to make sure this remains true.
> 
> should_zap_folio() no longer has to handle NULL pointers. This change
> replaces 2/3 "!page/!folio" checks by a single "!page" one.
> 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Ryan Roberts 

> ---
>  mm/memory.c | 20 ++--
>  1 file changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 50a6c79c78fc..69502cdc0a7d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1497,10 +1497,6 @@ static inline bool should_zap_folio(struct zap_details 
> *details,
>   if (should_zap_cows(details))
>   return true;
>  
> - /* E.g. the caller passes NULL for the case of a zero folio */
> - if (!folio)
> - return true;
> -
>   /* Otherwise we should only zap non-anon folios */
>   return !folio_test_anon(folio);
>  }
> @@ -1543,19 +1539,23 @@ static inline void zap_present_pte(struct mmu_gather 
> *tlb,
>   struct page *page;
>  
>   page = vm_normal_page(vma, addr, ptent);
> - if (page)
> - folio = page_folio(page);
> + if (!page) {
> + /* We don't need up-to-date accessed/dirty bits. */
> + ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
> + arch_check_zapped_pte(vma, ptent);
> + tlb_remove_tlb_entry(tlb, pte, addr);
> + VM_WARN_ON_ONCE(userfaultfd_wp(vma));
> + ksm_might_unmap_zero_page(mm, ptent);
> + return;
> + }
>  
> + folio = page_folio(page);
>   if (unlikely(!should_zap_folio(details, folio)))
>   return;
>   ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
>   arch_check_zapped_pte(vma, ptent);
>   tlb_remove_tlb_entry(tlb, pte, addr);
>   zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
> - if (unlikely(!page)) {
> - ksm_might_unmap_zero_page(mm, ptent);
> - return;
> - }
>  
>   if (!folio_test_anon(folio)) {
>   if (pte_dirty(ptent)) {



Re: [PATCH v1 1/9] mm/memory: factor out zapping of present pte into zap_present_pte()

2024-01-30 Thread Ryan Roberts
On 29/01/2024 14:32, David Hildenbrand wrote:
> Let's prepare for further changes by factoring out processing of present
> PTEs.
> 
> Signed-off-by: David Hildenbrand 
> ---
>  mm/memory.c | 92 ++---
>  1 file changed, 52 insertions(+), 40 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index b05fd28dbce1..50a6c79c78fc 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1532,13 +1532,61 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct 
> *vma,
>   pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
>  }
>  
> +static inline void zap_present_pte(struct mmu_gather *tlb,
> + struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
> + unsigned long addr, struct zap_details *details,
> + int *rss, bool *force_flush, bool *force_break)
> +{
> + struct mm_struct *mm = tlb->mm;
> + bool delay_rmap = false;
> + struct folio *folio;

You need to init this to NULL otherwise its a random value when calling
should_zap_folio() if vm_normal_page() returns NULL.

> + struct page *page;
> +
> + page = vm_normal_page(vma, addr, ptent);
> + if (page)
> + folio = page_folio(page);
> +
> + if (unlikely(!should_zap_folio(details, folio)))
> + return;
> + ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
> + arch_check_zapped_pte(vma, ptent);
> + tlb_remove_tlb_entry(tlb, pte, addr);
> + zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
> + if (unlikely(!page)) {
> + ksm_might_unmap_zero_page(mm, ptent);
> + return;
> + }
> +
> + if (!folio_test_anon(folio)) {
> + if (pte_dirty(ptent)) {
> + folio_mark_dirty(folio);
> + if (tlb_delay_rmap(tlb)) {
> + delay_rmap = true;
> + *force_flush = true;
> + }
> + }
> + if (pte_young(ptent) && likely(vma_has_recency(vma)))
> + folio_mark_accessed(folio);
> + }
> + rss[mm_counter(folio)]--;
> + if (!delay_rmap) {
> + folio_remove_rmap_pte(folio, page, vma);
> + if (unlikely(page_mapcount(page) < 0))
> + print_bad_pte(vma, addr, ptent, page);
> + }
> + if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
> + *force_flush = true;
> + *force_break = true;
> + }
> +}
> +
>  static unsigned long zap_pte_range(struct mmu_gather *tlb,
>   struct vm_area_struct *vma, pmd_t *pmd,
>   unsigned long addr, unsigned long end,
>   struct zap_details *details)
>  {
> + bool force_flush = false, force_break = false;
>   struct mm_struct *mm = tlb->mm;
> - int force_flush = 0;
>   int rss[NR_MM_COUNTERS];
>   spinlock_t *ptl;
>   pte_t *start_pte;
> @@ -1565,45 +1613,9 @@ static unsigned long zap_pte_range(struct mmu_gather 
> *tlb,
>   break;
>  
>   if (pte_present(ptent)) {
> - unsigned int delay_rmap;
> -
> - page = vm_normal_page(vma, addr, ptent);
> - if (page)
> - folio = page_folio(page);
> -
> - if (unlikely(!should_zap_folio(details, folio)))
> - continue;
> - ptent = ptep_get_and_clear_full(mm, addr, pte,
> - tlb->fullmm);
> - arch_check_zapped_pte(vma, ptent);
> - tlb_remove_tlb_entry(tlb, pte, addr);
> - zap_install_uffd_wp_if_needed(vma, addr, pte, details,
> -   ptent);
> - if (unlikely(!page)) {
> - ksm_might_unmap_zero_page(mm, ptent);
> - continue;
> - }
> -
> - delay_rmap = 0;
> - if (!folio_test_anon(folio)) {
> - if (pte_dirty(ptent)) {
> - folio_mark_dirty(folio);
> - if (tlb_delay_rmap(tlb)) {
> - delay_rmap = 1;
> - force_flush = 1;
> - }
> - }
> - if (pte_young(ptent) && 
> likely(vma_has_recency(vma)))
> - folio_mark_accessed(folio);
> - }
> - rss[mm_counter(folio)]--;
> - if (!delay_rmap) {
> - folio_remove_rmap_pte(folio, page, vma);
> - if (unlikely(page_mapcount(page) < 0))
> -