Re: [PATCH v2 5/7] riscv: mm: accelerate pagefault when badaccess
On 2024/4/11 1:28, Alexandre Ghiti wrote: On 10/04/2024 10:07, Kefeng Wang wrote: On 2024/4/10 15:32, Alexandre Ghiti wrote: Hi Kefeng, On 03/04/2024 10:38, Kefeng Wang wrote: The access_error() of vma already checked under per-VMA lock, if it is a bad access, directly handle error, no need to retry with mmap_lock again. Since the page faut is handled under per-VMA lock, count it as a vma lock event with VMA_LOCK_SUCCESS. Reviewed-by: Suren Baghdasaryan Signed-off-by: Kefeng Wang --- arch/riscv/mm/fault.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c index 3ba1d4dde5dd..b3fcf7d67efb 100644 --- a/arch/riscv/mm/fault.c +++ b/arch/riscv/mm/fault.c @@ -292,7 +292,10 @@ void handle_page_fault(struct pt_regs *regs) if (unlikely(access_error(cause, vma))) { vma_end_read(vma); - goto lock_mmap; + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + tsk->thread.bad_cause = SEGV_ACCERR; I think we should use the cause variable here instead of SEGV_ACCERR, as bad_cause is a riscv internal status which describes the real fault that happened. Oh, I see, it is exception causes on riscv, so it should be diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c index b3fcf7d67efb..5224f3733802 100644 --- a/arch/riscv/mm/fault.c +++ b/arch/riscv/mm/fault.c @@ -293,8 +293,8 @@ void handle_page_fault(struct pt_regs *regs) if (unlikely(access_error(cause, vma))) { vma_end_read(vma); count_vm_vma_lock_event(VMA_LOCK_SUCCESS); - tsk->thread.bad_cause = SEGV_ACCERR; - bad_area_nosemaphore(regs, code, addr); + tsk->thread.bad_cause = cause; + bad_area_nosemaphore(regs, SEGV_ACCERR, addr); return; } Hi Alex, could you help to check it? Hi Andrew, please help to squash it after Alex ack it. Thanks both. So I have just tested Kefeng's fixup on my usual CI and with a simple program that triggers such bad access, everything went fine so with the fixup applied: Reviewed-by: Alexandre Ghiti Tested-by: Alexandre Ghiti Great, thanks. Thanks, Alex Thanks, Alex + bad_area_nosemaphore(regs, code, addr); + return; } fault = handle_mm_fault(vma, addr, flags | FAULT_FLAG_VMA_LOCK, regs);
Re: [PATCH v2 5/7] riscv: mm: accelerate pagefault when badaccess
On 2024/4/10 15:32, Alexandre Ghiti wrote: Hi Kefeng, On 03/04/2024 10:38, Kefeng Wang wrote: The access_error() of vma already checked under per-VMA lock, if it is a bad access, directly handle error, no need to retry with mmap_lock again. Since the page faut is handled under per-VMA lock, count it as a vma lock event with VMA_LOCK_SUCCESS. Reviewed-by: Suren Baghdasaryan Signed-off-by: Kefeng Wang --- arch/riscv/mm/fault.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c index 3ba1d4dde5dd..b3fcf7d67efb 100644 --- a/arch/riscv/mm/fault.c +++ b/arch/riscv/mm/fault.c @@ -292,7 +292,10 @@ void handle_page_fault(struct pt_regs *regs) if (unlikely(access_error(cause, vma))) { vma_end_read(vma); - goto lock_mmap; + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + tsk->thread.bad_cause = SEGV_ACCERR; I think we should use the cause variable here instead of SEGV_ACCERR, as bad_cause is a riscv internal status which describes the real fault that happened. Oh, I see, it is exception causes on riscv, so it should be diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c index b3fcf7d67efb..5224f3733802 100644 --- a/arch/riscv/mm/fault.c +++ b/arch/riscv/mm/fault.c @@ -293,8 +293,8 @@ void handle_page_fault(struct pt_regs *regs) if (unlikely(access_error(cause, vma))) { vma_end_read(vma); count_vm_vma_lock_event(VMA_LOCK_SUCCESS); - tsk->thread.bad_cause = SEGV_ACCERR; - bad_area_nosemaphore(regs, code, addr); + tsk->thread.bad_cause = cause; + bad_area_nosemaphore(regs, SEGV_ACCERR, addr); return; } Hi Alex, could you help to check it? Hi Andrew, please help to squash it after Alex ack it. Thanks both. Thanks, Alex + bad_area_nosemaphore(regs, code, addr); + return; } fault = handle_mm_fault(vma, addr, flags | FAULT_FLAG_VMA_LOCK, regs);
Re: [PATCH v2 0/7] arch/mm/fault: accelerate pagefault when badaccess
On 2024/4/4 4:45, Andrew Morton wrote: On Wed, 3 Apr 2024 16:37:58 +0800 Kefeng Wang wrote: After VMA lock-based page fault handling enabled, if bad access met under per-vma lock, it will fallback to mmap_lock-based handling, so it leads to unnessary mmap lock and vma find again. A test from lmbench shows 34% improve after this changes on arm64, lat_sig -P 1 prot lat_sig 0.29194 -> 0.19198 Only build test on other archs except arm64. Thanks. So we now want a bunch of architectures to runtime test this. Do we have a selftest in place which will adequately do this? I don't find such selftest, and badaccess would lead to coredump, the performance should not affect most scene, so no selftest is acceptable. lmbench is easy to use to measure the performance.
[PATCH v2 7/7] x86: mm: accelerate pagefault when badaccess
The access_error() of vma already checked under per-VMA lock, if it is a bad access, directly handle error, no need to retry with mmap_lock again. In order to release the correct lock, pass the mm_struct into bad_area_access_error(), if mm is NULL, release vma lock, or release mmap_lock. Since the page faut is handled under per-VMA lock, count it as a vma lock event with VMA_LOCK_SUCCESS. Reviewed-by: Suren Baghdasaryan Signed-off-by: Kefeng Wang --- arch/x86/mm/fault.c | 23 ++- 1 file changed, 14 insertions(+), 9 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index a4cc20d0036d..67b18adc75dd 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -866,14 +866,17 @@ bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code, static void __bad_area(struct pt_regs *regs, unsigned long error_code, - unsigned long address, u32 pkey, int si_code) + unsigned long address, struct mm_struct *mm, + struct vm_area_struct *vma, u32 pkey, int si_code) { - struct mm_struct *mm = current->mm; /* * Something tried to access memory that isn't in our memory map.. * Fix it, but check if it's kernel or user first.. */ - mmap_read_unlock(mm); + if (mm) + mmap_read_unlock(mm); + else + vma_end_read(vma); __bad_area_nosemaphore(regs, error_code, address, pkey, si_code); } @@ -897,7 +900,8 @@ static inline bool bad_area_access_from_pkeys(unsigned long error_code, static noinline void bad_area_access_error(struct pt_regs *regs, unsigned long error_code, - unsigned long address, struct vm_area_struct *vma) + unsigned long address, struct mm_struct *mm, + struct vm_area_struct *vma) { /* * This OSPKE check is not strictly necessary at runtime. @@ -927,9 +931,9 @@ bad_area_access_error(struct pt_regs *regs, unsigned long error_code, */ u32 pkey = vma_pkey(vma); - __bad_area(regs, error_code, address, pkey, SEGV_PKUERR); + __bad_area(regs, error_code, address, mm, vma, pkey, SEGV_PKUERR); } else { - __bad_area(regs, error_code, address, 0, SEGV_ACCERR); + __bad_area(regs, error_code, address, mm, vma, 0, SEGV_ACCERR); } } @@ -1357,8 +1361,9 @@ void do_user_addr_fault(struct pt_regs *regs, goto lock_mmap; if (unlikely(access_error(error_code, vma))) { - vma_end_read(vma); - goto lock_mmap; + bad_area_access_error(regs, error_code, address, NULL, vma); + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + return; } fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs); if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) @@ -1394,7 +1399,7 @@ void do_user_addr_fault(struct pt_regs *regs, * we can handle it.. */ if (unlikely(access_error(error_code, vma))) { - bad_area_access_error(regs, error_code, address, vma); + bad_area_access_error(regs, error_code, address, mm, vma); return; } -- 2.27.0
[PATCH v2 5/7] riscv: mm: accelerate pagefault when badaccess
The access_error() of vma already checked under per-VMA lock, if it is a bad access, directly handle error, no need to retry with mmap_lock again. Since the page faut is handled under per-VMA lock, count it as a vma lock event with VMA_LOCK_SUCCESS. Reviewed-by: Suren Baghdasaryan Signed-off-by: Kefeng Wang --- arch/riscv/mm/fault.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c index 3ba1d4dde5dd..b3fcf7d67efb 100644 --- a/arch/riscv/mm/fault.c +++ b/arch/riscv/mm/fault.c @@ -292,7 +292,10 @@ void handle_page_fault(struct pt_regs *regs) if (unlikely(access_error(cause, vma))) { vma_end_read(vma); - goto lock_mmap; + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + tsk->thread.bad_cause = SEGV_ACCERR; + bad_area_nosemaphore(regs, code, addr); + return; } fault = handle_mm_fault(vma, addr, flags | FAULT_FLAG_VMA_LOCK, regs); -- 2.27.0
[PATCH v2 6/7] s390: mm: accelerate pagefault when badaccess
The vm_flags of vma already checked under per-VMA lock, if it is a bad access, directly handle error, no need to retry with mmap_lock again. Since the page faut is handled under per-VMA lock, count it as a vma lock event with VMA_LOCK_SUCCESS. Signed-off-by: Kefeng Wang --- arch/s390/mm/fault.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c index c421dd44ffbe..162ca2576fd4 100644 --- a/arch/s390/mm/fault.c +++ b/arch/s390/mm/fault.c @@ -325,7 +325,8 @@ static void do_exception(struct pt_regs *regs, int access) goto lock_mmap; if (!(vma->vm_flags & access)) { vma_end_read(vma); - goto lock_mmap; + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + return handle_fault_error_nolock(regs, SEGV_ACCERR); } fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs); if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) -- 2.27.0
[PATCH v2 3/7] arm: mm: accelerate pagefault when VM_FAULT_BADACCESS
The vm_flags of vma already checked under per-VMA lock, if it is a bad access, directly set fault to VM_FAULT_BADACCESS and handle error, no need to retry with mmap_lock again. Since the page faut is handled under per-VMA lock, count it as a vma lock event with VMA_LOCK_SUCCESS. Reviewed-by: Suren Baghdasaryan Signed-off-by: Kefeng Wang --- arch/arm/mm/fault.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c index 439dc6a26bb9..5c4b417e24f9 100644 --- a/arch/arm/mm/fault.c +++ b/arch/arm/mm/fault.c @@ -294,7 +294,9 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) if (!(vma->vm_flags & vm_flags)) { vma_end_read(vma); - goto lock_mmap; + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + fault = VM_FAULT_BADACCESS; + goto bad_area; } fault = handle_mm_fault(vma, addr, flags | FAULT_FLAG_VMA_LOCK, regs); if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) -- 2.27.0
[PATCH v2 4/7] powerpc: mm: accelerate pagefault when badaccess
The access_[pkey]_error() of vma already checked under per-VMA lock, if it is a bad access, directly handle error, no need to retry with mmap_lock again. In order to release the correct lock, pass the mm_struct into bad_access_pkey()/bad_access(), if mm is NULL, release vma lock, or release mmap_lock. Since the page faut is handled under per-VMA lock, count it as a vma lock event with VMA_LOCK_SUCCESS. Signed-off-by: Kefeng Wang --- arch/powerpc/mm/fault.c | 33 - 1 file changed, 20 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 53335ae21a40..215690452495 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -71,23 +71,26 @@ static noinline int bad_area_nosemaphore(struct pt_regs *regs, unsigned long add return __bad_area_nosemaphore(regs, address, SEGV_MAPERR); } -static int __bad_area(struct pt_regs *regs, unsigned long address, int si_code) +static int __bad_area(struct pt_regs *regs, unsigned long address, int si_code, + struct mm_struct *mm, struct vm_area_struct *vma) { - struct mm_struct *mm = current->mm; /* * Something tried to access memory that isn't in our memory map.. * Fix it, but check if it's kernel or user first.. */ - mmap_read_unlock(mm); + if (mm) + mmap_read_unlock(mm); + else + vma_end_read(vma); return __bad_area_nosemaphore(regs, address, si_code); } static noinline int bad_access_pkey(struct pt_regs *regs, unsigned long address, + struct mm_struct *mm, struct vm_area_struct *vma) { - struct mm_struct *mm = current->mm; int pkey; /* @@ -109,7 +112,10 @@ static noinline int bad_access_pkey(struct pt_regs *regs, unsigned long address, */ pkey = vma_pkey(vma); - mmap_read_unlock(mm); + if (mm) + mmap_read_unlock(mm); + else + vma_end_read(vma); /* * If we are in kernel mode, bail out with a SEGV, this will @@ -124,9 +130,10 @@ static noinline int bad_access_pkey(struct pt_regs *regs, unsigned long address, return 0; } -static noinline int bad_access(struct pt_regs *regs, unsigned long address) +static noinline int bad_access(struct pt_regs *regs, unsigned long address, + struct mm_struct *mm, struct vm_area_struct *vma) { - return __bad_area(regs, address, SEGV_ACCERR); + return __bad_area(regs, address, SEGV_ACCERR, mm, vma); } static int do_sigbus(struct pt_regs *regs, unsigned long address, @@ -479,13 +486,13 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address, if (unlikely(access_pkey_error(is_write, is_exec, (error_code & DSISR_KEYFAULT), vma))) { - vma_end_read(vma); - goto lock_mmap; + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + return bad_access_pkey(regs, address, NULL, vma); } if (unlikely(access_error(is_write, is_exec, vma))) { - vma_end_read(vma); - goto lock_mmap; + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + return bad_access(regs, address, NULL, vma); } fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs); @@ -521,10 +528,10 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address, if (unlikely(access_pkey_error(is_write, is_exec, (error_code & DSISR_KEYFAULT), vma))) - return bad_access_pkey(regs, address, vma); + return bad_access_pkey(regs, address, mm, vma); if (unlikely(access_error(is_write, is_exec, vma))) - return bad_access(regs, address); + return bad_access(regs, address, mm, vma); /* * If for any reason at all we couldn't handle the fault, -- 2.27.0
[PATCH v2 0/7] arch/mm/fault: accelerate pagefault when badaccess
After VMA lock-based page fault handling enabled, if bad access met under per-vma lock, it will fallback to mmap_lock-based handling, so it leads to unnessary mmap lock and vma find again. A test from lmbench shows 34% improve after this changes on arm64, lat_sig -P 1 prot lat_sig 0.29194 -> 0.19198 Only build test on other archs except arm64. v2: - a better changelog, and describe the counting changes, suggested by Suren Baghdasaryan - add RB Kefeng Wang (7): arm64: mm: cleanup __do_page_fault() arm64: mm: accelerate pagefault when VM_FAULT_BADACCESS arm: mm: accelerate pagefault when VM_FAULT_BADACCESS powerpc: mm: accelerate pagefault when badaccess riscv: mm: accelerate pagefault when badaccess s390: mm: accelerate pagefault when badaccess x86: mm: accelerate pagefault when badaccess arch/arm/mm/fault.c | 4 +++- arch/arm64/mm/fault.c | 31 ++- arch/powerpc/mm/fault.c | 33 - arch/riscv/mm/fault.c | 5 - arch/s390/mm/fault.c| 3 ++- arch/x86/mm/fault.c | 23 ++- 6 files changed, 53 insertions(+), 46 deletions(-) -- 2.27.0
[PATCH v2 2/7] arm64: mm: accelerate pagefault when VM_FAULT_BADACCESS
The vm_flags of vma already checked under per-VMA lock, if it is a bad access, directly set fault to VM_FAULT_BADACCESS and handle error, no need to retry with mmap_lock again, the latency time reduces 34% in 'lat_sig -P 1 prot lat_sig' from lmbench testcase. Since the page faut is handled under per-VMA lock, count it as a vma lock event with VMA_LOCK_SUCCESS. Reviewed-by: Suren Baghdasaryan Signed-off-by: Kefeng Wang --- arch/arm64/mm/fault.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index 9bb9f395351a..405f9aa831bd 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -572,7 +572,9 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, if (!(vma->vm_flags & vm_flags)) { vma_end_read(vma); - goto lock_mmap; + fault = VM_FAULT_BADACCESS; + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + goto done; } fault = handle_mm_fault(vma, addr, mm_flags | FAULT_FLAG_VMA_LOCK, regs); if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) -- 2.27.0
[PATCH v2 1/7] arm64: mm: cleanup __do_page_fault()
The __do_page_fault() only calls handle_mm_fault() after vm_flags checked, and it is only called by do_page_fault(), let's squash it into do_page_fault() to cleanup code. Reviewed-by: Suren Baghdasaryan Signed-off-by: Kefeng Wang --- arch/arm64/mm/fault.c | 27 +++ 1 file changed, 7 insertions(+), 20 deletions(-) diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index 8251e2fea9c7..9bb9f395351a 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -486,25 +486,6 @@ static void do_bad_area(unsigned long far, unsigned long esr, } } -#define VM_FAULT_BADMAP((__force vm_fault_t)0x01) -#define VM_FAULT_BADACCESS ((__force vm_fault_t)0x02) - -static vm_fault_t __do_page_fault(struct mm_struct *mm, - struct vm_area_struct *vma, unsigned long addr, - unsigned int mm_flags, unsigned long vm_flags, - struct pt_regs *regs) -{ - /* -* Ok, we have a good vm_area for this memory access, so we can handle -* it. -* Check that the permissions on the VMA allow for the fault which -* occurred. -*/ - if (!(vma->vm_flags & vm_flags)) - return VM_FAULT_BADACCESS; - return handle_mm_fault(vma, addr, mm_flags, regs); -} - static bool is_el0_instruction_abort(unsigned long esr) { return ESR_ELx_EC(esr) == ESR_ELx_EC_IABT_LOW; @@ -519,6 +500,9 @@ static bool is_write_abort(unsigned long esr) return (esr & ESR_ELx_WNR) && !(esr & ESR_ELx_CM); } +#define VM_FAULT_BADMAP((__force vm_fault_t)0x01) +#define VM_FAULT_BADACCESS ((__force vm_fault_t)0x02) + static int __kprobes do_page_fault(unsigned long far, unsigned long esr, struct pt_regs *regs) { @@ -617,7 +601,10 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, goto done; } - fault = __do_page_fault(mm, vma, addr, mm_flags, vm_flags, regs); + if (!(vma->vm_flags & vm_flags)) + fault = VM_FAULT_BADACCESS; + else + fault = handle_mm_fault(vma, addr, mm_flags, regs); /* Quick path to respond to signals */ if (fault_signal_pending(fault, regs)) { -- 2.27.0
Re: [PATCH 7/7] x86: mm: accelerate pagefault when badaccess
On 2024/4/3 13:59, Suren Baghdasaryan wrote: On Tue, Apr 2, 2024 at 12:53 AM Kefeng Wang wrote: The vm_flags of vma already checked under per-VMA lock, if it is a bad access, directly handle error and return, there is no need to lock_mm_and_find_vma() and check vm_flags again. Signed-off-by: Kefeng Wang Looks safe to me. Using (mm != NULL) to indicate that we are holding mmap_lock is not ideal but I guess that works. Yes, I will add this part it into change too, The access_error() of vma already checked under per-VMA lock, if it is a bad access, directly handle error, no need to retry with mmap_lock again. In order to release the correct lock, pass the mm_struct into bad_area_access_error(), if mm is NULL, release vma lock, or release mmap_lock. Since the page faut is handled under per-VMA lock, count it as a vma lock event with VMA_LOCK_SUCCESS. Thanks. Reviewed-by: Suren Baghdasaryan --- arch/x86/mm/fault.c | 23 ++- 1 file changed, 14 insertions(+), 9 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index a4cc20d0036d..67b18adc75dd 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -866,14 +866,17 @@ bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code, static void __bad_area(struct pt_regs *regs, unsigned long error_code, - unsigned long address, u32 pkey, int si_code) + unsigned long address, struct mm_struct *mm, + struct vm_area_struct *vma, u32 pkey, int si_code) { - struct mm_struct *mm = current->mm; /* * Something tried to access memory that isn't in our memory map.. * Fix it, but check if it's kernel or user first.. */ - mmap_read_unlock(mm); + if (mm) + mmap_read_unlock(mm); + else + vma_end_read(vma); __bad_area_nosemaphore(regs, error_code, address, pkey, si_code); } @@ -897,7 +900,8 @@ static inline bool bad_area_access_from_pkeys(unsigned long error_code, static noinline void bad_area_access_error(struct pt_regs *regs, unsigned long error_code, - unsigned long address, struct vm_area_struct *vma) + unsigned long address, struct mm_struct *mm, + struct vm_area_struct *vma) { /* * This OSPKE check is not strictly necessary at runtime. @@ -927,9 +931,9 @@ bad_area_access_error(struct pt_regs *regs, unsigned long error_code, */ u32 pkey = vma_pkey(vma); - __bad_area(regs, error_code, address, pkey, SEGV_PKUERR); + __bad_area(regs, error_code, address, mm, vma, pkey, SEGV_PKUERR); } else { - __bad_area(regs, error_code, address, 0, SEGV_ACCERR); + __bad_area(regs, error_code, address, mm, vma, 0, SEGV_ACCERR); } } @@ -1357,8 +1361,9 @@ void do_user_addr_fault(struct pt_regs *regs, goto lock_mmap; if (unlikely(access_error(error_code, vma))) { - vma_end_read(vma); - goto lock_mmap; + bad_area_access_error(regs, error_code, address, NULL, vma); + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + return; } fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs); if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) @@ -1394,7 +1399,7 @@ void do_user_addr_fault(struct pt_regs *regs, * we can handle it.. */ if (unlikely(access_error(error_code, vma))) { - bad_area_access_error(regs, error_code, address, vma); + bad_area_access_error(regs, error_code, address, mm, vma); return; } -- 2.27.0
Re: [PATCH 2/7] arm64: mm: accelerate pagefault when VM_FAULT_BADACCESS
On 2024/4/3 13:30, Suren Baghdasaryan wrote: On Tue, Apr 2, 2024 at 10:19 PM Suren Baghdasaryan wrote: On Tue, Apr 2, 2024 at 12:53 AM Kefeng Wang wrote: The vm_flags of vma already checked under per-VMA lock, if it is a bad access, directly set fault to VM_FAULT_BADACCESS and handle error, no need to lock_mm_and_find_vma() and check vm_flags again, the latency time reduce 34% in lmbench 'lat_sig -P 1 prot lat_sig'. The change makes sense to me. Per-VMA lock is enough to keep vma->vm_flags stable, so no need to retry with mmap_lock. Signed-off-by: Kefeng Wang Reviewed-by: Suren Baghdasaryan --- arch/arm64/mm/fault.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index 9bb9f395351a..405f9aa831bd 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -572,7 +572,9 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, if (!(vma->vm_flags & vm_flags)) { vma_end_read(vma); - goto lock_mmap; + fault = VM_FAULT_BADACCESS; + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); nit: VMA_LOCK_SUCCESS accounting here seems correct to me but unrelated to the main change. Either splitting into a separate patch or mentioning this additional fixup in the changelog would be helpful. The above nit applies to all the patches after this one, so I won't comment on each one separately. If you decide to split or adjust the changelog please do that for each patch. I will update the change log for each patch, thank for your review and suggestion. + goto done; } fault = handle_mm_fault(vma, addr, mm_flags | FAULT_FLAG_VMA_LOCK, regs); if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) -- 2.27.0
[PATCH 7/7] x86: mm: accelerate pagefault when badaccess
The vm_flags of vma already checked under per-VMA lock, if it is a bad access, directly handle error and return, there is no need to lock_mm_and_find_vma() and check vm_flags again. Signed-off-by: Kefeng Wang --- arch/x86/mm/fault.c | 23 ++- 1 file changed, 14 insertions(+), 9 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index a4cc20d0036d..67b18adc75dd 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -866,14 +866,17 @@ bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code, static void __bad_area(struct pt_regs *regs, unsigned long error_code, - unsigned long address, u32 pkey, int si_code) + unsigned long address, struct mm_struct *mm, + struct vm_area_struct *vma, u32 pkey, int si_code) { - struct mm_struct *mm = current->mm; /* * Something tried to access memory that isn't in our memory map.. * Fix it, but check if it's kernel or user first.. */ - mmap_read_unlock(mm); + if (mm) + mmap_read_unlock(mm); + else + vma_end_read(vma); __bad_area_nosemaphore(regs, error_code, address, pkey, si_code); } @@ -897,7 +900,8 @@ static inline bool bad_area_access_from_pkeys(unsigned long error_code, static noinline void bad_area_access_error(struct pt_regs *regs, unsigned long error_code, - unsigned long address, struct vm_area_struct *vma) + unsigned long address, struct mm_struct *mm, + struct vm_area_struct *vma) { /* * This OSPKE check is not strictly necessary at runtime. @@ -927,9 +931,9 @@ bad_area_access_error(struct pt_regs *regs, unsigned long error_code, */ u32 pkey = vma_pkey(vma); - __bad_area(regs, error_code, address, pkey, SEGV_PKUERR); + __bad_area(regs, error_code, address, mm, vma, pkey, SEGV_PKUERR); } else { - __bad_area(regs, error_code, address, 0, SEGV_ACCERR); + __bad_area(regs, error_code, address, mm, vma, 0, SEGV_ACCERR); } } @@ -1357,8 +1361,9 @@ void do_user_addr_fault(struct pt_regs *regs, goto lock_mmap; if (unlikely(access_error(error_code, vma))) { - vma_end_read(vma); - goto lock_mmap; + bad_area_access_error(regs, error_code, address, NULL, vma); + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + return; } fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs); if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) @@ -1394,7 +1399,7 @@ void do_user_addr_fault(struct pt_regs *regs, * we can handle it.. */ if (unlikely(access_error(error_code, vma))) { - bad_area_access_error(regs, error_code, address, vma); + bad_area_access_error(regs, error_code, address, mm, vma); return; } -- 2.27.0
[PATCH 1/7] arm64: mm: cleanup __do_page_fault()
The __do_page_fault() only check vma->flags and call handle_mm_fault(), and only called by do_page_fault(), let's squash it into do_page_fault() to cleanup code. Signed-off-by: Kefeng Wang --- arch/arm64/mm/fault.c | 27 +++ 1 file changed, 7 insertions(+), 20 deletions(-) diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index 8251e2fea9c7..9bb9f395351a 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -486,25 +486,6 @@ static void do_bad_area(unsigned long far, unsigned long esr, } } -#define VM_FAULT_BADMAP((__force vm_fault_t)0x01) -#define VM_FAULT_BADACCESS ((__force vm_fault_t)0x02) - -static vm_fault_t __do_page_fault(struct mm_struct *mm, - struct vm_area_struct *vma, unsigned long addr, - unsigned int mm_flags, unsigned long vm_flags, - struct pt_regs *regs) -{ - /* -* Ok, we have a good vm_area for this memory access, so we can handle -* it. -* Check that the permissions on the VMA allow for the fault which -* occurred. -*/ - if (!(vma->vm_flags & vm_flags)) - return VM_FAULT_BADACCESS; - return handle_mm_fault(vma, addr, mm_flags, regs); -} - static bool is_el0_instruction_abort(unsigned long esr) { return ESR_ELx_EC(esr) == ESR_ELx_EC_IABT_LOW; @@ -519,6 +500,9 @@ static bool is_write_abort(unsigned long esr) return (esr & ESR_ELx_WNR) && !(esr & ESR_ELx_CM); } +#define VM_FAULT_BADMAP((__force vm_fault_t)0x01) +#define VM_FAULT_BADACCESS ((__force vm_fault_t)0x02) + static int __kprobes do_page_fault(unsigned long far, unsigned long esr, struct pt_regs *regs) { @@ -617,7 +601,10 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, goto done; } - fault = __do_page_fault(mm, vma, addr, mm_flags, vm_flags, regs); + if (!(vma->vm_flags & vm_flags)) + fault = VM_FAULT_BADACCESS; + else + fault = handle_mm_fault(vma, addr, mm_flags, regs); /* Quick path to respond to signals */ if (fault_signal_pending(fault, regs)) { -- 2.27.0
[PATCH 6/7] s390: mm: accelerate pagefault when badaccess
The vm_flags of vma already checked under per-VMA lock, if it is a bad access, directly handle error and return, there is no need to lock_mm_and_find_vma() and check vm_flags again. Signed-off-by: Kefeng Wang --- arch/s390/mm/fault.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c index c421dd44ffbe..162ca2576fd4 100644 --- a/arch/s390/mm/fault.c +++ b/arch/s390/mm/fault.c @@ -325,7 +325,8 @@ static void do_exception(struct pt_regs *regs, int access) goto lock_mmap; if (!(vma->vm_flags & access)) { vma_end_read(vma); - goto lock_mmap; + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + return handle_fault_error_nolock(regs, SEGV_ACCERR); } fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs); if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) -- 2.27.0
[PATCH 0/7] arch/mm/fault: accelerate pagefault when badaccess
After VMA lock-based page fault handling enabled, if bad access met under per-vma lock, it will fallback to mmap_lock-based handling, so it leads to unnessary mmap lock and vma find again. A test from lmbench shows 34% improve after this changes on arm64, lat_sig -P 1 prot lat_sig 0.29194 -> 0.19198 Only build test on other archs except arm64. Kefeng Wang (7): arm64: mm: cleanup __do_page_fault() arm64: mm: accelerate pagefault when VM_FAULT_BADACCESS arm: mm: accelerate pagefault when VM_FAULT_BADACCESS powerpc: mm: accelerate pagefault when badaccess riscv: mm: accelerate pagefault when badaccess s390: mm: accelerate pagefault when badaccess x86: mm: accelerate pagefault when badaccess arch/arm/mm/fault.c | 4 +++- arch/arm64/mm/fault.c | 31 ++- arch/powerpc/mm/fault.c | 33 - arch/riscv/mm/fault.c | 5 - arch/s390/mm/fault.c| 3 ++- arch/x86/mm/fault.c | 23 ++- 6 files changed, 53 insertions(+), 46 deletions(-) -- 2.27.0
[PATCH 5/7] riscv: mm: accelerate pagefault when badaccess
The vm_flags of vma already checked under per-VMA lock, if it is a bad access, directly handle error and return, there is no need to lock_mm_and_find_vma() and check vm_flags again. Signed-off-by: Kefeng Wang --- arch/riscv/mm/fault.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c index 3ba1d4dde5dd..b3fcf7d67efb 100644 --- a/arch/riscv/mm/fault.c +++ b/arch/riscv/mm/fault.c @@ -292,7 +292,10 @@ void handle_page_fault(struct pt_regs *regs) if (unlikely(access_error(cause, vma))) { vma_end_read(vma); - goto lock_mmap; + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + tsk->thread.bad_cause = SEGV_ACCERR; + bad_area_nosemaphore(regs, code, addr); + return; } fault = handle_mm_fault(vma, addr, flags | FAULT_FLAG_VMA_LOCK, regs); -- 2.27.0
[PATCH 4/7] powerpc: mm: accelerate pagefault when badaccess
The vm_flags of vma already checked under per-VMA lock, if it is a bad access, directly handle error and return, there is no need to lock_mm_and_find_vma() and check vm_flags again. Signed-off-by: Kefeng Wang --- arch/powerpc/mm/fault.c | 33 - 1 file changed, 20 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 53335ae21a40..215690452495 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -71,23 +71,26 @@ static noinline int bad_area_nosemaphore(struct pt_regs *regs, unsigned long add return __bad_area_nosemaphore(regs, address, SEGV_MAPERR); } -static int __bad_area(struct pt_regs *regs, unsigned long address, int si_code) +static int __bad_area(struct pt_regs *regs, unsigned long address, int si_code, + struct mm_struct *mm, struct vm_area_struct *vma) { - struct mm_struct *mm = current->mm; /* * Something tried to access memory that isn't in our memory map.. * Fix it, but check if it's kernel or user first.. */ - mmap_read_unlock(mm); + if (mm) + mmap_read_unlock(mm); + else + vma_end_read(vma); return __bad_area_nosemaphore(regs, address, si_code); } static noinline int bad_access_pkey(struct pt_regs *regs, unsigned long address, + struct mm_struct *mm, struct vm_area_struct *vma) { - struct mm_struct *mm = current->mm; int pkey; /* @@ -109,7 +112,10 @@ static noinline int bad_access_pkey(struct pt_regs *regs, unsigned long address, */ pkey = vma_pkey(vma); - mmap_read_unlock(mm); + if (mm) + mmap_read_unlock(mm); + else + vma_end_read(vma); /* * If we are in kernel mode, bail out with a SEGV, this will @@ -124,9 +130,10 @@ static noinline int bad_access_pkey(struct pt_regs *regs, unsigned long address, return 0; } -static noinline int bad_access(struct pt_regs *regs, unsigned long address) +static noinline int bad_access(struct pt_regs *regs, unsigned long address, + struct mm_struct *mm, struct vm_area_struct *vma) { - return __bad_area(regs, address, SEGV_ACCERR); + return __bad_area(regs, address, SEGV_ACCERR, mm, vma); } static int do_sigbus(struct pt_regs *regs, unsigned long address, @@ -479,13 +486,13 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address, if (unlikely(access_pkey_error(is_write, is_exec, (error_code & DSISR_KEYFAULT), vma))) { - vma_end_read(vma); - goto lock_mmap; + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + return bad_access_pkey(regs, address, NULL, vma); } if (unlikely(access_error(is_write, is_exec, vma))) { - vma_end_read(vma); - goto lock_mmap; + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + return bad_access(regs, address, NULL, vma); } fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs); @@ -521,10 +528,10 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address, if (unlikely(access_pkey_error(is_write, is_exec, (error_code & DSISR_KEYFAULT), vma))) - return bad_access_pkey(regs, address, vma); + return bad_access_pkey(regs, address, mm, vma); if (unlikely(access_error(is_write, is_exec, vma))) - return bad_access(regs, address); + return bad_access(regs, address, mm, vma); /* * If for any reason at all we couldn't handle the fault, -- 2.27.0
[PATCH 3/7] arm: mm: accelerate pagefault when VM_FAULT_BADACCESS
The vm_flags of vma already checked under per-VMA lock, if it is a bad access, directly set fault to VM_FAULT_BADACCESS and handle error, so no need to lock_mm_and_find_vma() and check vm_flags again. Signed-off-by: Kefeng Wang --- arch/arm/mm/fault.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c index 439dc6a26bb9..5c4b417e24f9 100644 --- a/arch/arm/mm/fault.c +++ b/arch/arm/mm/fault.c @@ -294,7 +294,9 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) if (!(vma->vm_flags & vm_flags)) { vma_end_read(vma); - goto lock_mmap; + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + fault = VM_FAULT_BADACCESS; + goto bad_area; } fault = handle_mm_fault(vma, addr, flags | FAULT_FLAG_VMA_LOCK, regs); if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) -- 2.27.0
[PATCH 2/7] arm64: mm: accelerate pagefault when VM_FAULT_BADACCESS
The vm_flags of vma already checked under per-VMA lock, if it is a bad access, directly set fault to VM_FAULT_BADACCESS and handle error, no need to lock_mm_and_find_vma() and check vm_flags again, the latency time reduce 34% in lmbench 'lat_sig -P 1 prot lat_sig'. Signed-off-by: Kefeng Wang --- arch/arm64/mm/fault.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index 9bb9f395351a..405f9aa831bd 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -572,7 +572,9 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, if (!(vma->vm_flags & vm_flags)) { vma_end_read(vma); - goto lock_mmap; + fault = VM_FAULT_BADACCESS; + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + goto done; } fault = handle_mm_fault(vma, addr, mm_flags | FAULT_FLAG_VMA_LOCK, regs); if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) -- 2.27.0
Re: [PATCH] asm/io: remove unnecessary xlate_dev_mem_ptr() and unxlate_dev_mem_ptr()
On 2023/11/20 14:40, Arnd Bergmann wrote: On Mon, Nov 20, 2023, at 01:39, Kefeng Wang wrote: On 2023/11/20 3:34, Geert Uytterhoeven wrote: On Sat, Nov 18, 2023 at 11:09 AM Kefeng Wang wrote: -/* - * Convert a physical pointer to a virtual kernel pointer for /dev/mem - * access - */ -#define xlate_dev_mem_ptr(p) __va(p) -#define unxlate_dev_mem_ptr(p, v) do { } while (0) - void __ioread64_copy(void *to, const void __iomem *from, size_t count); Missing #include , according to the build bot report. Will check the bot report. I had planned to pick up the series from https://lore.kernel.org/lkml/20230921110424.215592-3-...@redhat.com/ Good to see it. for v6.7 but didn't make it in the end. I'll try to do it now for v6.8 and apply your v1 patch with the Acks on top. Thanks. Arnd
[PATCH v2] asm/io: remove unnecessary xlate_dev_mem_ptr() and unxlate_dev_mem_ptr()
The asm-generic/io.h already has default define, remove unnecessary arch's defination. Cc: Richard Henderson Cc: Ivan Kokshaysky Cc: Russell King Cc: Brian Cain Cc: "James E.J. Bottomley" Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Yoshinori Sato Cc: Rich Felker Cc: "David S. Miller" Cc: Stanislav Kinsburskii Reviewed-by: Geert Uytterhoeven [m68k] Acked-by: Geert Uytterhoeven [m68k] Reviewed-by: Geert Uytterhoeven[sh] Signed-off-by: Kefeng Wang --- v2: - remove mips change, since it needs more extra works for enable arch/alpha/include/asm/io.h| 6 -- arch/arm/include/asm/io.h | 6 -- arch/hexagon/include/asm/io.h | 6 -- arch/m68k/include/asm/io_mm.h | 6 -- arch/parisc/include/asm/io.h | 6 -- arch/powerpc/include/asm/io.h | 6 -- arch/sh/include/asm/io.h | 7 --- arch/sparc/include/asm/io_64.h | 6 -- 8 files changed, 49 deletions(-) diff --git a/arch/alpha/include/asm/io.h b/arch/alpha/include/asm/io.h index 7aeaf7c30a6f..5e5d21ebc584 100644 --- a/arch/alpha/include/asm/io.h +++ b/arch/alpha/include/asm/io.h @@ -651,12 +651,6 @@ extern void outsl (unsigned long port, const void *src, unsigned long count); #endif #define RTC_ALWAYS_BCD 0 -/* - * Convert a physical pointer to a virtual kernel pointer for /dev/mem - * access - */ -#define xlate_dev_mem_ptr(p) __va(p) - /* * These get provided from since alpha does not * select GENERIC_IOMAP. diff --git a/arch/arm/include/asm/io.h b/arch/arm/include/asm/io.h index 56b08ed6cc3b..1815748f5d2a 100644 --- a/arch/arm/include/asm/io.h +++ b/arch/arm/include/asm/io.h @@ -407,12 +407,6 @@ struct pci_dev; #define pci_iounmap pci_iounmap extern void pci_iounmap(struct pci_dev *dev, void __iomem *addr); -/* - * Convert a physical pointer to a virtual kernel pointer for /dev/mem - * access - */ -#define xlate_dev_mem_ptr(p) __va(p) - #include #ifdef CONFIG_MMU diff --git a/arch/hexagon/include/asm/io.h b/arch/hexagon/include/asm/io.h index e2b308e32a37..97d57751ce3b 100644 --- a/arch/hexagon/include/asm/io.h +++ b/arch/hexagon/include/asm/io.h @@ -58,12 +58,6 @@ static inline void *phys_to_virt(unsigned long address) return __va(address); } -/* - * convert a physical pointer to a virtual kernel pointer for - * /dev/mem access. - */ -#define xlate_dev_mem_ptr(p)__va(p) - /* * IO port access primitives. Hexagon doesn't have special IO access * instructions; all I/O is memory mapped. diff --git a/arch/m68k/include/asm/io_mm.h b/arch/m68k/include/asm/io_mm.h index 47525f2a57e1..090aec54b8fa 100644 --- a/arch/m68k/include/asm/io_mm.h +++ b/arch/m68k/include/asm/io_mm.h @@ -389,12 +389,6 @@ static inline void isa_delay(void) #define __ARCH_HAS_NO_PAGE_ZERO_MAPPED 1 -/* - * Convert a physical pointer to a virtual kernel pointer for /dev/mem - * access - */ -#define xlate_dev_mem_ptr(p) __va(p) - #define readb_relaxed(addr)readb(addr) #define readw_relaxed(addr)readw(addr) #define readl_relaxed(addr)readl(addr) diff --git a/arch/parisc/include/asm/io.h b/arch/parisc/include/asm/io.h index 366537042465..9c06cafb0e70 100644 --- a/arch/parisc/include/asm/io.h +++ b/arch/parisc/include/asm/io.h @@ -267,12 +267,6 @@ extern void iowrite64be(u64 val, void __iomem *addr); #define iowrite16_rep iowrite16_rep #define iowrite32_rep iowrite32_rep -/* - * Convert a physical pointer to a virtual kernel pointer for /dev/mem - * access - */ -#define xlate_dev_mem_ptr(p) __va(p) - extern int devmem_is_allowed(unsigned long pfn); #include diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h index 5220274a6277..79421c285066 100644 --- a/arch/powerpc/include/asm/io.h +++ b/arch/powerpc/include/asm/io.h @@ -709,12 +709,6 @@ static inline void name at \ #define memcpy_fromio memcpy_fromio #define memcpy_toio memcpy_toio -/* - * Convert a physical pointer to a virtual kernel pointer for /dev/mem - * access - */ -#define xlate_dev_mem_ptr(p) __va(p) - /* * We don't do relaxed operations yet, at least not with this semantic */ diff --git a/arch/sh/include/asm/io.h b/arch/sh/include/asm/io.h index ac521f287fa5..be7ac06423a9 100644 --- a/arch/sh/include/asm/io.h +++ b/arch/sh/include/asm/io.h @@ -304,13 +304,6 @@ unsigned long long poke_real_address_q(unsigned long long addr, #define ioremap_uc ioremap -/* - * Convert a physical pointer to a virtual kernel pointer for /dev/mem - * access - */ -#define xlate_dev_mem_ptr(p) __va(p) -#define unxlate_dev_mem_ptr(p, v) do { } while (0) - #include #define ARCH_HAS_VALID_PHYS_ADDR_RANGE diff --git a/arch/sparc/include/asm/io_64.h b/arch/sparc/include/asm/io_64.h index 9303270b22f3..75ae9bf3bb7b 100644 --- a/arch/sparc/include/asm/io_64.h +++ b/arch/sparc/include/asm/io_64.h @@ -470,12 +470,6 @@ static inline int sbus_can_burst64(void) struct device; void sbus_set_sbus64(
Re: [PATCH] asm/io: remove unnecessary xlate_dev_mem_ptr() and unxlate_dev_mem_ptr()
On 2023/11/20 3:34, Geert Uytterhoeven wrote: On Sat, Nov 18, 2023 at 11:09 AM Kefeng Wang wrote: The asm-generic/io.h already has default definition, remove unnecessary arch's defination. Cc: Richard Henderson Cc: Ivan Kokshaysky Cc: Russell King Cc: Brian Cain Cc: "James E.J. Bottomley" Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Yoshinori Sato Cc: Rich Felker Cc: "David S. Miller" Cc: Stanislav Kinsburskii Signed-off-by: Kefeng Wang arch/m68k/include/asm/io_mm.h | 6 -- Reviewed-by: Geert Uytterhoeven Acked-by: Geert Uytterhoeven arch/sh/include/asm/io.h | 7 --- Reviewed-by: Geert Uytterhoeven Thanks, --- a/arch/mips/include/asm/io.h +++ b/arch/mips/include/asm/io.h @@ -548,13 +548,6 @@ extern void (*_dma_cache_inv)(unsigned long start, unsigned long size); #define csr_out32(v, a) (*(volatile u32 *)((unsigned long)(a) + __CSR_32_ADJUST) = (v)) #define csr_in32(a)(*(volatile u32 *)((unsigned long)(a) + __CSR_32_ADJUST)) -/* - * Convert a physical pointer to a virtual kernel pointer for /dev/mem - * access - */ -#define xlate_dev_mem_ptr(p) __va(p) -#define unxlate_dev_mem_ptr(p, v) do { } while (0) - void __ioread64_copy(void *to, const void __iomem *from, size_t count); Missing #include , according to the build bot report. Will check the bot report. #endif /* _ASM_IO_H */ Gr{oetje,eeting}s, Geert
[PATCH] asm/io: remove unnecessary xlate_dev_mem_ptr() and unxlate_dev_mem_ptr()
The asm-generic/io.h already has default definition, remove unnecessary arch's defination. Cc: Richard Henderson Cc: Ivan Kokshaysky Cc: Russell King Cc: Brian Cain Cc: "James E.J. Bottomley" Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Yoshinori Sato Cc: Rich Felker Cc: "David S. Miller" Cc: Stanislav Kinsburskii Signed-off-by: Kefeng Wang --- arch/alpha/include/asm/io.h| 6 -- arch/arm/include/asm/io.h | 6 -- arch/hexagon/include/asm/io.h | 6 -- arch/m68k/include/asm/io_mm.h | 6 -- arch/mips/include/asm/io.h | 7 --- arch/parisc/include/asm/io.h | 6 -- arch/powerpc/include/asm/io.h | 6 -- arch/sh/include/asm/io.h | 7 --- arch/sparc/include/asm/io_64.h | 6 -- 9 files changed, 56 deletions(-) diff --git a/arch/alpha/include/asm/io.h b/arch/alpha/include/asm/io.h index 7aeaf7c30a6f..5e5d21ebc584 100644 --- a/arch/alpha/include/asm/io.h +++ b/arch/alpha/include/asm/io.h @@ -651,12 +651,6 @@ extern void outsl (unsigned long port, const void *src, unsigned long count); #endif #define RTC_ALWAYS_BCD 0 -/* - * Convert a physical pointer to a virtual kernel pointer for /dev/mem - * access - */ -#define xlate_dev_mem_ptr(p) __va(p) - /* * These get provided from since alpha does not * select GENERIC_IOMAP. diff --git a/arch/arm/include/asm/io.h b/arch/arm/include/asm/io.h index 56b08ed6cc3b..1815748f5d2a 100644 --- a/arch/arm/include/asm/io.h +++ b/arch/arm/include/asm/io.h @@ -407,12 +407,6 @@ struct pci_dev; #define pci_iounmap pci_iounmap extern void pci_iounmap(struct pci_dev *dev, void __iomem *addr); -/* - * Convert a physical pointer to a virtual kernel pointer for /dev/mem - * access - */ -#define xlate_dev_mem_ptr(p) __va(p) - #include #ifdef CONFIG_MMU diff --git a/arch/hexagon/include/asm/io.h b/arch/hexagon/include/asm/io.h index e2b308e32a37..97d57751ce3b 100644 --- a/arch/hexagon/include/asm/io.h +++ b/arch/hexagon/include/asm/io.h @@ -58,12 +58,6 @@ static inline void *phys_to_virt(unsigned long address) return __va(address); } -/* - * convert a physical pointer to a virtual kernel pointer for - * /dev/mem access. - */ -#define xlate_dev_mem_ptr(p)__va(p) - /* * IO port access primitives. Hexagon doesn't have special IO access * instructions; all I/O is memory mapped. diff --git a/arch/m68k/include/asm/io_mm.h b/arch/m68k/include/asm/io_mm.h index 47525f2a57e1..090aec54b8fa 100644 --- a/arch/m68k/include/asm/io_mm.h +++ b/arch/m68k/include/asm/io_mm.h @@ -389,12 +389,6 @@ static inline void isa_delay(void) #define __ARCH_HAS_NO_PAGE_ZERO_MAPPED 1 -/* - * Convert a physical pointer to a virtual kernel pointer for /dev/mem - * access - */ -#define xlate_dev_mem_ptr(p) __va(p) - #define readb_relaxed(addr)readb(addr) #define readw_relaxed(addr)readw(addr) #define readl_relaxed(addr)readl(addr) diff --git a/arch/mips/include/asm/io.h b/arch/mips/include/asm/io.h index 062dd4e6b954..2158ff302430 100644 --- a/arch/mips/include/asm/io.h +++ b/arch/mips/include/asm/io.h @@ -548,13 +548,6 @@ extern void (*_dma_cache_inv)(unsigned long start, unsigned long size); #define csr_out32(v, a) (*(volatile u32 *)((unsigned long)(a) + __CSR_32_ADJUST) = (v)) #define csr_in32(a)(*(volatile u32 *)((unsigned long)(a) + __CSR_32_ADJUST)) -/* - * Convert a physical pointer to a virtual kernel pointer for /dev/mem - * access - */ -#define xlate_dev_mem_ptr(p) __va(p) -#define unxlate_dev_mem_ptr(p, v) do { } while (0) - void __ioread64_copy(void *to, const void __iomem *from, size_t count); #endif /* _ASM_IO_H */ diff --git a/arch/parisc/include/asm/io.h b/arch/parisc/include/asm/io.h index 366537042465..9c06cafb0e70 100644 --- a/arch/parisc/include/asm/io.h +++ b/arch/parisc/include/asm/io.h @@ -267,12 +267,6 @@ extern void iowrite64be(u64 val, void __iomem *addr); #define iowrite16_rep iowrite16_rep #define iowrite32_rep iowrite32_rep -/* - * Convert a physical pointer to a virtual kernel pointer for /dev/mem - * access - */ -#define xlate_dev_mem_ptr(p) __va(p) - extern int devmem_is_allowed(unsigned long pfn); #include diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h index 5220274a6277..79421c285066 100644 --- a/arch/powerpc/include/asm/io.h +++ b/arch/powerpc/include/asm/io.h @@ -709,12 +709,6 @@ static inline void name at \ #define memcpy_fromio memcpy_fromio #define memcpy_toio memcpy_toio -/* - * Convert a physical pointer to a virtual kernel pointer for /dev/mem - * access - */ -#define xlate_dev_mem_ptr(p) __va(p) - /* * We don't do relaxed operations yet, at least not with this semantic */ diff --git a/arch/sh/include/asm/io.h b/arch/sh/include/asm/io.h index ac521f287fa5..be7ac06423a9 100644 --- a/arch/sh/include/asm/io.h +++ b/arch/sh/include/asm/io.h @@ -304,13 +304,6 @@ unsigned long long poke_real_address_q(unsigned long long addr,
Re: [PATCH rfc v2 04/10] s390: mm: use try_vma_locked_page_fault()
On 2023/8/24 16:32, Heiko Carstens wrote: On Thu, Aug 24, 2023 at 10:16:33AM +0200, Alexander Gordeev wrote: On Mon, Aug 21, 2023 at 08:30:50PM +0800, Kefeng Wang wrote: Use new try_vma_locked_page_fault() helper to simplify code. No functional change intended. Signed-off-by: Kefeng Wang --- arch/s390/mm/fault.c | 66 ++-- 1 file changed, 27 insertions(+), 39 deletions(-) ... - fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs); - if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) - vma_end_read(vma); - if (!(fault & VM_FAULT_RETRY)) { - count_vm_vma_lock_event(VMA_LOCK_SUCCESS); - if (likely(!(fault & VM_FAULT_ERROR))) - fault = 0; This fault fixup is removed in the new version. ... + vmf.vm_flags = VM_WRITE; + if (vmf.vm_flags == VM_WRITE) + vmf.flags |= FAULT_FLAG_WRITE; + + fault = try_vma_locked_page_fault(); + if (fault == VM_FAULT_NONE) + goto lock_mm; Because VM_FAULT_NONE is set to 0 it gets confused with the success code of 0 returned by a fault handler. In the former case we want to continue, while in the latter - successfully return. I think it applies to all archs. ... FWIW, this series ends up with kernel BUG at arch/s390/mm/fault.c:341! I didn't test and only built, this is a RFC to want to know whether the way to add three more numbers into vmf and using vmf in arch's page fault is feasible or not. Without having looked in detail into this patch: all of this is likely because s390's fault handling is quite odd. Not only because fault is set to 0, but also because of the private VM_FAULT values like VM_FAULT_BADCONTEXT. I'm just cleaning up all of this, but it won't make it for the next merge window. Sure, if re-post, will drop the s390's change, but as mentioned above, the abstract of the generic vma locked and changes may be not perfect, let's wait for more response. Thanks all. Therefore I'd like to ask to drop the s390 conversion of this series, and if this series is supposed to be merged the s390 conversion needs to be done later. Let's not waste more time on the current implementation, please.
Re: [PATCH rfc v2 01/10] mm: add a generic VMA lock-based page fault handler
On 2023/8/24 15:12, Alexander Gordeev wrote: On Mon, Aug 21, 2023 at 08:30:47PM +0800, Kefeng Wang wrote: Hi Kefeng, The ARCH_SUPPORTS_PER_VMA_LOCK are enabled by more and more architectures, eg, x86, arm64, powerpc and s390, and riscv, those implementation are very similar which results in some duplicated codes, let's add a generic VMA lock-based page fault handler try_to_vma_locked_page_fault() to eliminate them, and which also make us easy to support this on new architectures. Since different architectures use different way to check vma whether is accessable or not, the struct pt_regs, page fault error code and vma flags are added into struct vm_fault, then, the architecture's page fault code could re-use struct vm_fault to record and check vma accessable by each own implementation. Signed-off-by: Kefeng Wang --- ... + +vm_fault_t try_vma_locked_page_fault(struct vm_fault *vmf) +{ + vm_fault_t fault = VM_FAULT_NONE; + struct vm_area_struct *vma; + + if (!(vmf->flags & FAULT_FLAG_USER)) + return fault; + + vma = lock_vma_under_rcu(current->mm, vmf->real_address); + if (!vma) + return fault; + + if (arch_vma_access_error(vma, vmf)) { + vma_end_read(vma); + return fault; + } + + fault = handle_mm_fault(vma, vmf->real_address, + vmf->flags | FAULT_FLAG_VMA_LOCK, vmf->regs); + + if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) + vma_end_read(vma); Could you please explain how vma_end_read() call could be conditional? The check is added for swap and userfault, see https://lkml.kernel.org/r/20230630211957.1341547-4-sur...@google.com + + if (fault & VM_FAULT_RETRY) + count_vm_vma_lock_event(VMA_LOCK_RETRY); + else + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + + return fault; +} + #endif /* CONFIG_PER_VMA_LOCK */ #ifndef __PAGETABLE_P4D_FOLDED
Re: [PATCH rfc v2 05/10] powerpc: mm: use try_vma_locked_page_fault()
On 2023/8/22 17:38, Christophe Leroy wrote: Le 21/08/2023 à 14:30, Kefeng Wang a écrit : Use new try_vma_locked_page_fault() helper to simplify code. No functional change intended. Does it really simplifies code ? It's 32 insertions versus 34 deletions so only removing 2 lines. Yes,it is unfriendly for powerpc as the arch's vma access check is much complex than other arch, I don't like the struct vm_fault you are adding because when it was four independant variables it was handled through local registers. Now that it is a struct it has to go via the stack, leading to unnecessary memory read and writes. And going back and forth between architecture code and generic code may also be counter-performant. Because different arch has different var to check vma access, so the easy way to add them into vmf, I don' find a better way. Did you make any performance analysis ? Page faults are really a hot path when dealling with minor faults. no, this is only built and rfc to see the feedback about the conversion. Thanks. Thanks Christophe Signed-off-by: Kefeng Wang --- arch/powerpc/mm/fault.c | 66 - 1 file changed, 32 insertions(+), 34 deletions(-) diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index b1723094d464..52f9546e020e 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -391,6 +391,22 @@ static int page_fault_is_bad(unsigned long err) #define page_fault_is_bad(__err) ((__err) & DSISR_BAD_FAULT_32S) #endif +#ifdef CONFIG_PER_VMA_LOCK +bool arch_vma_access_error(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + int is_exec = TRAP(vmf->regs) == INTERRUPT_INST_STORAGE; + int is_write = page_fault_is_write(vmf->fault_code); + + if (unlikely(access_pkey_error(is_write, is_exec, + (vmf->fault_code & DSISR_KEYFAULT), vma))) + return true; + + if (unlikely(access_error(is_write, is_exec, vma))) + return true; + return false; +} +#endif + /* * For 600- and 800-family processors, the error_code parameter is DSISR * for a data fault, SRR1 for an instruction fault. @@ -407,12 +423,18 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address, { struct vm_area_struct * vma; struct mm_struct *mm = current->mm; - unsigned int flags = FAULT_FLAG_DEFAULT; int is_exec = TRAP(regs) == INTERRUPT_INST_STORAGE; int is_user = user_mode(regs); int is_write = page_fault_is_write(error_code); vm_fault_t fault, major = 0; bool kprobe_fault = kprobe_page_fault(regs, 11); + struct vm_fault vmf = { + .real_address = address, + .fault_code = error_code, + .regs = regs, + .flags = FAULT_FLAG_DEFAULT, + }; + if (unlikely(debugger_fault_handler(regs) || kprobe_fault)) return 0; @@ -463,45 +485,21 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address, * mmap_lock held */ if (is_user) - flags |= FAULT_FLAG_USER; + vmf.flags |= FAULT_FLAG_USER; if (is_write) - flags |= FAULT_FLAG_WRITE; + vmf.flags |= FAULT_FLAG_WRITE; if (is_exec) - flags |= FAULT_FLAG_INSTRUCTION; + vmf.flags |= FAULT_FLAG_INSTRUCTION; - if (!(flags & FAULT_FLAG_USER)) - goto lock_mmap; - - vma = lock_vma_under_rcu(mm, address); - if (!vma) - goto lock_mmap; - - if (unlikely(access_pkey_error(is_write, is_exec, - (error_code & DSISR_KEYFAULT), vma))) { - vma_end_read(vma); - goto lock_mmap; - } - - if (unlikely(access_error(is_write, is_exec, vma))) { - vma_end_read(vma); - goto lock_mmap; - } - - fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs); - if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) - vma_end_read(vma); - - if (!(fault & VM_FAULT_RETRY)) { - count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + fault = try_vma_locked_page_fault(); + if (fault == VM_FAULT_NONE) + goto retry; + if (!(fault & VM_FAULT_RETRY)) goto done; - } - count_vm_vma_lock_event(VMA_LOCK_RETRY); if (fault_signal_pending(fault, regs)) return user_mode(regs) ? 0 : SIGBUS; -lock_mmap: - /* When running in the kernel we expect faults to occur only to * addresses in user space. All other faults represent errors in the * kernel and should generate an OOPS. Unfortunately, in the case of an @@ -528,7 +526,7 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
[PATCH rfc v2 02/10] arm64: mm: use try_vma_locked_page_fault()
Use new try_vma_locked_page_fault() helper to simplify code, also pass struct vmf to __do_page_fault() directly instead of each independent variable. No functional change intended. Signed-off-by: Kefeng Wang --- arch/arm64/mm/fault.c | 60 --- 1 file changed, 22 insertions(+), 38 deletions(-) diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index 2e5d1e238af9..2b7a1e610b3e 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -498,9 +498,8 @@ static void do_bad_area(unsigned long far, unsigned long esr, #define VM_FAULT_BADACCESS ((__force vm_fault_t)0x02) static vm_fault_t __do_page_fault(struct mm_struct *mm, - struct vm_area_struct *vma, unsigned long addr, - unsigned int mm_flags, unsigned long vm_flags, - struct pt_regs *regs) + struct vm_area_struct *vma, + struct vm_fault *vmf) { /* * Ok, we have a good vm_area for this memory access, so we can handle @@ -508,9 +507,9 @@ static vm_fault_t __do_page_fault(struct mm_struct *mm, * Check that the permissions on the VMA allow for the fault which * occurred. */ - if (!(vma->vm_flags & vm_flags)) + if (!(vma->vm_flags & vmf->vm_flags)) return VM_FAULT_BADACCESS; - return handle_mm_fault(vma, addr, mm_flags, regs); + return handle_mm_fault(vma, vmf->real_address, vmf->flags, vmf->regs); } static bool is_el0_instruction_abort(unsigned long esr) @@ -533,10 +532,12 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, const struct fault_info *inf; struct mm_struct *mm = current->mm; vm_fault_t fault; - unsigned long vm_flags; - unsigned int mm_flags = FAULT_FLAG_DEFAULT; unsigned long addr = untagged_addr(far); struct vm_area_struct *vma; + struct vm_fault vmf = { + .real_address = addr, + .flags = FAULT_FLAG_DEFAULT, + }; if (kprobe_page_fault(regs, esr)) return 0; @@ -549,7 +550,7 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, goto no_context; if (user_mode(regs)) - mm_flags |= FAULT_FLAG_USER; + vmf.flags |= FAULT_FLAG_USER; /* * vm_flags tells us what bits we must have in vma->vm_flags @@ -559,20 +560,20 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, */ if (is_el0_instruction_abort(esr)) { /* It was exec fault */ - vm_flags = VM_EXEC; - mm_flags |= FAULT_FLAG_INSTRUCTION; + vmf.vm_flags = VM_EXEC; + vmf.flags |= FAULT_FLAG_INSTRUCTION; } else if (is_write_abort(esr)) { /* It was write fault */ - vm_flags = VM_WRITE; - mm_flags |= FAULT_FLAG_WRITE; + vmf.vm_flags = VM_WRITE; + vmf.flags |= FAULT_FLAG_WRITE; } else { /* It was read fault */ - vm_flags = VM_READ; + vmf.vm_flags = VM_READ; /* Write implies read */ - vm_flags |= VM_WRITE; + vmf.vm_flags |= VM_WRITE; /* If EPAN is absent then exec implies read */ if (!cpus_have_const_cap(ARM64_HAS_EPAN)) - vm_flags |= VM_EXEC; + vmf.vm_flags |= VM_EXEC; } if (is_ttbr0_addr(addr) && is_el1_permission_fault(addr, esr, regs)) { @@ -587,26 +588,11 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); - if (!(mm_flags & FAULT_FLAG_USER)) - goto lock_mmap; - - vma = lock_vma_under_rcu(mm, addr); - if (!vma) - goto lock_mmap; - - if (!(vma->vm_flags & vm_flags)) { - vma_end_read(vma); - goto lock_mmap; - } - fault = handle_mm_fault(vma, addr, mm_flags | FAULT_FLAG_VMA_LOCK, regs); - if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) - vma_end_read(vma); - - if (!(fault & VM_FAULT_RETRY)) { - count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + fault = try_vma_locked_page_fault(); + if (fault == VM_FAULT_NONE) + goto retry; + if (!(fault & VM_FAULT_RETRY)) goto done; - } - count_vm_vma_lock_event(VMA_LOCK_RETRY); /* Quick path to respond to signals */ if (fault_signal_pending(fault, regs)) { @@ -614,8 +600,6 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, goto no_co
[PATCH rfc v2 04/10] s390: mm: use try_vma_locked_page_fault()
Use new try_vma_locked_page_fault() helper to simplify code. No functional change intended. Signed-off-by: Kefeng Wang --- arch/s390/mm/fault.c | 66 ++-- 1 file changed, 27 insertions(+), 39 deletions(-) diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c index 099c4824dd8a..fbbdebde6ea7 100644 --- a/arch/s390/mm/fault.c +++ b/arch/s390/mm/fault.c @@ -357,16 +357,18 @@ static noinline void do_fault_error(struct pt_regs *regs, vm_fault_t fault) static inline vm_fault_t do_exception(struct pt_regs *regs, int access) { struct gmap *gmap; - struct task_struct *tsk; - struct mm_struct *mm; struct vm_area_struct *vma; enum fault_type type; - unsigned long address; - unsigned int flags; + struct mm_struct *mm = current->mm; + unsigned long address = get_fault_address(regs); vm_fault_t fault; bool is_write; + struct vm_fault vmf = { + .real_address = address, + .flags = FAULT_FLAG_DEFAULT, + .vm_flags = access, + }; - tsk = current; /* * The instruction that caused the program check has * been nullified. Don't signal single step via SIGTRAP. @@ -376,8 +378,6 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access) if (kprobe_page_fault(regs, 14)) return 0; - mm = tsk->mm; - address = get_fault_address(regs); is_write = fault_is_write(regs); /* @@ -398,45 +398,33 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access) } perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); - flags = FAULT_FLAG_DEFAULT; if (user_mode(regs)) - flags |= FAULT_FLAG_USER; + vmf.flags |= FAULT_FLAG_USER; if (is_write) - access = VM_WRITE; - if (access == VM_WRITE) - flags |= FAULT_FLAG_WRITE; - if (!(flags & FAULT_FLAG_USER)) - goto lock_mmap; - vma = lock_vma_under_rcu(mm, address); - if (!vma) - goto lock_mmap; - if (!(vma->vm_flags & access)) { - vma_end_read(vma); - goto lock_mmap; - } - fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs); - if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) - vma_end_read(vma); - if (!(fault & VM_FAULT_RETRY)) { - count_vm_vma_lock_event(VMA_LOCK_SUCCESS); - if (likely(!(fault & VM_FAULT_ERROR))) - fault = 0; + vmf.vm_flags = VM_WRITE; + if (vmf.vm_flags == VM_WRITE) + vmf.flags |= FAULT_FLAG_WRITE; + + fault = try_vma_locked_page_fault(); + if (fault == VM_FAULT_NONE) + goto lock_mm; + if (!(fault & VM_FAULT_RETRY)) goto out; - } - count_vm_vma_lock_event(VMA_LOCK_RETRY); + /* Quick path to respond to signals */ if (fault_signal_pending(fault, regs)) { fault = VM_FAULT_SIGNAL; goto out; } -lock_mmap: + +lock_mm: mmap_read_lock(mm); gmap = NULL; if (IS_ENABLED(CONFIG_PGSTE) && type == GMAP_FAULT) { gmap = (struct gmap *) S390_lowcore.gmap; current->thread.gmap_addr = address; - current->thread.gmap_write_flag = !!(flags & FAULT_FLAG_WRITE); + current->thread.gmap_write_flag = !!(vmf.flags & FAULT_FLAG_WRITE); current->thread.gmap_int_code = regs->int_code & 0x; address = __gmap_translate(gmap, address); if (address == -EFAULT) { @@ -444,7 +432,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access) goto out_up; } if (gmap->pfault_enabled) - flags |= FAULT_FLAG_RETRY_NOWAIT; + vmf.flags |= FAULT_FLAG_RETRY_NOWAIT; } retry: @@ -466,7 +454,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access) * we can handle it.. */ fault = VM_FAULT_BADACCESS; - if (unlikely(!(vma->vm_flags & access))) + if (unlikely(!(vma->vm_flags & vmf.vm_flags))) goto out_up; /* @@ -474,10 +462,10 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access) * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(vma, address, flags, regs); + fault = handle_mm_fault(vma, address, vmf.flags, regs); if (fault_signal_pending(fault, regs)) { fault = VM_FAULT_SIGNAL; - if (flags & FAULT_FLAG_RETRY_NOWAIT) +
[PATCH rfc v2 03/10] x86: mm: use try_vma_locked_page_fault()
Use new try_vma_locked_page_fault() helper to simplify code. No functional change intended. Signed-off-by: Kefeng Wang --- arch/x86/mm/fault.c | 55 +++-- 1 file changed, 23 insertions(+), 32 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index ab778eac1952..3edc9edc0b28 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1227,6 +1227,13 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code, } NOKPROBE_SYMBOL(do_kern_addr_fault); +#ifdef CONFIG_PER_VMA_LOCK +bool arch_vma_access_error(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + return access_error(vmf->fault_code, vma); +} +#endif + /* * Handle faults in the user portion of the address space. Nothing in here * should check X86_PF_USER without a specific justification: for almost @@ -1241,13 +1248,13 @@ void do_user_addr_fault(struct pt_regs *regs, unsigned long address) { struct vm_area_struct *vma; - struct task_struct *tsk; - struct mm_struct *mm; + struct mm_struct *mm = current->mm; vm_fault_t fault; - unsigned int flags = FAULT_FLAG_DEFAULT; - - tsk = current; - mm = tsk->mm; + struct vm_fault vmf = { + .real_address = address, + .fault_code = error_code, + .flags = FAULT_FLAG_DEFAULT + }; if (unlikely((error_code & (X86_PF_USER | X86_PF_INSTR)) == X86_PF_INSTR)) { /* @@ -1311,7 +1318,7 @@ void do_user_addr_fault(struct pt_regs *regs, */ if (user_mode(regs)) { local_irq_enable(); - flags |= FAULT_FLAG_USER; + vmf.flags |= FAULT_FLAG_USER; } else { if (regs->flags & X86_EFLAGS_IF) local_irq_enable(); @@ -1326,11 +1333,11 @@ void do_user_addr_fault(struct pt_regs *regs, * maybe_mkwrite() can create a proper shadow stack PTE. */ if (error_code & X86_PF_SHSTK) - flags |= FAULT_FLAG_WRITE; + vmf.flags |= FAULT_FLAG_WRITE; if (error_code & X86_PF_WRITE) - flags |= FAULT_FLAG_WRITE; + vmf.flags |= FAULT_FLAG_WRITE; if (error_code & X86_PF_INSTR) - flags |= FAULT_FLAG_INSTRUCTION; + vmf.flags |= FAULT_FLAG_INSTRUCTION; #ifdef CONFIG_X86_64 /* @@ -1350,26 +1357,11 @@ void do_user_addr_fault(struct pt_regs *regs, } #endif - if (!(flags & FAULT_FLAG_USER)) - goto lock_mmap; - - vma = lock_vma_under_rcu(mm, address); - if (!vma) - goto lock_mmap; - - if (unlikely(access_error(error_code, vma))) { - vma_end_read(vma); - goto lock_mmap; - } - fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs); - if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) - vma_end_read(vma); - - if (!(fault & VM_FAULT_RETRY)) { - count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + fault = try_vma_locked_page_fault(); + if (fault == VM_FAULT_NONE) + goto retry; + if (!(fault & VM_FAULT_RETRY)) goto done; - } - count_vm_vma_lock_event(VMA_LOCK_RETRY); /* Quick path to respond to signals */ if (fault_signal_pending(fault, regs)) { @@ -1379,7 +1371,6 @@ void do_user_addr_fault(struct pt_regs *regs, ARCH_DEFAULT_PKEY); return; } -lock_mmap: retry: vma = lock_mm_and_find_vma(mm, address, regs); @@ -1410,7 +1401,7 @@ void do_user_addr_fault(struct pt_regs *regs, * userland). The return to userland is identified whenever * FAULT_FLAG_USER|FAULT_FLAG_KILLABLE are both set in flags. */ - fault = handle_mm_fault(vma, address, flags, regs); + fault = handle_mm_fault(vma, address, vmf.flags, regs); if (fault_signal_pending(fault, regs)) { /* @@ -1434,7 +1425,7 @@ void do_user_addr_fault(struct pt_regs *regs, * that we made any progress. Handle this case first. */ if (unlikely(fault & VM_FAULT_RETRY)) { - flags |= FAULT_FLAG_TRIED; + vmf.flags |= FAULT_FLAG_TRIED; goto retry; } -- 2.27.0
[PATCH rfc v2 07/10] ARM: mm: try VMA lock-based page fault handling first
Attempt VMA lock-based page fault handling first, and fall back to the existing mmap_lock-based handling if that fails. Signed-off-by: Kefeng Wang --- arch/arm/Kconfig| 1 + arch/arm/mm/fault.c | 35 +-- 2 files changed, 26 insertions(+), 10 deletions(-) diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 1a6a6eb48a15..8b6d4507ccee 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -34,6 +34,7 @@ config ARM select ARCH_OPTIONAL_KERNEL_RWX_DEFAULT if CPU_V7 select ARCH_SUPPORTS_ATOMIC_RMW select ARCH_SUPPORTS_HUGETLBFS if ARM_LPAE + select ARCH_SUPPORTS_PER_VMA_LOCK select ARCH_USE_BUILTIN_BSWAP select ARCH_USE_CMPXCHG_LOCKREF select ARCH_USE_MEMTEST diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c index fef62e4a9edd..d53bb028899a 100644 --- a/arch/arm/mm/fault.c +++ b/arch/arm/mm/fault.c @@ -242,8 +242,11 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) struct vm_area_struct *vma; int sig, code; vm_fault_t fault; - unsigned int flags = FAULT_FLAG_DEFAULT; - unsigned long vm_flags = VM_ACCESS_FLAGS; + struct vm_fault vmf = { + .real_address = addr, + .flags = FAULT_FLAG_DEFAULT, + .vm_flags = VM_ACCESS_FLAGS, + }; if (kprobe_page_fault(regs, fsr)) return 0; @@ -261,15 +264,15 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) goto no_context; if (user_mode(regs)) - flags |= FAULT_FLAG_USER; + vmf.flags |= FAULT_FLAG_USER; if (is_write_fault(fsr)) { - flags |= FAULT_FLAG_WRITE; - vm_flags = VM_WRITE; + vmf.flags |= FAULT_FLAG_WRITE; + vmf.vm_flags = VM_WRITE; } if (fsr & FSR_LNX_PF) { - vm_flags = VM_EXEC; + vmf.vm_flags = VM_EXEC; if (is_permission_fault(fsr) && !user_mode(regs)) die_kernel_fault("execution of memory", @@ -278,6 +281,18 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); + fault = try_vma_locked_page_fault(); + if (fault == VM_FAULT_NONE) + goto retry; + if (!(fault & VM_FAULT_RETRY)) + goto done; + + if (fault_signal_pending(fault, regs)) { + if (!user_mode(regs)) + goto no_context; + return 0; + } + retry: vma = lock_mm_and_find_vma(mm, addr, regs); if (unlikely(!vma)) { @@ -289,10 +304,10 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) * ok, we have a good vm_area for this memory access, check the * permissions on the VMA allow for the fault which occurred. */ - if (!(vma->vm_flags & vm_flags)) + if (!(vma->vm_flags & vmf.vm_flags)) fault = VM_FAULT_BADACCESS; else - fault = handle_mm_fault(vma, addr & PAGE_MASK, flags, regs); + fault = handle_mm_fault(vma, addr & PAGE_MASK, vmf.flags, regs); /* If we need to retry but a fatal signal is pending, handle the * signal first. We do not need to release the mmap_lock because @@ -310,13 +325,13 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) if (!(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_RETRY) { - flags |= FAULT_FLAG_TRIED; + vmf.flags |= FAULT_FLAG_TRIED; goto retry; } } mmap_read_unlock(mm); - +done: /* * Handle the "normal" case first - VM_FAULT_MAJOR */ -- 2.27.0
[PATCH rfc v2 08/10] loongarch: mm: cleanup __do_page_fault()
Cleanup __do_page_fault() by reuse bad_area_nosemaphore and bad_area label. Signed-off-by: Kefeng Wang --- arch/loongarch/mm/fault.c | 48 +-- 1 file changed, 16 insertions(+), 32 deletions(-) diff --git a/arch/loongarch/mm/fault.c b/arch/loongarch/mm/fault.c index e6376e3dce86..5d4c742c4bc5 100644 --- a/arch/loongarch/mm/fault.c +++ b/arch/loongarch/mm/fault.c @@ -157,18 +157,15 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, if (!user_mode(regs)) no_context(regs, write, address); else - do_sigsegv(regs, write, address, si_code); - return; + goto bad_area_nosemaphore; } /* * If we're in an interrupt or have no user * context, we must not take the fault.. */ - if (faulthandler_disabled() || !mm) { - do_sigsegv(regs, write, address, si_code); - return; - } + if (faulthandler_disabled() || !mm) + goto bad_area_nosemaphore; if (user_mode(regs)) flags |= FAULT_FLAG_USER; @@ -178,23 +175,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, vma = lock_mm_and_find_vma(mm, address, regs); if (unlikely(!vma)) goto bad_area_nosemaphore; - goto good_area; - -/* - * Something tried to access memory that isn't in our memory map.. - * Fix it, but check if it's kernel or user first.. - */ -bad_area: - mmap_read_unlock(mm); -bad_area_nosemaphore: - do_sigsegv(regs, write, address, si_code); - return; -/* - * Ok, we have a good vm_area for this memory access, so - * we can handle it.. - */ -good_area: si_code = SEGV_ACCERR; if (write) { @@ -235,22 +216,25 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, */ goto retry; } + + mmap_read_unlock(mm); + if (unlikely(fault & VM_FAULT_ERROR)) { - mmap_read_unlock(mm); - if (fault & VM_FAULT_OOM) { + if (fault & VM_FAULT_OOM) do_out_of_memory(regs, write, address); - return; - } else if (fault & VM_FAULT_SIGSEGV) { - do_sigsegv(regs, write, address, si_code); - return; - } else if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)) { + else if (fault & VM_FAULT_SIGSEGV) + goto bad_area_nosemaphore; + else if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)) do_sigbus(regs, write, address, si_code); - return; - } - BUG(); + else + BUG(); } + return; +bad_area: mmap_read_unlock(mm); +bad_area_nosemaphore: + do_sigsegv(regs, write, address, si_code); } asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, -- 2.27.0
[PATCH rfc v2 09/10] loongarch: mm: add access_error() helper
Add access_error() to check whether vma could be accessible or not, which will be used __do_page_fault() and later vma locked based page fault. Signed-off-by: Kefeng Wang --- arch/loongarch/mm/fault.c | 30 -- 1 file changed, 20 insertions(+), 10 deletions(-) diff --git a/arch/loongarch/mm/fault.c b/arch/loongarch/mm/fault.c index 5d4c742c4bc5..2a45e9f3a485 100644 --- a/arch/loongarch/mm/fault.c +++ b/arch/loongarch/mm/fault.c @@ -126,6 +126,22 @@ static void __kprobes do_sigsegv(struct pt_regs *regs, force_sig_fault(SIGSEGV, si_code, (void __user *)address); } +static inline bool access_error(unsigned int flags, struct pt_regs *regs, + unsigned long addr, struct vm_area_struct *vma) +{ + if (flags & FAULT_FLAG_WRITE) { + if (!(vma->vm_flags & VM_WRITE)) + return true; + } else { + if (!(vma->vm_flags & VM_READ) && addr != exception_era(regs)) + return true; + if (!(vma->vm_flags & VM_EXEC) && addr == exception_era(regs)) + return true; + } + + return false; +} + /* * This routine handles page faults. It determines the address, * and the problem, and then passes it off to one of the appropriate @@ -169,6 +185,8 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, if (user_mode(regs)) flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); retry: @@ -178,16 +196,8 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, si_code = SEGV_ACCERR; - if (write) { - flags |= FAULT_FLAG_WRITE; - if (!(vma->vm_flags & VM_WRITE)) - goto bad_area; - } else { - if (!(vma->vm_flags & VM_READ) && address != exception_era(regs)) - goto bad_area; - if (!(vma->vm_flags & VM_EXEC) && address == exception_era(regs)) - goto bad_area; - } + if (access_error(flags, regs, vma)) + goto bad_area; /* * If for any reason at all we couldn't handle the fault, -- 2.27.0
[PATCH rfc -next v2 00/10] mm: convert to generic VMA lock-based page fault
Add a generic VMA lock-based page fault handler in mm core, and convert architectures to use it, which eliminate architectures's duplicated codes. With it, we can avoid multiple changes in architectures's code if we add new feature or bugfix, in the end, enable this feature on ARM32 and Loongarch. This is based on next-20230817, only built test. v2: - convert "int arch_vma_check_access()" to "bool arch_vma_access_error()" still use __weak function for arch_vma_access_error(), which avoid to declare access_error() in architecture's(x86/powerpc/riscv/loongarch) headfile. - re-use struct vm_fault instead of adding new struct vm_locked_fault, per Matthew Wilcox, add necessary pt_regs/fault error code/vm flags into vm_fault since they could be used in arch_vma_access_error() - add special VM_FAULT_NONE and make try_vma_locked_page_fault() to return vm_fault_t Kefeng Wang (10): mm: add a generic VMA lock-based page fault handler arm64: mm: use try_vma_locked_page_fault() x86: mm: use try_vma_locked_page_fault() s390: mm: use try_vma_locked_page_fault() powerpc: mm: use try_vma_locked_page_fault() riscv: mm: use try_vma_locked_page_fault() ARM: mm: try VMA lock-based page fault handling first loongarch: mm: cleanup __do_page_fault() loongarch: mm: add access_error() helper loongarch: mm: try VMA lock-based page fault handling first arch/arm/Kconfig | 1 + arch/arm/mm/fault.c | 35 arch/arm64/mm/fault.c | 60 - arch/loongarch/Kconfig| 1 + arch/loongarch/mm/fault.c | 111 ++ arch/powerpc/mm/fault.c | 66 +++ arch/riscv/mm/fault.c | 58 +--- arch/s390/mm/fault.c | 66 ++- arch/x86/mm/fault.c | 55 --- include/linux/mm.h| 17 ++ include/linux/mm_types.h | 2 + mm/memory.c | 39 ++ 12 files changed, 278 insertions(+), 233 deletions(-) -- 2.27.0
[PATCH rfc v2 05/10] powerpc: mm: use try_vma_locked_page_fault()
Use new try_vma_locked_page_fault() helper to simplify code. No functional change intended. Signed-off-by: Kefeng Wang --- arch/powerpc/mm/fault.c | 66 - 1 file changed, 32 insertions(+), 34 deletions(-) diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index b1723094d464..52f9546e020e 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -391,6 +391,22 @@ static int page_fault_is_bad(unsigned long err) #define page_fault_is_bad(__err) ((__err) & DSISR_BAD_FAULT_32S) #endif +#ifdef CONFIG_PER_VMA_LOCK +bool arch_vma_access_error(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + int is_exec = TRAP(vmf->regs) == INTERRUPT_INST_STORAGE; + int is_write = page_fault_is_write(vmf->fault_code); + + if (unlikely(access_pkey_error(is_write, is_exec, + (vmf->fault_code & DSISR_KEYFAULT), vma))) + return true; + + if (unlikely(access_error(is_write, is_exec, vma))) + return true; + return false; +} +#endif + /* * For 600- and 800-family processors, the error_code parameter is DSISR * for a data fault, SRR1 for an instruction fault. @@ -407,12 +423,18 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address, { struct vm_area_struct * vma; struct mm_struct *mm = current->mm; - unsigned int flags = FAULT_FLAG_DEFAULT; int is_exec = TRAP(regs) == INTERRUPT_INST_STORAGE; int is_user = user_mode(regs); int is_write = page_fault_is_write(error_code); vm_fault_t fault, major = 0; bool kprobe_fault = kprobe_page_fault(regs, 11); + struct vm_fault vmf = { + .real_address = address, + .fault_code = error_code, + .regs = regs, + .flags = FAULT_FLAG_DEFAULT, + }; + if (unlikely(debugger_fault_handler(regs) || kprobe_fault)) return 0; @@ -463,45 +485,21 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address, * mmap_lock held */ if (is_user) - flags |= FAULT_FLAG_USER; + vmf.flags |= FAULT_FLAG_USER; if (is_write) - flags |= FAULT_FLAG_WRITE; + vmf.flags |= FAULT_FLAG_WRITE; if (is_exec) - flags |= FAULT_FLAG_INSTRUCTION; + vmf.flags |= FAULT_FLAG_INSTRUCTION; - if (!(flags & FAULT_FLAG_USER)) - goto lock_mmap; - - vma = lock_vma_under_rcu(mm, address); - if (!vma) - goto lock_mmap; - - if (unlikely(access_pkey_error(is_write, is_exec, - (error_code & DSISR_KEYFAULT), vma))) { - vma_end_read(vma); - goto lock_mmap; - } - - if (unlikely(access_error(is_write, is_exec, vma))) { - vma_end_read(vma); - goto lock_mmap; - } - - fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs); - if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) - vma_end_read(vma); - - if (!(fault & VM_FAULT_RETRY)) { - count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + fault = try_vma_locked_page_fault(); + if (fault == VM_FAULT_NONE) + goto retry; + if (!(fault & VM_FAULT_RETRY)) goto done; - } - count_vm_vma_lock_event(VMA_LOCK_RETRY); if (fault_signal_pending(fault, regs)) return user_mode(regs) ? 0 : SIGBUS; -lock_mmap: - /* When running in the kernel we expect faults to occur only to * addresses in user space. All other faults represent errors in the * kernel and should generate an OOPS. Unfortunately, in the case of an @@ -528,7 +526,7 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(vma, address, flags, regs); + fault = handle_mm_fault(vma, address, vmf.flags, regs); major |= fault & VM_FAULT_MAJOR; @@ -544,7 +542,7 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address, * case. */ if (unlikely(fault & VM_FAULT_RETRY)) { - flags |= FAULT_FLAG_TRIED; + vmf.flags |= FAULT_FLAG_TRIED; goto retry; } -- 2.27.0
[PATCH rfc v2 06/10] riscv: mm: use try_vma_locked_page_fault()
Use new try_vma_locked_page_fault() helper to simplify code. No functional change intended. Signed-off-by: Kefeng Wang --- arch/riscv/mm/fault.c | 58 ++- 1 file changed, 24 insertions(+), 34 deletions(-) diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c index 6115d7514972..b46129b636f2 100644 --- a/arch/riscv/mm/fault.c +++ b/arch/riscv/mm/fault.c @@ -215,6 +215,13 @@ static inline bool access_error(unsigned long cause, struct vm_area_struct *vma) return false; } +#ifdef CONFIG_PER_VMA_LOCK +bool arch_vma_access_error(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + return access_error(vmf->fault_code, vma); +} +#endif + /* * This routine handles page faults. It determines the address and the * problem, and then passes it off to one of the appropriate routines. @@ -223,17 +230,16 @@ void handle_page_fault(struct pt_regs *regs) { struct task_struct *tsk; struct vm_area_struct *vma; - struct mm_struct *mm; - unsigned long addr, cause; - unsigned int flags = FAULT_FLAG_DEFAULT; + struct mm_struct *mm = current->mm; + unsigned long addr = regs->badaddr; + unsigned long cause = regs->cause; int code = SEGV_MAPERR; vm_fault_t fault; - - cause = regs->cause; - addr = regs->badaddr; - - tsk = current; - mm = tsk->mm; + struct vm_fault vmf = { + .real_address = addr, + .fault_code = cause, + .flags = FAULT_FLAG_DEFAULT, + }; if (kprobe_page_fault(regs, cause)) return; @@ -268,7 +274,7 @@ void handle_page_fault(struct pt_regs *regs) } if (user_mode(regs)) - flags |= FAULT_FLAG_USER; + vmf.flags |= FAULT_FLAG_USER; if (!user_mode(regs) && addr < TASK_SIZE && unlikely(!(regs->status & SR_SUM))) { if (fixup_exception(regs)) @@ -280,37 +286,21 @@ void handle_page_fault(struct pt_regs *regs) perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); if (cause == EXC_STORE_PAGE_FAULT) - flags |= FAULT_FLAG_WRITE; + vmf.flags |= FAULT_FLAG_WRITE; else if (cause == EXC_INST_PAGE_FAULT) - flags |= FAULT_FLAG_INSTRUCTION; - if (!(flags & FAULT_FLAG_USER)) - goto lock_mmap; - - vma = lock_vma_under_rcu(mm, addr); - if (!vma) - goto lock_mmap; + vmf.flags |= FAULT_FLAG_INSTRUCTION; - if (unlikely(access_error(cause, vma))) { - vma_end_read(vma); - goto lock_mmap; - } - - fault = handle_mm_fault(vma, addr, flags | FAULT_FLAG_VMA_LOCK, regs); - if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) - vma_end_read(vma); - - if (!(fault & VM_FAULT_RETRY)) { - count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + fault = try_vma_locked_page_fault(); + if (fault == VM_FAULT_NONE) + goto retry; + if (!(fault & VM_FAULT_RETRY)) goto done; - } - count_vm_vma_lock_event(VMA_LOCK_RETRY); if (fault_signal_pending(fault, regs)) { if (!user_mode(regs)) no_context(regs, addr); return; } -lock_mmap: retry: vma = lock_mm_and_find_vma(mm, addr, regs); @@ -337,7 +327,7 @@ void handle_page_fault(struct pt_regs *regs) * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(vma, addr, flags, regs); + fault = handle_mm_fault(vma, addr, vmf.flags, regs); /* * If we need to retry but a fatal signal is pending, handle the @@ -355,7 +345,7 @@ void handle_page_fault(struct pt_regs *regs) return; if (unlikely(fault & VM_FAULT_RETRY)) { - flags |= FAULT_FLAG_TRIED; + vmf.flags |= FAULT_FLAG_TRIED; /* * No need to mmap_read_unlock(mm) as we would -- 2.27.0
[PATCH rfc v2 01/10] mm: add a generic VMA lock-based page fault handler
The ARCH_SUPPORTS_PER_VMA_LOCK are enabled by more and more architectures, eg, x86, arm64, powerpc and s390, and riscv, those implementation are very similar which results in some duplicated codes, let's add a generic VMA lock-based page fault handler try_to_vma_locked_page_fault() to eliminate them, and which also make us easy to support this on new architectures. Since different architectures use different way to check vma whether is accessable or not, the struct pt_regs, page fault error code and vma flags are added into struct vm_fault, then, the architecture's page fault code could re-use struct vm_fault to record and check vma accessable by each own implementation. Signed-off-by: Kefeng Wang --- include/linux/mm.h | 17 + include/linux/mm_types.h | 2 ++ mm/memory.c | 39 +++ 3 files changed, 58 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 3f764e84e567..22a6f4c56ff3 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -512,9 +512,12 @@ struct vm_fault { pgoff_t pgoff; /* Logical page offset based on vma */ unsigned long address; /* Faulting virtual address - masked */ unsigned long real_address; /* Faulting virtual address - unmasked */ + unsigned long fault_code; /* Faulting error code during page fault */ + struct pt_regs *regs; /* The registers stored during page fault */ }; enum fault_flag flags; /* FAULT_FLAG_xxx flags * XXX: should really be 'const' */ + vm_flags_t vm_flags;/* VMA flags to be used for access checking */ pmd_t *pmd; /* Pointer to pmd entry matching * the 'address' */ pud_t *pud; /* Pointer to pud entry matching @@ -774,6 +777,9 @@ static inline void assert_fault_locked(struct vm_fault *vmf) struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, unsigned long address); +bool arch_vma_access_error(struct vm_area_struct *vma, struct vm_fault *vmf); +vm_fault_t try_vma_locked_page_fault(struct vm_fault *vmf); + #else /* CONFIG_PER_VMA_LOCK */ static inline bool vma_start_read(struct vm_area_struct *vma) @@ -801,6 +807,17 @@ static inline void assert_fault_locked(struct vm_fault *vmf) mmap_assert_locked(vmf->vma->vm_mm); } +static inline struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, + unsigned long address) +{ + return NULL; +} + +static inline vm_fault_t try_vma_locked_page_fault(struct vm_fault *vmf) +{ + return VM_FAULT_NONE; +} + #endif /* CONFIG_PER_VMA_LOCK */ extern const struct vm_operations_struct vma_dummy_vm_ops; diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index f5ba5b0bc836..702820cea3f9 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1119,6 +1119,7 @@ typedef __bitwise unsigned int vm_fault_t; * fault. Used to decide whether a process gets delivered SIGBUS or * just gets major/minor fault counters bumped up. * + * @VM_FAULT_NONE: Special case, not starting to handle fault * @VM_FAULT_OOM: Out Of Memory * @VM_FAULT_SIGBUS: Bad access * @VM_FAULT_MAJOR:Page read from storage @@ -1139,6 +1140,7 @@ typedef __bitwise unsigned int vm_fault_t; * */ enum vm_fault_reason { + VM_FAULT_NONE = (__force vm_fault_t)0x00, VM_FAULT_OOM= (__force vm_fault_t)0x01, VM_FAULT_SIGBUS = (__force vm_fault_t)0x02, VM_FAULT_MAJOR = (__force vm_fault_t)0x04, diff --git a/mm/memory.c b/mm/memory.c index 3b4aaa0d2fff..60fe35db5134 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5510,6 +5510,45 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, count_vm_vma_lock_event(VMA_LOCK_ABORT); return NULL; } + +#ifdef CONFIG_PER_VMA_LOCK +bool __weak arch_vma_access_error(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + return (vma->vm_flags & vmf->vm_flags) == 0; +} +#endif + +vm_fault_t try_vma_locked_page_fault(struct vm_fault *vmf) +{ + vm_fault_t fault = VM_FAULT_NONE; + struct vm_area_struct *vma; + + if (!(vmf->flags & FAULT_FLAG_USER)) + return fault; + + vma = lock_vma_under_rcu(current->mm, vmf->real_address); + if (!vma) + return fault; + + if (arch_vma_access_error(vma, vmf)) { + vma_end_read(vma); + return fault; + } + + fault = handle_mm_fault(vma, vmf->real_address, + vmf->flags | FAULT_FLAG_VMA_LOCK, vmf->regs); + + if (!(fault
[PATCH rfc v2 10/10] loongarch: mm: try VMA lock-based page fault handling first
Attempt VMA lock-based page fault handling first, and fall back to the existing mmap_lock-based handling if that fails. Signed-off-by: Kefeng Wang --- arch/loongarch/Kconfig| 1 + arch/loongarch/mm/fault.c | 37 +++-- 2 files changed, 32 insertions(+), 6 deletions(-) diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig index 2b27b18a63af..6b821f621920 100644 --- a/arch/loongarch/Kconfig +++ b/arch/loongarch/Kconfig @@ -56,6 +56,7 @@ config LOONGARCH select ARCH_SUPPORTS_LTO_CLANG select ARCH_SUPPORTS_LTO_CLANG_THIN select ARCH_SUPPORTS_NUMA_BALANCING + select ARCH_SUPPORTS_PER_VMA_LOCK select ARCH_USE_BUILTIN_BSWAP select ARCH_USE_CMPXCHG_LOCKREF select ARCH_USE_QUEUED_RWLOCKS diff --git a/arch/loongarch/mm/fault.c b/arch/loongarch/mm/fault.c index 2a45e9f3a485..f7ac3a14bb06 100644 --- a/arch/loongarch/mm/fault.c +++ b/arch/loongarch/mm/fault.c @@ -142,6 +142,13 @@ static inline bool access_error(unsigned int flags, struct pt_regs *regs, return false; } +#ifdef CONFIG_PER_VMA_LOCK +bool arch_vma_access_error(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + return access_error(vmf->flags, vmf->regs, vmf->real_address, vma); +} +#endif + /* * This routine handles page faults. It determines the address, * and the problem, and then passes it off to one of the appropriate @@ -151,11 +158,15 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write, unsigned long address) { int si_code = SEGV_MAPERR; - unsigned int flags = FAULT_FLAG_DEFAULT; struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; struct vm_area_struct *vma = NULL; vm_fault_t fault; + struct vm_fault vmf = { + .real_address = address, + .regs = regs, + .flags = FAULT_FLAG_DEFAULT, + }; if (kprobe_page_fault(regs, current->thread.trap_nr)) return; @@ -184,11 +195,24 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, goto bad_area_nosemaphore; if (user_mode(regs)) - flags |= FAULT_FLAG_USER; + vmf.flags |= FAULT_FLAG_USER; if (write) - flags |= FAULT_FLAG_WRITE; + vmf.flags |= FAULT_FLAG_WRITE; perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); + + fault = try_vma_locked_page_fault(); + if (fault == VM_FAULT_NONE) + goto retry; + if (!(fault & VM_FAULT_RETRY)) + goto done; + + if (fault_signal_pending(fault, regs)) { + if (!user_mode(regs)) + no_context(regs, write, address); + return; + } + retry: vma = lock_mm_and_find_vma(mm, address, regs); if (unlikely(!vma)) @@ -196,7 +220,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, si_code = SEGV_ACCERR; - if (access_error(flags, regs, vma)) + if (access_error(vmf.flags, regs, address, vma)) goto bad_area; /* @@ -204,7 +228,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(vma, address, flags, regs); + fault = handle_mm_fault(vma, address, vmf.flags, regs); if (fault_signal_pending(fault, regs)) { if (!user_mode(regs)) @@ -217,7 +241,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, return; if (unlikely(fault & VM_FAULT_RETRY)) { - flags |= FAULT_FLAG_TRIED; + vmf.flags |= FAULT_FLAG_TRIED; /* * No need to mmap_read_unlock(mm) as we would @@ -229,6 +253,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, mmap_read_unlock(mm); +done: if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) do_out_of_memory(regs, write, address); -- 2.27.0
Re: [PATCH rfc -next 01/10] mm: add a generic VMA lock-based page fault handler
On 2023/7/14 9:52, Kefeng Wang wrote: On 2023/7/14 4:12, Suren Baghdasaryan wrote: On Thu, Jul 13, 2023 at 9:15 AM Matthew Wilcox wrote: +int try_vma_locked_page_fault(struct vm_locked_fault *vmlf, vm_fault_t *ret) +{ + struct vm_area_struct *vma; + vm_fault_t fault; On Thu, Jul 13, 2023 at 05:53:29PM +0800, Kefeng Wang wrote: +#define VM_LOCKED_FAULT_INIT(_name, _mm, _address, _fault_flags, _vm_flags, _regs, _fault_code) \ + _name.mm = _mm; \ + _name.address = _address; \ + _name.fault_flags = _fault_flags; \ + _name.vm_flags = _vm_flags; \ + _name.regs = _regs; \ + _name.fault_code = _fault_code More consolidated code is a good idea; no question. But I don't think this is the right way to do it. I agree it is not good enough, but the arch's vma check acess has different implementation, some use vm flags, some need fault code and regs, and some use both :( +int __weak arch_vma_check_access(struct vm_area_struct *vma, + struct vm_locked_fault *vmlf); This should be: #ifndef vma_check_access bool vma_check_access(struct vm_area_struct *vma, ) { return (vma->vm_flags & vm_flags) == 0; } #endif and then arches which want to do something different can just define vma_check_access. Ok, I could convert to use this way. +int try_vma_locked_page_fault(struct vm_locked_fault *vmlf, vm_fault_t *ret) +{ + struct vm_area_struct *vma; + vm_fault_t fault; Declaring the vmf in this function and then copying it back is just wrong. We need to declare vm_fault_t earlier (in the arch fault handler) and pass it in. Actually I passed the vm_fault_t *ret(in the arch fault handler), we could directly use *ret instead of a new local variable, and no copy. Did you mean to say "we need to declare vmf (struct vm_fault) earlier (in the arch fault handler) and pass it in." ? After recheck the code, I think Matthew' idea is 'declare vmf (struct vm_fault) earlier' like Suren said, not vm_fault_t, right? will try this, thanks. I don't think that creating struct vm_locked_fault is the right idea either. As mentioned above for vma check access, we need many arguments for a function, a new struct looks possible better, is there better solution or any suggestion? Thanks.
Re: [PATCH rfc -next 01/10] mm: add a generic VMA lock-based page fault handler
On 2023/7/14 4:12, Suren Baghdasaryan wrote: On Thu, Jul 13, 2023 at 9:15 AM Matthew Wilcox wrote: +int try_vma_locked_page_fault(struct vm_locked_fault *vmlf, vm_fault_t *ret) +{ + struct vm_area_struct *vma; + vm_fault_t fault; On Thu, Jul 13, 2023 at 05:53:29PM +0800, Kefeng Wang wrote: +#define VM_LOCKED_FAULT_INIT(_name, _mm, _address, _fault_flags, _vm_flags, _regs, _fault_code) \ + _name.mm= _mm; \ + _name.address = _address; \ + _name.fault_flags = _fault_flags; \ + _name.vm_flags = _vm_flags;\ + _name.regs = _regs;\ + _name.fault_code= _fault_code More consolidated code is a good idea; no question. But I don't think this is the right way to do it. I agree it is not good enough, but the arch's vma check acess has different implementation, some use vm flags, some need fault code and regs, and some use both :( +int __weak arch_vma_check_access(struct vm_area_struct *vma, + struct vm_locked_fault *vmlf); This should be: #ifndef vma_check_access bool vma_check_access(struct vm_area_struct *vma, ) { return (vma->vm_flags & vm_flags) == 0; } #endif and then arches which want to do something different can just define vma_check_access. Ok, I could convert to use this way. +int try_vma_locked_page_fault(struct vm_locked_fault *vmlf, vm_fault_t *ret) +{ + struct vm_area_struct *vma; + vm_fault_t fault; Declaring the vmf in this function and then copying it back is just wrong. We need to declare vm_fault_t earlier (in the arch fault handler) and pass it in. Actually I passed the vm_fault_t *ret(in the arch fault handler), we could directly use *ret instead of a new local variable, and no copy. Did you mean to say "we need to declare vmf (struct vm_fault) earlier (in the arch fault handler) and pass it in." ? I don't think that creating struct vm_locked_fault is the right idea either. As mentioned above for vma check access, we need many arguments for a function, a new struct looks possible better, is there better solution or any suggestion? Thanks. + if (!(vmlf->fault_flags & FAULT_FLAG_USER)) + return -EINVAL; + + vma = lock_vma_under_rcu(vmlf->mm, vmlf->address); + if (!vma) + return -EINVAL; + + if (arch_vma_check_access(vma, vmlf)) { + vma_end_read(vma); + return -EINVAL; + } + + fault = handle_mm_fault(vma, vmlf->address, + vmlf->fault_flags | FAULT_FLAG_VMA_LOCK, + vmlf->regs); + *ret = fault; + + if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) + vma_end_read(vma); + + if ((fault & VM_FAULT_RETRY)) + count_vm_vma_lock_event(VMA_LOCK_RETRY); + else + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + + return 0; +} + #endif /* CONFIG_PER_VMA_LOCK */ #ifndef __PAGETABLE_P4D_FOLDED -- 2.27.0
Re: [PATCH rfc -next 00/10] mm: convert to generic VMA lock-based page fault
Please ignore this one... On 2023/7/13 17:51, Kefeng Wang wrote: Add a generic VMA lock-based page fault handler in mm core, and convert architectures to use it, which eliminate architectures's duplicated codes. With it, we can avoid multiple changes in architectures's code if we add new feature or bugfix. This fixes riscv missing change about commit 38b3aec8e8d2 "mm: drop per-VMA lock when returning VM_FAULT_RETRY or VM_FAULT_COMPLETED", and in the end, we enable this feature on ARM32/Loongarch too. This is based on next-20230713, only built test(no loongarch compiler, so except loongarch). Kefeng Wang (10): mm: add a generic VMA lock-based page fault handler x86: mm: use try_vma_locked_page_fault() arm64: mm: use try_vma_locked_page_fault() s390: mm: use try_vma_locked_page_fault() powerpc: mm: use try_vma_locked_page_fault() riscv: mm: use try_vma_locked_page_fault() ARM: mm: try VMA lock-based page fault handling first loongarch: mm: cleanup __do_page_fault() loongarch: mm: add access_error() helper loongarch: mm: try VMA lock-based page fault handling first arch/arm/Kconfig | 1 + arch/arm/mm/fault.c | 15 ++- arch/arm64/mm/fault.c | 28 +++- arch/loongarch/Kconfig| 1 + arch/loongarch/mm/fault.c | 92 --- arch/powerpc/mm/fault.c | 54 ++- arch/riscv/mm/fault.c | 38 +++- arch/s390/mm/fault.c | 23 +++--- arch/x86/mm/fault.c | 39 +++-- include/linux/mm.h| 28 mm/memory.c | 42 ++ 11 files changed, 206 insertions(+), 155 deletions(-)
[PATCH rfc -next 07/10] ARM: mm: try VMA lock-based page fault handling first
Attempt VMA lock-based page fault handling first, and fall back to the existing mmap_lock-based handling if that fails. Signed-off-by: Kefeng Wang --- arch/arm/Kconfig| 1 + arch/arm/mm/fault.c | 15 ++- 2 files changed, 15 insertions(+), 1 deletion(-) diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 1a6a6eb48a15..8b6d4507ccee 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -34,6 +34,7 @@ config ARM select ARCH_OPTIONAL_KERNEL_RWX_DEFAULT if CPU_V7 select ARCH_SUPPORTS_ATOMIC_RMW select ARCH_SUPPORTS_HUGETLBFS if ARM_LPAE + select ARCH_SUPPORTS_PER_VMA_LOCK select ARCH_USE_BUILTIN_BSWAP select ARCH_USE_CMPXCHG_LOCKREF select ARCH_USE_MEMTEST diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c index fef62e4a9edd..c44b83841e36 100644 --- a/arch/arm/mm/fault.c +++ b/arch/arm/mm/fault.c @@ -244,6 +244,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) vm_fault_t fault; unsigned int flags = FAULT_FLAG_DEFAULT; unsigned long vm_flags = VM_ACCESS_FLAGS; + struct vm_locked_fault vmlf; if (kprobe_page_fault(regs, fsr)) return 0; @@ -278,6 +279,18 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); + VM_LOCKED_FAULT_INIT(vmlf, mm, addr, flags, vm_flags, regs, fsr); + if (try_vma_locked_page_fault(, )) + goto retry; + else if (!(fault | VM_FAULT_RETRY)) + goto done; + + if (fault_signal_pending(fault, regs)) { + if (!user_mode(regs)) + goto no_context; + return 0; + } + retry: vma = lock_mm_and_find_vma(mm, addr, regs); if (unlikely(!vma)) { @@ -316,7 +329,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) } mmap_read_unlock(mm); - +done: /* * Handle the "normal" case first - VM_FAULT_MAJOR */ -- 2.27.0
[PATCH rfc -next 04/10] s390: mm: use try_vma_locked_page_fault()
Use new try_vma_locked_page_fault() helper to simplify code. No functional change intended. Signed-off-by: Kefeng Wang --- arch/s390/mm/fault.c | 23 ++- 1 file changed, 6 insertions(+), 17 deletions(-) diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c index 40a71063949b..97e511690352 100644 --- a/arch/s390/mm/fault.c +++ b/arch/s390/mm/fault.c @@ -362,6 +362,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access) struct task_struct *tsk; struct mm_struct *mm; struct vm_area_struct *vma; + struct vm_locked_fault vmlf; enum fault_type type; unsigned long address; unsigned int flags; @@ -407,31 +408,19 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access) access = VM_WRITE; if (access == VM_WRITE) flags |= FAULT_FLAG_WRITE; -#ifdef CONFIG_PER_VMA_LOCK - if (!(flags & FAULT_FLAG_USER)) - goto lock_mmap; - vma = lock_vma_under_rcu(mm, address); - if (!vma) - goto lock_mmap; - if (!(vma->vm_flags & access)) { - vma_end_read(vma); + + VM_LOCKED_FAULT_INIT(vmlf, mm, address, flags, access, regs, 0); + if (try_vma_locked_page_fault(, )) goto lock_mmap; - } - fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs); - if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) - vma_end_read(vma); - if (!(fault & VM_FAULT_RETRY)) { - count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + else if (!(fault | VM_FAULT_RETRY)) goto out; - } - count_vm_vma_lock_event(VMA_LOCK_RETRY); + /* Quick path to respond to signals */ if (fault_signal_pending(fault, regs)) { fault = VM_FAULT_SIGNAL; goto out; } lock_mmap: -#endif /* CONFIG_PER_VMA_LOCK */ mmap_read_lock(mm); gmap = NULL; -- 2.27.0
[PATCH rfc -next 03/10] arm64: mm: use try_vma_locked_page_fault()
Use new try_vma_locked_page_fault() helper to simplify code. No functional change intended. Signed-off-by: Kefeng Wang --- arch/arm64/mm/fault.c | 28 +--- 1 file changed, 5 insertions(+), 23 deletions(-) diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index b8c80f7b8a5f..614bb53fc1bc 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -537,6 +537,7 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, unsigned int mm_flags = FAULT_FLAG_DEFAULT; unsigned long addr = untagged_addr(far); struct vm_area_struct *vma; + struct vm_locked_fault vmlf; if (kprobe_page_fault(regs, esr)) return 0; @@ -587,27 +588,11 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); -#ifdef CONFIG_PER_VMA_LOCK - if (!(mm_flags & FAULT_FLAG_USER)) - goto lock_mmap; - - vma = lock_vma_under_rcu(mm, addr); - if (!vma) - goto lock_mmap; - - if (!(vma->vm_flags & vm_flags)) { - vma_end_read(vma); - goto lock_mmap; - } - fault = handle_mm_fault(vma, addr, mm_flags | FAULT_FLAG_VMA_LOCK, regs); - if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) - vma_end_read(vma); - - if (!(fault & VM_FAULT_RETRY)) { - count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + VM_LOCKED_FAULT_INIT(vmlf, mm, addr, mm_flags, vm_flags, regs, esr); + if (try_vma_locked_page_fault(, )) + goto retry; + else if (!(fault | VM_FAULT_RETRY)) goto done; - } - count_vm_vma_lock_event(VMA_LOCK_RETRY); /* Quick path to respond to signals */ if (fault_signal_pending(fault, regs)) { @@ -615,9 +600,6 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, goto no_context; return 0; } -lock_mmap: -#endif /* CONFIG_PER_VMA_LOCK */ - retry: vma = lock_mm_and_find_vma(mm, addr, regs); if (unlikely(!vma)) { -- 2.27.0
[PATCH rfc -next 01/10] mm: add a generic VMA lock-based page fault handler
There are more and more architectures enabled ARCH_SUPPORTS_PER_VMA_LOCK, eg, x86, arm64, powerpc and s390, and riscv, those implementation are very similar which results in some duplicated codes, let's add a generic VMA lock-based page fault handler to eliminate them, and which also make it easy to support this feature on new architectures. Signed-off-by: Kefeng Wang --- include/linux/mm.h | 28 mm/memory.c| 42 ++ 2 files changed, 70 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index c7886784832b..cba1b7b19c9d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -633,6 +633,15 @@ static inline void vma_numab_state_init(struct vm_area_struct *vma) {} static inline void vma_numab_state_free(struct vm_area_struct *vma) {} #endif /* CONFIG_NUMA_BALANCING */ +struct vm_locked_fault { + struct mm_struct *mm; + unsigned long address; + unsigned int fault_flags; + unsigned long vm_flags; + struct pt_regs *regs; + unsigned long fault_code; +}; + #ifdef CONFIG_PER_VMA_LOCK /* * Try to read-lock a vma. The function is allowed to occasionally yield false @@ -733,6 +742,19 @@ static inline void assert_fault_locked(struct vm_fault *vmf) struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, unsigned long address); +#define VM_LOCKED_FAULT_INIT(_name, _mm, _address, _fault_flags, _vm_flags, _regs, _fault_code) \ + _name.mm= _mm; \ + _name.address = _address; \ + _name.fault_flags = _fault_flags; \ + _name.vm_flags = _vm_flags;\ + _name.regs = _regs;\ + _name.fault_code= _fault_code + +int __weak arch_vma_check_access(struct vm_area_struct *vma, +struct vm_locked_fault *vmlf); + +int try_vma_locked_page_fault(struct vm_locked_fault *vmlf, vm_fault_t *ret); + #else /* CONFIG_PER_VMA_LOCK */ static inline bool vma_start_read(struct vm_area_struct *vma) @@ -742,6 +764,12 @@ static inline void vma_start_write(struct vm_area_struct *vma) {} static inline void vma_assert_write_locked(struct vm_area_struct *vma) {} static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached) {} +#define VM_LOCKED_FAULT_INIT(_name, _mm, _address, _fault_flags, _vm_flags, _regs, _fault_code) +static inline int try_vma_locked_page_fault(struct vm_locked_fault *vmlf, + vm_fault_t *ret) +{ + return -EINVAL; +} static inline void release_fault_lock(struct vm_fault *vmf) { diff --git a/mm/memory.c b/mm/memory.c index ad790394963a..d3f5d1270e7a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5449,6 +5449,48 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, count_vm_vma_lock_event(VMA_LOCK_ABORT); return NULL; } + +int __weak arch_vma_check_access(struct vm_area_struct *vma, +struct vm_locked_fault *vmlf) +{ + if (!(vma->vm_flags & vmlf->vm_flags)) + return -EINVAL; + return 0; +} + +int try_vma_locked_page_fault(struct vm_locked_fault *vmlf, vm_fault_t *ret) +{ + struct vm_area_struct *vma; + vm_fault_t fault; + + if (!(vmlf->fault_flags & FAULT_FLAG_USER)) + return -EINVAL; + + vma = lock_vma_under_rcu(vmlf->mm, vmlf->address); + if (!vma) + return -EINVAL; + + if (arch_vma_check_access(vma, vmlf)) { + vma_end_read(vma); + return -EINVAL; + } + + fault = handle_mm_fault(vma, vmlf->address, + vmlf->fault_flags | FAULT_FLAG_VMA_LOCK, + vmlf->regs); + *ret = fault; + + if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) + vma_end_read(vma); + + if ((fault & VM_FAULT_RETRY)) + count_vm_vma_lock_event(VMA_LOCK_RETRY); + else + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + + return 0; +} + #endif /* CONFIG_PER_VMA_LOCK */ #ifndef __PAGETABLE_P4D_FOLDED -- 2.27.0
[PATCH rfc -next 00/10] mm: convert to generic VMA lock-based page fault
Add a generic VMA lock-based page fault handler in mm core, and convert architectures to use it, which eliminate architectures's duplicated codes. With it, we can avoid multiple changes in architectures's code if we add new feature or bugfix. This fixes riscv missing change about commit 38b3aec8e8d2 "mm: drop per-VMA lock when returning VM_FAULT_RETRY or VM_FAULT_COMPLETED", and in the end, we enable this feature on ARM32/Loongarch too. This is based on next-20230713, only built test(no loongarch compiler, so except loongarch). Kefeng Wang (10): mm: add a generic VMA lock-based page fault handler x86: mm: use try_vma_locked_page_fault() arm64: mm: use try_vma_locked_page_fault() s390: mm: use try_vma_locked_page_fault() powerpc: mm: use try_vma_locked_page_fault() riscv: mm: use try_vma_locked_page_fault() ARM: mm: try VMA lock-based page fault handling first loongarch: mm: cleanup __do_page_fault() loongarch: mm: add access_error() helper loongarch: mm: try VMA lock-based page fault handling first arch/arm/Kconfig | 1 + arch/arm/mm/fault.c | 15 ++- arch/arm64/mm/fault.c | 28 +++- arch/loongarch/Kconfig| 1 + arch/loongarch/mm/fault.c | 92 --- arch/powerpc/mm/fault.c | 54 ++- arch/riscv/mm/fault.c | 38 +++- arch/s390/mm/fault.c | 23 +++--- arch/x86/mm/fault.c | 39 +++-- include/linux/mm.h| 28 mm/memory.c | 42 ++ 11 files changed, 206 insertions(+), 155 deletions(-) -- 2.27.0
[PATCH rfc -next 10/10] loongarch: mm: try VMA lock-based page fault handling first
Attempt VMA lock-based page fault handling first, and fall back to the existing mmap_lock-based handling if that fails. Signed-off-by: Kefeng Wang --- arch/loongarch/Kconfig| 1 + arch/loongarch/mm/fault.c | 26 ++ 2 files changed, 27 insertions(+) diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig index 397203e18800..afb0ccabab97 100644 --- a/arch/loongarch/Kconfig +++ b/arch/loongarch/Kconfig @@ -53,6 +53,7 @@ config LOONGARCH select ARCH_SUPPORTS_LTO_CLANG select ARCH_SUPPORTS_LTO_CLANG_THIN select ARCH_SUPPORTS_NUMA_BALANCING + select ARCH_SUPPORTS_PER_VMA_LOCK select ARCH_USE_BUILTIN_BSWAP select ARCH_USE_CMPXCHG_LOCKREF select ARCH_USE_QUEUED_RWLOCKS diff --git a/arch/loongarch/mm/fault.c b/arch/loongarch/mm/fault.c index cde2ea0119fa..7e54bc48813e 100644 --- a/arch/loongarch/mm/fault.c +++ b/arch/loongarch/mm/fault.c @@ -136,6 +136,17 @@ static inline bool access_error(unsigned int flags, struct pt_regs *regs, return false; } +#ifdef CONFIG_PER_VMA_LOCK +int arch_vma_check_access(struct vm_area_struct *vma, + struct vm_locked_fault *vmlf) +{ + if (unlikely(access_error(vmlf->fault_flags, vmlf->regs, vmlf->address, +vma))) + return -EINVAL; + return 0; +} +#endif + /* * This routine handles page faults. It determines the address, * and the problem, and then passes it off to one of the appropriate @@ -149,6 +160,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; struct vm_area_struct *vma = NULL; + struct vm_locked_fault vmlf; vm_fault_t fault; if (kprobe_page_fault(regs, current->thread.trap_nr)) @@ -183,6 +195,19 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, flags |= FAULT_FLAG_WRITE; perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); + + VM_LOCKED_FAULT_INIT(vmlf, mm, address, flags, 0, regs, 0); + if (try_vma_locked_page_fault(, )) + goto retry; + else if (!(fault | VM_FAULT_RETRY)) + goto done; + + if (fault_signal_pending(fault, regs)) { + if (!user_mode(regs)) + no_context(regs, address); + return; + } + retry: vma = lock_mm_and_find_vma(mm, address, regs); if (unlikely(!vma)) @@ -223,6 +248,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, mmap_read_unlock(mm); +done: if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) { do_out_of_memory(regs, address); -- 2.27.0
[PATCH rfc -next 09/10] loongarch: mm: add access_error() helper
Add access_error() to check whether vma could be accessible or not, which will be used __do_page_fault() and later vma locked based page fault. Signed-off-by: Kefeng Wang --- arch/loongarch/mm/fault.c | 30 -- 1 file changed, 20 insertions(+), 10 deletions(-) diff --git a/arch/loongarch/mm/fault.c b/arch/loongarch/mm/fault.c index 03d06ee184da..cde2ea0119fa 100644 --- a/arch/loongarch/mm/fault.c +++ b/arch/loongarch/mm/fault.c @@ -120,6 +120,22 @@ static void __kprobes do_sigsegv(struct pt_regs *regs, force_sig_fault(SIGSEGV, si_code, (void __user *)address); } +static inline bool access_error(unsigned int flags, struct pt_regs *regs, + unsigned long addr, struct vm_area_struct *vma) +{ + if (flags & FAULT_FLAG_WRITE) { + if (!(vma->vm_flags & VM_WRITE)) + return true; + } else { + if (!(vma->vm_flags & VM_READ) && addr != exception_era(regs)) + return true; + if (!(vma->vm_flags & VM_EXEC) && addr == exception_era(regs)) + return true; + } + + return false; +} + /* * This routine handles page faults. It determines the address, * and the problem, and then passes it off to one of the appropriate @@ -163,6 +179,8 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, if (user_mode(regs)) flags |= FAULT_FLAG_USER; + if (write) + flags |= FAULT_FLAG_WRITE; perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); retry: @@ -172,16 +190,8 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, si_code = SEGV_ACCERR; - if (write) { - flags |= FAULT_FLAG_WRITE; - if (!(vma->vm_flags & VM_WRITE)) - goto bad_area; - } else { - if (!(vma->vm_flags & VM_READ) && address != exception_era(regs)) - goto bad_area; - if (!(vma->vm_flags & VM_EXEC) && address == exception_era(regs)) - goto bad_area; - } + if (access_error(flags, regs, vma)) + goto bad_area; /* * If for any reason at all we couldn't handle the fault, -- 2.27.0
[PATCH rfc -next 08/10] loongarch: mm: cleanup __do_page_fault()
Cleanup __do_page_fault() by reuse bad_area_nosemaphore and bad_area label. Signed-off-by: Kefeng Wang --- arch/loongarch/mm/fault.c | 36 +++- 1 file changed, 11 insertions(+), 25 deletions(-) diff --git a/arch/loongarch/mm/fault.c b/arch/loongarch/mm/fault.c index da5b6d518cdb..03d06ee184da 100644 --- a/arch/loongarch/mm/fault.c +++ b/arch/loongarch/mm/fault.c @@ -151,18 +151,15 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, if (!user_mode(regs)) no_context(regs, address); else - do_sigsegv(regs, write, address, si_code); - return; + goto bad_area_nosemaphore; } /* * If we're in an interrupt or have no user * context, we must not take the fault.. */ - if (faulthandler_disabled() || !mm) { - do_sigsegv(regs, write, address, si_code); - return; - } + if (faulthandler_disabled() || !mm) + goto bad_area_nosemaphore; if (user_mode(regs)) flags |= FAULT_FLAG_USER; @@ -172,23 +169,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, vma = lock_mm_and_find_vma(mm, address, regs); if (unlikely(!vma)) goto bad_area_nosemaphore; - goto good_area; - -/* - * Something tried to access memory that isn't in our memory map.. - * Fix it, but check if it's kernel or user first.. - */ -bad_area: - mmap_read_unlock(mm); -bad_area_nosemaphore: - do_sigsegv(regs, write, address, si_code); - return; -/* - * Ok, we have a good vm_area for this memory access, so - * we can handle it.. - */ -good_area: si_code = SEGV_ACCERR; if (write) { @@ -229,14 +210,15 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, */ goto retry; } + + mmap_read_unlock(mm); + if (unlikely(fault & VM_FAULT_ERROR)) { - mmap_read_unlock(mm); if (fault & VM_FAULT_OOM) { do_out_of_memory(regs, address); return; } else if (fault & VM_FAULT_SIGSEGV) { - do_sigsegv(regs, write, address, si_code); - return; + goto bad_area_nosemaphore; } else if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)) { do_sigbus(regs, write, address, si_code); return; @@ -244,7 +226,11 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, BUG(); } + return; +bad_area: mmap_read_unlock(mm); +bad_area_nosemaphore: + do_sigsegv(regs, write, address, si_code); } asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, -- 2.27.0
[PATCH rfc -next 05/10] powerpc: mm: use try_vma_locked_page_fault()
Use new try_vma_locked_page_fault() helper to simplify code. No functional change intended. Signed-off-by: Kefeng Wang --- arch/powerpc/mm/fault.c | 54 + 1 file changed, 22 insertions(+), 32 deletions(-) diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 82954d0e6906..dd4832a3cf10 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -391,6 +391,23 @@ static int page_fault_is_bad(unsigned long err) #define page_fault_is_bad(__err) ((__err) & DSISR_BAD_FAULT_32S) #endif +#ifdef CONFIG_PER_VMA_LOCK +int arch_vma_check_access(struct vm_area_struct *vma, + struct vm_locked_fault *vmlf) +{ + int is_exec = TRAP(vmlf->regs) == INTERRUPT_INST_STORAGE; + int is_write = page_fault_is_write(vmlf->fault_code); + + if (unlikely(access_pkey_error(is_write, is_exec, + (vmlf->fault_code & DSISR_KEYFAULT), vma))) + return -EINVAL; + + if (unlikely(access_error(is_write, is_exec, vma))) + return -EINVAL; + return 0; +} +#endif + /* * For 600- and 800-family processors, the error_code parameter is DSISR * for a data fault, SRR1 for an instruction fault. @@ -413,6 +430,7 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address, int is_write = page_fault_is_write(error_code); vm_fault_t fault, major = 0; bool kprobe_fault = kprobe_page_fault(regs, 11); + struct vm_locked_fault vmlf; if (unlikely(debugger_fault_handler(regs) || kprobe_fault)) return 0; @@ -469,41 +487,15 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address, if (is_exec) flags |= FAULT_FLAG_INSTRUCTION; -#ifdef CONFIG_PER_VMA_LOCK - if (!(flags & FAULT_FLAG_USER)) - goto lock_mmap; - - vma = lock_vma_under_rcu(mm, address); - if (!vma) - goto lock_mmap; - - if (unlikely(access_pkey_error(is_write, is_exec, - (error_code & DSISR_KEYFAULT), vma))) { - vma_end_read(vma); - goto lock_mmap; - } - - if (unlikely(access_error(is_write, is_exec, vma))) { - vma_end_read(vma); - goto lock_mmap; - } - - fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs); - if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) - vma_end_read(vma); - - if (!(fault & VM_FAULT_RETRY)) { - count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + VM_LOCKED_FAULT_INIT(vmlf, mm, address, flags, 0, regs, error_code); + if (try_vma_locked_page_fault(, )) + goto retry; + else if (!(fault | VM_FAULT_RETRY)) goto done; - } - count_vm_vma_lock_event(VMA_LOCK_RETRY); if (fault_signal_pending(fault, regs)) return user_mode(regs) ? 0 : SIGBUS; -lock_mmap: -#endif /* CONFIG_PER_VMA_LOCK */ - /* When running in the kernel we expect faults to occur only to * addresses in user space. All other faults represent errors in the * kernel and should generate an OOPS. Unfortunately, in the case of an @@ -552,9 +544,7 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address, mmap_read_unlock(current->mm); -#ifdef CONFIG_PER_VMA_LOCK done: -#endif if (unlikely(fault & VM_FAULT_ERROR)) return mm_fault_error(regs, address, fault); -- 2.27.0
[PATCH rfc -next 06/10] riscv: mm: use try_vma_locked_page_fault()
Use new try_vma_locked_page_fault() helper to simplify code. No functional change intended. Signed-off-by: Kefeng Wang --- arch/riscv/mm/fault.c | 38 +++--- 1 file changed, 15 insertions(+), 23 deletions(-) diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c index 6ea2cce4cc17..13bc60370b5c 100644 --- a/arch/riscv/mm/fault.c +++ b/arch/riscv/mm/fault.c @@ -215,6 +215,16 @@ static inline bool access_error(unsigned long cause, struct vm_area_struct *vma) return false; } +#ifdef CONFIG_PER_VMA_LOCK +int arch_vma_check_access(struct vm_area_struct *vma, + struct vm_locked_fault *vmlf) +{ + if (unlikely(access_error(vmlf->fault_code, vma))) + return -EINVAL; + return 0; +} +#endif + /* * This routine handles page faults. It determines the address and the * problem, and then passes it off to one of the appropriate routines. @@ -228,6 +238,7 @@ void handle_page_fault(struct pt_regs *regs) unsigned int flags = FAULT_FLAG_DEFAULT; int code = SEGV_MAPERR; vm_fault_t fault; + struct vm_locked_fault vmlf; cause = regs->cause; addr = regs->badaddr; @@ -283,35 +294,18 @@ void handle_page_fault(struct pt_regs *regs) flags |= FAULT_FLAG_WRITE; else if (cause == EXC_INST_PAGE_FAULT) flags |= FAULT_FLAG_INSTRUCTION; -#ifdef CONFIG_PER_VMA_LOCK - if (!(flags & FAULT_FLAG_USER)) - goto lock_mmap; - vma = lock_vma_under_rcu(mm, addr); - if (!vma) - goto lock_mmap; - - if (unlikely(access_error(cause, vma))) { - vma_end_read(vma); - goto lock_mmap; - } - - fault = handle_mm_fault(vma, addr, flags | FAULT_FLAG_VMA_LOCK, regs); - vma_end_read(vma); - - if (!(fault & VM_FAULT_RETRY)) { - count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + VM_LOCKED_FAULT_INIT(vmlf, mm, addr, flags, 0, regs, cause); + if (try_vma_locked_page_fault(, )) + goto retry; + else if (!(fault | VM_FAULT_RETRY)) goto done; - } - count_vm_vma_lock_event(VMA_LOCK_RETRY); if (fault_signal_pending(fault, regs)) { if (!user_mode(regs)) no_context(regs, addr); return; } -lock_mmap: -#endif /* CONFIG_PER_VMA_LOCK */ retry: vma = lock_mm_and_find_vma(mm, addr, regs); @@ -368,9 +362,7 @@ void handle_page_fault(struct pt_regs *regs) mmap_read_unlock(mm); -#ifdef CONFIG_PER_VMA_LOCK done: -#endif if (unlikely(fault & VM_FAULT_ERROR)) { tsk->thread.bad_cause = cause; mm_fault_error(regs, addr, fault); -- 2.27.0
[PATCH rfc -next 02/10] x86: mm: use try_vma_locked_page_fault()
Use new try_vma_locked_page_fault() helper to simplify code. No functional change intended. Signed-off-by: Kefeng Wang --- arch/x86/mm/fault.c | 39 +++ 1 file changed, 15 insertions(+), 24 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 56b4f9faf8c4..3f3b8b0a87de 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1213,6 +1213,16 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code, } NOKPROBE_SYMBOL(do_kern_addr_fault); +#ifdef CONFIG_PER_VMA_LOCK +int arch_vma_check_access(struct vm_area_struct *vma, + struct vm_locked_fault *vmlf) +{ + if (unlikely(access_error(vmlf->fault_code, vma))) + return -EINVAL; + return 0; +} +#endif + /* * Handle faults in the user portion of the address space. Nothing in here * should check X86_PF_USER without a specific justification: for almost @@ -1231,6 +1241,7 @@ void do_user_addr_fault(struct pt_regs *regs, struct mm_struct *mm; vm_fault_t fault; unsigned int flags = FAULT_FLAG_DEFAULT; + struct vm_locked_fault vmlf; tsk = current; mm = tsk->mm; @@ -1328,27 +1339,11 @@ void do_user_addr_fault(struct pt_regs *regs, } #endif -#ifdef CONFIG_PER_VMA_LOCK - if (!(flags & FAULT_FLAG_USER)) - goto lock_mmap; - - vma = lock_vma_under_rcu(mm, address); - if (!vma) - goto lock_mmap; - - if (unlikely(access_error(error_code, vma))) { - vma_end_read(vma); - goto lock_mmap; - } - fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs); - if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) - vma_end_read(vma); - - if (!(fault & VM_FAULT_RETRY)) { - count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + VM_LOCKED_FAULT_INIT(vmlf, mm, address, flags, 0, regs, error_code); + if (try_vma_locked_page_fault(, )) + goto retry; + else if (!(fault | VM_FAULT_RETRY)) goto done; - } - count_vm_vma_lock_event(VMA_LOCK_RETRY); /* Quick path to respond to signals */ if (fault_signal_pending(fault, regs)) { @@ -1358,8 +1353,6 @@ void do_user_addr_fault(struct pt_regs *regs, ARCH_DEFAULT_PKEY); return; } -lock_mmap: -#endif /* CONFIG_PER_VMA_LOCK */ retry: vma = lock_mm_and_find_vma(mm, address, regs); @@ -1419,9 +1412,7 @@ void do_user_addr_fault(struct pt_regs *regs, } mmap_read_unlock(mm); -#ifdef CONFIG_PER_VMA_LOCK done: -#endif if (likely(!(fault & VM_FAULT_ERROR))) return; -- 2.27.0
[PATCH rfc -next 00/10] mm: convert to generic VMA lock-based page fault
Add a generic VMA lock-based page fault handler in mm core, and convert architectures to use it, which eliminate architectures's duplicated codes. With it, we can avoid multiple changes in architectures's code if we add new feature or bugfix. This fixes riscv missing change about commit 38b3aec8e8d2 "mm: drop per-VMA lock when returning VM_FAULT_RETRY or VM_FAULT_COMPLETED", and in the end, we enable this feature on ARM32/Loongarch too. This is based on next-20230713, only built test(no loongarch compiler, so except loongarch). Kefeng Wang (10): mm: add a generic VMA lock-based page fault handler x86: mm: use try_vma_locked_page_fault() arm64: mm: use try_vma_locked_page_fault() s390: mm: use try_vma_locked_page_fault() powerpc: mm: use try_vma_locked_page_fault() riscv: mm: use try_vma_locked_page_fault() ARM: mm: try VMA lock-based page fault handling first loongarch: mm: cleanup __do_page_fault() loongarch: mm: add access_error() helper loongarch: mm: try VMA lock-based page fault handling first arch/arm/Kconfig | 1 + arch/arm/mm/fault.c | 15 ++- arch/arm64/mm/fault.c | 28 +++- arch/loongarch/Kconfig| 1 + arch/loongarch/mm/fault.c | 92 --- arch/powerpc/mm/fault.c | 54 ++- arch/riscv/mm/fault.c | 38 +++- arch/s390/mm/fault.c | 23 +++--- arch/x86/mm/fault.c | 39 +++-- include/linux/mm.h| 28 mm/memory.c | 42 ++ 11 files changed, 206 insertions(+), 155 deletions(-) -- 2.27.0
[PATCH v2 1/2] mm: remove arguments of show_mem()
All callers of show_mem() pass 0 and NULL, so we can remove the two arguments by directly calling __show_mem(0, NULL, MAX_NR_ZONES - 1) in show_mem(). Signed-off-by: Kefeng Wang --- v2: update commit log arch/powerpc/xmon/xmon.c | 2 +- drivers/tty/sysrq.c | 2 +- drivers/tty/vt/keyboard.c | 2 +- include/linux/mm.h| 4 ++-- init/initramfs.c | 2 +- kernel/panic.c| 2 +- 6 files changed, 7 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c index fae747cc57d2..ee17270d35d0 100644 --- a/arch/powerpc/xmon/xmon.c +++ b/arch/powerpc/xmon/xmon.c @@ -1084,7 +1084,7 @@ cmds(struct pt_regs *excp) memzcan(); break; case 'i': - show_mem(0, NULL); + show_mem(); break; default: termch = cmd; diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c index b6e70c5cfa17..e1df63a88aac 100644 --- a/drivers/tty/sysrq.c +++ b/drivers/tty/sysrq.c @@ -342,7 +342,7 @@ static const struct sysrq_key_op sysrq_ftrace_dump_op = { static void sysrq_handle_showmem(int key) { - show_mem(0, NULL); + show_mem(); } static const struct sysrq_key_op sysrq_showmem_op = { .handler= sysrq_handle_showmem, diff --git a/drivers/tty/vt/keyboard.c b/drivers/tty/vt/keyboard.c index be8313cdbac3..358f216c6cd6 100644 --- a/drivers/tty/vt/keyboard.c +++ b/drivers/tty/vt/keyboard.c @@ -606,7 +606,7 @@ static void fn_scroll_back(struct vc_data *vc) static void fn_show_mem(struct vc_data *vc) { - show_mem(0, NULL); + show_mem(); } static void fn_show_state(struct vc_data *vc) diff --git a/include/linux/mm.h b/include/linux/mm.h index eef34f6a0351..ddb140e14f3a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3047,9 +3047,9 @@ extern void mem_init(void); extern void __init mmap_init(void); extern void __show_mem(unsigned int flags, nodemask_t *nodemask, int max_zone_idx); -static inline void show_mem(unsigned int flags, nodemask_t *nodemask) +static inline void show_mem(void) { - __show_mem(flags, nodemask, MAX_NR_ZONES - 1); + __show_mem(0, NULL, MAX_NR_ZONES - 1); } extern long si_mem_available(void); extern void si_meminfo(struct sysinfo * val); diff --git a/init/initramfs.c b/init/initramfs.c index e7a01c2ccd1b..8d0fd946cdd2 100644 --- a/init/initramfs.c +++ b/init/initramfs.c @@ -61,7 +61,7 @@ static void __init error(char *x) } #define panic_show_mem(fmt, ...) \ - ({ show_mem(0, NULL); panic(fmt, ##__VA_ARGS__); }) + ({ show_mem(); panic(fmt, ##__VA_ARGS__); }) /* link hash */ diff --git a/kernel/panic.c b/kernel/panic.c index 10effe40a3fa..07239d4ad81e 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -216,7 +216,7 @@ static void panic_print_sys_info(bool console_flush) show_state(); if (panic_print & PANIC_PRINT_MEM_INFO) - show_mem(0, NULL); + show_mem(); if (panic_print & PANIC_PRINT_TIMER_INFO) sysrq_timer_list_show(); -- 2.41.0
[PATCH v2 2/2] mm: make show_free_areas() static
All callers of show_free_areas() pass 0 and NULL, so we can directly use show_mem() instead of show_free_areas(0, NULL), which could make show_free_areas() a static function. Signed-off-by: Kefeng Wang --- v2: update commit log and fix a missing show_free_areas() conversion arch/sparc/kernel/setup_32.c | 2 +- include/linux/mm.h | 12 mm/internal.h| 6 ++ mm/nommu.c | 8 mm/show_mem.c| 4 ++-- 5 files changed, 13 insertions(+), 19 deletions(-) diff --git a/arch/sparc/kernel/setup_32.c b/arch/sparc/kernel/setup_32.c index 1adf5c1c16b8..34ef7febf0d5 100644 --- a/arch/sparc/kernel/setup_32.c +++ b/arch/sparc/kernel/setup_32.c @@ -83,7 +83,7 @@ static void prom_sync_me(void) "nop\n\t" : : "r" ()); prom_printf("PROM SYNC COMMAND...\n"); - show_free_areas(0, NULL); + show_mem(); if (!is_idle_task(current)) { local_irq_enable(); ksys_sync(); diff --git a/include/linux/mm.h b/include/linux/mm.h index ddb140e14f3a..0a1314a3ffae 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2218,18 +2218,6 @@ extern void pagefault_out_of_memory(void); #define offset_in_thp(page, p) ((unsigned long)(p) & (thp_size(page) - 1)) #define offset_in_folio(folio, p) ((unsigned long)(p) & (folio_size(folio) - 1)) -/* - * Flags passed to show_mem() and show_free_areas() to suppress output in - * various contexts. - */ -#define SHOW_MEM_FILTER_NODES (0x0001u) /* disallowed nodes */ - -extern void __show_free_areas(unsigned int flags, nodemask_t *nodemask, int max_zone_idx); -static void __maybe_unused show_free_areas(unsigned int flags, nodemask_t *nodemask) -{ - __show_free_areas(flags, nodemask, MAX_NR_ZONES - 1); -} - /* * Parameter block passed down to zap_pte_range in exceptional cases. */ diff --git a/mm/internal.h b/mm/internal.h index a7d9e980429a..721ed07d7fd6 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -61,6 +61,12 @@ void page_writeback_init(void); #define COMPOUND_MAPPED0x80 #define FOLIO_PAGES_MAPPED (COMPOUND_MAPPED - 1) +/* + * Flags passed to __show_mem() and show_free_areas() to suppress output in + * various contexts. + */ +#define SHOW_MEM_FILTER_NODES (0x0001u) /* disallowed nodes */ + /* * How many individual pages have an elevated _mapcount. Excludes * the folio's entire_mapcount. diff --git a/mm/nommu.c b/mm/nommu.c index f670d9979a26..bff51d8ec66e 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -990,7 +990,7 @@ static int do_mmap_private(struct vm_area_struct *vma, enomem: pr_err("Allocation of length %lu from process %d (%s) failed\n", len, current->pid, current->comm); - show_free_areas(0, NULL); + show_mem(); return -ENOMEM; } @@ -1223,20 +1223,20 @@ unsigned long do_mmap(struct file *file, kmem_cache_free(vm_region_jar, region); pr_warn("Allocation of vma for %lu byte allocation from process %d failed\n", len, current->pid); - show_free_areas(0, NULL); + show_mem(); return -ENOMEM; error_getting_region: pr_warn("Allocation of vm region for %lu byte allocation from process %d failed\n", len, current->pid); - show_free_areas(0, NULL); + show_mem(); return -ENOMEM; error_vma_iter_prealloc: kmem_cache_free(vm_region_jar, region); vm_area_free(vma); pr_warn("Allocation of vma tree for process %d failed\n", current->pid); - show_free_areas(0, NULL); + show_mem(); return -ENOMEM; } diff --git a/mm/show_mem.c b/mm/show_mem.c index 01f8e9905817..09c7d036d49e 100644 --- a/mm/show_mem.c +++ b/mm/show_mem.c @@ -186,7 +186,7 @@ static bool node_has_managed_zones(pg_data_t *pgdat, int max_zone_idx) * SHOW_MEM_FILTER_NODES: suppress nodes that are not allowed by current's * cpuset. */ -void __show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_zone_idx) +static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_zone_idx) { unsigned long free_pcp = 0; int cpu, nid; @@ -406,7 +406,7 @@ void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_idx) struct zone *zone; printk("Mem-Info:\n"); - __show_free_areas(filter, nodemask, max_zone_idx); + show_free_areas(filter, nodemask, max_zone_idx); for_each_populated_zone(zone) { -- 2.41.0
Re: [PATCH 2/2] mm: make show_free_areas() static
Thanks, On 2023/6/29 23:00, kernel test robot wrote: Hi Kefeng, kernel test robot noticed the following build errors: [auto build test ERROR on akpm-mm/mm-everything] url: https://github.com/intel-lab-lkp/linux/commits/Kefeng-Wang/mm-make-show_free_areas-static/20230629-182958 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20230629104357.35455-2-wangkefeng.wang%40huawei.com patch subject: [PATCH 2/2] mm: make show_free_areas() static config: sh-allmodconfig (https://download.01.org/0day-ci/archive/20230629/202306292240.rj0dlhfi-...@intel.com/config) compiler: sh4-linux-gcc (GCC) 12.3.0 reproduce: (https://download.01.org/0day-ci/archive/20230629/202306292240.rj0dlhfi-...@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot | Closes: https://lore.kernel.org/oe-kbuild-all/202306292240.rj0dlhfi-...@intel.com/ All errors (new ones prefixed by >>): mm/nommu.c: In function 'do_mmap': mm/nommu.c:1239:9: error: implicit declaration of function 'show_free_areas' [-Werror=implicit-function-declaration] 1239 | show_free_areas(0, NULL); | ^~~ cc1: some warnings being treated as errors Missing this one in patch-1, will update vim +/show_free_areas +1239 mm/nommu.c
Re: [PATCH 1/2] mm: remove arguments of show_mem()
On 2023/6/29 23:17, Matthew Wilcox wrote: On Thu, Jun 29, 2023 at 06:43:56PM +0800, Kefeng Wang wrote: Directly call __show_mem(0, NULL, MAX_NR_ZONES - 1) in show_mem() to remove the arguments of show_mem(). Do you mean, "All callers of show_mem() pass 0 and NULL, so we can remove the two arguments"? Yes, will update with above to make it clear, thanks Matthew.
[PATCH 1/2] mm: remove arguments of show_mem()
Directly call __show_mem(0, NULL, MAX_NR_ZONES - 1) in show_mem() to remove the arguments of show_mem(). Signed-off-by: Kefeng Wang --- arch/powerpc/xmon/xmon.c | 2 +- drivers/tty/sysrq.c | 2 +- drivers/tty/vt/keyboard.c | 2 +- include/linux/mm.h| 4 ++-- init/initramfs.c | 2 +- kernel/panic.c| 2 +- 6 files changed, 7 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c index fae747cc57d2..ee17270d35d0 100644 --- a/arch/powerpc/xmon/xmon.c +++ b/arch/powerpc/xmon/xmon.c @@ -1084,7 +1084,7 @@ cmds(struct pt_regs *excp) memzcan(); break; case 'i': - show_mem(0, NULL); + show_mem(); break; default: termch = cmd; diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c index b6e70c5cfa17..e1df63a88aac 100644 --- a/drivers/tty/sysrq.c +++ b/drivers/tty/sysrq.c @@ -342,7 +342,7 @@ static const struct sysrq_key_op sysrq_ftrace_dump_op = { static void sysrq_handle_showmem(int key) { - show_mem(0, NULL); + show_mem(); } static const struct sysrq_key_op sysrq_showmem_op = { .handler= sysrq_handle_showmem, diff --git a/drivers/tty/vt/keyboard.c b/drivers/tty/vt/keyboard.c index be8313cdbac3..358f216c6cd6 100644 --- a/drivers/tty/vt/keyboard.c +++ b/drivers/tty/vt/keyboard.c @@ -606,7 +606,7 @@ static void fn_scroll_back(struct vc_data *vc) static void fn_show_mem(struct vc_data *vc) { - show_mem(0, NULL); + show_mem(); } static void fn_show_state(struct vc_data *vc) diff --git a/include/linux/mm.h b/include/linux/mm.h index eef34f6a0351..ddb140e14f3a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3047,9 +3047,9 @@ extern void mem_init(void); extern void __init mmap_init(void); extern void __show_mem(unsigned int flags, nodemask_t *nodemask, int max_zone_idx); -static inline void show_mem(unsigned int flags, nodemask_t *nodemask) +static inline void show_mem(void) { - __show_mem(flags, nodemask, MAX_NR_ZONES - 1); + __show_mem(0, NULL, MAX_NR_ZONES - 1); } extern long si_mem_available(void); extern void si_meminfo(struct sysinfo * val); diff --git a/init/initramfs.c b/init/initramfs.c index e7a01c2ccd1b..8d0fd946cdd2 100644 --- a/init/initramfs.c +++ b/init/initramfs.c @@ -61,7 +61,7 @@ static void __init error(char *x) } #define panic_show_mem(fmt, ...) \ - ({ show_mem(0, NULL); panic(fmt, ##__VA_ARGS__); }) + ({ show_mem(); panic(fmt, ##__VA_ARGS__); }) /* link hash */ diff --git a/kernel/panic.c b/kernel/panic.c index 10effe40a3fa..07239d4ad81e 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -216,7 +216,7 @@ static void panic_print_sys_info(bool console_flush) show_state(); if (panic_print & PANIC_PRINT_MEM_INFO) - show_mem(0, NULL); + show_mem(); if (panic_print & PANIC_PRINT_TIMER_INFO) sysrq_timer_list_show(); -- 2.41.0
[PATCH 2/2] mm: make show_free_areas() static
Directly use show_mem() instead of show_free_areas(0, NULL), then make show_free_areas() a static function. Signed-off-by: Kefeng Wang --- arch/sparc/kernel/setup_32.c | 2 +- include/linux/mm.h | 12 mm/internal.h| 6 ++ mm/nommu.c | 6 +++--- mm/show_mem.c| 4 ++-- 5 files changed, 12 insertions(+), 18 deletions(-) diff --git a/arch/sparc/kernel/setup_32.c b/arch/sparc/kernel/setup_32.c index 1adf5c1c16b8..34ef7febf0d5 100644 --- a/arch/sparc/kernel/setup_32.c +++ b/arch/sparc/kernel/setup_32.c @@ -83,7 +83,7 @@ static void prom_sync_me(void) "nop\n\t" : : "r" ()); prom_printf("PROM SYNC COMMAND...\n"); - show_free_areas(0, NULL); + show_mem(); if (!is_idle_task(current)) { local_irq_enable(); ksys_sync(); diff --git a/include/linux/mm.h b/include/linux/mm.h index ddb140e14f3a..0a1314a3ffae 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2218,18 +2218,6 @@ extern void pagefault_out_of_memory(void); #define offset_in_thp(page, p) ((unsigned long)(p) & (thp_size(page) - 1)) #define offset_in_folio(folio, p) ((unsigned long)(p) & (folio_size(folio) - 1)) -/* - * Flags passed to show_mem() and show_free_areas() to suppress output in - * various contexts. - */ -#define SHOW_MEM_FILTER_NODES (0x0001u) /* disallowed nodes */ - -extern void __show_free_areas(unsigned int flags, nodemask_t *nodemask, int max_zone_idx); -static void __maybe_unused show_free_areas(unsigned int flags, nodemask_t *nodemask) -{ - __show_free_areas(flags, nodemask, MAX_NR_ZONES - 1); -} - /* * Parameter block passed down to zap_pte_range in exceptional cases. */ diff --git a/mm/internal.h b/mm/internal.h index a7d9e980429a..721ed07d7fd6 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -61,6 +61,12 @@ void page_writeback_init(void); #define COMPOUND_MAPPED0x80 #define FOLIO_PAGES_MAPPED (COMPOUND_MAPPED - 1) +/* + * Flags passed to __show_mem() and show_free_areas() to suppress output in + * various contexts. + */ +#define SHOW_MEM_FILTER_NODES (0x0001u) /* disallowed nodes */ + /* * How many individual pages have an elevated _mapcount. Excludes * the folio's entire_mapcount. diff --git a/mm/nommu.c b/mm/nommu.c index f670d9979a26..5b179234ce89 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -990,7 +990,7 @@ static int do_mmap_private(struct vm_area_struct *vma, enomem: pr_err("Allocation of length %lu from process %d (%s) failed\n", len, current->pid, current->comm); - show_free_areas(0, NULL); + show_mem(); return -ENOMEM; } @@ -1223,13 +1223,13 @@ unsigned long do_mmap(struct file *file, kmem_cache_free(vm_region_jar, region); pr_warn("Allocation of vma for %lu byte allocation from process %d failed\n", len, current->pid); - show_free_areas(0, NULL); + show_mem(); return -ENOMEM; error_getting_region: pr_warn("Allocation of vm region for %lu byte allocation from process %d failed\n", len, current->pid); - show_free_areas(0, NULL); + show_mem(); return -ENOMEM; error_vma_iter_prealloc: diff --git a/mm/show_mem.c b/mm/show_mem.c index 01f8e9905817..09c7d036d49e 100644 --- a/mm/show_mem.c +++ b/mm/show_mem.c @@ -186,7 +186,7 @@ static bool node_has_managed_zones(pg_data_t *pgdat, int max_zone_idx) * SHOW_MEM_FILTER_NODES: suppress nodes that are not allowed by current's * cpuset. */ -void __show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_zone_idx) +static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_zone_idx) { unsigned long free_pcp = 0; int cpu, nid; @@ -406,7 +406,7 @@ void __show_mem(unsigned int filter, nodemask_t *nodemask, int max_zone_idx) struct zone *zone; printk("Mem-Info:\n"); - __show_free_areas(filter, nodemask, max_zone_idx); + show_free_areas(filter, nodemask, max_zone_idx); for_each_populated_zone(zone) { -- 2.41.0
Re: [PATCH v3 03/14] arm64: reword ARCH_FORCE_MAX_ORDER prompt and help text
On 2023/3/25 14:08, Mike Rapoport wrote: From: "Mike Rapoport (IBM)" The prompt and help text of ARCH_FORCE_MAX_ORDER are not even close to describe this configuration option. Update both to actually describe what this option does. Acked-by: Kirill A. Shutemov Reviewed-by: Zi Yan Signed-off-by: Mike Rapoport (IBM) Reviewed-by: Kefeng Wang --- arch/arm64/Kconfig | 24 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 7324032af859..cc11cdcf5a00 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1487,24 +1487,24 @@ config XEN # 16K | 27 | 14 | 13| 11 | # 64K | 29 | 16 | 13| 13 | config ARCH_FORCE_MAX_ORDER - int "Maximum zone order" if EXPERT && (ARM64_4K_PAGES || ARM64_16K_PAGES) + int "Order of maximal physically contiguous allocations" if EXPERT && (ARM64_4K_PAGES || ARM64_16K_PAGES) default "13" if ARM64_64K_PAGES default "11" if ARM64_16K_PAGES default "10" help - The kernel memory allocator divides physically contiguous memory - blocks into "zones", where each zone is a power of two number of - pages. This option selects the largest power of two that the kernel - keeps in the memory allocator. If you need to allocate very large - blocks of physically contiguous memory, then you may need to - increase this value. + The kernel page allocator limits the size of maximal physically + contiguous allocations. The limit is called MAX_ORDER and it + defines the maximal power of two of number of pages that can be + allocated as a single contiguous block. This option allows + overriding the default setting when ability to allocate very + large blocks of physically contiguous memory is required. - We make sure that we can allocate up to a HugePage size for each configuration. - Hence we have : - MAX_ORDER = PMD_SHIFT - PAGE_SHIFT => PAGE_SHIFT - 3 + The maximal size of allocation cannot exceed the size of the + section, so the value of MAX_ORDER should satisfy - However for 4K, we choose a higher default value, 10 as opposed to 9, giving us - 4M allocations matching the default size used by generic code. + MAX_ORDER + PAGE_SHIFT <= SECTION_SIZE_BITS + + Don't change if unsure. config UNMAP_KERNEL_AT_EL0 bool "Unmap kernel when running in userspace (aka \"KAISER\")" if EXPERT
Re: [PATCH v3 02/14] arm64: drop ranges in definition of ARCH_FORCE_MAX_ORDER
On 2023/3/25 14:08, Mike Rapoport wrote: From: "Mike Rapoport (IBM)" It is not a good idea to change fundamental parameters of core memory management. Having predefined ranges suggests that the values within those ranges are sensible, but one has to *really* understand implications of changing MAX_ORDER before actually amending it and ranges don't help here. Drop ranges in definition of ARCH_FORCE_MAX_ORDER and make its prompt visible only if EXPERT=y Acked-by: Kirill A. Shutemov Reviewed-by: Zi Yan Signed-off-by: Mike Rapoport (IBM) Reviewed-by: Kefeng Wang --- arch/arm64/Kconfig | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index e60baf7859d1..7324032af859 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1487,11 +1487,9 @@ config XEN # 16K | 27 | 14 | 13| 11 | # 64K | 29 | 16 | 13| 13 | config ARCH_FORCE_MAX_ORDER - int "Maximum zone order" if ARM64_4K_PAGES || ARM64_16K_PAGES + int "Maximum zone order" if EXPERT && (ARM64_4K_PAGES || ARM64_16K_PAGES) default "13" if ARM64_64K_PAGES - range 11 13 if ARM64_16K_PAGES default "11" if ARM64_16K_PAGES - range 10 15 if ARM64_4K_PAGES default "10" help The kernel memory allocator divides physically contiguous memory
Re: [PATCH v3 05/14] ia64: don't allow users to override ARCH_FORCE_MAX_ORDER
On 2023/3/25 14:08, Mike Rapoport wrote: From: "Mike Rapoport (IBM)" It is enough to keep default values for base and huge pages without letting users to override ARCH_FORCE_MAX_ORDER. Drop the prompt to make the option unvisible in *config. Acked-by: Kirill A. Shutemov Reviewed-by: Zi Yan Signed-off-by: Mike Rapoport (IBM) --- arch/ia64/Kconfig | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig index 0d2f41fa56ee..b61437cae162 100644 --- a/arch/ia64/Kconfig +++ b/arch/ia64/Kconfig @@ -202,8 +202,7 @@ config IA64_CYCLONE If you're unsure, answer N. config ARCH_FORCE_MAX_ORDER - int "MAX_ORDER (10 - 16)" if !HUGETLB_PAGE - range 10 16 if !HUGETLB_PAGE + int default "16" if HUGETLB_PAGE default "10" It seems that we could drop the following part? diff --git a/arch/ia64/include/asm/sparsemem.h b/arch/ia64/include/asm/sparsemem.h index a58f8b466d96..18187551b183 100644 --- a/arch/ia64/include/asm/sparsemem.h +++ b/arch/ia64/include/asm/sparsemem.h @@ -11,11 +11,6 @@ #define SECTION_SIZE_BITS (30) #define MAX_PHYSMEM_BITS (50) -#ifdef CONFIG_ARCH_FORCE_MAX_ORDER -#if (CONFIG_ARCH_FORCE_MAX_ORDER + PAGE_SHIFT > SECTION_SIZE_BITS) -#undef SECTION_SIZE_BITS -#define SECTION_SIZE_BITS (CONFIG_ARCH_FORCE_MAX_ORDER + PAGE_SHIFT) -#endif #endif
[PATCH] mm: remove kern_addr_valid() completely
Most architectures(except arm64/x86/sparc) simply return 1 for kern_addr_valid(), which is only used in read_kcore(), and it calls copy_from_kernel_nofault() which could check whether the address is a valid kernel address, so no need kern_addr_valid(), let's remove unneeded kern_addr_valid() completely. Signed-off-by: Kefeng Wang --- arch/alpha/include/asm/pgtable.h | 2 - arch/arc/include/asm/pgtable-bits-arcv2.h | 2 - arch/arm/include/asm/pgtable-nommu.h | 2 - arch/arm/include/asm/pgtable.h| 4 -- arch/arm64/include/asm/pgtable.h | 2 - arch/arm64/mm/mmu.c | 47 --- arch/arm64/mm/pageattr.c | 3 +- arch/csky/include/asm/pgtable.h | 3 -- arch/hexagon/include/asm/page.h | 7 arch/ia64/include/asm/pgtable.h | 16 arch/loongarch/include/asm/pgtable.h | 2 - arch/m68k/include/asm/pgtable_mm.h| 2 - arch/m68k/include/asm/pgtable_no.h| 1 - arch/microblaze/include/asm/pgtable.h | 3 -- arch/mips/include/asm/pgtable.h | 2 - arch/nios2/include/asm/pgtable.h | 2 - arch/openrisc/include/asm/pgtable.h | 2 - arch/parisc/include/asm/pgtable.h | 15 arch/powerpc/include/asm/pgtable.h| 7 arch/riscv/include/asm/pgtable.h | 2 - arch/s390/include/asm/pgtable.h | 2 - arch/sh/include/asm/pgtable.h | 2 - arch/sparc/include/asm/pgtable_32.h | 6 --- arch/sparc/mm/init_32.c | 3 +- arch/sparc/mm/init_64.c | 1 - arch/um/include/asm/pgtable.h | 2 - arch/x86/include/asm/pgtable_32.h | 9 - arch/x86/include/asm/pgtable_64.h | 1 - arch/x86/mm/init_64.c | 41 arch/xtensa/include/asm/pgtable.h | 2 - fs/proc/kcore.c | 26 + 31 files changed, 11 insertions(+), 210 deletions(-) diff --git a/arch/alpha/include/asm/pgtable.h b/arch/alpha/include/asm/pgtable.h index 3ea9661c09ff..9e45f6735d5d 100644 --- a/arch/alpha/include/asm/pgtable.h +++ b/arch/alpha/include/asm/pgtable.h @@ -313,8 +313,6 @@ extern inline pte_t mk_swap_pte(unsigned long type, unsigned long offset) #define __pte_to_swp_entry(pte)((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val }) -#define kern_addr_valid(addr) (1) - #define pte_ERROR(e) \ printk("%s:%d: bad pte %016lx.\n", __FILE__, __LINE__, pte_val(e)) #define pmd_ERROR(e) \ diff --git a/arch/arc/include/asm/pgtable-bits-arcv2.h b/arch/arc/include/asm/pgtable-bits-arcv2.h index b23be557403e..515e82db519f 100644 --- a/arch/arc/include/asm/pgtable-bits-arcv2.h +++ b/arch/arc/include/asm/pgtable-bits-arcv2.h @@ -120,8 +120,6 @@ void update_mmu_cache(struct vm_area_struct *vma, unsigned long address, #define __pte_to_swp_entry(pte)((swp_entry_t) { pte_val(pte) }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val }) -#define kern_addr_valid(addr) (1) - #ifdef CONFIG_TRANSPARENT_HUGEPAGE #include #endif diff --git a/arch/arm/include/asm/pgtable-nommu.h b/arch/arm/include/asm/pgtable-nommu.h index d16aba48fa0a..25d8c7bb07e0 100644 --- a/arch/arm/include/asm/pgtable-nommu.h +++ b/arch/arm/include/asm/pgtable-nommu.h @@ -21,8 +21,6 @@ #define pgd_none(pgd) (0) #define pgd_bad(pgd) (0) #define pgd_clear(pgdp) -#define kern_addr_valid(addr) (1) -/* FIXME */ /* * PMD_SHIFT determines the size of the area a second-level page table can map * PGDIR_SHIFT determines what a third-level page table entry can map diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h index 78a532068fec..00954ab1a039 100644 --- a/arch/arm/include/asm/pgtable.h +++ b/arch/arm/include/asm/pgtable.h @@ -298,10 +298,6 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) */ #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > __SWP_TYPE_BITS) -/* Needs to be defined here and not in linux/mm.h, as it is arch dependent */ -/* FIXME: this is not correct */ -#define kern_addr_valid(addr) (1) - /* * We provide our own arch_get_unmapped_area to cope with VIPT caches. */ diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 71a1af42f0e8..4873c1d6e7d0 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -1021,8 +1021,6 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma, */ #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > __SWP_TYPE_BITS) -extern int kern_addr_valid(unsigned long addr); - #ifdef CONFIG_ARM64_MTE #define __HAVE_ARCH_PREPARE_TO_SWAP diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 9a7c38965154..556154d821bf 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -814,53 +814,6 @@ void __init
[PATCH v4 04/11] sections: Move is_kernel_inittext() into sections.h
The is_kernel_inittext() and init_kernel_text() are with same functionality, let's just keep is_kernel_inittext() and move it into sections.h, then update all the callers. Cc: Ingo Molnar Cc: Thomas Gleixner Cc: Arnd Bergmann Cc: x...@kernel.org Signed-off-by: Kefeng Wang --- arch/x86/kernel/unwind_orc.c | 2 +- include/asm-generic/sections.h | 14 ++ include/linux/kallsyms.h | 8 include/linux/kernel.h | 1 - kernel/extable.c | 12 ++-- 5 files changed, 17 insertions(+), 20 deletions(-) diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c index a1202536fc57..d92ec2ced059 100644 --- a/arch/x86/kernel/unwind_orc.c +++ b/arch/x86/kernel/unwind_orc.c @@ -175,7 +175,7 @@ static struct orc_entry *orc_find(unsigned long ip) } /* vmlinux .init slow lookup: */ - if (init_kernel_text(ip)) + if (is_kernel_inittext(ip)) return __orc_find(__start_orc_unwind_ip, __start_orc_unwind, __stop_orc_unwind_ip - __start_orc_unwind_ip, ip); diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h index 24780c0f40b1..811583ca8bd0 100644 --- a/include/asm-generic/sections.h +++ b/include/asm-generic/sections.h @@ -172,4 +172,18 @@ static inline bool is_kernel_rodata(unsigned long addr) addr < (unsigned long)__end_rodata; } +/** + * is_kernel_inittext - checks if the pointer address is located in the + * .init.text section + * + * @addr: address to check + * + * Returns: true if the address is located in .init.text, false otherwise. + */ +static inline bool is_kernel_inittext(unsigned long addr) +{ + return addr >= (unsigned long)_sinittext && + addr < (unsigned long)_einittext; +} + #endif /* _ASM_GENERIC_SECTIONS_H_ */ diff --git a/include/linux/kallsyms.h b/include/linux/kallsyms.h index b016c62f30a6..8a9d329c927c 100644 --- a/include/linux/kallsyms.h +++ b/include/linux/kallsyms.h @@ -24,14 +24,6 @@ struct cred; struct module; -static inline int is_kernel_inittext(unsigned long addr) -{ - if (addr >= (unsigned long)_sinittext - && addr < (unsigned long)_einittext) - return 1; - return 0; -} - static inline int is_kernel_text(unsigned long addr) { if ((addr >= (unsigned long)_stext && addr < (unsigned long)_etext)) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index e5a9af8a4e20..445d0dceefb8 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -229,7 +229,6 @@ extern bool parse_option_str(const char *str, const char *option); extern char *next_arg(char *args, char **param, char **val); extern int core_kernel_text(unsigned long addr); -extern int init_kernel_text(unsigned long addr); extern int __kernel_text_address(unsigned long addr); extern int kernel_text_address(unsigned long addr); extern int func_ptr_is_kernel_text(void *ptr); diff --git a/kernel/extable.c b/kernel/extable.c index da26203841d4..98ca627ac5ef 100644 --- a/kernel/extable.c +++ b/kernel/extable.c @@ -62,14 +62,6 @@ const struct exception_table_entry *search_exception_tables(unsigned long addr) return e; } -int init_kernel_text(unsigned long addr) -{ - if (addr >= (unsigned long)_sinittext && - addr < (unsigned long)_einittext) - return 1; - return 0; -} - int notrace core_kernel_text(unsigned long addr) { if (addr >= (unsigned long)_stext && @@ -77,7 +69,7 @@ int notrace core_kernel_text(unsigned long addr) return 1; if (system_state < SYSTEM_RUNNING && - init_kernel_text(addr)) + is_kernel_inittext(addr)) return 1; return 0; } @@ -94,7 +86,7 @@ int __kernel_text_address(unsigned long addr) * Since we are after the module-symbols check, there's * no danger of address overlap: */ - if (init_kernel_text(addr)) + if (is_kernel_inittext(addr)) return 1; return 0; } -- 2.26.2
[PATCH v4 10/11] microblaze: Use is_kernel_text() helper
Use is_kernel_text() helper to simplify code. Cc: Michal Simek Signed-off-by: Kefeng Wang --- arch/microblaze/mm/pgtable.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/microblaze/mm/pgtable.c b/arch/microblaze/mm/pgtable.c index c1833b159d3b..9f73265aad4e 100644 --- a/arch/microblaze/mm/pgtable.c +++ b/arch/microblaze/mm/pgtable.c @@ -34,6 +34,7 @@ #include #include #include +#include #include #include @@ -171,7 +172,7 @@ void __init mapin_ram(void) for (s = 0; s < lowmem_size; s += PAGE_SIZE) { f = _PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_SHARED | _PAGE_HWEXEC; - if ((char *) v < _stext || (char *) v >= _etext) + if (!is_kernel_text(v)) f |= _PAGE_WRENABLE; else /* On the MicroBlaze, no user access -- 2.26.2
[PATCH v4 07/11] mm: kasan: Use is_kernel() helper
Directly use is_kernel() helper in kernel_or_module_addr(). Cc: Andrey Ryabinin Cc: Alexander Potapenko Cc: Andrey Konovalov Cc: Dmitry Vyukov Signed-off-by: Kefeng Wang --- mm/kasan/report.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/kasan/report.c b/mm/kasan/report.c index 3239fd8f8747..1c955e1c98d5 100644 --- a/mm/kasan/report.c +++ b/mm/kasan/report.c @@ -226,7 +226,7 @@ static void describe_object(struct kmem_cache *cache, void *object, static inline bool kernel_or_module_addr(const void *addr) { - if (addr >= (void *)_stext && addr < (void *)_end) + if (is_kernel((unsigned long)addr)) return true; if (is_module_address((unsigned long)addr)) return true; -- 2.26.2
[PATCH v4 01/11] kallsyms: Remove arch specific text and data check
After commit 4ba66a976072 ("arch: remove blackfin port"), no need arch-specific text/data check. Cc: Arnd Bergmann Signed-off-by: Kefeng Wang --- include/asm-generic/sections.h | 16 include/linux/kallsyms.h | 3 +-- kernel/locking/lockdep.c | 3 --- 3 files changed, 1 insertion(+), 21 deletions(-) diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h index d16302d3eb59..817309e289db 100644 --- a/include/asm-generic/sections.h +++ b/include/asm-generic/sections.h @@ -64,22 +64,6 @@ extern __visible const void __nosave_begin, __nosave_end; #define dereference_kernel_function_descriptor(p) ((void *)(p)) #endif -/* random extra sections (if any). Override - * in asm/sections.h */ -#ifndef arch_is_kernel_text -static inline int arch_is_kernel_text(unsigned long addr) -{ - return 0; -} -#endif - -#ifndef arch_is_kernel_data -static inline int arch_is_kernel_data(unsigned long addr) -{ - return 0; -} -#endif - /* * Check if an address is part of freed initmem. This is needed on architectures * with virt == phys kernel mapping, for code that wants to check if an address diff --git a/include/linux/kallsyms.h b/include/linux/kallsyms.h index 6851c2313cad..2a241e3f063f 100644 --- a/include/linux/kallsyms.h +++ b/include/linux/kallsyms.h @@ -34,8 +34,7 @@ static inline int is_kernel_inittext(unsigned long addr) static inline int is_kernel_text(unsigned long addr) { - if ((addr >= (unsigned long)_stext && addr <= (unsigned long)_etext) || - arch_is_kernel_text(addr)) + if ((addr >= (unsigned long)_stext && addr <= (unsigned long)_etext)) return 1; return in_gate_area_no_mm(addr); } diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 7096384dc60f..dcdbcee391cd 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -803,9 +803,6 @@ static int static_obj(const void *obj) if ((addr >= start) && (addr < end)) return 1; - if (arch_is_kernel_data(addr)) - return 1; - /* * in-kernel percpu var? */ -- 2.26.2
[PATCH v4 05/11] x86: mm: Rename __is_kernel_text() to is_x86_32_kernel_text()
Commit b56cd05c55a1 ("x86/mm: Rename is_kernel_text to __is_kernel_text"), add '__' prefix not to get in conflict with existing is_kernel_text() in . We will add __is_kernel_text() for the basic kernel text range check in the next patch, so use private is_x86_32_kernel_text() naming for x86 special check. Cc: Ingo Molnar Cc: Borislav Petkov Cc: x...@kernel.org Signed-off-by: Kefeng Wang --- arch/x86/mm/init_32.c | 14 +- 1 file changed, 5 insertions(+), 9 deletions(-) diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c index bd90b8fe81e4..523743ee9dea 100644 --- a/arch/x86/mm/init_32.c +++ b/arch/x86/mm/init_32.c @@ -238,11 +238,7 @@ page_table_range_init(unsigned long start, unsigned long end, pgd_t *pgd_base) } } -/* - * The already defines is_kernel_text, - * using '__' prefix not to get in conflict. - */ -static inline int __is_kernel_text(unsigned long addr) +static inline int is_x86_32_kernel_text(unsigned long addr) { if (addr >= (unsigned long)_text && addr <= (unsigned long)__init_end) return 1; @@ -333,8 +329,8 @@ kernel_physical_mapping_init(unsigned long start, addr2 = (pfn + PTRS_PER_PTE-1) * PAGE_SIZE + PAGE_OFFSET + PAGE_SIZE-1; - if (__is_kernel_text(addr) || - __is_kernel_text(addr2)) + if (is_x86_32_kernel_text(addr) || + is_x86_32_kernel_text(addr2)) prot = PAGE_KERNEL_LARGE_EXEC; pages_2m++; @@ -359,7 +355,7 @@ kernel_physical_mapping_init(unsigned long start, */ pgprot_t init_prot = __pgprot(PTE_IDENT_ATTR); - if (__is_kernel_text(addr)) + if (is_x86_32_kernel_text(addr)) prot = PAGE_KERNEL_EXEC; pages_4k++; @@ -820,7 +816,7 @@ static void mark_nxdata_nx(void) */ unsigned long start = PFN_ALIGN(_etext); /* -* This comes from __is_kernel_text upper limit. Also HPAGE where used: +* This comes from is_x86_32_kernel_text upper limit. Also HPAGE where used: */ unsigned long size = (((unsigned long)__init_end + HPAGE_SIZE) & HPAGE_MASK) - start; -- 2.26.2
[PATCH v4 02/11] kallsyms: Fix address-checks for kernel related range
The is_kernel_inittext/is_kernel_text/is_kernel function should not include the end address(the labels _einittext, _etext and _end) when check the address range, the issue exists since Linux v2.6.12. Cc: Arnd Bergmann Cc: Sergey Senozhatsky Cc: Petr Mladek Reviewed-by: Petr Mladek Reviewed-by: Steven Rostedt (VMware) Acked-by: Sergey Senozhatsky Signed-off-by: Kefeng Wang --- include/linux/kallsyms.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/include/linux/kallsyms.h b/include/linux/kallsyms.h index 2a241e3f063f..b016c62f30a6 100644 --- a/include/linux/kallsyms.h +++ b/include/linux/kallsyms.h @@ -27,21 +27,21 @@ struct module; static inline int is_kernel_inittext(unsigned long addr) { if (addr >= (unsigned long)_sinittext - && addr <= (unsigned long)_einittext) + && addr < (unsigned long)_einittext) return 1; return 0; } static inline int is_kernel_text(unsigned long addr) { - if ((addr >= (unsigned long)_stext && addr <= (unsigned long)_etext)) + if ((addr >= (unsigned long)_stext && addr < (unsigned long)_etext)) return 1; return in_gate_area_no_mm(addr); } static inline int is_kernel(unsigned long addr) { - if (addr >= (unsigned long)_stext && addr <= (unsigned long)_end) + if (addr >= (unsigned long)_stext && addr < (unsigned long)_end) return 1; return in_gate_area_no_mm(addr); } -- 2.26.2
[PATCH v4 08/11] extable: Use is_kernel_text() helper
The core_kernel_text() should check the gate area, as it is part of kernel text range, use is_kernel_text() in core_kernel_text(). Cc: Steven Rostedt Signed-off-by: Kefeng Wang --- kernel/extable.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/kernel/extable.c b/kernel/extable.c index 98ca627ac5ef..0ba383d850ff 100644 --- a/kernel/extable.c +++ b/kernel/extable.c @@ -64,8 +64,7 @@ const struct exception_table_entry *search_exception_tables(unsigned long addr) int notrace core_kernel_text(unsigned long addr) { - if (addr >= (unsigned long)_stext && - addr < (unsigned long)_etext) + if (is_kernel_text(addr)) return 1; if (system_state < SYSTEM_RUNNING && -- 2.26.2
[PATCH v4 11/11] alpha: Use is_kernel_text() helper
Use is_kernel_text() helper to simplify code. Cc: Richard Henderson Cc: Ivan Kokshaysky Cc: Matt Turner Signed-off-by: Kefeng Wang --- arch/alpha/kernel/traps.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/arch/alpha/kernel/traps.c b/arch/alpha/kernel/traps.c index e805106409f7..2ae34702456c 100644 --- a/arch/alpha/kernel/traps.c +++ b/arch/alpha/kernel/traps.c @@ -129,9 +129,7 @@ dik_show_trace(unsigned long *sp, const char *loglvl) extern char _stext[], _etext[]; unsigned long tmp = *sp; sp++; - if (tmp < (unsigned long) &_stext) - continue; - if (tmp >= (unsigned long) &_etext) + if (!is_kernel_text(tmp)) continue; printk("%s[<%lx>] %pSR\n", loglvl, tmp, (void *)tmp); if (i > 40) { -- 2.26.2
[PATCH v4 09/11] powerpc/mm: Use core_kernel_text() helper
Use core_kernel_text() helper to simplify code, also drop etext, _stext, _sinittext, _einittext declaration which already declared in section.h. Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Christophe Leroy Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Kefeng Wang --- arch/powerpc/mm/pgtable_32.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c index dcf5ecca19d9..079abbf45a33 100644 --- a/arch/powerpc/mm/pgtable_32.c +++ b/arch/powerpc/mm/pgtable_32.c @@ -33,8 +33,6 @@ #include -extern char etext[], _stext[], _sinittext[], _einittext[]; - static u8 early_fixmap_pagetable[FIXMAP_PTE_SIZE] __page_aligned_data; notrace void __init early_ioremap_init(void) @@ -104,14 +102,13 @@ static void __init __mapin_ram_chunk(unsigned long offset, unsigned long top) { unsigned long v, s; phys_addr_t p; - int ktext; + bool ktext; s = offset; v = PAGE_OFFSET + s; p = memstart_addr + s; for (; s < top; s += PAGE_SIZE) { - ktext = ((char *)v >= _stext && (char *)v < etext) || - ((char *)v >= _sinittext && (char *)v < _einittext); + ktext = core_kernel_text(v); map_kernel_page(v, p, ktext ? PAGE_KERNEL_TEXT : PAGE_KERNEL); v += PAGE_SIZE; p += PAGE_SIZE; -- 2.26.2
[PATCH v4 06/11] sections: Provide internal __is_kernel() and __is_kernel_text() helper
An internal __is_kernel() helper which only check the kernel address ranges, and an internal __is_kernel_text() helper which only check text section ranges. Signed-off-by: Kefeng Wang --- include/asm-generic/sections.h | 29 + include/linux/kallsyms.h | 4 ++-- 2 files changed, 31 insertions(+), 2 deletions(-) diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h index 811583ca8bd0..a7abeadddc7a 100644 --- a/include/asm-generic/sections.h +++ b/include/asm-generic/sections.h @@ -186,4 +186,33 @@ static inline bool is_kernel_inittext(unsigned long addr) addr < (unsigned long)_einittext; } +/** + * __is_kernel_text - checks if the pointer address is located in the + *.text section + * + * @addr: address to check + * + * Returns: true if the address is located in .text, false otherwise. + * Note: an internal helper, only check the range of _stext to _etext. + */ +static inline bool __is_kernel_text(unsigned long addr) +{ + return addr >= (unsigned long)_stext && + addr < (unsigned long)_etext; +} + +/** + * __is_kernel - checks if the pointer address is located in the kernel range + * + * @addr: address to check + * + * Returns: true if the address is located in the kernel range, false otherwise. + * Note: an internal helper, only check the range of _stext to _end. + */ +static inline bool __is_kernel(unsigned long addr) +{ + return addr >= (unsigned long)_stext && + addr < (unsigned long)_end; +} + #endif /* _ASM_GENERIC_SECTIONS_H_ */ diff --git a/include/linux/kallsyms.h b/include/linux/kallsyms.h index 8a9d329c927c..5fb17dd4b6fa 100644 --- a/include/linux/kallsyms.h +++ b/include/linux/kallsyms.h @@ -26,14 +26,14 @@ struct module; static inline int is_kernel_text(unsigned long addr) { - if ((addr >= (unsigned long)_stext && addr < (unsigned long)_etext)) + if (__is_kernel_text(addr)) return 1; return in_gate_area_no_mm(addr); } static inline int is_kernel(unsigned long addr) { - if (addr >= (unsigned long)_stext && addr < (unsigned long)_end) + if (__is_kernel(addr)) return 1; return in_gate_area_no_mm(addr); } -- 2.26.2
[PATCH v4 03/11] sections: Move and rename core_kernel_data() to is_kernel_core_data()
Move core_kernel_data() into sections.h and rename it to is_kernel_core_data(), also make it return bool value, then update all the callers. Cc: Arnd Bergmann Cc: Steven Rostedt Cc: Ingo Molnar Cc: "David S. Miller" Signed-off-by: Kefeng Wang --- include/asm-generic/sections.h | 16 include/linux/kernel.h | 1 - kernel/extable.c | 18 -- kernel/trace/ftrace.c | 2 +- net/sysctl_net.c | 2 +- 5 files changed, 18 insertions(+), 21 deletions(-) diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h index 817309e289db..24780c0f40b1 100644 --- a/include/asm-generic/sections.h +++ b/include/asm-generic/sections.h @@ -142,6 +142,22 @@ static inline bool init_section_intersects(void *virt, size_t size) return memory_intersects(__init_begin, __init_end, virt, size); } +/** + * is_kernel_core_data - checks if the pointer address is located in the + * .data section + * + * @addr: address to check + * + * Returns: true if the address is located in .data, false otherwise. + * Note: On some archs it may return true for core RODATA, and false + * for others. But will always be true for core RW data. + */ +static inline bool is_kernel_core_data(unsigned long addr) +{ + return addr >= (unsigned long)_sdata && + addr < (unsigned long)_edata; +} + /** * is_kernel_rodata - checks if the pointer address is located in the *.rodata section diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 2776423a587e..e5a9af8a4e20 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -230,7 +230,6 @@ extern char *next_arg(char *args, char **param, char **val); extern int core_kernel_text(unsigned long addr); extern int init_kernel_text(unsigned long addr); -extern int core_kernel_data(unsigned long addr); extern int __kernel_text_address(unsigned long addr); extern int kernel_text_address(unsigned long addr); extern int func_ptr_is_kernel_text(void *ptr); diff --git a/kernel/extable.c b/kernel/extable.c index b0ea5eb0c3b4..da26203841d4 100644 --- a/kernel/extable.c +++ b/kernel/extable.c @@ -82,24 +82,6 @@ int notrace core_kernel_text(unsigned long addr) return 0; } -/** - * core_kernel_data - tell if addr points to kernel data - * @addr: address to test - * - * Returns true if @addr passed in is from the core kernel data - * section. - * - * Note: On some archs it may return true for core RODATA, and false - * for others. But will always be true for core RW data. - */ -int core_kernel_data(unsigned long addr) -{ - if (addr >= (unsigned long)_sdata && - addr < (unsigned long)_edata) - return 1; - return 0; -} - int __kernel_text_address(unsigned long addr) { if (kernel_text_address(addr)) diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 7efbc8aaf7f6..f15badf31f52 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -323,7 +323,7 @@ int __register_ftrace_function(struct ftrace_ops *ops) if (!ftrace_enabled && (ops->flags & FTRACE_OPS_FL_PERMANENT)) return -EBUSY; - if (!core_kernel_data((unsigned long)ops)) + if (!is_kernel_core_data((unsigned long)ops)) ops->flags |= FTRACE_OPS_FL_DYNAMIC; add_ftrace_ops(_ops_list, ops); diff --git a/net/sysctl_net.c b/net/sysctl_net.c index f6cb0d4d114c..4b45ed631eb8 100644 --- a/net/sysctl_net.c +++ b/net/sysctl_net.c @@ -144,7 +144,7 @@ static void ensure_safe_net_sysctl(struct net *net, const char *path, addr = (unsigned long)ent->data; if (is_module_address(addr)) where = "module"; - else if (core_kernel_data(addr)) + else if (is_kernel_core_data(addr)) where = "kernel"; else continue; -- 2.26.2
[PATCH v4 00/11] sections: Unify kernel sections range check and use
There are three head files(kallsyms.h, kernel.h and sections.h) which include the kernel sections range check, let's make some cleanup and unify them. 1. cleanup arch specific text/data check and fix address boundary check in kallsyms.h 2. make all the basic/core kernel range check function into sections.h 3. update all the callers, and use the helper in sections.h to simplify the code After this series, we have 5 APIs about kernel sections range check in sections.h * is_kernel_rodata() --- already in sections.h * is_kernel_core_data()--- come from core_kernel_data() in kernel.h * is_kernel_inittext() --- come from kernel.h and kallsyms.h * __is_kernel_text() --- add new internal helper * __is_kernel()--- add new internal helper Note: For the last two helpers, people should not use directly, consider to use corresponding function in kallsyms.h. v4: - Use core_kernel_text() in powerpc sugguested Christophe Leroy, build test only - Use is_kernel_text() in alpha and microblaze, build test only on next-20210929 v3: https://lore.kernel.org/linux-arch/20210926072048.190336-1-wangkefeng.w...@huawei.com/ - Add Steven's RB to patch2 - Introduce two internal helper, then use is_kernel_text() in core_kernel_text() and is_kernel() in kernel_or_module_addr() suggested by Steven v2: https://lore.kernel.org/linux-arch/20210728081320.20394-1-wangkefeng.w...@huawei.com/ - add ACK/RW to patch2, and drop inappropriate fix tag - keep 'core' to check kernel data, suggestted by Steven Rostedt , rename is_kernel_data() to is_kernel_core_data() - drop patch8 which is merged - drop patch9 which is resend independently v1: https://lore.kernel.org/linux-arch/20210626073439.150586-1-wangkefeng.w...@huawei.com Kefeng Wang (11): kallsyms: Remove arch specific text and data check kallsyms: Fix address-checks for kernel related range sections: Move and rename core_kernel_data() to is_kernel_core_data() sections: Move is_kernel_inittext() into sections.h x86: mm: Rename __is_kernel_text() to is_x86_32_kernel_text() sections: Provide internal __is_kernel() and __is_kernel_text() helper mm: kasan: Use is_kernel() helper extable: Use is_kernel_text() helper powerpc/mm: Use core_kernel_text() helper microblaze: Use is_kernel_text() helper alpha: Use is_kernel_text() helper arch/alpha/kernel/traps.c | 4 +- arch/microblaze/mm/pgtable.c | 3 +- arch/powerpc/mm/pgtable_32.c | 7 +--- arch/x86/kernel/unwind_orc.c | 2 +- arch/x86/mm/init_32.c | 14 +++ include/asm-generic/sections.h | 75 ++ include/linux/kallsyms.h | 13 +- include/linux/kernel.h | 2 - kernel/extable.c | 33 ++- kernel/locking/lockdep.c | 3 -- kernel/trace/ftrace.c | 2 +- mm/kasan/report.c | 2 +- net/sysctl_net.c | 2 +- 13 files changed, 78 insertions(+), 84 deletions(-) -- 2.26.2
Re: [PATCH v3 9/9] powerpc/mm: Use is_kernel_text() and is_kernel_inittext() helper
On 2021/9/29 1:51, Christophe Leroy wrote: Le 26/09/2021 à 09:20, Kefeng Wang a écrit : Use is_kernel_text() and is_kernel_inittext() helper to simplify code, also drop etext, _stext, _sinittext, _einittext declaration which already declared in section.h. Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Kefeng Wang --- arch/powerpc/mm/pgtable_32.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c index dcf5ecca19d9..13c798308c2e 100644 --- a/arch/powerpc/mm/pgtable_32.c +++ b/arch/powerpc/mm/pgtable_32.c @@ -33,8 +33,6 @@ #include -extern char etext[], _stext[], _sinittext[], _einittext[]; - static u8 early_fixmap_pagetable[FIXMAP_PTE_SIZE] __page_aligned_data; notrace void __init early_ioremap_init(void) @@ -104,14 +102,13 @@ static void __init __mapin_ram_chunk(unsigned long offset, unsigned long top) { unsigned long v, s; phys_addr_t p; - int ktext; + bool ktext; s = offset; v = PAGE_OFFSET + s; p = memstart_addr + s; for (; s < top; s += PAGE_SIZE) { - ktext = ((char *)v >= _stext && (char *)v < etext) || - ((char *)v >= _sinittext && (char *)v < _einittext); + ktext = (is_kernel_text(v) || is_kernel_inittext(v)); I think we could use core_kernel_next() instead. Indead. oops, sorry for the build error, will update, thanks. Build failure on mpc885_ads_defconfig arch/powerpc/mm/pgtable_32.c: In function '__mapin_ram_chunk': arch/powerpc/mm/pgtable_32.c:111:26: error: implicit declaration of function 'is_kernel_text'; did you mean 'is_kernel_inittext'? [-Werror=implicit-function-declaration] 111 | ktext = (is_kernel_text(v) || is_kernel_inittext(v)); | ^~ | is_kernel_inittext cc1: all warnings being treated as errors make[2]: *** [scripts/Makefile.build:277: arch/powerpc/mm/pgtable_32.o] Error 1 make[1]: *** [scripts/Makefile.build:540: arch/powerpc/mm] Error 2 make: *** [Makefile:1868: arch/powerpc] Error 2 .
[PATCH v3 4/9] sections: Move is_kernel_inittext() into sections.h
The is_kernel_inittext() and init_kernel_text() are with same functionality, let's just keep is_kernel_inittext() and move it into sections.h, then update all the callers. Cc: Ingo Molnar Cc: Thomas Gleixner Cc: Arnd Bergmann Cc: x...@kernel.org Signed-off-by: Kefeng Wang --- arch/x86/kernel/unwind_orc.c | 2 +- include/asm-generic/sections.h | 14 ++ include/linux/kallsyms.h | 8 include/linux/kernel.h | 1 - kernel/extable.c | 12 ++-- 5 files changed, 17 insertions(+), 20 deletions(-) diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c index a1202536fc57..d92ec2ced059 100644 --- a/arch/x86/kernel/unwind_orc.c +++ b/arch/x86/kernel/unwind_orc.c @@ -175,7 +175,7 @@ static struct orc_entry *orc_find(unsigned long ip) } /* vmlinux .init slow lookup: */ - if (init_kernel_text(ip)) + if (is_kernel_inittext(ip)) return __orc_find(__start_orc_unwind_ip, __start_orc_unwind, __stop_orc_unwind_ip - __start_orc_unwind_ip, ip); diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h index 24780c0f40b1..811583ca8bd0 100644 --- a/include/asm-generic/sections.h +++ b/include/asm-generic/sections.h @@ -172,4 +172,18 @@ static inline bool is_kernel_rodata(unsigned long addr) addr < (unsigned long)__end_rodata; } +/** + * is_kernel_inittext - checks if the pointer address is located in the + * .init.text section + * + * @addr: address to check + * + * Returns: true if the address is located in .init.text, false otherwise. + */ +static inline bool is_kernel_inittext(unsigned long addr) +{ + return addr >= (unsigned long)_sinittext && + addr < (unsigned long)_einittext; +} + #endif /* _ASM_GENERIC_SECTIONS_H_ */ diff --git a/include/linux/kallsyms.h b/include/linux/kallsyms.h index b016c62f30a6..8a9d329c927c 100644 --- a/include/linux/kallsyms.h +++ b/include/linux/kallsyms.h @@ -24,14 +24,6 @@ struct cred; struct module; -static inline int is_kernel_inittext(unsigned long addr) -{ - if (addr >= (unsigned long)_sinittext - && addr < (unsigned long)_einittext) - return 1; - return 0; -} - static inline int is_kernel_text(unsigned long addr) { if ((addr >= (unsigned long)_stext && addr < (unsigned long)_etext)) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index e5a9af8a4e20..445d0dceefb8 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -229,7 +229,6 @@ extern bool parse_option_str(const char *str, const char *option); extern char *next_arg(char *args, char **param, char **val); extern int core_kernel_text(unsigned long addr); -extern int init_kernel_text(unsigned long addr); extern int __kernel_text_address(unsigned long addr); extern int kernel_text_address(unsigned long addr); extern int func_ptr_is_kernel_text(void *ptr); diff --git a/kernel/extable.c b/kernel/extable.c index da26203841d4..98ca627ac5ef 100644 --- a/kernel/extable.c +++ b/kernel/extable.c @@ -62,14 +62,6 @@ const struct exception_table_entry *search_exception_tables(unsigned long addr) return e; } -int init_kernel_text(unsigned long addr) -{ - if (addr >= (unsigned long)_sinittext && - addr < (unsigned long)_einittext) - return 1; - return 0; -} - int notrace core_kernel_text(unsigned long addr) { if (addr >= (unsigned long)_stext && @@ -77,7 +69,7 @@ int notrace core_kernel_text(unsigned long addr) return 1; if (system_state < SYSTEM_RUNNING && - init_kernel_text(addr)) + is_kernel_inittext(addr)) return 1; return 0; } @@ -94,7 +86,7 @@ int __kernel_text_address(unsigned long addr) * Since we are after the module-symbols check, there's * no danger of address overlap: */ - if (init_kernel_text(addr)) + if (is_kernel_inittext(addr)) return 1; return 0; } -- 2.26.2
[PATCH v3 3/9] sections: Move and rename core_kernel_data() to is_kernel_core_data()
Move core_kernel_data() into sections.h and rename it to is_kernel_core_data(), also make it return bool value, then update all the callers. Cc: Arnd Bergmann Cc: Steven Rostedt Cc: Ingo Molnar Cc: "David S. Miller" Signed-off-by: Kefeng Wang --- include/asm-generic/sections.h | 16 include/linux/kernel.h | 1 - kernel/extable.c | 18 -- kernel/trace/ftrace.c | 2 +- net/sysctl_net.c | 2 +- 5 files changed, 18 insertions(+), 21 deletions(-) diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h index 817309e289db..24780c0f40b1 100644 --- a/include/asm-generic/sections.h +++ b/include/asm-generic/sections.h @@ -142,6 +142,22 @@ static inline bool init_section_intersects(void *virt, size_t size) return memory_intersects(__init_begin, __init_end, virt, size); } +/** + * is_kernel_core_data - checks if the pointer address is located in the + * .data section + * + * @addr: address to check + * + * Returns: true if the address is located in .data, false otherwise. + * Note: On some archs it may return true for core RODATA, and false + * for others. But will always be true for core RW data. + */ +static inline bool is_kernel_core_data(unsigned long addr) +{ + return addr >= (unsigned long)_sdata && + addr < (unsigned long)_edata; +} + /** * is_kernel_rodata - checks if the pointer address is located in the *.rodata section diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 2776423a587e..e5a9af8a4e20 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -230,7 +230,6 @@ extern char *next_arg(char *args, char **param, char **val); extern int core_kernel_text(unsigned long addr); extern int init_kernel_text(unsigned long addr); -extern int core_kernel_data(unsigned long addr); extern int __kernel_text_address(unsigned long addr); extern int kernel_text_address(unsigned long addr); extern int func_ptr_is_kernel_text(void *ptr); diff --git a/kernel/extable.c b/kernel/extable.c index b0ea5eb0c3b4..da26203841d4 100644 --- a/kernel/extable.c +++ b/kernel/extable.c @@ -82,24 +82,6 @@ int notrace core_kernel_text(unsigned long addr) return 0; } -/** - * core_kernel_data - tell if addr points to kernel data - * @addr: address to test - * - * Returns true if @addr passed in is from the core kernel data - * section. - * - * Note: On some archs it may return true for core RODATA, and false - * for others. But will always be true for core RW data. - */ -int core_kernel_data(unsigned long addr) -{ - if (addr >= (unsigned long)_sdata && - addr < (unsigned long)_edata) - return 1; - return 0; -} - int __kernel_text_address(unsigned long addr) { if (kernel_text_address(addr)) diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 7efbc8aaf7f6..f15badf31f52 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -323,7 +323,7 @@ int __register_ftrace_function(struct ftrace_ops *ops) if (!ftrace_enabled && (ops->flags & FTRACE_OPS_FL_PERMANENT)) return -EBUSY; - if (!core_kernel_data((unsigned long)ops)) + if (!is_kernel_core_data((unsigned long)ops)) ops->flags |= FTRACE_OPS_FL_DYNAMIC; add_ftrace_ops(_ops_list, ops); diff --git a/net/sysctl_net.c b/net/sysctl_net.c index f6cb0d4d114c..4b45ed631eb8 100644 --- a/net/sysctl_net.c +++ b/net/sysctl_net.c @@ -144,7 +144,7 @@ static void ensure_safe_net_sysctl(struct net *net, const char *path, addr = (unsigned long)ent->data; if (is_module_address(addr)) where = "module"; - else if (core_kernel_data(addr)) + else if (is_kernel_core_data(addr)) where = "kernel"; else continue; -- 2.26.2
[PATCH v3 6/9] sections: Provide internal __is_kernel() and __is_kernel_text() helper
An internal __is_kernel() helper which only check the kernel address ranges, and an internal __is_kernel_text() helper which only check text section ranges. Signed-off-by: Kefeng Wang --- include/asm-generic/sections.h | 29 + include/linux/kallsyms.h | 4 ++-- 2 files changed, 31 insertions(+), 2 deletions(-) diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h index 811583ca8bd0..a7abeadddc7a 100644 --- a/include/asm-generic/sections.h +++ b/include/asm-generic/sections.h @@ -186,4 +186,33 @@ static inline bool is_kernel_inittext(unsigned long addr) addr < (unsigned long)_einittext; } +/** + * __is_kernel_text - checks if the pointer address is located in the + *.text section + * + * @addr: address to check + * + * Returns: true if the address is located in .text, false otherwise. + * Note: an internal helper, only check the range of _stext to _etext. + */ +static inline bool __is_kernel_text(unsigned long addr) +{ + return addr >= (unsigned long)_stext && + addr < (unsigned long)_etext; +} + +/** + * __is_kernel - checks if the pointer address is located in the kernel range + * + * @addr: address to check + * + * Returns: true if the address is located in the kernel range, false otherwise. + * Note: an internal helper, only check the range of _stext to _end. + */ +static inline bool __is_kernel(unsigned long addr) +{ + return addr >= (unsigned long)_stext && + addr < (unsigned long)_end; +} + #endif /* _ASM_GENERIC_SECTIONS_H_ */ diff --git a/include/linux/kallsyms.h b/include/linux/kallsyms.h index 8a9d329c927c..5fb17dd4b6fa 100644 --- a/include/linux/kallsyms.h +++ b/include/linux/kallsyms.h @@ -26,14 +26,14 @@ struct module; static inline int is_kernel_text(unsigned long addr) { - if ((addr >= (unsigned long)_stext && addr < (unsigned long)_etext)) + if (__is_kernel_text(addr)) return 1; return in_gate_area_no_mm(addr); } static inline int is_kernel(unsigned long addr) { - if (addr >= (unsigned long)_stext && addr < (unsigned long)_end) + if (__is_kernel(addr)) return 1; return in_gate_area_no_mm(addr); } -- 2.26.2
[PATCH v3 9/9] powerpc/mm: Use is_kernel_text() and is_kernel_inittext() helper
Use is_kernel_text() and is_kernel_inittext() helper to simplify code, also drop etext, _stext, _sinittext, _einittext declaration which already declared in section.h. Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Kefeng Wang --- arch/powerpc/mm/pgtable_32.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c index dcf5ecca19d9..13c798308c2e 100644 --- a/arch/powerpc/mm/pgtable_32.c +++ b/arch/powerpc/mm/pgtable_32.c @@ -33,8 +33,6 @@ #include -extern char etext[], _stext[], _sinittext[], _einittext[]; - static u8 early_fixmap_pagetable[FIXMAP_PTE_SIZE] __page_aligned_data; notrace void __init early_ioremap_init(void) @@ -104,14 +102,13 @@ static void __init __mapin_ram_chunk(unsigned long offset, unsigned long top) { unsigned long v, s; phys_addr_t p; - int ktext; + bool ktext; s = offset; v = PAGE_OFFSET + s; p = memstart_addr + s; for (; s < top; s += PAGE_SIZE) { - ktext = ((char *)v >= _stext && (char *)v < etext) || - ((char *)v >= _sinittext && (char *)v < _einittext); + ktext = (is_kernel_text(v) || is_kernel_inittext(v)); map_kernel_page(v, p, ktext ? PAGE_KERNEL_TEXT : PAGE_KERNEL); v += PAGE_SIZE; p += PAGE_SIZE; -- 2.26.2
[PATCH v3 8/9] extable: Use is_kernel_text() helper
The core_kernel_text() should check the gate area, as it is part of kernel text range. Cc: Steven Rostedt Signed-off-by: Kefeng Wang --- kernel/extable.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/kernel/extable.c b/kernel/extable.c index 98ca627ac5ef..0ba383d850ff 100644 --- a/kernel/extable.c +++ b/kernel/extable.c @@ -64,8 +64,7 @@ const struct exception_table_entry *search_exception_tables(unsigned long addr) int notrace core_kernel_text(unsigned long addr) { - if (addr >= (unsigned long)_stext && - addr < (unsigned long)_etext) + if (is_kernel_text(addr)) return 1; if (system_state < SYSTEM_RUNNING && -- 2.26.2
[PATCH v3 7/9] mm: kasan: Use is_kernel() helper
Directly use is_kernel() helper in kernel_or_module_addr(). Cc: Andrey Ryabinin Cc: Alexander Potapenko Cc: Andrey Konovalov Cc: Dmitry Vyukov Signed-off-by: Kefeng Wang --- mm/kasan/report.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/kasan/report.c b/mm/kasan/report.c index 3239fd8f8747..1c955e1c98d5 100644 --- a/mm/kasan/report.c +++ b/mm/kasan/report.c @@ -226,7 +226,7 @@ static void describe_object(struct kmem_cache *cache, void *object, static inline bool kernel_or_module_addr(const void *addr) { - if (addr >= (void *)_stext && addr < (void *)_end) + if (is_kernel((unsigned long)addr)) return true; if (is_module_address((unsigned long)addr)) return true; -- 2.26.2
[PATCH v3 0/9] sections: Unify kernel sections range check and use
There are three head files(kallsyms.h, kernel.h and sections.h) which include the kernel sections range check, let's make some cleanup and unify them. 1. cleanup arch specific text/data check and fix address boundary check in kallsyms.h 2. make all the basic/core kernel range check function into sections.h 3. update all the callers, and use the helper in sections.h to simplify the code After this series, we have 5 APIs about kernel sections range check in sections.h * is_kernel_rodata() --- already in sections.h * is_kernel_core_data()--- come from core_kernel_data() in kernel.h * is_kernel_inittext() --- come from kernel.h and kallsyms.h * __is_kernel_text() --- add new internal helper * __is_kernel()--- add new internal helper Note: For the last two helpers, people should not use directly, consider to use corresponding function in kallsyms.h. v3: - Add Steven's RB to patch2 - Introduce two internal helper, then use is_kernel_text() in core_kernel_text() and is_kernel() in kernel_or_module_addr() suggested by Steven v2: https://lore.kernel.org/linux-arch/20210728081320.20394-1-wangkefeng.w...@huawei.com/ - add ACK/RW to patch2, and drop inappropriate fix tag - keep 'core' to check kernel data, suggestted by Steven Rostedt , rename is_kernel_data() to is_kernel_core_data() - drop patch8 which is merged - drop patch9 which is resend independently v1: https://lore.kernel.org/linux-arch/20210626073439.150586-1-wangkefeng.w...@huawei.com Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-a...@vger.kernel.org Cc: b...@vger.kernel.org Kefeng Wang (9): kallsyms: Remove arch specific text and data check kallsyms: Fix address-checks for kernel related range sections: Move and rename core_kernel_data() to is_kernel_core_data() sections: Move is_kernel_inittext() into sections.h x86: mm: Rename __is_kernel_text() to is_x86_32_kernel_text() sections: Provide internal __is_kernel() and __is_kernel_text() helper mm: kasan: Use is_kernel() helper extable: Use is_kernel_text() helper powerpc/mm: Use is_kernel_text() and is_kernel_inittext() helper arch/powerpc/mm/pgtable_32.c | 7 +--- arch/x86/kernel/unwind_orc.c | 2 +- arch/x86/mm/init_32.c | 14 +++ include/asm-generic/sections.h | 75 ++ include/linux/kallsyms.h | 13 +- include/linux/kernel.h | 2 - kernel/extable.c | 33 ++- kernel/locking/lockdep.c | 3 -- kernel/trace/ftrace.c | 2 +- mm/kasan/report.c | 2 +- net/sysctl_net.c | 2 +- 11 files changed, 75 insertions(+), 80 deletions(-) -- 2.26.2
[PATCH v3 2/9] kallsyms: Fix address-checks for kernel related range
The is_kernel_inittext/is_kernel_text/is_kernel function should not include the end address(the labels _einittext, _etext and _end) when check the address range, the issue exists since Linux v2.6.12. Cc: Arnd Bergmann Cc: Sergey Senozhatsky Cc: Petr Mladek Reviewed-by: Petr Mladek Reviewed-by: Steven Rostedt (VMware) Acked-by: Sergey Senozhatsky Signed-off-by: Kefeng Wang --- include/linux/kallsyms.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/include/linux/kallsyms.h b/include/linux/kallsyms.h index 2a241e3f063f..b016c62f30a6 100644 --- a/include/linux/kallsyms.h +++ b/include/linux/kallsyms.h @@ -27,21 +27,21 @@ struct module; static inline int is_kernel_inittext(unsigned long addr) { if (addr >= (unsigned long)_sinittext - && addr <= (unsigned long)_einittext) + && addr < (unsigned long)_einittext) return 1; return 0; } static inline int is_kernel_text(unsigned long addr) { - if ((addr >= (unsigned long)_stext && addr <= (unsigned long)_etext)) + if ((addr >= (unsigned long)_stext && addr < (unsigned long)_etext)) return 1; return in_gate_area_no_mm(addr); } static inline int is_kernel(unsigned long addr) { - if (addr >= (unsigned long)_stext && addr <= (unsigned long)_end) + if (addr >= (unsigned long)_stext && addr < (unsigned long)_end) return 1; return in_gate_area_no_mm(addr); } -- 2.26.2
[PATCH v3 1/9] kallsyms: Remove arch specific text and data check
After commit 4ba66a976072 ("arch: remove blackfin port"), no need arch-specific text/data check. Cc: Arnd Bergmann Signed-off-by: Kefeng Wang --- include/asm-generic/sections.h | 16 include/linux/kallsyms.h | 3 +-- kernel/locking/lockdep.c | 3 --- 3 files changed, 1 insertion(+), 21 deletions(-) diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h index d16302d3eb59..817309e289db 100644 --- a/include/asm-generic/sections.h +++ b/include/asm-generic/sections.h @@ -64,22 +64,6 @@ extern __visible const void __nosave_begin, __nosave_end; #define dereference_kernel_function_descriptor(p) ((void *)(p)) #endif -/* random extra sections (if any). Override - * in asm/sections.h */ -#ifndef arch_is_kernel_text -static inline int arch_is_kernel_text(unsigned long addr) -{ - return 0; -} -#endif - -#ifndef arch_is_kernel_data -static inline int arch_is_kernel_data(unsigned long addr) -{ - return 0; -} -#endif - /* * Check if an address is part of freed initmem. This is needed on architectures * with virt == phys kernel mapping, for code that wants to check if an address diff --git a/include/linux/kallsyms.h b/include/linux/kallsyms.h index 6851c2313cad..2a241e3f063f 100644 --- a/include/linux/kallsyms.h +++ b/include/linux/kallsyms.h @@ -34,8 +34,7 @@ static inline int is_kernel_inittext(unsigned long addr) static inline int is_kernel_text(unsigned long addr) { - if ((addr >= (unsigned long)_stext && addr <= (unsigned long)_etext) || - arch_is_kernel_text(addr)) + if ((addr >= (unsigned long)_stext && addr <= (unsigned long)_etext)) return 1; return in_gate_area_no_mm(addr); } diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 7096384dc60f..dcdbcee391cd 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -803,9 +803,6 @@ static int static_obj(const void *obj) if ((addr >= start) && (addr < end)) return 1; - if (arch_is_kernel_data(addr)) - return 1; - /* * in-kernel percpu var? */ -- 2.26.2
[PATCH v3 5/9] x86: mm: Rename __is_kernel_text() to is_x86_32_kernel_text()
Commit b56cd05c55a1 ("x86/mm: Rename is_kernel_text to __is_kernel_text"), add '__' prefix not to get in conflict with existing is_kernel_text() in . We will add __is_kernel_text() for the basic kernel text range check in the next patch, so use private is_x86_32_kernel_text() naming for x86 special check. Cc: Ingo Molnar Cc: Borislav Petkov Cc: x...@kernel.org Signed-off-by: Kefeng Wang --- arch/x86/mm/init_32.c | 14 +- 1 file changed, 5 insertions(+), 9 deletions(-) diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c index bd90b8fe81e4..523743ee9dea 100644 --- a/arch/x86/mm/init_32.c +++ b/arch/x86/mm/init_32.c @@ -238,11 +238,7 @@ page_table_range_init(unsigned long start, unsigned long end, pgd_t *pgd_base) } } -/* - * The already defines is_kernel_text, - * using '__' prefix not to get in conflict. - */ -static inline int __is_kernel_text(unsigned long addr) +static inline int is_x86_32_kernel_text(unsigned long addr) { if (addr >= (unsigned long)_text && addr <= (unsigned long)__init_end) return 1; @@ -333,8 +329,8 @@ kernel_physical_mapping_init(unsigned long start, addr2 = (pfn + PTRS_PER_PTE-1) * PAGE_SIZE + PAGE_OFFSET + PAGE_SIZE-1; - if (__is_kernel_text(addr) || - __is_kernel_text(addr2)) + if (is_x86_32_kernel_text(addr) || + is_x86_32_kernel_text(addr2)) prot = PAGE_KERNEL_LARGE_EXEC; pages_2m++; @@ -359,7 +355,7 @@ kernel_physical_mapping_init(unsigned long start, */ pgprot_t init_prot = __pgprot(PTE_IDENT_ATTR); - if (__is_kernel_text(addr)) + if (is_x86_32_kernel_text(addr)) prot = PAGE_KERNEL_EXEC; pages_4k++; @@ -820,7 +816,7 @@ static void mark_nxdata_nx(void) */ unsigned long start = PFN_ALIGN(_etext); /* -* This comes from __is_kernel_text upper limit. Also HPAGE where used: +* This comes from is_x86_32_kernel_text upper limit. Also HPAGE where used: */ unsigned long size = (((unsigned long)__init_end + HPAGE_SIZE) & HPAGE_MASK) - start; -- 2.26.2
[PATCH -next] trap: Cleanup trap_init()
There are some empty trap_init() in different ARCHs, introduce a new weak trap_init() function to cleanup them. Cc: Vineet Gupta Cc: Russell King Cc: Yoshinori Sato Cc: Ley Foon Tan Cc: Jonas Bonn Cc: Stefan Kristiansson Cc: Stafford Horne Cc: James E.J. Bottomley Cc: Helge Deller Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Paul Walmsley Cc: Jeff Dike Cc: Richard Weinberger Cc: Anton Ivanov Cc: Andrew Morton Signed-off-by: Kefeng Wang --- arch/arc/kernel/traps.c | 5 - arch/arm/kernel/traps.c | 5 - arch/h8300/kernel/traps.c| 4 arch/hexagon/kernel/traps.c | 4 arch/nds32/kernel/traps.c| 5 - arch/nios2/kernel/traps.c| 5 - arch/openrisc/kernel/traps.c | 5 - arch/parisc/kernel/traps.c | 4 arch/powerpc/kernel/traps.c | 5 - arch/riscv/kernel/traps.c| 5 - arch/um/kernel/trap.c| 4 init/main.c | 2 ++ 12 files changed, 2 insertions(+), 51 deletions(-) diff --git a/arch/arc/kernel/traps.c b/arch/arc/kernel/traps.c index 57235e5c0cea..6b83e3f2b41c 100644 --- a/arch/arc/kernel/traps.c +++ b/arch/arc/kernel/traps.c @@ -20,11 +20,6 @@ #include #include -void __init trap_init(void) -{ - return; -} - void die(const char *str, struct pt_regs *regs, unsigned long address) { show_kernel_fault_diag(str, regs, address); diff --git a/arch/arm/kernel/traps.c b/arch/arm/kernel/traps.c index 64308e3a5d0c..e9b4f2b49bd8 100644 --- a/arch/arm/kernel/traps.c +++ b/arch/arm/kernel/traps.c @@ -781,11 +781,6 @@ void abort(void) panic("Oops failed to kill thread"); } -void __init trap_init(void) -{ - return; -} - #ifdef CONFIG_KUSER_HELPERS static void __init kuser_init(void *vectors) { diff --git a/arch/h8300/kernel/traps.c b/arch/h8300/kernel/traps.c index 5d8b969cd8f3..bdbe988d8dbc 100644 --- a/arch/h8300/kernel/traps.c +++ b/arch/h8300/kernel/traps.c @@ -39,10 +39,6 @@ void __init base_trap_init(void) { } -void __init trap_init(void) -{ -} - asmlinkage void set_esp0(unsigned long ssp) { current->thread.esp0 = ssp; diff --git a/arch/hexagon/kernel/traps.c b/arch/hexagon/kernel/traps.c index 904134b37232..edfc35dafeb1 100644 --- a/arch/hexagon/kernel/traps.c +++ b/arch/hexagon/kernel/traps.c @@ -28,10 +28,6 @@ #define TRAP_SYSCALL 1 #define TRAP_DEBUG 0xdb -void __init trap_init(void) -{ -} - #ifdef CONFIG_GENERIC_BUG /* Maybe should resemble arch/sh/kernel/traps.c ?? */ int is_valid_bugaddr(unsigned long addr) diff --git a/arch/nds32/kernel/traps.c b/arch/nds32/kernel/traps.c index ee0d9ae192a5..f06421c645af 100644 --- a/arch/nds32/kernel/traps.c +++ b/arch/nds32/kernel/traps.c @@ -183,11 +183,6 @@ void __pgd_error(const char *file, int line, unsigned long val) } extern char *exception_vector, *exception_vector_end; -void __init trap_init(void) -{ - return; -} - void __init early_trap_init(void) { unsigned long ivb = 0; diff --git a/arch/nios2/kernel/traps.c b/arch/nios2/kernel/traps.c index b172da4eb1a9..596986a74a26 100644 --- a/arch/nios2/kernel/traps.c +++ b/arch/nios2/kernel/traps.c @@ -105,11 +105,6 @@ void show_stack(struct task_struct *task, unsigned long *stack, printk("%s\n", loglvl); } -void __init trap_init(void) -{ - /* Nothing to do here */ -} - /* Breakpoint handler */ asmlinkage void breakpoint_c(struct pt_regs *fp) { diff --git a/arch/openrisc/kernel/traps.c b/arch/openrisc/kernel/traps.c index 4d61333c2623..aa1e709405ac 100644 --- a/arch/openrisc/kernel/traps.c +++ b/arch/openrisc/kernel/traps.c @@ -231,11 +231,6 @@ void unhandled_exception(struct pt_regs *regs, int ea, int vector) die("Oops", regs, 9); } -void __init trap_init(void) -{ - /* Nothing needs to be done */ -} - asmlinkage void do_trap(struct pt_regs *regs, unsigned long address) { force_sig_fault(SIGTRAP, TRAP_BRKPT, (void __user *)regs->pc); diff --git a/arch/parisc/kernel/traps.c b/arch/parisc/kernel/traps.c index 8d8441d4562a..747c328fb886 100644 --- a/arch/parisc/kernel/traps.c +++ b/arch/parisc/kernel/traps.c @@ -859,7 +859,3 @@ void __init early_trap_init(void) initialize_ivt(_vector_20); } - -void __init trap_init(void) -{ -} diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c index e103b89234cd..91efb5c6f2f3 100644 --- a/arch/powerpc/kernel/traps.c +++ b/arch/powerpc/kernel/traps.c @@ -2209,11 +2209,6 @@ DEFINE_INTERRUPT_HANDLER(kernel_bad_stack) die("Bad kernel stack pointer", regs, SIGABRT); } -void __init trap_init(void) -{ -} - - #ifdef CONFIG_PPC_EMULATED_STATS #define WARN_EMULATED_SETUP(type) .type = { .name = #type } diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c index 0a98fd0ddfe9..0daaa3e4630d 100644 --- a/arch/riscv/kernel/traps.c +++ b/arch/riscv/kernel/traps.c @@ -199,11 +199,6 @@ int is_valid_bugaddr(unsigned long pc)
Re: [PATCH v2 5/7] kallsyms: Rename is_kernel() and is_kernel_text()
On 2021/7/29 12:05, Steven Rostedt wrote: On Thu, 29 Jul 2021 10:00:51 +0800 Kefeng Wang wrote: On 2021/7/28 23:28, Steven Rostedt wrote: On Wed, 28 Jul 2021 16:13:18 +0800 Kefeng Wang wrote: The is_kernel[_text]() function check the address whether or not in kernel[_text] ranges, also they will check the address whether or not in gate area, so use better name. Do you know what a gate area is? Because I believe gate area is kernel text, so the rename just makes it redundant and more confusing. Yes, the gate area(eg, vectors part on ARM32, similar on x86/ia64) is kernel text. I want to keep the 'basic' section boundaries check, which only check the start/end of sections, all in section.h, could we use 'generic' or 'basic' or 'core' in the naming? * is_kernel_generic_data() --- come from core_kernel_data() in kernel.h * is_kernel_generic_text() The old helper could remain unchanged, any suggestion, thanks. Because it looks like the check of just being in the range of "_stext" to "_end" is just an internal helper, why not do what we do all over the kernel, and just prefix the function with a couple of underscores, that denote that it's internal? __is_kernel_text() OK, thanks for your advise, there's already a __is_kernel_text() in arch/x86/mm/init_32.c, I will change it to is_x32_kernel_text() to avoid conflict on x86_32. Then you have: static inline int is_kernel_text(unsigned long addr) { if (__is_kernel_text(addr)) return 1; return in_gate_area_no_mm(addr); } -- Steve .
Re: [PATCH v2 2/7] kallsyms: Fix address-checks for kernel related range
On 2021/7/28 22:46, Steven Rostedt wrote: On Wed, 28 Jul 2021 16:13:15 +0800 Kefeng Wang wrote: The is_kernel_inittext/is_kernel_text/is_kernel function should not include the end address(the labels _einittext, _etext and _end) when check the address range, the issue exists since Linux v2.6.12. Cc: Arnd Bergmann Cc: Sergey Senozhatsky Cc: Petr Mladek Acked-by: Sergey Senozhatsky Reviewed-by: Petr Mladek Signed-off-by: Kefeng Wang Reviewed-by: Steven Rostedt (VMware) Thanks. -- Steve
Re: [PATCH v2 6/7] sections: Add new is_kernel() and is_kernel_text()
On 2021/7/28 23:32, Steven Rostedt wrote: On Wed, 28 Jul 2021 16:13:19 +0800 Kefeng Wang wrote: @@ -64,8 +64,7 @@ const struct exception_table_entry *search_exception_tables(unsigned long addr) int notrace core_kernel_text(unsigned long addr) { - if (addr >= (unsigned long)_stext && - addr < (unsigned long)_etext) + if (is_kernel_text(addr)) Perhaps this was a bug, and these functions should be checking the gate area as well, as that is part of kernel text. Ok, I would fix this if patch5 is reviewed well. -- Steve return 1; if (system_state < SYSTEM_RUNNING && diff --git a/mm/kasan/report.c b/mm/kasan/report.c index 884a950c7026..88f5b0c058b7 100644 --- a/mm/kasan/report.c +++ b/mm/kasan/report.c @@ -235,7 +235,7 @@ static void describe_object(struct kmem_cache *cache, void *object, static inline bool kernel_or_module_addr(const void *addr) { - if (addr >= (void *)_stext && addr < (void *)_end) + if (is_kernel((unsigned long)addr)) return true; if (is_module_address((unsigned long)addr)) return true; -- .
Re: [PATCH v2 5/7] kallsyms: Rename is_kernel() and is_kernel_text()
On 2021/7/28 23:28, Steven Rostedt wrote: On Wed, 28 Jul 2021 16:13:18 +0800 Kefeng Wang wrote: The is_kernel[_text]() function check the address whether or not in kernel[_text] ranges, also they will check the address whether or not in gate area, so use better name. Do you know what a gate area is? Because I believe gate area is kernel text, so the rename just makes it redundant and more confusing. Yes, the gate area(eg, vectors part on ARM32, similar on x86/ia64) is kernel text. I want to keep the 'basic' section boundaries check, which only check the start/end of sections, all in section.h, could we use 'generic' or 'basic' or 'core' in the naming? * is_kernel_generic_data() --- come from core_kernel_data() in kernel.h * is_kernel_generic_text() The old helper could remain unchanged, any suggestion, thanks. -- Steve .
[PATCH v2 6/7] sections: Add new is_kernel() and is_kernel_text()
The new is_kernel() check the kernel address ranges, and the new is_kernel_text() check the kernel text section ranges. Then use them to make some code clear. Cc: Arnd Bergmann Cc: Andrey Ryabinin Signed-off-by: Kefeng Wang --- include/asm-generic/sections.h | 27 +++ include/linux/kallsyms.h | 4 ++-- kernel/extable.c | 3 +-- mm/kasan/report.c | 2 +- 4 files changed, 31 insertions(+), 5 deletions(-) diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h index 4f2f32aa2b7a..6b143637ab88 100644 --- a/include/asm-generic/sections.h +++ b/include/asm-generic/sections.h @@ -170,6 +170,20 @@ static inline bool is_kernel_rodata(unsigned long addr) addr < (unsigned long)__end_rodata; } +/** + * is_kernel_text - checks if the pointer address is located in the + * .text section + * + * @addr: address to check + * + * Returns: true if the address is located in .text, false otherwise. + */ +static inline bool is_kernel_text(unsigned long addr) +{ + return addr >= (unsigned long)_stext && + addr < (unsigned long)_etext; +} + /** * is_kernel_inittext - checks if the pointer address is located in the *.init.text section @@ -184,4 +198,17 @@ static inline bool is_kernel_inittext(unsigned long addr) addr < (unsigned long)_einittext; } +/** + * is_kernel - checks if the pointer address is located in the kernel range + * + * @addr: address to check + * + * Returns: true if the address is located in kernel range, false otherwise. + */ +static inline bool is_kernel(unsigned long addr) +{ + return addr >= (unsigned long)_stext && + addr < (unsigned long)_end; +} + #endif /* _ASM_GENERIC_SECTIONS_H_ */ diff --git a/include/linux/kallsyms.h b/include/linux/kallsyms.h index 4f501ac9c2c2..897d5720884f 100644 --- a/include/linux/kallsyms.h +++ b/include/linux/kallsyms.h @@ -26,14 +26,14 @@ struct module; static inline int is_kernel_text_or_gate_area(unsigned long addr) { - if ((addr >= (unsigned long)_stext && addr < (unsigned long)_etext)) + if (is_kernel_text(addr)) return 1; return in_gate_area_no_mm(addr); } static inline int is_kernel_or_gate_area(unsigned long addr) { - if (addr >= (unsigned long)_stext && addr < (unsigned long)_end) + if (is_kernel(addr)) return 1; return in_gate_area_no_mm(addr); } diff --git a/kernel/extable.c b/kernel/extable.c index 98ca627ac5ef..0ba383d850ff 100644 --- a/kernel/extable.c +++ b/kernel/extable.c @@ -64,8 +64,7 @@ const struct exception_table_entry *search_exception_tables(unsigned long addr) int notrace core_kernel_text(unsigned long addr) { - if (addr >= (unsigned long)_stext && - addr < (unsigned long)_etext) + if (is_kernel_text(addr)) return 1; if (system_state < SYSTEM_RUNNING && diff --git a/mm/kasan/report.c b/mm/kasan/report.c index 884a950c7026..88f5b0c058b7 100644 --- a/mm/kasan/report.c +++ b/mm/kasan/report.c @@ -235,7 +235,7 @@ static void describe_object(struct kmem_cache *cache, void *object, static inline bool kernel_or_module_addr(const void *addr) { - if (addr >= (void *)_stext && addr < (void *)_end) + if (is_kernel((unsigned long)addr)) return true; if (is_module_address((unsigned long)addr)) return true; -- 2.26.2
[PATCH v2 3/7] sections: Move and rename core_kernel_data() to is_kernel_core_data()
Move core_kernel_data() into sections.h and rename it to is_kernel_core_data(), also make it return bool value, then update all the callers. Cc: Arnd Bergmann Cc: Steven Rostedt Cc: Ingo Molnar Cc: "David S. Miller" Signed-off-by: Kefeng Wang --- include/asm-generic/sections.h | 14 ++ include/linux/kernel.h | 1 - kernel/extable.c | 18 -- kernel/trace/ftrace.c | 2 +- net/sysctl_net.c | 2 +- 5 files changed, 16 insertions(+), 21 deletions(-) diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h index 817309e289db..26ed9fc9b4e3 100644 --- a/include/asm-generic/sections.h +++ b/include/asm-generic/sections.h @@ -142,6 +142,20 @@ static inline bool init_section_intersects(void *virt, size_t size) return memory_intersects(__init_begin, __init_end, virt, size); } +/** + * is_kernel_core_data - checks if the pointer address is located in the + * .data section + * + * @addr: address to check + * + * Returns: true if the address is located in .data, false otherwise. + */ +static inline bool is_kernel_core_data(unsigned long addr) +{ + return addr >= (unsigned long)_sdata && + addr < (unsigned long)_edata; +} + /** * is_kernel_rodata - checks if the pointer address is located in the *.rodata section diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 1b2f0a7e00d6..0622418bafbc 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -230,7 +230,6 @@ extern char *next_arg(char *args, char **param, char **val); extern int core_kernel_text(unsigned long addr); extern int init_kernel_text(unsigned long addr); -extern int core_kernel_data(unsigned long addr); extern int __kernel_text_address(unsigned long addr); extern int kernel_text_address(unsigned long addr); extern int func_ptr_is_kernel_text(void *ptr); diff --git a/kernel/extable.c b/kernel/extable.c index b0ea5eb0c3b4..da26203841d4 100644 --- a/kernel/extable.c +++ b/kernel/extable.c @@ -82,24 +82,6 @@ int notrace core_kernel_text(unsigned long addr) return 0; } -/** - * core_kernel_data - tell if addr points to kernel data - * @addr: address to test - * - * Returns true if @addr passed in is from the core kernel data - * section. - * - * Note: On some archs it may return true for core RODATA, and false - * for others. But will always be true for core RW data. - */ -int core_kernel_data(unsigned long addr) -{ - if (addr >= (unsigned long)_sdata && - addr < (unsigned long)_edata) - return 1; - return 0; -} - int __kernel_text_address(unsigned long addr) { if (kernel_text_address(addr)) diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index e6fb3e6e1ffc..d01ca1cb2d5f 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -323,7 +323,7 @@ int __register_ftrace_function(struct ftrace_ops *ops) if (!ftrace_enabled && (ops->flags & FTRACE_OPS_FL_PERMANENT)) return -EBUSY; - if (!core_kernel_data((unsigned long)ops)) + if (!is_kernel_core_data((unsigned long)ops)) ops->flags |= FTRACE_OPS_FL_DYNAMIC; add_ftrace_ops(_ops_list, ops); diff --git a/net/sysctl_net.c b/net/sysctl_net.c index f6cb0d4d114c..4b45ed631eb8 100644 --- a/net/sysctl_net.c +++ b/net/sysctl_net.c @@ -144,7 +144,7 @@ static void ensure_safe_net_sysctl(struct net *net, const char *path, addr = (unsigned long)ent->data; if (is_module_address(addr)) where = "module"; - else if (core_kernel_data(addr)) + else if (is_kernel_core_data(addr)) where = "kernel"; else continue; -- 2.26.2
[PATCH v2 7/7] powerpc/mm: Use is_kernel_text() and is_kernel_inittext() helper
Use is_kernel_text() and is_kernel_inittext() helper to simplify code, also drop etext, _stext, _sinittext, _einittext declaration which already declared in section.h. Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Kefeng Wang --- arch/powerpc/mm/pgtable_32.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c index dcf5ecca19d9..13c798308c2e 100644 --- a/arch/powerpc/mm/pgtable_32.c +++ b/arch/powerpc/mm/pgtable_32.c @@ -33,8 +33,6 @@ #include -extern char etext[], _stext[], _sinittext[], _einittext[]; - static u8 early_fixmap_pagetable[FIXMAP_PTE_SIZE] __page_aligned_data; notrace void __init early_ioremap_init(void) @@ -104,14 +102,13 @@ static void __init __mapin_ram_chunk(unsigned long offset, unsigned long top) { unsigned long v, s; phys_addr_t p; - int ktext; + bool ktext; s = offset; v = PAGE_OFFSET + s; p = memstart_addr + s; for (; s < top; s += PAGE_SIZE) { - ktext = ((char *)v >= _stext && (char *)v < etext) || - ((char *)v >= _sinittext && (char *)v < _einittext); + ktext = (is_kernel_text(v) || is_kernel_inittext(v)); map_kernel_page(v, p, ktext ? PAGE_KERNEL_TEXT : PAGE_KERNEL); v += PAGE_SIZE; p += PAGE_SIZE; -- 2.26.2
[PATCH v2 4/7] sections: Move is_kernel_inittext() into sections.h
The is_kernel_inittext() and init_kernel_text() are with same functionality, let's just keep is_kernel_inittext() and move it into sections.h, then update all the callers. Cc: Ingo Molnar Cc: Thomas Gleixner Cc: Arnd Bergmann Cc: x...@kernel.org Signed-off-by: Kefeng Wang --- arch/x86/kernel/unwind_orc.c | 2 +- include/asm-generic/sections.h | 14 ++ include/linux/kallsyms.h | 8 include/linux/kernel.h | 1 - kernel/extable.c | 12 ++-- 5 files changed, 17 insertions(+), 20 deletions(-) diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c index a1202536fc57..d92ec2ced059 100644 --- a/arch/x86/kernel/unwind_orc.c +++ b/arch/x86/kernel/unwind_orc.c @@ -175,7 +175,7 @@ static struct orc_entry *orc_find(unsigned long ip) } /* vmlinux .init slow lookup: */ - if (init_kernel_text(ip)) + if (is_kernel_inittext(ip)) return __orc_find(__start_orc_unwind_ip, __start_orc_unwind, __stop_orc_unwind_ip - __start_orc_unwind_ip, ip); diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h index 26ed9fc9b4e3..4f2f32aa2b7a 100644 --- a/include/asm-generic/sections.h +++ b/include/asm-generic/sections.h @@ -170,4 +170,18 @@ static inline bool is_kernel_rodata(unsigned long addr) addr < (unsigned long)__end_rodata; } +/** + * is_kernel_inittext - checks if the pointer address is located in the + *.init.text section + * + * @addr: address to check + * + * Returns: true if the address is located in .init.text, false otherwise. + */ +static inline bool is_kernel_inittext(unsigned long addr) +{ + return addr >= (unsigned long)_sinittext && + addr < (unsigned long)_einittext; +} + #endif /* _ASM_GENERIC_SECTIONS_H_ */ diff --git a/include/linux/kallsyms.h b/include/linux/kallsyms.h index b016c62f30a6..8a9d329c927c 100644 --- a/include/linux/kallsyms.h +++ b/include/linux/kallsyms.h @@ -24,14 +24,6 @@ struct cred; struct module; -static inline int is_kernel_inittext(unsigned long addr) -{ - if (addr >= (unsigned long)_sinittext - && addr < (unsigned long)_einittext) - return 1; - return 0; -} - static inline int is_kernel_text(unsigned long addr) { if ((addr >= (unsigned long)_stext && addr < (unsigned long)_etext)) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 0622418bafbc..d4ba46cf4737 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -229,7 +229,6 @@ extern bool parse_option_str(const char *str, const char *option); extern char *next_arg(char *args, char **param, char **val); extern int core_kernel_text(unsigned long addr); -extern int init_kernel_text(unsigned long addr); extern int __kernel_text_address(unsigned long addr); extern int kernel_text_address(unsigned long addr); extern int func_ptr_is_kernel_text(void *ptr); diff --git a/kernel/extable.c b/kernel/extable.c index da26203841d4..98ca627ac5ef 100644 --- a/kernel/extable.c +++ b/kernel/extable.c @@ -62,14 +62,6 @@ const struct exception_table_entry *search_exception_tables(unsigned long addr) return e; } -int init_kernel_text(unsigned long addr) -{ - if (addr >= (unsigned long)_sinittext && - addr < (unsigned long)_einittext) - return 1; - return 0; -} - int notrace core_kernel_text(unsigned long addr) { if (addr >= (unsigned long)_stext && @@ -77,7 +69,7 @@ int notrace core_kernel_text(unsigned long addr) return 1; if (system_state < SYSTEM_RUNNING && - init_kernel_text(addr)) + is_kernel_inittext(addr)) return 1; return 0; } @@ -94,7 +86,7 @@ int __kernel_text_address(unsigned long addr) * Since we are after the module-symbols check, there's * no danger of address overlap: */ - if (init_kernel_text(addr)) + if (is_kernel_inittext(addr)) return 1; return 0; } -- 2.26.2
[PATCH v2 5/7] kallsyms: Rename is_kernel() and is_kernel_text()
The is_kernel[_text]() function check the address whether or not in kernel[_text] ranges, also they will check the address whether or not in gate area, so use better name. Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: Sami Tolvanen Cc: Nathan Chancellor Cc: Arnd Bergmann Cc: b...@vger.kernel.org Signed-off-by: Kefeng Wang --- arch/x86/net/bpf_jit_comp.c | 2 +- include/linux/kallsyms.h| 8 kernel/cfi.c| 2 +- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 333650b9372a..c87d0dd4370d 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -372,7 +372,7 @@ static int __bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t, int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t, void *old_addr, void *new_addr) { - if (!is_kernel_text((long)ip) && + if (!is_kernel_text_or_gate_area((long)ip) && !is_bpf_text_address((long)ip)) /* BPF poking in modules is not supported */ return -EINVAL; diff --git a/include/linux/kallsyms.h b/include/linux/kallsyms.h index 8a9d329c927c..4f501ac9c2c2 100644 --- a/include/linux/kallsyms.h +++ b/include/linux/kallsyms.h @@ -24,14 +24,14 @@ struct cred; struct module; -static inline int is_kernel_text(unsigned long addr) +static inline int is_kernel_text_or_gate_area(unsigned long addr) { if ((addr >= (unsigned long)_stext && addr < (unsigned long)_etext)) return 1; return in_gate_area_no_mm(addr); } -static inline int is_kernel(unsigned long addr) +static inline int is_kernel_or_gate_area(unsigned long addr) { if (addr >= (unsigned long)_stext && addr < (unsigned long)_end) return 1; @@ -41,9 +41,9 @@ static inline int is_kernel(unsigned long addr) static inline int is_ksym_addr(unsigned long addr) { if (IS_ENABLED(CONFIG_KALLSYMS_ALL)) - return is_kernel(addr); + return is_kernel_or_gate_area(addr); - return is_kernel_text(addr) || is_kernel_inittext(addr); + return is_kernel_text_or_gate_area(addr) || is_kernel_inittext(addr); } static inline void *dereference_symbol_descriptor(void *ptr) diff --git a/kernel/cfi.c b/kernel/cfi.c index e17a56639766..e7d90eff4382 100644 --- a/kernel/cfi.c +++ b/kernel/cfi.c @@ -282,7 +282,7 @@ static inline cfi_check_fn find_check_fn(unsigned long ptr) { cfi_check_fn fn = NULL; - if (is_kernel_text(ptr)) + if (is_kernel_text_or_gate_area(ptr)) return __cfi_check; /* -- 2.26.2
[PATCH v2 2/7] kallsyms: Fix address-checks for kernel related range
The is_kernel_inittext/is_kernel_text/is_kernel function should not include the end address(the labels _einittext, _etext and _end) when check the address range, the issue exists since Linux v2.6.12. Cc: Arnd Bergmann Cc: Sergey Senozhatsky Cc: Petr Mladek Acked-by: Sergey Senozhatsky Reviewed-by: Petr Mladek Signed-off-by: Kefeng Wang --- include/linux/kallsyms.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/include/linux/kallsyms.h b/include/linux/kallsyms.h index 2a241e3f063f..b016c62f30a6 100644 --- a/include/linux/kallsyms.h +++ b/include/linux/kallsyms.h @@ -27,21 +27,21 @@ struct module; static inline int is_kernel_inittext(unsigned long addr) { if (addr >= (unsigned long)_sinittext - && addr <= (unsigned long)_einittext) + && addr < (unsigned long)_einittext) return 1; return 0; } static inline int is_kernel_text(unsigned long addr) { - if ((addr >= (unsigned long)_stext && addr <= (unsigned long)_etext)) + if ((addr >= (unsigned long)_stext && addr < (unsigned long)_etext)) return 1; return in_gate_area_no_mm(addr); } static inline int is_kernel(unsigned long addr) { - if (addr >= (unsigned long)_stext && addr <= (unsigned long)_end) + if (addr >= (unsigned long)_stext && addr < (unsigned long)_end) return 1; return in_gate_area_no_mm(addr); } -- 2.26.2
[PATCH v2 1/7] kallsyms: Remove arch specific text and data check
After commit 4ba66a976072 ("arch: remove blackfin port"), no need arch-specific text/data check. Cc: Arnd Bergmann Signed-off-by: Kefeng Wang --- include/asm-generic/sections.h | 16 include/linux/kallsyms.h | 3 +-- kernel/locking/lockdep.c | 3 --- 3 files changed, 1 insertion(+), 21 deletions(-) diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h index d16302d3eb59..817309e289db 100644 --- a/include/asm-generic/sections.h +++ b/include/asm-generic/sections.h @@ -64,22 +64,6 @@ extern __visible const void __nosave_begin, __nosave_end; #define dereference_kernel_function_descriptor(p) ((void *)(p)) #endif -/* random extra sections (if any). Override - * in asm/sections.h */ -#ifndef arch_is_kernel_text -static inline int arch_is_kernel_text(unsigned long addr) -{ - return 0; -} -#endif - -#ifndef arch_is_kernel_data -static inline int arch_is_kernel_data(unsigned long addr) -{ - return 0; -} -#endif - /* * Check if an address is part of freed initmem. This is needed on architectures * with virt == phys kernel mapping, for code that wants to check if an address diff --git a/include/linux/kallsyms.h b/include/linux/kallsyms.h index 6851c2313cad..2a241e3f063f 100644 --- a/include/linux/kallsyms.h +++ b/include/linux/kallsyms.h @@ -34,8 +34,7 @@ static inline int is_kernel_inittext(unsigned long addr) static inline int is_kernel_text(unsigned long addr) { - if ((addr >= (unsigned long)_stext && addr <= (unsigned long)_etext) || - arch_is_kernel_text(addr)) + if ((addr >= (unsigned long)_stext && addr <= (unsigned long)_etext)) return 1; return in_gate_area_no_mm(addr); } diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index bf1c00c881e4..64b17e995108 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -803,9 +803,6 @@ static int static_obj(const void *obj) if ((addr >= start) && (addr < end)) return 1; - if (arch_is_kernel_data(addr)) - return 1; - /* * in-kernel percpu var? */ -- 2.26.2
[PATCH v2 0/7] sections: Unify kernel sections range check and use
There are three head files(kallsyms.h, kernel.h and sections.h) which include the kernel sections range check, let's make some cleanup and unify them. 1. cleanup arch specific text/data check and fix address boundary check in kallsyms.h 2. make all the basic/core kernel range check function into sections.h 3. update all the callers, and use the helper in sections.h to simplify the code After this series, we have 5 APIs about kernel sections range check in sections.h * is_kernel_core_data()--- come from core_kernel_data() in kernel.h * is_kernel_rodata() --- already in sections.h * is_kernel_text() --- come from kallsyms.h * is_kernel_inittext() --- come from kernel.h and kallsyms.h * is_kernel() --- come from kallsyms.h Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s...@vger.kernel.org Cc: linux-a...@vger.kernel.org Cc: io...@lists.linux-foundation.org Cc: b...@vger.kernel.org v2: - add ACK/RW to patch2, and drop inappropriate fix tag - keep 'core' to check kernel data, suggestted by Steven Rostedt , rename is_kernel_data() to is_kernel_core_data() - drop patch8 which is merged - drop patch9 which is resend independently v1: https://lore.kernel.org/linux-arch/20210626073439.150586-1-wangkefeng.w...@huawei.com Kefeng Wang (7): kallsyms: Remove arch specific text and data check kallsyms: Fix address-checks for kernel related range sections: Move and rename core_kernel_data() to is_kernel_core_data() sections: Move is_kernel_inittext() into sections.h kallsyms: Rename is_kernel() and is_kernel_text() sections: Add new is_kernel() and is_kernel_text() powerpc/mm: Use is_kernel_text() and is_kernel_inittext() helper arch/powerpc/mm/pgtable_32.c | 7 +--- arch/x86/kernel/unwind_orc.c | 2 +- arch/x86/net/bpf_jit_comp.c| 2 +- include/asm-generic/sections.h | 71 ++ include/linux/kallsyms.h | 21 +++--- include/linux/kernel.h | 2 - kernel/cfi.c | 2 +- kernel/extable.c | 33 ++-- kernel/locking/lockdep.c | 3 -- kernel/trace/ftrace.c | 2 +- mm/kasan/report.c | 2 +- net/sysctl_net.c | 2 +- 12 files changed, 72 insertions(+), 77 deletions(-) -- 2.26.2